Subject: sum : recursos para el espanol ( spanish resources )

aqui envio un resumen de respuestas acerca de recursos linguisticos existentes para el espanol . here i send a summary of answers about available spanish resources . gracias a / thanks to : gerardo arrarte fernando sanchez leon ruthanna barnett alice carlberger rodrigo santurio james l . fidelholtz cesar romani joerge koch jose l . rodrigo martin beaumont franowsky steve halmreich eduardo a . martinez labrada mon alameda erik oltmans . . . and many more - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - el instituto cervantes , ente pu ' blico espan ~ ol dedicado principalmente a la difusio ' n en el mundo de la lengua espan ~ ola y de la cultura de los pueblos de habla hispana , lleva a cabo diversas actividades destinadas a fomentar la investigacio ' n de la lengua espan ~ ola . entre otras actividades relacionadas con el campo de la tecnologi ' a lingu " i 's tica , estamos poniendo en marcha una oficina cuyo objetivo sera ' la promocio ' n de las industrias de la lengua aplicadas al espan ~ ol . para ello , se ha considerado esencial realizar una labor de recogida y diseminacio ' n de informacio ' n sobre actividades en curso y recursos lingu " i 's ticos disponibles en distintos centros de investigacio ' n . hasta el momento , hemos realizado una encuesta sobre corpus de espan ~ ol existentes o en desarrollo en centros de investigacio ' n espan ~ oles , y hemos recogido los datos resultantes de esta encuesta en un informe de 56 pa ' ginas que tendre ' mucho gusto en hacerte llegar . en el futuro , esta ' previsto ampliar este inventario con datos correspondientes a otros tipos de recursos lingu " i 's ticos , asi ' como con los procedentes de proyectos en marcha en otros pai 's es . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : gerardo arrarte carriquiry : e - mail : : : programas de tecnologia linguistica : g . arrarte @ cervantes . es : : instituto cervantes : : : libreros , 23 : tel : + 34 1 885 62 03 : : e-28801 alcala de henares ( madrid ) : fax : + 34 1 883 50 10 : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - el corpus itu est ' a disponible en el corpus de eci ( european corpus initiative ) , que puede conseguirse a trav ' es de la elsnet . la direcci ' on es la siguiente : email : elsnet @ let . ruu . nl mail : ots , trans 10 , 3512 jk , utrecht , the netherlands tel : + 31 30 53 6039 fax : + 31 30 53 6000 www : http : / / www . cogsci . ed . ac . uk / elsnet / home . html es un corpus triling " ue ( espa ~ nol , ingl ' es , franc ' es ) . la versi ' on que estamos elaborando nosotros incluye etiquetado morfosint ' actico , corregido a mano , de 1 mill ' on de palabras del corpus . esta versi ' on estar ' a en el dominio p ' ublico a partir de octubre de este a ~ no . asimismo , la versi ' on espa ~ nola del etiquetador de xerox estar ' a tambi ' en en el dominio p ' ublico en esa fecha . en nuestro laboratorio tenemos otros corpus , como habr ' as visto en la lista corpora ( te incluyo parte de un anuncio en ingl ' es ) : there are some spanish corpora that you can retrieve from our laboratory . they are all documented . the corpora can be downloaded from the following address : host : lola . lllf . uam . es login : anonymous password : < send your e-mail address > at this moment , we have a corpus of spoken spanish in orthographic transcription directory : pub / corpus / oral and a corpus of written spanish texts from argentine and chile directory : pub / corpus / argentina pub / corpus / chile all the corpora include texts in one of the topics you are interested in . note that the oral corpus is compressed using unix command ' compress ' while the other two are . zip files produced with dos compress utilities ( take a look at readme files ) . fernando sanchez leon fsanchez @ ccuam3 . uam . es - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - . - nota : mas informacion sobre el tagger de xerox se puede conseguir en : consortium for lexical research email : lexical @ crl . nmsu . edu ftp : / / clr . nmsu . edu ftp directory : members-only / tools / ling-analysis / syntax / xerox-tagger / this part-of - speech tagger , designed by doug cutting and jan pederson at xerox , was written in ansi common lisp . its development was done in franz allegro common lisp version 4 . 1 on sunos4 . x and macintosh common lisp 2 . 0p2 . the following code is provided : source code , a tokenizer for plain ascii english , an english lexicon enduced from the brown corpus , a table of mappings for word suffixes to likely ambiguity classes , and an hmm trained on the odd numbered sentences in the brown corpus . more info : info / xerox . o : ftp : / / parcftp . xerox . com / pub / tagger if you need to install common lisp to run it , several good free implementations at http : / / www . cs . rochester . edu / users / staff / miller / alu . html . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - european corpus initiative corpora available on cd-rom : eci1 / mul06 / msp06 / spa16a : information technology , eu , 26 , 000 words eci1 / spa02a - j : el diario sur , local newspaper from malaga , belongs to national publisher , in existence for 40 years . different writing styles , 500 , 000 words . eci2 / mul04 / msp04a - j : telecommunication user manual , several 100 , 000 words . eci2 / mul09 / spa19a : xerox scanworx user manual , 45 , 000 words . eci2 / mul12 / msp12 / msp12a - c : civil law , switzerland , 600 , 000 words . eci4 / spa03 : minimally processed by eci ; contains errors and duplication but the clean and f c files are clean ( ? ) el diario vasco , newspaper clean files , news , few errors , 300 , 000 words fc files , 177 , 000 words the national newspaper abc has just released a cd-rom with last year 's literary supplement that can be purchased for under $ 50 . + 4 million words of clean , high-quality written text . archivo digital de manuscritos y textos espa = a4oles available on cd-rom . charles faulhaber , dept . of spanish & portuguese , u of california , berkeley the eu multext project of collecting a corpus which will contain parallel texts from the european parliament and financial newspaper articles ( spanish from expansion newspaper ) . still finalizing licence agreements for these data . the relator language resources server , supports distribution of nlp resources . currently available through relator speech and text corpora , lexicons , nlp programs and tools , and related databases and systems . ftp : / / de . relator . research . ec . org / relator = 0d afs : / / afs / research . ec . org / projects / relator multilingual web pages : http : / / www . xx . relator . research . ec . org ( xx = 3dtwo - letter country codes of the eu countries such as de , uk , etc . ) only speech materials . = 0d alice carlberger alice @ speech . kth . se - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - we have been working on a spanish to english machine translation system and so have access to a large corpus of spanish text and have developed a tagger for general newspaper articles . although the tagger uses proprietary information ( collins spanish - english on-line dictionary ) , we will shortly make the results available on-line . that is , you will be able to e-mail spanish texts and they will be returned tagged with part of speech . steve helmreich shelmrei @ crl . nmsu . edu - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - hola ; soy el coautor de un diccionario de frecuencias del castellano . . . . mon alameda cmsfi52 @ vmesa . cpd . uniovi . es - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - quizas pueda serte util la lista terminometro electronico en espanhol . la direccion de la lista es latin-te @ frmop11 . cnusc . fr el servidor electonico de la lista es listserv @ frmop11 . cnusc . fr martin beaumont franowsky beaumont @ desco . org . pe - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - desde hace mucho existe el trabajo de el colegio de me ' xico ( el diccionario del espan ~ ol de me ' xico ) , proyecto cuyo investigador principal es luis fernando lara . e ' l tiene cuenta en internet , pero no la tengo a la mano , asi ' que te doy su direccio ' n de snail-mail : dr . luis fernando lara dem el colegio de me ' xico camino al ajusco me ' xico , d . f . me ' xico . han hecho recuentos por frecuencia segu ' n un corpus de aproximadamente 2 millones ( si no mal recuerdo ) de palabras , y tienen un programa de asignacio ' n de palabras segu ' n su parte de la oracio ' n . james l . fidelholtz jfidel @ udlapvms . pue . udlap . mx jfidel @ unm . edu - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - nosotros tratamos corpus de lengua de gran tamano , y hemos creado herramientas para la extraccion de informacion linguistica : - programa de busqueda y extraccion automatica de lemas con su contexto : real - programa de segmentacion y etiquetado morfologico de lemas , smorph . jose l . rodrigo + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + jose @ gril . univ-bpclermont . fr gril : groupe de recherche dans les industries de la langue universite blaise pascal - clermont ii 34 av . carnot , f - 63037 clermont - ferrand cedex + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + rodrigo @ eucmax . sim . ucm . es facultad de filologia universidad complutense de madrid - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - you might want to check out the agfl grammar worklab which also contains a small grammar for the spanish noun phrase . the author , paula maria santalla , can be contacted through paula @ cs . kun . nl . the url of the agfl home page is : http : / / www . cs . kun . nl / agfl / erik oltmans department of computer science university of nijmegen nijmegen , the netherlands http : / / www . cs . kun . nl / agfl / eriko - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - the autonomous university of nuevo leon college of medicine , monterrey , mexico and california state university at fullerton ( csuf ) make available " spanish 92 " ( the first 2 , 000 most frequent words of spanish ) based on espa ~ nol 92 ( e92 ) , computational linguistic analysis of a million - word corpus of contemporary spanish carried out between 1986 and 1992 under a grant from the secretariat of public education of the mexican government . " spanish 92 " is available from the ftp server at csuf : ftp wintermute . fullerton . edu user > anonymous pw > username @ host . domain ftp > cd / pub / research / chandler prof . r . m . chandler - burns college of medicine autonomous university of nuevo leon monterrey , mexico remite : gabriel amores departamento de lengua inglesa universidad de sevilla nota : la direccion del prof . chandler - burns es rchandlr @ ccr . dsi . uanl . mx - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - consortium for lexical research email : lexical @ crl . nmsu . edu ftp : / / clr . nmsu . edu parallel text in english and spanish pan american health organization ftp directory : members-only / corpora / paho / the pan american health organization ( paho ) , conferences and general services division , has kindly allowed this group of sample parallel texts to be released for nlp research purposes . there are 180 pairs of text , 360 individual files , which amount to about 8 mb of data . the documents cover the general domains of public health and latin america , but vary greatly in content and in length . some are short memos or letters , most are longer reports and conference proceedings . the spanish documents do contain the spanish character encoding . other formatting commands , such as tabs , centering , italicizing , etc . have been removed . special thanks to dr . marjorie leon for her assistance in making these texts available . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - the pappi system : a principle - based parser announcing the first public release of pappi , a prolog - based natural language parser for theories in the principles - and - parameters framework . pappi is designed to run on sun sparc - stations with quintus prolog . the pappi system includes : * an x - window system-based user interface to the underlying prolog - based parser . * a sample implementation of classic gb - theory , based on theory described in lasnik and uriagereka 's textbook " a course in gb syntax " . the implementation also includes sets of example sentences and sample parameterization for six languages . currently , these are english , japanese , dutch , french , spanish and german . ( this software was recently demoed at coling ' 94 . ) pappi is a parser that is designed to be a high-level research tool for experimenting with and learning about linguistic theory . this release represents just one possible instantiation within the principles - and - parameters framework . users are encouraged to experiment with and modify the sample principles . the pappi system represents code written to support research work . it is still very much under development . alternate theories ( and more sophisticated parsing models ) will be made publically available at a later stage . upcoming releases may also support other platforms and may not need quintus prolog . this is free software developed at the nec research institute , inc . , an institute for conducting long-term , fundamental research in computer and physical sciences . comments and suggestions for improvement to the system will be gratefully accepted ! i would like to also hear from those interested in extending the system . the pappi project also welcomes unencumbered software contributions , including ( but not limited to ) support for additional languages , theory and debugging tools . the system is available for anonymous ftp as : external . nj . nec . com : / pub / sandiway / pappi-2 . 0x . tar . z [ note : x is an alphabetic character denoting the current minor release . ] a . gz compressed version of the same tar file is also available as : external . nj . nec . com : / pub / sandiway / pappi-2 . 0x . tar . gz this version is recommended for those for those installations having gnu compress . current requirements : sun sparcstation sunos 4 . 1 . 3 or 5 . 3 ( aka solaris 2 . 3 ) quintus prolog 3 . 1 . 4 or 3 . 1 . 1 ( june 1992 ) approx . 35mb of disk space ( 55-70mb to install ) contact address : dr . sandiway fong nec research institute , inc . princeton nj 08540 usa email : sandiway @ research . nj . nec . com fax : ( 609 ) 951-2482 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - cualquier otra informacion sobre recursos para el espanol , por favor envienla a mi direccion de e-mail ( no voy a estar suscrito a la lista ) . please , send any other information about spanish resources to my e-mail address ( i ' ll be no longer subscribed to the list ) . muchas gracias ! ! thank you very much ! ! pablo accuosto facultad de ingenieria universidad de la republica montevideo - uruguay e-mail : accuosto @ fing . edu . uy
