Subject: summary

dear all , i send you a summary of the answers i got for my query on spanish corpora . my apologies for the repetitions : i have n't got the time to really " summarize " . ta . yours , albert llorens spanish - english development group incyta , s . a . c . lluis muntadas 5 08940 cornella de llobregat barcelona spain e-mail : albert @ incyta . es _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ there 's a cd-rom edited by the european corpus initiative which includes a number of texts in several european languages . among others it includes cee law in spanish , english and portugese , or a xerox manual in english and spanish . a somewhat more detailed account of the contents of this cd-rom follows : european corpus initiative corpora available on cd-rom : eci1 / mul06 / msp06 / spa16a : information technology , eu , 26 , 000 words eci1 / spa02a - j : el diario sur , local newspaper from malaga , belongs to national publisher , in existence for 40 years . different writing styles , 500 , 000 words . eci2 / mul04 / msp04a - j : telecommunication user manual , several 100 , 000 words . eci2 / mul09 / spa19a : xerox scanworx user manual , 45 , 000 words . eci2 / mul12 / msp12 / msp12a - c : civil law , switzerland , 600 , 000 words . eci4 / spa03 : minimally processed by eci ; contains errors and duplication but the clean and fc files seem to be clean . el diario vasco , newspaper clean files , news , few errors , 300 , 000 words fc files , 177 , 000 words _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ apart from the eci cd-rom there are the following corpora available : ftp lola . lllf . uam . es / pub / corpus / argentina 2 million words / pub / corpus / chile 2 millions words fernando sanchez leon , laboratorio de linguistica informatica : the crater project : itu corpus in the process of postediting . trilingual ( french / english / spanish ) corpus has more than 3 million words and is the so-called " white book on telecommunications " released by the international telecommunications union . fernando et al are working with a 1 - million word subcorpus , which will also be postedited . this corpus , along with the tagger developed for its tagging and all the resources associated with the tagger will be in the public domain in october 1995 . there is a lexicon with + 35 , 000 words ( full forms , not lemmas ) , part-of - speech annotated , that can be used as a starting point in lexicon-building tasks . the national newspaper abc has just released a cd-rom with last year 's literary supplement that can be purchased for under $ 50 . + 4 million words of clean , high-quality written text . archivo digital de manuscritos y textos espa = a4oles available on cd-rom . charles faulhaber , dept . of spanish & portuguese , u of california , berkeley . the eu multext project of collecting a corpus which will contain parallel texts from the european parliament and financial newspaper articles ( spanish from expansion newspaper ) . still finalizing licence agreements for these data . the relator language resources server , supports distribution of nlp resources . currently available through relator speech and text corpora , lexicons , nlp programs and tools , and related databases and systems . ftp : / / de . relator . research . ec . org / relator afs : / / afs / research . ec . org / projects / relator multilingual web pages : http : / / www . xx . relator . research . ec . org ( xx = 3dtwo - letter country codes of the eu countries such as de , uk , etc . ) only speech materials . briscoe et al paper reports a 17 , 000 - word tagged corpus . ( this is all the info i have on this paper . ) ftp : / / parcftp . xerox . com / pub / tagger spanish tagger , implemented in common lisp . comes with documentation , works very well . if you need to install common lisp to run it , several good free implementations at http : / / www . cs . rochester . edu / users / staff / miller / alu . html . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ a last report . > 1 . / pub / corpus / : a . oral corpus of spanish ( 7 mb , about 2 , 000 , 000 words ) > b . some written corpora of south american spanish > > 2 . the lds is the best source , but joining costs money . > > 3 . the oxford text archive > 13 banbury road > oxford ox2 6nn > fax : + 44 865 273275 > > catalogue of over 1300 titles , available in paper > or electronic form on the oxford vax cluster as ox $ doc : textarchive . list and > ox $ textarchive . sgml , from various listservers , e . g . , listserv @ brownvm ( send > the mail message get humanist filelist for details ) , by anonymous ftp from > internet site ota . ox . ac . uk ( 163 . 1 . 2 . 4 ) in the directory pub / ota / public . > also , wherever you are , you can send a note to archive @ vax . oxford . ac . uk > specifying which form you want . > > spanish > > a . literary works , poems . > > 4 . 1066108 words ( approx . ) > origin : grupo eurotra , universidad autonoma de madrid > contact : manuel campos , eurotrac @ ccuam3 . sdi . uam . es or > fernando sanchez leon , laboratorio de l > available : publically via anonymous ftp , node lola . lllf . uam . es , > directory pub / corpus > contents : transcriptions of spoken language ( conferences , conversations , etc . ) > > 5 . 121051 words ( approx . ) > origin : childes ( child language data exchange system ) database , carnegie mellon > univ . > contact : brian macwhinney , brian @ andrew . cmu . edu > available : publically , previous communication with brian macwhinney > contents : database of corpora of parent-child and child-child interactions > from children speaking . > > 6 . 9 , 000 , 000 words ( approx . ) > origin : this is the european corpus initiative multilingual corpus i cd-rom > cost : 20 pounds > contact : eucorp @ cogsci . ed . ac . uk > available : all use of this corpus is subject to a licence agreement > the cd-rom is available in the us from the linguistic data consortium ( ldc ) , > for members of the ldc or those making a bulk purchase , and otherwise from > elsnet , 2 buccleuch place , edinburgh eh8 9lw , scotland . the cost from elsnet > is 20 uk pounds plus postage , handling and tax where applicable . ordering > procedure is detailed in > > http : / / www . cogsci . ed . ac . uk / elsnet / eci . html > > 7 . university of barcelona : spoken corpus _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
