Subject: review of boguraev and pustejovsky

branimir boguraev and james pustejovsky . 1996 . corpus processing for lexical acquisition . mit press : cambridge , massachusetts . 245 pages . $ 32 . 50 . the term " acquisition " in the title of this book refers to automatic learning - - - acquisition not by human children , but by natural language systems . the papers in this book deal with the topic of building and refining lexica for natural language systems automatically - - i . e . by computer , with little or no human intervention - - from large corpora . building lexica for natural language systems by hand is difficult , expensive , and labor-intensive , and the result may be out of date before it is completed . furthermore , by the standards of earlier systems , lexica have become enormous . continuous speech dictation systems ship with active vocabularies in the range of 30 , 000 lexical items . lexica in production by one company are expected to have 200 , 000 entries for american english and 700 , 000 entries for german . so , from an industrial point of view , work on the automatic acquisition of lexical knowledge is very welcome . this is not to say that automatic lexical acquisition should be of interest only to applied linguists . lexical information is also necessary in psycholinguistic research , and some of the work in this volume shows such application . furthermore , the sorts of data that researchers in this field are attempting to acquire is just the sort of data that is needed for large-scale applications of formalisms like head - driven phrase structure grammar . so , the work described in this book should be of interest to academic , as well as industrial , linguists . this book is the result of a workshop , and as such , it has the usual scattering of topics seen in proceedings . this should be seen as a feature , not a bug : the result is that there is something here for everyone . various papers come from the fields of corpus linguistics , statistical analysis of language , psycholinguistics , rule acquisition , semantics , and lexical acquisition . the papers are divided into five broad categories : ( 1 ) unknown words , ( 2 ) building representations , ( 3 ) categorization , ( 4 ) lexical semantics , and ( 5 ) evaluation . in addition , a paper by the editors lays out the reasons for , and challenges of , automatic acquisition of lexical information . ( 1 ) introduction issues in text-based lexicon acquisition , branimir boguraev and james pustejovsky . this paper presents an in-depth answer to the question with which lexicon builders are perenially plagued by anyone to whom they try to explain their work : why not just use an on-line dictionary ? the short answer is that such dictionaries are static and do not evolve at the same pace as the language that they are attempting to describe . the long answer is that natural language systems require information that is not reflected in traditional dictionaries-semantic feature geometries , subcategorization frames , and so on . so : " the fundamental problem of lexical acquisition . . . is how to provide , fully and adequately , the systems with the lexical knowledge they need to operate with the proper degree of efficiency . the answer . . . to which the community is converging today . . . is to extract the lexicon from the texts themselves " ( 3 ) . automatic lexical acquisition can trivially solve the short-answer problem by allowing updating as frequently as new data can be acquired . more importantly , it allows the linguist to define the questions that they would like the lexicon to answer , rather than having those questions chosen for them by the dictionary maker . ( 2 ) dealing with unknown words consider a spell-checking program that encounters the ( unknown ) word " horowitz . " the spell checker would like to know the best action to take with this word : is it a mis-spelling that should be replaced with something else , or is it a precious datum that should be added to its lexicon ? the spell-checker asks its user ; the papers in this section discuss attempts to answer these questions automatically . linguists tend not to pay much attention to proper nouns . as mcdonald puts it in an epigram to his paper in this volume , " proper names are the rodney dangerfield of linguistics . they do n't get no respect " ( 21 ) . thus , it might surprise the reader to find that all three of the papers in this section deal with names . the identification and classification of names is , in fact , of considerable interest in natural language systems . for relatively uninflected languages like english , names may constitute the majority of unknown words encountered in a corpus . names raise special issues for classification , including the facts that they may have multiple forms ; multiple forms may have the same referent in a single text , raising problems for reference and coindexation ; and , on a less theoretically interesting but no less morally and legally compelling level , they may require special treatment in the corpus . for instance , proper names are routinely removed from medical data , and may need to be removed from sociolinguistic data , as well . internal and external evidence in the identification and semantic categorization of proper names . david d . mcdonald . this paper is written in the language of artificial intelligence . it describes the proper name facility of the sparser system . it describes the use of context-sensitive rewrite rules to analyze " external evidence " for proper names , e . g . their combinatorial properties . a surprising and impressive aspect of the system described here is that it does not use stored lists of proper nouns . identifying unknown proper names in newswire text . inderjeet mani , t . richard macmillan . this paper describes a method of using contextual clues such as appositives ( " < name > , the daughter of a prominent local physician " or " a niloticist of great repute , < name > " ) and felicity conditions for identifying names . the contextual clues themselves are then tapped for data about the referents of the names . categorizing and standardizing proper nouns for efficient information retrieval . woojin paik , elizabeth d . liddy , edmund yu , and mary mckenna . this paper deals with discovering and encoding relationships between groups and their members . paik et al . state the problem as follows : " proper nouns are . . . important sources of information for detecting relevant document in information retrieval . . . . group proper nouns ( e . g . , " middle east " ) and group common nouns ( e . g . , " third world " ) will not match on their constituents unless the group entity is mentioned in the document " ( 61 ) . the problem , then , is to allow a search on " health care third world " to find a document on " health care in nicaragua . " the paper includes a short but useful discussion of the problems that can arise with respect to prepositions when noun phrases containing proper nouns are parsed as common noun phrases . ( the authors solved this problem by changing the ordering of two bracketing routines . ) ( 3 ) building representations customizing a lexicon to better suit a computational task . marti a . hearst , hinrich schuetze . as mentioned above , lexicon building is expensive ; this paper describes a method for reducing development costs by customizing a pre-existing lexicon , rather than building a new one . the project described here uses as its pre-existing lexicon wordnet , an on-line lexicon that contains information about semantic relationships such as hypernymy , hyponymy , etc . this was customized by reducing the resolution of the semantic hierarchies to simple categories , and by combining categories from " distant parts of the hierarchy . . . . . we are interested in finding grouping of terms that contribute to a frame or schema-like representation . . . this can be achieved by finding associational lexical relations among the existing taxonymic relations " ( 79 ) . crucially , these relations should be derived from a particular corpus . the paper includes a nice description of the algorithm used for collapsing semantic categories . towards building contextual representations of word senses using statistical models . claudia leacock , geoffrey towell , and ellen m . voorhees . this paper describes a method for differentiating amongst the multiple senses of a polysemous word . the authors discuss using " topical context , " or content words occurring in the vicinity , and " local context , " which includes not just content words but function morphemes , word order , and syntactic structure . they test three methods of acquiring topical context : bayesian , context vector , and a neural network . they also give the results of a psycholinguistic experiment comparing human performance with machine performance , given the topical contexts created by the three types of " classifiers . " local context acquisition is based on acquiring " templates , " or specific sequences of words . this paper gives a particularly nice description of its algorithms , and is so clearly written as to be suitable for presentation in courses on statistics or psycholinguistics . ( 4 ) categorization a context driven conceptual clustering method for verb classification . roberto basili , maria - teresa pazienza , paola velardi . this paper describes a method of categorizing verbs with respect to thematic roles , drawing on the cobweb and ariosto _ lex systems . its aim is to do categorization without relying on " defining features , " and to categorize with respect to the domain of discourse . the authors describe their algorithms , and the paper has a nice literature review , covering both psycholinguistic and computational perspectives on classification . distinguished usage . scott a . waterman . this paper tackles the syntax / semantics interface . the author attempts to give a linguistic grounding to systems that map text to some knowledge base by means of pattern matching : " by relating lexical pattern-based approaches to a lexical semantic framework , such as the generative lexicon theory [ pustejovsky , 1991 ] , my aim is to provide a basis through which pattern-based understanding systems can be understood in more conventional linguistic terms . . . . . my main contention is that such a framework can be developed by viewing the lexical patterns as structural mappings from text to denotation in a compositional lexical semantics . . . obviating the need for separate syntactic and semantic analysis " ( 144 ) . this paper features an excellent presentation of background ideas and explication of the issues that it discusses . ( 5 ) lexical semantics detecting dependencies between semantic verb subclasses and subcategorization frames in text corpora . victor poznanski , antonio sanfilippo . this paper describes " a suite of programs . . . . which elicit dependencies between semantic verb classes and their . . . subcategorization frames using machine readable thesauri to assist in semantic tagging of texts " ( 176 ) . the system uses a commercially available thesaurus-like online lexicon to do semantic tagging . a " subcategorization frame " is then automatically extracted , and the subcategorization frames are analyzed and classified . acquiring predicate-argument mapping information from multilingual texts . chinatsu aone , douglas mckee . the authors hold predicate-argument mapping to be equivalent to conceptual representation ; as such , it is clearly important to language understanding . this is the only paper in the volume that deals with bilingual corpora . ( 6 ) evaluating acquisition evaluation techniques for automatic semantic extraction : comparing syntactic and window based approaches . gregory grefenstette . this paper proposes techniques for comparing " knowledge-poor " approaches to determining the degree of semantic similarity between two words . a syntax-based method is compared to a windowing technique . the syntax-based method is shown to perform better for high-frequency words , while the windowing method is the better performer for low-frequency words . conclusion this is by no means an introductory text on automatic lexical acquisition . nonetheless , this volume contains papers that will appeal to workers in a variety of linguistic disciplines . the reviewer k . bretonnel cohen is a linguist at voice input technologies in dublin , ohio , where his responsibilities include the construction of tools for lexicon building and analysis .
