C-Perform: Methods and Tools for Collocation Extraction and Performance-Oriented Parsing

The aim of this project was to lay the foundations for a new generation of systems that enable fast, efficient and robust natural language processing and are still sufficiently general. Based on the assumption that particular aspects of performance are grammaticalized, we pursued a novel approach to grammar where performance and competence aspects are already interleaved within the grammar model. In particular, we aimed at modeling the interaction of generativity which is the distinctive feature of competence, and lexicalization which is a feature of language usage. To achieve this goal, the influence of lexicalization on generativity was studied within the phenomenon of collocations. The interaction of lexical and structural information was modeled by means of corpus-based statistical techniques.

Due to the impact of generative grammar on linguistics, collocations have been regarded as a phenomenon outside the grammar. In general, reduction of grammar to competence aspects has lead to grammar models that account for the dichotomy of syntactically correct versus incorrect utterances, but ignore the fact that some of the correct analyses are more adequate than others. This emphasis on competence information leads to ambiguity--a severe problem for processing as the search space becomes large--and thus leads to fairly slow systems.

In order to come up with efficient and sufficiently general systems we combined statistical models with elaborate linguistic knowledge. One possibility to achieve this goal is to provide corpora with linguistically elaborate annotation schemes. Grammatical competence can also alleviate another inherent problem of statistical models. Since the number of model parameters is limited by the size of the training corpus a linguistically guided pre-selection of appropriate candidate parameters is crucial.

Within the project, stochastic grammars with different degrees of lexicalization were induced from a German newspaper corpus. Parametrization of the grammar models is guided by insights gained from corpus-based retrieval of collocations. The initial model was trained on annotated portions of the corpus. The parameters were systematically varied and tested in a number of parsing experiments. With parsing, an additional aspect of performance comes into play. With respect to collocation extraction, corpus pre-processing tools were adapted in order to automatically enrich raw text with structural information required for collocation extraction.

As theoretical result, the project provided insights into the interaction of generativity and lexicalization within collocations, and as a consequence insights into the interaction of competence and performance aspects of natural language.

As practical outcome, the project provides methods and tools for automatic high precision extraction of collocations from raw text, methods and tools to induce a highly lexicalized stochastic grammar model from arbitrary corpora, and a CKY-type stochastic parser parametrizable with respect to the grammar. Both, grammar model and parser are particularly designed for the requirements of robust and efficient processing of real world German text, and thus overcome the disadvantages of existing stochastic parsers for German which have largely been developed on the basis of English--a language which in contrast to German has little inflection, rigid word order and a fairly restricted amount of non-local phenomena.

Duration: 1998 - 2003
Sponsor: Austrian Science Foundation (FWF)
Researchers: Brigitte Krenn, Antonio Pareja-Lora, Harald Trost
Partners: Department of Computational Linguistics (University of Saarbrücken)