A Web-based Text Corpora Development System (summary)

Dan Bohus, Marian Boldea Natural Language Processing Laboratory, “Politehnica” University of Timisoara [email protected], [email protected]

Keywords: text corpora; WWW; diacritics characters restoration; lexical disambiguation; morphosyntactic annotation.

One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system that focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless resource of texts. To ensure a certain quality, we enrich the text with relevant information to be fit for further use by resolving in an integrated manner the problems of diacritic characters restoration, lexical ambiguity resolution and morphosyntactic annotation. Although at this moment it is targeted at texts in Romanian, a number of mechanisms have been provided that allows it to be easily adapted to other languages.

1. System Overview

The system was build by adapting existing tools to the necessities of the intended task and the peculiarities of the Romanian language, and by creating a couple of new ones. Its modules are organized in a layered structure, each of them performing some basic operation on the data: text collection, tokenization, part-of-speech tagging, etc. The modules were then encapsulated by a graphical user interface built as a Tcl/Tk wrapper. Each module can be easily adapted to the necessities of other languages, making the whole system very flexible. The only language-specific resources necessary are a large-scale morphosyntactic and a tagged text corpus for the initial training of the stochastic POS tagger.

2. Text Acquisition

The first issue addressed is plain text acquisition: lately the number of WWW sites that can be used as sources of texts (e.g. newspapers and radio-TV broadcast stations sites) has grown tremendously, so we took into account the Internet as a source of raw texts. The actual task of downloading information from WWW sites was automated by means of standard UNIX tools such as wget and cron. The acquisition module can gather data from both plain text and HTML files, but acquisition from HTML files raises some problems, since a simple HTML/text conversion is not sufficient in most cases due to the intricate structure of the respective pages. Such a direct conversion introduces a lot of undesirable text (links, advertisements, etc.), and to solve this problem a dedicated scripting language for HTML/text conversion was developed. Once downloaded, the text is translated into a stream of tokens by making use of a segmentation module built using the GNU flex lexical analyzers generator. A basic set of segmentation rules (that can later be dynamically extended by the user) was defined to control the behavior of the segmenter. The module saves the texts in SGML format, with appropriate annotations for each identified token. Attributes regarding the file itself (e.g. collection date, source) are also saved for corpora management purposes.

3.

The primary purpose of text processing in our system is diacritic characters restoration, as the largest part of the electronic texts gathered from the Internet in Romanian exhibits a troublesome lack of diacritics, and this might also be the case for many other languages. Several solutions can be mentioned for dealing with this problem (Scheytt, Geutner & Waibel 1998, Tufis & Chitu 1999), and we combined a couple of technologies to solve it: a compressed representation of large-scale by means of finite state automata, and probabilistic POS tagging. The dictionary used for Romanian contains about 420.000 entries and 33.000 lemmas. It has been shown that very good look-up times and compression ratios can be obtained by implementing the dictionary using a finite state automaton, and the Finite State Library from AT&T was adapted to compile the dictionary as a transducer and implement the look-up. For diacritics restoration, first a dictionary look-up is performed, resulting in a list of from which the current token could have been obtained, each with its morphosyntactic description. When the list contains more than one word, a decision as to which is the correct one is made using a probabilistic POS tagger. In the current implementation of the system we demonstrate that a probabilistic POS tagger (ISSCO- TATOO) can be adapted successfully to a highly inflectional language, such as Romanian, using a tiered tagging approach (Tufis, 1999). As POS tagging is already included in the system, it can also be used to enrich the text with valuable information by other processing steps, i.e. lexical disambiguation and morphosyntactic annotation, where necessary.

4. The Corpora Development Workbench

The described modules are all accessible as a Corpora Development Workbench, which is a single graphical user interface created using Tcl/Tk. The goals were to make the system easily usable even by non- specialists, and to provide an efficient way of performing manual tagging and/or disambiguation, which otherwise can be a time-consuming, error prone operation (Day & others). Two issues had to be solved here: first, the user must be able to easily spot and manually tag the words that are still ambiguous after the automated process; and secondly, the user must be able to correct possible mistakes of the tagger. To accommodate both these problems an approach similar to the Penn standard of "slash encoding" was used. The workbench can also be used to perform corpora management functions (statistics, logging, inspection, etc.), edit HTML/text conversion scripts, modify and compile the dictionary, add segmentation rules, export texts from corpora into different formats, etc.

5. Conclusion

An integrated web-based text corpora development system that focuses not only on the size of the corpora, but also on their quality, was built. Due to its layered structure, and to the flexibility of each module, the system can easily be adapted to many languages. Work towards creating Romanian text corpora is in progress using the system, and quantitative performance evaluations will be conducted on this occasion.

References: Day D., Aberdeen J., Caskey S., Hirschman L., Robinson P., Vilain M.: "Alembic Workbench Corpus Development Tool"

Scheytt P., Geutner P., Waibel A. (1998): “Serbo-Croatian LVCSR on the Dictation and Broadcast News Domain”, ICASSP 1998

Tufis D. (1999): "Tiered Tagging and Combined Language Models Classifiers". In Jelinek, F., Noth, E. (eds.) “Text, Speech and Dialogue”, Lecture Notes in Artificial Intelligence 1992, Springer, 28-33

Tufis D., Chitu A. (1999): "Automatic Diacritics Insertion in Romanian Texts", COMPLEX 1999