Using SGML As a Basis for Data-Intensive NLP
Total Page:16
File Type:pdf, Size:1020Kb
Using SGML as a Basis for Data-Intensive NLP David McKelvie, Chris Brew & Henry Thompson Language Technology Group, Human Communication Research Centre, University of Edinburgh, Edinburgh, Scotland David. McKelvie@ed. ac. uk ~z Chris. Brew@ed. ac. uk & H. Thompson@ed. ac. uk Abstract cessing of (primarily) text corpora. It generalises the UNIX pipe architecture, making it possible to This paper describes the LT NSL sys- use pipelines of general-purpose tools to process an- tem (McKelvie et al, 1996), an architec- notated corpora. The original UNIX architecture al- ture for writing corpus processing tools. lows the rapid construction of efficient pipelines of This system is then compared with two conceptually simple processes to carry out relatively other systems which address similar is- complex tasks, but is restricted to a simple model of sues, the GATE system (Cunningham et streams as sequences of bytes, lines or fields. LT NSL al, 1995) and the IMS Corpus Workbench lifts this restriction, allowing tools access to streams (Christ, 1994). In particular we address which are sequences of tree-structured text (a repre- the advantages and disadvantages of an sentation of SGML marked-up text). SGML approach compared with a non-SGML The use of SGML as an I/0 stream format between database approach. programs has the advantage that SGML is a well de- fined standard for representing structured text. Its 1 Introduction value is precisely that it closes off the option of a The theme of this paper is the design of software proliferation of ad-hoc notations and the associated and data architectures for natural language process- software needed to read and write them. The most ing using corpora. Two major issues in corpus-based important reason why we use SGMLfor all corpus lin- NLP are: how best to deal with medium to large guistic annotation is that it forces us to formally de- scale corpora often with complex linguistic annota- scribe the markup we will be using and provides soft- tions, and what system architecture best supports ware for checking that these markup invariants hold the reuse of software components in a modular and in an annotated corpus. In practise this is extremely interchangeable fashion. useful. SGML is human readable, so that interme- In this paper we describe the LT NSL system (McK- diate results can be inspected and understood. It elvie et al, 1996), an architecture for writing corpus also means that it is easy for programs to access the processing tools, which we have developed in an at- information which is relevant to them, while ignor- tempt to address these issues. This system is then ing additional markup. A further advantage is that compared with two other systems which address many text corpora are available in SGML, for exam- some of the same issues, the GATE system (Cun- ple, the British National Corpus (Burnage&Dunlop, ningham et al, 1995) and the IMS Corpus Work- 1992). bench (Christ, 1994). In particular we address the The LT NSL system is released as C source code. advantages and disadvantages of an SGML approach The software consists of a C-language Application compared with a non-SGML database approach. Fi- Program Interface (API) of function calls, and a num- nally, in order to back up our claims about the merits ber of stand-alone programs which use this API. The of SGML-based corpus processing, we present a num- current release is known to work on UNIX (SunOS ber of case studies of the use of the LT NSL system 4.1.3, Solaris 2.4 and Linux), and a Windows-NT for corpus preparation and linguistic analysis. version will be released during 1997. There is also an API for the Python programming language. 2 The LT NSL system One question which arises in respect to using LT NSL is a tool architecture for SGML-based pro- SGML as an I/O format is: what about the cost of 229 parsing SGML? Surely that makes pipelines too in- the distribution, improves the performance of LT NSL efficient? Parsing SGML in its full generality, and to acceptable levels for much larger datasets. providing validation and adequate error detection Why did we say "primarily for text corpora"? Be- is indeed rather hard. For efficiency reasons, you cause much of the technology is directly applicable wouldn't want to use long pipelines of tools, if each to multimedia corpora such as the Edinburgh Map tool had to reparse the SGML and deal with the Task corpus (Anderson et al, 1991). There are tools full language. Fortunately, LT NSL doesn't require which interpret SGML elements in the corpus text as this. The first stage of processing normalises the offsets into files of audio-data, allowing very flexi- input, producing a simplified, but informationally ble retrieval and output of audio information using equivalent form of the document. Subsequent tools queries defined over the corpus text and its annota- can and often will use the LT NSL API which parses tions. The same could be done for video clips, etc. normalised SGML (henceforth NSGML) approximately ten times more efficiently than the best parsers for 2.1 Hyperlinking full SGML. The API then returns this parsed SGML We are inclined to steer a middle course between to the calling program as data-structures. a monolithic comprehensive view of corpus data, in NSGML is a fully expanded text form of SGML in- which all possible views, annotations, structurings formationally equivalent to the ESlS output of SGML etc. of a corpus component are combined in a sin- parsers. This means that all markup minimisation gle heavily structured document, and a massively is expanded to its full form, SGML entities are ex- decentralised view in which a corpus component is panded into their value (except for SDATA entities), organised as a hyper-document, with all its informa- and all SGML names (of elements, attributes, etc) are tion stored in separate documents, utilising inter- normalised. The result is a format easily readable by document pointers. Aspects of the LT NSL library humans and programs. are aimed at supporting this approach. It is neces- The LT NSL programs consist of mknsg, a program sary to distinguish between files, which are storage for converting arbitrary valid SGML into normalised units, (SGML) documents, which may be composed SGML1 , the first stage in a pipeline of LT NSL tools; of a number of files by means of external entity ref- and a number of programs for manipulating nor- erences, and hyper-documents, which are linked en- malised SGML files, such as sggrep which finds SGML sembles of documents, using e.g. HyTime or TEI elements which match some query. Other of our soft- (Sperberg-McQueen&Burnard, 1994) link notation. ware packages such as LT POS (a part of speech tag- The implication of this is that corpus compo- ger) and LT WB (Mikheev&Finch, 1997) also use the nents can be hyper-documents, with low-density (i.e. LT NSL library. above the token level) annotation being expressed in- In addition to the normalised SGML, the mknsg directly in terms of links. In the first instance, this program writes a file containing a compiled form is constrained to situations where element content of the Document Type Definition (DTD) 2, which at one level of one document is entirely composed LT NSL programs read in order to know what the of elements from another document. Suppose, for structure of their NSGML input or output is. example, we already had segmented a file resulting How fast is it? Processes requiring sequential ac- in a single document marked up with SGML headers cess to large text corpora are well supported. It is and paragraphs, and with the word segmentation unlikely that LT NSL will prove the rate limiting step marked with <w> tags: in sequential corpus processing. The kinds of re- peated search required by lexicographers are more <p id=~> of a problem, since the system was not designed <w id=p4.wl>Time</w> for that purpose. The standard distribution is fast <w id=p4.w2>flies</w> enough for use as a search engine with files of up to <w id=p4.w3>.</w> several million words. Searching 1% of the British <Ip> National Corpus (a total of 700,000 words (18 Mb)) is currently only 6 times slower using LT NSL sggrep The output of a phrase-level segmentation might than using fgl"ep, and sF~rre p allows more complex then be stored as follows: structure-sensitive queries. A prototype indexing • ° mechanism (Mikheev&McKelvie, 1997), not yet in <p id=p4> <phr id=p4.phl type=n doe=file1 from='id p4.wl~> 1Based on James Clark's SP parser (Clark, 1996). <phr id=p4.ph2 type=v from=~id p4.w2~> 2SGML's way of describing the structure (or grammar) </p> of the allowed markup in a document 230 Linking is specified using one of the available TEI <p £d=p325> mechanisms. Details are not relevant here, suffice it <repl from='id p325.t1' to='id p325.t15'> to say that doc=filel resolves to the word level file <!-- the correction itself --> <corr sic='procede' resp='ispell~> and establishes a default for subsequent links. At <token id=p325, t 16>proceed</t oken> a minimum, links are able to target single elements </corr> or sequences of contiguous elements. LT NSL imple- <!-- more unchanged text--> ments a textual inclusion semantics for such links, in- <repl from=~id p325.t17 ~ to=~id p325.t96 '> serting the referenced material as the content of the <Ip> <!-- the rest of the unchanged text--> element bearing the linking attributes.