Usage of XSL Stylesheets for the Annotation of the Sámi Language Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Usage of XSL Stylesheets for the annotation of the Sami´ language corpora Saara Huhmarniemi Sjur N. Moshagen Trond Trosterud University of Tromsø Norwegian Sami´ Parliament University of Tromsø [email protected] [email protected] [email protected] Abstract texts. The small amount of those texts forced us to take into account a wide variety of sources and This paper describes an annotation system formats, and also to include texts that were of low for Sami´ language corpora, which consists technical quality. That introduced problems for the of structured, running texts. The annotation automatic processing of the texts. The solution to of the texts is fully automatic, starting from this problem was the document-specific processing the original documents in different formats. instructions that were implemented in XSLT. The texts are first extracted from the origi- nal documents preserving the original struc- 2 The Project tural markup. The markup is enhanced by a document-specific XSLT script which con- The corpus described here is the first structurally an- tains document-specific formatting instruc- notated text corpus for any Sami´ language. The cor- tions. The overall maintenance is achieved pus database was developed in parallel with the spell by system-wide XSLT scripts. checker and the syntactic analyzer projects for North 1 1 Introduction and Lule Sami´ . The new texts became test mate- rial for these two applications as soon as they were Corpus building for a specific language is consid- added to the corpus database. The requirements for ered to require much human effort and time. To the markup were constantly being re-evaluated dur- overcome this difficulty, there is a recent develop- ing the project. The infrastructure was designed ment of applications for automatic corpus building flexible so that it would accomodate to the differ- using often the Web as a resource e.g. (Baroni and ent needs of the two projects in different phases of Bernardini eds., 2006; Sharoff, 2006). For minority the application development. languages, the resources for building a text corpus At the moment, the corpus database consists of al- are often limited. Automatic tools for building cor- most 6 million words for North Sami´ and some 240 pus database specifically for the minority languages 000 for Lule Sami.´ Even though the system was pri- are developed e.g. by (Ghani et al., 2005; Scannell, marily designed for the Sami´ languages, there are 2004). no strictly language-dependent sections in the sys- The requirement to have the corpus building pro- tem; it has already been tested with Norwegian and cess automatized as much as possible was also cen- Finnish, among others. tral in the Sami´ language corpora project. How- One of the main applications of the text corpus ever, the collection of texts is done in a ”traditional” database is the syntactically annotated and fully dis- manner: the files are gathered and classified man- ambiguated corpus database for Sami´ languages. ually. For North Sami,´ there are texts available in The syntactic annotation is done automatically us- electronic form which can be exploited in a cor- pus database, mainly administrative and newspaper 1http://www.divvun.no/, http://giellatekno.uit.no/ 45 Proceedings of the Linguistic Annotation Workshop, pages 45–48, Prague, June 2007. c 2007 Association for Computational Linguistics ing the tools developed in the syntactic analyzer result is the final XML-document with structural project, but the process is out of the scope of this markup, see Fig. 1. paper. There is also some parallel texts with Norwe- gian, and plans for extending parallel text corpora to different Sami´ languages and Finnish and Swedish. The corpus database is freely available for research purposes. There will be a web-based corpus inter- face for the syntactically annotated corpus and a re- stricted access to the system for examining the cor- pus data directly. 3 XSLT and corpus maintenace Flexibility and reusability are in general the design requirements of annotated text corpora. XML has become the standard annotation system in physical storage representation. XML Transformation Lan- guage (XSLT) (Clark ed., 1999) provides an easy Figure 1: The overall architecture of the conversion data transformation between different formats and process. applications. XSLT is commonly used in the con- temporary corpus development. The power of XSLT The conversion of a document always starts from mainly comes from its sublanguage XPath (Clark the original file, which makes it possible to adapt and DeRose eds., 1999). XPath provides an access for the latest versions of the text extraction tools and to the XML structure, elements, attributes and text other tools used in the process as well as the changes through concise path expressions. in XML-markup. In the Sami´ language corpora, XSLT is used in The annotation process is fully automatic and corpus establishment and maintenance. The raw can be rerun at will. Some documents may con- structural format is produced by text extraction tools tain errors or formatting that are not taken into ac- and coverted to a preliminary XML-format using count by the automatic tools. On the other hand, XSLT. The markup is further enhanced by document the automatized annotation process does not allow specific information and a system-wide processing manual correction of the texts, nor manual XML- instruction, both implemented in XSLT. markup. Those exceptions can be taken into account 4 The Sami´ corpus database by document-specific processing instructions, which are implemented using XSLT. The script can be used 4.1 Overall architecture for adding XML-annotation for specific parts of the The corpus database is organized so that the original document, fixing smaller errors in the document, or text resources, which are the documents in various even to rescue a corrupted file that would be other- formats (Word, PDF, HTML, text) form the source wise unusable. This is a useful feature when build- base. The text is extracted from the original docu- ing a corpus for a minority language with diverse ments using various freely available text extraction and often limited text resources. tools, such as antiword and HTML Tidy. They al- ready provide a preliminary structural markup: anti- 4.2 XML-annotation word produces DocBook and HTML Tidy provides In the Sami´ language corpora, markup of running output in XHTML. There are XSLT scripts for con- text is simple, containing no more structural infor- verting the different preliminary formats the to an mation than what is generally available in the orig- intermediate document format. inal text. The body text can contain sections and The intermediate format is further processed to paragraphs and each section can contain sections the desired XML-format using XSLT-scripts. The and paragraphs. There are four paragraph types: ti- 46 tle, text, table and list. The paragraphs are classified tion of wrongly encoded characters. whenever the information is available in the origi- Another example of a template that can be called nal document. Lists and especially tables contain in- from the document-specific XSLT script is the complete sentences and in many cases numeric data. string-replacement, that can be used for marking When conducting e.g. syntactic analysis, it might spelling errors in the text. Due to the variety of con- be better to leave tables and even lists or titles out, ventions of writing Sami,´ the texts tend to contain whereas for e.g. terminological work the tables are lot of strings that are classified as spelling errors. highly relevant. Tagging for paragraph type makes it The errors disturb the testing of the analyzer, but are possible to include or exclude the relevant paragraph on the other hand interesting from the point of view types at will. of the spell checker project. When a spelling error Inside a paragraph, there is a possibility to is discovered in the text, the erroneous strings and add emphasis markup and other span information, their corrections are added to the document-specific such as quotes. The sentence-level and word-level metafile from where they are picked by the con- markup is not included in the text corpus. The version process. The errors are thus preserved in markup is added when the text corpus is moved to the XML-format but with a correction which can be the syntactically annotated corpus database. used instead of the erroneous string. This is achieved The XML-annotation does not follow any stan- by a special markup: dardized XML-format, but it is, in essence, a subset <error correct="text">tetx</error> of the XCES (Ide et al., 2000) format. Furthermore, In this way the original documents stay intact and the system is designed so that changing the XML- the information is preserved also when the file is re- annotation and moving to a standardized format is a converted. If the error is not just a single word but straightforward process. involves more context, it is possible to add the con- text to the error string. 4.3 XSLT processing In addition, the document-specific XSLT script Each original document in the corpus database is contains variables that may be used already in the paired with an XSLT script. The document-specific text extraction phase. An example would be the text XSLT script contains processing instructions that are alignment information of a pdf-file. applied to the document during the conversion from the preliminary document format to the final XML- 4.4 Language identification format (see Fig. 1.). The XPath expressions are pow- Most documents in the Sami´ corpus database con- erful tools for accessing portions of text in a docu- tain sections of text that are not in the document’s ment and modifying the XML-markup without edit- main language.