Xlike Project Language Analysis Services
Total Page:16
File Type:pdf, Size:1020Kb
XLike Project Language Analysis Services 1 Xavier Carreras∗, Llu´ıs Padro´∗, Lei Zhang♠, Achim Rettinger♠, Zhixing Li , ? / Esteban Garc´ıa-Cuesta, Zeljkoˇ Agic´ , Bozoˇ Bekavac , Blaz Fortuna†, Tadej Stajnerˇ † Universitat Politecnica` de Catalunya, Barcelona, Spain. iSOCO S.A. Madrid, Spain. /∗ University of Zagreb, Zagreb, Croatia. ? University of Potsdam, Germany. Jozefˇ Stefan Institute, Ljubljana, Slovenia. 1 Tsinghua University, Beijing, China. † Karlsruhe Institute of Technology, Karlsruhe, Germany. ♠ Abstract put documents into a language-independent rep- resentation that afterwards enables knowledge ag- This paper presents the linguistic analysis gregation. infrastructure developed within the XLike To achieve this goal, a bench of linguistic pro- project. The main goal of the imple- cessing pipelines is devised as the first step in the mented tools is to provide a set of func- document processing flow. Then, a cross-lingual tionalities supporting the XLike main ob- semantic annotation method, based on Wikipedia jectives: Enabling cross-lingual services and Linked Open Data (LOD), is applied. The for publishers, media monitoring or de- semantic annotation stage enriches the linguistic veloping new business intelligence appli- anaylsis with links to knowledge bases for differ- cations. The services cover seven major ent languages, or links to language independent and minor languages: English, German, representations. Spanish, Chinese, Catalan, Slovenian, and Croatian. These analyzers are provided 2 Linguistic Analyzers as web services following a lightweigth SOA architecture approach, and they are Apart from basic state-of-the-art tokenizers, lem- publically accessible and shared through matizers, PoS/MSD taggers, and NE recogniz- META-SHARE.1 ers, each pipeline requires deeper processors able to build the target language-independent seman- 1 Introduction tic representantion. For that, we rely on three 2 steps: dependency parsing, semantic role label- Project XLike goal is to develop technology able ing and word sense disambiguation. These three to gather documents in a variety of languages and processes, combined with multilingual ontologi- genres (news, blogs, tweets, etc.) and to extract cal resouces such as different WordNets and Pred- language-independent knowledge from them, in icateMatrix (Lopez´ de la Calle et al., 2014), a order to provide new and better services to pub- lexical semantics resource combining WordNet, lishers, media monitoring, and business intelli- FrameNet, and VerbNet, are the key to the con- gence. Thus, project use cases are provided by struction of our semantic representation. STA (Slovenian Press Agency) and Bloomberg, as well as New York Times as an associated partner. 2.1 Dependency Parsing Research partners in the project are Jozefˇ Ste- We use graph-based methods for dependency fan Institute (JSI), Karlsruhe Institute of Technol- parsing, namely, MSTParser3 (McDonald et al., ogy (KIT), Universitat Politecnica` de Catalunya 2005) is used for Chinese and Croatian, and (UPC), University of Zagreb (UZG), and Tsinghua Treeler4 is used for the other languages. Treeler is University (THU). The Spanish company iSOCO a library developed by the UPC team that imple- is in charge of integration of all components de- ments several statistical methods for tagging and veloped in the project. parsing. This paper deals with the language technology We use these tools in order to train dependency developed within the project XLike to convert in- parsers for all XLike languages using standard 1accessible and shared here means that the services are available treebanks. publicly callable, not that the code is open-source. http://www.meta-share.eu 3http://sourceforge.net/projects/mstparser 2http://www.xlike.org 4http://treeler.lsi.upc.edu 9 Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 9–12, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics 2.2 Semantic Role Labeling It is important to note that frames are a more As with syntactic parsing, we are developing SRL general representation than SVO-triples. While methods with the Treeler library. In order to train SVO-triples represent a binary relation between models, we will use the treebanks made available two participants, frames can represent n-ary rela- by the CoNLL-2009 shared task, which provided tions (e.g. predicates with more than two argu- data annotated with predicate-argument relations ments, or with adjuncts). Frames also allow repre- for English, Spanish, Catalan, German and Chi- senting the sentences where one of the arguments nese. No treebank annotated with semantic roles is in turn a frame (as is the case with plan to make exists for Slovene or Croatian. A prototype of in the example). SRL has been integrated in all pipelines (except Finally, although frames are extracted at sen- the Slovene and Croatian pipelines). The method tence level, the resulting graphs are aggregated implemented follows a pipeline architecture de- in a single semantic graph representing the whole scribed in (Llu´ıs et al., 2013). document via a very simple coreference resolution based on detecting named entity aliases and repe- 2.3 Word Sense Disambiguation titions of common nouns. Future improvements include using an state-of-the-art coreference reso- Word sense disambiguation is performed for all lution module for languages where it is available. languages with a publicly available WordNet. This includes all languages in the project except Chi- 3 Cross-lingual Semantic Annotation nese. The goal of WSD is to map specific lan- guages to a common semantic space, in this case, This step adds further semantic annotations on top WN synsets. Thanks to existing connections be- of the results obtained by linguistic processing. tween WN and other resources, SUMO and Open- All XLike languages are covered. The goal is CYC sense codes are also output when available. to map word phrases in different languages into Thanks to PredicateMatrix, the obtained con- the same semantic interlingua, which consists of cepts can be projected to FrameNet, achieving a resources specified in knowledge bases such as normalization of the semantic roles produced by Wikipedia and Linked Open Data (LOD) sources. the SRL (which are treebank-dependent, and thus, Cross-lingual semantic annotation is performed in not the same for all languages). The used WSD two stages: (1) first, candidate concepts in the engine is the UKB (Agirre and Soroa, 2009) im- knowledge base are linked to the linguistic re- plementation provided by FreeLing (Padro´ and sources based on a newly developed cross-lingual Stanilovsky, 2012). linked data lexica, called xLiD-Lexica, (2) next the candidate concepts get disambiguated based 2.4 Frame Extraction on the personalized PageRank algorithm by utiliz- The final step is to convert all the gathered linguis- ing the structure of information contained in the tic information into a semantic representation. Our knowledge base. method is based on the notion of frame: a seman- The xLiD-Lexica is stored in RDF format and tic frame is a schematic representation of a situ- contains about 300 million triples of cross-lingual ation involving various participants. In a frame, groundings. It is extracted from Wikipedia dumps each participant plays a role. There is a direct cor- of July 2013 in English, German, Spanish, Cata- respondence between roles in a frame and seman- lan, Slovenian and Chinese, and based on the tic roles; namely, frames correspond to predicates, canonicalized datasets of DBpedia 3.8 contain- and participants correspond to the arguments of ing triples extracted from the respective Wikipedia the predicate. We distinguish three types of par- whose subject and object resource have an equiv- ticipants: entities, words, and frames. alent English article. Entities are nodes in the graph connected to 4 Web Service Architecture Approach real-world entities as described in Section 3. Words are common words or concepts, linked to The different language functionalities are imple- general ontologies such as WordNet. Frames cor- mented following the service oriented architec- respond to events or predicates described in the ture (SOA) approach defined in the project XLike. document. Figure 1 shows an example sentence, Therefore all the pipelines (one for each language) the extracted frames and their arguments. have been implemented as web services and may 10 Figure 1: Graphical representation of frames in the sentence Acme, based in New York, now plans to make computer and electronic products. be requested to produce different levels of analy- disambiguation are based on FreeLing and sis (e.g. tokenization, lemmatization, NERC, pars- Treeler. Frame extraction is rule-based since ing, relation extraction). This approach is very ap- no SRL corpus is available for Slovene. pealing due to the fact that it allows to treat ev- Croatian: Croatian shallow processing is • ery language independently and execute the whole based on proprietary tokenizer, POS/MSD- language analysis process at different threads or tagging and lemmatisaton system (Agic´ et computers allowing an easier parallelization (e.g. al., 2008), NERC system (Bekavac and using external high perfomance platforms such as Tadic,´ 2007) and dependency parser (Agic,´ 5 Amazon Elastic Compute Cloud EC2 ) as needed. 2012). Word sense disambiguation is based Furthermore it also provides independent develop- on