<<

XLike Project Language Analysis Services

1 Xavier Carreras∗, Llu´ıs Padro´∗, Lei Zhang♠, Achim Rettinger♠, Zhixing Li , ? / Esteban Garc´ıa-Cuesta, Zeljkoˇ Agic´ , Bozoˇ Bekavac , Blaz Fortuna†, Tadej Stajnerˇ †

Universitat Politecnica` de Catalunya, Barcelona, Spain. iSOCO S.A. Madrid, Spain. ∗/ University of , Zagreb, . ? University of Potsdam, Germany. Jozefˇ Stefan Institute, , . 1 Tsinghua University, Beijing, China. † Karlsruhe Institute of Technology, Karlsruhe, Germany. ♠

Abstract put documents into a language-independent rep- resentation that afterwards enables knowledge ag- This paper presents the linguistic analysis gregation. infrastructure developed within the XLike To achieve this goal, a bench of linguistic pro- project. The main goal of the imple- cessing pipelines is devised as the first step in the mented tools is to provide a set of func- document processing flow. Then, a cross-lingual tionalities supporting the XLike main ob- semantic annotation method, based on Wikipedia jectives: Enabling cross-lingual services and Linked Open Data (LOD), is applied. The for publishers, media monitoring or de- semantic annotation stage enriches the linguistic veloping new business intelligence appli- anaylsis with links to knowledge bases for differ- cations. The services cover seven major ent languages, or links to language independent and minor languages: English, German, representations. Spanish, Chinese, Catalan, Slovenian, and Croatian. These analyzers are provided 2 Linguistic Analyzers as web services following a lightweigth SOA architecture approach, and they are Apart from basic state-of-the-art tokenizers, lem- publically accessible and shared through matizers, PoS/MSD taggers, and NE recogniz- META-SHARE.1 ers, each pipeline requires deeper processors able to build the target language-independent seman- 1 Introduction tic representantion. For that, we rely on three 2 steps: dependency parsing, semantic role label- Project XLike goal is to develop technology able ing and word sense disambiguation. These three to gather documents in a variety of languages and processes, combined with multilingual ontologi- genres (news, blogs, tweets, etc.) and to extract cal resouces such as different WordNets and Pred- language-independent knowledge from them, in icateMatrix (Lopez´ de la Calle et al., 2014), a order to provide new and better services to pub- lexical semantics resource combining WordNet, lishers, media monitoring, and business intelli- FrameNet, and VerbNet, are the key to the con- gence. Thus, project use cases are provided by struction of our semantic representation. STA (Slovenian Press Agency) and Bloomberg, as well as New York Times as an associated partner. 2.1 Dependency Parsing Research partners in the project are Jozefˇ Ste- We use graph-based methods for dependency fan Institute (JSI), Karlsruhe Institute of Technol- parsing, namely, MSTParser3 (McDonald et al., ogy (KIT), Universitat Politecnica` de Catalunya 2005) is used for Chinese and Croatian, and (UPC), University of Zagreb (UZG), and Tsinghua Treeler4 is used for the other languages. Treeler is University (THU). The Spanish company iSOCO a library developed by the UPC team that imple- is in charge of integration of all components de- ments several statistical methods for tagging and veloped in the project. parsing. This paper deals with the language technology We use these tools in order to train dependency developed within the project XLike to convert in- parsers for all XLike languages using standard 1accessible and shared here means that the services are available treebanks. publicly callable, not that the code is open-source. http://www.meta-share.eu 3http://sourceforge.net/projects/mstparser 2http://www.xlike.org 4http://treeler.lsi.upc.edu

9 Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 9–12, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics 2.2 Semantic Role Labeling It is important to note that frames are a more As with syntactic parsing, we are developing SRL general representation than SVO-triples. While methods with the Treeler library. In order to train SVO-triples represent a binary relation between models, we will use the treebanks made available two participants, frames can represent n-ary rela- by the CoNLL-2009 shared task, which provided tions (e.g. predicates with more than two argu- data annotated with predicate-argument relations ments, or with adjuncts). Frames also allow repre- for English, Spanish, Catalan, German and Chi- senting the sentences where one of the arguments nese. No treebank annotated with semantic roles is in turn a frame (as is the case with plan to make exists for Slovene or Croatian. A prototype of in the example). SRL has been integrated in all pipelines (except Finally, although frames are extracted at sen- the Slovene and Croatian pipelines). The method tence level, the resulting graphs are aggregated implemented follows a pipeline architecture de- in a single semantic graph representing the whole scribed in (Llu´ıs et al., 2013). document via a very simple coreference resolution based on detecting named entity aliases and repe- 2.3 Word Sense Disambiguation titions of common nouns. Future improvements include using an state-of-the-art coreference reso- Word sense disambiguation is performed for all lution module for languages where it is available. languages with a publicly available WordNet. This includes all languages in the project except Chi- 3 Cross-lingual Semantic Annotation nese. The goal of WSD is to map specific lan- guages to a common semantic space, in this case, This step adds further semantic annotations on top WN synsets. Thanks to existing connections be- of the results obtained by linguistic processing. tween WN and other resources, SUMO and Open- All XLike languages are covered. The goal is CYC sense codes are also output when available. to map word phrases in different languages into Thanks to PredicateMatrix, the obtained con- the same semantic interlingua, which consists of cepts can be projected to FrameNet, achieving a resources specified in knowledge bases such as normalization of the semantic roles produced by Wikipedia and Linked Open Data (LOD) sources. the SRL (which are treebank-dependent, and thus, Cross-lingual semantic annotation is performed in not the same for all languages). The used WSD two stages: (1) first, candidate concepts in the engine is the UKB (Agirre and Soroa, 2009) im- knowledge base are linked to the linguistic re- plementation provided by FreeLing (Padro´ and sources based on a newly developed cross-lingual Stanilovsky, 2012). linked data lexica, called xLiD-Lexica, (2) next the candidate concepts get disambiguated based 2.4 Frame Extraction on the personalized PageRank algorithm by utiliz- The final step is to convert all the gathered linguis- ing the structure of information contained in the tic information into a semantic representation. Our knowledge base. method is based on the notion of frame: a seman- The xLiD-Lexica is stored in RDF format and tic frame is a schematic representation of a situ- contains about 300 million triples of cross-lingual ation involving various participants. In a frame, groundings. It is extracted from Wikipedia dumps each participant plays a role. There is a direct cor- of July 2013 in English, German, Spanish, Cata- respondence between roles in a frame and seman- lan, Slovenian and Chinese, and based on the tic roles; namely, frames correspond to predicates, canonicalized datasets of DBpedia 3.8 contain- and participants correspond to the arguments of ing triples extracted from the respective Wikipedia the predicate. We distinguish three types of par- whose subject and object resource have an equiv- ticipants: entities, words, and frames. alent English article. Entities are nodes in the graph connected to 4 Web Service Architecture Approach real-world entities as described in Section 3. Words are common words or concepts, linked to The different language functionalities are imple- general ontologies such as WordNet. Frames cor- mented following the service oriented architec- respond to events or predicates described in the ture (SOA) approach defined in the project XLike. document. Figure 1 shows an example sentence, Therefore all the pipelines (one for each language) the extracted frames and their arguments. have been implemented as web services and may

10 Figure 1: Graphical representation of frames in the sentence Acme, based in New York, now plans to make computer and electronic products. be requested to produce different levels of analy- disambiguation are based on FreeLing and sis (e.g. tokenization, lemmatization, NERC, pars- Treeler. Frame extraction is rule-based since ing, relation extraction). This approach is very ap- no SRL corpus is available for Slovene. pealing due to the fact that it allows to treat ev- Croatian: Croatian shallow processing is • ery language independently and execute the whole based on proprietary tokenizer, POS/MSD- language analysis process at different threads or tagging and lemmatisaton system (Agic´ et computers allowing an easier parallelization (e.g. al., 2008), NERC system (Bekavac and using external high perfomance platforms such as Tadic,´ 2007) and dependency parser (Agic,´ 5 Amazon Elastic Compute Cloud EC2 ) as needed. 2012). Word sense disambiguation is based Furthermore it also provides independent develop- on FreeLing. Frame extraction is rule-based ment lifecycles for each language which is crucial since no SRL corpus is available for Croatian. in this type of research projects. Recall that these Chinese: Chinese shallow and deep process- • web services can be deployed locally or remotely, ing is based on a word segmentation compo- maintaining the option of using them in a stand- nent ICTCLAS8 and a semantic dependency alone configuration. parser trained on CSDN corpus. Then, rule- The main structure for each one of the pipelines based frame extraction is performed (no SRL is described below: corpus nor WordNet are available for Chi- nese). Spanish, English, and Catalan: all mod- • ules are based on FreeLing (Padro´ and Each language analysis service is able to pro- Stanilovsky, 2012) and Treeler. cess thousands of words per second when per- German: German shallow processing is forming shallow analysis (up to NE recognition), • based on OpenNLP6, Stanford POS tagger and hundreds of words per second when produc- and NE extractor (Toutanova et al., 2003; ing the semantic representation based on full anal- Finkel et al., 2005). Dependency parsing, ysis. Moreover, the web service architecture en- semantic role labeling, word sense disam- ables the same server to run a different thread for biguation, and SRL-based frame extraction each client, thus taking advantage of multiproces- are based on FreeLing and Treeler. sor capabilities. Slovene: Slovene shallow processing is pro- The components of the cross-lingual semantic • vided by JSI Enrycher7 (Stajnerˇ et al., 2010), annotation stage are: which consists of the Obeliks morphosyntac- xLiD-Lexica: The cross-lingual groundings tic analysis library (Grcarˇ et al., 2012), the • LemmaGen lemmatizer (Jursiˇ cˇ et al., 2010) in xLiD-Lexica are translated into RDF data and a CRF-based entity extractor (Stajnerˇ et and are accessible through a SPARQL end- 9 al., 2012). Dependency parsing, word sense point, based on OpenLink Virtuoso as the back-end database engine. 5http://aws.amazon.com/ec2/ 6http://opennlp.apache.org 8http://ictclas.org/ 7http://enrycher.ijs.si 9http://virtuoso.openlinksw.com/

11 Semantic Annotation: The cross-lingual se- Bozoˇ Bekavac and Marko Tadic.´ 2007. Implementa- • mantic annotation service is based on the tion of Croatian NERC system. In Proceedings of xLiD-Lexica for entity mention recognition the Workshop on Balto-Slavonic Natural Language 10 Processing (BSNLP2007), Special Theme: Informa- and the JUNG Framework for graph-based tion Extraction and Enabling Technologies, pages disambiguation. 11–18. Association for Computational Linguistics. 5 Conclusion Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local informa- We presented the web service based architecture tion into information extraction systems by gibbs used in XLike FP7 project to linguistically ana- sampling. In Proceedings of the 43rd Annual Meet- ing on Association for Computational Linguistics lyze large amounts of documents in seven differ- (ACL’05), pages 363–370. ent languages. The analysis pipelines perform ba- sic processing as tokenization, PoS-tagging, and Miha Grcar,ˇ Simon Krek, and Kaja Dobrovoljc. 2012. Obeliks: statisticniˇ oblikoskladenjski oznacevalnikˇ named entity extraction, as well as deeper analy- in lematizator za slovenski jezik. In Zbornik Osme sis such as dependency parsing, word sense disam- konference Jezikovne tehnologije, Ljubljana, Slove- biguation, and semantic role labelling. The result nia. of these linguistic analyzers is a semantic graph Matjaz Jursiˇ c,ˇ Igor Mozetic,ˇ Tomaz Erjavec, and Nada capturing the main events described in the docu- Lavrac.ˇ 2010. Lemmagen: Multilingual lemmati- ment and their core participants. sation with induced ripple-down rules. Journal of On top of that, the cross-lingual semantic an- Universal Computer Science, 16(9):1190–1214. notation component links the resulting linguistic Xavier Llu´ıs, Xavier Carreras, and Llu´ıs Marquez.` resources in one language to resources in a knowl- 2013. Joint arc-factored parsing of syntactic and se- edge bases in any other language or to language mantic dependencies. Transactions of the Associa- independent representations. This semantic repre- tion for Computational Linguistics, 1:219–230. sentation is later used in XLike for document min- Maddalen Lopez´ de la Calle, Egoitz Laparra, and Ger- ing purposes such as enabling cross-lingual ser- man Rigau. 2014. First steps towards a predicate vices for publishers, media monitoring or devel- matrix. In Proceedings of the Global WordNet Con- oping new business intelligence applications. ference (GWC 2014), Tartu, , January. GWA. The described analysis services are currently Ryan McDonald, Koby Crammer, and Fernando available via META-SHARE as callable RESTful Pereira. 2005. Online large-margin training of services. dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computa- tional Linguistics (ACL’05), pages 91–98, Ann Ar- Acknowledgments bor, Michigan, June. This work was funded by the Llu´ıs Padro´ and Evgeny Stanilovsky. 2012. Freeling through project XLike (FP7-ICT-2011-288342). 3.0: Towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Confer- ence (LREC 2012), Istanbul, Turkey, May. ELRA. References Tadej Stajner,ˇ Delia Rusu, Lorand Dali, Blazˇ Fortuna, Zeljkoˇ Agic,´ Marko Tadic,´ and Zdravko Dovedan. Dunja Mladenic,´ and Marko Grobelnik. 2010. A 2008. Improving part-of-speech tagging accuracy service oriented framework for natural language text for Croatian by morphological analysis. Informat- enrichment. Informatica, 34(3):307–313. ica, 32(4):445–451. Kristina Toutanova, Dan Klein, Christopher D. Man- ˇ Zeljko Agic.´ 2012. K-best spanning tree dependency ning, and Yoram Singer. 2003. Feature-rich part-of- parsing with verb valency lexicon reranking. In Pro- speech tagging with a cyclic dependency network. ceedings of COLING 2012: Posters, pages 1–12, In Proceedings of the 2003 Conference of the North Mumbai, India, December. The COLING 2012 Or- American Chapter of the Association for Computa- ganizing Committee. tional Lin- guistics on Human Language Technology Eneko Agirre and Aitor Soroa. 2009. Personalizing (NAACL’03). pagerank for word sense disambiguation. In Pro- Tadej Stajner,ˇ Tomazˇ Erjavec, and Simon Krek. ceedings of the 12th conference of the European 2012. Razpoznavanje imenskih entitet v slovenskem chapter of the Association for Computational Lin- besedilu. In In Proceedings of 15th Internation guistics (EACL-2009), Athens, Greece. Multiconference on Information Society - Jezikovne 10Java Universal Network/Graph Framework Tehnologije, Ljubljana, Slovenia. http://jung.sourceforge.net/

12