View metadata, citation and similar papers at core.ac.uk brought to you by CORE
provided by HAL-Paris 13
Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature Kata G´abor, Ha¨ıfaZargayouna, Davide Buscaldi, Isabelle Tellier, Thierry Charnois
To cite this version: Kata G´abor, Ha¨ıfaZargayouna, Davide Buscaldi, Isabelle Tellier, Thierry Charnois. Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature. LREC 2016, May 2016, Portoroz, Slovenia. 2016, Proceedings of the LREC 2016 Conference.
HAL Id: hal-01360407 https://hal.archives-ouvertes.fr/hal-01360407 Submitted on 6 Sep 2016
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destin´eeau d´epˆotet `ala diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publi´esou non, lished or not. The documents may come from ´emanant des ´etablissements d’enseignement et de teaching and research institutions in France or recherche fran¸caisou ´etrangers,des laboratoires abroad, or from public or private research centers. publics ou priv´es. Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature
Kata Gábor∗, Haïfa Zargayouna∗, Davide Buscaldi∗, Isabelle Tellier†, Thierry Charnois∗ ∗ LIPN, CNRS (UMR 7030), Université Paris 13 Sorbonne Paris Cité † LaTTiCe (UMR 8094), CNRS, ENS Paris, Université Sorbonne Nouvelle - Paris 3 PSL Research University, Université Sorbonne Paris Cité (gabor,haifa.zargayouna,davide.buscaldi,thierry.charnois)@lipn.univ-paris13.fr, [email protected]
Abstract This paper describes the process of creating a corpus annotated for concepts and semantic relations in the scientific domain. A part of the ACL Anthology Corpus was selected for annotation, but the annotation process itself is not specific to the computational linguistics domain and could be applied to any scientific corpus. Concepts were identified and annotated fully automatically, based on a combination of terminology extraction and available ontological resources. A typology of semantic relations between concepts is also proposed. This typology, consisting of 18 domain-specific and 3 generic relations, is the result of a corpus-based investigation of the text sequences occurring between concepts in sentences. A sample of 500 abstracts from the corpus is currently being manually annotated with these semantic relations. Only explicit relations are taken into account, so that the data could serve to train or evaluate pattern-based semantic relation classification systems.
Keywords: semantic annotation, semantic relations, ACL Anthology
1. Introduction such as ontologies. Ontologies allow a fine-tuned semantic analysis, as opposed to approaches that primarily rely on One of the emerging trends of natural language technolo- unstructured resources or terminology extraction on the gies is their use for the humanities and sciences. There is fly (Omodei et al., 2014; Chavalarias and Cointet, 2013; a constant increase in the production of scientific papers Skupin, 2004). Sateli and Witte (2015) and Presutti et and experts are faced with an explosion of information that al. (2014) rely on DBPedia to identify key concepts. Of makes it difficult to have an overview of the state of the art particular interest for us is that systems using an ontology in a given domain (Larsen and von Ins, 2010). This phe- can also benefit from different kinds of typed relations. nomenon gave rise to efforts from the semantic web, sci- While statistical systems can do well on identifying key entometry and natural language processing communities, concepts, a typology of relations is much more difficult trying to improve access to scientific literature. Most of to extract from domain corpora. In the context of our these works concentrate on unfolding links between pa- research, we are interested in exploiting the relations pers in order to build cartographies and analyze scientific from external ontologies, as well as populating ontologies networks based on bibliographic references and topic tax- by discovering new types of semantic relations between onomies (Osborne and Motta, 2012; Osborne and Motta, concepts using pattern mining techniques (Béchet et al., 2014). Shotton (2010) designed an ontology of different 2012). The semantic analysis of scientific corpora allows to types of citations, while Presutti et al. (2014) aim to im- add new relation types and instances to existing ontologies prove access to semantic content and inter-document navi- (Petasis et al., 2011) or thesauri (Wang et al., 2013). gation by identifying links between documents. Other stud- As a first step towards the automated analysis of scientific ies focus on the evolution of a scientific domain over time corpora, the production of an annotated gold standard (Chavalarias and Cointet, 2013; Omodei et al., 2014). corpus was undertaken. We decided to concentrate on the As opposed to these lines of research, our analysis focuses computational linguistics domain and use part of the ACL on the semantic content of individual scientific papers Anthology Corpus (Radev et al., 2009) for our purposes. instead of inter-document links and author networks. The This article describes the (ongoing) work on annotating goal of our work is to automatically build a state of the art this corpus to create a resource for training and evaluating of a scientific domain. Focusing on the abstract and the relation extraction systems. introduction, we represent the semantic content of a paper Our semantic annotations are conceived in terms of entities as a set of domain-specific concepts and typed relations representing domain-relevant concepts and the possible between them. By identifying instances of concepts and semantic relations linking them, e.g.: domain-specific relations, we can extract the contribution of a research paper. Applying this method to a corpus
3694 posed of two consecutive steps : 3. A First Analysis on the Annotation of • The identification of domain-specific concepts in the Concepts papers. These concepts will be linked to external on- Linking entities to external ontological resources is an im- tological resources. Concepts were automatically an- portant aspect in the context of our work. The corpus we notated in the totality of the corpus. prepare will serve to experiment with different methods • The annotation of the semantic relations that hold bet- for populating ontologies by unsupervised semantic rela- ween these concepts (if any) - this is done manually tion extraction. These methods may rely on different kinds on the gold standard corpus. These data will serve to of information: sequential patterns, distributional vectors, evaluate consecutive experiments on relation extrac- but also on the semantic properties of the entities, as ex- tion and classification. A sample of 500 abstracts is tracted from the ontology. Therefore, entities representing currently being manually annotated. concepts were annotated based on available resources. Section 2. describes how the texts of the corpus were se- 3.1. Goals in terms of precision/annotation lected and pre-processed. Section 3. presents an analysis density of the annotation of concepts and the adequacy of the avail- Ontologies differ with respect to their domain and to the able resources. Section 4. explains the two-step process fol- concepts and types of relations they encode. Two aspects lowed to annotate the relations: creating a typology (4.1.) come into play when evaluating an ontology as a knowl- and applying it (4.2.). Finally, section 5. draws the conclu- edge resource: the precision and recall of the resource itself sions and presents directions for future work. (i.e. the quality and quantity of the information it contains), 2. Composition of the Corpus and and the compatibility between the resource and the corpus Pre-processing to which it is applied. Our focus was to explore the ade- quacy of different available resources for concept annota- We decided to focus on the abstract and introduction parts tion in a scientific corpus. Brewster et al. (2004) propose of scientific papers since they express essential informa- data-driven methodologies to evaluate the "fit" between an tion in a compact and often repetitive manner, which makes ontology and a domain corpus. Unlike their study, we pri- them an optimal source for mining sequential patterns in oritized simple coverage over structural fit. The following further experiments. three criteria were considered in accordance with our pur- The core of the corpus is the database pre-processed by by poses. E. Omodei (2014), which contains the abstracts and differ- The coverage of a resource can be interpreted in two differ- ent kinds of meta-information for 13.322 papers from the ent ways. Coverage with respect to the domain vocabulary ACL Anthology. We relied on the paper IDs in the database shows the proportion of the concepts of a domain that are to identify and extract the introductions for the same arti- included in the ontology: it can be interpreted as a measure cles from the 2009 version of the ACL Anthology Corpus of recall. However, as we do not have a "standard" domain 1 (Radev et al., 2009). As a first cleaning, CRF tagger was model, we resort to the hypothesis that the ACL Anthology ran on the ACL Anthology Corpus to filter out files than corpus is representative of the domain. cannot be processed: those with a significant number of This brings us to the second criterion: coverage with res- OCR errors or tokenisation difficulties. We then extracted pect to the corpus. Brewster et al. (2004) suggest using lex- the introductions from the remaining files using the follow- ical keyword extraction and query expansion to extract rel- ing heuristics: evant domain vocabulary. This vocabulary will be mapped • the introduction starts with a line that contains "Intro- to the ontology, and the ratio of words found in the ontology duction/INTRODUCTION" or starts with "1" or "1.1" gives the measure of coverage for corpus vocabulary. We followed by a capitalized word, propose another interpretation of coverage, as annotation density: the proportion of running words that are annotated • the last line of the introduction is the one followed by a by the resource. This aspect is of particular importance in line starting with "2" or "1.2" followed by a capitalized the context of pattern mining. We need to assess whether word relevant patterns between concepts will be well represented • if the end of the introduction cannot be identified, the in the corpus. Knowing that not every pair of entity in- first 6 lines are kept. stances participate in an explicit relation, we need several The introductions were paired with the corresponding ab- recognized entities in the same sentence and within a rea- stracts : the resulting corpus contains 4.200.000 words from sonable distance to make sure that meaningful repetitions 11.000 papers. can be spotted. Annotation density will be measured as the proportion of annotated entities per 100 words (the average People’s names and bibliographic references were automat- length of an abstract). ically annotated. For bibliographic references, we adapted precision the specific grammar available in SxPipe (Sagot and Boul- The third, conflicting criterion is the of the anno- lier, 2008). People’s names were recognized using the Stan- tations produced by a resource. It can be measured as the ford NER Named EntityRecognizer (Finkel et al., 2005) proportion of relevant annotations over the total number of annotations. This aspect is correlated with the specificity of and its pre-trained model for three classes (among which the resource: a good quality specialized ontology coupled only PERSON was used). with a corpus of the same domain allows to annotate the 1http://crftagger.sourceforge.net important concepts of the domain while restricting ambi- 3695 guities or irrelevant annotations. It is important to note that and Wiktionary), it was selected for the annotation exper- besides the quality of the resource and its adequacy with iments. This resource has the advantage of providing sig- the corpus, precision also depends on the accuracy of the nificantly increased density - on the other hand, it also in- annotation process itself. creases ambiguity and introduces less relevant annotations. While general ontologies such as WordNet (Fellbaum, Keeping every BabelNet-annotated entity as a concept can- 1998) may ensure a good annotation density, domain- didate was clearly not an option, as test runs showed that specific resources are less likely to produce irrelevant an- more than 80% of words would qualify as a candidate, notations. On the other hand, domain ontologies are costly most of them being irrelevant to the domain. The next sec- to construct as they require substantial manual labor and tion presents the filtering approaches we tried to reduce the domain expertise. Consequently, specialized resources are number of irrelevant entities. usually less available and more restricted in size: this is why there is currently significant research aiming to en- 3.3. Combining Resources with Data-driven rich ontologies with concept and relation candidates from Methods corpora (Petasis et al., 2011). Therefore, we experimented We examined several solutions to filter out relevant anno- with combining domain-specific and generic resources to tations from BabelNet while keeping a good annotation achieve a satisfying balance between annotation density density. The first logical step would be to exploit the and precision. structure of the resource: in our case, WordNet synsets and Wikipedia categories. By identifying domain-relevant 3.2. Ontological resources and coverage issues synsets and categories, we could limit the number of enti- First, existing ontological/lexical resources were consid- ties annotated from BabelNet to those specific to our do- 3 ered for entity annotation. To our knowledge there is no main. However, we found that DBpedia classes , retrieved available hand-crafted ontology specific to the NLP do- from Wikipedia categories are too generic for our purposes. main. Saffron Knowledge Extraction Framework2 proposes The YAGO ontology (Suchanek et al., 2007) of English specific terminological resources for different domains, au- Wikipedia classes was considered unfit for our purposes as tomatically extracted from corpora (Bordea, 2013; Bordea it is specifically designed for named entities and informa- et al., 2013). tion extraction. A general problem we encountered about Saffron’s underlying methodology extracts terms com- taxonomies of Wikipedia classes stems from the large num- pletely automatically in two steps, without relying on com- ber of classes and the fact that Wikipedia pages are not parable corpora. First, high-level terms (domain models) linked to classes in a systematic way. are selected and filtered by their weight, with a thresh- Another solution to limit the number of annotations from old set empirically for the given corpus. Domain mod- BabelNet is to filter concept candidates before looking them els contain frequent words; they constitute "the preferred up in BabelNet. The idea is to combine terminology extrac- level of naming, that is the taxonomical level at which cate- tion / multi-word expression (MWE) extraction with onto- gories are most cognitively efficient" (Bordea et al., 2013). logy lookup. Both vocabulary extraction methods are ex- We used the available domain models for computer sci- pected to be useful for filtering: MWEs are presumably ence and for computational linguistics (120 and 200 simple more specific than simple words due to the composition and words respectively, with 56 overlapping concepts). Second, terms are extracted according to domain specificity. 4 intermediate-level terms are extracted according to their First, we applied the phrase tool included in word2vec distributional similarity to high-level terms. As opposed to (Mikolov et al., 2013) to extract multi-word units from the the methods based on a comparison between different cor- corpus. The abstract part of the corpus was used for train- pora, this approach favors terms that are less specific but ing, and was annotated with the identified MWEs. Only more frequent. The intermediate-level terms (topic hierar- the recognised MWEs were looked up in BabelNet. This chies) for computational linguistics, containing 500 units process yielded a density 3.7 concepts/100 words. This co- (mostly multioword expressions), were included in our re- verage was estimated insufficient. sources. Both high-level and intermediate terms are rela- In the following experiment, the term extraction tool tively frequent: Saffron’s terms "are specific to a domain TermSuite (Daille et al., 2013) was applied to the corpus. but broad enough to be usable for summarisation or classi- The extraction process takes as input a set of documents fication" (Bordea, 2013). (one abstract/document) and returns a list of term candi- Despite the quality reported for Saffron’s resources (Bor- dates together with a specificity value. Several additional dea, 2013), the annotation density they provide is definitely filtering parameters can be applied. We filtered out any too limited for our purposes, especially with respect to pat- candidate whose part of speech is not common noun (or tern mining. An annotation test run on 1.100.000 words a multi-word unit ending with a common noun). Words yielded an average of 13 annotated concepts/100 words. To that were unknown for the CRF tagger were also excluded, increase density, extensive general ontologies were consid- as well as nouns with less than 5 occurrences. Finally, ered with a presumably good coverage for the NLP domain, the specificity threshold was set empirically. After setting such as WordNet, BabelNet (Navigli and Ponzetto, 2012) or these parameters, the resulting list was manually validated. DBpedia (Mendes et al., 2012). BabelNet being the most The concept candidates coming from terminology extrac- extensive (with merged synsets from WordNet, Wikipedia tion were compared with BabelNet concept candidates, and
3http://wiki.dbpedia.org/services-resources/ontology 2 http://saffron.insight-centre.org/ 4https://code.google.com/p/word2vec/ 3696 produced an overlap of 3.805 candidates. The correspond- Resource density precision ing synsets from BabelNet resources (Wikipedia, Word- /100 w Net or Wiktionary) were thus selected for entity annota- Saffron - all 13 0.98 tion. This combination resulted in an annotation density BabelNet filtered - all 23 0.97 of 23 concepts/100 words, which is advantageous for our purposes. Table 1: Quantity and precision of entities annotated 3.4. Evaluation and Error Analysis We manually evaluated annotation precision on a sample Table 2 shows the annotation precision of specific resources containing 100 sentences, with 358 annotated entities and (with minor types). Every annotation produced by each 932 annotations (an entity is thus linked to 2.6 resources on specific resource was considered, hence the higher density. average). The sample shows the contexts and the annota- POS tagging was still discarded. These data confirm that tions together with their sources. Concept candidates were filtering helped to maintain consistency in the quality of classified as correct or incorrect, with the following error the different annotation resources. subcategories:
• "non relevant", e.g. : Resource density precision written in the early XX century by
3697 Generic Relations antonyms
Domain-specific relations affects ARG1: specific property of data ARG2: results
Table 4: Semantic Relation Typology entity mention pairs. First, semantic relation types were On the textual level, a semantic relation will be conceived identified in a sample of 100 abstracts extracted from the as a text span linking two annotated instances of concepts corpus. within the same sentence. Only explicit relations were taken into account, i.e. examples when it is realistic to Relation Frequency in corpus expect a sequential data mining algorithm to recognize and usedfor 27% annotate the sequence as an instance of a relation. If the composed_of 16% two entities are in a semantic relation with each other but propose 11% this relation is not expressed in the text, the instance was yields 6% not annotated. study 6% The annotation covers the text span between the two taskapplied 5% entities, and specifies the type of the relation, as well as uses_information 4% the two arguments. Sequences can contain gaps: not every affects 4% word in the context is expected to be relevant for the rela- tion. In the example below, only the highlighted text span Table 5: Most frequent semantic relations
3698 Figure 1: Manual annotation with the GATE interface is actually informative for identifying the semantic relation: However, the two relation instances are overlapping:
3699 annotation confirmed that ontology specificity correlates ment environment for robust NLP tools and applications. with annotation precision but comes with limited coverage. In Proceedings of the ACL 2002 Conference. The experiments confirmed that data-driven term extraction Daille, B., Jacquin, C., Monceaux, L., Morin, E., and Ro- methods can be efficiently combined with large-coverage cheteau, J. (2013). TTC TermSuite : Une chaîne de external resources, and allow to expand coverage while traitement pour la fouille terminologique multilingue. In maintaining the same level of precision that is achieved us- Proceedings of the Traitement Automatique des Langues ing domain-specific resources. As a result, the entire corpus Naturelles Conference (TALN). was automatically annotated for domain concepts. Fellbaum, C. (1998). WordNet: An Electronic Lexical We also presented a typology of semantic relations in the Database. MIT Press, Cambridge (MA). science/engineering domain and are carrying out a manual Finkel, J. R., Grenager, T., and Manning, C. (2005). Incor- annotation experiment. The first findings on applying the porating non-local information into information extrac- typology to the corpus are encouraging in that valuable in- tion systems by gibbs sampling. In Proceedings of the formation can be identified and the limitations revealed by ACL Conference. the error analysis can further be addressed. This corpus is currently being exploited for experiments in Herrera, M., Roberts, D. C., and Gulbahce, N. (2010). PLOS ONE unsupervised relation extraction. Sequence mining meth- Mapping the evolution of scientific fields. , ods are used to identify relevant text patterns between con- 5. cepts. The distribution of concept couples over sequential Larsen, P. O. and von Ins, M. (2010). The rate of growth patterns allow to cluster the instances according to their se- in scientific publication and the decline in coverage pro- mantic relation. We also expect to be able to discover new, vided by science citation index. domain-specific relation types via unsupervised clustering. Mendes, P., Jakob, M., and Bizer, C. (2012). Dbpedia The next scheduled step is the manual annotation of se- for nlp: A multilingual cross-domain knowledge base. mantic relations in 500 abstracts. This work should allow In Proceedings of the International Conference on Lan- to gain information on the plausibility and usability of our guage Resources and Evaluation (LREC). relation typology by measuring inter-annotator agreement. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). The resulting annotated corpus will be shared with the re- Efficient estimation of word representations in vector search community, allowing to compare relation extraction space. In Proceedings of Workshop at ICLR, 2013. algorithms. Navigli, R. and Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and application of a 6. Acknowledgments wide-coverage multilingual semantic network. Artificial This work is part of the program "Investissements Intelligence, 193:217–250. d’Avenir" overseen by the French National Research Omodei, E., Cointet, J.-P., and Poibeau, T. (2014). Map- Agency, ANR-10-LABX-0083 (Labex EFL). ping the natural language processing domain : Experi- The authors are grateful to Elisa Omodei for sharing the ments using the acl anthology. In International Confer- preprocessed corpus of abstracts from the ACL Anthology, ence on Language Resources and Evaluation LREC. and to Georgeta Bordea for putting the ACL domain mod- Osborne, F. and Motta, E. (2012). Mining semantic rela- els at our disposal. tions between reserach areas. In International Semantic 7. Bibliographical References Web Conference, Boston (MA). Osborne, F. and Motta, E. (2014). Rexplore: unveiling the Béchet, N., Cellier, P., Charnois, T., and Crémilleux, B. dynamics of scholarly data. Digital Libraries, 8(12). (2012). Discovering linguistic patterns using sequence mining. In Computational Linguistics and Intelligent Petasis, G., Karkaletsis, V., Paliouras, G., Krithara, A., and Text Processing - 13th International Conference, CI- Zavitsanos, E. (2011). Ontology population and enrich- CLing, pages 154–165. ment: State of the art. In Knowledge-driven multime- dia information extraction and ontology evolution, pages Bordea, G., Buitelaar, P., and Polajnar, T. (2013). Domain- 134–166. Springer-Verlag. independent term extraction through domain modelling. In 10th International Conference on Terminology and Presutti, V., Consoli, S., Nuzzolese, A. G., Recupero, D. R., Artificial Intelligence (TIA 2013). Gangemi, A., Bannour, I., and Zargayouna, H. (2014). Bordea, G. (2013). Domain Adaptive Extraction of Topical Uncovering the semantics of wikipedia pagelinks. In Hierarchies for Expertise Mining. Phd thesis, National Knowledge Engineering and Knowledge Management, University of Ireland, Galway. pages 413–428. Springer. Brewster, C., Alani, H., and Dasmahapatra, A. (2004). Radev, D., Muthukrishnan, P., and Qazvinian, V. (2009). Data driven ontology evaluation. In Proceedings of the The acl anthology network corpus. In Proceedings of the International Conference on Language Resources and 2009 ACL Workshop on Text and Citation Analysis for Evaluation (LREC). Scholarly Digital Libraries. Chavalarias, D. and Cointet, J.-P. (2013). Phylomemetic Sagot, B. and Boullier, P. (2008). Sxpipe 2: architec- patterns in science evolution - the rise and fall of scien- ture pour le traitement pré-syntaxique de corpus bruts. tific fields. PLOS ONE, 8(2). Traitement Automatique des Langues, 49 (2). Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, Sateli, B. and Witte, R. (2015). What’s in this paper?: V. (2002). GATE: A framework and graphical develop- Combining rhetorical entities with linked open data for
3700 semantic literature querying. In Proceedings of the 24th International Conference on World Wide Web. Shotton, D. (2010). CiTO, the Citation Typing Ontology. J. Biomedical Semantics, 1(S-1). Skupin, A. (2004). The world of geography: Visualizing a knowledge domain with cartographic means. Proceed- ings of the National Academy of Sciences, 101. Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). Yago - a core of semantic knowledge. In Proceedings of the 16th International World Wide Web Conference. Wang, W., Besançon, R., Ferret, O., and Grau, B. (2013). Extraction et regroupement de relations entre entités pour l’extraction d’information non supervisée. Traite- ment automatique de la langue, 54.
3701