Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by HAL-Paris 13

Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientiﬁc Literature Kata G´abor, Ha¨ıfaZargayouna, Davide Buscaldi, Isabelle Tellier, Thierry Charnois

To cite this version: Kata G´abor, Ha¨ıfaZargayouna, Davide Buscaldi, Isabelle Tellier, Thierry Charnois. Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientiﬁc Literature. LREC 2016, May 2016, Portoroz, Slovenia. 2016, Proceedings of the LREC 2016 Conference.

HAL Id: hal-01360407 https://hal.archives-ouvertes.fr/hal-01360407 Submitted on 6 Sep 2016

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinéeau dépôtet àla diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiésou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche fran¸caisou étrangers,des laboratoires abroad, or from public or private research centers. publics ou privés. Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature

Kata Gábor∗, Haïfa Zargayouna∗, Davide Buscaldi∗, Isabelle Tellier†, Thierry Charnois∗ ∗ LIPN, CNRS (UMR 7030), Université Paris 13 Sorbonne Paris Cité † LaTTiCe (UMR 8094), CNRS, ENS Paris, Université Sorbonne Nouvelle - Paris 3 PSL Research University, Université Sorbonne Paris Cité (gabor,haifa.zargayouna,davide.buscaldi,thierry.charnois)@lipn.univ-paris13.fr, [email protected]

Abstract This paper describes the process of creating a corpus annotated for concepts and semantic relations in the scientific domain. A part of the ACL Anthology Corpus was selected for annotation, but the annotation process itself is not specific to the computational linguistics domain and could be applied to any scientific corpus. Concepts were identified and annotated fully automatically, based on a combination of terminology extraction and available ontological resources. A typology of semantic relations between concepts is also proposed. This typology, consisting of 18 domain-specific and 3 generic relations, is the result of a corpus-based investigation of the text sequences occurring between concepts in sentences. A sample of 500 abstracts from the corpus is currently being manually annotated with these semantic relations. Only explicit relations are taken into account, so that the data could serve to train or evaluate pattern-based semantic relation classification systems.

Keywords: semantic annotation, semantic relations, ACL Anthology

1. Introduction such as ontologies. Ontologies allow a fine-tuned semantic analysis, as opposed to approaches that primarily rely on One of the emerging trends of natural language technolo- unstructured resources or terminology extraction on the gies is their use for the humanities and sciences. There is fly (Omodei et al., 2014; Chavalarias and Cointet, 2013; a constant increase in the production of scientific papers Skupin, 2004). Sateli and Witte (2015) and Presutti et and experts are faced with an explosion of information that al. (2014) rely on DBPedia to identify key concepts. Of makes it difficult to have an overview of the state of the art particular interest for us is that systems using an ontology in a given domain (Larsen and von Ins, 2010). This phe- can also benefit from different kinds of typed relations. nomenon gave rise to efforts from the semantic web, sci- While statistical systems can do well on identifying key entometry and natural language processing communities, concepts, a typology of relations is much more difficult trying to improve access to scientific literature. Most of to extract from domain corpora. In the context of our these works concentrate on unfolding links between pa- research, we are interested in exploiting the relations pers in order to build cartographies and analyze scientific from external ontologies, as well as populating ontologies networks based on bibliographic references and topic tax- by discovering new types of semantic relations between onomies (Osborne and Motta, 2012; Osborne and Motta, concepts using pattern mining techniques (Béchet et al., 2014). Shotton (2010) designed an ontology of different 2012). The semantic analysis of scientific corpora allows to types of citations, while Presutti et al. (2014) aim to im- add new relation types and instances to existing ontologies prove access to semantic content and inter-document navi- (Petasis et al., 2011) or thesauri (Wang et al., 2013). gation by identifying links between documents. Other stud- As a first step towards the automated analysis of scientific ies focus on the evolution of a scientific domain over time corpora, the production of an annotated gold standard (Chavalarias and Cointet, 2013; Omodei et al., 2014). corpus was undertaken. We decided to concentrate on the As opposed to these lines of research, our analysis focuses computational linguistics domain and use part of the ACL on the semantic content of individual scientific papers Anthology Corpus (Radev et al., 2009) for our purposes. instead of inter-document links and author networks. The This article describes the (ongoing) work on annotating goal of our work is to automatically build a state of the art this corpus to create a resource for training and evaluating of a scientific domain. Focusing on the abstract and the relation extraction systems. introduction, we represent the semantic content of a paper Our semantic annotations are conceived in terms of entities as a set of domain-specific concepts and typed relations representing domain-relevant concepts and the possible between them. By identifying instances of concepts and semantic relations linking them, e.g.: domain-specific relations, we can extract the contribution of a research paper. Applying this method to a corpus String models of papers, we can lay the foundation to applications for are popular in statistical visualizing the evolution of a domain, the emergence of machine translation . topics over time and to extract the state of the art or make predictions over trends (Chavalarias and Cointet, 2013; Our approach is specifically focused on scientific literature, Herrera et al., 2010; Skupin, 2004). but still generic in that it can be applied to any scientific Semantic cartographies can rely on structured resources domain for which a corpus of papers is available. It is com-

3694 posed of two consecutive steps : 3. A First Analysis on the Annotation of • The identification of domain-specific concepts in the Concepts papers. These concepts will be linked to external on- Linking entities to external ontological resources is an im- tological resources. Concepts were automatically an- portant aspect in the context of our work. The corpus we notated in the totality of the corpus. prepare will serve to experiment with different methods • The annotation of the semantic relations that hold bet- for populating ontologies by unsupervised semantic rela- ween these concepts (if any) - this is done manually tion extraction. These methods may rely on different kinds on the gold standard corpus. These data will serve to of information: sequential patterns, distributional vectors, evaluate consecutive experiments on relation extrac- but also on the semantic properties of the entities, as ex- tion and classification. A sample of 500 abstracts is tracted from the ontology. Therefore, entities representing currently being manually annotated. concepts were annotated based on available resources. Section 2. describes how the texts of the corpus were se- 3.1. Goals in terms of precision/annotation lected and pre-processed. Section 3. presents an analysis density of the annotation of concepts and the adequacy of the avail- Ontologies differ with respect to their domain and to the able resources. Section 4. explains the two-step process fol- concepts and types of relations they encode. Two aspects lowed to annotate the relations: creating a typology (4.1.) come into play when evaluating an ontology as a knowl- and applying it (4.2.). Finally, section 5. draws the conclu- edge resource: the precision and recall of the resource itself sions and presents directions for future work. (i.e. the quality and quantity of the information it contains), 2. Composition of the Corpus and and the compatibility between the resource and the corpus Pre-processing to which it is applied. Our focus was to explore the adequacy of different available resources for concept annota- We decided to focus on the abstract and introduction parts tion in a scientific corpus. Brewster et al. (2004) propose of scientific papers since they express essential informa- data-driven methodologies to evaluate the "fit" between an tion in a compact and often repetitive manner, which makes ontology and a domain corpus. Unlike their study, we pri- them an optimal source for mining sequential patterns in oritized simple coverage over structural fit. The following further experiments. three criteria were considered in accordance with our pur- The core of the corpus is the database pre-processed by by poses. E. Omodei (2014), which contains the abstracts and differ- The coverage of a resource can be interpreted in two different kinds of meta-information for 13.322 papers from the ent ways. Coverage with respect to the domain vocabulary ACL Anthology. We relied on the paper IDs in the database shows the proportion of the concepts of a domain that are to identify and extract the introductions for the same arti- included in the ontology: it can be interpreted as a measure cles from the 2009 version of the ACL Anthology Corpus of recall. However, as we do not have a "standard" domain 1 (Radev et al., 2009). As a first cleaning, CRF tagger was model, we resort to the hypothesis that the ACL Anthology ran on the ACL Anthology Corpus to filter out files than corpus is representative of the domain. cannot be processed: those with a significant number of This brings us to the second criterion: coverage with res- OCR errors or tokenisation difficulties. We then extracted pect to the corpus. Brewster et al. (2004) suggest using lex- the introductions from the remaining files using the follow- ical keyword extraction and query expansion to extract rel- ing heuristics: evant domain vocabulary. This vocabulary will be mapped • the introduction starts with a line that contains "Intro- to the ontology, and the ratio of words found in the ontology duction/INTRODUCTION" or starts with "1" or "1.1" gives the measure of coverage for corpus vocabulary. We followed by a capitalized word, propose another interpretation of coverage, as annotation density: the proportion of running words that are annotated • the last line of the introduction is the one followed by a by the resource. This aspect is of particular importance in line starting with "2" or "1.2" followed by a capitalized the context of pattern mining. We need to assess whether word relevant patterns between concepts will be well represented • if the end of the introduction cannot be identified, the in the corpus. Knowing that not every pair of entity in- first 6 lines are kept. stances participate in an explicit relation, we need several The introductions were paired with the corresponding ab- recognized entities in the same sentence and within a rea- stracts : the resulting corpus contains 4.200.000 words from sonable distance to make sure that meaningful repetitions 11.000 papers. can be spotted. Annotation density will be measured as the proportion of annotated entities per 100 words (the average People’s names and bibliographic references were automat- length of an abstract). ically annotated. For bibliographic references, we adapted precision the specific grammar available in SxPipe (Sagot and Boul- The third, conflicting criterion is the of the anno- lier, 2008). People’s names were recognized using the Stan- tations produced by a resource. It can be measured as the ford NER Named EntityRecognizer (Finkel et al., 2005) proportion of relevant annotations over the total number of annotations. This aspect is correlated with the specificity of and its pre-trained model for three classes (among which the resource: a good quality specialized ontology coupled only PERSON was used). with a corpus of the same domain allows to annotate the 1http://crftagger.sourceforge.net important concepts of the domain while restricting ambi- 3695 guities or irrelevant annotations. It is important to note that and Wiktionary), it was selected for the annotation exper- besides the quality of the resource and its adequacy with iments. This resource has the advantage of providing sig- the corpus, precision also depends on the accuracy of the nificantly increased density - on the other hand, it also in- annotation process itself. creases ambiguity and introduces less relevant annotations. While general ontologies such as WordNet (Fellbaum, Keeping every BabelNet-annotated entity as a concept can- 1998) may ensure a good annotation density, domain- didate was clearly not an option, as test runs showed that specific resources are less likely to produce irrelevant an- more than 80% of words would qualify as a candidate, notations. On the other hand, domain ontologies are costly most of them being irrelevant to the domain. The next sec- to construct as they require substantial manual labor and tion presents the filtering approaches we tried to reduce the domain expertise. Consequently, specialized resources are number of irrelevant entities. usually less available and more restricted in size: this is why there is currently significant research aiming to en- 3.3. Combining Resources with Data-driven rich ontologies with concept and relation candidates from Methods corpora (Petasis et al., 2011). Therefore, we experimented We examined several solutions to filter out relevant anno- with combining domain-specific and generic resources to tations from BabelNet while keeping a good annotation achieve a satisfying balance between annotation density density. The first logical step would be to exploit the and precision. structure of the resource: in our case, WordNet synsets and Wikipedia categories. By identifying domain-relevant 3.2. Ontological resources and coverage issues synsets and categories, we could limit the number of enti- First, existing ontological/lexical resources were consid- ties annotated from BabelNet to those specific to our do- 3 ered for entity annotation. To our knowledge there is no main. However, we found that DBpedia classes , retrieved available hand-crafted ontology specific to the NLP do- from Wikipedia categories are too generic for our purposes. main. Saffron Knowledge Extraction Framework2 proposes The YAGO ontology (Suchanek et al., 2007) of English specific terminological resources for different domains, au- Wikipedia classes was considered unfit for our purposes as tomatically extracted from corpora (Bordea, 2013; Bordea it is specifically designed for named entities and informa- et al., 2013). tion extraction. A general problem we encountered about Saffron’s underlying methodology extracts terms com- taxonomies of Wikipedia classes stems from the large num- pletely automatically in two steps, without relying on com- ber of classes and the fact that Wikipedia pages are not parable corpora. First, high-level terms (domain models) linked to classes in a systematic way. are selected and filtered by their weight, with a thresh- Another solution to limit the number of annotations from old set empirically for the given corpus. Domain mod- BabelNet is to filter concept candidates before looking them els contain frequent words; they constitute "the preferred up in BabelNet. The idea is to combine terminology extrac- level of naming, that is the taxonomical level at which cate- tion / multi-word expression (MWE) extraction with onto- gories are most cognitively efficient" (Bordea et al., 2013). logy lookup. Both vocabulary extraction methods are ex- We used the available domain models for computer sci- pected to be useful for filtering: MWEs are presumably ence and for computational linguistics (120 and 200 simple more specific than simple words due to the composition and words respectively, with 56 overlapping concepts). Second, terms are extracted according to domain specificity. 4 intermediate-level terms are extracted according to their First, we applied the phrase tool included in word2vec distributional similarity to high-level terms. As opposed to (Mikolov et al., 2013) to extract multi-word units from the the methods based on a comparison between different cor- corpus. The abstract part of the corpus was used for train- pora, this approach favors terms that are less specific but ing, and was annotated with the identified MWEs. Only more frequent. The intermediate-level terms (topic hierar- the recognised MWEs were looked up in BabelNet. This chies) for computational linguistics, containing 500 units process yielded a density 3.7 concepts/100 words. This co- (mostly multioword expressions), were included in our re- verage was estimated insufficient. sources. Both high-level and intermediate terms are rela- In the following experiment, the term extraction tool tively frequent: Saffron’s terms "are specific to a domain TermSuite (Daille et al., 2013) was applied to the corpus. but broad enough to be usable for summarisation or classi- The extraction process takes as input a set of documents fication" (Bordea, 2013). (one abstract/document) and returns a list of term candi- Despite the quality reported for Saffron’s resources (Bor- dates together with a specificity value. Several additional dea, 2013), the annotation density they provide is definitely filtering parameters can be applied. We filtered out any too limited for our purposes, especially with respect to pat- candidate whose part of speech is not common noun (or tern mining. An annotation test run on 1.100.000 words a multi-word unit ending with a common noun). Words yielded an average of 13 annotated concepts/100 words. To that were unknown for the CRF tagger were also excluded, increase density, extensive general ontologies were consid- as well as nouns with less than 5 occurrences. Finally, ered with a presumably good coverage for the NLP domain, the specificity threshold was set empirically. After setting such as WordNet, BabelNet (Navigli and Ponzetto, 2012) or these parameters, the resulting list was manually validated. DBpedia (Mendes et al., 2012). BabelNet being the most The concept candidates coming from terminology extrac- extensive (with merged synsets from WordNet, Wikipedia tion were compared with BabelNet concept candidates, and

3http://wiki.dbpedia.org/services-resources/ontology 2 http://saffron.insight-centre.org/ 4https://code.google.com/p/word2vec/ 3696 produced an overlap of 3.805 candidates. The correspond- Resource density precision ing synsets from BabelNet resources (Wikipedia, Word- /100 w Net or Wiktionary) were thus selected for entity annota- Saffron - all 13 0.98 tion. This combination resulted in an annotation density BabelNet filtered - all 23 0.97 of 23 concepts/100 words, which is advantageous for our purposes. Table 1: Quantity and precision of entities annotated 3.4. Evaluation and Error Analysis We manually evaluated annotation precision on a sample Table 2 shows the annotation precision of specific resources containing 100 sentences, with 358 annotated entities and (with minor types). Every annotation produced by each 932 annotations (an entity is thus linked to 2.6 resources on specific resource was considered, hence the higher density. average). The sample shows the contexts and the annota- POS tagging was still discarded. These data confirm that tions together with their sources. Concept candidates were filtering helped to maintain consistency in the quality of classified as correct or incorrect, with the following error the different annotation resources. subcategories:

• "non relevant", e.g. : Resource density precision written in the early XX century by main-stream authors ... Saffron concepts 2.7 1 • "wrong delimitation" : only a part of a multi-word Saffron acl dm 13 0.97 entity is annotated, or each part is annotated as a Saffron cs dm 9 0.99 separate entity, e.g. : wikipedia 18 0.97 Minimum wiktionary 5.5 0.94 Description wordnet 16 0.98 Length Table 2: Annotation density and precision for each resource • "tagging error", e.g. : approach that aims to In terms of annotations produced, the most frequent er- model ror type is bad delimitation (62.5% of all errors), followed This categorization of errors takes two different aspects into by POS tagging errors (29.5%), and irrelevant annotations consideration: the precision of the resource (in terms of rel- (8%). Finally, Table 3 shows the proportion of different evant/irrelevant annotations produced by the resource), and types of errors in terms of annotated expressions. The high the precision of the annotation process itself (in terms of proportion of delimitation errors is partly explained by the candidates that should not be linked to a resource). In other fact that consecutive parts of a multi-word expression are words, we distinguished between error types that could be often annotated as separate instances (see the Minimum De- addressed by improving the corpus pre-processing or the scription Length example above) and count for as many er- annotation process, and error types coming directly from rors as they have recognized constituents. the resource. Evidently, the distinction is not always clear- Annotation quality Proportion cut: semantically ambiguous terms produce both relevant Correct 60% and irrelevant annotations. For instance, the word term is Delimitation error 21% often, but not always, used as a domain-relevant concept Tagging error 16% e.g.: We show, in terms of crossing rates, ... Irrelevant 3% Table 3: Error types Beside entity delimitation, the annotation also includes a double reference to the resource: its major type (Saf- fron/BabelNet) and a minor type (NLP/computer science 4. Unfolding Semantic Relation Types in domain models and topic hierarchies, WordNet, Wikipedia Scientific Papers or Wiktionary). This allows us to compute resource- specific precision. Tagging errors were discarded in this 4.1. The typology of Relations evaluation. Table 1 shows the aggregated results for the While concepts could be retrieved from existing resources, two major resources in proportion to annotated expressions, the relevant semantic relations of the domain were to be i.e. if an expression was annotated by more than one mi- identified and annotated manually. The types of relations nor resource type, they were collapsed. For instance, if the were defined with a data-driven approach, in parallel to the same entity was found in both a Wikipedia Babel synset first round of manual annotation of relation instances. The and a WordNet Babel synset, it counts as a single anno- goal of this step was to study which kinds of relations are tated BabelNet concept. A precision similar to that of Saf- present and how they are expressed in scientific papers. An- fron’s resources was reached for filtered BabelNet annota- other objective was to verify whether the data confirm the tions, which confirms the interest of combining data-driven hypothesis behind the pattern mining approach, i.e. that re- term extraction with large-coverage external resources. lations between entities are explicit in at least a subset of

3697 Generic Relations antonyms similarities rather than the differences co-hyponyms analysis as well as synthesis is-a task addressed here is that of information retrieval

Domain-specific relations affects ARG1: specific property of data ARG2: results issues that influence the effectiveness based_on ARG1: method, system based on ARG2: other method parser based on maximum entropy char ARG1: observed characteristics of an ARG2: entity words which occur in the same contexts compare ARG1: result (of experiment) to ARG2: result 2 results of the experiments are compared with a gold standard composed_of ARG1: database/resource ARG2: data corpus consisting of sentences datasource ARG1: information extracted from ARG2: data word in both languages is extracted from the corpus methodapplied ARG1: method applied to ARG2: data approach is illustrated by applying it to large corpora model ARG1: abstract representation of an ARG2: observed entity parse tree representation of the input sentences phenomenon ARG1: phenomenon found in ARG2: context differences attested among languages problem ARG1: phenomenon is a problem in a ARG2: field ambiguity exists in the input propose ARG1: paper/author presents ARG2: an idea paper describes a framework study ARG1: analysis of a ARG2: phenomenon study of word meaning tag ARG1: meta-information associated to ARG2: entity corpus annotated for semantic information taskapplied ARG1: task performed on ARG2: data tagging English texts usedfor ARG1: method/system ARG2: task method we used to search uses_information ARG1: method relies on ARG2: information technique relies on explicit relevance yields ARG1: experiment/method ARG2: result method achieved better accuracy wrt ARG1 a change in/with respect to ARG2: property improvement in translation quality

Table 4: Semantic Relation Typology entity mention pairs. First, semantic relation types were On the textual level, a semantic relation will be conceived identiﬁed in a sample of 100 abstracts extracted from the as a text span linking two annotated instances of concepts corpus. within the same sentence. Only explicit relations were taken into account, i.e. examples when it is realistic to Relation Frequency in corpus expect a sequential data mining algorithm to recognize and usedfor 27% annotate the sequence as an instance of a relation. If the composed_of 16% two entities are in a semantic relation with each other but propose 11% this relation is not expressed in the text, the instance was yields 6% not annotated. study 6% The annotation covers the text span between the two taskapplied 5% entities, and speciﬁes the type of the relation, as well as uses_information 4% the two arguments. Sequences can contain gaps: not every affects 4% word in the context is expected to be relevant for the relation. In the example below, only the highlighted text span Table 5: Most frequent semantic relations

3698 Figure 1: Manual annotation with the GATE interface is actually informative for identifying the semantic relation: However, the two relation instances are overlapping: scenarios can be simulated at a preliminary • This paper presents an ap- stage, instead of real-time implementations, allowing plication of Text Zoningto the ACL An- for repeatable experiments thology. . • This paper presents an application of Text Zoning On the semantic level, a relation type needs to be specific to the ACL Anthology. enough to be easily defined and identified for a domain ex- pert. As it was revealed by the manual annotation, in or- In these cases, the most relevant binary relation has to be der to achieve this level of specificity, the arguments of a selected for annotation. relation have to be typed, too. If a concept is linked to The findings of this work phase are twofold. First, as shown a WordNet synset from BabelNet, this type can eventually by the examples above, the semantic relations are not spe- correspond to the synset description. Table 4.1. presents the cific to natural language processing but reflect the more typology of relations defined at the first step, together with general semantics of the science/engineering domain, al- 5 argument type specifications . though particular arguments of the relations in the corpus 4.2. Annotation of Relations can be more domain-specific. Second, entities participat- ing in relations are mostly high level concepts. E.g., typ- After the typology of relations was defined, 87 instances ical instances of the relation affects include the following of relations were manually annotated with GATE (Cun- arguments: properties affect results; question affects per- ningham et al., 2002) in a sample of 100 abstracts. A formance, type affects differences. One reason behind this larger sample of 500 abstracts, following the distribution may be that abstracts and introductions aim to put the con- of the corpus with respect to the source of the paper tent in perspective instead of presenting precise details. An- (conference/workshop type), is to be annotated by two in- other explanation can be that our concept resources prior- dependent annotators and will serve as training/evaluation itize high to intermediate level terminology (Bordea et al., corpus for relation extraction experiments. 2013) as the "preferred level of naming". This feature is One of the major issues in associating semantic relations advantageous for pattern mining. to an explicit text span is the problem of overlapping Manual annotation also revealed some limitations due to relations, including those with more than two arguments. conscious choices. An issue we had foreseen concerns A single concept can be the argument of more than one anaphoric expressions. Although we consciously exclude (explicit) relations in the same context. For instance, relations between entities expressed as pronouns, this does consider the following sentence: not seem to result in significant loss of information, as abstracts and introductions occur early in the paper and usu- This paper presents an application of Text Zoning to the ally contain the first mention of an entity. ACL Anthology. However, other limitations stem from entity annotation errors, in particular bad delimitation. These errors affect Two instances of relations can be recognized according to the annotation and especially the quality of the relation in- our semantic relation typology (Figure 4.1.): stances to be retrieved automatically. propose (ARG1: paper ARG2: TextZoning) 5. Conclusion and Future Work taskapplied(ARG1: Text Zoning ARG2: ACL Anthology) We have presented the first experiments and outcome of 5The final typology may be subject to modifications as long as annotating the ACL Anthology with domain-relevant con- the manual annotation phase is not finished cepts and semantic relations. Our studies on concept

3699 annotation confirmed that ontology specificity correlates ment environment for robust NLP tools and applications. with annotation precision but comes with limited coverage. In Proceedings of the ACL 2002 Conference. The experiments confirmed that data-driven term extraction Daille, B., Jacquin, C., Monceaux, L., Morin, E., and Ro- methods can be efficiently combined with large-coverage cheteau, J. (2013). TTC TermSuite : Une chaîne de external resources, and allow to expand coverage while traitement pour la fouille terminologique multilingue. In maintaining the same level of precision that is achieved us- Proceedings of the Traitement Automatique des Langues ing domain-specific resources. As a result, the entire corpus Naturelles Conference (TALN). was automatically annotated for domain concepts. Fellbaum, C. (1998). WordNet: An Electronic Lexical We also presented a typology of semantic relations in the Database. MIT Press, Cambridge (MA). science/engineering domain and are carrying out a manual Finkel, J. R., Grenager, T., and Manning, C. (2005). Incor- annotation experiment. The first findings on applying the porating non-local information into information extrac- typology to the corpus are encouraging in that valuable in- tion systems by gibbs sampling. In Proceedings of the formation can be identified and the limitations revealed by ACL Conference. the error analysis can further be addressed. This corpus is currently being exploited for experiments in Herrera, M., Roberts, D. C., and Gulbahce, N. (2010). PLOS ONE unsupervised relation extraction. Sequence mining meth- Mapping the evolution of scientific fields. , ods are used to identify relevant text patterns between con- 5. cepts. The distribution of concept couples over sequential Larsen, P. O. and von Ins, M. (2010). The rate of growth patterns allow to cluster the instances according to their se- in scientific publication and the decline in coverage pro- mantic relation. We also expect to be able to discover new, vided by science citation index. domain-specific relation types via unsupervised clustering. Mendes, P., Jakob, M., and Bizer, C. (2012). Dbpedia The next scheduled step is the manual annotation of se- for nlp: A multilingual cross-domain knowledge base. mantic relations in 500 abstracts. This work should allow In Proceedings of the International Conference on Lan- to gain information on the plausibility and usability of our guage Resources and Evaluation (LREC). relation typology by measuring inter-annotator agreement. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). The resulting annotated corpus will be shared with the re- Efficient estimation of word representations in vector search community, allowing to compare relation extraction space. In Proceedings of Workshop at ICLR, 2013. algorithms. Navigli, R. and Ponzetto, S. P. (2012). BabelNet: The automatic construction, evaluation and application of a 6. Acknowledgments wide-coverage multilingual semantic network. Artificial This work is part of the program "Investissements Intelligence, 193:217–250. d’Avenir" overseen by the French National Research Omodei, E., Cointet, J.-P., and Poibeau, T. (2014). Map- Agency, ANR-10-LABX-0083 (Labex EFL). ping the natural language processing domain : Experi- The authors are grateful to Elisa Omodei for sharing the ments using the acl anthology. In International Confer- preprocessed corpus of abstracts from the ACL Anthology, ence on Language Resources and Evaluation LREC. and to Georgeta Bordea for putting the ACL domain mod- Osborne, F. and Motta, E. (2012). Mining semantic rela- els at our disposal. tions between reserach areas. In International Semantic 7. Bibliographical References Web Conference, Boston (MA). Osborne, F. and Motta, E. (2014). Rexplore: unveiling the Béchet, N., Cellier, P., Charnois, T., and Crémilleux, B. dynamics of scholarly data. Digital Libraries, 8(12). (2012). Discovering linguistic patterns using sequence mining. In Computational Linguistics and Intelligent Petasis, G., Karkaletsis, V., Paliouras, G., Krithara, A., and Text Processing - 13th International Conference, CI- Zavitsanos, E. (2011). Ontology population and enrich- CLing, pages 154–165. ment: State of the art. In Knowledge-driven multime- dia information extraction and ontology evolution, pages Bordea, G., Buitelaar, P., and Polajnar, T. (2013). Domain- 134–166. Springer-Verlag. independent term extraction through domain modelling. In 10th International Conference on Terminology and Presutti, V., Consoli, S., Nuzzolese, A. G., Recupero, D. R., Artificial Intelligence (TIA 2013). Gangemi, A., Bannour, I., and Zargayouna, H. (2014). Bordea, G. (2013). Domain Adaptive Extraction of Topical Uncovering the semantics of wikipedia pagelinks. In Hierarchies for Expertise Mining. Phd thesis, National Knowledge Engineering and Knowledge Management, University of Ireland, Galway. pages 413–428. Springer. Brewster, C., Alani, H., and Dasmahapatra, A. (2004). Radev, D., Muthukrishnan, P., and Qazvinian, V. (2009). Data driven ontology evaluation. In Proceedings of the The acl anthology network corpus. In Proceedings of the International Conference on Language Resources and 2009 ACL Workshop on Text and Citation Analysis for Evaluation (LREC). Scholarly Digital Libraries. Chavalarias, D. and Cointet, J.-P. (2013). Phylomemetic Sagot, B. and Boullier, P. (2008). Sxpipe 2: architec- patterns in science evolution - the rise and fall of scien- ture pour le traitement pré-syntaxique de corpus bruts. tific fields. PLOS ONE, 8(2). Traitement Automatique des Langues, 49 (2). Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, Sateli, B. and Witte, R. (2015). What’s in this paper?: V. (2002). GATE: A framework and graphical develop- Combining rhetorical entities with linked open data for

3700 semantic literature querying. In Proceedings of the 24th International Conference on World Wide Web. Shotton, D. (2010). CiTO, the Citation Typing Ontology. J. Biomedical Semantics, 1(S-1). Skupin, A. (2004). The world of geography: Visualizing a knowledge domain with cartographic means. Proceed- ings of the National Academy of Sciences, 101. Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). Yago - a core of semantic knowledge. In Proceedings of the 16th International World Wide Web Conference. Wang, W., Besançon, R., Ferret, O., and Grau, B. (2013). Extraction et regroupement de relations entre entités pour l’extraction d’information non supervisée. Traite- ment automatique de la langue, 54.

3701