Instructions for Preparing LREC 2016 Proceedings
Total Page:16
File Type:pdf, Size:1020Kb
Linking Language Resources and NLP papers Gil Francopoulo, LIMSI, CNRS, Université Paris-Saclay + Tagmatica (France) Joseph Mariani, LIMSI, CNRS, Université Paris-Saclay (France) Patrick Paroubek, LIMSI, CNRS, Université Paris-Saclay (France) Abstract The Language Resources and Evaluation Map (LRE Map) is an accessible database on Language Resources based on records collected during the submission of several major Speech and Natural Language Processing (NLP) conferences, including the Language Resources and Evaluation Conferences (LREC). The NLP4NLP is a very large corpus of scientific papers in the field of Speech and Natural Language Processing covering a large number of conferences and journals in that field. In this article, we establish the link between those two elements in order to study the mention of the LRE Map resource names within the NLP4NLP corpus. Keywords: Resource Citation, Named Entity Detection, Informetrics, Scientometrics, Text Mining, LRE Map. Data Consortium (LDC) team whose goal was, and still is, to build a language resource (LR) database documenting 1. Introduction the use of the LDC resources [Ahtaridis et al 2012]. At the Our work is based on the hypothesis that names, in this case time of the publication (i.e. 2012), the LDC team found language resource names, correlate with the study, use and 8,000 references and the problems encountered were improvement of the given referred objects, in this case documented in [Mariani et al 2014b]. language resources. We believe that the automatic (and objective) detection is a step towards the improvement of 3. Our approach the reliability of language resources as mentioned in The general principle is to confront the names of the LRE [Branco 2013]. Map with the newly collected NLP4NLP corpus. The process is as follows: We already have an idea on how the resources are used in Consider the archives of (most of) the NLP field, the recent venues of conferences such as Coling and LREC, as the LRE Map is built according to the resources declared Take an entity name detector which is able to work by the authors of these conferences [Calzolari et al 2012]. with a given list of proper names, But what about the other conferences and the other years? This is the subject of the present study. Use the LRE Map as the given list of proper names, Run the application and study the results. 2. Situation with respect to other studies The approach is to apply NLP tools on texts about NLP 4. Archives of a large part of the NLP field itself, taking advantage of the fact that we have a good knowledge of the domain ourselves. Our work goes after The corpus is a large content of our own research field, i.e. the various studies presented and initiated in the Workshop NLP, covering both written and speech sub-domains and entitled: “Rediscovering 50 Years of Discoveries in Natural extended to a limited number of corpora, for which Language Processing” on the occasion of ACL’s 50th Information Retrieval and NLP activities intersect. This anniversary in 2012 [Radev et al 2013] where a group of corpus was collected at IMMI-CNRS and LIMSI-CNRS 3 researchers studied the content of the corpus recorded in (France) and is named NLP4NLP . It currently contains the ACL Anthology [Bird et al 2008]. Various studies, 65,003 documents coming from various conferences and based on the same corpus followed, for instance [Bordea et journals with either public or restricted access. This is a al 2014] on trend analysis and resulted in systems such as large part of the existing published articles in our field, Saffron1 or the Michigan Univ. web site2 . Other studies apart from the workshop proceedings and the published were conducted by ourselves specifically on speech-related books. Despite the fact that they often reflect innovative archives [Mariani et al 2013], and on the LREC archives trends, we did not include workshops as they may be based [Mariani et al 2014a] but the target was to detect the on various reviewing processes and as the access to their terminology used within the articles, and the focus was not content may sometimes be difficult. The time period spans to detect resource names. More focused on the current from 1965 to 2015. Broadly speaking, and aside from the 4 workshop topic is the study conducted by the Linguistic small corpora, one third comes from the ACL Anthology , one third from the ISCA Archive5 and one third from IEEE6. 1 http://saffron.deri.ie 4 http://aclweb.org/anthology 2 http://clair.eecs.umich.edu/aan/index.php 5 www.isca-speech.org/iscaweb/index.php/archive/online-archive 3 See www.nlp4nlp.org 6 https://www.ieee.org/index.html The corpus follows the organization of the ACL Anthology associated with some alternate names. The number of with two parts in parallel. For each document, on one side, entries was originally 4,396. Each entry has been defined the metadata is recorded with the author names and the title. with a headword like “British National Corpus” and some On the other side, the PDF document is recorded on disk in of them are associated with alternate names like “BNC”. its original form. Each document is labeled with a unique We further cleaned the data, by regrouping the duplicate identifier, for instance “lrec2000_1” is reified on the hard entries, by omitting the version number which was disk as two files: “lrec2000_1.bib” and “lrec2000_1.pdf”. When recorded as an image, the PDF content is extracted associated with the resource name for some entries, and by by means of Tesseract OCR7. The automatic test leading to ignoring the entries which were not labeled with a proper the call (or not) of the OCR is implemented by means of name but through a textual definition and those which had some PDFBox 8 API calls. For all the other documents, no name. Once cleaned, the number of entries is now 1,301, other PDFBox API calls are applied in order to extract the all of them with a different proper name. All the LRE Map textual content. See [Francopoulo et al 2015] for more entries are classified according to a very detailed set of details about the extraction process as well as the solutions resource types. We reduced the number of types to 5 broad for some tricky problems like joint conferences categories: NLPCorpus, NLPGrammar, NLPLexicon, management. NLPSpecification and NLPTool, with the convention that The majority (90%) of the documents come from when a resource is both a specification and a tool, the conferences, the rest coming from journals. The overall “specification” type is retained. An example is ROUGE number of words is 270M. Initially, the texts are in four languages: English, French, German and Russian. The which is both a set of metrics and a software package number of texts in German and Russian is less than 0.5%. implementing those metrics, for which we chose the They are detected automatically and are ignored. The texts “specification” type. in French are a little bit numerous (3%), so they are kept with the same status as the English ones. This is not a 7. Connection of LRE Map with TagParser problem because our tool is able to process English and TagParser is natively associated with a large multilingual French. The number of different authors is 48,894. The knowledge base made from Wikidata and Wikipedia and detail is presented in table 1. whose name is Global Atlas [Francopoulo et al 2013]. Of course, at the beginning, this knowledge base did not 5. Named Entity Detection contain all the names of the LRE Map. Only 30 resource The aim is to detect a given list of names of resources, names were known like “Wikipedia” or “WordNet”. provided that the detection should be robust enough to During the preparation of the experiment, a data fusion has recognize and link as the same entry some typographic been applied between the two lists to incorporate the LRE variants such as “British National Corpus” vs “British Map into the knowledge base. National corpus” and more elaborated aliases like “BNC”. Said in other terms, the aim is not to recognize some given 8. Running session and post-processing raw character strings but also to link names together, a The entity name detection is applied to the whole corpus on process often labeled as “entity linking” in the literature a middle range machine, i.e. one Xeon E3-1270V2 with [Guo et al 2011][Moro et all 2014]. We use the industrial 32Gb of memory. A post-processing is done in order to 9 Java-based parser TagParser [Francopoulo 2007] which, filter only the linked entities of the types: NLPCorpus, after a deep robust parsing for English and French, NLPGrammar, NLPLexicon, NLPSpecification and performs a named entity detection and then an entity NLPTool. Then the results are gathered to compute a linking processing. The system is hybrid, combining a readable synthesis as an HTML file which is too big to be statistical chunker, a large language specific lexicon, a presented here, but the interested reader may consult the multilingual knowledge base with a hand-written set of file “lremap.html” on www.nlp4nlp.org. Let’s add that the rules for the final selection of the named entities and their whole computation takes 95 minutes. entity linking. 6. The LRE Map The LRE Map is a freely accessible large database on resources dedicated to Natural Language Processing (NLP). The original feature of LRE Map is that the records are collected during the submission of different major NLP conferences10. These records were collected directly from the authors.