JATE2.0 : Java Automatic Term Extraction with Apache Solr

This is a repository copy of JATE2.0 : Java Automatic Term Extraction with Apache Solr. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/96573/ Version: Published Version Proceedings Paper: Zhang, Z., Gao, J. and Ciravegna, F. (2016) JATE2.0 : Java Automatic Term Extraction with Apache Solr. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J. and Piperidis, S., (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). LREC 2016, Tenth International Conference on Language Resources and Evaluation, 23-28 May 2016, Portorož, Slovenia. European Language Resources Association (ELRA) . ISBN 9782951740891 © 2016 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial Licence (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. You may not use the material for commercial purposes. Reuse This article is distributed under the terms of the Creative Commons Attribution-NonCommercial (CC BY-NC) licence. This licence allows you to remix, tweak, and build upon this work non-commercially, and any new works must also acknowledge the authors and be non-commercial. You don’t have to license any derivative works on the same terms. More information and the full terms of the licence here: https://creativecommons.org/licenses/ Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ JATE 2.0: Java Automatic Term Extraction with Apache Solr Ziqi Zhang, Jie Gao, Fabio Ciravegna Regent Court, 211 Portobello, Sheffield, UK, S1 4DP ziqi.zhang@sheffield.ac.uk, j.gao@sheffield.ac.uk, f.ciravegna@sheffield.ac.uk Abstract Automatic Term Extraction (ATE) or Recognition (ATR) is a fundamental processing step preceding many complex knowledge engineering tasks. However, few methods have been implemented as public tools and in particular, available as open-source freeware. Further, little effort is made to develop an adaptable and scalable framework that enables customization, development, and comparison of algorithms under a uniform environment. This paper introduces JATE 2.0, a complete remake of the free Java Automatic Term Extraction Toolkit (Zhang et al., 2008) delivering new features including: (1) highly modular, adaptable and scalable ATE thanks to integration with Apache Solr, the open source free-text indexing and search platform; (2) an extended collection of state-of-the-art algorithms. We carry out experiments on two well-known benchmarking datasets and compare the algorithms along the dimensions of effectiveness (precision) and efficiency (speed and memory consumption). To the best of our knowledge, this is by far the only free ATE library offering a flexible architecture and the most comprehensive collection of algorithms. Keywords: term extraction, term recognition, NLP, text mining, Solr, search, indexing 1. Introduction by completely re-designing and re-implementing JATE to Automatic Term Extraction (or Recognition) is an impor- fulfill three goals: adaptability, scalability, and extended tant Natural Language Processing (NLP) task that deals collections of algorithms. The new library, named JATE 3 4 with the extraction of terminologies from domain-specific 2.0 , is built on the Apache Solr free-text indexing and textual corpora. ATE is widely used by both industries search platform and can be either used as a separate mod- and researchers in many complex tasks, such as Informa- ule, or a Solr plugin during document processing to en- tion Retrieval (IR), machine translation, ontology engineer- rich the indexed documents with terms. JATE 2.0 deliv- ing and text summarization (Bowker, 2003; Brewster et ers domain-specific adaptability and scalability by seam- al., 2007; Maynard et al., 2007). Over the years, there lessly integrating with the Solr text analysis capability in 5 has been constant effort on researching new algorithms, the form of ‘analyzers’ , allowing users to directly exploit adapting to different domains and/or languages, and creat- the highly modular and scalable Solr text analysis to facili- ing benchmarking datasets (Ananiadou, 1994; Church and tate term extraction for large data from various content for- Gale, 1995; Ahmad et al., 1999; Frantzi et al., 2000; Park mats. Last but not least, JATE 2.0 expands its collection 6 et al., 2002; Matsuo and Ishizuka, 2003; Sclano and Ve- of state-of-the-art algorithms to double that in the origi- lardi, 2007; Rose et al., 2010; Spasic´ et al., 2013; Zadeh nal JATE release. We also evaluate these algorithms under and Handschuh, 2014). uniform settings on two publicly available benchmarking Given the fundamental role of ATE in many tasks, a real datasets. Released as open source software, we believe that benefit to the community would be releasing the tools to JATE 2.0 offers significant value to both researchers and facilitate the development of downstream applications, also practitioners. encouraging code-reuse, comparative studies, and fostering The remainder of this paper is structured as follows. Sec- further research. Unfortunately, effort in this direction has tion 2. discusses related work. Section 3. describes JATE been extremely limited. First, only very few tools have been 2.0 in details. Section 5. describes experiments and dis- published with readily available source code (Spasic´ et al., cusses results. Section 6. concludes this paper. 2013) and some are proprietary (labs.translated.net, 2015; Yahoo!, 2015), or no longer maintained (Sclano and Ve- 2. Related Work lardi, 2007)1. Second, different tools have been developed We describe related work from three persectives: ATE under different scenarios and evaluated in different domains methods in general, ATE tools and software, and text min- using proprietary language resources, making it difficult for ing plugins based on Solr. comparison. Third, it is unclear whether and how well these ATE methods typically start with linguistic processors tools can adapt to different domain tasks and scale up to as the first stage to extract candidate terms; followed by large data. To the best of our knowledge, the only effort towards ad- 3https://github.com/ziqizhang/jate dressing these issues is the Java Automatic Term Extrac- 4http://lucene.apache.org/solr/ tion Toolkit (JATE)2, where we originally implemented 5https://cwiki.apache.org/confluence/ five state-of-the-art algorithms under a uniform framework display/solr/Analyzers 6 (Zhang et al., 2008). This work makes one step further By this we mean the mathematical formulation of term candidate scores. We do not claim identical replication of state-of-the- 1Original link: http://lcl2.di.uniroma1.it/ art methods, as they often differ in pre- and post-processing, such 2https://code.google.com/p/jatetoolkit. as term candidate generation, term variant handling and the use of Google Code has been shut down since Jan 2016. dictionary. 2262 statistics-based term ranking algorithms to perform can- ATE. didate filtering. Linguistic processors are domain specific, and designed to capture term formation and collo- 3. JATE 2.0 Architecture cation patterns. They often take two forms: ‘closed filters’ (Arora et al., 2014) focus on precision and are usu- Figure 1 shows the general architecture of JATE 2.0 and ally restricted to nouns or noun sequences. ‘Open filters’ its workflow, which consists of four phases: (1) data pre- (Frantzi et al., 2000) are more permissive and often al- processing; (2) term candidate extraction and indexing; (3) low adjectives, adverbs, etc. For both, widely used tech- candidate scoring and filtering and (4) final term indexing niques include Part-of-Speech (PoS) tag sequence match- and export. ing, N-gram extraction, Noun Phrase (NP) Chunking, and Data pre-processing (Section 3.1.) parses input docu- dictionary lookup. The selection of linguistic processors is ments to raw text content and performs text normaliza- often domain-specific and important for the trade-off be- tion in order to reduce ‘noise’ in irregular data. The pre- tween precision and recall. The ranking algorithms are of- processed text content then passes through the candidate ten based on: ‘unithood’ indicating the collocation strength extraction (Section 3.2.) component that extracts and nor- of units that comprise a single term (Matsuo and Ishizuka, malizes term candidates from each document. Candidate 2003); and ‘termhood’ indicating the association strength extraction and data pre-processing are embedded as part of a term to domain concepts (Ahmad et al., 1999). Most of the Solr document indexing process, largely realized by methods combine both termhood and unithood measures Solr’s ‘analyzers’. An analyzer is composed of a series (Ananiadou, 1994; Frantzi et al., 2000; Park et al., 2002; of processors (‘tokenizers’ and ‘filters’) to form a pipeline, Sclano and Velardi, 2007; Spasic´ et al., 2013). Two unique or an analysis ‘chain’ that examines the text content from features of JATE 2.0 are (1) customization of linguistic pro- each document, generates a token stream and records sta- cessors of the two forms hence making it adaptable to many tistical information.

Load more