Appendix Available Tools and Resources

Appendix Available Tools and Resources This chapter describes tools, datasets, and resources to implement and exploit the techniques described in the book. A.1 Libraries and APIs A.1.1 Libraries for Natural Language Processing In Chap. 2, we described a classical pipeline for text processing which allows to extract relevant features from unstructured text. We started with the lexical analysis (Sect. 2.1.1), whose main goal is to identify and extract relevant words and phrases from the text, and we concluded with the syntactic analysis (Sect. 2.1.2), which aims at inferring information about the structure of the text and the role of each word in the text. In this section, we list a set of libraries which can be used to perform the afore- mentioned analysis: • OpenNLP: Machine learning-based toolkit for the processing of natural language text, which supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection, and coreference resolution. More information at: https://opennlp.apache.org/ http://opennlp.sourceforge.net/models-1.5/ • TextPro: Suite of modular NLP tools for analysis of written texts for both Italian and English. The suite has been designed to integrate and reuse state-of-the-art NLP components developed by researchers at Fondazione Bruno Kessler.1 The current 1https://www.fbk.eu. © Springer Nature Switzerland AG 2019 173 P. Lops et al., Semantics in Adaptive and Personalised Systems, https://doi.org/10.1007/978-3-030-05618-6 174 Appendix: Available Tools and Resources version of the tool suite provides functions ranging from tokenization to parsing and named entity recognition. The different modules included in TextPro have been evaluated in the context of several evaluation campaigns and international shared tasks, such as EVALITA2 (PoS tagging, named entity recognition, and parsing for Italian) and Semeval 20103 (keyphrase extraction from scientific articles in English). The architecture of TextPro is organized as a pipeline of processors where each stage accepts data from an initial input (or from the output of a previous stage), executes a specific task, and outputs the resulting data (or sends it to the next stage). More information at: http://textpro.fbk.eu/ • Stanford CoreNLP: A set of human language technology tools, which can give the base forms of words, their part-of-speech tags, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc. More information at: http://nlp.stanford.edu/software/corenlp.shtml http://corenlp.run/ • GATE: A full-lifecycle open-source solution for text processing. GATEis in active use for all types of computational task involving human language. It has a mature and extensive community of developers, users, educators, students, and scientists, and it is widely adopted by corporations, small and medium enterprises, research labs, and universities worldwide. More information at: https://gate.ac.uk/ • UIMA: Apache UIMA (Unstructured Information Management applications) allows to analyze large volumes of unstructured information in order to discover relevant knowledge. UIMA might ingest plain text and identify entities, such as persons, places, organizations, or relations, such as works-for or located-at. It allows to perform a wide range of operations, such as language identification, sentence boundary detection, and entity detection (person/place names, etc.). More information at: https://uima.apache.org/index.html • SpaCy: Free, open-source library for advanced Natural Language Processing in Python. It can be used to build information extraction or natural language under- standing systems, or to preprocess text for deep learning. More information at: https://spacy.io/ • Natural Language Toolkit (NLTK): Platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 2http://www.evalita.it. 3http://semeval2.fbk.eu/semeval2.php?location=. Appendix: Available Tools and Resources 175 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. More information at: https://www.nltk.org/ Besides the complete libraries to perform the most common NLP tasks, we can use the following tools to perform some of the operations of the NLP pipeline: • Snowball: A small string processing language designed for creating stemming algorithms for use in Information Retrieval. More information at: http://snowballstem.org/ • Porter stemmer: One of the most famous rule-based stemming algorithms. For English language, you can refer to: http://snowball.tartarus.org/algorithms/porter/stemmer.html http://snowballstem.org/algorithms/porter/stemmer.html For Italian language, you can refer to: http://snowball.tartarus.org/algorithms/italian/stemmer.html http://snowballstem.org/algorithms/italian/stemmer.html • Morph-it!: Morphological resource (dictionary) for the Italian language. It is a lexicon of inflected forms with their lemma and morphological features. Hence, each word is assigned with its lemma and various morphological information (features). Among the most important features, we have positive, comparative, or superlative for adjectives, inflectional gender (feminine or masculine), and number (singular or plural) for both nouns and adjectives. More information at: http://docs.sslmit.unibo.it/doku.php?id=resources:morph-it • LemmaGen: A standardized open-source multilingual platform for lemmatization in 12 European languages. It is able to learn lemmatization rules for new languages by providing it with existing (word form, lemma) pair examples. More information at: http://lemmatise.ijs.si • Stanford Log-linear Part-Of-Speech Tagger: Included in the Stanford CoreNLP suite, it reads text in some language and assigns parts of speech to each open cat- egory word/token (noun, verb, adjective, …). Similarly to other POS Taggers, it uses fine-grained POS tags, such as “noun-plural”. This software is a Java implementation of the log-linear part-of-speech taggers described in [14, 15]. Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language. The English taggers adopt the Penn Treebank tag set. The tagger is licensed under the GNU General Public License (v2 or later). 176 Appendix: Available Tools and Resources Open-source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available. More information at: http://nlp.stanford.edu/software/tagger.shtml • Stanford Parser: Included in the Stanford CoreNLP suite, it is a probabilistic context-free grammar parser for English, which can be adapted to other languages such as Italian, Bulgarian, and Portuguese. It also includes a German parser based on the Negra corpus, a Chinese parser based on the Chinese Treebank, as well as Arabic parsers based on the Penn Arabic Treebank. The output of the parser is Universal Dependencies v1 as well as phrase structure trees. The types of parsers included in the package are shift-reduce constituency parser and neural network dependency parser; a tool for scoring of generic dependency parses, called in addition, dependency scoring, is also provided. They are released under a dual license—open-source licensing is under the full GPL, which allows many free uses, while commercial licensing is available for distributors of proprietary software. More information at: https://nlp.stanford.edu/software/lex-parser.shtml • MaltParser: System for data-driven dependency parsing, which can be used to induce a parsing model from treebank data and to parse new data using an induced model. While a traditional parser generator constructs a parser given a grammar, a data-driven parser generator constructs a parser given a treebank. MaltParser is an implementation of inductive dependency parsing, where the syntactic analysis of a sentence amounts to the derivation of a dependency structure, and where inductive machine learning is used to guide the parser at nondeterministic choice points. The parsing methodology is based on (i) a deterministic parsing algorithms for building labeled dependency graphs; (ii) history-based models for predicting the next parser action at nondeterministic choice points; and (iii) discriminative learning to map histories to parser actions. Parsers developed using MaltParser have many parameters that need to be optimized and have achieved state-of-the- art accuracy for a number of languages. More information at: http://www.maltparser.org/ • SENNA: Software distributed under a noncommercial license, which performs NLP tasks, such as part-of-speech tagging, chunking, name entity recognition, semantic role labeling, and syntactic parsing. SENNA is fast because it uses a simple architecture, self-contained because it does

Load more