<<

Appendix Available Tools and Resources

This chapter describes tools, datasets, and resources to implement and exploit the techniques described in the book.

A.1 Libraries and APIs

A.1.1 Libraries for Natural Language Processing

In Chap. 2, we described a classical pipeline for which allows to extract relevant features from unstructured text. We started with the (Sect. 2.1.1), whose main goal is to identify and extract relevant and phrases from the text, and we concluded with the syntactic analysis (Sect. 2.1.2), which aims at inferring information about the structure of the text and the role of each in the text. In this section, we list a set of libraries which can be used to perform the afore- mentioned analysis: • OpenNLP: -based toolkit for the processing of natural language text, which supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, , language detection, and coreference resolution. More information at: https://opennlp.apache.org/ http://opennlp.sourceforge.net/models-1.5/ • TextPro: Suite of modular NLP tools for analysis of written texts for both Italian and English. The suite has been designed to integrate and reuse state-of-the-art NLP components developed by researchers at Fondazione Bruno Kessler.1 The current

1https://www.fbk.eu.

© Springer Nature Switzerland AG 2019 173 P. Lops et al., Semantics in Adaptive and Personalised Systems, https://doi.org/10.1007/978-3-030-05618-6 174 Appendix: Available Tools and Resources

version of the tool suite provides functions ranging from tokenization to parsing and named entity recognition. The different modules included in TextPro have been evaluated in the context of several evaluation campaigns and international shared tasks, such as EVALITA2 (PoS tagging, named entity recognition, and parsing for Italian) and Semeval 20103 (keyphrase extraction from scientific articles in English). The architecture of TextPro is organized as a pipeline of processors where each stage accepts data from an initial input (or from the output of a previous stage), executes a specific task, and outputs the resulting data (or sends it to the next stage). More information at: http://textpro.fbk.eu/ • Stanford CoreNLP: A set of human language technology tools, which can give the base forms of words, their part-of-speech tags, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc. More information at: http://nlp.stanford.edu/software/corenlp.shtml http://corenlp.run/ • GATE: A full-lifecycle open-source solution for text processing. GATEis in active use for all types of computational task involving human language. It has a mature and extensive community of developers, users, educators, students, and scientists, and it is widely adopted by corporations, small and medium enterprises, research labs, and universities worldwide. More information at: https://gate.ac.uk/ • UIMA: Apache UIMA (Unstructured Information Management applications) allows to analyze large volumes of unstructured information in order to discover relevant knowledge. UIMA might ingest plain text and identify entities, such as persons, places, organizations, or relations, such as works-for or located-at. It allows to perform a wide range of operations, such as language identification, sen- tence boundary detection, and entity detection (person/place names, etc.). More information at: https://uima.apache.org/index.html • SpaCy: Free, open-source library for advanced Natural Language Processing in Python. It can be used to build or natural language under- standing systems, or to preprocess text for deep learning. More information at: https://spacy.io/ • (NLTK): Platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50

2http://www.evalita.it. 3http://semeval2.fbk.eu/semeval2.php?location=. Appendix: Available Tools and Resources 175

corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, , tagging, parsing, and semantic reasoning. More information at: https://www.nltk.org/ Besides the complete libraries to perform the most common NLP tasks, we can use the following tools to perform some of the operations of the NLP pipeline: • Snowball: A small string processing language designed for creating stemming algorithms for use in . More information at: http://snowballstem.org/ • Porter stemmer: One of the most famous rule-based stemming algorithms. For English language, you can refer to: http://snowball.tartarus.org/algorithms/porter/stemmer.html http://snowballstem.org/algorithms/porter/stemmer.html For Italian language, you can refer to: http://snowball.tartarus.org/algorithms/italian/stemmer.html http://snowballstem.org/algorithms/italian/stemmer.html • Morph-it!: Morphological resource (dictionary) for the Italian language. It is a lexicon of inflected forms with their lemma and morphological features. Hence, each word is assigned with its lemma and various morphological information (features). Among the most important features, we have positive, comparative, or superlative for adjectives, inflectional gender (feminine or masculine), and number (singular or plural) for both nouns and adjectives. More information at: http://docs.sslmit.unibo.it/doku.php?id=resources:morph-it • LemmaGen: A standardized open-source multilingual platform for lemmatization in 12 European languages. It is able to learn lemmatization rules for new languages by providing it with existing (word form, lemma) pair examples. More information at: http://lemmatise.ijs.si • Stanford Log-linear Part-Of-Speech Tagger: Included in the Stanford CoreNLP suite, it reads text in some language and assigns parts of speech to each open cat- egory word/token (noun, verb, adjective, …). Similarly to other POS Taggers, it uses fine-grained POS tags, such as “noun-plural”. This software is a Java imple- mentation of the log-linear part-of-speech taggers described in [14, 15]. Several downloads are available. The basic download contains two trained tag- ger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, a French tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language. The English taggers adopt the Penn tag set. The tagger is licensed under the GNU General Public License (v2 or later). 176 Appendix: Available Tools and Resources

Open-source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available. More information at: http://nlp.stanford.edu/software/tagger.shtml • Stanford Parser: Included in the Stanford CoreNLP suite, it is a probabilistic context-free grammar parser for English, which can be adapted to other languages such as Italian, Bulgarian, and Portuguese. It also includes a German parser based on the Negra corpus, a Chinese parser based on the Chinese Treebank, as well as Arabic parsers based on the Penn Arabic Treebank. The output of the parser is v1 as well as phrase structure trees. The types of parsers included in the package are shift-reduce constituency parser and neural network dependency parser; a tool for scoring of generic dependency parses, called in addition, dependency scoring, is also provided. They are released under a dual license—open-source licensing is under the full GPL, which allows many free uses, while commercial licensing is available for distributors of proprietary software. More information at: https://nlp.stanford.edu/software/lex-parser.shtml • MaltParser: System for data-driven dependency parsing, which can be used to induce a parsing model from treebank data and to parse new data using an induced model. While a traditional parser generator constructs a parser given a grammar, a data-driven parser generator constructs a parser given a treebank. MaltParser is an implementation of inductive dependency parsing, where the syntactic analysis of a sentence amounts to the derivation of a dependency structure, and where inductive machine learning is used to guide the parser at nondeterministic choice points. The parsing methodology is based on (i) a deterministic parsing algorithms for building labeled dependency graphs; (ii) history-based models for predicting the next parser action at nondeterministic choice points; and (iii) discriminative learning to map histories to parser actions. Parsers developed using MaltParser have many parameters that need to be optimized and have achieved state-of-the- art accuracy for a number of languages. More information at: http://www.maltparser.org/ • SENNA: Software distributed under a noncommercial license, which performs NLP tasks, such as part-of-speech tagging, chunking, name entity recognition, , and syntactic parsing. SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP systems, and accurate because it offers state-of-the-art or near state-of-the-art performance [1]. More information at: https://ronan.collobert.com/senna/ • SyntaxNet: Library developed by Google for data-driven dependency parsing based on neural nets. It can train and run syntactic dependency parsing models. One model that it provides, called Parsey McParseface, offers a particularly good speed/accuracy trade-off. Multilingual data are provided by the Universal Depen- Appendix: Available Tools and Resources 177

dency Parsing (UDP) project.4 It is available as open-source software. More information at: https://github.com/tensorflow/models/blob/master/research/syntaxnet/g3doc/syn- taxnet-tutorial.md • ACOPOST: Set of freely available POS taggers written in , aiming for extreme portability and code correctness/safety. ACOPOST currently consists of four tag- gers which are based on different frameworks, and it provides a uniform environ- ment for testing. More information at: http://acopost.sourceforge.net/ • TreeTagger: Tool for annotating text with part-of-speech and lemma information for German, English, French, Italian, Danish, Dutch, Spanish, Bulgarian, Russian, Portuguese, Galician, Greek, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish, Romanian, Czech, Coptic, and Old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available. More information at: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ After the execution of the NLP pipeline, we can represent documents using the Vector Space Model. As described in Sect. 2.2, VSM allows to represent every document as a vector of term weights, and documents can be compared using cosine similarity. To this purpose, the following libraries can be adopted: • Apace Lucene: High-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. According to the principles of the VSM, Lucene is able to perform ranked searching, that is to say, best results returned first, it provides many powerful query types, such as phrase queries, wildcard queries, proximity queries, range queries, and searching on semi-structured documents organized in different fields (e.g., title, author, contents). It allows to plug different ranking models, including the Vector Space Model and the Okapi BM25 [13]. More information at: https://lucene.apache.org/core/ • Apache Solr: Highly reliable, scalable and fault-tolerant, open-source built on Apache Lucene platform, providing distributed indexing, repli- cation and load balanced querying, automated failover and recovery, centralized configuration, and more. More information at: https://lucene.apache.org/solr/ • Elasticsearch: Scalable and near real-time search engine developed in Java. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is the most popular enterprise search engine followed by Apache Solr. Elasticsearch uses Lucene and

4http://universaldependencies.org/. 178 Appendix: Available Tools and Resources

tries to make all its features available through the JSON and Java API. More information at: https://www.elastic.co/

A.1.2 Libraries for Encoding Endogenous Semantics

In this section, we list a set of libraries which can be used to learn semantics-aware content representations, using endogenous techniques described in Chap. 3, such as , LSA, and Random Indexing. To this purpose, the following libraries can be adopted: • S-Space: Collection of algorithms for building Semantic Spaces as well as a highly scalable library for designing new algorithms. Distribu- tional algorithms process text corpora and represent the semantics for words as high-dimensional feature vectors. More information at: https://github.com/fozziethebeat/S-Space • SemanticVectors: Package for creating semantic WordSpace models from free natural language text. Such models are designed to represent words and documents in terms of underlying concepts. They can be used for many semantic (concept- aware) matching tasks such as automatic thesaurus generation, knowledge repre- sentation, and concept matching. More information at: https://github.com/semanticvectors/semanticvectors/wiki • : Python library for topic modeling, document Indexing, and similar- ity retrieval with large corpora. Efficient implementations of popular algorithms, such as online (LSA/LSI/SVD), Latent Dirichlet Allo- cation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP), or Word2Vec deep learning. More information at: http://radimrehurek.com/gensim/ • Word2Vec: Tool which provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. The tool takes a as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words. The resulting word vector file can be used as features in many NLP and machine learning applications. More information at: https://code.google.com/archive/p/word2vec/ https://github.com/wlin12/wang2vec Appendix: Available Tools and Resources 179

• GloVe: Global Vectors for Word Representation is an algo- rithm for obtaining vector representations for words. Training is performed on aggregated global word–word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space [11]. More information at: https://nlp.stanford.edu/projects/glove/ • FastText: Open-source, free, lightweight library that allows users to learn text representations and text classifiers. FastText builds on Word2Vec by learning vec- tor representations for each word and the n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to training, it enables word embeddings to encode sub-word information. FastText vectors have been shown to be more accurate than Word2Vec vectors by a number of different measures. More information at: https://fasttext.cc/ • Wikipedia2Vec: Tool used for obtaining embeddings (or vector representations) of words and entities from Wikipedia (e.g., concepts that have corresponding pages in Wikipedia). The tool enables to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. More information at: https://github.com/wikipedia2vec/wikipedia2vec • Wikipedia-based Explicit Semantic Analysis [6]: A library implementing the ESA technique as described by Gabrilovich and Markovitch. More information at: https://github.com/pvoosten/explicit-semantic-analysis • LexVec: Implementation of the LexVec model (similar to Word2Vec and GloVe) that achieves state-of-the-art results in multiple NLP tasks. More information at: https://github.com/alexandres/lexvec • EmojitoVec: Pre-trained embeddings for all Unicode emojis which are learned from their description in the Unicode emoji standard5 [4]. The method maps emoji symbols into the same space as the 300-dimensional Google News Word2Vec embeddings, described in Sect. A.2. Thus, the resulting emoji2vec embeddings can be used in addition to 300-dimensional Word2Vec embeddings in any application. More information at: https://github.com/uclmr/emoji2vec • ELMo: Deep contextualized word representation that models complex - istics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional , which

5http://www.unicode.org/emoji/charts/full-emoji-list.html. 180 Appendix: Available Tools and Resources

is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state-of-the-art across a broad range of challeng- ing NLP problems, including , , and senti- ment analysis. ELMo representations are contextual, i.e., the representation for each word depends on the entire context in which it is used, and deep, i.e., the word representations combine all layers of a deep pre-trained neural network [12]. More information at: https://allennlp.org/elmo • BERT: Bidirectional Encoder Representations from Transformers is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right contexts in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications [3]. More information at: https://github.com/google-research/bert Even though the following libraries are not specifically defined to encode endoge- nous semantics, they provide some useful services: • ConvertVec: Tool for converting Word2Vec vectors between binary and plaintext formats. You can use this to convert the pre-trained vectors to plaintext. More information at: https://github.com/marekrei/convertvec • t-SNE: t-Distributed Stochastic Neighbor Embedding, tool for visualizing word embeddings in 2D. It implements a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. More information at: http://lvdmaaten.github.io/tsne/

A.1.3 Libraries for Encoding Exogenous Semantics

In this section, we list a set of libraries which can be used to learn semantics-aware content representation, using exogenous techniques described in Chap. 4, which exploits the data encoded in structured and external knowledge sources, such as Wikipedia, BabelNet, or the Linked Open Data cloud, using approaches based on WSD or entity linking. To this purpose, the following libraries can be adopted: • TAGME [5]: Powerful tool that is able to identify on-the-fly meaningful short phrases (called spots) in an unstructured text and link them to a pertinent Wikipedia Appendix: Available Tools and Resources 181

page in a fast and effective way. This annotation process has implications which go far beyond the enrichment of the text with explanatory links because it concerns with the contextualization and, in some way, the understanding of the text. The main advantage of TAGME is the ability to annotate texts which are short and poorly composed, such as snippets coming from search engine result pages, tweets, news, and so on. More information at: https://tagme.d4science.org/tagme/ • Wikify! [2]: Framework for text wikification, that is to say, for automatically cross-referencing documents with Wikipedia. The tool is able to identify important concepts in a text representation by using keyword extraction and to link them to the corresponding Wikipedia pages by exploiting WSD techniques. The system is trained on Wikipedia articles and learns to disambiguate and detect links in the same way as Wikipedia editors. More information at: https://bitbucket.org/techtonik/wikify/ • Dexter: Open-Source Framework for Entity Linking that implements some pop- ular algorithms and provides all the tools needed to develop any entity linking technique. More information at: http://dexter.isti.cnr.it/ • Babelfy [10]: Novel integrated approach to entity linking and WSD. Given a lexicalized , e.g., BabelNet, the approach is based on three steps: (i) the automatic creation of semantic signatures, as related concepts and named entities for each vertex of the semantic network; (ii) extraction of all the linkable fragments from a given text, listing all the possible meanings according to the semantic network; and (iii) linking based on a high-coherence densest subgraph algorithm. More information at: http://babelfy.org • DBpedia Spotlight [8]: It connects unstructured text to the Linked Open Data cloud by using DBpedia as hub. The output is a set of Wikipedia articles related to a text retrieved by following the URI of the DBpedia instances. The annota- tion process works in four stages. First, the text is analyzed in order to select the phrases that may indicate a mention of a DBpedia resource. In this step, spots that are only composed of verbs, adjectives, adverbs, and prepositions are disregarded. Subsequently, a set of candidate DBpedia resources is built by mapping the spot- ted phrase to resources that are candidate disambiguations for that phrase. The disambiguation process uses the context around the spotted phrase to decide for the best choice among the candidates. More information at: https://www.dbpedia-spotlight.org/ • Open Calais: It exploits NLP and machine learning to find entities within docu- ments. The main difference with respect to other entity recognizers is that Open Calais returns facts and events hidden within the text. Open Calais consists of 182 Appendix: Available Tools and Resources

three main components: (i) a named entity recognizer that identifies people, com- panies, and organizations; (ii) a fact recognizer that links the text with position tags, alliance, and person-political; and (iii) an event recognizer whose role is to identify sport, management, change events, labor actions, etc. Open Calais sup- ports English, French, and Spanish, and its assets are currently linked to DBpedia, Wikipedia, Freebase, and GeoNames. More information at: http://www.opencalais.com/ • Watson Natural Language Understanding: Full suite of advanced text analytics features to extract keywords, concepts (not necessarily directly referenced in the text), entities (people, places, events, and other types), categories (using a five- level classification hierarchy), sentiment (toward specific target phrases and of the document as a whole), emotions (conveyed by specific target phrases or by the document as a whole), relations (recognize when two entities are related and identify the type of relation), semantic roles (sentences are parsed into subject- action–object form, and entities and keywords that are subjects or objects of an action are identified), and more using natural language understanding. It currently supports 13 different languages. More information at: https://www.ibm.com/cloud/watson-natural-language-understanding • ARQ—Apache Jena: ARQ is a query engine for Jena that supports the SPARQL RDF Query language. As previously explained, SPARQL is the query language that can be used to directly access and gather information from the LOD cloud. More information at: https://jena.apache.org/documentation/query/

A.2 Datasets and Resources

A.2.1 Resources to Feed Endogenous Approaches

Endogenous approaches rely on NLP techniques, and they basically need huge amount of textual content. In the following, we will depict some sources of tex- tual content that can be exploited to feed these methods. • Wikipedia Dump: Dumps of Wikipedia are freely available and can be exploited to learn representations of words and documents through endogenous approaches. Dumps are available here: https://dumps.wikimedia.org/backup-index.html • Amazon Reviews Data: A huge set of Amazon reviews coming from heteroge- neous domains of interests is available online. These reviews can be exploited to learn word representations through endogenous approaches as well as to directly Appendix: Available Tools and Resources 183

process textual content through NLP techniques to extract relevant characteristics from the reviews that can be used for several tasks (e.g., to feed content-based recommender system). Reviews are available here: http://jmcauley.ucsd.edu/data/amazon/ Moreover, in order to skip the process of learning a representation from rough textual content, a common practice is to use of pre-trained vector representations of words, also known as embeddings. To this purpose, the following resources can be adopted: • Pre-trained word and phrase vectors from Google News: Model containing pre-trained vectors trained on part of Google News dataset consisting of about 100 billion words. The model built using Word2Vec contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [9]. The archive is available here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit? usp=sharing • Pre-trained entity vectors with Freebase naming: More than 1.4 millions pre- trained entity vectors with naming from Freebase, which are particularly helpful for projects related to knowledge mining. Entity vectors are trained using Word2Vec on 100B words from Google News articles. The archive is available here: https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit? usp=sharing • Wikipedia 2014 + Gigaword 5: Pre-trained word vectors built using 6B tokens extracted from Wikipedia 2014 dump and English Gigaword Fifth Edition,6 a comprehensive archive of newswire text data that has been acquired over sev- eral years by the Linguistic Data Consortium. The model built using GloVe con- tains 50/100/200/300-dimensional vectors for 400K words. For more information, please refer to: https://nlp.stanford.edu/projects/glove/ The archive is available here: http://nlp.stanford.edu/data/glove.6B.zip • Common Crawl 42B: Pre-trained word vectors built using 42B tokens extracted from Common Crawl data.7 The model built using GloVecontains 300-dimensional vectors for 1.9M words. The archive is available here: http://nlp.stanford.edu/data/glove.42B.300d.zip • Common Crawl 840B: Pre-trained word vectors built using 840B tokens extracted from Common Crawl. The model built using GloVe contains 300-dimensional vectors for 2.2M words. The archive is available here: http://nlp.stanford.edu/data/glove.840B.300d.zip

6https://catalog.ldc.upenn.edu/LDC2011T07. 7http://commoncrawl.org/. 184 Appendix: Available Tools and Resources

• Twitter 2B: Pre-trained word vectors built using 2B tweets and 27B tokens. The model built using GloVe contains 25/50/100/200-dimensional vectors for 1.2M words. The archive is available here: http://nlp.stanford.edu/data/glove.twitter.27B.zip • Pre-trained Word vectors induced from PubMed and PMC texts: Word vec- tors obtained using Word2Vec and provided in the Word2Vec binary format. They were induced from a large corpus of biomedical text combining PubMed8 and PMC texts. The archive is available here: http://evexdb.org/pmresources/vec-space-models/ A set of word vectors induced on a combination of biomedical texts coming from PubMed and PMC and general-domain texts extracted from a recent English Wikipedia dump is available here: http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC- w2v.bin • BioWordVec & BioSentVec: Pre-trained biomedical word and sentence embed- dings using 30M documents, 222M sentences, and 4.8B tokens from PubMed and the clinical notes from MIMIC-III Clinical Database. The model built using Fast- Text contains 200-dimensional word vectors. More information at: https://github.com/ncbi-nlp/BioSentVec • Lexical vector sets: Vectors trained using different methods (counting, Word2Vec and dependency relations) on 112M words from the British National Corpus (BNC). The vectors are built using three different techniques: (i) counting word co-occurrences in a fixed context window; (ii) using Word2Vec with a skip-gram model; and (iii) using dependency relations from a parser as features. More information at: http://www.marekrei.com/projects/vectorsets/ • Pre-trained word vectors of non-English languages: Pre-trained word vector models extracted from Wikipedia using Word2Vec and FasText for the following languages: Bengali, Catalan, Chinese, Danish, Dutch, Esperanto, Finnish, French, German, Hindi, Hungarian, Indonesian, Italian, Japanese, Javanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, and Vietnamese. More information at: https://github.com/Kyubyong/wordvectors • : Word embeddings for more than 100 languages built using their corre- sponding Wikipedia dumps. More information at: https://sites.google.com/site/rmyeid/projects/polyglot

8https://www.ncbi.nlm.nih.gov/pubmed/. Appendix: Available Tools and Resources 185

• Pre-trained word vectors for 157 languages: Pre-trained word vectors for 157 languages, trained on Common Crawl, and Wikipedia using FastText. These mod- els were trained using CBOW with position weights and contain 300-dimensional vectors [7]. More information at: https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md • WaCky—The Web-As-Corpus Kool Yinitiative: Corpora built by downloading text from the Web. There are different corpora for English, French, German, and Italian. More information at: http://wacky.sslmit.unibo.it/doku.php?id=corpora • Italian Word Embeddings: Word embeddings generated with two popular word representation models, Word2Vec and GloVe trained on the Italian Wikipedia. More information at: http://hlt.isti.cnr.it/wordembeddings/ Another resource is described in [16] and is available at: https://goo.gl/YagBKT

A.2.2 Resources to feed Exogenous Approaches

Some LOD-aware versions of datasets to evaluate recommender systems can be easily found on the Web. Moreover, complete dumps or portions of the knowledge bases and knowledge graphs we have previously discussed are typically available and can be freely downloaded to locally manage the encoded information. The pointers to such resources follow: • LOD-aware Datasets:Asemantics-aware version of several state-of-the-art datasets in the area of recommender systems is available online. These datasets include a mapping of the items with the URIs in DBpedia. Dumps are available here: https://github.com/sisinflab/LODrecsys-datasets • DBpedia: The DBpedia foundation periodically updates and makes available sev- eral versions of the information available in DBpedia. The dumps are typically split by exploiting several criteria, such as the language of the content or the nature of the information encoded in the subsets of the dataset. Dumps are available here: https://wiki.dbpedia.org/develop/datasets/downloads-2016-10 • Wikidata: The database of Wikidata is available online and can be downloaded in several formats, as JSON, XML, and so on. Dump is available here: https://www.wikidata.org/wiki/Wikidata:Database_download 186 Appendix: Available Tools and Resources

References

1. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537 2. Csomai A, Mihalcea R (2008) Linking documents to encyclopedic knowledge. IEEE Intell Syst 23(5):34–41. https://doi.org/10.1109/MIS.2008.86 3. Devlin J, Chang M, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805, https://arxiv.org/abs/1810. 04805 4. Eisner B, Rocktäschel T, Augenstein I, Bosnjak M, Riedel S (2016) emoji2vec: Learning emoji representations from their description. In: Ku L, Hsu JY, Li C (eds) proceedings of the fourth international workshop on natural language processing for social media, SocialNLP@EMNLP 2016, Association for Computational , pp 48–54 5. Ferragina P, Scaiella U (2012) Fast and accurate annotation of short texts with wikipedia pages. IEEE Softw 29(1):70–75 6. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI, pp 1606–1611 7. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. In: Proceedings of the international conference on language resources and evaluation (LREC 2018) 8. Mendes PN, Jakob M, García-Silva A, Bizer C (2011) DBpedia spotlight: shedding light on the web of documents. In: Ghidini C, Ngomo AN, Lindstaedt SN, Pellegrini T (eds) Proceedings the 7th international conference on semantic systems, I-SEMANTICS 2011, ACM, pp 1–8 9. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 10. Moro A, Raganato A, Navigli R (2014) Entity linking meets word sense disambiguation: a unified approach. Trans Assoc Computat Linguist 2:231–244 11. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543 12. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL 13. Sparck-Jones K, Walker S, Robertson SE (2000) A probabilistic model of information retrieval: development and comparative experiments—Part 1 and Part 2. Inf Process Manag 36(6):779– 840 14. Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL, the association for computational linguistics, pp 252–259 15. Toutanova K, Manning CD (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: EMNLP, association for computational linguistics, pp 63–70 16. Tripodi R, Li Pira S (2017) Analysis of italian word embeddings. arXiv preprint arXiv:170708783