An Open Science System for Text Mining

Home , D4Science, Open access, Open Data Indices, Open science

Gianpaolo Coro Giancarlo Panichi Pasquale Pagano ISTI-CNR ISTI-CNR ISTI-CNR via Moruzzi 1 Pisa, Italy via Moruzzi 1 Pisa, Italy via Moruzzi 1 Pisa, Italy [email protected] [email protected] [email protected]

Abstract of results. Although text mining techniques ex- ist that can tackle big data experiments (Gandomi Text mining (TM) techniques can ex- and Haider, 2015; Amado et al., 2018), few extract high-quality information from big amples that incorporate OS concepts can be found data through complex system architec- (Linthicum, 2017). For example, common text tures. However, these techniques are usu- mining "cloud" services do not allow easy repeata- ally difficult to discover, install, and com- bility of the experiments by different users and are bine. Further, modern approaches to Sci- usually domain-specific and thus poorly re-usable ence (e.g. Open Science) introduce new (Bontcheva and Derczynski, 2016; Adedugbe et requirements to guarantee reproducibility, al., 2018). Available multi-domain systems do repeatability, and re-usability of methods not use communication standards (Bontcheva and and results as well as their longevity and Derczynski, 2016; Wei et al., 2016), and the few sustainability. In this paper, we present OS-oriented initiatives that use text mining focus a distributed system (NLPHub) that pub- specifically on documents preservation and cata- lishes and combines several state-of-the- loguing (OpenMinTeD, 2019; OpenAire, 2019). art text mining services for named entities, events, and keywords recognition. NL- In this paper, we present a multi-domain text PHub makes the integrated methods com- mining system (NLPHub) that is compliant with pliant with Open Science requirements OS and combines multiple and heterogeneous pro- and manages heterogeneous access poli- cesses. NLPHub is based on an e-Infrastructure cies to the methods. In the paper, we as- (e-I), i.e. a network of hardware and software sess the benefits and the performance of resources that allow remote users and services NLPHub on the I-CAB corpus1. to collaborate while supporting data-intensive Science through cloud computing (Pollock and 1 Introduction Williams, 2010; Andronico et al., 2011). Cur- Today, text mining operates within the chal- rently, NLPHub integrates 30 state-of-the-art text lenges introduced by big data and new Science mining services and methods to recognize frag- paradigms, which impose to manage large vol- ments of a text (annotations) associated with umes, high production rate, heterogeneous com- named abstract or physical objects (named enti- plexity, and unreliable content, while ensuring ties), spatiotemporal events, and keywords. These data and methods longevity through re-use in com- integrated processes cover overall 5 languages plex models and processes chains. Among the new (English, Italian, German, French, and Spanish), paradigms, Open Science (OS) focusses on the requested by the European projects this software implementation in computer systems of the three is involved in (i.e. (Parthenos, 2019; SoBigData, "R"s of the scientific method: Reproducibility, Re- 2019; Ariadne, 2019)). These processes come peatability, and Re-usability (Hey et al., 2009; EU from different providers that have different ac- Commission, 2016). The systems envisaged by cess policies, and the e-I is used both to man- OS, are based on Web services networks that sup- age this heterogeneity and to possibly speed up port big data processing and the open publication the processing through cloud computing. NL- PHub uses the Web Processing Service standard 1Copyright © 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- (WPS, (Schut and Whiteside, 2007)) to describe ternational (CC BY 4.0). all integrated processes, and the Prov-O XML ontological standard (Lebo et al., 2013) to track ables folders creation and sharing between VRE the complete set of input, output, and parameters users. Through VREs and accounting and security used for the computations (provenance). Overall, services, D4Science is able to manage heteroge- these features enable OS-compliance and we show neous access policies by granting free access to that the orchestration mechanism implemented by open services in public VREs, and controlled/pri- NLPHub adds effectiveness and efficiency to the vate access to non-open services in private or mod- connected methods. The name "NLPHub" refers erated VREs. D4Science includes a cloud comput- to the forthcoming extensions of this platform to ing platform named DataMiner (Coro et al., 2015; other text mining methods (e.g. sentiment analy- Coro et al., 2017) that currently hosts ∼400 pro- sis and opinion mining), and natural language processes and makes all integrated processes avail- cessing tasks (e.g. text-to-speech and speech pro- able under the WPS standard (Figure 1). WPS cessing). is supported by third-party software and allows standardising a process’ input, its parameterisa- 2 Methods and tools tion and output. DataMiner executes the processes 2.1 E-Infrastructure and Cloud Computing in a cloud computing cluster of 15 machines with Platform Ubuntu 16.04.4 LTS x86 64 operating system, 16 virtual cores, 32 GB of RAM and 100 GB of disk space. These machines are hosted by the National Research Council of Italy and the Italian Network of the University and Research (GARR). Each process can parallelise an execution either across the machines (using a Map-Reduce approach) or on the cores of one single machine (Coro et al., 2017). After each computation, DataMiner saves - on the user’s Workspace- all the information about the input and output data, and the experiment’s parameters (computational provenance) using the Prov-O XML standard. In each D4Science VRE, DataMiner offers an online tool to integrate algorithms, which supports many programming languages (Coro et al., 2016). All these features make D4Science useful to develop OS-compliant applications, because WPS and provenance track- ing allow repeating and reproducing a computa- Figure 1: Overall architectural schema of the NL- tion executed by another user. Also, the possibility PHub. to provide a process in multiple VREs focussing on different domains fosters its re-usability (Coro NLPHub uses the open-source D4Science e- et al., 2017). In this paper, we will use the I(Candela et al., 2013; Assante et al., 2019), term "algorithm" to indicate processes running on which currently supports applications in many do- DataMiner, and "method" to indicate the original mains through the integration of a distributed stor- processes or services integrated with DataMiner. age system, a cloud computing platform, online collaborative tools, and catalogues of metadata 2.2 Annotations and geospatial data. D4Science supports the creation of Virtual Research Environments (VREs) NLPHub integrates a number of named entities (Assante et al., 2016), i.e. Web-based environ- recognizers (NERs) but also information extrac- ments fostering collaboration and data sharing be- tion processes that recognize events, keywords, tween users and managing heterogeneous data and tokens, and sentences. Overall, we will use services access policies. D4Science grants each the term "annotation" to indicate all the infor- user with access to a private online file system mation that NLPHub can extract from a text. (the Workspace) that uses a high-availability dis- The complete list of supported annotations, lan- tributed storage system behind the scenes, and en- guages, and processes is reported in the supplementary material, together with the list of all men- (Aprosio and Moretti, 2016) was installed as a sep- tioned Web services’ endpoints. The ontologi- arate service. Overall, two distinct replicated and cal classes used for NERs annotations come from balanced virtual machines host these services on the Standford CoreNLP software. Included non- machines with 10 GB of RAM and 6 cores. standard annotations are "Misc" (miscellaneous GATE Cloud. GATE Cloud is a cloud ser- concepts that cannot be associated with none of vice that offers on-payment text analysis methods the other classes, e.g. "Bachelor of Science"), as-a-service (GATE Cloud, 2019a; Tablan et al., "Event" (nouns, verbs, or phrases referring to a 2011). NLPHub integrates the GATE Cloud AN- phenomenon occurring at a certain time and/or NIE NER for English, German, and French within space), and "Keyword" (a word or a phrase that a controlled VRE that accounts for users’ requests is of great importance to understand the text con- load. This VRE ensures a fair usage of the ser- tent). vices, whose access has been freely granted to D4Science in exchange for enabling OS-oriented 2.3 Integrated Text Mining Methods features (SoBigData European Project, 2016). NLPHub uses a common JSON format to repre- OpenNLP. The Apache OpenNLP library is an sent the annotations of every integrated method. open source text processing toolkit mostly based This format describes the input text, the NER pro- on machine learning models (Kottmann et al., cesses, and the annotations for each NER: 2011). An OpenNLP-based English NER is available as-a-service on GATE Cloud (GATE Cloud, 1 "text": "input text", 2019b) and is included among the free-to-use ser- 2 "NER ": { 1 vices granted to D4Science. 3 "annotations":{ ItaliaNLP. ItaliaNLP is a free-to-use ser- 4 "annotation ":[ 1 vice - developed by the "Istituto di Linguistica 5 {"indices": [i ,i ]}, 1 2 Computazionale" (ILC-CNR) - hosting a NER 6 {"indices": [i ,i ]}, 3 4 method for Italian that combines rule-based and 7 ..., machine learning algorithms (ILC-CNR, 2019; We integrated services and methods with Dell’Orletta et al., 2014). DataMiner through "wrapping algorithms" that NewsReader. NewsReader is an advanced transformed the original outputs into this format. events recognizer for 4 languages, developed by We implemented a general workflow in each al- the NewsReader European project (Vossen et al., gorithm to execute the corresponding integrated 2016). NewsReader is a formal inferencing sys- method, which adopts the following steps: (i) re- tem that identifies events by detecting their partic- ceive an input text file and a list of entities to rec- ipants and time-space constraints. Two balanced ognize (among those supported by the language), virtual machines were installed in D4Science for (ii) pre-process the text by deleting useless char- the English and Italian NewsReader versions. acters, (iii) encode the text with UTF-8 encod- TagMe. TagMe is a service for identifying short ing, (iv) send the text via HTTP-Post to the corre- phrases (anchors) in a text that can be linked to sponding service or execute the method on the lo- pertinent Wikipedia pages (Ferragina and Scaiella, cal machine directly, if possible, and (v) return the 2010). TagMe supports 3 languages (English, Ital- annotation as an NLPHub-compliant JSON doc- ian, and German) and D4Science already hosts its ument. In the following, we list all the methods official instances. Since anchors are sequences of currently integrated with NLPHub with reference words having a recognized meaning within their to Figure 1 for an architectural view. context, NLPHub interprets them as keywords that CoreNLP. The Stanford CoreNLP software can help contextualising and understanding the (Manning et al., 2014) is an open-source text text. processing toolkit that supports several languages Keywords NER. Keywords NER is an open- (Stanford University, 2019). NLPHub integrates source statistical method that produces tags clouds CoreNLP as a service instance running within of verbs and nouns (Coro, 2019a), which was also D4Science with English, German, French, and used by the H-Care award-winning human digi- Spanish language packages enabled. Also, the tal assistant (SpeechTEK 2010, 2019). Tag clouds Tint (The Italian NLP Tool) extension for Italian are extracted through a statistical analysis of part- of-speech (POS) tags (extracted with TreeTagger, and the alignment algorithm manages all cases (Schmid, 1995)) and the method can be applied to through algebraic evaluations, as reported in the all the 23 TreeTagger supported languages. Key- following pseudo-code: words NER is executed directly on the DataMiner 1 AMERGE Algorithm machines, and the nouns tags are interpreted as keywords for the NLPHub scopes, because - by 2 3 For each annotation E: construction - their sequence is useful to under- 4 Collect all annotations detected stand the topics treated by a text. by the algorithms (intervals Language Identifier. NLPHub also provides with text start and end a language identification process (Coro, 2019b), positions); should language information not be specified as 5 Sort the intervals by their input. This process was developed in order to be start position; fast, easily, and quickly extendible to new lan- 6 For each segment si: guages. The algorithm is based on an empirical 7 If sj is properly included in behaviour of TreeTagger (common to many POS si, process the next sj; taggers): When TreeTagger is initialised on a cer- 8 If si does not intersect sj, tain language, but it processes a text written in brake the loop; another language, it tends to detect many nouns 9 If si intersects sj, create a and unstemmed words than verbs and other lexi- new segment sui as the union cal categories. Thus, the detected language is the of the two segments → one having the most balanced ratio of recognized substitute sui to si and and stemmed words with respect to other lexical restart the loop on sj; categories. This algorithm is applicable to many 10 Save si in the overall list of languages supported by TreeTagger and can run merged intervals S; on the DataMiner machines directly. An estimated 11 Associate S to E; accuracy of 95% on 100 sample text files covering 12 Return all (E,S) pairs sets. the 5 NLPHub languages was convincing to use this algorithm as an auxiliary tool for the NLPHub Since the AMERGE algorithm is a DataMiner users. algorithm, it is published as-a-service with a RESTful WPS interface. It represents one single 2.4 NLPHub access point to the services integrated with NL- PHub. In order to invoke this service, a client On top of the methods and services described so should specify an authorization code in the HTTP far, we implemented an alignment-merging algo- request that identifies both the invoking user and rithm (AMERGE) that orchestrates the computa- the VRE (CNR, 2016). The available annotations tions and assembles their outputs. AMERGE re- and methods depend on the VRE. An additional ceives a user-provided input text, along with the service (NLPHub-Info) allows retrieving the list indication of the text language (optionally), and a of supported entities for a VRE, given a user’s au- set of annotations to be extracted (selected among thorization code. NLPHub is also endowed with those supported for that language). Then, it con- a free-to-use Web interface (nlp.d4science. currently invokes - via WPS - the text process- org/hub/), based on a public VRE, operating on ing algorithms that support the input request, and top of the AMERGE process, which allows inter- eventually collects the JSON documents coming acting with the system and retrieving the annota- from them. Finally, it aligns and merges the in- tions in a graphical format. formation to produce one overall sequence rep- resented in JSON format. The issue of merging 3 Results the heterogeneous connected services’ outputs is solved through the use of the DataMiner wrapping We assessed the NLPHub performance by us- algorithms. Another solved issue is the merge of ing the I-CAB corpus as a reference (Magnini et the different intervals identified by several algo- al., 2006), which contains annotations of the fol- rithms focusing on the same entities. These inter- lowing named-entities categories from 527 Ital- vals may either overlap or be mutually inclusive, ian newspapers: Person, Location, Organization, Person Geopolitical Location Organization Algorithm F-measure Precision Recall Agreement F-measure Precision Recall Agreement F-measure Precision Recall Agreement F-measure Precision Recall Agreement ItaliaNLP 79% 74% 84% Excellent 77% 74% 80% Good 59% 52% 69% Good 58% 52% 66% Good CoreNLP-Tint 85% 78% 93% Excellent NA NA NA NA 30% 18% 84% Marginal 65% 53% 83% Good AMERGE 84% 74% 96% Excellent 77% 74% 80% Good 31% 19% 88% Marginal 63% 49% 87% Good Keywords NER 20% 12% 56% Marginal 14% 8% 66% Marginal 6% 3% 58% Marginal 22% 13% 66% Marginal TagMe 23% 18% 30% Marginal 33% 22% 67% Marginal 9% 5% 42% Marginal 25% 19% 38% Marginal AMERGE - Keywords 20% 12% 69% Marginal 18% 10% 91% Marginal 6% 3% 74% Marginal 22% 13% 79% Marginal

Table 1: Performance assessment of the NLPHub algorithms with respect to the I-CAB corpus annotations.

Geopolitical entity. NLPHub was executed to 4 Conclusions annotate these same entities plus Keywords (Ta- ble 1). The involved algorithms were CoreNLP- We have described NLPHub, a distributed sys- Tint, ItaliaNLP, Keywords NER, and TagMe. Ac- tem connecting and combining 30 text processing cording to the F-measure, CoreNLP-Tint was methods for 5 languages that adds Open Science- the best at recognizing Persons and Organiza- oriented features to these methods. The advan- tions, whereas ItaliaNLP - the only one supporting tages of using NLPHub are several, starting from Geopolitical entities - had the highest performance the fact that it provides one single access end- on Locations and a moderately-high performance point to several methods and spares installation on Geopolitical entities. Overall, the connected and configuration time. Further, it proposes the methods showed high performance on specific en- AMERGE process as a valid option when the best tities, but there was not one method outperform- performing algorithm for a certain entity extrac- ing the others on all entities. AMERGE had lower tion is not known a priori. Also, the AMERGE- but good F-measure and a generally high recall in Keywords annotations can be used when the enti- all cases, which indicates that the connected al- ties to extract are not known. Indeed, these fea- gorithms include complementary and valuable in- tures would require more investigation, especially tervals. The AMERGE-Keywords algorithm had through multiple-language experiments, in order a generally high recall (especially on Geopoliti- to define their full potential and limitations. Fi- cal entities), which means that the extracted key- nally, NLPHub adds to the original methods fea- words include also words from the annotated enti- tures like WPS and Web interfaces, provenance ties. The associated F-measures indicate that there management, results sharing, and access/usage is overlap with several entities. In turn, this indi- policies control, which make the methods more cates that AMERGE-Keywords could be a valu- compliant the with Open Science requirements. able source of information in the case of uncer- The potential users of NLPHub are scholars tainty about the entities that can be extracted from who want to use NERs but also want to avoid soft- a text. As a further evaluation, we used Cohen’s ware and hardware-related issues, or automatic Kappa (Cohen, 1960) to explore the agreement agents that need to automatically extract and re- between the algorithms and the I-CAB annota- use knowledge from large quantities of texts. For tions. This measure required estimating the over- example, NLPHub can be used in automatic ontol- all number of classifiable tokens, thus it is more ogy population and - since it also supports Events realistic to refer to Fleiss’ Kappa macro classifica- extraction - automatic narratives generation (Peta- tions rather than to the exact values (Fleiss, 1971). sis et al., 2011; Metilli et al., 2019). Future exten- According to Fleiss’ labels, all NERs generally sions of NLPHub will involve other text mining have good agreement with I-CAB except for Lo- methods (e.g. sentiment analysis, opinion mining, cations, which are often reported as Geopolitical and morphological parsing), and additional NLP entities in I-CAB. This evaluation also highlights tasks like text-to-speech and speech processing as- that AMERGE has good general agreement with a-service. manual annotations, and thus can be a valid choice when there is no prior knowledge about the algo- Supplementary Material rithm to use for extracting a certain entity. Supplementary material is available on D4Science at this permanent hyper-link. References [Coro et al.2015] Gianpaolo Coro, Leonardo Candela, Pasquale Pagano, Angela Italiano, and Loredana [Adedugbe et al.2018] Oluwasegun Adedugbe, Elhadj Liccardo. 2015. Parallelizing the execution of na- Benkhelifa, and Russell Campion. 2018. A cloud- tive data mining algorithms for computational biol- driven framework for a holistic approach to semantic ogy. Concurrency and Computation: Practice and annotation. In 2018 Fifth International Conference Experience, 27(17):4630–4644. on Social Networks Analysis, Management and Se- curity (SNAMS), pages 128–134. IEEE. [Coro et al.2016] Gianpaolo Coro, Giancarlo Panichi, and Pasquale Pagano. 2016. A web application to [Amado et al.2018] Alexandra Amado, Paulo Cortez, publish r scripts as-a-service on a cloud computing Paulo Rita, and Sérgio Moro. 2018. Research trends platform. Bollettino di Geofisica Teorica ed Appli- on big data in marketing: A text mining and topic cata, 57:51–53. modeling based literature analysis. European Re- search on Management and Business Economics, 24(1):1–7. [Coro et al.2017] Gianpaolo Coro, Giancarlo Panichi, Paolo Scarponi, and Pasquale Pagano. 2017. Cloud [Andronico et al.2011] Giuseppe Andronico, Valeria computing in a distributed e-infrastructure using Ardizzone, Roberto Barbera, Bruce Becker, Ric- the web processing service standard. Concur- cardo Bruno, Antonio Calanducci, Diego Carvalho, rency and Computation: Practice and Experience, Leandro Ciuffo, Marco Fargetta, Emidio Giorgio, 29(18):e4219. et al. 2011. e-infrastructures for e-science: a global view. Journal of Grid Computing, 9(2):155–184. [Coro2019a] Gianpaolo Coro. 2019a. The Keywords Tag Cloud Algorithm. https: [Aprosio and Moretti2016] Alessio Palmero Aprosio //svn.research-infrastructures. and Giovanni Moretti. 2016. Italy goes to stanford: eu/public/d4science/gcube/ a collection of corenlp modules for italian. arXiv trunk/data-analysis/ preprint arXiv:1609.06204. LatentSemanticAnalysis/.

[Ariadne2019] Ariadne. 2019. The Ari- [Coro2019b] Gianpaolo Coro. 2019b. The Language adnePlus European Project. https: Identiﬁer Algorithm. hyper-link. //ariadne-infrastructure.eu/. [Dell’Orletta et al.2014] Felice Dell’Orletta, Giulia [Assante et al.2016] Massimiliano Assante, Leonardo Venturi, Andrea Cimino, and Simonetta Monte- Candela, Donatella Castelli, Gianpaolo Coro, Lu- magni. 2014. T2kˆ 2: a system for automatically cio Lelii, and Pasquale Pagano. 2016. Virtual re- extracting and organizing knowledge from texts. search environments as-a-service by gcube. PeerJ In Proceedings of the Ninth International Con- Preprints, 4:e2511v1. ference on Language Resources and Evaluation (LREC-2014). [Assante et al.2019] Massimiliano Assante, Leonardo Candela, Donatella Castelli, Roberto Cirillo, Gian- [EU Commission2016] EU Commission. 2016. paolo Coro, Luca Frosini, Lucio Lelii, Francesco Open science (open access). https: Mangiacrapa, Valentina Marioli, Pasquale Pagano, //ec.europa.eu/programmes/ et al. 2019. The gcube system: Delivering virtual horizon2020/en/h2020-section/ research environments as-a-service. Future Genera- open-science-open-access. tion Computer Systems, 95:445–453. [Ferragina and Scaiella2010] Paolo Ferragina and Ugo [Bontcheva and Derczynski2016] Kalina Bontcheva Scaiella. 2010. Tagme: on-the-ﬂy annotation of and Leon Derczynski. 2016. Extracting information short text fragments (by wikipedia entities). In Pro- from social media with gate. In Working with Text, ceedings of the 19th ACM international conference pages 133–158. Elsevier. on Information and knowledge management, pages [Candela et al.2013] Leonardo Candela, Donatella 1625–1628. ACM. Castelli, Gianpaolo Coro, Pasquale Pagano, and Fabio Sinibaldi. 2013. Species distribution model- [Fleiss1971] Joseph L Fleiss. 1971. Measuring nomi- ing in the cloud. Concurrency and Computation: nal scale agreement among many raters. Psycholog- Practice and Experience. ical bulletin, 76(5):378.

[CNR2016] CNR. 2016. gcube wps thin clients. [Gandomi and Haider2015] Amir Gandomi and Mur- https://wiki.gcube-system.org/ taza Haider. 2015. Beyond the hype: Big data con- gcube/How_to_Interact_with_the_ cepts, methods, and analytics. International Journal DataMiner_by_client. of Information Management, 35(2):137–144.

[Cohen1960] Jacob Cohen. 1960. A coefﬁcient of [GATE Cloud2019a] GATE Cloud. 2019a. GATE agreement for nominal scales. Educational and psy- Cloud: Text Analytics in the Cloud. https:// chological measurement, 20(1):37–46. cloud.gate.ac.uk/. [GATE Cloud2019b] GATE Cloud. 2019b. OpenNLP Knowledge-driven multimedia information ex- English Pipeline. https://cloud.gate. traction and ontology evolution, pages 134–166. ac.uk/shopfront/displayItem/ Springer-Verlag. opennlp-english-pipeline. [Pollock and Williams2010] Neil Pollock and Robin [Hey et al.2009] Tony Hey, Stewart Tansley, Kristin M Williams. 2010. E-infrastructures: How do we Tolle, et al. 2009. The fourth paradigm: data- know and understand them? strategic ethnography intensive scientiﬁc discovery, volume 1. Microsoft and the biography of artefacts. Computer Supported research Redmond, WA. Cooperative Work (CSCW), 19(6):521–556.

[ILC-CNR2019] ILC-CNR. 2019. The ItaliaNLP [Schmid1995] Helmut Schmid. 1995. Treetagger - a REST Service. http://api.italianlp.it/ language independent part-of-speech tagger. Insti- docs/. tut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 43:28. [Kottmann et al.2011] J Kottmann, B Margulies, G In- gersoll, I Drost, J Kosin, J Baldridge, T Goetz, [Schut and Whiteside2007] Peter Schut and A White- T Morton, W Silva, A Autayeu, et al. 2011. Apache side. 2007. OpenGIS Web Processing Ser- OpenNLP. www.opennlp.apache.org. vice. OGC project document http://www. opengeospatial.org/standards/wps. [Lebo et al.2013] Timothy Lebo, Satya Sahoo, Debo- rah McGuinness, Khalid Belhajjame, James Cheney, [SoBigData European Project2016] SoBigData Eu- David Corsar, Daniel Garijo, Stian Soiland-Reyes, ropean Project. 2016. Deliverable D2.7 - Stephan Zednik, and Jun Zhao. 2013. Prov-o: The IP principles and business models. http: prov ontology. W3C Recommendation, 30. //project.sobigdata.eu/material.

[Linthicum2017] David S Linthicum. 2017. Cloud [SoBigData2019] SoBigData. 2019. The SoBigData computing changes data integration forever: What’s European Project. http://sobigdata.eu/ needed right now. IEEE Cloud Computing, 4(3):50– index. 53. [SpeechTEK 20102019] SpeechTEK 2010. 2019. [Magnini et al.2006] Bernardo Magnini, Emanuele Pi- SpeechTEK 2010 - H-Care Avatar wins Peo- anta, Christian Girardi, Matteo Negri, Lorenza Ro- ple’s Choice Award. http://web.archive. mano, Manuela Speranza, Valentina Bartalesi, and org/web/20160919100019/http://www. Rachele Sprugnoli. 2006. I-cab: the italian content speechtek.com/europe2010/avatar/. annotation bank. In LREC, pages 963–968. Citeseer. [Stanford University2019] Stanford University. 2019. [Manning et al.2014] Christopher Manning, Mihai Sur- Stanford CoreNLP - Human Languages Sup- deanu, John Bauer, Jenny Finkel, Steven Bethard, ported. https://stanfordnlp.github. and David McClosky. 2014. The stanford corenlp io/CoreNLP/. natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for com- [Tablan et al.2011] Valentin Tablan, Ian Roberts, putational linguistics: system demonstrations, pages Hamish Cunningham, and Kalina Bontcheva. 55–60. 2011. GATE Cloud.net: Cloud Infrastructure for Large-Scale, Open-Source Text Processing. In UK [Metilli et al.2019] Daniele Metilli, Valentina Bartalesi, e-Science All hands Meeting. and Carlo Meghini. 2019. Steps towards a system to extract. In Proceedings of the Text2Story 2019 [Vossen et al.2016] Piek Vossen, Rodrigo Agerri, Itziar Workshop, page na. Springer. Aldabe, Agata Cybulska, Marieke van Erp, Antske Fokkens, Egoitz Laparra, Anne-Lyse Minard, [OpenAire2019] OpenAire. 2019. European project Alessio Palmero Aprosio, German Rigau, et al. supporting Open Access. https://www. 2016. Newsreader: Using knowledge resources openaire.eu/. in a cross-lingual reading machine to generate more knowledge from massive streams of news. [OpenMinTeD2019] OpenMinTeD. 2019. Open Knowledge-Based Systems, 110:60–85. Mining INfrastructure for TExt and Data. https://cordis.europa.eu/project/ [Wei et al.2016] Chih-Hsuan Wei, Robert Leaman, and rcn/194923/factsheet/en. Zhiyong Lu. 2016. Beyond accuracy: creating in- teroperable and scalable text-mining web services. [Parthenos2019] Parthenos. 2019. The Bioinformatics, 32(12):1907–1910. Parthenos European Project. http: //www.parthenos-project.eu/.

[Petasis et al.2011] Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Anastasia Krithara, and Elias Zavitsanos. 2011. Ontology population and enrichment: State of the art. In