Syntactic Analyses and Named Entity Recognition for Pubmed and Pubmed Central — Up-To-The-Minute

Total Page:16

File Type:pdf, Size:1020Kb

Syntactic Analyses and Named Entity Recognition for Pubmed and Pubmed Central — Up-To-The-Minute Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute 1,2 1,2,3 1,3 1 Kai Hakala ∗, Suwisa Kaewphan ∗, Tapio Salakoski and Filip Ginter 1. Dept. of Information Technology, University of Turku, Finland 2. The University of Turku Graduate School (UTUGS), University of Turku, Finland 3. Turku Centre for Computer Science (TUCS), Finland [email protected], [email protected], [email protected], [email protected] Abstract Various community efforts, mainly in the form of shared tasks, have resulted in steady improve- Although advanced text mining methods ment in biomedical text mining methods (Kim et specifically adapted to the biomedical do- al., 2009; Segura Bedmar et al., 2013). For in- main are continuously being developed, stance the GENIA shared tasks focusing on ex- their applications on large scale have been tracting biological events, such as gene regula- scarce. One of the main reasons for this tions, have consistently gathered wide interest and is the lack of computational resources and have led to the development of several text mining workforce required for processing large tools (Miwa et al., 2012; Bjorne¨ and Salakoski, text corpora. 2013). These methods have been also succes- In this paper we present a publicly avail- fully applied on a large scale and several biomed- able resource distributing preprocessed ical text mining databases are publicly available biomedical literature including sentence (Van Landeghem et al., 2013a; Franceschini et al., splitting, tokenization, part-of-speech tag- 2013; Muller¨ et al., 2004). Although these re- ging, syntactic parses and named entity sources exist, their number does not reflect the recognition. The aim of this work is to vast amount of fundamental research invested in support the future development of large- the underlying methods, mainly due to the non- scale text mining resources by eliminating trivial amount of manual labor and computational the time consuming but necessary prepro- resources required to process large quantities of cessing steps. textual data. Another issue arising from the chal- This resource covers the whole of PubMed lenging text preprocessing is the lack of mainte- and PubMed Central Open Access sec- nance of the existing databases which in effect tion, currently containing 26M abstracts nullifies the purpose of text mining as these re- and 1.4M full articles, constituting over sources tend to be almost as much out-of-date as 388M analyzed sentences. The re- their manually curated counterparts. According to source is based on a fully automated MEDLINE statistics1 806,326 new articles were pipeline, guaranteeing that the distributed indexed during 2015 and thus a text mining re- data is always up-to-date. The resource source will miss on average 67 thousand articles is available at https://turkunlp. each month it hasn’t been updated. github.io/pubmed_parses/. In this paper we present a resource aiming to support the development and maintenance of 1 Introduction large-scale biomedical text mining. The resource Due to the rapid growth of biomedical literature, includes all PubMed abstracts as well as full ar- the maintenance of manually curated databases, ticles from the open access section of PubMed usually updated following new discoveries pub- Central (PMCOA), with the fundamental lan- lished in articles, has become unfeasible. This guage technology building blocks, such as part-of- has led to a significant interest in developing au- speech (POS) tagging and syntactic parses, readily tomated text mining methods specifically for the available. In addition, recognition of several bio- biomedical domain. 1https://www.nlm.nih.gov/bsd/bsd_key. ∗These authors contributed equally. html 102 Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pages 102–107, Berlin, Germany, August 12, 2016. c 2016 Association for Computational Linguistics logically relevant named entities, such as proteins As the PMCOA does not provide incremental up- and chemicals is included. Hence we hope that dates, we use the index file and compare it to the this resource eliminates the need of the tedious previous file list to select new articles for process- preprocessing involved in utilizing the PubMed ing. data and allows swifter development of new infor- Even though the PubMed and PMCOA docu- mation extraction databases. ments are provided in slightly different XML for- The resource is constructed with an automated mats, they can be processed in similar fashion. As pipeline which provides weekly updates with the a result, the rest of the pipeline discussed in this latest articles indexed in PubMed and PubMed section is applied to both document types. Central, ensuring the timeliness of the distributed Both PubMed XML articles and PMCOA data. All the data is downloadable in an easily NXML full texts are preprocessed using publicly handleable XML format, also used by the widely available tools2 (Pyysalo et al., 2013). These tools adapted event extraction system TEES (Bjorne¨ convert XML documents to plain text and change and Salakoski, 2015). A detailed description of character encoding from UTF-8 to ASCII as many this format is available on the website. of the legacy language processing tools are inca- pable of handling non-ASCII characters. Addi- 2 Data tionally, all excess meta data is removed, leaving We use all publicly available literature from titles, abstracts and full-text contents for further PubMed and PubMed Central Open Access sub- processing. These documents are subsequently set, which cover most of the relevant literature and split into sentences using GENIA sentence split- are commonly used as the prime source of data in ter (Sætre et al., 2007) as most linguistic analyses biomedical text mining knowledge bases. are done on the sentence level. GENIA sentence PubMed provides titles and abstracts in XML splitter is trained on biomedical text (GENIA cor- format in a collection of baseline release and sub- pus) and has state-of-the-art performance on this sequent updates. The former is available at the end domain. of each year whereas the latter is updated daily. The whole data is parsed with the BLLIP con- As this project was started during 2015, we have stituent parser (Charniak and Johnson, 2005), us- first processed the baseline release from the end ing a model adapted for the biomedical domain of 2014 and this data has then been extended with (McClosky, 2010), as provided in the TEES pro- the new publications from the end of 2015 base- cessing pipeline. The distributed tokenization and line release. The rest of the data up to date has POS tagging are also produced with the parser been collected from the daily updates. pipeline. We chose to use this tool as the perfor- The full articles in PMC Open Access subset mance of the TEES software has been previously (PMCOA) are retrieved via the PMC FTP service. evaluated on a large-scale together with this pars- Multiple types of data format are provided in PM- ing pipeline (Van Landeghem et al., 2013b) and it COA, including NXML and TXT formats which should be a reliable choice for biomedical relation are suitable for text processing. We use the pro- extraction. Since dependency parsing has become vided NXML format as it is compatible with our the prevalent approach in modeling syntactic rela- processing pipeline. This service does not provide tions, we also provide conversions to the collapsed distinct incremental updates, but a list of all in- Stanford dependency scheme (De Marneffe et al., dexed articles updated weekly. 2006). The pipeline is run in parallel on a cluster com- 3 Processing Pipeline puter with the input data divided into smaller batches. The size of these batches is altered along In this section, we discuss our processing pipeline the pipeline to adapt to the varying computational as shown in Figure 1. Firstly, both PubMed and requirements of the different tools. PMCOA documents are downloaded from NCBI FTP services. For the periodical updates of our 3.1 Named Entity Recognition resource this is done weekly — the same inter- Named entity recognition (NER) is one of the fun- val the official PMCOA dataset is updated. From damental tasks in BioNLP as most of the cru- the PubMed incremental updates we only include newly added documents and ignore other updates. 2https://github.com/spyysalo/nxml2txt 103 Entity type Our system State-of-the-art system References Precision/Recall/F-score Precision/Recall/F-score Cell line 89.88 / 84.36 / 87.03 91.67 / 85.47 / 88.46 (Kaewphan et al., 2016) Chemical 85.27 / 82.92 / 84.08 89.09 / 85.75 / 87.39 (Leaman et al., 2015) Disease* 86.32 / 80.83 / 83.49 82.80 / 81.90 / 80.90 (Leaman et al., 2013) GGP** 74.27 / 72.99 / 73.62 90.22 / 84.82 / 87.17 (Campos et al., 2013) Organism 77.15 / 80.15 / 78.63 83.90 / 72.60 / 77.80 (Pafilis et al., 2013) Table 1: Evaluation of the named entity recognition for each entity type on the test sets, measured with strict entity level metrics. Reported results for corresponding state-of-the-art approaches are shown for comparison. * The evaluation of the best performing system for disease mentions is the combination of named entity recognition and normalization. ** The official BioCreative II evaluation for our GGP model results in 84.67, 84.54 and 84.60 for preci- sion, recall and F-score respectively. These numbers are comparable to the listed state-of-the-art method. cial biological information is expressed as rela- SPECIES corpus being an exception. For this cor- tions among entities such as genes and proteins. pus we do our own data division with random sam- To support further development on this dataset, we pling on document level, for each taxonomy cate- provide named entity tagging for five entity types, gory separately.
Recommended publications
  • BMC Bioinformatics Biomed Central
    BMC Bioinformatics BioMed Central Introduction Open Access Proceedings of the Second International Symposium for Semantic Mining in Biomedicine Sophia Ananiadou*1 and Juliane Fluck*2 Address: 1School of Computer Science, National Centre for Text Mining, Manchester Interdisciplinary Biocentre, University of Manchester, Oxford Road, M13 9PL, Manchester, UK and 2Fraunhofer Institute SCAI, Schloss Birlinghoven, 53754 St. Augustin, Germany Email: Sophia Ananiadou* - [email protected]; Juliane Fluck* - [email protected] * Corresponding authors from Second International Symposium on Semantic Mining in Biomedicine (SMBM) Jena, Germany. 9–12 April 2006 Published: 24 November 2006 <supplement> <title> <p>Second International Symposium on Semantic Mining in Biomedicine (SMBM)</p> </title> <editor>Sophia Ananiadou, Juliane Fluck</editor> <note>Proceedings</note> </supplement> BMC Bioinformatics 2006, 7(Suppl 3):S1 doi:10.1186/1471-2105-7-S3-S1 © 2006 Ananiadou and Fluck; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Introduction • discovery of semantic relations between entities With an overwhelming amount of biomedical knowledge recorded in texts, it is not surprising that there is so much • event discovery interest in techniques which can identify, extract, manage, integrate and exploit this knowledge, and moreover dis- The current limitations of using existing terminological cover new, hidden or unsuspected knowledge. For this and ontological resources such as the Gene Ontology, reason, in the past five years, there has been an upsurge of Swiss-Prot, Entrez Gene, UMLS, and Mesh etc.
    [Show full text]
  • Alliheedi Mohammed.Pdf (7.910Mb)
    Procedurally Rhetorical Verb-Centric Frame Semantics as a Knowledge Representation for Argumentation Analysis of Biochemistry Articles by Mohammed Alliheedi A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2019 c Mohammed Alliheedi 2019 Examining Committee Membership External Examiner: Vlado Keselj Professor, Faculty of Computer Science Dalhousie University Supervisor(s): Robert E. Mercer Professor, Dept. of Computer Science, The University of Western Ontario Robin Cohen Professor, School of Computer Science, University of Waterloo Internal Member: Jesse Hoey Associate Professor, School of Computer Science, University of Waterloo Internal-External Member: Randy Harris Professor, Dept. of of English Language and Literature, University of Waterloo Other Member(s): Charles Clarke Professor, School of Computer Science, University of Waterloo ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract The central focus of this thesis is rhetorical moves in biochemistry articles. Kanoksila- patham has provided a descriptive theory of rhetorical moves that extends Swales' CARS model to the complete biochemistry article. The thesis begins the construction of a com- putational model of this descriptive theory. Attention is placed on the Methods section of the articles. We hypothesize that because authors' argumentation closely follows their experimental procedure, procedural verbs may be the guide to understanding the rhetor- ical moves. Our work proposes an extension to the normal (i.e., VerbNet) semantic roles especially tuned to this domain.
    [Show full text]
  • Syntactic Analyses and Named Entity Recognition for Pubmed and Pubmed Central — Up-To-The-Minute
    Syntactic analyses and named entity recognition for PubMed and PubMed Central — up-to-the-minute Kai Hakala1;2∗, Suwisa Kaewphan1;2;3∗, Tapio Salakoski1;3 and Filip Ginter1 1. Dept. of Information Technology, University of Turku, Finland 2. The University of Turku Graduate School (UTUGS), University of Turku, Finland 3. Turku Centre for Computer Science (TUCS), Finland [email protected], [email protected], [email protected], [email protected] Abstract Various community efforts, mainly in the form of shared tasks, have resulted in steady improve- Although advanced text mining methods ment in biomedical text mining methods (Kim et specifically adapted to the biomedical do- al., 2009; Segura Bedmar et al., 2013). For in- main are continuously being developed, stance the GENIA shared tasks focusing on ex- their applications on large scale have been tracting biological events, such as gene regula- scarce. One of the main reasons for this tions, have consistently gathered wide interest and is the lack of computational resources and have led to the development of several text mining workforce required for processing large tools (Miwa et al., 2012; Bjorne¨ and Salakoski, text corpora. 2013). These methods have been also succes- In this paper we present a publicly avail- fully applied on a large scale and several biomed- able resource distributing preprocessed ical text mining databases are publicly available biomedical literature including sentence (Van Landeghem et al., 2013a; Franceschini et al., splitting, tokenization, part-of-speech tag- 2013; Muller¨ et al., 2004). Although these re- ging, syntactic parses and named entity sources exist, their number does not reflect the recognition.
    [Show full text]
  • Adding Value to Scholarly Communications Through Text Mining
    Adding Value to Scholarly Communications Enhancing User Experience of Scholarly Communicationthrough through Text Text Mining Mining Sophia Ananiadou UK National Centre for Text Mining • first national text mining centre in the world www.nactem.ac.uk • Remit : Provision of text mining services to support UK research • Funded by • University of Manchester, collaboration with Tokyo From Text to Knowledge Applications, users and techniques Scholarly Communication Requirements • What is needed in the repositories – Annotation and curation assistance • Creation of metadata, consistent manner – Name authorities • Merging and mapping existing resources • Prediction lists based on named entity recognition • Disambiguation – Semantic metadata creation and enhancement Provision of semantic metadata to support search • Extraction of terms and named entities (names of people, organisations, diseases, genes, etc) • Discovery of concepts allows semantic annotation and enrichment of documents – Improves information access by going beyond index terms, enabling semantic querying – Improves clustering, classification of documents • Going a step further: extracting relationships, events from text – Enables even more advanced semantic applications Semantic metadata for whom? Semantic metadata for whom? • end users – adds value to library content – allows enhanced searching functionalities – allows interaction with content, living document • automated content aggregators – access to data-driven, quality metadata derived from text • librarians – enhanced capability for semantic indexing, cross- referencing between Library collections and classification Terminology Services TerMine Identifies the most significant terms Used as metadata Suggests similar areas of interest Refines index terms for document classification Used for ontology building (Protégé TerMine plug-in) Semantic metadata: terms Term Based Applications Tag Cloud based on terms automatically extracted from the blog of BBSRC Chief Executive Professor Kell.
    [Show full text]
  • Text Mining for Biomedicine an Overview: Selected Bibliography
    Text Mining for Biomedicine an Overview: selected bibliography Sophia Ananiadou a & Yoshimasa Tsuruoka b University of Manchester a, University of Tokyo b National Centre for Text Mining a,b http://www.nactem.ac.uk/ (i) Overviews on Text Mining for Biomedicine Ananiadou, S. & J. McNaught (eds) (2006) Text Mining for Biology and Biomedicine, Artech House. MacMullen, W.J, and S.O. Denn, “Information Problems in Molecular Biology and Bioinformatics,” Journal of the American Society for Information Science and Technology , Vol. 56, No. 5, 2005, pp. 447--456. Lars Juhl Jensen, J. Saric and P. Bork (2006) "Literature mining for the biologist: from information retrieval to biological discovery", In Nature Reviews Genetics, Vol. 7, Feb. 2006, pp 119-129 Blaschke, C., L. Hirschman, and A. Valencia, “Information Extraction in Molecular Biology,” Briefings in Bioinformatics , Vol. 3, No. 2, 2002, pp. 1--12. Cohen, A. M., and W. R. Hersh, “A Survey of Current Work in Biomedical Text Mining,” Briefings in Bioinformatics , Vol. 6, 2005, pp. 57--71. Nédellec, C., “Machine Learning for Information Extraction in Genomics---State of the Art and Perspectives.” In Text Mining and its Applications , pp. 99--118, S. Sirmakessis (ed.), Berlin: Springer-Verlag, Studies in Fuzziness and Soft Computing 138, 2004. Rebholz-Schuhmann, D., H. Kirsch, and F. Couto, “Facts from Text—Is Text Mining Ready to Deliver? ” PLoS Biology , Vol. 3, No. 2, 2005, pp. 0188--0191, http://www.plosbiology.org, June 2005. Shatkay, H., and R. Feldman, “Mining the Biomedical Literature in the Genomic Era: An Overview,” Journal of Computational Biology , Vol. 10, No. 6, 2004, pp.
    [Show full text]
  • Themes in Biomedical Natural Language Processing: Bionlp08
    Edinburgh Research Explorer Themes in biomedical natural language processing: BioNLP08 Citation for published version: Demner-Fushman, D, Ananiadou, S, Cohen, KB, Pestian, J, Tsujii, J & Webber, BL 2008, 'Themes in biomedical natural language processing: BioNLP08', BMC Bioinformatics, vol. 9, no. S-11. https://doi.org/10.1186/1471-2105-9-S11-S1 Digital Object Identifier (DOI): 10.1186/1471-2105-9-S11-S1 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: BMC Bioinformatics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 29. Sep. 2021 BMC Bioinformatics BioMed Central Research Open Access Themes in biomedical natural language processing: BioNLP08 Dina Demner-Fushman*1, Sophia Ananiadou2, K Bretonnel Cohen3, John Pestian4, Jun'ichi Tsujii5 and Bonnie Webber6 Address: 1US National Library of Medicine, 8600 Rockville Pike, Bethesda,
    [Show full text]
  • Development and Analysis of NLP Pipelines in Argo
    Development and Analysis of NLP Pipelines in Argo Rafal Rak, Andrew Rowley, Jacob Carter, and Sophia Ananiadou National Centre for Text Mining School of Computer Science, University of Manchester Manchester Institute of Biotechnology 131 Princess St, M1 7DN, Manchester, UK rafal.rak,andrew.rowley,jacob.carter,sophia.ananiadou @manchester.ac.uk { } Abstract in text is preceded by text segmentation, part-of- speech recognition, the recognition of named enti- Developing sophisticated NLP pipelines ties, and dependency parsing. Currently, the avail- composed of multiple processing tools ability of such atomic processing components is and components available through differ- no longer an issue; the problem lies in ensur- ent providers may pose a challenge in ing their compatibility, as combining components terms of their interoperability. The Un- coming from multiple repositories, written in dif- structured Information Management Ar- ferent programming languages, requiring different chitecture (UIMA) is an industry stan- installation procedures, and having incompatible dard whose aim is to ensure such in- input/output formats can be a source of frustration teroperability by defining common data and poses a real challenge for developers. structures and interfaces. The architec- Unstructured Information Management Archi- ture has been gaining attention from in- tecture (UIMA) (Ferrucci and Lally, 2004) is a dustry and academia alike, resulting in a framework that tackles the problem of interoper- large volume of UIMA-compliant process- ability of processing components. Originally de- ing components. In this paper, we demon- veloped by IBM, it is currently an Apache Soft- strate Argo, a Web-based workbench for ware Foundation open-source project1 that is also the development and processing of NLP registered at the Organization for the Advance- pipelines/workflows.
    [Show full text]
  • 2Nd Workshop on Building and Evaluating Resources for Biomedical Text Mining
    2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining Tuesday, 18th March 2010 Valletta, Malta Organisers: Sophia Ananiadou Kevin Cohen Dina Demner-Fushman Workshop Programme 9:15 – 9:30 Welcome 9:30 – 10:30 Invited Talk (chair: Sophia Ananiadou) Pierre Zweigenbaum, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI-CNRS), France 10:30 – 11:00 Coffee break 11:00 – 12:30 Session 1 (chair: Kevin Cohen) 11:00 Spelling Correction in Clinical Notes with Emphasis on First Suggestion Accuracy Jon Patrick, Mojtaba Sabbagh, Suvir Jain and Haifeng Zheng 11:25 Automatically Building a Repository to Support Evidence Based Practice Dina Demner-Fushman, Joanna Karpinski and George Thoma 11:50 An Empirical Evaluation of Resources for the Identif cation of Diseases and Adverse Effects in Biomedical Literature Harsha Gurulingappa, Roman Klinger, Martin Hofmann-Apitius and Juliane Fluck 12:15 A Task-Oriented Extension of Chinese MeSH Concepts Hierarchy Xinkai Wang and Sophia Ananiadou 12:40 – 14:10 Lunch break 14:10 – 15:10 Invited talk (chair: Sophia Ananiadou) Simonetta Montemagni, Istituto di Linguistica Computazionale (ILC-CNR), Italy 15:10 – 16:55 Session 2 (chair: Dina Demner-Fushman) 15:10 Structuring of Status Descriptions in Hospital Patient Records Svetla Boytcheva, Ivelina Nikolova, Elena Paskaleva, Galia Angelova, Dimitar Tcharaktchiev and Nadya Dimitrova 15:35 Annotation of All Coreference in Biomedical Text: Guideline Selection and Adaptation K. Bretonnel Cohen, Arrick Lanfranchi, William
    [Show full text]
  • Three Bionlp Tools Powered by a Biological Lexicon
    Three BioNLP Tools Powered by a Biological Lexicon Yutaka Sasaki 1 Paul Thompson 1 John McNaught 1, 2 Sophia Ananiadou 1, 2 1 School of Computer Science, University of Manchester 2 National Centre for Text Mining MIB, 131 Princess Street, Manchester, M1 7DN, United Kingdom {Yutaka.Sasaki,Paul.Thompson,John.McNaught,Sophia.Ananiadou}@manchester.ac.uk its focus is on medical terms. Therefore some Abstract biology-specific terms, e.g., molecular biology terms, are not the main target of the lexicon. In this paper, we demonstrate three NLP In response to this, we have constructed the applications of the BioLexicon, which is a BioLexicon (Sasaki et al ., 2008), a lexical lexical resource tailored to the biology resource tailored to the biology domain. We will domain. The applications consist of a demonstrate three applications of the BioLexicon, dictionary-based POS tagger, a syntactic in order to illustrate the utility of the lexicon parser, and query processing for biomedical information retrieval. Biological within the biomedical NLP field. terminology is a major barrier to the The three applications are: accurate processing of literature within biology domain. In order to address this • BLTagger: a dictionary-based POS tagger problem, we have constructed the based on the BioLexicon BioLexicon using both manual and semi- • Enju full parser enriched by the automatic methods. We demonstrate the BioLexicon utility of the biology-oriented lexicon • Lexicon-based query processing for within three separate NLP applications. information retrieval 1 Introduction 2. Summary of the BioLexicon Processing of biomedical text can frequently be In this section, we provide a summary of the problematic, due to the huge number of technical BioLexicon (Sasaki et al ., 2008).
    [Show full text]
  • List of Reviewers
    List of Reviewers Area Chairs Key-Sun Choi Korea Advanced Institute of Science and Technology, Korea Machine Translation and Multilinguality Walter Daelmans UniversityofAntwerp, Belgium Machine Learning and Statistical Methods for Syntax Dan Jurafsky University of Colorado, USA Speech, Systems and Evaluation Yuji Matsumoto Nara Institute of Science and Technology, Japan Syntax, Grammars and Morphology Johanna D. Mo ore University of Edinburgh, UK Discourse, Dialogue and Generation Martha S. Palmer UniversityofPennsylvania, USA Lexicon and Semantics Ellen Rilo University of Utah, USA Corpus-based and Statistical Natural Langauge Processing Giorgio Satta UniversityofPadua, Italy Parsing Algorithms and Models Theme Chairs Sanda M. Harabagiu Southern Metho dist University, USA NLP and Open-Domain Question Answering from Text Antal van den Bosch ILK, Tilburg University, The Netherlands Emiel Krahmer IPO, Eindhoven Technical University, The Netherlands Maria Wolters IKP, University of Bonn, Germany Machine Learning and Statistical NLP for Dialogue Inderjeet Mani The Mitre Corp oration Text Summarization Hitoshi Isahara Communications Research Lab oratory, Japan Ra jeev Sangal Indian Institute of Information Technology, India Ming Zhou Microsoft Research, China Theoretical and Technical Approaches for Asian Language Processing: Similarities and Di erences among Languages Reviewers David Aha Naval Research Lab oratory, USA, Sophia Ananiadou University of Salford, UK, Chinatsu Aone SRA, USA Srinivas Bangalore AT&T Research, USA, Tilman Becker
    [Show full text]
  • Towards the French Biomedical Ontology Enrichment
    Awarded by University of Montpellier Prepared at I2S∗ Graduate School, UMR 5506 Research Unit, ADVANSE Team, and Laboratory of Informatics, Robotics And Microelectronics of Montpellier Speciality: Computer Science Defended by Mr Juan Antonio LOSSIO-VENTURA [email protected] Towards the French Biomedical Ontology Enrichment Defended on 09/11/2015 in front of a jury composed by: Sophia Ananiadou Professor Univ. of Manchester President Fabio Crestani Professor Univ. of Lugano Reviewer Pierre Zweigenbaum Professor LIMSI-CNRS Reviewer Natalia Grabar Researcher CNRS Examinator Mathieu Roche Research Professor Cirad, TETIS, LIRMM Director Clement Jonquet Associate Professor Univ. of Montpellier Advisor Maguelonne Teisseire Research Professor TETIS, LIRMM Advisor [<Dreams have only one owner at a time. For that, dreamers tend to be alone.] ∗ I2S: Information, Structures and Systems. 2 Dedication This thesis is lovingly dedicated to my mother Laly Ventura. The fact of making me stronger, her support, encouragement, and constant love have positively influenced all my life. i ii Acknowledgments During the past three years, I met a lot of wonderful people. They helped me with- out asking any response. These people contributed to this thesis as well as to my personal development. First, I would like to thank my thesis committee: Prof. Sophia Ananiadou, Prof. Fabio Crestani, Prof. Pierre Zweigenbaum, and Dr. Natalia Grabar, for their time reading deeply my thesis, for their insightful comments and for the hard questions which allowed me to improve my research from various perspectives. I also would like to express my sincere gratitude to my advisors Clement Jon- quet, Mathieu Roche, and Maguelonne Teisseire for the continuous support of my PhD and related research.
    [Show full text]
  • About the Contributors
    About the Contributors Violaine Prince is full professor at the University Montpellier 2 (Montpellier, France). She obtained her PhD in 1986 at the University of Paris VII, and her ‘habilitation’ (post-PhD degree) at the Univer- sity of Paris XI (Orsay). Previous head of Computer Science department at the Faculty of Sciences in Montpellier, previous head of the National University Council for Computer Science (grouping 3,000 professors and assistant professors in Computer Science in France), she now leads the NLP research team at LIRMM (Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier, a CNRS research unit). Her research interests are in natural language processing (NLP) and cognitive science. She has published more than 70 reviewed papers in books, journals and conferences, authored 10 research and education books, founded and chaired several conferences and belonged to program committees as well as journals reading committees. She is member of the board of the IEEE Computer Society French Chapter. Mathieu Roche is assistant professor at the University Montpellier 2, France. He received a PhD in computer science at the University Paris XI (Orsay) in 2004. With Jérôme Azé, he created in 2005 the DEFT challenge (‘DEfi Francophone de Fouille de Textes’ meaning ‘Text Mining Challenge’) which is a francophone equivalent of the TREC Conferences. His current main research interests at LIRMM (Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier, a CNRS research unit) are text mining, information retrieval, terminology, and natural language processing for schema mapping. * * * Sophia Ananiadou is director of The National Centre for Text Mining (NaCTeM)providing text mining services with particular focus on biomedicine.
    [Show full text]