Information Retrieval and Text Mining Technologies for Chemistry
Total Page:16
File Type:pdf, Size:1020Kb
Review pubs.acs.org/CR Information Retrieval and Text Mining Technologies for Chemistry † ○ ‡ ○ § ∥ ⊥ ‡ Martin Krallinger, , Obdulia Rabal, , Analiá Lourenco,̧ , , Julen Oyarzabal,*, # ∇ ■ and Alfonso Valencia*, , , † Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, C/Melchor Fernandeź Almagro 3, Madrid E-28029, Spain ‡ Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Avenida Pio XII 55, Pamplona E-31008, Spain § ESEI - Department of Computer Science, University of Vigo, Edificio Politecnico,́ Campus Universitario As Lagoas s/n, Ourense E-32004, Spain ∥ Centro de Investigaciones Biomedicaś (Centro Singular de Investigacioń de Galicia), Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain ⊥ CEB-Centre of Biological Engineering, University of Minho, Campus de Gualtar, Braga 4710-057, Portugal # Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), C/Jordi Girona, 29-31, Barcelona E-08034, Spain ∇ Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, C/ Baldiri Reixac 10, Barcelona E-08028, Spain ■ InstitucióCatalana de Recerca i Estudis Avancatş (ICREA), Passeig de Lluís Companys 23, Barcelona E-08010, Spain ABSTRACT: Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field. CONTENTS 2.2.2. Character Encoding J 2.2.3. Document Segmentation J 1. Introduction B 2.2.4. Sentence Splitting J 1.1. Chemical Information: What D 2.2.5. Tokenizers and Chemical Tokenizers K 1.2. Chemical Information: Where E 2.3. Document Indexing and Term Weighting L 1.2.1. Journal Literature and Conference Pa- 2.3.1. Term Weighting O pers, Reports, Dissertations, Books, and 2.4. Information Retrieval of Chemical Data P Others E 2.4.1. Boolean Search Queries Q 1.2.2. Patents F 2.4.2. Using Metadata for Searching and 1.2.3. Regulatory Reports and Competitive Indexed Entries Q Intelligence Tools G 2.4.3. Query Expansion Q 2. Chemical Information Retrieval H 2.4.4. Keyword Searching and Subject Search- 2.1. Unstructured Data Repositories and Charac- ing R teristics I 2.2. Preprocessing and Chemical Text Tokeniza- tion I Received: December 30, 2016 2.2.1. Document Transformation I © XXXX American Chemical Society A DOI: 10.1021/acs.chemrev.6b00851 Chem. Rev. XXXX, XXX, XXX−XXX Chemical Reviews Review 2.4.5. Vector Space Retrieval Model and 4.4. Chemical Representation AV Extended Boolean Model R 4.4.1. Line Notations AW 2.4.6. Proximity Searches R 4.4.2. Connection Tables AX 2.4.7. Wildcard Queries S 4.4.3. InChI and InChIKey AX 2.4.8. Autocompletion T 4.5. Chemical Normalization or Standardization AX 2.4.9. Spelling Corrections T 5. Chemical Knowledgebases AY 2.4.10. Phrase Searches, Exact Phrase Search- 5.1. Management of Chemical Data AY ing, or Quoted Search U 5.2. Chemical Cartridges AY 2.5. Supervised Document Categorization U 5.3. Structure-Based Chemical Searches AZ 2.5.1. Text Classification Overview U 6. Integration of Chemical and Biological Data BB 2.5.2. Text Classification Algorithms W 6.1. Biomedical Text Mining BD 2.5.3. Text Classification Challenges W 6.1.1. General and Background BD 2.5.4. Documents Clustering X 6.1.2. Evaluations BE 2.6. BioCreative Chemical Information Retrieval 6.2. Detection of Chemical and Biomedical Assessment Y Entity Relations BE 2.6.1. Evaluation Metrics Y 6.2.1. General and Background BE 2.6.2. Community Challenges for IR Z 6.2.2. Chemical−Protein Entity Relation Ex- 3. Chemical Entity Recognition and Extraction AB traction BG 3.1. Entity Definition AB 6.2.3. Chemical Entity−Disease Relation Ex- 3.2. Historical Work on Named Entities AC traction BI 3.3. Methods for Chemical Entity Recognition AC 7. Future Trends BL 3.3.1. General Factors Influencing NER and 7.1. Technological and Methodological Chal- CER AC lenges BM 3.3.2. Chemical and Drug Names AC Author Information BO 3.3.3. Challenges for CER AD Corresponding Authors BO 3.3.4. General Flowchart of CER AE ORCID BO 3.4. Dictionary Lookup of Chemical Names AE Author Contributions BO 3.4.1. Definition and Background AE Notes BO 3.4.2. Lexical Resources for Chemicals AF Biographies BO 3.4.3. Types of Matching Algorithms AG Acknowledgments BO 3.4.4. Problems with Dictionary-Based Meth- Abbreviations BO ods AG References BQ 3.5. Pattern and Rule-Based Chemical Entity Detection AG 3.6. Supervised Machine Learning Chemical 1. INTRODUCTION Recognition AI Because of the transition from printed hardcopy scientific fi 3.6.1. De nition, Types, and Background AI papers to digitized electronic publications, the increasing 3.6.2. Machine Learning Models AJ amount of published documents accessible through the Internet 3.6.3. Data Representation for Machine Learn- and their importance not only for commercial exploitation but ing AK also for academic research, data mining technologies, and 3.6.4. Feature Types, Representation, and search engines have impacted deeply most areas of scientific Selection AK research, communication, and discovery. The Internet has 3.6.5. Phases: Training and Test AL greatly influenced the publication environment,1 not only when 3.6.6. Shortcomings with ML-Based CER AL considering the access and distribution of published literature, fl 3.7. Hybrid Entity Recognition Work ows AL but also during the actual review and writing process. In this 3.8. Annotation Standards and Chemical Corpo- respect, chemistry is a pioneering domain of online information ra AL representation as machine-readable encoding systems for fi 3.8.1. De nition, Types, and Background AL chemical compounds date back to the line notation systems 3.8.2. Biology Corpora with Chemical Entities AM of the late 1940s. 3.8.3. Chemical Text Corpora AN Currently, efficient access to chemical information (section 3.8.4. CHEMDNER Corpus and CHEMDNER 1.1) contained in scientific articles, patents, legacy reports, or Patents Corpus AN the Web is a pressing need shared by researchers and patent 3.9. BioCreative Chemical Entity Recognition attorneys from different chemical disciplines. From a Evaluation AO researcher’s viewpoint, chemists aim to locate documents that 3.9.1. Background AO describe particular aspects of a compound of interest (e.g., 3.9.2. CHEMDNER AP synthesis, physicochemical properties, biological activity, 4. Linking Documents to Structures AR industrial application, crystalline status, safety, and toxicology) 4.1. Name-to-Structure Conversion AR or a particular chemical reaction among the growing collection 4.2. Chemical Entity Grounding AS of published papers and patents. In fact, chemists are among fi 4.2.1. De nition, Types, and Background AS the researchers that spend more time reading articles, and after 4.2.2. Grounding of Other Entities AS medical researchers are the ones that overall read more articles 4.2.3. Grounding of Chemical Entities AT per person.2 For example, there are approximately 10 000 4.3. Optical Compound Recognition AT journals publishing “chemistry” articles.3 Every year, over B DOI: 10.1021/acs.chemrev.6b00851 Chem. Rev. XXXX, XXX, XXX−XXX Chemical Reviews Review 20 000 new compounds are published in medicinal and to items of structured data repositories (i.e., defined database biological chemistry journals.4 Together with journals, patents fields such as already indexed chemicals) or document metadata constitute a valuable information source for chemical (data about the document itself). Metadata attributes of a compounds and reactions:5 the Chemical Abstracts Service document (or data in general) do usually provide some (CAS) states that 77% of new chemical compounds added to descriptive information associated to the document, such as the CAS Registry are disclosed first in patent applications,6 and author, publication dates, author names keywords, or journal the percentage of new compounds added to the CAS information, which can be also exploited for retrieval purposes. REGISTRY database from patents raised from 14% in 1976 Metadata attributes are usually much more structured than the to 46% in 2010, with 576 new compounds being added to the actual document