Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Total Page:16

File Type:pdf, Size:1020Kb

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning Lawrence Berkeley National Laboratory Recent Work Title Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Permalink https://escholarship.org/uc/item/3ss713pj Journal Chemical reviews, 120(16) ISSN 0009-2665 Authors Jablonka, Kevin Maik Ongari, Daniele Moosavi, Seyed Mohamad et al. Publication Date 2020-08-01 DOI 10.1021/acs.chemrev.0c00004 Peer reviewed eScholarship.org Powered by the California Digital Library University of California This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. pubs.acs.org/CR Review Big-Data Science in Porous Materials: Materials Genomics and Machine Learning Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit* Cite This: Chem. Rev. 2020, 120, 8066−8129 Read Online ACCESS Metrics & More Article Recommendations ABSTRACT: By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal−organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years. CONTENTS 5.1.1. Neural Networks 8087 5.2. Limited Amount of (Structured) Data (Wide 1. Introduction 8067 Data) 8090 2. Machine Learning Landscape 8067 5.2.1. Linear and Logistic Regression 8090 2.1. Machine Learning Pipeline 8068 fl 5.2.2. Kernel Methods 8090 2.1.1. Machine Learning Work ow 8068 5.2.3. Bayesian Learning 8091 2.1.2. Machine Learning Algorithms 8069 5.2.4. Instance-Based Learning 8092 2.2. Theory-Guided Data Science 8070 fi 5.2.5. Ensemble Methods 8092 2.3. Scienti c Method in Machine Learning: 6. How to Learn Well: Regularization, Hyperpara- Strong Inference and Multiple Models 8071 meter Tuning, and Tricks 8093 3. Selecting the Data: Dealing with Little, Imbal- 6.1. Hyperparameter Tuning 8093 anced, and Nonrepresentative Data 8071 6.2. Regularization 8094 3.1. Limitations of Hypothetical Databases 8072 6.2.1. Explicit Regularization: Adding a Term 3.2. Sampling to Improve Predictive Perform- or Layer 8094 ance 8072 6.2.2. Implicit Regularization: More Subtle 3.2.1. Diverse Set Selection 8072 Ways to Stop the Model from Remem- 3.3. Active Learning 8073 bering 8094 3.4. Dealing with Little Data 8074 7. How to Measure Performance and Compare 3.5. Dealing with Imbalanced Data Labels 8075 Models 8095 4. What to Learn from: Translating Structures into 7.1. Holdout Splits and Cross-Validation: Sam- Feature Vectors 8075 pling without Replacement 8095 4.1. Descriptors 8077 7.2. Bootstrap: Sampling with Replacement 8097 4.2. Overview of the Descriptor Landscape 8078 4.2.1. Local Descriptors 8079 4.2.2. Global Descriptors 8080 4.3. Feature Learning 8083 Special Issue: Porous Framework Chemistry 4.3.1. Feature Engineering 8083 4.3.2. Feature Selection 8083 Received: January 3, 2020 4.3.3. Data Transformations 8085 Published: June 10, 2020 5. How to Learn: Choosing a Learning Algorithm 8085 5.1. Lots of (Unstructured) Data (Tall Data) 8087 © 2020 American Chemical Society https://dx.doi.org/10.1021/acs.chemrev.0c00004 8066 Chem. Rev. 2020, 120, 8066−8129 Chemical Reviews pubs.acs.org/CR Review 7.3. Choosing the Appropriate Regression Met- decade, over 10,000 porous2,3 and 80,000 nonporous MOFs ric 8097 have been synthesized.4 In addition, one also has covalent 7.4. Classification 8097 organic frameworks (COFs), porous polymer networks 7.4.1. Probabilities That Can Be Interpreted as (PPNs), zeolites, and related porous materials. Because of Confidence 8097 their potential in many applications, ranging from gas 7.4.2. Choosing the Appropriate Classification separation and storage, sensing, catalysis, etc., these materials Metric 8097 have attracted a lot of attention. From a scientific point of view, 7.5. Estimating Extrapolation Ability 8098 these materials are interesting as their chemical tunability 7.6. Domain of Applicability 8098 allows us to tailor-make materials with exactly the right 7.7. Confidence Intervals and Error Estimates 8099 properties. As one can only synthesize a tiny fraction of all 7.7.1. Ensemble Approach 8099 possible materials, these experimental efforts are often 7.7.2. Distance-Based 8099 combined with computational approaches, often referred to 7.7.3. Conformal Prediction 8099 as materials genomics,5 to generate libraries of predicted or 7.8. Comparing Models 8100 hypothetical MOFs, COFs, and other related porous materials. 7.8.1. Ablation Studies 8100 These libraries are subsequently computationally screened to 7.9. Randomization Tests: Is the Model Learning identify the most promising material for a given application. Something Meaningful? 8100 That we now have of the order of ten thousand synthesized 8. How to Interpret the Results: Avoiding the porous crystals and over a hundred thousand predicted Clever Hans 8100 materials does create new challenges; we simply have too 8.1. Consider Using Explainable Models 8101 many structures and too much data. Issues related to having so 8.2. Post-Hoc Techniques to Shine Light Into many structures can be simple questions on how to manage so Black Boxes 8101 much data but also more profound on how to use the data to 8.3. Auditing Models: What Are Indirect Influen- discover new science. Therefore, a logical next step in materials ces? 8103 genomics is to apply the tools of big-data science and to exploit 9. Applications of Supervised Machine Learning 8104 “the unreasonable effectiveness of data”.6 In this review, we 9.1. Gas Storage and Separation 8104 discuss how machine learning (ML) has been applied to 9.1.1. Starting on Small Data Sets 8104 porous materials and review some aspects of the underlying 9.1.2. Moving to Big Data 8105 techniques in each step. Before discussing the specific 9.1.3. Bridging the Gap between Process applications of ML to porous materials, we give an overview Engineering and Materials Science 8107 over the ML landscape to introduce some terminologies and 9.1.4. Interpreting the Models 8107 also give a short overview over the technical terms we will use 9.2. Stability 8108 throughout this review in Table 1. 9.3. Reactivity and Chemical Properties 8108 In this review, we focus on applications of ML in materials 9.4. Electronic Properties 8108 science and chemistry with a particular focus on porous 9.5. ML for Molecular Simulations 8109 materials. For a more general discussion on ML, we refer the 9.6. Synthesis 8109 reader to some excellent reviews.7,8 9.6.1. Synthesizability 8111 9.7. Generative Models 8111 2. MACHINE LEARNING LANDSCAPE 10. Outlook and Concluding Remarks 8111 Nowadays it is difficult, if not impossible, to avoid ML in 10.1. Automatizing the Machine Learning Work- science. Because of recent developments in technology, we fl ow 8111 now routinely store and analyze large amounts of data. The 10.2. Reproducibility in Machine Learning 8111 underlying idea of big-data science is that if one has large 10.2.1. Comparability and Reporting Stand- amounts of data, one might be able to discover statistically ards 8113 significant patterns that are correlated to some specific fi 10.3. Transfer Learning and Multi delity Opti- properties or events. Arthur Samuel was among the first to mization 8113 use the term “machine learning” for the algorithms he 10.4. Multitask Prediction 8114 developed in 1959 to teach a computer to play the game of 10.5. Future of Big-Data Science in Porous checkers.9 His ML algorithm let the computer look ahead a Materials 8114 few moves. Initially, each possible move had the same weight Author Information 8114 and hence probability of being executed. By collecting more Corresponding Author 8114 and more data from actual games, the computer could learn Authors 8114 which move for a given board configuration would develop a Notes 8114 winning strategy. One of the reasons why Arthur Samuel Biographies 8114 looked at checkers was that in the practical sense the game of Acknowledgments 8115 checkers is not deterministic; there is no known algorithm that Abbreviations 8115 leads to winning the game and the complete evaluation of all References 8116 1040 possible moves is beyond the capacity of any computer. There are some similarities between the game checkers and the science of discovering new materials. Making a new 1. INTRODUCTION material is in practice equally nondeterministic. The number of One of the fascinating aspects of metal−organic frameworks possible ways we can combine atoms is simply too large to (MOFs) is that by combining linkers and metal nodes we can evaluate all possible materials. For a long time, materials synthesize millions of different materials.1 Over the past discovery has been based on empirical knowledge. Significant 8067 https://dx.doi.org/10.1021/acs.chemrev.0c00004 Chem.
Recommended publications
  • Unstructured Data Is a Risky Business
    R&D Solutions for OIL & GAS EXPLORATION AND PRODUCTION Unstructured Data is a Risky Business Summary Much of the information being stored by oil and gas companies—technical reports, scientific articles, well reports, etc.,—is unstructured. This results in critical information being lost and E&P teams putting themselves at risk because they “don’t know what they know”? Companies that manage their unstructured data are better positioned for making better decisions, reducing risk and boosting bottom lines. Most users either don’t know what data they have or simply cannot find it. And the oil and gas industry is no different. It is estimated that 80% of existing unstructured information. Unfortunately, data is unstructured, meaning that the it is this unstructured information, often majority of the data that we are gen- in the form of internal and external PDFs, erating and storing is unusable. And PowerPoint presentations, technical while this is understandable considering reports, and scientific articles and publica- 18 managing unstructured data takes time, tions, that contains clues and answers money, effort and expertise, it results in regarding why and how certain interpre- 3 x 10 companies wasting money by making tations and decisions are made. ill-informed investment decisions. If oil A case in point of the financial risks of bytes companies want to mitigate risk, improve not managing unstructured data comes success and recovery rates, they need to from a customer that drilled a ‘dry hole,’ be better at managing their unstructured costing the company a total of $20 million data. In this paper, Phoebe McMellon, dollars—only to realize several years later Elsevier’s Director of Oil and Gas Strategy that they had drilled a ‘dry hole’ ten years & Segment Marketing, discusses how big Every day we create approximately prior two miles away.
    [Show full text]
  • 1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature
    Amy Neustein, S. Sagar Imambi, Mário Rodrigues, António Teixeira and Liliana Ferreira 1 Application of text mining to biomedical knowledge extraction: analyzing clinical narratives and medical literature Abstract: One of the tools that can aid researchers and clinicians in coping with the surfeit of biomedical information is text mining. In this chapter, we explore how text mining is used to perform biomedical knowledge extraction. By describing its main phases, we show how text mining can be used to obtain relevant information from vast online databases of health science literature and patients’ electronic health records. In so doing, we describe the workings of the four phases of biomedical knowledge extraction using text mining (text gathering, text preprocessing, text analysis, and presentation) entailed in retrieval of the sought information with a high accuracy rate. The chapter also includes an in depth analysis of the differences between clinical text found in electronic health records and biomedical text found in online journals, books, and conference papers, as well as a presentation of various text mining tools that have been developed in both university and commercial settings. 1.1 Introduction The corpus of biomedical information is growing very rapidly. New and useful results appear every day in research publications, from journal articles to book chapters to workshop and conference proceedings. Many of these publications are available online through journal citation databases such as Medline – a subset of the PubMed interface that enables access to Medline publications – which is among the largest and most well-known online databases for indexing profes- sional literature. Such databases and their associated search engines contain important research work in the biological and medical domain, including recent findings pertaining to diseases, symptoms, and medications.
    [Show full text]
  • Big Data Mining Tools for Unstructured Data: a Review YOGESH S
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by International Journal of Innovative Technology and Research (IJITR) Yogesh S. Kalambe* et al. (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH Volume No.3, Issue No.2, February – March 2015, 2012 – 2017. Big Data Mining Tools for Unstructured Data: A Review YOGESH S. KALAMBE D. PRATIBA M.Tech-Student Assistant Professor, Computer Science and Engineering, Department of Computer Science and Engineering, R V College of Engineering, R V College of Engineering, Bangalore, Karnataka, India. Bangalore, Karrnataka, India. Dr. PRITAM SHAH Associate Professor, Department of Computer Science and Engineering, R V College of Engineering, Bangalore, Karnataka, India. Abstract— Big data is a buzzword that is used for a large size data which includes structured data, semi- structured data and unstructured data. The size of big data is so large, that it is nearly impossible to collect, process and store data using traditional database management system and software techniques. Therefore, big data requires different approaches and tools to analyze data. The process of collecting, storing and analyzing large amount of data to find unknown patterns is called as big data analytics. The information and patterns found by the analysis process is used by large enterprise and companies to get deeper knowledge and to make better decision in faster way to get advantage over competition. So, better techniques and tools must be developed to analyze and process big data. Big data mining is used to extract useful information from large datasets which is mostly unstructured data.
    [Show full text]
  • Extracting Unstructured Data from Template Generated Web Documents
    Extracting Unstructured Data from Template Generated Web Documents Ling Ma Nazli Goharian Abdur Chowdhury Information Retrieval Laboratory Information Retrieval Laboratory America Online Inc. Computer Science Department Computer Science Department [email protected] Illinois Institute of Technology Illinois Institute of Technology [email protected] [email protected] ABSTRACT section of a page contributes to the same problem. A problem that We propose a novel approach that identifies web page templates is not well studied is the negative effect of such noise data on the and extracts the unstructured data. Extracting only the body of the result of the user queries. Removing this data improves the page and eliminating the template increases the retrieval effectiveness of search by reducing the irrelevant results. precision for the queries that generate irrelevant results. We Furthermore, we argue that the irrelevant results, even covering a believe that by reducing the number of irrelevant results; the small fraction of retrieved results, have the restaurant-effect, users are encouraged to go back to a given site to search. Our namely users are ten times less likely to return or use the search experimental results on several different web sites and on the service after a bad experience. This is of more importance, whole cnnfn collection demonstrate the feasibility of our approach. considering the fact that an average of 26.8% of each page is formatting data and advertisement. Table 1 shows the ratio of the formatting data to the real data in the eight web sites we examined. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Search Process Table 1: Template Ratio in Web Collections Web page Collection Template Ratio General Terms Cnnfn.com 34.1% Design, Experimentation Netflix.com 17.8% Keywords NBA.com 18.1% Automatic template removal, text extraction, information retrieval, Amazon.com 19.3% Retrieval Accuracy Ebay.com 22.5% Disneylandsource.com 54.1% 1.
    [Show full text]
  • Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1
    Top Natural Language Processing Applications in Business UNLOCKING VALUE FROM UNSTRUCTURED DATA For years, enterprises have been making good use of their 1. Internet of Things (IoT): applying technologies, such structured data (tables, spreadsheets, etc.). However, as real-time analytics, machine learning (ML), and smart the larger part of enterprise data, nearly 80 percent, is sensors, to manage and analyze machine-generated unstructured and has been much less accessible. structured data 2. Computer Vision: using digital imaging technologies, From emails, text documents, research and legal reports ML, and pattern recognition to interpret image and video to voice recordings, videos, social media posts and content more, unstructured data is a huge body of information. 3. Document Understanding: combining NLP and ML to But, compared to structured data, it has been much more gain insights into human-generated, natural language challenging to leverage. unstructured data. Traditional search has done a great job helping users As it powers document understanding applications that discover and derive insights from some of this data. But unlock value from unstructured data, NLP has become an enterprises need to go beyond search to maximize the use of essential enabler of the AI evolution in today’s enterprises. unstructured data as a resource for enhanced analytics and decision making. This white paper discusses the emergence of NLP as a key insight discovery technique, followed by examples This is where natural language processing (NLP), a field of of impactful NLP applications that Accenture has helped artificial intelligence (AI) that’s used to handle the processing clients implement. and analysis of large volumes of unstructured data, can be a real game changer.
    [Show full text]
  • Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis
    Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis Rolf Sint1, Sebastian Schaffert1, Stephanie Stroka1 and Roland Ferstl2 1 [email protected] Salzburg Research Jakob Haringer Str. 5/3 5020 Salzburg Austria 2 [email protected] Siemens AG (Siemens IT Solutions and Services) Werner von Siemens-Platz 1 5020 Salzburg Austria Abstract. The growing impact of Semantic Wikis deduces the impor- tance of finding a strategy to store textual articles, semantic metadata and management data. Due to their different characteristics, each data type requires a specialized storing system, as inappropriate storing re- duces performance, robustness, flexibility and scalability. Hence, it is important to identify a sophisticated strategy for storing and synchro- nizing different types of data structures in a way they provide the best mix of the previously mentioned properties. In this paper we compare fully structured, semi-structured and unstruc- tured data and present their typical appliance. Moreover, we discuss how all data structures can be combined and stored for one application and consider three synchronization design alternatives to keep the dis- tributed data storages consistent. Furthermore, we present the semantic wiki KiWi, which uses an RDF triplestore in combination with a re- lational database as basis for the persistence of data, and discuss its concrete implementation and design decisions. 1 Introduction Although the promise of effective knowledge management has had the indus- try abuzz for well over a decade, the reality of available systems fails to meet the expectations. The EU-funded project KiWi - Knowledge in a Wiki project sets out to combine the wiki method of collaborative content creation with the technologies of the Semantic Web to bring knowledge management to the next level.
    [Show full text]
  • Solving the Unstructured Data Puzzle with Analytics
    SOLVING THE UNSTRUCTURED SOLUTION OVERVIEW DATA PUZZLE WITH ANALYTICS Solving the Unstructured Data Puzzle with Analytics OpenText™ Information Hub in concert with OpenText™ InfoFusion™ creates a fast, powerful, innovative way to realize the promise of big data analysis. Unstructured data is a wellspring of valuable BUSINESS BENEFITS information. To derive its true value, users need to • Unique insights into consumer visually monitor, compare, and discover interesting sentiment and other hard-to facts about their business data. A useful solution will -spot patterns • Easy mining of many types of collect, sift, and correlate text from thousands of unstructured data, including emails, emails, PDFs, and other data sources into meaningful, documents, and social media feeds visual, and highly interactive dashboards that • Scalability to handle terabytes of synthesize findings across products, topics, events, data and millions of users and devices • Open APIs, including JavaScript and even the theme or sentiment of the document. API (JSAPI) and REST allow for smooth integration with enterprise applications What Is This? • Time to deployment in hours The OpenText™ solution for unstructured data analytics is a powerful, effective answer to instead of months the need to make sense of huge volumes of unstructured data, an increasingly common business requirement across all industries. Modern digital organizations are looking to • Built-in integration with industry- their unstructured data to help make business decisions, such as determining user or leading OpenText solutions for consumer sentiment, cooperating with discovery requirements, assessing risk, and content management, e-discovery, personalizing their products for customers. visualization, archiving, and more These organizations face fundamental challenges, as most traditional databases and data visualization tools only deal with structured data.
    [Show full text]
  • Cheminformatics for Genome-Scale Metabolic Reconstructions
    CHEMINFORMATICS FOR GENOME-SCALE METABOLIC RECONSTRUCTIONS John W. May European Molecular Biology Laboratory European Bioinformatics Institute University of Cambridge Homerton College A thesis submitted for the degree of Doctor of Philosophy June 2014 Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation is not substantially the same as any I have submitted for a degree, diploma or other qualification at any other university, and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified length limit of 60,000 words as defined by the Biology Degree Committee. This dissertation has been typeset using LATEX in 11 pt Palatino, one and half spaced, according to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee. June 2014 John W. May to Róisín Acknowledgements This work was carried out in the Cheminformatics and Metabolism Group at the European Bioinformatics Institute (EMBL-EBI). The project was fund- ed by Unilever, the Biotechnology and Biological Sciences Research Coun- cil [BB/I532153/1], and the European Molecular Biology Laboratory. I would like to thank my supervisor, Christoph Steinbeck for his guidance and providing intellectual freedom. I am also thankful to each member of my thesis advisory committee: Gordon James, Julio Saez-Rodriguez, Kiran Patil, and Gos Micklem who gave their time, advice, and guidance. I am thankful to all members of the Cheminformatics and Metabolism Group.
    [Show full text]
  • Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA
    Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA Abstract Geospatial semantics is a broad field that involves a variety of research areas. The term semantics refers to the meaning of things, and is in contrast with the term syntactics. Accordingly, studies on geospatial semantics usually focus on understanding the meaning of geographic entities as well as their counterparts in the cognitive and digital world, such as cognitive geographic concepts and digital gazetteers. Geospatial semantics can also facilitate the design of geographic information systems (GIS) by enhancing the interoperability of distributed systems and developing more intelligent interfaces for user interactions. During the past years, a lot of research has been conducted, approaching geospatial semantics from different perspectives, using a variety of methods, and targeting different problems. Meanwhile, the arrival of big geo data, especially the large amount of unstructured text data on the Web, and the fast development of natural language processing methods enable new research directions in geospatial semantics. This chapter, therefore, provides a systematic review on the existing geospatial semantic research. Six major research areas are identified and discussed, including semantic interoperability, digital gazetteers, geographic information retrieval, geospatial Semantic Web, place semantics, and cognitive geographic concepts. Keywords: geospatial semantics, semantic interoperability, ontology engineering, digital gazetteers, geographic information retrieval, geospatial Semantic Web, cognitive geographic concepts, qualitative reasoning, place semantics, natural language processing, text mining, spatial data infrastructures, location-based social networks 1. Introduction The term semantics refers to the meaning of expressions in a language, and is in contrast with the term syntactics.
    [Show full text]
  • Unstructured Data Analysis in Arcgis
    Unstructured Data Analysis in ArcGIS James Jones - Esri Julia Bell - Esri Scott Graff - Microsoft What is Unstructured Data? • Does not have a recognizable structure or is loosely structured • Can be in a variety of formats and storage mechanisms • Word Documents • Email • Social Media Posts • PowerPoint • PDF • Share drive “Every two days we create as much information as we did up to 2003” Eric Schmidt, 2010 144 million e-mails are sent What does that look like? Every minute… Twitter sees new 350,000 tweets Facebook has 510,000 comments posted, 293,000 statuses updated 15.2 million Text Messages are sent 954,000 new Microsoft Office documents are created How much spatial information are we missing out on? How much spatial information are we missing out on? How can we capture this information in ArcGIS? How to Integrate Unstructured Data into ArcGIS Native Esri Capability Coordinates Custom Locations What are you looking for? User defined keywords What is the best tool? ArcGIS Pro w/ ArcGIS Pro for ArcGIS Enterprise w/ LocateXT Intelligence LocateXT • Data is at least somewhat understood • Data benefits from identifiable and repeating patterns How is it best used? • Little to no programming experience available/needed ArcGIS LocateXT Extract Locations from Unstructured Data Extracting Locations with ArcGIS • LocateXT Extension for ArcGIS Desktop and Enterprise • Available for ArcMap 9.1 and later • Available in ArcGIS Pro at 2.3 • 100% Feature function as ArcGIS Pro 2.4 • Uses pattern matching regular expressions (REGEX) to search for
    [Show full text]
  • The Role of Text Analytics in Healthcare: a Review of Recent Developments and Applications
    The Role of Text Analytics in Healthcare: A Review of Recent Developments and Applications Mahmoud Elbattah1, Émilien Arnaud2, Maxime Gignon2 and Gilles Dequen1 1Laboratoire MIS, Université de Picardie Jules Verne, Amiens, France 2Emergency Department, Amiens-Picardy University, Amiens France Keywords: Text Analytics, Natural Language Processing, Unstructured Data, Healthcare Analytics. Abstract: The implementation of Data Analytics has achieved a significant momentum across a very wide range of domains. Part of that progress is directly linked to the implementation of Text Analytics solutions. Organisations increasingly seek to harness the power of Text Analytics to automate the process of gleaning insights from unstructured textual data. In this respect, this study aims to provide a meeting point for discussing the state-of-the-art applications of Text Analytics in the healthcare domain in particular. It is aimed to explore how healthcare providers could make use of Text Analytics for different purposes and contexts. To this end, the study reviews key studies published over the past 6 years in two major digital libraries including IEEE Xplore, and ScienceDirect. In general, the study provides a selective review that spans a broad spectrum of applications and use cases in healthcare. Further aspects are also discussed, which could help reinforce the utilisation of Text Analytics in the healthcare arena. 1 INTRODUCTION However, one of the key challenges for healthcare analytics is to deal with huge data volumes in the form “Most of the knowledge in the world in the future of unstructured text. Examples include nursing notes, is going to be extracted by machines and will clinical protocols, medical transcriptions, medical reside in machines”, (LeCun, 2014).
    [Show full text]
  • Web Mining – Data Mining Im Internet
    WWeebb MMiinniinngg –– DDaattaa MMiinniinngg iimm IInntteernrneett Johannes Fürnkranz [email protected] 1 © J. Fürnkranz GGeenneerarall IInnffoorrmmaattiioonn ● Web-page: http://www.ke.informatik.tu-darmstadt.de/lehre/ss06/web-mining/ ● Text: ● Soumen Chakrabarti: Mining the Web – Discovering Knowlege from Hypertext Data, Morgan Kaufmann Publishers 2003. ● Johannes Fürnkranz: Web Mining. Draft book chapter with many pointers to the literature ● Various other articles available from the Web-page ● Slides: ● available from course page 2 © J. Fürnkranz MMoottiivvaattiioonn ● The Web is now over 10 years old ca. 1990, Tim Berners-Lee, CERN developed the first graphical hypertext browser ● The information on the Web has grown exponentially on probably every topic you can think of, there is some information available on some Web page ● However, it is still very hard to find relevant information The query interface to search engines has not changed since the early days of the Web! ● Users have adapted to the interface instead of the other way around 3 © J. Fürnkranz 4 © J. Fürnkranz 5 © J. Fürnkranz HHaardrd qquueerriieess ● For many queries, the information that is needed to answer the query is readily available on the Web: What are the cheapest hotels in Vienna's first district? ● The problems are finding the pages that contain relevant information ● pages of hotels in Vienna extracting the relevant pieces of information from these pages ● finding the prices, names, address of these hotels connecting the information that is extracted from the pages ● comparing the prices, sorting the hotels apply common-sense reasoning in all phases ● e.g., look for pages of bed & breakfast (Pension) as well ● know about different currencies and conversions, etc.
    [Show full text]