Neural Natural Language Processing for Unstructured Data in Electronic

Total Page:16

File Type:pdf, Size:1020Kb

Neural Natural Language Processing for Unstructured Data in Electronic Neural Natural Language Processing for Unstructured Data in Electronic Health Records: a Review Irene Li, Jessica Pan, Jeremy Goldwasser, Neha Verma, Wai Pan Wong Muhammed Yavuz Nuzumlalı, Benjamin Rosand, Yixin Li, Matthew Zhang, David Chang R. Andrew Taylor, Harlan M. Krumholz and Dragomir Radev {irene.li,jessica.pan,jeremy.goldwasser,neha.verma,waipan.wong}@yale.edu {yavuz.nuzumlali,benjamin.rosand,yixin.li,matthew.zhang,david.chang}@yale.edu {richard.taylor,harlan.krumholz,dragomir.radev}@yale.edu. Yale University, New Haven CT 06511 Abstract diagnoses, medications, and laboratory values in fixed nu- Electronic health records (EHRs), digital collections of pa- merical or categorical fields. Unstructured data, in contrast, tient healthcare events and observations, are ubiquitous in refer to free-form text written by healthcare providers, such medicine and critical to healthcare delivery, operations, and as clinical notes and discharge summaries. Unstructured data research. Despite this central role, EHRs are notoriously dif- represent about 80% of total EHR data, but unfortunately re- ficult to process automatically. Well over half of the informa- main very difficult to process for secondary use[255]. In this tion stored within EHRs is in the form of unstructured text survey paper, we focus our discussion mainly on unstruc- (e.g. provider notes, operation reports) and remains largely tured text data in EHRs and newer neural, deep-learning untapped for secondary use. Recently, however, newer neural based methods used to leverage this type of data. network and deep learning approaches to Natural Language In recent years, artificial neural networks have dramati- Processing (NLP) have made considerable advances, outper- cally impacted fields such as speech recognition, computer forming traditional statistical and rule-based systems on a vision (CV), and natural language processing (NLP) within variety of tasks. In this survey paper, we summarize current medicine and elsewhere [31, 167]. These networks, which neural NLP methods for EHR applications. We focus on a most often are composed of multiple layers in a modeling broad scope of tasks, namely, classification and prediction, strategy known as deep learning, have come to outperform word embeddings, extraction, generation, and other topics traditional rule-based and statistical methods on many tasks. such as question answering, phenotyping, knowledge graphs, Recognizing the inherent performance advantages and poten- medical dialogue, multilinguality, interpretability, etc. tial to improve health outcomes by unlocking unstructured EHR text data, neural NLP methods are attracting great inter- CCS Concepts: • General and reference ! Surveys and est among researchers in artificial intelligence for healthcare. overviews; • Computing methodologies ! Natural lan- guage processing; Machine learning algorithms. 1.1 Challenges and difficulties Keywords: natural language processing, neural networks, In this survey, we focus on neural approaches to analyzing EHR, unstructured data clinical texts in EHRs. While these data can carry abundant useful information in healthcare, there also exist significant 1 Introduction challenges and difficulties including: • Privacy. Due to the extremely sensitive nature of the arXiv:2107.02975v1 [cs.CL] 7 Jul 2021 Electronic health records (EHRs), digital collections of pa- tient healthcare events and observations, are now ubiquitous information contained in EHRs and the existence of in medicine and critical to healthcare delivery, operations, regulatory laws such as the (US) Health Insurance and and research [45, 51, 73, 92]. Data within EHRs are often Portability and Accountability Act (HIPAA), mainte- classified based on collection and representation formats nance of privacy within analytic pipelines is impera- belonging into two classes: structured or unstructured [43]. tive. [62, 115]. Therefore, often before any downstream Structured EHR data consist of heterogeneous sources like tasks can be performed or any data can be shared with others, additional privacy-preserving steps must be Authors’ address: Irene Li, Jessica Pan, Jeremy Goldwasser, Neha taken. Removing identifying information from a large Verma, Wai Pan Wong; Muhammed Yavuz Nuzumlalı, Benjamin corpus of EHRs is an expensive process. It is difficult Rosand, Yixin Li, Matthew Zhang, David Chang; R. Andrew Tay- to automate and requires annotators with domain ex- lor, Harlan M. Krumholz and Dragomir Radev, {irene.li,jessica.pan, pertise. jeremy.goldwasser,neha.verma,waipan.wong}@yale.edu, {yavuz.nuzumlali, • benjamin.rosand,yixin.li,matthew.zhang,david.chang}@yale.edu, {richard. Lack of annotations. Many existing machine learning taylor,harlan.krumholz,dragomir.radev}@yale.edu.Yale University, New and deep learning models are supervised and thus Haven CT 06511. require labeled data for training. Annotating EHR data Li et al. Neural NLP Methods for EHR Tasks systematic survey by Xiao et al. [255] targeted five categories: disease detection/classification, sequential prediction, con- cept embedding, data augmentation and EHR data privacy. Classification Embeddings Extraction Generation Other Topics Kwak and Hui [116] reviewed research applying artificial Medical Medical Named Entity EHR Generation Question Answering intelligence to health informatics. Specifically in the EHR Text Concept Recognition Classification Embeddings field, they included the following applications: outcome pre- Summarization Phenotyping Entity Linking diction, computational phenotyping, knowledge extraction, Segmentation Visiting Medical Knowledge Graphs Embeddings representation learning, de-identification and medical in- Relation and Language Word Sense Event Extraction Translation Medical Dialogues tervention recommendations. Assale et al. [9] presented a Disambiguation Patient Embeddings Medication literature review of how free-text content from electronic Multilinguality Medical Coding Information patient records can benefit from recent Natural Language BERT-based Extraction Embeddings Interpretability Medical Outcome Processing techniques, selecting four application domains: Prediction data quality, information extraction, sentiment analysis and Applications in Public Health De-identification predictive models, and automated patient cohort selection. Another recent survey [251] summarized deep learning meth- Figure 1. Various tasks covered in this survey. ods in clinical NLP. The authors reviewed methods such as deep learning architectures, embeddings and medical knowl- edge on four groups of clinical NLP tasks: text classification, can be challenging because of the cognitive complexity named entity recognition, relation extraction, and others. of the task and because of the variability in the quality Joshi et al. [99] discussed textual epidemic intelligence or of data. Neural networks require large amounts of text the detection of disease outbreaks via medical and informal on which to train, and unfortunately, useful EHR data (i.e. the Web) text. are often in short supply. For some tasks, only qualified annotators can be recruited to complete annotations. Even when annotators are available, the quality of the 1.3 Motivation annotations is sometimes hard to ensure. There may be While the aforementioned reviews target different applica- disagreement among the annotators which can make tions or techniques, and vary on the scope of selected topics, the evaluations more difficult and controversial. in this survey, we look at a more comprehensive coverage of • Interpretability. Deep neural networks are able to achieve applications and tasks, and we include several works with superior results compared with other methods. How- very recent BERT-based models. Specifically, we summarize ever, in many fields, they are often treated as black a broad range of existing literature on deep learning methods boxes [230, 273]. Typically, a neural network model has with clinical text data. Most of the papers on clinical NLP a large number of trainable parameters, which causes discussed achieve performance at or near the state-of-the-art severe difficulties for model interpretability. Moreover, for their tasks. unlike linear models, which are usually more straight- We now provide some NLP preliminaries, and then we ex- forward and explainable, neural networks consist of plore deep learning methods on a various tasks: classification, non-linear layers and complex architectures further embeddings, extraction, generation, and other topics includ- hampering interpretation. Recently, there have been ing question answering, phenotyping, knowledge graphs, a few attempts to produce explainable deep neural and more. Figure1 shows the scope of the tasks. Finally, we networks to increase model transparency [29, 30, 164]. also summarize relevant datasets and existing tools. 1.2 Related Work 2 Preliminaries Prior surveys have focused on a variety of deep learning In this section, we present an overview of important con- topics within health informatics, bioinformatics, including cepts in deep learning and natural language processing. In EHRs. An early survey by Miotto et al. [157] summarized particular, we cover commonly used deep learning architec- deep learning methods for healthcare and their applications tures in NLP, word embeddings, recent Transformer models, in clinical imaging, EHRs, genomics, and mobile domains. and concepts from transfer learning. The authors reviewed a number of tasks including predict- ing diseases, modeling phenotypes,
Recommended publications
  • Information Retrieval Application of Text Mining Also on Bio-Medical Fields
    JOURNAL OF CRITICAL REVIEWS ISSN- 2394-5125 VOL 7, ISSUE 8, 2020 INFORMATION RETRIEVAL APPLICATION OF TEXT MINING ALSO ON BIO-MEDICAL FIELDS Charan Singh Tejavath1, Dr. Tryambak Hirwarkar2 1Research Scholar, Dept. of Computer Science & Engineering, Sri Satya Sai University of Technology & Medical Sciences, Sehore, Bhopal-Indore Road, Madhya Pradesh, India. 2Research Guide, Dept. of Computer Science & Engineering, Sri Satya Sai University of Technology & Medical Sciences, Sehore, Bhopal Indore Road, Madhya Pradesh, India. Received: 11.03.2020 Revised: 21.04.2020 Accepted: 27.05.2020 ABTRACT: Electronic clinical records contain data on all parts of health care. Healthcare data frameworks that gather a lot of textual and numeric data about patients, visits, medicines, doctor notes, and so forth. The electronic archives encapsulate data that could prompt an improvement in health care quality, advancement of clinical and exploration activities, decrease in clinical mistakes, and decrease in healthcare costs. In any case, the reports that contain the health record are wide-going in multifaceted nature, length, and utilization of specialized jargon, making information revelation complex. The ongoing accessibility of business text mining devices gives a one of a kind chance to remove basic data from textual information files. This paper portrays the biomedical text-mining technology. KEYWORDS: Text-mining technology, Bio-medical fields, Textual information files. © 2020 by Advance Scientific Research. This is an open-access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/) DOI: http://dx.doi.org/10.31838/jcr.07.08.357 I. INTRODUCTION The corpus of biomedical data is becoming quickly. New and helpful outcomes show up each day in research distributions, from diary articles to book parts to workshop and gathering procedures.
    [Show full text]
  • Barriers to Retrieving Patient Information from Electronic Health Record Data: Failure Analysis from the TREC Medical Records Track
    Barriers to Retrieving Patient Information from Electronic Health Record Data: Failure Analysis from the TREC Medical Records Track Tracy Edinger, ND, Aaron M. Cohen, MD, MS, Steven Bedrick, PhD, Kyle Ambert, BA, William Hersh, MD Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA Abstract Objective: Secondary use of electronic health record (EHR) data relies on the ability to retrieve accurate and complete information about desired patient populations. The Text Retrieval Conference (TREC) 2011 Medical Records Track was a challenge evaluation allowing comparison of systems and algorithms to retrieve patients eligible for clinical studies from a corpus of de-identified medical records, grouped by patient visit. Participants retrieved cohorts of patients relevant to 35 different clinical topics, and visits were judged for relevance to each topic. This study identified the most common barriers to identifying specific clinic populations in the test collection. Methods: Using the runs from track participants and judged visits, we analyzed the five non-relevant visits most often retrieved and the five relevant visits most often overlooked. Categories were developed iteratively to group the reasons for incorrect retrieval for each of the 35 topics. Results: Reasons fell into nine categories for non-relevant visits and five categories for relevant visits. Non-relevant visits were most often retrieved because they contained a non-relevant reference to the topic terms. Relevant visits were most often infrequently retrieved because they used a synonym for a topic term. Conclusions : This failure analysis provides insight into areas for future improvement in EHR-based retrieval with techniques such as more widespread and complete use of standardized terminology in retrieval and data entry systems.
    [Show full text]
  • Biomedical Text Mining: a Survey of Recent Progress
    Chapter 14 BIOMEDICAL TEXT MINING: A SURVEY OF RECENT PROGRESS Matthew S. Simpson Lister Hill National Center for Biomedical Communications United States National Library of Medicine, National Institutes of Health [email protected] Dina Demner-Fushman Lister Hill National Center for Biomedical Communications United States National Library of Medicine, National Institutes of Health [email protected] Abstract The biomedical community makes extensive use of text mining tech- nology. In the past several years, enormous progress has been made in developing tools and methods, and the community has been witness to some exciting developments. Although the state of the community is regularly reviewed, the sheer volume of work related to biomedical text mining and the rapid pace in which progress continues to be made make this a worthwhile, if not necessary, endeavor. This chapter pro- vides a brief overview of the current state of text mining in the biomed- ical domain. Emphasis is placed on the resources and tools available to biomedical researchers and practitioners, as well as the major text mining tasks of interest to the community. These tasks include the recognition of explicit facts from biomedical literature, the discovery of previously unknown or implicit facts, document summarization, and question answering. For each topic, its basic challenges and methods are outlined and recent and influential work is reviewed. Keywords: Biomedical information extraction, named entity recognition, relations, events, summarization, question answering, literature-based discovery C.C. Aggarwal and C.X. Zhai (eds.), Mining Text Data, DOI 10.1007/978-1-4614-3223-4_14, 465 © Springer Science+Business Media, LLC 2012 466 MINING TEXT DATA 1.
    [Show full text]
  • Big-Data Science in Porous Materials: Materials Genomics and Machine Learning
    Lawrence Berkeley National Laboratory Recent Work Title Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Permalink https://escholarship.org/uc/item/3ss713pj Journal Chemical reviews, 120(16) ISSN 0009-2665 Authors Jablonka, Kevin Maik Ongari, Daniele Moosavi, Seyed Mohamad et al. Publication Date 2020-08-01 DOI 10.1021/acs.chemrev.0c00004 Peer reviewed eScholarship.org Powered by the California Digital Library University of California This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. pubs.acs.org/CR Review Big-Data Science in Porous Materials: Materials Genomics and Machine Learning Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit* Cite This: Chem. Rev. 2020, 120, 8066−8129 Read Online ACCESS Metrics & More Article Recommendations ABSTRACT: By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal−organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies.
    [Show full text]
  • Unstructured Data Is a Risky Business
    R&D Solutions for OIL & GAS EXPLORATION AND PRODUCTION Unstructured Data is a Risky Business Summary Much of the information being stored by oil and gas companies—technical reports, scientific articles, well reports, etc.,—is unstructured. This results in critical information being lost and E&P teams putting themselves at risk because they “don’t know what they know”? Companies that manage their unstructured data are better positioned for making better decisions, reducing risk and boosting bottom lines. Most users either don’t know what data they have or simply cannot find it. And the oil and gas industry is no different. It is estimated that 80% of existing unstructured information. Unfortunately, data is unstructured, meaning that the it is this unstructured information, often majority of the data that we are gen- in the form of internal and external PDFs, erating and storing is unusable. And PowerPoint presentations, technical while this is understandable considering reports, and scientific articles and publica- 18 managing unstructured data takes time, tions, that contains clues and answers money, effort and expertise, it results in regarding why and how certain interpre- 3 x 10 companies wasting money by making tations and decisions are made. ill-informed investment decisions. If oil A case in point of the financial risks of bytes companies want to mitigate risk, improve not managing unstructured data comes success and recovery rates, they need to from a customer that drilled a ‘dry hole,’ be better at managing their unstructured costing the company a total of $20 million data. In this paper, Phoebe McMellon, dollars—only to realize several years later Elsevier’s Director of Oil and Gas Strategy that they had drilled a ‘dry hole’ ten years & Segment Marketing, discusses how big Every day we create approximately prior two miles away.
    [Show full text]
  • 1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature
    Amy Neustein, S. Sagar Imambi, Mário Rodrigues, António Teixeira and Liliana Ferreira 1 Application of text mining to biomedical knowledge extraction: analyzing clinical narratives and medical literature Abstract: One of the tools that can aid researchers and clinicians in coping with the surfeit of biomedical information is text mining. In this chapter, we explore how text mining is used to perform biomedical knowledge extraction. By describing its main phases, we show how text mining can be used to obtain relevant information from vast online databases of health science literature and patients’ electronic health records. In so doing, we describe the workings of the four phases of biomedical knowledge extraction using text mining (text gathering, text preprocessing, text analysis, and presentation) entailed in retrieval of the sought information with a high accuracy rate. The chapter also includes an in depth analysis of the differences between clinical text found in electronic health records and biomedical text found in online journals, books, and conference papers, as well as a presentation of various text mining tools that have been developed in both university and commercial settings. 1.1 Introduction The corpus of biomedical information is growing very rapidly. New and useful results appear every day in research publications, from journal articles to book chapters to workshop and conference proceedings. Many of these publications are available online through journal citation databases such as Medline – a subset of the PubMed interface that enables access to Medline publications – which is among the largest and most well-known online databases for indexing profes- sional literature. Such databases and their associated search engines contain important research work in the biological and medical domain, including recent findings pertaining to diseases, symptoms, and medications.
    [Show full text]
  • Big Data Mining Tools for Unstructured Data: a Review YOGESH S
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by International Journal of Innovative Technology and Research (IJITR) Yogesh S. Kalambe* et al. (IJITR) INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND RESEARCH Volume No.3, Issue No.2, February – March 2015, 2012 – 2017. Big Data Mining Tools for Unstructured Data: A Review YOGESH S. KALAMBE D. PRATIBA M.Tech-Student Assistant Professor, Computer Science and Engineering, Department of Computer Science and Engineering, R V College of Engineering, R V College of Engineering, Bangalore, Karnataka, India. Bangalore, Karrnataka, India. Dr. PRITAM SHAH Associate Professor, Department of Computer Science and Engineering, R V College of Engineering, Bangalore, Karnataka, India. Abstract— Big data is a buzzword that is used for a large size data which includes structured data, semi- structured data and unstructured data. The size of big data is so large, that it is nearly impossible to collect, process and store data using traditional database management system and software techniques. Therefore, big data requires different approaches and tools to analyze data. The process of collecting, storing and analyzing large amount of data to find unknown patterns is called as big data analytics. The information and patterns found by the analysis process is used by large enterprise and companies to get deeper knowledge and to make better decision in faster way to get advantage over competition. So, better techniques and tools must be developed to analyze and process big data. Big data mining is used to extract useful information from large datasets which is mostly unstructured data.
    [Show full text]
  • Extracting Unstructured Data from Template Generated Web Documents
    Extracting Unstructured Data from Template Generated Web Documents Ling Ma Nazli Goharian Abdur Chowdhury Information Retrieval Laboratory Information Retrieval Laboratory America Online Inc. Computer Science Department Computer Science Department [email protected] Illinois Institute of Technology Illinois Institute of Technology [email protected] [email protected] ABSTRACT section of a page contributes to the same problem. A problem that We propose a novel approach that identifies web page templates is not well studied is the negative effect of such noise data on the and extracts the unstructured data. Extracting only the body of the result of the user queries. Removing this data improves the page and eliminating the template increases the retrieval effectiveness of search by reducing the irrelevant results. precision for the queries that generate irrelevant results. We Furthermore, we argue that the irrelevant results, even covering a believe that by reducing the number of irrelevant results; the small fraction of retrieved results, have the restaurant-effect, users are encouraged to go back to a given site to search. Our namely users are ten times less likely to return or use the search experimental results on several different web sites and on the service after a bad experience. This is of more importance, whole cnnfn collection demonstrate the feasibility of our approach. considering the fact that an average of 26.8% of each page is formatting data and advertisement. Table 1 shows the ratio of the formatting data to the real data in the eight web sites we examined. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Search Process Table 1: Template Ratio in Web Collections Web page Collection Template Ratio General Terms Cnnfn.com 34.1% Design, Experimentation Netflix.com 17.8% Keywords NBA.com 18.1% Automatic template removal, text extraction, information retrieval, Amazon.com 19.3% Retrieval Accuracy Ebay.com 22.5% Disneylandsource.com 54.1% 1.
    [Show full text]
  • A Survey of Embeddings in Clinical Natural Language Processing Kalyan KS, S
    This paper got published in Journal of Biomedical Informatics. Updated version is available @ https://doi.org/10.1016/j.jbi.2019.103323 SECNLP: A Survey of Embeddings in Clinical Natural Language Processing Kalyan KS, S. Sangeetha Text Analytics and Natural Language Processing Lab Department of Computer Applications National Institute of Technology Tiruchirappalli, INDIA. [email protected], [email protected] ABSTRACT Traditional representations like Bag of words are high dimensional, sparse and ignore the order as well as syntactic and semantic information. Distributed vector representations or embeddings map variable length text to dense fixed length vectors as well as capture prior knowledge which can transferred to downstream tasks. Even though embedding has become de facto standard for representations in deep learning based NLP tasks in both general and clinical domains, there is no survey paper which presents a detailed review of embeddings in Clinical Natural Language Processing. In this survey paper, we discuss various medical corpora and their characteristics, medical codes and present a brief overview as well as comparison of popular embeddings models. We classify clinical embeddings into nine types and discuss each embedding type in detail. We discuss various evaluation methods followed by possible solutions to various challenges in clinical embeddings. Finally, we conclude with some of the future directions which will advance the research in clinical embeddings. Keywords: embeddings, distributed representations, medical, natural language processing, survey 1. INTRODUCTION Distributed vector representation or embedding is one of the recent as well as prominent addition to modern natural language processing. Embedding has gained lot of attention and has become a part of NLP researcher’s toolkit.
    [Show full text]
  • Affective Computing and Deep Learning to Perform Sentiment Analysis
    Affective computing and deep learning to perform sentiment analysis NJ Maree orcid.org 0000-0003-0031-9188 Dissertation accepted in partial fulfilment of the requirements for the degree Master of Science in Computer Science at the North-West University Supervisor: Prof L Drevin Co-supervisor: Prof JV du Toit Co-supervisor: Prof HA Kruger Graduation October 2020 24991759 Preface But as for you, be strong and do not give up, for your work will be rewarded. ~ 2 Chronicles 15:7 All the honour and glory to our Heavenly Father without whom none of this would be possible. Thank you for giving me the strength and courage to complete this study. Secondly, thank you to my supervisor, Prof Lynette Drevin, for all the support to ensure that I remain on track with my studies. Also, thank you for always trying to help me find the golden thread to tie my work together. To my co-supervisors, Prof Tiny du Toit and Prof Hennie Kruger, I appreciate your pushing me to not only focus on completing my dissertation, but also to participate in local conferences. Furthermore, thank you for giving me critical and constructive feedback throughout this study. A special thanks to my family who supported and prayed for me throughout my dissertation blues. Thank you for motivating me to pursue this path. Lastly, I would like to extend my appreciation to Dr Isabel Swart for the language editing of the dissertation, and the Telkom Centre of Excellence for providing me with the funds to attend the conferences. i Abstract Companies often rely on feedback from consumers to make strategic decisions.
    [Show full text]
  • Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1
    Top Natural Language Processing Applications in Business UNLOCKING VALUE FROM UNSTRUCTURED DATA For years, enterprises have been making good use of their 1. Internet of Things (IoT): applying technologies, such structured data (tables, spreadsheets, etc.). However, as real-time analytics, machine learning (ML), and smart the larger part of enterprise data, nearly 80 percent, is sensors, to manage and analyze machine-generated unstructured and has been much less accessible. structured data 2. Computer Vision: using digital imaging technologies, From emails, text documents, research and legal reports ML, and pattern recognition to interpret image and video to voice recordings, videos, social media posts and content more, unstructured data is a huge body of information. 3. Document Understanding: combining NLP and ML to But, compared to structured data, it has been much more gain insights into human-generated, natural language challenging to leverage. unstructured data. Traditional search has done a great job helping users As it powers document understanding applications that discover and derive insights from some of this data. But unlock value from unstructured data, NLP has become an enterprises need to go beyond search to maximize the use of essential enabler of the AI evolution in today’s enterprises. unstructured data as a resource for enhanced analytics and decision making. This white paper discusses the emergence of NLP as a key insight discovery technique, followed by examples This is where natural language processing (NLP), a field of of impactful NLP applications that Accenture has helped artificial intelligence (AI) that’s used to handle the processing clients implement. and analysis of large volumes of unstructured data, can be a real game changer.
    [Show full text]
  • Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis
    Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis Rolf Sint1, Sebastian Schaffert1, Stephanie Stroka1 and Roland Ferstl2 1 [email protected] Salzburg Research Jakob Haringer Str. 5/3 5020 Salzburg Austria 2 [email protected] Siemens AG (Siemens IT Solutions and Services) Werner von Siemens-Platz 1 5020 Salzburg Austria Abstract. The growing impact of Semantic Wikis deduces the impor- tance of finding a strategy to store textual articles, semantic metadata and management data. Due to their different characteristics, each data type requires a specialized storing system, as inappropriate storing re- duces performance, robustness, flexibility and scalability. Hence, it is important to identify a sophisticated strategy for storing and synchro- nizing different types of data structures in a way they provide the best mix of the previously mentioned properties. In this paper we compare fully structured, semi-structured and unstruc- tured data and present their typical appliance. Moreover, we discuss how all data structures can be combined and stored for one application and consider three synchronization design alternatives to keep the dis- tributed data storages consistent. Furthermore, we present the semantic wiki KiWi, which uses an RDF triplestore in combination with a re- lational database as basis for the persistence of data, and discuss its concrete implementation and design decisions. 1 Introduction Although the promise of effective knowledge management has had the indus- try abuzz for well over a decade, the reality of available systems fails to meet the expectations. The EU-funded project KiWi - Knowledge in a Wiki project sets out to combine the wiki method of collaborative content creation with the technologies of the Semantic Web to bring knowledge management to the next level.
    [Show full text]