Domain Independent Natural Language Processing—A Case Study for Hospital Readmission with COPD
Total Page:16
File Type:pdf, Size:1020Kb
2014 IEEE 14th International Conference on Bioinformatics and Bioengineering Domain Independent Natural Language Processing – A Case Study for Hospital Readmission with COPD Ankur Agarwal, Ph.D. Sirish Malpura Florida Atlantic University Florida Atlantic University Boca Raton, Florida Boca Raton, Florida [email protected] [email protected] Ravi S Behara, Ph.D. Vivek Tyagi Florida Atlantic University Florida Atlantic University Boca Raton, Florida Boca Raton, Florida [email protected] [email protected] Abstract—Natural language processing is a field of computer science, which focuses on interactions between computers and phenomena based on actual examples of these phenomena human (natural) languages. The human languages are provided by the text input without adding significant linguistic ambiguous unlike Computer languages, which make its analysis or world knowledge. Machine learning based NLP solutions and processing difficult. Most of the data present these days is in use this approach. unstructured form (such as: Accident reports, Patient discharge summary, Criminal records etc), which makes it hard for Connectionist Approach: Connectionist approach computers to understand for further use and analysis. This develops generalized models from examples of linguistic unstructured text needs to be converted into structured form by phenomena. However, connectionist models combine clearly defining the sentence boundaries, word boundaries and statistical learning with various theories of representation – context dependent character boundaries for further analysis. This paper proposes a component-based domain-independent thus the connectionist representations allow transformation, text analysis system for processing of the natural language known inference, and manipulation of logic formulae. [1] as Domain-independent Natural Language Processing System (DINLP). Further the paper discusses the system capability and Connectionist NLP approach is newer compared to its application in the area of bioinformatics through the case symbolic and statistical approaches. Connectionist NLP work study first appeared in the 1960’s. For a long time, symbolic approaches dominated the field however, in the 1980’s, statistical approaches regained popularity as a result of the Keywords— Chronic Obstructive Pulmonary Disease; Natural availability of critical computational resources and the need to Language Processing; Term Extraction deal with broad, real-world contexts. Connectionist approaches also recovered from earlier criticism by demonstrating the utility of neural networks in NLP. [2] I. INTRODUCTION Natural language processing approaches fall roughly into Natural language processing can be used in various four categories: symbolic, statistical, connectionist, and application such as translation between languages, dialogues hybrid. systems (such as Customer Care) and the most important Symbolic Approach: Symbolic approaches are based on being information extraction (IE). The main goal information explicit representation of facts about language through well- extraction is to transform unstructured text into structured understood knowledge representation schemes and associated (database) representations that can be searched and browsed in algorithms. They perform deep analysis of linguistic flexible ways. phenomena. Natural language processing includes following tasks. Statistical Approach: Statistical approaches employ Some of these tasks can server as real-world application and various mathematical techniques and often use large text input others might be sub-tasks for other tasks. to develop approximate generalized models of linguistic 978-1-4799-7502-0/14 $31.00 © 2014 IEEE 399 DOI 10.1109/BIBE.2014.57 Information Extraction – IE is the process of converting NLP is also being applied in fields such as Automated unstructured text into structures or semi-structured form. The Customer Care service. The other applications of NLP include unstructured text can be translated to standard databases, generating SQL queries from plain text based on synonymous which can be queried by users. [3] words, [17] the flight schedule query system, [18] improving Information Retrieval – IR is the process of determining communications in e-democracy, [19] intrusion detection, [20] information resources, which are relevant to a query from a text encryption, [21] ontology based natural language collection of resources such as text files, images, videos etc. processing for in-store shopping, [22] and software The most proper form of information-seeking behavior is requirements specification analysis [23] etc. considered as Information Retrieval. [4] Relationship Extraction – It is the process of detecting semantic relationships between a set of articles. The co- In DINLP, Apache UIMA, cTAKES is being used and modified for the needs of domain-independent natural occurrence of term or their synonyms can be treated as an language processing and term extraction system. The indicator of relationship between to artifacts. [5] background for this paper can be divided into three categories Co-reference Resolution – Co-reference resolution plays a broadly: Apache UIMA, Apache cTAKES and Text analysis. very important role in information extractions. It is the process Below is the diagram showing the relationship between Text of marking up two expressions with the same entity. [6] e.g. Analysis, cTAKES and UIMA. Timi is very nice boy, he is very hard working. In this sentence, Timi and hei refers to same person. There are different types of distinctions that can be made out of co- cTAKES (Clinical Text Analysis and Knowledge references: Anaphora, Cataphora, Split antecedents, and Co- Extraction Syetem) is a text analysis system developed by referring noun phrases. Mayo Clinic now being maintained by Apache. It uses Named Entity Recognition - Named Entity Recognition is Apache’s UIMA (Unstructured Information management a process of marking up different parts/atoms of a sentence to Architecture) for converting the unstructured text to structured their respective entities e.g Person Name, Quantity, and Size form. The DINLP uses cTAKES for all the text analysis and etc. It has been regarded as an efficient strategy to capture knowledge extraction. relevant entities for answering different queries. [7] Part-of-speech tagging – Part-of-speech tagging is the process of tagging each word with its respective part-of- speeches. The part-of-speech tagging is often ambiguous due to different forms of words. The problem of part-of-speech disambiguation can only be solved after solving other problems associated with natural language understanding. [8] Syntactic Parsing – It is the process of analyzing a sentence by determining the structure of its constituent parts. A parse tree is formed in the process of syntactical parsing. The syntactic parsing plays a very important role in semantic role labeling. [9] Sentiment Analysis - It is also called as Opinion mining. Sentiment analysis is the process of extracting the subjective information from a piece of text. It is the analysis where the information extracted can be the tone of the author, the inferences that can be made out of that text. [10] Other tasks may include: Word sense disambiguation, word segmentation, topic segmentation, sentence breaking, morphological segmentation, Discourse analysis, stemming etc. Fig. 1. Relationship between components of DINLP II. LITERATURE REVIEW Text Analysis: The text-based materials are very important The Natural language processing has variety of source of valuable information and knowledge. There are applications in real world applications. The Natural language varying pieces of text which can be treated as Information processing is being used extensively in bio-medical field. The source such as: Discharge Summaries for Health Care, National Library of Medicine’s MetaMap program is being Accident reports for Road Safety etc. [24] All the sources used extensively for text-mining and creates a standard for provide experiment results/summaries as free text which is indexing of bio-medical terms. [11] There are many other easily readable by human, but complex for computers to researches which are related to work in bio-medical text understand. [25] The important components of a general text mining, such as - cause of death [12], health score [13], analysis system are: Information retrieval [26], Natural smoking status [14][15][16]. 400 language processing [27], Named Entity recognition [28], Co- techniques aiming at information extraction from the clinical reference [29], Relationship Extraction [30], and Sentiment narrative. The system is being used to process and extract analysis. The text-analysis is being used in Bio-medical text information from free-text clinical notes for PAD(Peripheral mining, social media monitoring, National Security and Artery disease). It consists of loosely coupled components and Enterprise business intelligence etc. each component has unique capabilities and responsibilities. III. METHODOLOGY The system liberally borrows some of the components from the Apache UIMA framework and Apache cTAKES. This section will give a detailed overview of what has been taken from cTAKES and UIMA with information about all components which are modified to suit domain-independent needs. The Apache UIMA framework is the core of this system. It is built on top of all the components such as their Annotators, type system and Analysis Engines. Below is the overall process flow of this system which is made