Biomedical Natural Language Processing and Text Mining
Total Page:16
File Type:pdf, Size:1020Kb
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter SPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. B978-0-12-401678-1.00006-3 00006 AUTHOR QUERY FORM Book: Methods in Biomedical Please e-mail your responses and any Informatics corrections to: Chapter: 6 E-mail: [email protected] Dear Author, Any queries or remarks that have arisen during the processing of your manuscript are listed below and are highlighted by flags in the proof. (AU indicates author queries; ED indicates editor queries; and TS/TY indicates typesetter queries.) Please check your proof carefully and answer all AU queries. Mark all corrections and query answers at the appropriate place in the proof (e.g., by using on-screen annotation in the PDF file http://www.elsevier.com/framework_authors/tutorials/ePDF_voice_skin.swf) or compile them in a separate list, and tick off below to indicate that you have answered the query. Please return your input as instructed by the project manager. Uncited references: References that occur in the reference list but are not cited in the text. Please position each reference in the text or delete it from the reference list. Missing references: References listed below were noted in the text but are missing from the reference list. Please make the reference list complete or remove the references from the text. Location in Chapter Query / remark AU:1, page 33 This section comprises references that occur in the reference list but not in the body of the text. Please position each ref- erence in the text or, alternatively, delete it. Any reference not dealt with will be retained in this section. AU:2, page 1 The country name has been inserted for the author affiliation. Please check, and correct if necessary. AU:3, page 34 Please provide author name for Ref. [1]. AU:4, page 34 Please provide volume number for Ref. [3,6,24,49,50]. AU:5, page 36 Please note that the deletion of term U% in Ref. [42]. AU:6, page 36 Please check the year in Ref. [50]. Cohen 978-0-12-401678-1 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter SPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. Chapter 6 Biomedical Natural Language Processing and Text Mining Kevin Bretonnel Cohen Computational Bioscience Program, University of Colorado School of Medicine, Department of Linguistics, University of Colorado at Boulder, USA AQ2 CHAPTER OUTLINE 6.1 Natural Language Processing and Text Mining Defined 2 6.2 Natural Language 2 6.3 Approaches to Natural Language Processing and Text Mining 4 6.4 Some Special Considerations of Natural Language Processing in the Biomedical Domain 5 6.5 Building Blocks of Natural Language Processing Applications 8 6.5.1 Document Segmentation 9 6.5.2 Sentence Segmentation 10 6.5.3 Tokenization 10 6.5.4 Part of Speech Tagging 11 6.5.5 Parsing 11 6.6 Additional Components of Natural Language Processing Applications 12 6.6.1 Word Sense Disambiguation 12 6.6.2 Negation Detection 12 6.6.3 Temporality 13 6.6.4 Context Determination 13 6.6.5 Hedging 14 6.7 Evaluation in Natural Language Processing 15 6.8 Practical Applications: Text Mining Tasks 17 6.8.1 Information Retrieval 17 6.8.2 Named Entity Recognition 18 6.8.3 Named Entity Normalization 19 6.8.4 Information Extraction 21 6.8.5 Summarization 25 6.8.6 Cohort Retrieval and Phenotype Definition 26 6.9 Software Engineering in Natural Language Processing 28 6.9.1 Architectures 28 Methods in Biomedical Informatics. http://dx.doi.org/10.1016/B978-0-12-401678-1.00006-3 © 2014 Elsevier Inc. All rights reserved. 1 Cohen 978-0-12-401678-1 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter SPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. B978-0-12-401678-1.00006-3 00006 2 CHAPTER 6 Biomedical Natural Language Processing and Text Mining 6.9.2 Scaling 30 6.9.3 Quality Assurance and Testing 31 6.10 Conclusion 34 Acknowledgments 34 References 34 s0005 6.1 NATURAL LANGUAGE PROCESSING AND TEXT MINING DEFINED p0005 Natural language processing is the study of computer programs that take natural, or human, language as input. Natural language processing appli- cations may approach tasks ranging from low-level processing, such as assigning parts of speech to words, to high-level tasks, such as answering questions. Text mining is the use of natural language processing for practi- cal tasks, often related to finding information in prose of various kinds. In practice, natural language processing and text mining exist on a continuum, and there is no hard and fast line between the two. s0010 6.2 NATURAL LANGUAGE p0010 Natural language is human language, as opposed to computer languages. The difference between them lies in the presence of ambiguity. No well- designed computer language is ambiguous. In contrast, all known natural languages exhibit the property of ambiguity. Ambiguity occurs when an input can have more than one interpretation. Ambiguity exists at all lev- els of human language. (For illustrative purposes, we will focus on written language, since that is the domain of most biomedical natural language pro- cessing, with the exception of speech recognition applications.) Consider, for example, the word discharge. This word is ambiguous in that it may be a noun or a verb. In this example, we see both cases: With a decreased LOS, inpatient rehabilitation services will have to improve FIM efficiency or discharge patients with lower discharge FIM scores. (PMID 12917856) p0015 In the case of discharge patients, it is a verb. In the case of discharge FIM scores, it is a noun. Suppose that an application (called a part of speech tag- ger) has determined that it is a noun. It remains ambiguous as to its meaning. In these examples, we see two common meanings in biomedical text: Patient was prescribed codeine upon discharge The discharge was yellow and purulent p0020 In the first case, discharge has the meaning “release from the hospital,” while in the second case, it refers to a body fluid. Once an application Cohen 978-0-12-401678-1 To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s), Elsevier and typesetter SPS. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher and is confidential until formal publication. B978-0-12-401678-1.00006-3 00006 6.2 Natural Language 3 has determined which is the correct meaning, it remains ambiguous as to its relationship to the words with which it appears—i.e., there is syntac- tic ambiguity. Should discharge FIM scores be interpreted as discharge modifying FIM scores, or as discharge FIM modifying scores? The correct analyses—noun versus verb, release from hospital versus body fluid, and discharge [FIM scores] versus [discharge FIM] scores—are so obvious to the human reader that we are typically not even consciously aware of the ambiguities. However, ambiguity is so pervasive in natural language that it is difficult to produce a sentence that is not ambiguous in some way, and the ambiguity that humans resolve so easily is one of the central problems that natural language processing must deal with. p0025 The statistical properties of language would be much different if the same ideas were always presented the same way. However, one of the predomi- nant features of human language is that it allows for an enormous amount of variability. Just as we saw that ambiguity is a fact at all levels of language, variability occurs at all levels of language as well. p0030 On the most basic level, corresponding to the sounds of spoken language, we have variability in spelling. This is most noticeable when compar- ing British and American medical English, where we see frequent vari- ants like haematoma in British English versus hematoma in American English. p0035 At the level of morphemes, or word parts, we see ambiguity in how words should be analyzed. Given a word such as unlockable (PMID 23003012), should we interpret it as [un[lockable]] (not capable of being locked), or as [[unlock]able] (capable of being unlocked)? p0040 At the level of words, the phenomenon of synonymy allows for using different, but (roughly) equivalent, words. For example, we might see cardiac surgery referred to as heart surgery. One common type of synonymy comes from using abbreviations or acronyms, such as CAT for computed axial tomography. Of course, synonymy multiplies the chances for ambiguity—it is commonly the case that biomedical synonyms can have multiple definitions. p0045 Variability exists at the syntactic level, as well.