Detecting Privacy-Sensitive Events in Medical Text
Total Page:16
File Type:pdf, Size:1020Kb
Detecting Privacy-Sensitive Events in Medical Text Prateek Jindal Carl A. Gunter Dan Roth Yahoo! Inc. Department of Computer Department of Computer 2021 S. First St Science, UIUC Science, UIUC Suite 110 201 N. Goodwin Ave 201 N. Goodwin Ave Champaign, IL, USA Urbana, IL, USA Urbana, IL, USA [email protected] [email protected] [email protected] ABSTRACT view as extra-sensitive information when this information is In this paper, we present a novel semi-supervised technique not needed for a specific instance. For instance, a patient for finding privacy-sensitive events in clinical text. Unlike with a twisted ankle would not like her records of childhood traditional semi-supervised methods, we do not require large abuse to be shared automatically for this type of treatment. amounts of unannotated data. Instead, our approach relies There are several types of sensitive data that are found in on information contained in the hierarchical structure of a the clinical narratives. We categorize the sensitive data into large medical encyclopedia. 5 major types below: 1. Mental health and abuse in the family Categories and Subject Descriptors 2. Substance Abuse I.2.6 [Artificial Intelligence]: Learning|Concept Learn- ing; I.2.7 [Artificial Intelligence]: Natural Language Pro- 3. HIV Data cessing|Text Analysis; J.3 [Life and Medical Sciences]: 4. Presence of genetic information in EHRs Medical Information Systems; K.5.2 [Legal Aspects of 5. Sexually transmitted infections and reproductive health Computing]: Government Issues|Regulation In some hospital systems, structured information in the General Terms form of ICD9 codes is used to restrict the sharing of patients' Algorithms, Experimentation, Performance, Security clinical history. For instance, if a record contains ICD9 code 042 (HIV) and a patient has indicated that his HIV status should not be shared, then this can be done by preventing Keywords all or some of his record (including the HIV diagnosis) from Natural Language Processing, Health Informatics, Computer being shared. This strategy has a number of limitations, one Security, Active Learning, Semi-Supervision, Set Expansion, of the most serious of which is the fact that it does not work Concept Identification, Domain Knowledge, SNOMED CT for unstructured data like clinical narratives. In this paper, we present a semi-supervised technique for 1. INTRODUCTION finding phrases in clinical narratives which express a sensi- tive condition about the patient. The advantage of our tech- There is a growing interest in ways to exploit the increased nique is that we don't require any annotated data for train- use of Electronic Health Records (EHRs) to improve health- ing etc. Moreover, we don't even require any unannotated care and reduce its costs. One key goal is to enable records data. This is in contrast to traditional semi-supervised ap- of patients to be shared quickly to assure that the right in- proaches [9, 6] which rely on large amounts of unannotated formation is in the right place when it is needed (such as a data for building distributional contexts. Instead of relying list of allergies and medications for a patient who has arrived on distributional contexts of the relevant concepts, we use in an emergency room) and that redundancy is reduced (so the tree positions of the concepts in a medical encyclopedia that tests run at one institution do not need to be re-run at named SNOMED CT1 to find the similarity between them. another). This sharing raises privacy concerns, even when Problem of finding sensitive information in clinical notes it can be taken as given that a patient wants to allow his is complicated by the fact that the concepts may be negated records to be shared to aid his own care. One area of con- in the text or the concepts may not apply to the patient at cern is the risk of sharing what some patients and regulators all. We briefly discuss these issues as well. Permission to make digital or hard copies of part or all of this work for personal or 2. DESCRIPTION OF OUR METHOD classroom use is granted without fee provided that copies are not made or distributed The input to our algorithm consists of a few examples for profit or commercial advantage, and that copies bear this notice and the full ci- (seeds) of the concepts which the user is interested to find. tation on the first page. Copyrights for third-party components of this work must be For example, if one wants to find the instances of substance honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). abuse, one may give the following concepts in the input: BCB’14, September 20–23, 2014, Newport Beach, CA, USA. \cocaine, barbiturates and alcohol". Let us represent this ACM 978-1-4503-2894-4/14/09. 1 http://dx.doi.org/10.1145/2649387.2662451. http://www.ihtsdo.org/snomed-ct/ ACM-BCB 2014 617 Level Concepts Level 0 Cocaine, Cocaine measurement Drug measurement, Tropane alkaloid, Level 1 Psychostimulant, Ester type local anesthetic Azabicyclo compound, Local anesthetic, Measurement of substance, Ester, Level 2 Psychotherapeutic agent, Stimulant, Heterocyclic compound, Tropane alkaloid CNS drug, Psychoactive substance, Aza compound, Anesthetic, Level 3 Drug pseudoallergen by function, Alkaloid, Measurement, Organic compound General drug type, Substance categorized functionally, Level 4 Drug pseudoallergen, Evaluation procedure, Techniques, Chemical categorized structurally Table 1: This table shows the descriptor for concept \cocaine". th seed set by S. Let si denote the i element of this seed set. trally acting muscle relaxant, Morphine Derivative, Opiate, Now, the various steps of our algorithm are described in the Hallucinogen, Alcoholic Beverage, Central Depressant. following subsections. 2.3 Step 3: Computing the Similarity of any 2.1 Step 1: Building Individual Descriptors Concept to Seed-Set Using SNOMED CT, we first build a detailed descriptor In this subsection, we describe how to compute the simi- of every concept specified in the seed set S. A concept can larity of any given SNOMED CT concept to the seed set, S, appear at multiple places in SNOMED CT. We define the provided by the user. Let us denote the given SNOMED CT descriptor of a concept to be simply the parents of the con- concept by the variable x. Then the similarity, sim(x; S), of cept upto 5 higher levels. We explain it below with the help the concept x to the seed set S is defined by the following of an example. Let us consider the concept \cocaine". The equation: descriptor of this concept is shown in Table 1. At level 0, two SNOMED CT concepts corresponding to \cocaine" are 4 ! shown. Concepts at any level i + 1 are basically the parents [ \ sim(x; S) = Li(x) MedRep(S) (2) of concepts at level i. It is normal for some of the concepts i=1 to repeat at later levels. These descriptors were made by In other words, similarity of a concept to the seed set is the a simple breadth-first search on the SNOMED CT graph number of unique SNOMED CT concepts in the descriptor starting from the concept under consideration. of the concept that also appear in the seed-set model learned 2.2 Step 2: Obtaining Model for Seed-Set Us- in step 2 above. ing Active Learning 2.4 Step 4: Concept Identification After computing the individual descriptors of all the con- First of all, we map the noun phrases occurring in the cepts in the seed set S, we merge those descriptors into a clinical text to corresponding SNOMED CT concepts us- single descriptor. Let us assume that for concept x, parents ing MetaMap software2. Then, by using the procedure de- at level i are denoted by the set Li(x). Then the levels of scribed in Step 3 above, we find the similarities of all the con- the overall descriptor are defined by the following equation: cepts in any given document to the given seed set. Finally, we rank the concepts in decreasing order of similarity and jSj then apply a threshold (which was computed empirically). [ Li(S) = Li(sj ) 8i (1) This gives us the desired list of most relevant concepts in j=1 the given document which belong to the same category as the seeds input by the user to begin with. After some preprocessing (removal of overly general con- cepts), the descriptor is shown to the user. Then the user is supposed to identify one or more most appropriate SNOMED 3. ASSERTION STATUS CT concepts from this overall descriptor. User response In the above section, we described a technique for finding is recorded into a list. This list constitutes the learned sensitive events in the clinical text. However, it should be model for the concept identification. Let us call this list appreciated that the mere presence of a sensitive concept as MedRep(S). No further input from user is now required. in clinical text does not necessarily raise privacy concerns To give an example, in one of our sample runs, MedRep(S) about patient's medical information. contained the following SNOMED CT concepts for the seed We give two examples to illustrate this fact. For the first set described at the beginning of this section: Psychoactive example, consider a clinical narrative that says, \The patient substance, Anxiolytic, sedative AND/OR hypnotic, Sympa- thomimetic agent, Centrally acting hypotensive agent, Cen- 2http://metamap.nlm.nih.gov/ ACM-BCB 2014 618 does not inject heroin". Even though the drug heroin is men- this simple seed set, we were able to identify in our corpus tioned in this sentence, there is no sensitive information be- several other products which are often used for substance- cause of the negation.