Articles

WatsonPaths: Scenario-Based and Inference over Unstructured Information

Adam Lally, Sugato Bagchi, Michael A. Barborak, David W. Buchanan, Jennifer Chu-Carroll, David A. Ferrucci, Michael R. Glass, Aditya Kalyanpur, Erik T. Mueller, J. William Murdock, Siddharth Patwardhan, John M. Prager

n We present WatsonPaths, a novel BM is a question-answering system that takes nat - system that can answer scenario-based ural language questions as input and produces precise questions. These include medical ques - Ianswers along with accurate confidences as output (Fer - tions that present a patient summary rucci et al. 2010). In 2011, in a modified version of the quiz and ask for the most likely diagnosis or show Jeopardy!, Watson defeated two of the best human play - most appropriate treatment. Watson - Paths builds on the IBM Watson ques - ers. tion-answering system. WatsonPaths Jeopardy! questions are usually factoid questions — the breaks down the input scenario into answer and supporting evidence are usually stated explicitly individual pieces of information, asks in some document in the corpus. While in practice we may relevant subquestions of Watson to con - retrieve multiple redundant documents, in principle the clude new information, and represents answer could be expressed succinctly in one. The main chal - these results in a graphic model. Proba - lenges for a factoid question-answering system are retrieving bilistic inference is performed over the the correct document, and then extracting the correct answer graph to conclude the answer. On a set of medical test preparation questions, from the document. At the core of Watson’s question answer - WatsonPaths shows a significant ing is a suite of algorithms that match passages containing improvement in accuracy over multiple candidate answers to the original question. These algorithms baselines. have been described in a series of articles (Chu-Carroll et al. 2012; Ferrucci 2012; Gondek et al. 2012; Lally et al. 2012; McCord, Murdock, and Boguraev 2012; Murdock et al. 2012a; 2012b). In some important applications, however, questions do not have this “factoid” character. Consider the

Copyright © 2017, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Summer 2017 59 Articles

“A 32-year-old woman with type 1 diabetes mellitus has had progressive renal failure... Her hemoglobin concentration is 9 g/dL... A blood smear shows normochromic, normocytic cells. What is the problem?

Patient’s hemoglobin conc. Patient’s blood smear Patient has renal failure is 9 g/dL [low] shows normocytic cells

Evidence: “Low hemoglobin conc. indicates anemia.” Evidence: “Erythropoietin is produced in the kidneys.”

Patient has anemia Evidence: “Normocytic anemia is a type of anemia Patient is at risk for with normal red blood Erythropoietin de!ciency cells.”

Patient has normocytic anemia Evidence: “Erythropoietin de!ciency is a cause of normocytic anemia.”

Most likely cause of low hemoglobin conc. is Erythropoietin de!ciency

Figure 1. A Simple Diagnosis Graph for a Patient with Erythropoietin Deficiency.

following questions, one from medicine and one knowledge to a specific case, as in a medical scenario from taxation: about a patient. A 32-year-old woman with type 1 diabetes mellitus has Before beginning work on automated scenario- had progressive renal failure. Her hemoglobin con - based question answering, we investigated how centration is 9 g/dL. A blood smear shows nor - humans solve such questions. We asked domain mochromic, normocytic cells. What is the problem? experts to describe their approach to solving a set of I inherited real-estate from a relative who died 5 years scenario-based questions in the medical domain. An ago via a trust that was created before his death. The example is shown in figure 1. Many drew a graph of property was sold this year after dissolution of the initial signs and symptoms leading to their most like - trust, and the money was put in a Roth-IRA. Which ly possible causes and connecting them to a final tax form(s) do I need to file? conclusion. This motivated us to look into graph- We will call these types of questions scenario-based based methods as a way of answering scenario-based questions. In these types of questions, it is not gener - questions automatically. ally the case that the answer and supporting evidence In this article, we describe WatsonPaths, a system can be contained in one document. Rather, for many that builds on Watson to answer scenario-based ques - scenario-based questions, information from multiple tions. The core idea is to break the question down documents and other sources must generally be into parts, over which we can ask and answer factoid retrieved and then integrated to answer the questions subquestions using Watson, then integrate these properly. Furthermore, we must often apply general answers into a graphic model that can be used to

60 AI MAGAZINE Articles answer the larger scenario-based question. We show Scenario Analysis that WatsonPaths not only outperforms a baseline The first step in the pipeline is scenario analysis, system that uses simple information retrieval, but where we identify factors in the input scenario that also outperforms its own subcomponent, Watson, in may be of importance. In the medical domain, the answering a set of scenario-based questions from the factors may include demographics (“32-year old medical domain. woman”), preexisting conditions (“type 1 diabetes mellitus”), signs and symptoms (“progressive renal WatsonPaths Medical Use Case failure”), and test results (“hemoglobin concentra - tion is 9 g/dL,” “normochromic cells,” “normocytic Although WatsonPaths is intended as a domain-gen - cells”). The extracted factors become nodes in a eral technology for scenario-based question answer - graph structure called the assertion graph, on which ing, we decided to start by focusing our attention on the remaining steps of the process will operate. the medical domain. We focused on the problem of patient scenario analysis, where the goal is typically Node Prioritization a diagnosis or a treatment recommendation. The next step is node prioritization, where we decide To explore this kind of problem solving, we which nodes in the graph are most important for obtained a set of medical test preparation questions. solving the problem. In a small scenario like this These are multiple-choice medical questions based example, we may be able to explore everything, but on an unstructured or semistructured natural lan - in general this will not be the case. Factors that affect guage description of a patient. Although Watson - the priority of a node may include the system’s con - Paths is not restricted to multiple-choice questions, fidence in the node assertion or the system’s estima - we saw multiple-choice questions as a good starting tion of how fruitful it would be to expand a node. For point for development. Many of these questions example, normal test results and demographic infor - involve diagnosis, either as the entire question, as in mation are generally less useful for starting a diagno - the previous medical example, or as an intermediate sis than symptoms and abnormal test results. step, as in the following example: A 63-year old patient is sent to the neurologist with a Relation Generation clinical picture of resting tremor that began 2 years The relation-generation step builds the assertion ago. At first it was only on the left hand, but now it graph. We do this primarily by asking Watson ques - compromises the whole arm. At physical exam, the tions about the factors. In medicine we want to know patient has an unexpressive face and difficulty in the causes of the findings and abnormal test results walking, and a continuous movement of the tip of the that are consistent with the patient’s demographic first digit over the tip of the second digit of the left information and normal test results. Given the sce - hand is seen at rest. What part of his nervous system nario in the Introduction, we could ask, “What does is most likely affected? type 1 diabetes mellitus cause?” We use a medical For this question, it is useful to diagnose that the ontology to guide the process of formulating sub - patient has Parkinson’s disease before determining questions to ask Watson. Relevant factors may also be which part of his nervous system is most likely affect - combined to form a single, more targeted question. ed. These multistep inferences are a natural fit for the Because in this step we want to emphasize recall, we graphs that WatsonPaths constructs. In this example, take several of Watson’s highly ranked answers. The the diagnosis is the missing link on the way to the exact number of answers taken, or the confidence final answer. threshold, are parameters that must be tuned. Given a set of answers, we add them to the graph as nodes, Scenario-Based Question Answering with edges from nodes that were used in questions to nodes that were answers. The edge is labeled with the In scenario-based question answering, the system predicate used to formulate the question (like causes receives a scenario description that ends with a or indicates ), and the strength of the edge is initially punch line question. For instance, the punch line set to Watson’s confidence in the answer. question in the Parkinson’s example is “What part of Although Watson is the primary way we add edges his nervous system is most likely affected?” Instead of to the graph, WatsonPaths allows for any number of treating the entire scenario as one monolithic ques - relation generator components to post edges to the tion as would Watson, WatsonPaths explores multi - graph. For instance, we apply term matchers to pairs ple facts in the scenario in parallel and reasons with of nodes, and post a relation between nodes that the results of its exploration as a whole to arrive at match. the most likely conclusion regarding the punch line question. The architecture of WatsonPaths is shown Belief Computation in figure 2. In this section, we briefly outline each Once the assertion graph has been expanded in this step, while the bulk of the rest of the article goes into way, we recompute the confidences of nodes in the more detail on important steps. graph based on new information. We do this using

Summer 2017 61 Articles

Scenario Analysis

Input Scenario

Node Assertion Prioritization Graph

Relation (Edge) Generation (may ask questions to Watson) Repeat until “completion” (which may be de!ned in Estimate Con!dence different in Nodes ways) (“Belief Engine”)

Hypothesis Identi!cation

Hypothesis Con!dence Final Re!nement Con!dences in (learned model) Hypotheses

Figure 2. Scenario-Based Question-Answering Architecture.

probabilistic inference systems that take a holistic ple iterations, during which the nodes that were the view of the assertion graph and try to reconcile the answers to the previous round of questions can be results of multiple paths of exploration. used to ask the next round of questions, producing more nodes and edges in the graph. Hypothesis Identification After each iteration we may do hypothesis identi - As figure 2 shows, the process can go through multi - fication, where some nodes in the graph are identi -

62 AI MAGAZINE Articles fied as potential final answers to the punch line ques - thus it does not fit into the semantics of an assertion tion (for example, the most likely diagnoses of a graph. WatsonPaths is charitable in interpreting patient’s problem). In some situations hypotheses strings as if they had a truth value. For instance, the may be provided up front — a physician may have a default semantics of the string “low hemoglobin” is list of competing diagnoses and want to explore the the same as “patient has low hemoglobin.” evidence for each. But in general the system needs to A relation is a named association between state - identify these. Hypothesis nodes may be treated dif - ments. Technically, relations are themselves state - ferently in later iterations. For instance, we may ments, and have a truth value. The relation has a attempt to do backward chaining from the hypothe - head, a tail, and predicate; for instance in medicine ses, asking Watson what things, if they were true of we may say that “Parkinsons causes resting tremor” the patient, would support or refute a hypothesis. or “Parkinson’s matches Parkinsonism.” Typically we The process may terminate after a fixed number of are concerned with relations that may provide evi - iterations or based on some other criterion like con - dence for the truth of one statement given another. fidence in the hypotheses. Although some relations may have special meanings While hypothesis identification is part of Watson - in the probabilistic inference systems, a common Paths, it is not described in detail in this article. In semantics for a relation is indicative in the following the system that generates the results we present in way: “A indicates B” means that the truth of A pro - this article, no hypothesis identification is necessary vides an independent reason to believe that B is true. because the multiple-choice answers are provided. An assertion is a claim that some agent makes That system always does one iteration of expansion, about the truth of a statement (including a relation). both forward from the identified factors and back - The assertion records the name of the agent and a ward from the hypotheses, before stopping. confidence value. Assertions may also record prove - nance information that explains how the agent came Hypothesis Confidence Refinement to its conclusion. For the Watson question-answering As described so far, WatsonPath’s confidence in each agent, this includes natural language passages that hypothesis depends on the strengths of the edges provide evidence for the answer. leading to it, and since our primary relation (edge) In the assertion graph, each node represents exact - generator is Watson, the hypothesis confidence ly one statement, and each edge represents exactly depends heavily on the confidence of Watson’s one relation. Nodes and edges may have multiple answers. Having good answer confidence depends on assertions attached to them, one for each agent that having a representative set of question/answer pairs has asserted that node or edge to be true. with which to train Watson. The following question We often visualize assertion graphs by using a arises: What can we do if we do not have a represen - node’s border width to represent the confidence of tative set of question/answer pairs, but we do have the node, an edge’s width to represent the confidence training examples for entire scenarios (for example, of the edge, and an edge’s gray level as the amount of correct diagnoses associated with patient scenarios)? “belief flow” along that edge. Belief flow is described To leverage the available scenario-level ground truth, later, but essentially it is how much the value of the we have built machine-learning techniques to learn head influences the value of the tail. This depends a refinement of Watson’s confidence estimation that mostly on the confidences of the assertions on the produces better results when applied to the entire edge. scenario. We describe our techniques in the Learning over Assertion Graphs section. Scenario Analysis Assertion Graphs The goal of scenario analysis is to identify informa - tion in the natural language narrative of the problem The core data structure used by WatsonPaths is the scenario that is potentially relevant to solving the assertion graph. Figure 3 explains this data structure, problem. When human experts read the problem along with the visualization that we commonly use narrative, they are trained to extract concepts that for it. Assertion graphs are defined as follows. match a set of semantic types relevant for solving the A statement is something that can be true or false problem. In the medical domain, doctors and nurses (though its state may not be known). Often we deal identify semantic types like chief complaints, past with unstructured statements, which are natural lan - medical history, demographics, family and social his - guage expressions like “A 63-year-old patient is sent tory, physical examination findings, labs, and current to the neurologist with a clinical picture of resting medications (Bowen 2006). Experts also generalize tremor that began 2 years ago.” WatsonPaths also from specific observations in a particular problem allows for statements that are structured expressions, instance to more general terms used in the domain namely, a predicate and arguments. Not all natural corpus. An important aspect of this information language expressions can have a truth value. For extraction is to identify the semantic qualifiers asso - instance, the string “patient” cannot be true or false; ciated with the clinical observations (Chang, Bor -

Summer 2017 63 Articles

A 63-year-old patient is sent to the neurologist with ... Scenario resting tremor ... What part of his nervous system is most likely affected?

states A node represents a statement. Types of statements are input Input patient factors, inferred factors and Factor exhibits hypotheses or answers. Border resting strength visually represents tremor “belief” the factor is true in context. Inferred indicates Factor

patient has An edge represents a relation Parkinson’s between the connected Disease statements. Agents make Relation assertions about the truth of indicates these relations with con!dences. Edge width represents that con!dence. Gray level represents the patient’s amount of belief "ow. Substantia Hypothesis Nigra is affected

Assertion Graph

Figure 3. Visualization of an Assertion Graph. By convention, input factors are placed at the top and hypotheses at the bottom with levels of inference factors in between.

dage, and Connell 1998). These qualifiers could be might have to be extracted from the sentence “The temporal (for example, “pain started two days ago”), patient reports pain, which started two days ago, in spatial (“pain in the epigastric region”), or other asso - the epigastric region especially after eating fatty ciations (“pain after eating fatty foods”). Implicit in foods.” this task is the human’s ability to extract concepts The computer system needs to perform a similar and their associated qualifiers from the natural lan - analysis of the narrative. We use the term factor to guage narrative. For example, the above qualifiers denote the potentially relevant observations along

64 AI MAGAZINE Articles with their associated semantic qualifiers. Reliably spans and their semantic types. They also annotated identifying and typing these factors, however, is a dif - semantic qualifier spans and linked them to factors as ficult task, because medical terms are far more com - attributive relations. plex than the kind of named entities typically stud - The machine-learning system comprises two ied in natural language processing. Our scenario sequential steps: analytics pipeline attempts to address this problem In step one, a conditional random field (CRF) mod - with the following major processing steps: el (Lafferty, McCallum, and Pereira 2001) learns the Step one: The analysis starts with syntactic spans of text that should be marked as one of the fol - of the natural language. This creates a dependency lowing factor types: finding, disease, test, treatment, tree of syntactically linked terms in a sentence and demographics, negation, or a semantic qualifier. Fea - helps to associate terms that are distant from each tures used for training the CRF model are lexical other in the sentence. (lemmas, morphological information, part-of-speech Step two: The terms are mapped to a dictionary to tags), semantic (UMLS semantic types and groups, identify concepts and their semantic types. For the demographic and lab value annotations), and parse- medical domain, our dictionary is derived from the based (features associated with dependency links Unified Medical Language System (UMLS) Metathesaurus from a given token). A token window size of five (two (National Library of Medicine 2009), Wikipedia redi - tokens before and after) is used to associate features rects, and medical abbreviation resources. The con - for a given token. A BIO tagging scheme is used by cepts identified by the dictionary are then typed the CRF to identify entities in terms of their token using the UMLS , which consists of spans and types. a taxonomy of biological and clinical semantic types In step two, a maximum entropy model then like Anatomy, SignOrSymptom, DiseaseOrSyndrome, learns the relations between the entities identified by and TherapeuticOrPreventativeProcedure. In addi - the CRF model. For each pair of entities in a sentence, tion to mapping the sequence of tokens in a sentence this model uses lexical features (within and between to the dictionary, the dependency parse is also used entities), entity type, and other semantic features to map syntactically linked terms. For example “… associated with both entities, and parse features in stiffness and swelling in the arm and leg” can be the dependency path linking them. The relations mapped to the four separate concepts contained in learned by this model are negation and attributeOf that phrase. relations linking negation triggers and semantic qual - Step three: The syntactic and semantic informa - ifiers (respectively) to factors. tion identified above is used by a set of predefined The combined entity and relation identification rules to identify important relations. Negation is models have a precision of 71 percent and recall of 65 commonly used in clinical narratives and needs to be percent on a blind evaluation set of patient scenarios accurately identified. Rules based on parse features found in medical test preparation questions. We are identify the negation trigger term and its scope in a currently exploring joint inference models and iden - sentence. Factors found within the negated scope can tification of relations that span multiple sentences then be associated with a negated qualifier. Another using coreference resolution. example of rule-based annotation is lab value analy - sis. This associates a quantitative measurement to the substance measured and then looks up reference lab Relation Generation value ranges to make a clinical assessment. For exam - The scenario analysis component described in the ple “hemoglobin concentration is 9 g/dL” is previous section extracts pertinent factors related to processed by rules to extract the value, unit, and sub - the patient from the scenario description. At this stance and then assessed to be “low hemoglobin” by stage, the assertion graph consists of the full scenario, looking up a reference. Next, the clinical assessment individual scenario sentences, and the extracted fac - is mapped by the dictionary to the corresponding tors. An indicates relation is posted from a source clinical concept. node (for example, a scenario sentence node) to a tar - At this point, we should have all the information get node whose assertion was derived from the asser - to identify factors and their semantic qualifiers. We tion in the source node (for example, a factor extract - have to contend, however, with language ambigui - ed from that sentence). In addition, a set of ties, errors in parsing, a noisy and noncomprehensive hypotheses, if given, is posted as the goal nodes in dictionary, and a limited set of rules. If we were to the assertion graph. rely solely on a rule-based system, then the resulting The task of the relation generation component is factor identification would suffer from a - to (1) expand the graph by inferring new facts from ing of errors in these components. To address this known facts in the graph and (2) identify relation - issue, we employ machine-learning methods to learn ships between nodes in the graph (like matches and clinical factors and their semantic qualifiers in the contraindicates ) to help with reasoning and confi - problem narrative. We obtained the ground truth by dence estimation. We begin by discussing how we asking medical students to annotate clinical factor infer new facts for graph expansion.

Summer 2017 65 Articles

Expanding the Graph with Watson source node to each new target node where the con - In medical problem solving, experts reason with fidence of the relation is Watson’s confidence in the chief complaints, findings, medical history, demo - answer in the target node. graphic information, and so on, to identify the In addition to asking questions from scenario fac - underlying causes for the patient’s problems. tors, WatsonPaths may also expand backwards from Depending on the situation, they may then proceed hypotheses. The premise for this approach is to to propose a test whose results will allow them to dis - explore how a hypothesis fits in with the rest of the tinguish between multiple possible problem causes, inference graph. If one hypothesis is found to have a or identify the best treatment for the identified cause, strong relationship with an existing node in the and so on. assertion graph, then our probabilistic inference Motivated by the medical problem-solving para - mechanisms allow belief to flow from known factors digm, WatsonPaths first attempts to make a diagnosis to that hypothesis, thus increasing the system’s con - based on factors extracted from the scenario. The fidence in that hypothesis. graph is expanded to include new assertions about Figure 5 illustrates the WatsonPaths graph expan - the patient by asking questions of a version of the sion process. The top two rows of nodes and the Watson question-answering system adapted for the edges between them show a subset of the Watson - medical domain (Ferrucci et al. 2013). WatsonPaths Paths assertion graph after scenario analysis, with the takes a two-pronged approach to medical problem second row of nodes representing some clinical fac - solving by expanding the graph forward from the sce - tors extracted from the scenario sentences. nario in an attempt to make a diagnosis, and then The graph expansion process identifies the most linking high-confidence diagnoses with the hypothe - confident assertions in the graph, which include the ses. The latter step is typically done by identifying an four clinical factor nodes extracted from the scenario. important relation expressed in the punch line ques - These four nodes are all typed as findings, so they are tion (for example, “What is the most appropriate aggregated into a single finding node for the purpose treatment for this patient” or “What body part is most of graph expansion. For a finding node, the Emerald likely affected?”). This approach is a logical extension proposes a single findingOf relation that links it to a of the open-domain work of Prager, Chu-Carroll, and disease. This results in the formulation of the sub - Czuba (2004), where in order to build a profile of an question “What disease causes resting tremor that entity, questions were asked of properties of the enti - began 2 years ago, compromises the whole arm, ty and constraints between the answers were enforced unexpressive face, and difficulty in walking?” whose to establish internal consistency. answers include Parkinson’s disease, Huntington’s The graph expansion process of WatsonPaths disease, cerebellar disease, and so on. These answer begins with automatically formulating questions nodes are added to the graph and some of them are related to high-confidence assertions, which in our shown in the third row of nodes in figure 5. graphs represent statements WatsonPaths believes to In the reverse direction, WatsonPaths explores rela - be true to a certain degree of confidence about the tionships between hypotheses to nodes in the exist - patient. These statements may be factors, as extract - ing graph based on the punch line question in the ed and typed by our scenario analysis algorithm, or scenario, which in this case is “What part of his nerv - combinations of those factors. ous system is mostly likely affected?” Assuming each To determine what kinds of questions to ask, Wat - hypothesis to be true, the system formulates sub - sonPaths can use a domain model that tells us what questions to link it to the assertion graph. Consider relations form paths between the semantic type of a the hypothesis, substantia nigra . WatsonPaths can ask high-confidence node and the semantic type of a “In what disease is substantia nigra most likely affect - hypothesis like a diagnosis or treatment. For the ed?” A subset of the answers to this question, includ - medical domain, we created a model that we called ing Parkinson’s disease and diffuse Lewy body disease the Emerald , which is shown in figure 4. (Notice the are shown in the fourth row of nodes in figure 5. resemblence to an emerald.) The Emerald is a small model of entity types and relations that are crucial Matching Graph Nodes for diagnosis and for formulating next steps. When a new node is added to the WatsonPaths asser - We select from the Emerald all relations that link tion graph, we compare the assertion in the new the semantic type of a high-confidence source node node to those in existing nodes to ensure that equiv - to a semantic type of interest. The relations and the alence relations between nodes are properly identi - high-confidence nodes then form the basis of instan - fied. This is done by comparing the statements in tiating the target nodes, thereby expanding the asser - those assertions: for unstructured statements, tion graph. To instantiate the target nodes, we issue whether the statements are lexically equivalent, and WatsonPaths subquestions to Watson. All answers for structured statements, whether the predicates and returned by Watson that score above a predeter - their arguments are the same. A more complex oper - mined threshold are posted as target nodes in the ation is to identify when nodes contain assertions inference graph. A relation edge is posted from the that may be equivalent to the new assertion.

66 AI MAGAZINE Articles

causeOf disease

riskFactorOf causeOf

demographics, microbe !ndingOf social, genetic locationOf causeOf treats,prevents diagnoses anatomy substance causeOf mechanismOf locationOf hasIngredient treatment test measures locationOf affects treats,prevents clinical affects attribute bio-function

measures locationOf manifestationOf

“!nding”

Figure 4. The Emerald.

We employ an aggregate of term matchers (Mur - Belief Engine dock et al. 2012a) to match pairs of assertions. Each One approach to the problem of inferring the correct term matcher posts a confidence value on the degree hypothesis from the assertion graph is probabilistic of match between two assertions based on its own inference over a graphic model (Pearl 1988). We refer resource for determining equivalence. For example, a to the component that does this as the belief engine. WordNet-based term matcher considers terms in the same synset to be equivalent, and a Wikipedia-redi - Although the primary goal of the belief engine is to rect-based term matcher considers terms with a redi - infer confidences in hypotheses, it also has the sec - rect link between them in Wikipedia to be a match. ondary goal to infer belief in unknown nodes that are The dotted line between Parkinson disease and Parkin - not hypotheses. These intermediate nodes may be son’s disease in figure 5 is posted by the UMLS-based important intermediate steps toward an answer; by term matcher, which considers variants for the same assigning high confidences to them in the main loop, concept to be equivalent. we know to assign them high priority for subques - tion asking. Therefore, the belief engine needs to assign a confidence to each node, not just hypothe - Confidence and Belief ses. Once the assertion graph is constructed, and some To execute the belief engine, we first make a work - questions and answers are posted, there remains the ing copy of the assertion graph that we call the infer - problem of confidence estimation. We develop mul - ence graph. A separate graph is used so that we can tiple models of inference to address this step. make changes without losing information that might

Summer 2017 67 Articles

A 63-year old patient is At !rst it was only the At physical exam, the sent to the neurologist left hand, but now patient has an with a clinical picture of it compromises the unexpressive face and resting tremor that whole arm dif!culty in walking, began 2 years ago and...

resting tremor that compromises unexpressive dif!culty in began 2 years ago the whole arm face walking

Progressive Parkinson Huntington’s Cerebellar supranuclear disease disease diseases palsy

Parkinson’s Diffuse disease Lewy body disease

Substantia Caudate Cerebellum Lenticular Pons Nigra Nuclei Nucleus

Figure 5. WatsonPaths Graph Expansion Process.

be useful in later steps of inference. For instance, we There are some challenges in applying probabilis - might choose to merge nodes or reorient edges. Once tic inference to an assertion graph. Most tools in the the inference graph has been built, we run a proba - inference literature were designed to solve a different bilistic inference engine over the graph to generate problem, which we will call the classical inference new confidences. Each node represents an assertion, problem. In this problem, we are given a training set so it can be in one of two states: true or false (“on” or and a test set that can be seen as samples from a com - “off”). Thus a graph with k nodes can be in 2 k possi - mon joint distribution. The task is to construct a ble states. The inference graph specifies the likeli - model that captures the training set (for instance, by hoods of each of these states. The belief engine uses maximizing the likelihood of the training set), and these likelihoods to calculate the marginal probabili - then apply the model to predict unknown values in ty, for each node, of it being in the true state. This the test set. Arguably the greatest problem in the clas - marginal probability is treated as a confidence. Final - sical inference task is that the structure of the graph - ly, we read confidences and other data from the infer - ic model is underdetermined; a large space of possi - ence graph back into the assertion graph. ble structures needs to be explored. Once a structure

68 AI MAGAZINE Articles is found, adjusting the strengths is relatively easier, version of a noisy-logical Bayesian network (Yuille because we know that samples from the training set and Lu 2007). The strength of each edge can be inter - are sampled from a consistent joint distribution. preted as an indicative power, a concept related to In WatsonPaths, we face a different set of prob - causal power (Cheng 1997), with the difference that lems. The challenge is not to construct a model from we are semantically agnostic as to the true direction training data, but to use a very noisy, already con - of the causal relation. structed model to do inference. Training data in the In experiments, the indicative semantics performs classical sense is absent or very sparse; all we have are at least as well as other semantics. We outline a few correct answers to some scenario-level questions. An simple alternatives here. One of the first semantics advantage is that a graph structure is given. A disad - we tried was undirected pairwise Markov random vantage is that the graph is noisy. Furthermore, it is fields. In this semantics, we take the maximum con - not known that the confidences on the edges neces - fidence connecting two nodes to be their compati - sarily correspond to the optimal edge strengths. (In bility: how likely they are to take the same truth val - the next section, we address the problem of learning ue. This performed poorly in practice. We edge strengths.) Thus we have the problem of select - hypothesize that this is because important informa - ing a semantics — a way to convert the assertion tion is contained in the direction of the edges that graph into a graph over which we can do optimal Watson returns: asking about A and getting B as an probabilistic inference to meet our goals. answer is different from asking about B and getting A After much experimentation, the primary seman - as an answer. An undirected model loses this infor - tics used by the belief engine is the indicative seman - mation. tics: If there is a directed relation from node A to The indicative semantics is a default, basic seman - node B with strength x, then A provides an inde - tics. The ability to reason over arbitrary relations pendent reason to believe that B is true with proba - makes the indicative semantics robust, but it is easy bility x. Some edges are classified as contraindicative; to construct examples in which the indicative for these edges, A provides an independent reason to semantics is not strictly correct. For instance, “fever believe that B is false with probability x. The inde - is a finding of Lyme disease” may be correctly true pendence means that multiple parents R can easily with high confidence, but this does not mean that be combined using a noisy-OR function. For fever provides an independent reason to believe that instance, if the node resting tremor indicates Parkin - Lyme disease is present, with high probability. Fever son disease with strength 0.8, and the node difficulty is caused by many things, each of which could in walking indicates Parkinson disease with power 0.4, explain it. We are currently working on adding a then the probability of Parkinson disease will be (1 – causal semantics in which we use a noisy-logical (1 – 0.8)(1 – 0.4)) = 0.88. If so, then the edge with Bayesian network, but belief flows from causes to strength 0.9 to Parkinson’s disease will fire with prob - effects, rather than from factors to hypotheses. Edges ability 0.88 · 0.9 = 0.792. In this way, probabilities are oriented according to the types of the nodes: Dis - can often multiply down simple chains. Inference eases cause findings but not vice versa. Currently this must be more sophisticated to handle the graphs we does not lead to detectable improvement in accuracy. see in practice, but the intuition is the same. We expect that we need to improve the precision of As an example of a problem that requires addi - the rest of the system before it will show impact. tional sophistication, the assertion graphs in Wat - sonPaths contain many directed and undirected cycles, which are not allowed in many inference Learning over Assertion Graphs algorithms. We have developed a novel inference The belief engine described in the previous section algorithm to deal with the directed cycles, which uses confidences directly as strengths. These confi - shows an improvement over existing methods. dences include the output of multiple relation gener - Another example is an “exactly one” constraint that ators (for instance, matching algorithms) but most can be optionally added to multiple-choice ques - importantly the confidences of Watson’s subquestion tions. This constraint assigns a higher likelihood to answering. For instance, if we ask a question about assignments in which exactly one multiple-choice resting tremor, and get Parkinson’s disease as an answer is true. Because of these kinds of constraints, answer with 73 per cent confidence, then the belief we cannot simply calculate the probabilities in a engine will treat this as a directed relation from rest - feed-forward manner. To perform inference we use ing tremor to parkinson’s disease with an indicative Metropolis-Hastings sampling over a factor graph power of 0.73. When multiple relations exist in one representation of the inference graph. This has the direction, the belief engine takes the maximum of all advantage of being a very general approach — the the confidences in that direction. Watson combines inference engine can easily be adapted to a new subquestion features using a model that was trained semantics — and also allows an arbitrary level of pre - on medical trivia questions (see the evaluation sec - cision given enough processing time. tion for details). This raises the question of whether Formally, the indicative semantics instantiates a we can also use the training set of scenario-based

Summer 2017 69 Articles

questions to better learn how to map features more The feature addition model uses the same DAG as optimally to edge strengths. This section describes the noisy-OR model, but confidence in the interme - our attempts to do so. diate nodes is computed by adding the feature values There are two main ways that we use the scenario- for the questions that lead to it and then applying based training set. First, we create a set of closed-form the logistic model to the resulting vector. An effect is inference methods, most of which are approxima - that the confidence for a node does not increase tions of methods described in the previous section, monotonically with the number of parents. Instead, but are more straightforward to optimize. We use if features that are negatively associated with correct - these methods to create equations that express can - ness are present in one edge, it can lower the confi - didate confidences in terms of the model, and then dence of the node below the confidence given by optimize the model to maximize certain objectives another edge. such as accuracy. Second, we combine all the infer - The causal model attempts to capture causal seman - ence methods, including the belief engine, into an tics by expressing the confidence for each candidate ensemble and use the scenario-based train set to opti - as the product over every clinical factor of the prob - mize the weights of the ensemble. ability that either the diagnosis could explain the fac - tor (as estimated from Watson/question-answering Closed-Form Inference Methods features), or the factor “leaked” — it is an unex - This section describes a series of closed-form infer - plained observation or is not actually relevant. ence methods. Note that none of these methods by In the closed-form inference systems described, themselves have been shown to give higher accuracy there is no constraint that the answer confidences than the belief engine method described earlier. sum to one. We implement a final stage where fea - However, they are more amenable to optimization, tures based on the raw confidence from the inference and the ensemble of methods may perform better. model are transformed into a proper probability dis - The noisy-OR model is most similar to the indica - tribution over the candidate answers. tive semantics used in the belief engine, with some differences. While the belief engine allows a graph Direct Learning with directed cycles, the noisy-OR model requires a The methods described in the previous section per - directed acyclic graph (DAG). As mentioned above, mit expressing the confidence in the correct answer the assertion graph is not, in general, free of cycles. as a closed-form expression. Summing the log of the Additionally, the assertion graph contains matching confidence in the correct hypothesis across the train - relations, which are undirected. To form a DAG, the ing set T, we construct a learning problem with log- nodes in the assertion graph are first clustered by likelihood in the correct final answer as our objective these matching relations, and then cycles are broken function. The result is a function that is nonconvex, by applying heuristics to reorient edges to point from and in some cases (due to max) not differentiable in factors to hypotheses. Confidence is computed in a the parameters. feed-forward manner. The confidence in factors To limit overfitting and encourage a sparse, inter - extracted by scenario analysis is 1.0. For all other pretable parameter weighting we use L1-regulariza - nodes the confidence is defined recursively in terms tion. The absolute value of all learned weights is sub - of the confidences of the parents and the confidence tracted from the objective function. of the edges produced by the question-answering sys - To learn the parameters for the inference models tem. This generates an equation for each candidate, we apply a “black-box” optimization method: expressing its confidence interms of the parameters. greedy-stochastic local search. Learning explores the The edge type variant of the noisy-OR model con - parameter space, tending to search in regions of high siders the type of the edge when propagating confi - value while never becoming stuck in a local maxi - dence from parents to children. The strength of the mum. edge according to the question-answering model is We also experimented with the Nelder-Mead sim - multiplied by a per-edge-type learned weight, then a plex method (Nelder and Mead 1965) and the multi - sigmoid function is applied. In this model, different directional search method of Torczon (1989) but types of subquestions may have different influence found weaker performance from these methods. on confidences, even when the question-answering model produces similar features for them. Ensemble Learning The matching model estimates the confidence in a We have multiple inference methods, each approach - hypothesis according to how well each factor in the ing the problem of combining the subquestion con - scenario, plus the answers to forward questions asked fidences from a different intuition and formalizing it about it, match against either the hypothesis or the in a different way. To combine all these different answers to the backward questions asked from it. We approaches we train an ensemble. estimate this degree of match using the term match - This is a final, convex, confidence estimation over ers described earlier in the Matching Graph Nodes the multiple-choice answers using the predictions of section. the inference models as features.

70 AI MAGAZINE Articles

The ensemble learning uses the same training set focusing initially on diagnosis questions was a good that the individual closed-form inference models use. strategy. To avoid giving excess weight to inference models We split our data set of 2190 questions into a train - that have overfit the training set, we use a common ing set of 1000 questions, a development set of 690 technique from stacking ensembles (Breiman 1996). questions, and a blind test set of 500 questions. The The training set is split into five folds, each leaving development set was used to iteratively drive the out 20 percent of the training data, as though for development of the scenario analysis, relation gener - cross validation. Each closed-form inference model is ation, and belief engine components, and for param - trained on each fold. When the ensemble gathers an eter tuning. The training set was used to build mod - inference model’s confidence as a feature for an els used by the learning component. instance, the inference model uses the learned As noted earlier, our learning process requires sub - parameters from the fold that excludes that instance. question training data to consolidate groups of ques - In this way, each inference model’s performance is tion-answering features into smaller, more manage - testlike, and the ensemble model does not overly able sets of features. We do not have robust and trust overfit models. comprehensive ground truth for a sufficiently large The ensemble is a binary logistic regression per set of our automatically generated subquestions. answer hypothesis using three features from each Instead, we use a preexisting set of simple factoid inference model. The features used are: the probabil - medical questions as subquestion training data: the ity of the hypothesis, the logit of the probability, and Doctor’s Dilemma (DD) question set. 1 DD is an estab - the rank of the answer among the multiple-choice lished benchmark used to assess performance in fac - answers. Using the logit of the probability ensures toid medical question answering. We use 1039 DD that selecting a single inference model is in the questions (with a known answer key) as our sub - ensemble’s hypothesis space, achieved by simply set - question training data. Although the Doctor’s Dilem - ting the weight for that model’s logit feature to one ma questions do have some basic similarity to the and all other weights to zero. subquestions we ask in assertion graphs, there are Each closed-form inference model is also trained some important differences: (1) In an assertion graph on the full training set. These versions are applied at subquestion, there is usually one known entity and test time to generate the features for the ensemble. one relation that is being asked about. For DD, the question may constrain the answer by multiple enti - ties and relations. (2) An assertion graph subquestion Evaluation like “What causes hypertension?” has many correct For the automatic evaluation of WatsonPaths, we answers, whereas DD questions have a single best used a set of medical test preparation questions from answer. (3) There may be a mismatch between how Exam Master and McGraw-Hill, which are analogous confidence for DD is trained and how subquestion to the examples we have used throughout this article. confidence is used in an inference method. The DD These questions consist of a paragraph-sized natural confidence model is trained to maximize log-likeli - language scenario description of a patient case, hood on a correct/incorrect binary classification task. optionally accompanied by a semistructured tabular In contrast, many probabilistic inference methods structure. The paragraph description typically ends use confidence as something like strength of indica - with a punch line question and a set of multiple- tion or relevance. choice answers (average 5.2 answer choices per ques - For all these reasons, DD data might seem poorly tion). We excluded from consideration questions that suited to training a complete model for judging edge- require image analysis or whose answers are not text strength for subquestion edges in WatsonPaths. In segments. practice, however, we have found that DD data is use - The punch line questions may simply be seeking ful as subquestion training data because it is easier to the most likely disease that caused the patient’s obtain than subquestion ground truth, and so far symptoms (for example, “What is the most likely shows improved performance over the limited sub - diagnosis in this patient?”), in which case the ques - question ground truth we have constructed. 2 In our tion is classified as a diagnosis question. The diagno - hybrid learning approach, we use 1039 DD questions sis question set reported in this evaluation was iden - for consolidating question-answering features and tified by independent annotators. Nondiagnosis then use the smaller, consolidated set of features as punch line questions may include appropriate treat - inputs to the inference models that are trained on the ments, the organism causing the disease, and so on 1000 medical test preparation questions. (for example, “What is the most appropriate treat - ment?” and “Which organism is the most likely Metrics and Baseline cause of his meningitis?” respectively). We observed As a baseline, we attempted to answer questions from that most questions that did not directly ask for a our tests set using a simple information-retrieval diagnosis nonetheless required a diagnosis as an strategy. It used as much as possible the same corpus intermediate step. For this reason, we decided that and starting point used by WatsonPaths. In this base -

Summer 2017 71 Articles

diagnosis in the punch line depend on a correct diag - nosis along the way. Thus progress on the diagnosis Full Diag nosis subset may be a step toward better performance on Acc urac y Baseli ne 30 .6% 41 .0% multistep questions. We use the full 1000 questions Wats on 53 .8% 42 .0% in the training set to learn the models for both the Wats onPaths 64 .1% 48 .0% baseline system and the WatsonPaths system. As not - Con!dence Weighted Score Baseli ne 42 .9% 52 .1% ed earlier, Doctor’s Dilemma training data is used to Wats on 59 .8% 75 .3% consolidate question-answering features in the Wat - Wats onPaths 67 .5% 81 .8% sonPaths system. In the Watson system that was not part of WatsonPaths, we did not use Doctor’s Dilem - ma training data for any purpose. Table 1. WatsonPaths Performance Results. Results Table 1 shows the results of our evaluation on a set of 500 blind questions of which a subset of 156 ques - tions were identified as diagnosis questions by anno - line, we took each sentence in the question and gen - tators. erated an Indri query by removing stop words. We We report results on our blind evaluation data, then ran this query over our medical corpus, return - using two metrics. Accuracy simply measures the per - ing a set of 100 passages. (We also tried different centage of questions for which a system ranks the numbers of passages on our development set; 100 correct answer in top position. Confidence weighted passages appeared to show the best results.) The score score is a metric that takes into account both the for each candidate was simply the number of times accuracy of the system and its confidence in produc - that the candidate text appeared in any of these pas - ing the top answer (Voorhees 2003). We sort all sages. Confidence in each answer was generated by pairs in an evaluation set in normalizing the scores. For instance, if answer A decreasing order of the system’s confidence in the top appeared 4 times in the passages, and answer B answer and compute the confidence weighted score appeared 1 time, the confidence in answer A would as be 80 per cent. We also evaluated the performance of the Watson 1 n number correct in first i ranks CWS = question-answering system adapted for the medical ! n i=1 i domain (Ferrucci et al. 2013). We ran this factoid- based pipeline on our scenario-based questions in where n is the number of questions in the evaluation order to evaluate the value added by our scenario- set. This metric rewards systems for more accurately based approach. Watson takes the entire scenario as assigning high confidences to correct answers, an input and evaluates each multiple-choice answer important consideration for real-world question- based on its likelihood of being the correct answer to answering and medical diagnosis systems. the punch line question. This one-shot approach to Because most questions have five or six multiple- answering medical scenario questions contrasts with choice answers, chance performance on our test set the WatsonPaths approach of decomposing the sce - was approximately 19.8 per cent. nario, asking questions of atomic factors, and per - Results show that in terms of accuracy, Watson - forming probabilistic inference over the resulting Paths outperforms both the baseline system and Wat - graphic model. Note that Watson is the same system son on both the full set and the diagnosis subset. We that WatsonPaths uses as a subcomponent. It has used a significance level of p < 0.05. In terms of con - been developed and improved along with Watson - fidence weighted score, WatsonPaths significantly Paths. outperforms the baseline system on both sets, and We tuned various parameters in the WatsonPaths significantly outperforms Watson on the full set. For system on the development set to balance speed and the diagnosis subset, the difference between Watson performance. The system performs one iteration each and WatsonPaths on confidence weighted score was of forward and backward relation generation. The not statistically significant, despite a 6+ percent score minimum confidence threshold for expanding a increase. This is likely due to the small diagnosis sub - node is 0.25, and the maximum number of nodes set, which contains only 156 questions. expanded per iteration is 40. In the relation genera - Overall, these results suggest that WatsonPaths tion component, the Watson medical question- adds significant value to scenario-based question answering system returns all answers with a confi - answering, over and above a simple information- dence of above 0.01. retrieval baseline, and also over and above a factoid- We evaluate system performance both on the full type question-answering approach. This is true even test set as well as on the diagnosis subset only. The when comparing WatsonPaths to the same Watson reason for evaluating the diagnosis subset separately system that was developed as a subcomponent for is because most questions that do not directly seek a WatsonPaths. These results suggest that the graphic

72 AI MAGAZINE Articles model, subquestion strategy, and probabilistic infer - duce a strong inference chain that connects to a ence engines, such as the ones used in WatsonPaths, hypothesis. In contrast, irrelevant text will be unlike - can add significant value to scenario-based question ly to produce inference chains to the hypotheses. answering. This property is important as many real-world appli - cations are not as concise as trivia questions. For Discussion instance, medical records often contain large WatsonPaths has some key features that drive the amounts of detail, much of which is irrelevant to a performance improvement over Watson. The first particular question. and most important is that WatsonPaths has the abil - ity to engage in inference. Watson does well on short diagnostic questions where one phrase is strongly Related Work associated with the correct diagnosis. WatsonPaths Clinical decision support systems (CDSSs) have had a does better than Watson when there is another factor long history of development starting from the early that rules out a diagnosis that would otherwise be days of artificial intelligence. These systems use a likely, or a second symptom that is not explained by variety of knowledge representations, reasoning that diagnosis. Holistically performing inference over processes, system architectures, scope of medical the information contained in the question is often domain, and types of decision (Musen, Middleton, necessary to answer such questions well. and Greenes 2014). Although several studies have Development of the inference capabilities of Wat - reported on the success of CDSS implementations in sonPaths has been driven by empirical results: We improving clinical outcomes (Kawamoto et al. 2005; continued to add sophistication to inference as long Roshanov et al. 2013), widespread adoption and rou - as we detected a statistically significant improvement tine use is still lacking (Osheroff et al. 2007). in accuracy in the development set. The framework The pioneering Leeds abdominal pain system (De supports many more kinds of inference. Some types Dombal et al. 1972) used structured knowledge in the of inference (for instance, causal inference) already form of conditional probabilities for diseases and have implementations, but have not yet shown a sta - their symptoms. Its success at using Bayesian reason - tistically significant impact. For some of these types ing was comparable to experienced clinicians at the of inference, we suspect that improvements in the Leeds hospital where it was developed. But it did not underlying Watson question answering will be nec - adapt successfully to other hospitals or regions, indi - essary before this impact will emerge. For some oth - cating the brittleness of some systems when they are er kinds of inference, such as reasoning about events separated from their original developers. A recent sys - in time, we are not convinced that such inference temic review of 162 CDSS implementations shows will be necessary to do well on the United States Med - that success at clinical trials is significantly associat - ical Licensing Examination (USMLE) test set. Finally, ed with systems that were evaluated by their own some kinds of inference are not well supported by developers (Roshanov et al. 2013). MYCIN (Shortliffe WatsonPaths. For instance, as we mentioned, state - 1976) was another early system that used structured ments in WatsonPaths must be either true or false. representation in the form of production rules. Its Thus explicit reasoning about entities and events, scope was limited to the treatment of infectious dis - and the relations between them, would require a eases and, as with other systems with structured major extension to WatsonPaths. Overall, because knowledge bases, required expert humans to develop the primary impact of WatsonPaths appears to be its and maintain these production rules. This manual more advanced inference, and further advancing process can prove to be infeasible in many medical inference depends on the quality of Watson’s results, specialties where active research produces new diag - we believe that the biggest gains in the performance nosis and treatment guidelines and phases out older of WatsonPaths will come from improvements in ones. Many CDSS implementations mitigate this lim - Watson. itation by focusing their manual decision logic devel - Another factor contributing to the impact of Wat - opment effort on clinical guidelines for specific dis - sonPaths, is that WatsonPaths seems less likely than eases or treatments, for example, hypertension Watson to get overwhelmed by irrelevant details in management (Goldstein et al. 2001). But such sys - long questions. While Watson tries to weight the rel - tems lack the ability to handle patient comorbidities evance of various phrases in the question, its baseline and concurrent treatment plans (Sittig et al. 2008). assumption is that all the text is potentially impor - Another notable system that used structured knowl - tant. Thus irrelevant text can water down the score of edge was Internist-1. The knowledge base contained a candidate hypothesis that would otherwise get a disease-to-finding mappings represented as condi - high score. In WatsonPaths, we ask many subques - tional probabilities (of disease given finding, and of tions using different ways of breaking down the sce - finding given disease) mapped to a 1–5 scale. Despite nario. For instance, we ask questions about sen - initial success as a diagnostic tool, its design as an tences, factors, and combinations of factors. This expert consultant was not considered to meet the increases the chances that some set of words will pro - information needs of most physicians. Eventually, its

Summer 2017 73 Articles underlying knowledge base helped its in the form of alerts. But this integra - beyond the medical domain. For med - evolution into an electronic reference tion is limited to the structured data ical applications, it might have been that can provide physicians with cus - contained in a patient’s electronic easier to design Watson with certain tomized information (Miller et al. medical record. When a CDSS requires medical aspects hard coded into the 1986). A similar system, DXplain (Bar - information like findings, assessments, flow of execution. Instead we designed nett et al. 1987), continues to be com - or plans in clinical notes written by a the overall flow as well as each compo - mercially successful and extensively health-care provider, existing systems nent to be general across domains. used. Rather than focus on a definitive are unable to extract them. As a result, Note that the Emerald could be diagnosis, it provides the physician search-based CDSSs remain a separate replaced by a structure from a different with a list of differential diagnoses consultative tool. The scenario analy - domain, and the basic semantics we along with descriptive information sis capability of WatsonPaths provides have explored: matching, indicative and bibliographic references. the means to analyze these unstruc - and causal, have no requirement that Other systems in commercial use tured clinical notes and serves as a the graph structure come from medi - have adopted the unstructured med - means for integration into the health cine. Even the causal aspect of the ical text reference approach directly, information system. belief engine could apply to any using search technology to provide domain that involves diagnostic infer - decision support. Isabel provides diag - Conclusions ence (for example, automotive repair). nostic support using natural language Most importantly, the way that sub - processing of medical textbooks and and Further Work questions are answered is completely journals. Other commercial systems WatsonPaths is a system for scenario- general. By asking the right subques - like UpToDate and ClinicalKey forgo based question answering that has a tions and using the right corpus, we the diagnostic support and provide a graphic model at its core. We have can apply WatsonPaths to any sce - search capability to their medical text - developed WatsonPaths on a set of nario-based question-answering prob - books and other unstructured refer - multiple-choice questions from the lem. We hope to develop a toolbox of ences. Although search over unstruc - medical domain. On this test set, Wat - expansion strategies, relation genera - tured content makes it easier to sonPaths shows a significant improve - tors, and inference mechanisms that incorporate new knowledge, it shifts ment over our baselines, even outper - can be reused as we apply WatsonPaths the reasoning load from the system forming its own subquestion- to new domains. back to the physician. answering system, Watson. Although The most important area for further In comparison to the aforemen - the test preparation question set has work is on a collaborative user applica - tioned systems, WatsonPaths uses a been important for the early develop - tion. Question answering is not always hybrid approach. It uses question- ment of the system, we have designed just about returning the correct answer answering technology over unstruc - WatsonPaths to function well beyond — often we must also explain to the tured medical content to obtain it. In future work, we plan to extend user why the answer is correct. This is answers to specific subquestions gener - WatsonPaths in several ways. particularly important in domains like ated by WatsonPaths. For this task, it The present set of questions are all medicine, where users are justfied in builds on the search functionality by multiple-choice questions. This means challenging and validating answers extracting answer entities from the that hypotheses have already been because of the life-and-death nature of search results and seeking supporting identified, and it is also known that decisions. Question-answering systems evidence for them in order to estimate exactly one of the hypotheses is the that rely heavily on machine learning answer confidences. These answers are correct answer. Although they have are often criticized for being too then treated as inferences by Watson - made the early development of sce - opaque to allow clear explanations. We Paths over which it can perform prob - nario-based question answering more have designed WatsonPaths with abilistic reasoning without requiring a straightforward, the overall Watson - explanatory power in mind. For probabilistic knowledge base. Paths architecture does not rely on instance, we modeled the graphic Another major area of difference these constraints. For instance, we can model after the hand-drawn graphs between CDSS implementations is the easily remove the confidence reestima - that domain experts used to explain extent of their integration to the tion phase for the closed-form infer - their answers. At the same time, we are health information system and work - ence systems and the “exactly one” able to use machine learning to flow used by the physicians. Studies constraint from the belief engine. Also, improve our accuracy and other met - have shown that CDSSs are most effec - it will be straightforward to add a sim - rics, without losing the ability to tive when they are integrated within ple hypothesis identification step to explain our answers. the workflow (Kawamoto et al. 2005; the main loop. One way to do this is to In addition to explaining our Roshanov et al. 2013). Many of the find nodes whose type corresponds to answers, a collaborative application guideline-based CDSS implementa - the type being asked about in the gives us the opportunity to have tions are integrated with the health punch line question. We already find humans assist WatsonPaths in tasks information system and workflow, such correspondences in the base Wat - that are still difficult for machines. For having access to the data being entered son system (Chu-Carroll et al. 2012). instance, factor identification and sup - and providing timely decision support We also plan to extend WatsonPaths porting passage evaluation may benefit

74 AI MAGAZINE Articles from human input. In a fully automat - Bowen, J. L. 2006. Educational Strategies to Identify Features Critical to Success. British ic system, the user receives an answer Promote Clinical Diagnostic Reasoning. Medical Journal 330(7494): 765–72. using little or no time or cognitive New England Journal of Medicine 355(21): dx.doi.org/10.1136/bmj.38398.500764.8F effort. In a collaborative system, the 2217–2225. doi.org/10.1056/NEJMra0547 Lafferty, J.; McCallum, A.; and Pereira, F. C. 82 user spends some time and effort, and N. 2001. Conditional Random Fields: Prob - potentially gets a better answer. We Breiman, L. 1996. Stacked Regressions. abilistic Models for Segmenting and Label - Machine Learning 24(1): 49–64. doi.org/10. ing Sequence Data. In Proceedings of the 18th suspect that, in many applications of 1007/BF00117832 International Conference on Machine Learning scenario-based question answering, Chang, R. W.; Bordage, G.; and Connell, K. 2001 (ICML 2001), 282–289. San Francisco: this will be an attractive trade-off for J. 1998. Cognition, Confidence, and Clini - Morgan Kaufmann Publishers. the user, because of the complexity of cal Skills: The Importance of Early Problem Lally, A.; Prager, J. M.; McCord, M. C.; Bogu - the scenario and the importance of the Representation During Case Presentations. raev, B. K.; Patwardhan, S.; Fan, J.; Fodor, P.; answer. Our objective is to minimize Academic Medicine 73(10): S109–111. and Chu-Carroll, J. 2012. Question Analy - the time and effort required of users doi.org/10.1097/00001888-199810000- sis: How Watson Reads a Clue. IBM Journal and to maximize the benefit they 00062 of Research and Development 56(3/4): 2: 1–2: receive. The combination of the user Cheng, P. W. 1997. From Covariation to 14. and WatsonPaths should be able to Causation: A Causal Power Theory. Psycho - McCord, M. C.; Murdock, J. W.; and Bogu - handle more difficult problems more logical Review 104(2): 367. doi.org/10.1037/ raev, B. K. 2012. Deep Parsing in Watson. quickly than either could alone. 0033-295X.104.2.367 IBM Journal of Research and Development Chu-Carroll, J.; Fan, J.; Boguraev, B. K.; 56(3/4): 3: 1–3: 15. Carmel, D.; Sheinwald, D.; and Welty, C. Miller, R. A.; McNeil, M. A.; Challinor, S. M.; Acknowledgements 2012. Finding Needles in the Haystack: Masarie, Jr, F. E.; and Myers, J. D. 1986. The Search and Candidate Generation. IBM Jour - Internist-1/Quick Medical Reference Project We would like to acknowledge the fol - nal of Research and Development 56(3/4): 6: — Status Report. Western Journal of Medicin e lowing colleagues at IBM who helped 1–6: 12. 145(6): 816. to create WatsonPaths: Ken Barker, Eric De Dombal, F. T.; Leaper, D. J.; Staniland, J. Murdock, J. W.; Fan, J.; Lally, A.; Shima, H.; Brown, James Fan, Bhavani Iyer, Ben - R.; McCann, A. P.; and Horrocks, J. C. 1972. and Boguraev, B. K. 2012a. Textual Evidence jamin Segal, Parthasarathy Surya - Computer-Aided Diagnosis of Acute Ab - Gathering and Analysis. I BM Journal of narayanan, and Chris Welty. We would dominal Pain. British Medical Journal Research and Development 56(3/4): 8: 1–8: 14. also like to acknowledge the Cleveland 2(5804): 9. doi.org/10.1136/bmj.2.5804.9 Murdock, J. W.; Kalyanpur, A.; Welty, C.; Clinic Lerner College of Medicine of Ferrucci, D. 2012. Introduction to This is Fan, J.; Ferrucci, D.; Gondek, D. C.; Zhang, Case Western Reserve University, with Watson. IBM Journal of Research and Devel - L.; and Kanayama, H. 2012b. Typing Candi - whom we are working to show positive opment 56(3/4): 1: 1–1: 15. date Answers Using Type Coercion. IBM impact of WatsonPaths on educational Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, Journal of Research and Development 56(3/4): outcomes. This work was conducted at J.; Gondek, D.; Kalyanpur, A. A.; Lally, A.; 7: 1–7: 13. IBM Research and IBM Watson Group, Murdock, J. W.; Nyberg, E.; Prager, J.; Musen, M. A.; Middleton, B.; and Greenes, Thomas J. Watson Research Center, Schlaefer, N.; and Welty, C. 2010. Building R. A. 2014. Clinical Decision-Support Sys - tems. In Biomedical Informatics: Computer P.O. Box 218, Yorktown Heights, NY Watson: An Overview of the DeepQA Proj - ect. AI Magazine 31(3): 59–79. doi.org/10. Applications in Health Care and Biomedicine, 10598. 1016/j.artint.2012.06.009 ed. E. H. Shortliffe and J. J. Cimino, 643– Ferrucci, D.; Levas, A.; Bagchi, S.; Gondek, 674. Berlin: Springer. doi.org/10.1007/978- 1-4471-4474-8_22 Notes D.; and Mueller, E. T. 2013. Watson: Beyond Jeopardy! Artificial Intelligence 199– National Library of Medicine (US). 2009. 1. See the American College of Physicians 200(June–July): 93–105. UMLS Reference Manual. Bethesda, MD: 2014 Doctor’s Dilemma competition, Goldstein, M. K.; Hoffman, B. B.; Coleman, National Library of Medicine (US). www.acponline.org/residents fellows/com - R. W.; Tu, S. W.; Shankar, R. D.; O’Connor, www.ncbi.nlm.nih.gov/books/NBK9676/. petitions/doctors dilemma. M.; Martins, S.; Advani, A.; and Musen, M. Nelder, J. A., and Mead, R. 1965. A Simplex 2. Building a comprehensive answer key for A. 2001. Patient Safety in Guideline-Based Method for Function Minimization. Com - such questions is very time consuming, and Decision Support for Hypertension Man - puter Journal 7(4): 308–313. doi.org/10. an incomplete answer key can be less effec - agement: ATHENA DSS. In Proceedings of the 1093/comjnl/7.4.308 tive. Although this approach has not yet 2001 AMIA Annual Symposium, 214–218. Osheroff, J. A.; Teich, J. M.; Middleton, B.; succeeded, it may still succeed if we invest Bethesda, MD: American Medical Informat - Steen, E. B.; Wright, A.; and Detmer, D. E. much more in building a bigger, better ics Association. 2007. A Roadmap for National Action on answer key for actual WatsonPaths sub - Gondek, D. C.; Lally, A.; Kalyanpur, A.; Mur - Clinical Decision Support. Journal of the questions. dock, J. W.; Duboue, P. A.; Zhang, L.; Pan, American Medical Informatics Association Y.; Qiu, Z. M.; and Welty, C. 2012. A Frame - 14(2): 141–145. doi.org/10.1197/jamia. work for Merging and Ranking of Answers M2334 References in DeepQA. IBM Journal of Research and Pearl, J. 1988. Probabilistic Reasoning in Intel - Barnett, G. O.; Cimino, J. J.; Hupp, J. A.; and Development 56(3/4): 14: 1–14: 12. ligent Systems: Networks of Plausible Inference . Hoffer, E. P. 1987. DXplain: An Evolving Kawamoto, K.; Houlihan, C. A.; Balas, E. A.; San Francisco: Morgan Kaufmann. Diagnostic Decision-Support System. JAMA and Lobach, D. F. 2005. Improving Clinical Prager, J. M.; Chu-Carroll, J.; and Czuba, K. 258(1): 67–74. doi.org/10.1001/jama.1987. Practice Using Clinical Decision Support 2004. Question Answering Using Con - 03400010071030 Systems: A Systematic Review of Trials to straint Satisfaction: QA-by-Dossier-with-

Summer 2017 75 Articles

Constraints. In Proceedings of the 42nd Asso - support tasks. He has an entrepreneurial with ThoughtTreasure. He received an S.B. in ciation for Computational Linguistics, 575– background in software development and computer science and engineering from the 582. Stroudsburg, PA: Association for Com - technical services. His education is in elec - Massachusetts Institute of Technology and putational Linguistics, Inc. doi.org/10. trical engineering with degrees from the an M.S. and Ph.D. in computer science from 3115/1218955.1219028 University of Texas at Austin (BSEE 1989 the University of California, Los Angeles. and MSEE 1992). Roshanov, P. S.; Fernandes, N.; Wilczynski, J. William Murdoc k is a research staff J. M.; Hemens, B. J.; You, J. J.; Handler, S. M.; David W. Buchanan is interested in joint member in IBM Watson Group. Before join - Nieuwlaat, R.; Souza, N. M.; Beyene, J.; inference problems in large AI systems, ing IBM, he worked at the United States Spall, H. G. C. V.; Garg, A. X.; and Haynes, especially using Bayesian approaches. He Naval Research Laboratory. His research R. B. 2013. Features of Effective Comput - built the core inference system for Watson - interests include question answering, natu - erised Clinical Decision Support Systems: Paths and other systems. He has published ral language semantics, analogical reason - Meta-Regression of 162 Randomised Trials. in causal inference, graphic models, and ing, knowledge-based planning, machine British Medical Journal 346(f657). doi.org/ hierarchical Bayesian models. He earned his learning, and computational reflection. In 10.1136/bmj.f657 Ph.D. in Cognitive Science from Brown 2001, he earned his Ph.D. in computer sci - Shortliffe, E. H. 1976. MYCIN: Computer- University in 2011. ence from the Georgia Institute of Technol - Based Medical Consultations. New York: Else - ogy. vier. Jennifer Chu-Carroll serves on the editori - Sittig, D. F.; Wright, A.; Osheroff, J. A.; Mid - al boards of the Journal of Dialogue Systems Siddharth Patwardhan pursues research in dleton, B.; Teich, J. M.; Ash, J. S.; Campbell, and the ACM Transactions of Intelligent Sys - natural language processing, with interests E.; and Bates, D. W. 2008. Grand Challenges tems and Technology. She previously served spanning a variety of topics, including in Clinical Decision Support. Journal of Bio - on the executive board of the North Ameri - , question answer - medical Informatics 41(2): 387–392. doi.org/ can Chapter of the Association for Compu - ing, and computational representation of 10.1016/j.jbi.2007.09.003 tational Linguistics and as general chair of lexical semantics. He has investigated these the 2012 Conference of the North American topics as a researcher at IBM, the University Torczon, V. J. 1989. Multi-Directional Chapter of the Association for Computa - of Utah, and the Mayo Clinic and at the Search: A Direct Search Algorithm for Paral - tional Linguistics — Human Language University of Minnesota. He was conferred lel Machines. Ph.D. Dissertation, Depart - Technologies. Her research interests include a Ph.D. in computer science by the Univer - ment of Mathematical Sciences, Rice Uni - question answering, semantic search, and sity of Utah in 2009, and is the coauthor of versity, Houston, TX. natural language discourse and dialogue. more than 30 peer-reviewed technical pub - Voorhees, E. M. 2003. Overview of TREC lications. 2002. In Proceedings of the Twelfth Text David A. Ferrucci is an award-winning AI REtrieval Conference (TREC 2003). Gaithers - researcher and IBM Fellow. He worked at John M. Prager is a research staff member burg, MD: National Institute of Standards IBM Research from 1995–2012 where he at IBM Research. His background includes and Technology. built and led a team of researchers to create natural language–based interfaces and Yuille, A. L., and Lu, H. 2007. The Noisy- the open-source UIMA framework and Wat - semantic search, and his current interest is Logical Distribution and Its Application to son, the landmark open-domain question- on incorporating user and domain models Causal Inference. In Advances in Neural answering system. In 2012, after initiating to inform question answering, in particular Information Processing Systems 20, ed. J. C. the work on WatsonPaths and its applica - in the medical domain. He is a member of Platt, D. Koller, Y. Singer, and S. T. Roweis. tion to medicine, he joined Bridgewater the TREC program committee. Red Hook, NY: Curran Associates, Inc. where he is exploring ways to combine machine learning and semantic technolo - gies to accelerate the discovery, application, Adam Lally is the primary developer of the and continuous refinement of interpretable WatsonPaths system architecture as well as knowledge systems. the original Watson question-answering Michael R. Glass is a software engineer at system architecture. His background IBM Research. His research focuses on deep includes developing natural language-pro - learning in natural language processing and cessing algorithms and frameworks for a probabilistic joint inference. Glass complet - variety of applications. ed his doctorate in computer science at the Sugato Bagchi is a research staff member at University of Texas at Austin in 2012. IBM Research and IBM Watson Group. He Aditya A. Kalyanpur’s primary research works on decision support systems based on interests include knowledge representation unstructured information analytics. Cur - and reasoning, natural language processing, rently, he is part of a team that is working machine learning, and question answering. with leading medical institutions on devel - He has served on W3C working groups, as oping analytics that can help physicians program chair of several international make clinical decisions using content from workshops, and is currently a coeditor of medical textbooks, evidence-based guide - the International Journal of Web Information lines, and patient consultation notes. Systems. Kalyanpur earned his Ph.D. from Michael A. Barborak has been a technical the University of Maryland, College Park in leader of the WatsonPaths project, which 2006. has as its mission to advance cognitive com - Erik T. Mueller’s books include Common - puting through the application of Watson sense Reasoning, Daydreaming in Humans and question answering to complex decision Machines, and Natural Language Processing

76 AI MAGAZINE