An Introduction to Bayesian Artificial Intelligence in Medicine
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Bayesian Artificial Intelligence in Medicine Giovanni Briganti1 and Marco Scutari2 1Unit of Epidemiology, Biostatistics and Clinical Research, Universit´eLibre de Bruxelles 2IDSIA Dalle Molle Institute for Artificial Intelligence August 7, 2020 Abstract Artificial Intelligence has become a topic of interest in the medical sciences in the last two decades, but most studies of its applications in clinical studies have been criticised for having unreliable designs and poor replicability: there is an increasing need for literacy among medical professionals on this increasingly popular domain of research. This work provides a short introduction to the specific topic of Bayesian Artificial Intelligence: we introduce Bayesian reasoning, networks and causal discovery as well as their (potential) applications in clinical practice. Keywords: statistics, causality, directed acyclic graphs 1 Introduction The origins of Artificial Intelligence (AI) can be traced back to the work of Alan Turing in 19501. He proposed a test (known today as the Turing Test) in which a certain level of intelligence is required from a machine to fool a human into thinking he is carrying on a conversation with another human (the Imitation Game). Although such level of intelligence has not yet been attained, a simple conversation with a virtual assistant like Siri serves as a clear example of how AI research has rapidly evolved in the past decades. AI has become a topic of interest in the medical sciences in the last two decades, with more and more applications being approved by health authorities worldwide. However, most physicians and medical scientists have no formal training in the disciplines the world of smart medical monitoring, diagnosis and follow-up has sprung from. This, along with historical epistemological differences in the practice of medicine and engineering disciplines, is one of the main obstacles to widespread collaboration between physicians and engineers in the development of AI software. It may also be argued that that is in turn one of the root causes of the ongoing replicability crisis in medical AI research2, characterised by unreliable study designs and poor replication3. Although there is an increasingly rich literature on how AI can be used in the clinical practice, few works aim to interest physicians in the fundamental concepts and terminology of AI. The gradual shift towards quantitative methods in the last century has made them 1 familiar with such concepts from probability and statistics as correlation, regression and confidence intervals; it is worthwhile to expand on these concepts and link them with modern AI. The most common AI approaches in the medical literature are neural networks (in the early 2000s) and clustering (in the last decade)4. Bayesian reasoning and methods for AI are less known, although they can be used for medical decision making, to study human physi- ology5, to identify interactions between symptoms6, and to investigate symptom recovery7. Bayesian models can facilitate reasoning in probabilistic terms when dealing with uncertainty, which is omnipresent in the medical sciences since we cannot construct a complete mecha- nistic model of diseases and of the physiological mechanisms they impact. One important characteristic of Bayesian models, and in particular of Bayesian networks, is that the vari- ables that are used as inputs for the model and their relationships are direct representations of real-world entities and of their interplay, not a purely mathematical construct as in neural networks8. This is why such approach is interesting to the medical field: Bayesian networks are the method of choice for explainable reasoning in AI. This work aims to introduce physicians to Bayesian AI through a clinical perspective on probabilistic reasoning and Bayesian networks. More detailed accounts on the topic may be found in popular textbooks9;10. 2 Probabilistic reasoning Bayes' theorem describes the probability of an event based on prior knowledge of conditions that may be related to that event: that is expressed with the well known mathematical notation Pr(B j A) Pr(A) Pr(A j B) = ; Pr(B) the probability Pr(A j B) of an event A given prior knowledge of a condition B (that is, B has occurred). Pr(A j B) is a conditional probability, while Pr(A) and Pr(B) are known as marginal probabilities, that is, the probabilities of observing the events A and B individually. In the context of medical diagnosis, the goal is to determine the probability Pr(Di j Cp) of presence of a particular disease or disorder Di given the clinical presentation of the patient 11 0 Cp , the prior probabilities Pi of the disease in the patient's reference group, and the prior 0 probabilities Pj of other diseases Dj; that is 0 Pi Pr (Cp j Di) Pr (Di j Cp) = 0 P 0 Pi Pr (Cp j Di) + j Pj Pr (Cp j Dj) where Pr(Cp j Dj) is the probability of having the same clinical presentation given other diseases. Bayes' theorem makes it possible to work with the distributions of dependent (or condi- tionally dependent) variables. However, in order to reduce the number of variables we need to observe simultaneously in probabilistic systems, it is also important to determine whether two variables A and B are independent (A ?? B), Pr(A j B) = Pr(A); 2 or conditionally independent (A ?? B j C) given the value of a third variable C, Pr(A \ B j C) = Pr(A j C) Pr(B j C): Extracting conditional dependence tables is one of the building blocks upon which we can build Bayesian probabilistic reasoning. It allows to take a set of variables (say, A, B and C again) and to compute the conditional probabilities of some of them (say, A j C), P B2fx;yg Pr(C = x; B; A = y) Pr(A = y j C = x) = P : B;A2fx;yg Pr(C = x; B; A) Conversely, we can also take variables or set of variables that are (conditionally) independent from each other and combine them to obtain their joint probability. This joint probability will be structured as a larger conditional dependence table that is the product of smaller tables associated with the original variables. For instance, Pr(A = y; B = z j C = x) = Pr(A = y j B = z) Pr(B = z j C = x) assuming A ?? C. The ability of explicitly merging and splitting set of variables to separate variables of interest we need for diagnostic purposes from redundant variables is one of the reasons that makes Bayesian reasoning easy while at the same time mathematically rigorous. 3 Bayesian thinking in machine learning Machine learning (ML) is the sub-field of AI that studies the algorithms and the statistical tools that allow computer systems to perform specific and well-defined tasks without explicit instructions12. Implementing machine learning requires four components. First, we need a working model of the world that describes the tasks and their context in a way that is understandable by a computer. In practice, this means choosing a class of Bayesian models defined over the variables of interest, and over relevant exogenous variables, and implementing it in software. Generative models such as Bayesian networks describe how variables interact with each other, and therefore represent the joint probability Pr (X1;:::;XN ); while discriminative models such as random forests and neural networks only focus on how a group of variables predicts a target variable by estimating Pr (X1 j X2;:::;XN ) : Clearly, depending on the application, a generative model may be a more appropriate choice than a discriminative model or vice versa. If characterising the phenomenon we are modelling from a systems perspective, generative models should be preferred. If we are only interested in predicting some clinical event, as in the case of diagnostic devices, discriminative models provide better predictive accuracy at the cost of being less expressive. Second, we need to measure the performance of the model, usually by being able to predict new events. Third, we must encode the knowledge of the world from training data, experts or both into the model: this step is called learning. The computer system can then 3 Figure 1: An undirected network of five nodes connected through edges. learn what is the best model within the prescribed class by finding that that maximize the chosen performance metric, drawing either from observational or experimental data or expert knowledge available from practitioners. Fourth, the computer uses the model as a proxy of reality and performs inference as new inputs come in, all while deciding if and how to perform the assigned task. Successfully implementing machine learning applications is, however, far from easy: it requires large amounts of data, and it is difficult to decide how to structure the model from a probabilistic and mathematical point of view. Principled software engineering also plays an important role13. For these reasons, machine learning models should comprise a limited number of variables so that they are easy to construct and interpret. As a side effect, compact models are unlikely to require copious amounts of computational power to learn, unlike state- of-the-art neural networks13. Clinical settings limit our knowledge of the patients to what we can learn from them: hence probability should be used to determine whether two variables are associated given other variables. Formally, if the occurrence of an event in one variable affects the probability of an event occurring in another variable, we say the two are associated with or probabilistically dependent on each other. Association is symmetric in probability; it does not distinguish between cause and effects in itself. Bayesian network models, however, go beyond probability theory to represent causal effects as arcs in a graph to allow for principled causal reasoning. 4 Bayesian Networks We introduce in this section the fundamental notions of Bayesian networks.