<<

Published online: 2019-04-25

128 © 2019 IMIA and Georg Thieme Verlag KG

Artificial Intelligence in Clinical Decision Support: Challenges for Evaluating AI and Practical Implications A Position Paper from the IMIA Technology Assessment & Quality Development in Health Informatics Working Group and the EFMI Working Group for Assessment of Health Systems Farah Magrabi1, Elske Ammenwerth2, Jytte Brender McNair3, Nicolet F. De Keizer4, Hannele Hyppönen5, Pirkko Nykänen6, Michael Rigby7, Philip J. Scott8, Tuulikki Vehko5, Zoie Shui-Yee Wong9, Andrew Georgiou1 1 Macquarie University, Australian Institute of Health Innovation, Sydney, Australia 2 UMIT, University for Health , Medical Informatics and Technology, Institute of Medical Informatics, Hall in Tyrol, Austria 3 Department of Health and Technology, Aalborg University, Aalborg, Denmark 4 Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health research institute, The Netherlands 5 National Institute for Health and Welfare, Information Department, Helsinki, Finland 6 Tampere University, Faculty for Information Technology and Communication Sciences, Tampere, Finland 7 Keele University, School of Social Science and Public Policy, Keele, United Kingdom 8 University of Portsmouth, Centre for Healthcare Modelling and Informatics, Portsmouth, United Kingdom 9 St. Luke’s International University, Tokyo, Japan

Summary a historical perspective about the evaluation of AI in healthcare. 1 Introduction It then examines key challenges of evaluating AI-enabled clinical Objectives: This paper draws attention to: i) key considerations Artificial intelligence (AI) promises to trans- for evaluating artificial intelligence (AI) enabled clinical decision decision support during design, development, selection, use, and ongoing surveillance. Practical aspects of evaluating AI in form clinical decision-making processes as it support; and ii) challenges and practical implications of AI has the potential to harness the vast amounts of design, development, selection, use, and ongoing surveillance. healthcare, including approaches to evaluation and indicators to monitor AI are also discussed. genomic, biomarker, and phenotype data that Method: A narrative review of existing research and evaluation is being generated across the health system approaches along with expert perspectives drawn from the Inter- Conclusion: Commitment to rigorous initial and ongoing evalu- including from health records and delivery national Medical Informatics Association (IMIA) Working Group on ation will be critical to ensuring the safe and effective integration systems, to improve the safety and quality of Technology Assessment and Quality Development in Health Infor- of AI in complex sociotechnical settings. Specific enhancements care decisions [1]. Today AI has been incor- matics and the European Federation for Medical Informatics (EFMI) that are required for the new generation of AI-enabled clinical porated successfully into decision support Working Group for Assessment of Health Information Systems. decision support will emerge through practical application. systems (DSSs) for diagnosis in data-intensive Results: There is a rich history and tradition of evaluating AI specialties like radiology, pathology, and oph- in healthcare. While evaluators can learn from past efforts, Keywords thalmology [2]. Future systems are expected and build on best practice evaluation frameworks and meth- Artificial intelligence; machine learning; clinical decision support, to be increasingly more autonomous, going odologies, questions remain about how to evaluate the safety evaluation studies; program evaluation beyond making recommendations about and effectiveness of AI that dynamically harness vast amounts possible clinical actions to autonomously per- of genomic, biomarker, phenotype, electronic record, and care Yearb Med Inform 2019:128-34 delivery data from across health systems. This paper first provides http://dx.doi.org/10.1055/s-0039-1677903 forming certain tasks such as triaging patients and screening referrals [3, 4].

IMIA Yearbook of Medical Informatics 2019 129 Artificial Intelligence in Clinical Decision Support: Challenges for Evaluating AI and Practical Implications

In this paper, we focus on the present in clinical settings to examine effects on the role of AI in supporting clinical decisions. 2 Evaluation of AI in structure, process, and outcome of care deliv- Evaluation of these machine-learning- Healthcare – a Historical ery [17-21]. The 1998 Helsinki workshop of based systems has tended to focus on our IMIA Working Group (then called the examining the performance of algorithms Perspective WG on Technology Assessment and Qual- in laboratory settings. There are a few AI in healthcare is not new, and neither is AI ity Development, and Organizational and observational studies which have tested evaluation; the dedicated journal, Artificial Social Issues) focussed on the importance of systems in clinical settings providing a Intelligence in Medicine dates back to the addressing broader organisational issues and safe environment while patients continue early 1990’s. AI as a term was used for a the need to focus on constructive evaluation to receive standard care [5]. However, collection of established technologies that within the whole lifecycle of information little is known about the effects of AI on were called decision support technologies, technology (IT) development [22-24]. Two care delivery and patient outcomes which knowledge-based systems, or expert sys- books from 2006 which serve as textbooks need to be carefully considered to ensure tems in earlier years. The health informatics for health IT evaluation methodologies spe- that it is appropriately applied. Like any community has a rich history and tradition cifically address techniques for AI as well technology, AI will also come with its of studying and evaluating the application of as potential biases that may compromise unintended effects that may disrupt care AI to solve problems in care delivery. Early an evaluation [25, 26]. Biases introduced delivery and pose risks to patients [6]. It applications were focussed on performing during the development of algorithms due to is clear that the many benefits of AI cannot diagnosis and making therapy recommenda- differences between training and real-world be realized safely unless AI is responsibly tions in medical settings. The early versions populations are particularly important for and effectively integrated into clinical of AI mostly used symbolic approaches machine learning-based systems and are decision-making and care delivery pro- based on rules and knowledge, whereas discussed in section 3. cesses, and risks are effectively identified present day AI uses statistical methods Later, following implementations of AI and mitigated. along with symbolic approaches to disease in clinical practice, studies shifted to the While we have witnessed the beginnings representation. In the late 1950’s, McCar- clinical impacts of AI [18] and on the related of mainstream adoption of AI in only the thy and colleagues developed the language methodological challenges to find adequate last five years or so, academic researchers LISP which was particularly well suited to clinical endpoints [19]. Building upon evi- within the informatics community have symbolic approaches [11]. The dence about the benefit of AI in medicine, studied AI for almost half a century [7-9]. language for logic programming to manage research then focussed on reviewing the However, an enhanced risk is now pre- reasoning and decision-making processes impact of AI on patient outcomes in inpa- sented by a shift in the environment from with logical structures was developed in the tient settings [20], in psychiatry [21], and draw-down of innovation based on positive 1960’s [12]. In the 1970’s, there was a big medication safety [22]. evidence to political and commercial pres- wave of the so called first generation of AI The challenges recognised in the early sures for speedy adoption based on promise, in medicine [13], such as Shortliffe’s work days of applying AI were among others: 1) with the call for evidence being presented with MYCIN [14], Kulikowski’s individu- the legal (e.g., who is responsible when the as self-interested commitment to the status alized clinical decision models [15], and de system makes an error?) and ethical issues quo [10]. Clearly this disregard of evidence Dombal’s computer-aided diagnosis of acute (algorithms dealing with discriminative puts patients at risk. abdominal pain [16] which incorporated investigations that are highly unethical, see The objective of this paper is to draw newer statistical reasoning methods such as for example [27]); 2) the context including attention to key considerations for evaluat- probabilistic reasoning and neural networks. informal information that is not documented ing AI in clinical decision support; and to Within the period covered above, there in the medical record but is nevertheless examine challenges and practical implica- were extensive discussions about the role of part of a doctor’s mental image about the tions of AI design, development, selection, AI in medicine, ranging from expert systems patient [28]; 3) the transferability of algo- use, and ongoing surveillance. The paper replacing doctors, to AI supporting doctors rithms from one setting to another both with begins with a historical perspective about in their clinical decision-making processes, respect to patient groups and clinical setting the evaluation of AI in healthcare. It then and AI embedded in medical devices. The (i.e. local technologies/methodologies, local examines key challenges of evaluating latter has been successfully applied for sev- or regional professional culture [29]); 4) AI-enabled clinical decision-support. eral decades, e.g. in ECG machines and ICU brittleness (inability of AI to reason at the The final section deals with the practi- equipment such as pumps, and insulin pens. boundaries or outside own application range, cal aspects of evaluating AI, including The need for rigorous evaluation of the resulting in “sense in – garbage out” [30]; approaches to evaluation and indicators quality and impact of AI was recognised and 5) the dynamic nature of professional to monitor AI. It will show that evaluation in the 1980s. While the first publications knowledge development in health care that and ongoing surveillance is central to safe focussed mainly on methodologies for needs to be captured in dynamic AI. and effective integration of AI in complex evaluating performance in a laboratory envi- Today, there is an increasing and renewed sociotechnical settings. ronment, later papers addressed field-testing interest in using AI, where a new generation

IMIA Yearbook of Medical Informatics 2019 130 Magrabi et al.

of clinical decision support is facilitated other. For example, an algorithm used for are considered black boxes to end-users. by the availability of powerful computing triage needs high discrimination, while an Auditing has been proposed as a pragmatic tools to manage big data and to analyse and algorithm that predicts mortality or compli- approach to evaluating opaque algorithms generate new knowledge. Big data offers a cation risks in shared decision-making needs that were devised autonomously. This challenge to connect molecular and cellular to be highly accurate and precise for all types follows an analogy to human judgement; biology to the clinical world and to consider of patients. Even at this stage of performance typically we measure outcomes, not prob- individual variations, as well as large volume evaluation, we need to recognise that the lem-solving style or cognitive process [35]. responses and outcomes to similar present- algorithm may be “mathematically optimal However, given the fundamental healthcare ing problems, rather than using population but ethically problematic” [33]. A fundamen- ethic of “first do no harm”, we suggest averages [31] or the restricted population of tal challenge is that AIs built using machine more effort is needed in the design phases clinical trials. While such advances in tech- learning do not necessarily generalise to explain the principles of a computational nology may provide solutions to the problem well beyond the data upon which they are model to allow transparent assessment. This of generalisation, they cannot solve all the trained. Even in restricted tasks like image would help to keep clinicians and patients other issues mentioned above. AI now incor- interpretation, AI can make erroneous diag- engaged and avoid conflict between prac- porates a diverse range of computational noses because of differences in the training titioners and commercial algorithm devel- reasoning methods to support clinicians in and real-world populations, including new opers (e.g. [36]). Algorithm developers, making decisions. Evaluation of these kinds ‘edge’ cases, as well as differences in image including those who operate on a proprietary of applications is even more critical as they capture workflows [4]. Another challenge basis, need to consider how to open the black start to penetrate healthcare on a product is that algorithms need to reflect up-to-date box (even if partially) and work within a purchase or turn-key basis, where they may knowledge given the dynamic changes of framework for shareable biomedical knowl- have an impact on the individual treatment professional knowledge. Based on which edge so that clinicians can judge the merits of patients and care delivery, as well as on knowledge do the algorithms work? What of AI models [37]. the liability of the individual healthcare level of evidence is sufficient? How is regular professional. There are also risks related to adaptation to new professional knowledge Selection and use of AI: Widespread these systems, they may be ill-functioning organized? availability of clinical data, easy to use or may even have negative impacts on the Designers also need to consider if the AI development environments (e.g. Ten- outcome of care in a specialised unit or a computational finding is actionable in an sorflow, DeepLearning4J and Keras [38]) different populations or health systems [32]. ethical way or if it might disadvantage peo- and online communities (e.g. healthcare.ai) We look at some of these challenges in the ple from a particular socio-demographic have resulted in rapidly growing numbers next section. background. For example, an algorithm to of algorithms that have become available prioritise which patient should receive an to clinicians. When multiple algorithms organ donated for transplant might include are available and one must be selected, it is expected longevity as a predictor variable. important to evaluate any risks of data qual- 3 Challenges in Evaluating AI This might seem sensible on face value, but ity issues, and poor fit of the foundational would ignore the implicit correlation with data to a new situation, such as different for Clinical Decision Support social determinants of health and lifespan population and morbidity patterns. What As with all interventions in healthcare we [33]. Furthermore, the inferential logic does the provenance of an algorithm tell need to understand and evaluate the effects should be put in context – is the reasoning us about its generalizability? For example, of AI on care delivery and patient out- behind the algorithm meaningful in the real a computation that finds an association comes. Evaluation is needed at each distinct world or only a statistical artefact? This between blood pressure and complications stage in the IT lifecycle including design, question has been raised about the so-called in ICU patients where blood pressure is mea- development, selection, use, and ongoing “week-end effect” of apparent increases in sured continuously might not be applicable surveillance. hospital mortality [34]. Many studies have to other hospitalised patients where blood relied upon administrative data to investi- pressure is measured perhaps only once Design and development of AI: Histori- gate the association between time or day a day; or between a coronary unit where cally, evaluation of AI was limited to the of admission and hospital mortality, which pressure may be controlled in most patients design and development phase, as imple- has resulted in inadequate adjustment for and a general surgical ward with a different mentation and use of AI systems in routine case mix difference between weekdays and and non-controlled hypertensive population. clinical practice was rare. During design and weekends and consequently has produced Data generation is often a “social phenom- development, evaluation concentrates upon inconsistent findings and conclusions. enon” [39]: data may flow from unreliable the performance of the algorithms in terms Evaluation challenges also arise from the processes [40] or human workarounds to of discrimination, accuracy, and precision. nature of AI and the manner in which it is mandatory entry fields [41]. Depending on the use case, one performance developed. Some computational reasoning We are also interested in the deci- measure might be more important than the methods in AI such as neural networks sion-making performance of humans with

IMIA Yearbook of Medical Informatics 2019 131 Artificial Intelligence in Clinical Decision Support: Challenges for Evaluating AI and Practical Implications

and without AI assistance. Once an algo- may become mandatory as health systems approach such evaluations. We reviewed rithm is developed, clinical validation of are increasingly expecting AI developers to be existing guidelines for conducting and its utility is needed. The algorithm may be transparent about the limitations and ethical reporting evaluation studies, and concluded correct but is it operationally meaningful and examination of the population data used to with exemplar indicators to monitor AI. The useful? Does it fit clinical workflow? Does it develop algorithms, how data performance aim is to highlight existing approaches that still represent up-to-date clinical knowledge? was validated, and how clinicians integrate can be readily applied and to identify areas Will it change clinical decisions? What into care delivery effective treatment and where further work is required to meet the level of confidence can be given? A signif- value for money [44]. Reporting to trusted specific needs of current AI. icant concern here is that when humans are third parties of any suspected adverse out- assisted by DSSs, they tend to over-rely and comes would also be valuable. Approaching AI evaluation: As we dis- delegate full responsibility to the DSS rather cussed in the previous section, evaluating than continuing to be vigilant. This is known Ongoing surveillance of AI: AI will have AI as a one-off activity will not be sufficient, as automation bias and can have dangerous systemic effects in complex sociotech- therefore continuous evaluation or surveil- consequences when the DSS is wrong or nical settings, including aspects such as lance might become necessary to monitor fails, or the presenting problem is subtly usage, , interpretability, up-to-date the emergent behaviour of AI in complex unique. Previous work has demonstrated the capabilities, safety, unintended effects, sociotechnical settings, and the response of effects of automation bias with a prescrib- and ethical issues. Algorithms may affect its users. Indeed, it may become an ethical ing DSS that was based on simple rules of resource allocation and prioritisation, so imperative as the complexity of interven- logic [42]. In this study, the presence of the the risk of reinforcing bias in care delivery tions and their effects on sociotechnical DSS reduced the verification of prescription must be considered. Of course, the impact interactions become impossible to predict safety and increased prescribing errors when of AI on patient outcomes and experience in advance. One paradigm that might be the DSS was incorrect [43]. With AI, these and on clinical experience, as well as both particularly relevant for approaching the effects of automation bias are likely to be organisational and social impacts, all need to evaluation of AI on an ongoing basis is exacerbated because of the black box nature be monitored. Over time, the context, treat- the Learning Healthcare System [48]. of many machine learning approaches which ment possibilities, and patient population This paradigm usefully incorporates the are not conducive to verification. Methods might change. Therefore, once implemented, notion of continuous system improvement are thus needed to mitigate automation bias, ongoing surveillance is needed to monitor using locally generated evidence to inform for example by improving the interpretability and recalibrate AI algorithms [45]. This practice changes. Such learning can occur of machine learning models, explaining how surveillance is also needed for dynamic algo- at different levels including institutional, an AI came to a conclusion, and training rithms that continuously update themselves national, or international. While aspects like clinicians to maintain vigilance. based on practice data and published clinical algorithm performance may be addressed at Consideration also needs to be given to the evidence. Another aspect is in detecting institutional level, safety governance might various stakeholders involved in processes and evaluating “hidden” errors – subtle be considered at a national level and the to select AI. Evaluation findings need to be flaws in inferred data or a misconfiguration ethical implications of using AI for clinical communicated in a manner that is under- that may not be as obvious as the financial decision support could be tackled at an standable and meaningful to clinicians, but “flash crash” of 2010 [46] which may take international level. also to administrators and policymakers. One months to be revealed (as with hand-crafted possible approach to support better selection algorithms [47]). Thus, evaluation is likely to Evaluation guidelines and models: Exist- of AI systems is by labelling. Neither patient shift from a one-off activity to a continuous ing resources such as the guidelines for Good nor clinician can be fully comfortable in process to ensure that the use of AI, including Evaluation Practice in Health Informatics the use of AI ‘black box’ decision support those incorporating dynamic algorithms, is (GEP-HI) provide a solid foundation for unless they know the system is appropriately meeting expectations. planning and executing evaluation projects created and relevant for their situation. Based [49]. of GEP-HI for AI and its on the transferability and quality schemas, completeness will become apparent through label structures need to be developed which application and use. Similarly, the Statement identify the training population type includ- 4 Practical Aspects of on Reporting of Evaluation Studies in Health ing demography and treatment setting, the Informatics (STARE-HI) provides useful nature of the decisions supported, the clinical Evaluating AI-enabled principles for publication of evaluation stud- environment, and any problems identified. It ies [50, 51]. These guidelines support design, should be possible to devise a standard popu- Clinical Decision Support execution, and reporting of an evaluation lation characteristics data set that is verifiable, This section examines some of the practical study irrespective of the type of the system yet at the same time does not reveal technical aspects of evaluating AI-enabled clinical under study. The evaluation question and the design details and thus protects commercial decision support. It begins by consider- purpose of the study guide the selection of confidence. Indeed population information ing broad paradigms and frameworks to evaluation methods and their application.

IMIA Yearbook of Medical Informatics 2019 132 Magrabi et al.

Very important evaluation criteria for AI Box 1 Defining indicators for monitoring AI. applications include validity of the system, i.e. correctness in reasoning, usefulness for the clinical practice and effects and impacts 1 Define context for on clinical work, care processes, and patient 1.1 Define AI application outcomes. With AI applications that are • Identify AI components and functions within operational system based on big data sets, especially genomic, biomarker, and phenotype data from across • Define outputs (which may be inputs to another component) and risks the health system, it is important to ensure 1.2 Identify service context and boundaries (on micro-, meso- and macro levels). sufficient coverage, specificity, and validity of the data. This will be an additional step 1.3 Identify key stakeholders. in the evaluation process with AI systems 2 Define monitoring goals (e.g. safety, usability, transparency, up-to-dateness, i.e. to assess and collect evidence on data , patient safety, work process changes). quality, and is the key point to prevent unin- 3 Define methods for indicator selection and categorisation e.g. by reviewing tended consequences and to avoid risks of expert knowledge, peer-reviewed literature, or existing indicator work. harmful results. Understanding of the comparability of 4 Define data for repeated , analysis, and feedback from the design and development situation with different user groups the potential implementation situation is cru- • Produce objective reports. cial. Aspects such as the algorithm training data set (clinical situation, physiology, and demography) are critical, as are treatment options and user population (including at stake depend on the type and purpose of integration, response time, ease of learning professional and e-health skills). A lot can use of the technologies, emphasizing the [55]. Measures and indicators for these are be learned from Implementation science; importance of the first two phases of the targeted at monitoring and development for instance Schloemer and Schröder-Bäck methodology. Peer-reviewed literature on of the entire socio-technical system where have created the useful generic PIET model, AI for specific situations remains one solid AI is implemented. Function quality is an in which Population, Intervention, (clinical) source for practical indicator selection and important part of system quality, particularly Environment, and Transfer methods are grouping. when AI components feed other interim decomposed [52]. Adaptation of the sub-cat- Availability (state of the art) of custom processes. AI applications with direct impact egories of each to fit the AI situation would AI applications for specific purposes in on patient health can be approached by the yield a valuable assessment tool. different contexts of use for different user EU directive for medical devices, including groups provides the first level of monitoring, the requirements for CE marking with tight Indicators to monitor AI: Implementation important for policy makers for local and clinical evidence requirements for Class of AI in an organisation can have an impact national steering. If these data are collected II and III devices according to Regulation on several aspects of care quality, depending systematically from a whole country, they 2017/745 [57]. System quality is directly on the clinical area where implemented. can also be used as benchmarking data by associated with patient safety, and existence Existing methodologies for defining indica- policy makers to steer development and by of CE marking could be one indicator for AI tors for continuous monitoring of health IT healthcare providers to find best practices quality and a proxy for patient safety. help not only to pinpoint AI-specific care and learn about different AI solutions. Information quality may be the most quality indicators, but also to define AI-spe- The information system (IS) success important surveillance and monitoring cific indicators for AI-quality monitoring and model also offers an option for grouping domain for AI algorithms alongside with surveillance, for example, the methodology indicators to examine AI quality, focussing system quality. It covers quality of data for the Nordic eHealth Indicators which on system quality, information quality, used as input for the AI as well as quality has four major phases including definition service quality, use, user satisfaction, and of the output information (e.g. accuracy, of the context of measurement, monitoring outcomes [54-56]. Each of these are exam- reliability, relevance, completeness) [55]. goals (e.g. care quality, efficiency, technol- ined below: In addition to these, , sensi- ogy quality …etc.), methods for indicator System quality: Once implemented, tivity and specificity of the results, type and selection, and the collection, analysis, and AI system requires ongoing surveillance severity of errors, observed versus expected feedback of data from relevant user groups based on a set of measures to recalibrate error rates, causes of errors, how algorithms (Box 1) [53]. AI algorithms. Apart from surveillance tar- respond and what indicators to use if input Governing AI in healthcare may involve geted directly at AI development, IT System data characteristics start vary, and the avail- trade-offs between risk and rewards. Ques- quality monitoring includes indicators like ability of data, code, and workflows need tions on ethics, safety, and human values (IT) accessibility, reliability, flexibility, to be focussed on, since the algorithms are

IMIA Yearbook of Medical Informatics 2019 133 Artificial Intelligence in Clinical Decision Support: Challenges for Evaluating AI and Practical Implications

trained with certain data, and their soundness equity and inclusion, data privacy and secu- artificial intelligence in medicine. BMJ Qual Saf needs to be confirmed in the context of use rity, algorithmic bias, and representativeness 2019;28:238-41. 3. Yu K-H, Beam AL, Kohane I.S. Artificial in relation to the quality of actual clinical of data [62]. intelligence in healthcare. Nat Biomed Eng data to be used [58]. A broader outlook is also essential, 2018;2(10):719-31. Service quality in the IS success model rather than solely focussing on the quality 4. Challen R, Denny J, Pitt M, Gompels L, Edwards T, refers to help desk-type support available of algorithms or quality of their outcomes. Tsaneva-Atanasova K. Artificial intelligence, bias for users as well as long-term feedback from The presence of governance structures and clinical safety. BMJ Qual Saf 2019;28:231-7. 5. Kanagasingam Y, Xiao D, Vignarajan J, Preetham users for both immediate updates as well as (e.g. transparency of processes and policies A, Tay-Kearney M-L, Mehrotra A. Evaluation of for the entire system development. to ensure reproducibility for large scale Artificial Intelligence–Based Grading of Diabetic System use refers to utilization (e.g. use/ computational models, regulation related Retinopathy in Primary CareArtificial Intelli- non-use, frequency of use) of the system to safe and secure use of data, liability and gence–Based Grading of Diabetic Retinopathy output [55]. AI in healthcare can be used accountability questions, data ownership), in Primary Care Artificial Intelligence–Based Grading of Diabetic Retinopathy in Primary Care. for decision support as well as automating education and training, as well as national JAMA Netw Open 2018;1(5):e182665-e. actions of care (e.g., medication adminis- development and validation procedures can 6. Kim MO, Coiera E, Magrabi F. Problems with health tration). Existing OECD and EU eHealth be used as indicators of system maturity on information technology and their effects on care monitoring surveys focus on the availability national level. delivery and patient outcomes: a systematic review. and use of HIS [59-61]. Functionalities for J Am Med Inform Assoc 2017;24(2):246-50. 7. Shortliffe EH. The adolescence of AI in medicine: diagnostic or treatment support (for pro- will the field come of age in the ‘90s? Artif Intell fessionals or patients) are not covered in Med 1993;5(2):93-106. the current surveys, and need revisions to 8. Coiera EW. Artificial intelligence in medicine: encompass AI-enabled functionalities. 5 Conclusions the challenges ahead. J Am Med Inform As- User satisfaction/acceptance refers to Technological developments are outpacing soc1996;3(6):363-6. our ability to predict the effects of AI on 9. Patel VL, Shortliffe EH, Stefanelli M, Szolovits user perceptions about system output [55]. P, Berthold MR, Bellazzi R, Abu-Hanna A. The Validated satisfaction scales miss indicators the practice of medicine, the care received coming of age of artificial intelligence in medicine. for trust and competence of the personnel to by patients, and the impact on their life. Artif Intell Med 2009;46(1):5-17. deal with the decisions made by AI (e.g. to In the immediate future, we can expect AI 10. Benber B, Lay K. Health secretary Matt Han- identify false results, and to record override to support clinical decisions with humans cock endorses untested medical app; The Times, 17 September 2018, available at https://www. and its contemporaneous justification). The in the decision loop. The dynamics of thetimes.co.uk/article/matt-hancock-endorses-un- competence of professionals to use com- decision-making processes are likely to be tested-health-app-3xq0qcl0x. monly agreed documentation standards, altered with clinicians needing to weigh 11. McCarthy J. Recursive Functions of Symbolic structures, and classifications may also AI-generated advice against other evidence Expressions and Their Computation by Machine. Commun ACM 1960;3(4):184–95. be essential, since it impacts the quality and patient preferences [1]. To ensure the 12. Colmerauer A, Roussel P. The birth of Prolog. of documentation (input for AI). From safe and effective integration of AI in care ACM SIGPLAN Notices. 1993;28(3):37. the viewpoint of patients, barriers to use, delivery, a robust commitment to evaluation 13. Kulikowski, C.A. An Opening Chapter of the First including trust and difficult instructions are is critical. Evaluators can draw important Generation of Artificial Intelligence in Medicine: The First Rutgers AIM Workshop. Yearb Med likely to become increasingly important with lessons from past efforts and should build Inform 2015;10(1):227-33. growing numbers of AI implementations and upon current best practice frameworks, 14. Shortliffe EH, Davis R, Axline SG, Buchanan BG, wider settings. User competence to evaluate evaluation guidelines, and methods to guide Green CC, Cohen SN. Computer-based consulta- decisions offered by AI would gain crucial evidence-based design, development, selec- tions in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. importance in future. tion, use, and ongoing surveillance of AI Comput Biomed Res 1975 8(4):303-20. Outcomes refer to the individual and in clinical decision support. Labels will be 15. Sonnenberg FA, Hagerty CG, Kulikowski CA. organizational impacts [55]. AI promises to important for defining source-training data An architecture for knowledge-based construc- enhance healthcare effectiveness and effi- so as to identify optimal transferable use tion of decision models. Med Decis Making 1994;14(1):27-39. ciency, but requires due diligence to validate patterns. Specific enhancements required 16. de Dombal FT, Leaper DJ, Staniland JR, Mc- actions, optimise their use, and minimize for evaluating dynamic algorithms incorpo- Cann AP, Horrocks JC. Computer-aided di- risks. Outcome indicators need to reflect the rating vast amounts of genomic, biomarker, agnosis of acute abdominal pain. Br Med J purpose of the algorithm, and can include and phenotype data will emerge through 1972;1(2(5804)):9-13. 17. Nykänen P, Chowdhury S, Wigertz O. Evaluation appropriateness of AI-assisted decisions, practical application. of decision support systems in medicine. Comput efficiency in care process, as well as patient Methods Programs Biomed 1991;34(2/3):229 - 38. acceptance and adherence. Patient outcomes 18. Wyatt JC, Spiegelhalter D. Evaluating medical are increasingly important to monitor along expert systems: What to test and how? Int J Med References Inform 1990;15:205-17. with patient safety incidents. Monitoring 1. Coiera E. The fate of medicine in the time of AI. 19. Clarke K, O’Moore R, Smeets R, Talmon J, Bren- risks/adverse effects/trade-offs is an essential Lancet 2018 Dec 1;392(10162):2331-2. der J, McNair P, et al. A Methodology for Evalu- part of monitoring AI outcomes, including 2. Yu KH, Kohane IS. Framing the challenges of ation of Knowledge-Based Systems in Medicine.

IMIA Yearbook of Medical Informatics 2019 134 Magrabi et al.

In: van Bemmel J, McCray AT, editors. Yearbook 37. Lehmann HP, Downs SM. Desiderata for sharable 52. Schloemer T, Schröder-Bäck P. Criteria for evalu- of Medical Informatics1995. p. 513-27. computable biomedical knowledge for learning ating transferability of health interventions: a sys- 20. Yu VL, Buchanan BG, Shortliffe EH, Wraith SM, health systems. Learn Health Syst 2018:e10065. tematic review and thematic synthesis. Implement Davis R, Scott AC, et al. Evaluating the perfor- 38. Bhattacharya S, Czejdo B, Agrawal R, Erdemir Sci 2018;13(1):88. mance of a computer-based consultant. Comput E, Gokaraju B, editors. Open Source Platforms 53. Hyppönen H, Faxvaag A, Gilstad H, Hardardottir Programs Biomed 1979 9(1):95-102. and Frameworks for Artificial Intelligence and GA, Jerlvall L, Kangas M, et al. Nordic eHealth 21. van Gennip EM, Talmon JL, Bakker AR. ATIM, Machine Learning. SoutheastCon 2018; 2018 19- Indicators: Organisation of research, first results accompanying measure on the assessment of infor- 22 April 2018. and the plan for the future [Internet]. Copenhagen: mation technology in medicine. Comput Methods 39. Osoba OA, Welser IV W. An intelligence in our Nordic Council of Ministers; 2013. Available Programs Biomed 1994;45(1-2):5-8. image: The risks of bias and errors in artificial from: http://urn.kb.se/resolve?urn=urn:nbn:se:nor- 22. Brender J. Methodology for constructive assess- intelligence. Rand Corporation; 2017. den:org:diva-675. ment of IT-based systems in an organisational 40. Burnett S, Franklin BD, Moorthy K, Cooke MW, 54. Canada Health Infoway Benefits Evaluation Indi- context. Int J Med Inform 1999;56:67-86. Vincent C. How reliable are clinical systems in the cators Technical Report version 2.0. 2012. 23. Nykänen P, Enning J, Talmon J. Inventory of val- UK NHS? A study of seven NHS organisations. 55. DeLone WH, McLean ER. Information systems idation approaches in selected health telematics BMJ Qual Saf 2012;21(6):466-72. success: The quest for the dependent variable. Inf projects. Int J Med Inform 1999;56:87-96. 41. Ser G, Robertson A, Sheikh A. A qualitative Syst Res 1992;3(1):60–95. 24. van Gennip E, Lorenzi NM. Results of discussions exploration of workarounds related to the imple- 56. DeLone WH, McLean ER. The DeLone and at the IMIA WG 13 and 15 working conference. mentation of national electronic health records in McLean Model of Information Systems Suc- Int J Med Inform 1999;56:177-80. early adopter mental health hospitals. PLoS One cess: A Ten-Year Update. J Manage Inf Syst 25. Brender J. Handbook of evaluation methods for 2014;9(1):e77669. 2003;19(4):9-30. health informatics. Burlington, MA: Elsevier 42. Lyell D, Magrabi F, Raban MZ, Pont LG, Baysari 57. Regulation (EU) 2017/745 of the European Par- Academic Press; 2006. MT, Day RO, et al. Automation bias in electron- liament and of the Council of 5 April 2017 on 26. Friedman CP, Wyatt JC. Evaluation Methods in ic prescribing. BMC Med Inform Decis Mak medical devices, amending Directive 2001/83/EC, Medical Informatics. 2nd ed. New York: Springer; 2017;17(1):28. Regulation (EC) No 178/2002 and Regulation (EC) 2006. 43. Lyell D, Magrabi F, Coiera E. Reduced Verification No 1223/2009 and repealing Council Directives 27. Karthaus V, Thygesen H, Egmont-Petersen M, of Medication Alerts Increases Prescribing Errors. 90/385/EEC and 93/42/EEC: https://publications. Talmon J, BrenderJ, McNair P. User-requirements Appl Clin Inform 2019;10(1):66-76. europa.eu/en/home. driven learning. Comput Methods Programs 44. NHS code of conduct for data-driven health and 58. Artificial Intelligence for Health and Health Care, Biomed.1995;48(1-2):39-44. care technology, 19 February 2019: https://www. JSR-17-Task-002, JASON, The MITRE Corpora- 28. Agniel D, Kohane IS, Weber GM. Biases in elec- gov.uk/government/publications/code-of-con- tion 2017: https://fas.org/category/artificial-intel- tronic health record data due to processes within duct-for-data-driven-health-and-care-technology/ ligence/. the healthcare system: retrospective observational initial-code-of-conduct-for-data-driven-health- 59. Draft OECD Guide for Measuring ICTs in the study. BMJ 2018;361:k1479. and-care-technology Health Sector, Paris: OECD. COM/DELSA/ 29. Nolan J, McNair P, Brender J. Factors influencing 45. Minne L, Eslami S, de Keizer N, de Jonge E, de DSTI(2013)3/FINAL; available at: http://www. the transferability of medical decision support Rooij SE, Abu-Hanna A. Effect of changes over oecd.org/health/health-systems/Draft-oecd-guide- systems. Int J Biomed Comput 1991;27(1):7-26. time in the performance of a customized SAPS-II to-measuring-icts-in-the-health-sector.. 2013. 30. Feigenbaum EA, editor. Autoknowledge: from file model on the quality of care assessment. Intensive 60. Adler-Milstein J, Ronchi E, Cohen GR, Winn LA, server to knowledge servers. MEDINFO; 1986; Care Med 2012;38(1):40-6. Jh, AK. Benchmarking health IT among OECD Amsterdam: Elsevier Science Publishers BV; 1986. 46. Treleaven P, Galas M, Lalchand V. Algorithmic countries: better data for better policy. J Am Med 31. Nykänen P, Zvarova J. Big data challenges for per- trading review. Commun ACM 2013;56(11):76-85. Inform Assoc 2014;21(1):111-6. sonalised medicine. Editorial. International Journal 47. Crouch H. East and North Herts could face 61. Codagnone C, Lupiañez-Villanueva F. Bench- of Biomedicine and Healthcare 2015;3(1):1. £7m bill to fix Lorenzo issue: Digital Health; marking Deployment of eHealth among General 32. Ammenwerth E, Shaw N. Bad health informatics 2018 [Available from: https://www.digitalhealth. Practitioners Final report; 2013. can kill - is evaluation the answer? Methods Inf net/2018/10/east-and-north-herts-lorenzo-it-issue/. 62. Lannquist Y. Ethical & Policy Risks of Artificial Med 2005;44:1-3. 48. Friedman C, Rigby M. Conceptualising and cre- Intelligence in Healthcare. The Future Society; 33. DeDeo S. Wrong side of the tracks: Big Data ating a global learning health system. Int J Med 2018. and Protected Categories. arXiv preprint arX- Inform 2013;82(4):e63-71. iv:14124643. 2014. 49. Nykänen P, Brender J, Talmon J, de Keizer NF, 34. Bray BD, Steventon A. What have we learnt after Rigby M, Beuscart-Zephir M, et al. Guideline 15 years of research into the ‘weekend effect’? for good evaluation practice in health informatics BMJ Qual Saf 2017;26(8):607-10. (GEP-HI). Int J Med Inform 2011;80(12):815-27. Correspondence to: 35. Sandvig C, Hamilton K, Karahalios K, Langbort 50. Talmon J, Ammenwerth E, Brender J, de Keizer N, A/Prof. Farah Magrabi C. Auditing algorithms: Research methods for de- Nykanen P, Rigby M. STARE-HI -statement on re- Centre for Health Informatics tecting discrimination on internet platforms. Data porting of evaluation studies in health informatics. Australian Institute of Health Innovation and discrimination: converting critical concerns Yearb Med Inform 2009:23-31. Macquarie University into productive inquiry; 2014. p. 1-23. 51. Brender J, Talmon J, de Keizer N, Nykanen P, Level 6, 75 Talavera Road 36. Crouch H. RCGP chair says GPs are not ‘techno- Rigby M, Ammenwerth E. STARE-HI - Statement Macquarie University NSW 2109 phobic dinosaurs’. Digital Health; 2018 [Available on Reporting of Evaluation Studies in Health In- Australia from: https://www.digitalhealth.net/2018/10/ formatics: explanation and elaboration. Appl Clin Tel: +61 2 9850 2429 rcgp-chair-technophobic-dinosaurs/. Inform 2013;4(3):331-58. E-mail: [email protected]

IMIA Yearbook of Medical Informatics 2019