A Knowledge-Based Assessment of Compliance to the Longitudinal Application of Clinical Guidelines

Thesis submitted in partial fulfillment

of the requirements for the degree of

by

Avner Hatsek

Submitted to the Senate of Ben-Gurion University of the Negev

November 2014

Beer-Sheva

A Knowledge-Based Assessment of Compliance to the Longitudinal Application of Clinical Guidelines

Thesis submitted in partial fulfillment

of the requirements for the degree of

“DOCTOR OF PHILOSOPHY”

by

Avner Hatsek

Submitted to the Senate of Ben-Gurion University of the Negev

Approved by the advisor______

Approved by the Dean of the Kreitman School of Advanced Graduate Studies______

November 2014

Beer-Sheva

This work was carried out under the supervision of Prof. Yuval Shahar

In the Department: Engineering

Faculty: Engineering

Acknowledgments

I would like to thank to those who supported and encouraged me in completing this research. My academic supervisor, Professor Yuval Shahar, who guided me for several years, in which he shared much of his infinite knowledge with me, and granted me with an opportunity to learn and to develop my skills in many new areas.

I would like to thank Dr. Irit Hochberg, Dr. Deeb Daoud and Prof. Aya Biderman who participated in this research and donated from their time and knowledge to assist me in evaluating the system I developed in this research, and to better understand the implications of such systems to real clinical settings.

My colleges at the Ben Gurion University Medical Informatics Research Center, Erez

Shalom, Denis Klimov, Ayelet Goldstein, Elior Ariel, Mata Lion and all the others, whom I was lucky to share many experiences and develop new ideas, and who never hesitated to assist when any help was needed.

I would like to thank the Department of Information System Engineering and the

Faculty of Engineering at Ben Gurion University, where I studied, taught and researched for many years, for assisting me financially, and mainly, for providing me the environment to acquire so much knowledge and so many skills.

I would also like to thank the Israeli Ministry of Science Office and the "GERTNER"

National Institute for research in health-care policy, who supported in part of the funding for my research.

This work is dedicated to my wife Merav, who I really love, and together expecting a first son to arrive very soon…

I

Abstract

Clinical guidelines are developed by professional medical associations as a tool to standardize medical care. These guidelines are published by the associations in order to assist clinicians in basing their medical decisions on state of the art, research-based evidence. Although these guidelines are in general, accessible to clinicians, it is almost impossible for busy clinicians to constantly follow every new guideline that is published, and to comply with all of the current recommendations.

Several methods were presented in the past for the development of automated systems for clinical-guideline-based plan recognition, critiquing, and quality assessment. However, additional work is still needed in order for such systems to be widely accepted in healthcare.

In this study, I designed, implemented, and evaluated the DiscovErr system, which is a comprehensive system for guideline-based critiquing and quality assessment, that includes an integrated set of modules with a graphical knowledge specification interface; a clinical guideline library; a full-fledged guideline-based quality-assessment engine that assesses care over long time periods according to a well-defined, formal representation of the guideline; and an interface for the visualization of the compliance analysis results. The DiscovErr system uses a formal representation of the procedural workflow knowledge inherent in clinical guidelines and of the declarative domain- specific data-interpretation knowledge they explicitly or implicitly use, to perform quality assessment of the medical care by analyzing the longitudinal clinical data.

Furthermore, by using a flexible reasoning mechanism based on fuzzy temporal logic, the compliance analysis algorithm addresses the inherent ambiguity of clinical guidelines, the uncertainty of clinical data, and the fact that care providers typically do not strictly follow clinical guidelines but rather adhere to these guidelines in a partial manner that reflects the essence of the guidelines' recommendations.

To evaluate the DiscovErr system, I formally represented a complex, state of the art guideline for management of type II diabetes patients. I then performed several experiments by applying the system to significant numbers of real patient data, comparing the critique generated by the system to the comments made by three expert internists (two of whom are diabetes-therapy experts, and the third a family medicine expert) who reviewed the original clinical records as well as the system’s comments. II

The completeness of the DiscovErr system's comments was assessed by comparing the system's comments to those of the three expert clinicians, defined as the portion of the experts' comments that were reproduced by the system. The completeness of the DiscovErr system's comments increased from 66% for comments made by only one expert, to 83% for comments that were mentioned by exactly two of the experts, and reached 98% for comments made by all three experts. The completeness of the system was 91% for comments made by at least a majority of the three experts, which I considered as the system's completeness level.

The correctness of the comments made by the system was assessed by the two diabetes experts, who assessed each comment as correct, partially correct, or incorrect. The level of agreement on the correctness of the system's comments, among the two diabetes experts, was measured using Cohen's Kappa, and was found to be significantly high. A comment was considered as correct if it was assessed as correct by one expert and as at least partially correct by the other expert. The correctness of the systems' comments, as assessed by the two diabetes experts, was also 91%. Correctness was higher for comments concerning issues that were judged as being of greater importance, as opposed to comments concerning issues judged as being of lesser importance.

I have also assessed the experts' correctness and completeness in an indirect fashion, through the comments they made and through their assessments of the system's comments. The correctness of the experts' comments was defined as having one's comment agreed to by at least one additional expert. I defined an additional measure to assess the global "indirect correctness" score, in which the DiscovErr system was essentially considered as a fourth expert, and which was used for the overall assessment of the system versus the three experts. Completeness of the experts' comments was defined relative to the overall set of comments made by the DiscovErr system that were assessed as correct by both of the diabetes experts. Overall, when compared to the three human experts, using these measures to assess system and human correctness, the DiscovErr system could be placed between the family medicine expert and the two diabetes experts; with respect to the system and human measures of completeness, it displayed a higher level of completeness than any of the three human experts.

I conclude that systems such as DiscovErr can be effectively used to assess the quality of longitudinal guideline-based care of large numbers of patients.

Table of Contents

1 INTRODUCTION ...... 5

2 BACKGROUND ...... 13 2.1 Planning, Plan Understanding, and Plan Recognition ...... 15 2.2 Formal Models for Guideline Representation ...... 16 2.3 Methods for Clinical Guidelines Recognition and Critiquing ...... 20 2.4 Fuzzy Sets and Fuzzy Logic ...... 37

3 THE DISCOVERR SYSTEM ...... 41 3.1 Desiderata for the System’s Design ...... 42 3.2 Overall Architecture ...... 44 3.3 The Knowledge Framework ...... 46 3.3.1 Clinical Guidelines Representation ...... 47 3.3.2 Clinical Steps Representation ...... 53 3.3.3 Domain Knowledge Representation ...... 55 3.3.4 The Knowledge Specification Tool ...... 59 3.4 The Patient Data Access Module ...... 62 3.5 The Analysis Framework ...... 64 3.5.1 The Fuzzy Temporal Reasoner ...... 65 3.5.2 The Compliance Analysis Engine ...... 72 3.5.3 Visualization of the Compliance Analysis Results ...... 94 3.6 The Taxonomy of Computable Compliance Comments ...... 98

4 EVALUATION ...... 103 4.1 Research Questions ...... 104 4.2 Experimental Design ...... 107 4.3 Performing the Experiment ...... 108

5 RESULTS ...... 123 5.1 Completeness of the System's Comments ...... 124 5.2 Correctness of the System's Comments ...... 132 5.3 A Comparison of Correctness and Completeness among the Experts and System ...... 141 5.4 Results Regarding Runtimes and Memory Consumption ...... 146

II

6 SUMMARY AND DISCUSSION ...... 149 6.1 Summary of the Methods and of the Results ...... 150 6.2 Discussion ...... 153 6.3 Implications to the Field of Medicine ...... 155 6.4 Limitations of the Work ...... 158 6.5 Conclusions ...... 159

7 REFERENCES ...... 161

1

List of Figures

Figure 1. The main modules of the DiscovErr system...... 9

Figure 2. The overall architecture of the DiscovErr system ...... 44

Figure 3. The integrated knowledge model of the DiscovErr system...... 47

Figure 4. Guideline-Plan class diagram...... 49

Figure 5. Plan-Body class diagram...... 51

Figure 6. Sub-Plan class diagram...... 52

Figure 7. Clinical-Step concepts class diagram...... 54

Figure 8. KBTA concepts class diagram...... 58

Figure 9. The Clinical Guidelines Library knowledge specification interface...... 59

Figure 10. The Domain Knowledge Library knowledge specification interface...... 60

Figure 11. The Clinical steps Library knowledge specification interface...... 61

Figure 12. The Taxonomy Library Viewer interface...... 61

Figure 13. The data flow in the Patient Data Access module...... 62

Figure 14. The components of the Analysis Framework...... 64

Figure 15. Blood Pressure measurements demonstrating the Fuzzy Temporal Reasoner...... 66

Figure 16. Extrapolation of time-stamped measurements by the Fuzzy Temporal Reasoner. .... 67

Figure 17. Temporal partitioning by the Fuzzy Temporal Reasoner...... 68

Figure 18. Illustration of the fuzzification-function...... 69

Figure 19. Evaluation of logical constraints by the Fuzzy Temporal Reasoner...... 70

Figure 20. Evaluation of logic operators by the Fuzzy Temporal Reasoner...... 71

Figure 21. AND-OR tree representation of the Preeclampsia diagnosis concept...... 72

Figure 22. The main steps of the DiscovErr’s compliance analysis algorithm...... 73

Figure 23. Initialize the TimeLine step of the compliance analysis algorithm...... 74

Figure 24. The top-down Analysis step of the compliance analysis algorithm...... 75

Figure 25. The bottom-up Analysis step of the compliance analysis algorithm...... 79

Figure 26. Compliance analysis in the context of a Not-Applicable guideline...... 80 2

Figure 27. Compliance Assessment in the context of an Applicable guideline...... 81

Figure 28. Compliance Assessment of a step in the Plan-Body of Applicable guidelines...... 82

Figure 29. Compliance analysis in the context of a Stopped guideline...... 84

Figure 30. Compliance analysis in the context of a Completed guideline...... 84

Figure 31. Compliance analysis in the context of a Just-Completed guideline...... 85

Figure 32. The Missing-Actions Analysis of the compliance analysis algorithm...... 88

Figure 33. Missing-Actions Analysis for drug-increase clinical steps...... 89

Figure 34. The Results Viewer graphical interface...... 95

Figure 35. Visualization of the details of a compliance-related comment ...... 96

Figure 36. Visual explanation of compliance-related comments...... 97

Figure 37. Taxonomy of computed-explanations of abstracted medication-related actions ...... 98

Figure 38. Taxonomy of guideline control computed-explanations...... 99

Figure 39. Taxonomy of computed-explanations related to outcome intentions...... 99

Figure 40. Taxonomy of computed-explanations assigned to each raw data items...... 100

Figure 41. Taxonomy of computed-explanations related to missing actions...... 101

Figure 42. The interface for visualization of the raw patient data...... 114

Figure 43. The interface for adding an expert comment about the patient's management...... 115

Figure 44. The interface for evaluating the comments of the system...... 116

Figure 45.The area of the screen in which the experts evaluated each system comment...... 117

Figure 46. A system screenshot illustrating the environment for the meta-critiquing-analysis of the experts’ comments...... 119

Figure 47. A zoom-in illustrating an example of compliance comments of the three experts regarding the same patient...... 119

Figure 48. The interface used to perform the meta-critiquing-analysis of the system comments.120

Figure 49. A profile of the completeness and correctness of the experts and the system...... 145

Figure 50. A snapshot of the memory consumption graph ...... 146

Figure 51. A snapshot the CPU utilization graph...... 147

3

List of Tables

Table 1. Comparing the goals and algorithms of the critiquing systems ...... 33

Table 2. Comparing the underlying knowledge of the critiquing systems ...... 35

Table 3. Comparing the evaluation of the critiquing systems ...... 36

Table 4. The set of comment types of which the experts could describe their comments ...... 113

Table 5. General information about the manual compliance analysis experiment...... 124

Table 6. Summary of the compliance evaluation time...... 125

Table 7. Expert's comments regarding compliance to the guideline...... 125

Table 8. Distribution of the experts' comments with respect to clinical action type...... 125

Table 9. The distribution of the experts' comments with respect to compliance issue’s type. .. 126

Table 10. Completeness of all agents relative to the comments made by Diabetes Expert 1. .. 126

Table 11. Completeness of all agents relative to the comments made by Diabetes Expert 2. .. 127

Table 12. Completeness of all agents relative to the comments made by the Family Medicine Expert...... 127

Table 13. Summary of the completeness of the comments made by all agents, relative to the union of the comments made by the three experts...... 128

Table 14. The level of support, by the three experts, to the unique compliance issues...... 128

Table 15. Completeness of the system’s comments relative to the unique compliance issues, by level of support of the three experts...... 129

Table 16. Completeness of the system’s comments relative to the unique compliance issues, by level of support of the two diabetes experts...... 130

Table 17. Distribution of the scenarios in which unique compliance issues were not detected by the system...... 131

Table 18. General information about the correctness experiment...... 132

Table 19. Summary of the correctness evaluation time...... 132

Table 20. System's comments regarding the compliance to the guideline...... 133

Table 21. The diabetes expert’s assessments of the system’s comments’ correctness...... 134

Table 22. The level of agreement between the two diabetes experts regarding the truth value of the correctness of the system's comments...... 134

4

Table 23. Correctness of the system’s comments according to both of the diabetes experts. ... 135

Table 24. Correctness of the system's comments regarding tests and patient monitoring issues.136

Table 25. Correctness of the system's comments regarding medication therapy issues...... 136

Table 26. The diabetes expert’s assessments of the system’s comments’ importance...... 137

Table 27. Importance of the system's comments...... 138

Table 28. Importance of the correct comments compared to importance of the incorrect comments...... 138

Table 29. Distribution of the reoccurring scenarios among the incorrect comments...... 139

Table 30. Indirect correctness of the experts’ comments, partitioned by level of support of the comments by the other experts...... 142

Table 31. Indirect correctness of the experts’ comments, partitioned by level of support of the comments by the other agents, including the system...... 143

Table 32. Indirect Completeness of the experts in the manual compliance evaluation...... 143

Table 33. Distribution of the comments that were not mentioned by an expert, although they were judged as jointly correct by the two diabetes experts...... 144

Table 34. Indirect Completeness of the experts in the manual compliance evaluation, regarding compliance problems only...... 144

Table 35. Summary of completeness and correctness of the system and the experts...... 145

Table 36. Results regarding runtimes...... 146

1

Introduction Chapter 1 6 Introduction

The necessity of common standards for medical care is becoming increasingly clear to the medical community. Evidence-based recommendations are published world-wide in the form of clinical guidelines and protocols. These guidelines are usually published in a text format, and are intended to be used by clinicians to provide the state of the art care. Evidence also indicates that implementation of these guidelines may improve medical care and reduces its costs [Grimshaw and Russel 1993; Quaglini et al. 2004; Patkar et al. 2006; Ruben et al. 2009].

Multiple types of computational tools were developed in order to increase the compliance to the guidelines in medical settings and to assist clinicians to apply latest medical knowledge in real time. These tools include guideline search and visualization engines, frameworks for specification of guidelines in formal formats, and tools for application of guidelines knowledge for clinical decision support. A comprehensive review of most of these systems can be found in [Peleg et al. 2003; De Clerq et al. 2004; Peleg 2013].

In recent years, healthcare providers have invested increasing efforts in applying methods to assess the quality of the medical care that they provide for their patients. These efforts include the definition and publication of objective quantifiable quality measures that are being used as a guide for proper treatment and for evaluating the healthcare quality provided by clinical organizations and in private practice. Examples of such quality measure are the Clinical Quality Measures (CQMs) published by Medicare and Medicaid services, and the Indicators for Quality Improvement (IQIs) published by the NHS. Efforts are also done for the development of automated systems for reporting the compliance to these measures, including the development of quality data models by organizations such as the National Quality Forum (NQF), for the connection of local databases to the publishes standards [Dykes et al. 2010], and for the development of tools to improve the quality of data in order to support automated analysis of these measures [Lanzola et al. 2014] . In addition, an accelerating number of medical centers and organizations have established internal quality and risk assessment units that perform quality assessment of the medical treatment, typically on a random subset of the patient population, or in dire circumstances in which mistakes were already been made. These risk assessment units usually examine the medical records manually, or by using relatively simple computational methods, and compare the

Chapter 1 7 Introduction

medical records to a deterministic set of quality measures that are created specifically for that purpose.

In this study I focused on an additional approach that aims to enhance clinical guideline compliance through an automated retrospective critiquing and quality assessment analysis.

In a related research (described in more detail in Chapter 2) several approaches were presented for the development of automated systems for guideline-based plan recognition, critiquing, and quality assessment. One of the first systems for medical critiquing was the HyperCritic system [van der Lei, Musen 1990] that examines the electronic medical records and generates critiquing statements by applying a set of previously defined critiquing-tasks. The HyperCritic system was implemented in the hypertension domain, and was later extended [Kuilboer et al. 2003] in a new implementation in the domain of asthma and COPD, called AsthmaCritic. In a later publication, [Shahar, Musen 1995] presented fundamental aspects required to be addressed before such systems can be adopted, such as the need of plan recognition by an automatic planner, the need for a mechanism of plan revision and the importance of a sophisticated temporal reasoning engine, to enable the analysis of the time-based data that is collected on each patient. An additional system for medical critiquing was presented by [Gertner 1997], who developed the Trauma-TIQ system that extended the Trauma-AID [Clarke et al. 1993] system. The Trauma-TIQ system was aimed to assist in decision support for trauma management by critiquing the clinician's actions only when significant gaps from the guideline were detected by system. [Advani et al. 2002] have outlined an approach for quality assessment by scoring adherence to guideline intentions, and presented the Quality Indicator Language (QUIL), which was based on the Asbru language [Shahar et al. 1998] and was specially designed to represent knowledge for the quality assessment task. Few years later, [Sips et al. 2006] presented an algorithm for an intention-based matching process and its evaluation in cases of hyperbilirubinemia. Another system, IGR, was presented by [Boldo 2007] for plan- recognition of clinical guideline plans, by analyzing the data in the electronic medical records. In order to perform the plan recognition, the system used an abstract format to represent the clinical guidelines, called guideline characteristic vectors. The plan recognition method of the IGR system included the use of fuzzy logic techniques to support partial matching between patient data and the guidelines. [Panzarasa et al.

Chapter 1 8 Introduction

2007] presented the RoMA module for analysis of motivations for non-compliance of clinicians to guideline recommendations presented to them by a care-flow system. [Groot et al. 2008] proposed an additional computational method to perform the critiquing, by representing the actual treatment actions and the clinical guidelines using temporal logic and a state transition system, and employing model checking to investigate whether the actual treatment is consisted with the guidelines.

The approaches presented above have pointed to important aspects to consider when developing systems for medical critiquing and quality assessment. However, additional research is still required, and integrated systems that address all these important aspects should be developed for such approaches to be accepted by health care providers I a massive scale and be integrated in the process of medical care.

In this research I designed, implemented, and evaluated a new comprehensive system, called DiscovErr, for medical critiquing and quality assessment, by analyzing compliance to clinical guidelines. The DiscovErr system is composed of three main modules (see Figure 1). The first module is the Knowledge Framework, which supports representation of knowledge according to a formal model, and includes a graphical knowledge specification tool that is used by knowledge engineers and expert physicians to specify and maintain the formal medical knowledge. The knowledge representation model is generic regarding the medical domain and enables specification of both clinical guidelines’ procedural knowledge, and medical domain declarative knowledge; it uses the Asbru language and the KBTA ontology, respectively, for these two types of knowledge. The second module is Data Access, which is used to retrieve data from the patients’ electronic medical records, convert it to the required format, and store it in a temporal database that is accessed during the compliance analysis. The third and main module of DiscovErr is Analysis Framework, which performs the critiquing and quality assessment, by analyzing the patients’ medical records based on the formally represented clinical guidelines. The Compliance Analysis Engine is designed to address the important fact that clinical guidelines include significant ambiguity, and that clinicians may choose different actions to comply with the recommendations. To do so, the compliance analysis algorithm uses techniques of fuzzy logic to evaluate the changes in the patients’ clinical state, and considers the intentions of each plan in the clinical guidelines in addition to its straightforward recommendations.

Chapter 1 9 Introduction

Figure 1. The main modules of the DiscovErr system.

To evaluate the system, and to measure the correctness and completeness of the compliance analysis algorithm, I conducted a trial with the system in the diabetes domain. A state of the art diabetes management guideline was represented, the system was applied to real patient data, and the results of the compliance analysis were evaluated by two expert physicians in the field of diabetes and an additional expert in the field of family medicine. The experts examined the medical records, added comments regarding the compliance of the medical treatment manifested by these records relative to the state of the art guideline, and then annotated the correctness and importance of the comments provided by the system to the same medical records.

My main research questions were (See Section 4.1 for a detailed explanation of the questions and of the measures used to assess the answers to these questions):

Question 1: (Completeness) Does the system produce all or most of the important comments relevant to the task of assessing compliance to a guideline?

Question 2: (Correctness) Is the system correct in its comments regarding the compliance to the guideline?

Question 3: (Importance) Are the comments provided by the system significant and important for understanding the quality of treatment?

Question 4: (Performance) Does the system perform well regarding run times and memory consumption?

While answering these primary research questions, I have also examined (and even answered to some extent) two interesting secondary issues that focus on the human aspects of the experiment:

Chapter 1 01 Introduction

Issue 1: (Internal expert similarity) What is the similarity between the comments of several expert physicians, when assessing the compliance [of a care provider and/or patient] to a certain guideline, given the same set of patient records?

Issue 2: (External expert agreement) What is the level of agreement between several experts regarding the quality of the system's assessments?

The results of the evaluation of the DiscovErr system demonstrated the feasibility of specifying the required knowledge, which is a mandatory preliminary step to enable quality assessment. More importantly, the results indicated high levels of completeness and correctness of the DiscovErr framework regarding the set of critiques generated by the system when examining the quality of the medical care that was provided to the patients.

In particular (see Section 5.1 for details), the completeness of the DiscovErr system was 91% for comments made by at least a majority of the three experts. The correctness of the systems' comments, as assessed by the two diabetes experts, was also 91%. Overall, when compared to a majority of the three human experts, the DiscovErr system could be placed between the family medicine expert and the diabetes experts; in fact, it displayed a higher level of completeness than any of the three human experts.

The comments of the system were found significant and important, when evaluated by the medical experts, for assessing the quality of the medical treatment and its compliance to clinical guidelines. It was encouraging to note that the correctness was higher for comments concerning issues that were judged as being of greater importance, as opposed to comments concerning issues judged as being of lesser importance.

Regarding performance, the DiscovErr system performed well enough, with respect to computation times, to justify its use in realistic clinical settings.

With respect to the secondary issues examined, 46% of the unique (with respect to content) comments regarding compliance issues were made by only one expert; 26.5% were made by two; and the rest, 27.5%, were made by all three experts. It is encouraging to note that the DiscovErr system produced 98% of the comments made by all three experts (versus 83% of the comments made by only two experts, and 66% of the comments made by only one expert), providing yet another implicit validation of the system’s focus on important issues. The agreement between the diabetes experts

Chapter 1 00 Introduction

regarding the correctness of the system’s comments, assessed through a weighted Kappa measure, was a good and significant 0.61 (p < 0.05).

I present relevant background in Chapter 2, with review of works in the fields of planning, plan recognition, guideline representation, methods for clinical guidelines recognition and critiquing and fuzzy logic. I present the DiscovErr system in Chapter 3, with details about its desiderata, architecture and a thorough description of its various modules. I present the evaluation in Chapter 4, explaining of the research questions and the design of the experiment. I present the results in Chapter 5, for each research question. I enclose with a summary in Chapter 6, discussing the implications of this work on the field of medicine, its limitations, and the main conclusions.

2

Background

Chapter 2 01 Background

In this chapter, I describe earlier studies that I found most relevant to my research. I categorized these studies into three main fields, covered by the three sections of this chapter.

The first section of the background is focused on the field of planning, plan understanding, and plan recognition, with early studies from the late 1970s. These studies presented pioneering systems that used methods to decide on a required set of actions to solve a certain task, and studies that presented systems for understanding the plans or intentions of an external agent by viewing its actual (past) actions. These studies have laid the foundations for later studies that were focused on the medical domain, aiming to assist care providers to improve the quality of their medical tasks.

The second section of the background is focused on the field of formal representation of clinical guidelines, with studies that presented new formal models and languages for the representation of the medical knowledge within clinical guidelines. Many languages and representation models were presented, each with its own advantages. However, while developing the DiscovErr system I used the Asbru language for the representation of clinical guidelines and the KBTA ontology for the representation of declarative medical knowledge. These models were chosen due to their advanced expressiveness in representing the temporal aspects and the clinical intentions of the clinical guidelines, in addition to representing the core medical knowledge that is supported by most of these suggested languages, core knowledge that includes aspects such as the clinical plan body, pre-conditions, decision steps, and more.

The third section of the background is focused on studies that have presented methods and systems for clinical guidelines recognition and critiquing. These studies, which share similar aims and ideas with my work, have pointed to important aspects that need to be considered when approaching the challenging task of medical critiquing and quality assessment.

Chapter 2 01 Background

2.1 Planning, Plan Understanding, and Plan Recognition

Early studies in the field of AI were focused in the task of planning, aiming to develop artificial agents that are able to decide and organize a sequence of required actions to achieve certain goals. One of the important publications was "Planning and Acting" [McDermott 1978], which presented a new theory of problem solving that included concepts such as plans and actions, and presented the need for sophisticated languages to describe the plans and to decide on their executions. It was shown that the knowledge about the plans should include rules and constraints about each plan and action that allows the rule-based problem solver to choose and schedule plans. These rules were described as pre-conditions and post-conditions, and examples for application were presented in multiple domains. The types of languages suggested to support the task of planning have laid the basis for the models for formal representation of clinical guidelines that were developed years later, such as Asbru, a formal language for clinical guideline representation that I used in this research; these models are described in more detail in the next section.

Another task, which is close to planning, is the task of understanding and explaining the actions of external agents (or actors) from observations on their actions. The PAM (Plan Applier Mechanism) system was one of the first systems that were designed to understand stories by analyzing the character intentions [Wilensky 1977, 1978]. The inputs to PAM were textual stories, and its task was to analyze them and provide structured explanation that included information about the planner and his goals. This work emphasized the need to identify cases when the planner needed to choose between multiple goals at a certain point in time in order to be able to understand and explain his actions. The concept of "Goal Subsumption" was presented as a way of planning (or understanding) for many goals at the same time. The Goal Subsumption handles three different types of situations, each of which has its own rules for recognition and understanding. A few years later, the idea of Meta-Planning was presented [Wilensky 1981], and provided extensions to the planning language that included "meta-goals" that improved the plan explanations. The meta-goals assisted the explanation algorithm in various cases; for example, the meta-goal "save resources" can be used to explain why two plans with overlapping goals should be replaced with a single one that

Chapter 2 06 Background

achieves both goals, or why a specific plan was chosen from multiple actions (has a lowest cost).

Additional work on a more general solution for plan recognition was presented later [Kautz 1987, 1991] where a formal theory of plan recognition was included. This work accounted for problems of ambiguity, abstraction, and complex temporal interactions that were ignored by previous research. The theory was based on first-order logic language, event hierarchies, and representing time using individual intervals and Allen's interval algebra [Allen 1983] for representing the relations between them. Each plan had a description of its components, preconditions, effects, equality constraints and duration constrains. The theory was implemented and demonstrated in three different domains: cooking plans, operating systems file management plans, and in medical diagnostics where the plan recognition was converted to diagnosis recognition.

A different model for plan recognition was presented in [Konolige and Pollack 1989] as a belief and intention ascription, with an inherent reasoning process that was encoded using a direct argumentation system. The idea behind this method was to enable making explicit statements about why one candidate ascription should be preferred over another, and to avoid the overly strong assumption that each observed actions is correct. The suggested model included an "initial world" of the beliefs and intentions that can be observed from the actor's actions, and added to previously held mutual beliefs. Then, the argumentation system is used to discover the intended plans by applying arguments to discover plan fragments that can be ascribed to the actor. This work emphasized the need for an ability to cope with resource limitations (i.e., missing data and knowledge) during the reasoning process.

2.2 Formal Models for Guideline Representation

To allow any form of computational support for the application of clinical guidelines, both online at the point of care or retrospective, the free-text guidelines need to be formalized into machine interpretable formats that can be applied by clinical decision support systems. Several guideline-specification ontologies were developed to represent guidelines in a formal and machine interpretable format. A comprehensive comparison between most of those approaches can be found in [Peleg et al. 2003; De Clerq et al. 2004]. The existing major approaches for formal guideline representation and

Chapter 2 07 Background

application vary in the goals they are designed to achieve in their representation model and in their provided knowledge specification tools.

Some systems used production rules (i.e., "if-then rules") to describe the medical knowledge. As these rules are limited in their expressiveness, the systems that have used them were extremely hard to maintain and expand, as the addition of one rule to a large set of existing rules could have affected the whole system's behavior. Later, models such as the Arden Syntax [Hripcsak et al. 1994] were developed to encode medical knowledge in a knowledge base of medical logic modules (MLMs), which are a hybrid representation between a production rule and a procedural formalism.

When it was clear that a more expressive formalism was required in order to develop decision support systems for the application of long-term clinical guidelines, systems such as EON [Musen et al. 1996] were developed. EON's guideline model (called the Dharma model) defines guideline knowledge structures such as eligibility criteria, abstraction definitions, guideline algorithm, decision models, and recommended actions. Several systems used the EON model for different domains such as Hypertension (Athena) [Goldstein et al. 2001], which was later assessed in a post- fielding surveillance study [Chan et al. 2005], and AIDS (T-Helper) [Musen et al. 1992].

Other approaches of representing comprehensive clinical guidelines offered different abilities to represent the knowledge:

The Prodigy [Johnson et al. 2000, 2001] approach enables multiple entry points into a guideline, using the concept of common clinical scenarios.

GLIF was developed by shared efforts of groups [Ohno-Machado et al. 1998; Peleg et al. 2001; Boxwala et al. 2004] to support sharing of clinical guidelines among different medical centers and used to represent clinical actions and decisions.

SAGE [Tu et al. 2004; Tu et al. 2007]. The SAGE project created an infra-structure for implementing computable clinical practice guidelines in enterprise settings with the guideline model. SAGE recommendation-set formalism uses activity graphs and decision maps, which are recommendation sets that allow specification of computational algorithms or medical care plans as processes.

Chapter 2 08 Background

PROforma [Fox et al. 1998; Sutton and Fox 2003] combines logic programming and object-oriented modeling that supports four tasks: actions, compound plans, decisions, and enquiries.

GUIDE [Quaglini et al. 2001; Ciccarese et al. 2004] supports the integration of modeled guidelines into organizational workflows, and uses decision analytical models such as decision trees and influence diagrams. GUIDE supports simulation of guideline implementation in terms of Petri nets, and considers issues such as patient data, the implementing facility’s organizational structure, and resource allocation.

GLARE [Terenziani et al. 2001; Terenziani et al. 2004] is a domain-independent system for the acquisition, representation, and execution of clinical guidelines. The main philosophy of GLARE is to achieve a balance between complexity and expressiveness, and to have a limited, but focused and clear set of primitives.

GASTON [De Clercq et al. 2001; De Clercq and Hasman 2004]. Gaston is a methodology and framework that facilitates the development and implementation of clinical guidelines and guideline-based decision support systems. The overall goal of this approach is to improve the acceptance of computer-interpretable guidelines and decision support systems in daily care by facilitating all phases in the guideline development process.

SDA* model [Riano 2007]. The SDA* model is a formal language for the representation of procedural medical knowledge that emphasizes the need for a simple representation model that is understandable for health care professionals, without compromising on the ability to express complex medical procedures. The SDA* model is based on the formalism of flowcharts, but extends it with several elements in order to allow complete representation of clinical procedures. The ability to include multiple starting points for the clinical process and the ability to specify time constraints are two examples of extensions of the flowchart formalisms.

Asbru [Shahar et al. 1998; Miksch 1999; Seyfang et al. 2002] is an expressive language for formal representation of clinical guidelines, which emphasizes the representation of the process and outcome intentions of the guidelines in order to better support the quality assessment analysis task. Clinical guidelines are represented in Asbru using a hierarchical structure of plans and sub-plans, in which each plan is represented using a set of knowledge elements, called knowledge roles. The knowledge

Chapter 2 09 Background

roles allow specifying both meta-data and formal definitions of each guideline-plan; examples of knowledge roles used to store the meta-data include the plan’s Title, Create-Date, and Author; examples of knowledge roles used to represent the formal definitions include the Conditions (filter, setup, complete, abort, suspend, and restart conditions), Plan-Body, and Outcome-Intentions. Asbru provides an expressive schema to support the representation of the temporal aspects of clinical procedures that comprise the guidelines. Because Asbru was designed to support the quality assessment task, it was chosen as an underlying model by the majority of studies (see review in section 2.3.) in the field of medical critiquing and quality assessment as it provides the most expressive model to represent the intentions and the temporal aspects of the guideline. From the same reasons, the DiscovErr system developed in this research uses Asbru to represent the procedural aspect of the clinical guidelines, integrated with the Knowledge Based Temporal Abstraction (KBTA) ontology to represent the declarative aspects of the guideline. Details about the specific implementation are provided in section 3.3 in the description of the knowledge model of the DiscovErr system.

KBTA [Shahar 1997]. Knowledge Based Temporal Abstraction (KBTA) is a powerful method for abstracting higher-level concepts from time-stamped data, used in multiple systems that analyze clinical data. Different from the other models presented above, the knowledge representation ontology of the KBTA is not a clinical guideline representation language, as it focuses on representation of declarative medical concepts. The KBTA ontology is used to formally represent medical concepts of various types that include Primitive-Parameters, Event-Parameters, State-Abstractions, Gradients, and Patterns. Each of these concept types is represented in KBTA in a formal manner that supports automated reasoning by a temporal abstraction engine. Because the KBTA ontology defines a very expressive model for the representation of medical concepts, it is used in the knowledge model of the DiscovErr system for the representation of the declarative medical domain knowledge. Details about the specific implementation are provided in section 3.3.3, in the description of the knowledge model of the DiscovErr system.

Chapter 2 01 Background

2.3 Methods for Clinical Guidelines Recognition and Critiquing

In the following section, I describe the studies that are most related to my research, and from which many aspects were learned and implemented in the DiscovErr system.

A Model for Critiquing Based on Automated Medical Records

In the study of [van der Lei and Musen 1990], the authors designed the HyperCritic system, which was one of the earliest systems developed for automated medical critiquing, aimed to advise general practitioners in the treatment of patients with hypertension. The HyperCritic system had access to the data stored in a primary care information systems (a system called ELIAS), and was used to provide multiple critiquing statements by applying a medical knowledge-based method that solves a set of critiquing tasks. One of the main ideas presented by the HyperCritic system was to rely on the data in the electronic medical record to provide the critiquing statements, thus avoiding a consultation-style interaction between the system and the user, which is a cumbersome task for most busy medical practitioners. In the critiquing approach, the system’s operation is invoked only after the physicians submits the decisions they intend to make, and the system critiques these planned actions and provides recommendations for possible modifications of the treatment if necessary.

The critiquing process that was implemented in the HyperCritic system used formally represented knowledge that supported the application of a set of critiquing tasks. The knowledge model comprised two distinct types of knowledge, critiquing knowledge and medical knowledge. The critiquing knowledge was used by the system to determine when to apply the critiquing and how the critiquing should be applied, whereas the medical knowledge described the medical facts that are used during the application of the critiquing tasks.

In the first step of the critiquing process, the system scanned the medical record to detect pertinent events that invoke the application of the critiquing tasks. These events included several types of drug administration-related actions, such as starting, increasing, decreasing, and stopping drugs. When the system detects the pertinent events, it invokes the relevant task to critique the medical treatment.

Chapter 2 00 Background

The critiquing knowledge includes the representation of four types of critiquing tasks; (1) Preparation Tasks describe the critiquing processes performed when the system detects initiation of a drug treatment, and includes detection of missing workup- requirements and notifying of missing baseline measurements required for the future monitoring of possible side effects of the drug, which were retrieved from the medical- facts knowledge; (2) Selection Tasks describe the critiquing processes performed for critiquing the action selection of the physician, and included detection of contraindications on drug initiation, initiation of drugs without appropriate records of the indications, validation of the drug doses, validation of dose increments, detection of drug increments performed too early, detection of unrecompensed combinations of medications, and detection of inappropriate decreases of drug dosage, e.g., when the blood pressure was still elevated; (3) Monitoring Tasks describe the detection of absent monitoring actions (i.e., clinical tests and measurement), which are required, according to the medical facts knowledge base, during the administration of the drugs; and (4) Responding Tasks describes the critiquing processes performed for the detection of the known possible side effects of a drug that occur during the treatment, alert the physician on possible side effects when changing drugs during the treatment and regarding the time between patient visits that exceeds the recommended range.

The HyperCritic system was evaluated in a clinical trial [van der Lei et al. 1991] in which the critiquing comments of the system on the clinical decisions in 240 visits of 20 randomly selected patients, were compared to the comments of a panel of eight physicians who examined the same records. Using an “index of merit” measure that combines the sensitivity and specificity of each individual reviewer, the system was found to perform better than the physicians who examined the same patient visits.

In later study [Kuilboer et al. 2003] the HyperCritic critiquing model was re- implemented in the Asthma and COPD domain by a new system called AsthmaCritic. The new implementation included improvements – increasing the use of the time- stamped data, allowing physician to control the system’s output, and implementation of additional functions for better integration of the system with the operational electronic medical records, such as an ability to run in the background and to attach the system’s critiques to the patient’s medical record. The design of the AsthmaCritic system emphasized several needs: a need for integrating the critiques generated by the system in the operative electronic medical records used daily by the physicians in order to

Chapter 2 00 Background

provide them with a consistent work environment; a need for considering the physicians as professionals and providing them with critiquing comments only when detecting problems with high clinical impact.

The HyperCritic was a pioneering system developed for medical critiquing; thus, the system developed in my research implements the same general idea. There is a certain similarity between the analysis process performed by the HyperCritic, AsthmaCritic, and my system, in the fact that all systems are scanning the raw patient data and when detecting certain events, relevant analysis mechanisms are applied to provide comments regarding provided treatment. However, in my study I used a more advanced model for the knowledge representation, assigned more weight to the temporal aspects of the guidelines, and to address the ambiguity and uncertainty of the knowledge and data considered the intentions of each plan in the guideline in addition to its direct recommendations.

Plan Recognition and Revision in Support of Guideline-Based Care

In the paper of [Shahar, Musen 1995], the authors present and emphasize the unique problems and requirements involved in developing systems that support the application of clinical guidelines and protocols, and demonstrated the need for plan recognition by automatic planner, the need for a mechanism of plan revision, and the importance of temporal reasoning about goals and actions in a time-oriented domain. The views presented by this paper address several problems in the development and adoption of clinical decision support systems in the medical domain, such as the ambiguity and incompleteness of the clinical guidelines, the flexibility of physicians in their application, and the tendency of physicians to ignore systems that provide too many straight forward recommendations.

The authors stated that clinical-guidelines involve clinical-intensions that represent the goals of the therapeutic actions, and that these intentions should be represented along with all other knowledge elements, such as eligibility conditions and specific actions. These intentions can serve the automated planners in understanding the higher level goals that the physicians are attempting to follow, by supporting computational algorithms that are not constrained to explain the given treatment only by examining the lower level actions recorded in the patients’ data.

Chapter 2 02 Background

Care providers are able to revise on override the recommendations of known guidelines, for example, when expected or unexpected complications occur. For example, when a side effect of a medication is not tolerable by a patient, the physician can change the treatment to include lower dose of the medication a with deeper lifestyle modifications. In order to support the care providers with automated planners, or to correctly asses their compliance to the treatment, the planner must have the ability to revise the guidelines in the same manner. For that, the planner should have access to a library of plans, where each contains knowledge about its effects and intentions. When providing that, the planner can understand the intention of the physicians when they modify its suggestions, and avoid making erroneous interpretations.

The revision mechanisms that are suggested to be applied include different revision actions, such as suspending event proposition until some condition holds, clip (i.e., finish) some event when some other event or condition holds, starting a plan when a certain condition holds, replacing one event with another of a similar type (i.e., same clinical intention), and adding new events that may contradict the effect of some offending event.

To be able to support intelligent interpretation of the data involved in the medical treatment, the automatic planner should be able to interpret the time-oriented data. The knowledge used by the planner should include representation of the temporal aspects of the concepts, and the planner should include mechanisms for applying the temporal aspects when interpreting the patient's data.

In my research, I tried to address these aspects that Shahar and Musen presented in their paper. The system I developed addresses, to a certain manner, the flexibility provided to physicians that can revise the medical actions when applying clinical guidelines. This flexibility to revise the clinical actions is addressed by the system in three main manners: (1) the evaluation of the compliance to the plan’s outcome is performed in parallel and in separate from the evaluation of the compliance to the process, thus, even if the system does not recognize the revised actions as related to the guideline, it can still recognized the compliance to the outcome; (2) by using techniques of fuzzy logic to evaluate the patients, the system can understand cases in which physicians decided to start or to stop certain actions, when the patient state was close to a state mentioned by the guideline, that supports the physician’s actions, even if it did not completely fulfilled the logic constraint mentioned in the guideline; (3) by utilizing

Chapter 2 01 Background

the semantic structures of controlled medical vocabularies that exist in the knowledge base, the can understand when a certain treatment was revised with alternatives, for example, different medications from the same group. In addition, as emphasized by the authors, the system I developed focuses on interpreting time-oriented data and is using both Asbru guideline specification language and the KBTA ontology for domain knowledge representation, two knowledge representation models that include a strong support for representing the temporal aspects of the knowledge.

Plan Recognition and Evaluation for On-line Critiquing

In the study of [Gertner 1997], the author developed the Trauma-TIQ, which is a critiquing system based on the Trauma-AID decision support system. Trauma-TIQ receives the actions that the physician is just about to carry out, and produces a critique in response to these intended actions. Trauma-TIQ presents two main components; a plan recognizer that uses the context of the case to disambiguate plans; and a plan evaluator that identifies possible clinical errors, and calculates their significance in order to determine an appropriate response. The system evaluates the physician’s proposed plan and attempts to intervene before a problem occurs. The critiquing approach avoids the problem in which physicians are reluctant to use systems for consultation, and intervening only when it finds significant gaps between the physician's proposal and the system's proposal. This type of system that also includes the HyperCritic and AsthmaCritic that were described earlier, can be seen as assisting the user rather than presenting a contrary solution, and is less intrusive by producing comments only when significant problems are detected [Miller 1986].

The underlying knowledge representation model used by the Trauma-TIQ is based on the knowledge of the Trauma-AID 2.0 [Clarke et al. 1993]. This knowledge was represented by treatment Goals (e.g., Treat Thoracic Esophageal Injury), Procedures (e.g., Perform Upper Esophagus Repair), and Actions (e.g., Thoracotomy). Each goal can be applicable to multiple procedures and each action, represented in an ordered sequence of actions, can belong to multiple procedures.

The plan recognition algorithm tries to explain each action by enumerating a set of possible explanations, each consisting of a path that exists in the knowledge base from the action, through its corresponding procedures to the procedures’ goals. The possible explanations are classified by their relevance to the current situation, and by the

Chapter 2 01 Background

strength of evidence that the physician was pursuing the goal, determined by additional actions performed that support the same goal. The explanation with the most relevant goal and most supported by the actions is then chosen.

The plan evaluation algorithm tries to detect possible errors. The algorithm uses an outcome-based taxonomy of errors, which includes details about potential types of errors, such as omission of actions, commission of actions, wrong procedure choice, and wrong scheduling. To determine how each error should be handled by the critiquing system interface, the algorithm also determines the significance of each error (Tolerable, Non-Critical, Critical). This is done according to knowledge about the clinical significance of various types of errors, which was elicited from a group of four surgeons. The output of the plan evaluation process is a set of free text comments generated by the system to the users.

This research presents a method for critiquing treatment actions by an algorithm that combines both plan recognition and evaluation. The critiquing approach is used by the system to improve the way an expert system interacts with its users, and improves the acceptance of the decision support that was previously implemented by the TraumaAID system.

The system I developed in my research shares many ideas with the Trauma-TIQ system, from performing retrospective assessment in a manner that eliminates the need for online interaction with the system, through combining two computational steps of plan detection and plan quality evaluation, to provision of comments regarding gaps between the guidelines and the actual treatment (planned treatment in the case of Trauma-TIQ). The taxonomy of comments that was presented in the Trauma-TIQ study assisted me in designing the compliance analysis algorithm. However, in my study I developed a system with a new compliance assessment algorithm that uses a temporal reasoning engine to analyze data of long time periods and applies fuzzy logic techniques to address the uncertainty of the patients’ medical data and the ambiguity of the clinical guidelines.

Medical Quality Assessment by Scoring Adherence to Guideline Intentions

In their paper [Advani et al. 2002], the authors described an approach for evaluating and consistently scoring clinician adherence to medical guidelines using the intentions

Chapter 2 06 Background

of guideline authors. This paper describes a system for retrospective quality assessment; MedCritiq is presented together with the Quality Indicator Language (QUIL). This language is used by the system to formally specify quality constraints on physician behavior and patient outcomes derived from medical guidelines. QUIL combines several knowledge-roles of the Asbru language, and allows specifying guidelines for the task of quality assessment. The language uses temporal constraints that are built up from complex patterns of temporal events and non-temporal parameter propositions defined by similar QUIL axioms. The intentions in QUIL are organized by a hierarchy of higher-level intentions that are composed from more detailed intentions, with logical operators that describe the method of computation of the adherence algorithm. Each intention element is assigned a weight that defines how adherence to this intention is prioritized over other intentions in the guideline. QUIL defines two types of intentions, behavioral intentions that express the medical actions and are used for plan revision, and state intentions that are used to express the goals to be achieved in the patient's state.

The guideline adherence scoring algorithm normalizes the weights of each intention to allow any deviation to be scored consistently. To compute the score of high-level intentions, the algorithm calculates the level of matching of the data in the patient record to the atomic sub-elements of the intentions, and then propagates these scores across the hierarchy using the logical operators that comprise the high level intention. The algorithm uses a penalty mechanism to reduce the score of an intention in cases of plan revisions.

In my study, I have implemented the idea suggested in their paper, of scoring the adherence to the guideline intentions, and as suggested, separated the evaluation of compliance to the process from the evaluation of compliance to the guidelines’ intentions.

Applying intention-based guidelines for critiquing

The work of [Sips et al. 2006] presented an algorithm for an intention-based matching process and its evaluation on 12 cases of hyperbilirubinemia. The algorithm was designed to use the intention of clinical guidelines, represented in the Asbru language, for matching between the actual medical treatments and the clinical guideline. The actual treatment is represented as a sequence of clinical actions and is compared by the

Chapter 2 07 Background

algorithm to all possible execution sequences of the clinical guideline, which are generated automatically by the algorithm from the formal representation of the guideline. The algorithm uses a distance function that calculates the number of actions in the patient record that are matched by the execution sequence minus the number of unmatched actions in the execution sequence that occur before the last matched action in the execution sequence. The algorithm ignores the order of the plan in each sequence to support situations where the medical actions are not entered to the patient record exactly in the order they were performed.

For its evaluation, the algorithm was tested on an overall number of 12 patients divided into two groups from different medical location (5 vs. 7 patients), treated by a pediatrician familiar with the guideline and a pediatrician not familiar with the guideline. The result suggested, without significance, that the patient treated by a physician familiar with the guideline had higher match to the guideline sequence.

The study by Sips et al. had shown the feasibility of using the intention of a clinical guideline for matching patient data to “legal sequences” of actions in the guideline, by using a relatively straightforward function to measure the distance between pairs of clinical actions’ sequences (actual actions in the observed data, versus the guideline- based recommended actions). The score for matching a single action could be assigned 1 or 0 (i.e., exists or does not exist), and the algorithm did not use the expressive temporal annotation of the Asbru language. The intentions of the guideline were mainly used to revise the atomic action, to support cases when a clinician modifies the guideline with alternative actions (e.g., prescribing different drug from the same group).

The study was also described later by the same authors [Sips et al. 2008] with a deeper discussion of its relation to the Beliefs, Desires, and Intentions (BDI) approach [Rao and Georgeff 1995].

Similarly to the algorithm suggested in their study, the compliance analysis algorithm developed in my research uses the Asbru language to represent the clinical guidelines, and implements similar logic that uses the process-intentions to detect plans that were (legally) revised. However, the algorithm that I developed in my research uses Asbru’s representation of the guidelines’ temporal aspects in a more extensive manner, and uses a knowledge-based temporal reasoning engine that applies fuzzy logic to address cases with partial match between the patient state and the guideline conditions.

Chapter 2 08 Background

Knowledge-Based Recognition of Clinical-Guideline Application in Time-Oriented Medical Records

In her research, [Boldo 2007] presented a framework called IGR (Intelligent Guideline Recognition), which determines by which guideline the patient was most likely to have been treated and determines the degree of adherence to the guideline specification. This framework was designed to support runtime quality assurance and retrospective quality assessment. The plan recognition problem was approached in this study as a classification problem. To represent the clinical guideline the authors developed an abstract knowledge representation format for clinical guidelines as vectors of attributes (characteristic vectors), which was formed according to concepts of the Asbru language. The framework was evaluated by a computational tool that was developed for that purpose, applied on a set of 773 records of hypertensive patients. The formal representation of several guidelines for management of hypertension was achieved with assistance of a clinical expert. The expert also took part in the evaluation where a comparison between a manual quality assessment of multiple patient records conducted by that expert, to an automatic assessment conducted by the framework. A significant correlation was found between the expert’s judgment and the framework's score.

The IGR system’s architecture included four main modules; (1) a knowledge library, to store the guidelines represented as characteristic vectors that were created in the guideline specification process; (2) a computational module that compares the patient records to the characteristic vectors; (3) a classification module to determine which of the known guidelines is detected; and (4) a reporting system to provide the system output.

The guideline characteristic vector included a sequence of abstraction functions with temporal relations. Each abstraction function is composed of weights and membership functions. The abstraction functions describe the relations between the data items in the patient records, both value and temporal relations. The membership function used the weights of multiple sub-levels of higher level concepts to calculate an overall score for the match (or distance) between the patient data and these concepts. The default membership function used linear, triangular, and trapezoid functions. The membership functions, which were created and modified in collaboration with the expert, included

Chapter 2 09 Background

constraint on point value attributes, interval value attributes on single or continuous parameters, and constraints with several attributes.

The matching algorithm consists of two phases; the first performs features abstraction from a given electronic record, and the second computes a score for the matching to the specified guidelines. The algorithm used a penalty mechanism to consider cases where a deviation from a guideline is detected, but the treatment still followed the intentions of the guidelines (e.g., using alternative treatment). Penalties were specified according to the judgment of the clinical expert.

JNC6 and JNC7 hypertension management guidelines were used for the system evaluation. These guidelines were specified in the IGR formal model, and applied on 773 patient records from 16 medical centers. 487 patient records were relevant for the initial drug therapy, which was selected as the focus of the evaluation. 38 records were selected randomly, and were assigned a matching level to both JNC6 and JNC7 guidelines by the clinical expert. The results of the expert were compared to those of the system. The correlation between the scores of the expert and the system was found significant, in both JNC6 and JNC7 guidelines. The evaluation also tested if a difference exists between the two guidelines. The result also showed a significant difference between the level of compliance to each of these two guidelines, manifested both in the assessments of the clinical expert and in the assessment of the IGR system.

The methodology presented by the IGR framework uses a formal representation of clinical guidelines, which is based on the ideas of the Asbru language, to support performing plan recognition and assessment of the level of matching between patient records and treatment recommended by clinical guidelines. The research has presented a new idea for using fuzzy logic techniques for interpreting the patient record and performing the data abstraction. The evaluation results support the idea of developing systems for automatic quality assessment, which can be done in a fast and accurate manner, on large sets of multiple patients.

In my research, I adopted the idea, first implemented in the IGR framework, of using techniques of fuzzy logic to evaluate the state of the patient according to the guideline knowledge. However, in contrast to IGR, the system I developed uses a full representation of the guidelines according to the Asbru ontology, which is extended with a representation of declarative concept according to the knowledge-based temporal

Chapter 2 21 Background

abstraction ontology. In addition, by using this knowledge, the compliance analysis algorithm developed in my research addresses the long term application of the guidelines over time, and extends the plan recognition functionality by providing multiple types of comments regarding the compliance to the guidelines.

Using model checking for critiquing based on clinical guidelines

In their paper [Groot et al. 2008], the authors proposed a computational method for critiquing, which employ model checking to investigate whether the actual treatment is consistent with the guideline. The model checking technique was applied by using temporal logic formulas to represent the actual patient state and treatment, and using a state transition system to describe the guidelines. This technique was used to detect several types of non-compliance, which were organized by the authors in the following two categories: The first is called ‘T’ type and focuses on the difference between the actual treatment to the treatment suggested by the guideline; the second is called ‘F’ type focuses on the inconsistency of the medical findings to the action in the actual treatment. The ‘T’ type non-compliance included actions that are not supported by the patient findings, conflicting actions that are contradicted by the patient findings, non- compliant order of actions, and missing mandatory actions; the ‘F’ type included missing relevant findings for an action, wrong findings for an action, and non compliant finding for the order of actions, and redundant irrelevant findings.

In their suggested method, the properties of the guidelines are specified using both computation tree logic (CTL) and linear temporal logic (LTL). CTL uses atomic propositions, propositional connectives, path quantifiers and temporal operators for describing properties of computation trees, in contrast, LTL provides operators for describing events along a single computation path. The method used Asbru guideline specification language to model the guidelines, and the Asbru model was translated into state transition system, in which each path in the state transition system is considered as a treatment that consistent with the guideline’s recommendations.

The method used both direct and indirect critiquing. In the direct critiquing, the model checker applied the CTL and LTL formulas directly on the patient data; in the indirect critiquing by generating satisfaction sets based on the formal representation of the guideline and performing the critiquing by applying these satisfaction sets on the patient’s data.

Chapter 2 20 Background

To demonstrate the model-checking-based critiquing method, a breast cancer guideline was represented as a state transition system, and prototypical cases were created by medical experts who examined a breast cancer patient registries. To overcome problems of data granularity, these prototypical cases were represented according to the terminology of the formal guideline. The direct critiquing was demonstrated in two case studies, of different section in the guideline, by applying the model checker on these prototypical cases; the indirect critiquing was demonstrated by applying the satisfactions sets on fictive patient data. The two critiquing methods are capable of detecting each of the types of the non-compliance, the case studies demonstrated the direct and indirect methods to detect different types of non- compliance.

In a comparison between the two computational methods, the indirect method was shown to have the advantage that part of the computational task, of creating the satisfaction sets, can be perform “off-line” before applying the critiquing on the patients, a fact that can be used to reduce the computation time needed for examining the patient record.

Similarly to the methods suggested in their research, the system developed in my research is using Asbru to represent the formal guidelines and applies the compliance analysis by applying the knowledge on the patient data. However, in my research the systems uses the Asbru-based representation of the guideline directly, and does not translate the guidelines to a logic-based state transition system. In my research I utilized the temporal annotations available in the Asbru representation, to better examine the compliance to the temporal aspects of the guideline and evaluate the actions’ scheduling in addition to their order. In addition, I used fuzzy logic techniques that supports detection of cases with partial compliance to the guidelines, a functionality that extends the capabilities of the logic-based model checking.

Computerized Guidelines Implementation: Obtaining Feedback for Revision of Guidelines, Clinical Data Model and Data Flow

In their paper [Panzarasa et al. 2007], which is also described in [Quaglini 2008], the authors describe a module for the analysis reasons for non-compliance of physicians toclinical guideline. The module is called RoMA (reasoning on Medical Actions), and was developed on top of an existing care-flow management system (CfMS), that was

Chapter 2 20 Background

designed to support the physicians in a stroke unit in applying a stroke stroke clinical guideline. The existing CfMS was implemented using Oracle Workflow™ engine [Panzarasa et al. 2006], by structuring the guideline’s clinical flow as a workflow and apply it using this engine, on top of the stroke registry [Micieli et al 2010].

The motivation for developing the RoMA module, was to enable analysis of the reasons for non-compliance, in order to provide an organizational feedback for the healthcare administrators. RoMA interacts with the physicians on a patient’s discharge, while they are preparing the summary of the patient’s stay in the hospital. The system listed the recommendation the patient was eligible for (according the rules defined in the CfMS), and highlighting the recommendations in which there was non-compliance. For each of the non-compliant recommendations, the physician could use the system’s interface to provide a motivation for the non-compliance.

The motivations for non-compliance were organized in a taxonomy with five top general levels, that included the following types of motivations: Organizational, Technical, Medical, Patient-related Problems and Erroneous Recommendation. The taxonomy was designed in order to communicate the feedback to the relevant people, for example, Technical motivations (e.g. a medical device that was not working) can be sent to the technical staff of the hospital, whereas Erroneous Recommendation can be sent to those who are in-charge of the guideline’s implementation. The investigation of the motivations for non-compliance, included an analysis of the motivations by their various types. The top reason for non-compliance (41%) was Erroneous Recommendations by the CfMS itself, do to incomplete data or representation of the guideline. The other motivations distributed almost evenly between Technical Problems, Medical Problems and Patient-related Problems. The findings of this analysis were then used to improve the implemented rules of the guideline and the integration between the CfMS and the componentized clinical chart software used in the stroke unit.

Their research emphasizes the need for a structured collection of reasons for non- compliance when implementing a guideline-based decision support system, which is extremely important for understanding the bottlenecks and improving guideline implementation. However, in comparison the system developed in my research, RoMA does not focus on retrospective analysis of patient data to detect the non-compliances themselves.

Chapter 2 22 Background

Comparison between the Critiquing and Clinical Guideline Recognition Systems

The tables provided in this section summarize the previous work and compare the critiquing and plan recognition systems that were developed in the medical field.

Table 1 provides a description of the system goals, algorithm, and compare the systems by the following four criteria: (1) Does the system aim to provide specific comments regarding multiple types of compliances and non-compliances to the guidelines, or just provide a numeric matching score to the overall guideline? (2) Does the algorithm truly handle the temporal aspects of the guidelines and data? (3) Does the algorithm try to handle legitimate deviations from the guidelines and differentiates between decisions and actions that are close to the original recommendations from those that are completely different? (4) Does the system include a module to graphically visualize and explain the data the led to its decision?

Table 1. Comparing the goals and algorithms of the critiquing and plan recognition systems

System / Description of the Algorithm Multiple Handle Handle Graphical Approach Types of Temporal Legitimate Visualization Compliances Aspects Deviations and and Non- from Explanation of Compliances guidelines the Comments HyperCritic / Automated application of four types of Yes Yes No No AsthmaCritic predefined “critiquing knowledge”, each defines several specific Intention-based critiquing-processes, for reporting of matching process absence of preparations, presence of between clinical criteria, absence of monitoring and actions and presence of situations that require guidelines. response Trauma-TIQ Interpretation of the physician’s Yes No Using No orders in terms of their underlying "disutility On-line critiquing goals (plan recognition) and evaluation function" to for assisting the inferred plan structure by decide if to physicians during comparing it to Trauma-AID’s prompt a the management recommended plan (plan evaluation), comment of patients with and generation of English critiques severe injuries. addressing elements of the physician’s plans that were detected with a potential problems (critique generation). MedCritiq / Computation of a score representing No Yes No No QUIL the adherence to high-level intentions, by calculation of the level of matching Evaluating and of the data in the patient record to the consistently atomic sub-elements of the intentions, scoring clinician and then propagation of these scores adherence to across the hierarchy using the logical medical operators that comprise the high level guidelines using intention. The algorithm uses a penalty the intentions of mechanism to reduce the score of an guideline authors. intention in cases of plan revisions.

Chapter 2 21 Background

System / Description of the Algorithm Multiple Handle Handle Graphical Approach Types of Temporal Legitimate Visualization Compliances Aspects Deviations and and Non- from Explanation of Compliances guidelines the Comments Intention Based Representation of the actual treatment No No No No Guidelines for as a sequence of clinical actions and Critiquing comparison to all possible execution sequences of the clinical guideline, Intention-based which are generated automatically by algorithm for the algorithm from the formal matching between representation of the guideline. Use of clinical actions a distance function that calculates the and guidelines. number of actions in the patient record that are matched by the execution sequence minus the number of unmatched actions in the execution sequence that occur before the last matched action in the execution sequence. The algorithm ignores the order of the plan in each sequence to support situations where the medical actions are not entered to the patient record exactly in the order they were performed.

IGR A matching algorithm that consists of No No Yes No two phases; the first performs features Determination of abstraction from a given electronic the guideline that record, and the second computes a the patient was score for the matching to the specified most likely to guidelines. Use of a penalty have been treated mechanism to consider cases where a and the degree of deviation from a guideline is detected, adherence to the but the treatment still followed the guideline intentions of the guidelines. Penalties specification. were specified according to the judgment of a clinical expert. Model Checking Use of both direct and indirect Yes Partial No No for Critiquing critiquing. In the direct critiquing, the (Temporal model checker applied the CTL and Logic) A computational LTL formulas directly on the patient method for data; in the indirect critiquing by critiquing, which generating satisfaction sets based on employ model the formal representation of the checking to guideline and performing the critiquing investigate by applying these satisfaction sets on whether the actual the patient’s data. treatment is consistent with the guideline

DiscovErr An algorithm that combines between Yes Yes Yes Yes two computational approaches: top- A system for down and bottom-up. The algorithm guideline-based consists of several sequential steps, in critiquing and which each step can directly analyze quality the raw patient data or use the outputs assessment, over calculated in previous steps of the long time periods algorithm. according to a well-defined, formal guidelines

Chapter 2 21 Background

Table 2 provides a description of the underlying knowledge of the system and compare the systems by the following three criteria: (1) Does the knowledge modeling language is a standard language used by other systems and groups? (2) Does the system include a knowledge specification tool for acquiring and maintaining the knowledge? (3) Does the knowledge model use standard medical vocabularies to represent the knowledge items and exploits their structure?

Table 2. Comparing the underlying knowledge of the critiquing and plan recognition systems

System / Underlying Knowledge Standard Includes a Exploits Approach Modeling Knowledge Standard Language Specification Vocabularies Tool HyperCritic / Used object oriented model with knowledge about No No No AsthmaCritic drugs, drug classes, contraindication, side effect and recommended side effect related monitoring actions Trauma-TIQ Based on the knowledge of the Trauma-AID. This No No No knowledge was represented as a "plan graph" with treatment Goals (e.g., Treat Thoracic Esophageal Injury), Procedures (e.g., Perform Upper Esophagus Repair), and Actions (e.g., Thoracotomy). Each goal can be applicable to multiple procedures and each action, represented in an ordered sequence of actions, can belong to multiple procedures.

MedCritiq / Quality Indicator Language (QUIL), used to No No No QUIL formally specify quality constraints on physician behavior and patient outcomes derived from medical guidelines. QUIL combines several knowledge-roles of the Asbru language, and allows specifying guidelines for the task of quality assessment. The language uses temporal constraints that are built up from complex patterns of temporal events and non-temporal parameter propositions defined by similar QUIL axioms. Intention Clinical guidelines represented in Asbru, using the Yes No Yes Based Merck Manual reference to translate clinical actions Guidelines for to low-level intentions, i.e., mapping prescriptions Critiquing to medication groups.

IGR An abstract knowledge representation format for No No No clinical guidelines as vectors of attributes (characteristic vectors), which was formed according to concepts of the Asbru language.

Model Modal-logic approach, where properties of No No No Checking for guidelines are specified using both computation tree Critiquing logic (CTL) and linear temporal logic (LTL), either on their own or using modular model checking.

DiscovErr A knowledge model that comprises three integrated Yes Yes Yes schemas, a schema for procedural knowledge of clinical guidelines, based on the Asbru ; a schema for declarative medical domain knowledge, according to KBTA and a schema for the representation of atomic clinical steps referencing terms from controlled medical vocabularies

Chapter 2 26 Background

Table 3 provides a description of the evaluation of the systems and with details about the medical domains, number of participating medical experts and number of patient cases.

Table 3. Comparing the evaluation of the critiquing and plan recognition systems

System / Evaluation Medical # Medical # Patients in Approach Domain of Experts in Evaluation Evaluation Evaluation

HyperCritic / 240 visits of 20 randomly selected patients, were Hypertension 8 20 AsthmaCritic compared to the comments of a panel of eight / Asthma physicians who examined the same records. Using an “index of merit” measure that combines the sensitivity and specificity of each individual reviewer, the system was found to perform better than the physicians who examined the same patient visits

Trauma-TIQ Evaluation of the plan recognition algorithm by Trauma 0 97 applying it on 97 actual cases and analysis of the number of actions in which the system has detected a treatment goal. The evaluation included complexity analysis and run time measures. (Involvement of medical experts was not mentioned)

MedCritiq / None None 0 0 QUIL Intention Based Evaluation of a prototype implementation of the Jaundice 0 12 Guidelines for algorithm, based on the data of 12 patients, divided Critiquing into two groups from different medical location (5 vs. 7 patients), treated by a pediatrician familiar with the guideline and a pediatrician not familiar with the guideline. The result suggested, without significance, that the patient treated by a physician familiar with the guideline had higher match to the guideline sequence. (Involvement of medical experts was not mentioned)

IGR 38 patient records that were selected randomly, and Hypertension 1 38 assigned with a matching score to both JNC6 and JNC7 guidelines by the clinical expert. The results of the system were compared to the judgments of the expert. Model A case study in which two scenarios of a breast Breast Cancer ? Checking for cancer guideline were represented as a state Critiquing transition system, and prototypical cases were created by medical experts who examined a breast cancer patient registries. DiscovErr Application of the system on real patient data of 10 Diabetes 3 10 patient, and comparison of the critique generated by the system to comments made by three expert internists (two of whom are diabetes-therapy experts, and the third a family medicine expert) who reviewed the original clinical records as well as the system’s comments.

Chapter 2 27 Background

2.4 Fuzzy Sets and Fuzzy Logic

In the following section, I enclose a short background on the theory of Fuzzy Sets and Fuzzy Logic, and explain why these methods were chosen to be implemented in as part of the compliance analysis algorithm.

Fuzzy Set is a class of objects in which each object is characterized with continuous grade of membership to the set [Zadeh 1965]. In contrast to Binary Logic, where variables represent true or false values, Fuzzy Logic deals with situations in which variables can be assigned with partial truth that may range between completely true and completely false. The motivations behind this theory is to deal with situations of the real world, where a class does not have a precise definition for the criteria of membership to the class. Some classical real world examples, includes classes such as ‘old-man’, ‘tall-people’ or ‘cold-weather’, that may be interpreted differently by people in the real world. In the Fuzzy Set theory, object can belong to multiple classes at the same time and is assigned with a grade of membership to each of these classes. The grade is computed using a membership-function. A membership-function is a function from a parameter value x to the real interval [0,1], where 1 represent a full membership of the value to the fuzzy set, and 0 represent no membership to the set.

Fuzzy set theory defines fuzzy operators that can be applied on fuzzy sets. The boolean logic operators AND, OR, and NOT, exist in fuzzy logic, usually defined as the minimum, maximum, and complement; when they are defined this way, they are called the Zadeh operators. A formal definition of the Zadeh operators for fuzzy variables x and y:

NOT(x) = (1 - truth(x))

x AND y = minimum(truth(x), truth(y))

x OR y = maximum(truth(x), truth(y))

Fuzzy Logic was introduced by Zadeh at 1965, and been applied since then in multiple fields such as consumer products, medical devices and artificial intelligence systems, that benefit from the ability of this method to interpret information in cases where more than one interpretation is acceptable.

When comparing fuzzy logic theory to the probability theory, it is important to note that they address different forms of uncertainty, fuzzy logic defines the uncertainty as

Chapter 2 28 Background

the degree of an object’s membership to a set, and probability theory defines the uncertainty as the probability of an object to be in the set.

In many practical cases, the operators provided by the fuzzy logic provide more suitable results regarding the degree of membership to knowledge definitions. For example, consider the definition of a set “Young OR Tall people”, and a case of a person who is assigned with a membership (or probability) of 0.9 to the young-people set, and membership (or probability) of 0.6 to the tall-people set. When using the operator OR of the probability theory, the results regarding it’s probability of being in the “Young OR Tall people” set will be 0.9 + 0.6 = 1.5 which reflect the unified probability, when using the Zadeh operators of fuzzy logic, the objects are assigned with a different membership grade to the set, of max(0.9,0.6) = 0.9, which better reflects the degree in which this person belongs to the set (intuitively, it should be less than perfect).

The medical knowledge consist of numerous logical definitions that are used in diagnosis and treatment of patients; especially clinical guidelines which include many such definitions. For example, a patient is diagnosed with diabetes when his or her Hemoglobin A1c (HA1C) values are greater than 6.5%, in addition, the HA1C is considered not-under-control when it exceeds 9%. In order to better understand the decisions of care providers and patients, there is an important value to better interpret cases in which parameter values are borderline in respect to the definitions in the guidelines. A physician can postpone the start of the pharmaceutical treatment for a patient with HA1C value of 6.6%, or start this treatment for a patient with HA1C value of 6.4%, this because there are many other personal and environmental parameters that are taken into consideration.

An example from the data set used in the evaluation of this research, can provide some intuition regarding the frequency of cases in which decisions need to be made according to data with borderline values: according to the diabetes guideline, a Low- Density Lipoprotein (LDL) level, below 100 mg/dL, is recommended for most patients (patients with history of heart disease are recommended with lower values); in case the LDL is higher, statins therapy is recommended . When examining the data of LDL test results, taken from about 2000 patients in a time period of around five years, 13.2% of the available results (1,753 of 13,250), had a value between 95 mg/dL to 105 mg/dL. Although this is only a single example, it can be seen that rather significant number of

Chapter 2 29 Background

the LDL tests resulted with a value close to one threshold defined by the guideline, and these borderline data affect decisions regarding start, increasing or decreasing the doses of statin medications.

3

The DiscovErr System

Chapter 3 10 The DiscovErr System

This chapter describes the DiscovErr system, which I designed and implemented in this study for guideline-based critiquing and quality assessment.

3.1 Desiderata for the System’s Design

After studying the previous work in the field, and before starting the system’s design, I defined a set of design goals important to achieve in order for the system to provide comprehensive support for the complete process of guideline-based critiquing and quality assessment. In the following paragraphs I list this set of design requirements, organized in a manner that reflects the steps of the medical quality assessment process and not according to the requirements importance.

1. Provide a solution for medical critiquing and quality assessment that is generic regarding the medical domain and clinical guidelines. The system should use a generic model for knowledge representation, and differentiate the medical knowledge and the analysis algorithm. This requirement is important to enable future implementations of the system in multiple medical domains, and enhances the generalization of the study. It is important to use an expressive model for knowledge representation to support the representation of multiple guidelines from various medical domains.

2. Use a knowledge model that is based on already existing and widely accepted knowledge representation methods. As described in chapter 2, extensive work was already done in this field, and multiple formal models for clinical guidelines representation were designed and evaluated. A system that is based on an existing knowledge model has the benefit of using a validated knowledge model that was previously evaluated regarding its expression of the knowledge within clinical guidelines. In addition, by using a standard knowledge model that is used in other systems, the system can benefit by sharing and reusing clinical guidelines that are already formally represented.

3. Provide a usable user interface to support the tasks of knowledge specification and maintenance. To allow using a knowledge-based system, it is important to provide a usable interface that allows users to specify and maintain the underlying knowledge. In the medical domain, to assure that the acquired knowledge is correct and complete, it is important to involve medical experts in the process of knowledge

Chapter 3 12 The DiscovErr System

specification. A knowledge-based system that includes inherent usable knowledge specification tools, supports its users in creating a consistent knowledge base in a shorter time.

4. Support the process of mapping the formal knowledge to the items in the electronic medical records. To shorten the time of system setup, and to support the process of integration with external electronic medical records, the system should use standard medical classification systems to describe the knowledge and exploit the structure of existing controlled vocabularies.

5. Provide a compliance analysis algorithm that addresses the inherent ambiguity of clinical guidelines, the uncertainty of clinical data, and the fact that patients and care providers do not strictly follow the guidelines but adhere to the essence of their recommendations. Dealing with uncertainty and ambiguity is an inherent aspect of medical treatment; in fact, medicine is often described as a science of uncertainty and probability. When applying methods for retrospective medical quality assessment, the level of uncertainty is even higher, because in addition to the uncertainty that exists during the medical treatment, it involves uncertainty regarding the past data, which in many cases is incomplete or even different from the data that was available when the clinical decisions and action really happened. Practicing clinicians, expert physicians, and medical administrators are used to dealing with cases of uncertainty of the data and ambiguity of the medical knowledge; thus, a medical critiquing and quality assessment system that addresses (or tries to address) these issues, will have a better chance to be accepted by its potential users.

6. Provide a user interface to explore the guideline compliance analysis results. Analyzing the compliance to clinical guidelines, performed on time stamped data recorded over a long period of time, may result with multiple outputs describing the level of compliance to each part of each guideline. In order to support users in exploring these multiple results, it is important to provide them with a user interface that is suitable for this task. The user interface should allow users to view the compliance analysis result in an organized manner, with details and explanation on each guideline compliance comment. It is required to develop such a user interface to support the step of system evaluation.

Chapter 3 11 The DiscovErr System

3.2 Overall Architecture

The overall architecture of the DiscovErr system is presented in Figure 2. The system includes three main modules that are integrated to support the guideline compliance analysis task; the Knowledge Framework with the knowledge library and knowledge specification tool, the Patient Data Access module that retrieves the data from the electronic medical record, and the Analysis Framework that performs the compliance analysis and provides a graphical interface for the users to view the compliance analysis results.

Figure 2. The overall architecture of the DiscovErr system

The Knowledge Framework includes a library that facilitates a knowledge model that integrates three types of formal knowledge: procedural knowledge of clinical guidelines, declarative medical domains knowledge, and knowledge about clinical steps with references to controlled medical vocabularies. The Knowledge Framework includes a graphical tool for knowledge specification, which is used by expert physicians and knowledge engineers to create and maintain the formal knowledge and store it a knowledge library.

The Patient Data Access module includes the Data Mapper module for importing data from the electronic medical record and storing it in a temporal database according to the internal system’s schema. The Data Mapper performs three main tasks during the

Chapter 3 11 The DiscovErr System

data import process: (1) map the concept identifiers used in the external electronic medical record to the concept identifiers used in the formal knowledge; (2) convert the data when required due to different units of measurement; and (3) store the time- oriented patient data in a temporal database, where each raw data item (e.g., Hemoglobin level) is mapped to a corresponding concept in the formal knowledge, and include relevant time stamps. The data in the temporal database consists of two types of parameters; primitive parameters that represent information about the patient’s state (e.g., lab results, physical examinations) and event parameters that represent the medical treatment (e.g., medication orders, procedures).

The Analysis Framework is responsible for the computational task of compliance analysis. The core of the Analysis Framework is the Compliance Analysis Engine, which applies the computational algorithms to analyze the patient data regarding the compliance to the clinical guidelines. The analysis engine accesses the knowledge through the Knowledge Adapter that provides sophisticated methods to query the knowledge library. The analysis engine uses the Fuzzy Temporal Reasoner to extract high level temporal abstractions regarding the patient’s state. In addition, the Analysis Framework includes the Results Viewer, a graphical interface that allows users to view and explore the compliance analysis results.

In the following sections, each of these modules is described in more detail.

Chapter 3 16 The DiscovErr System

3.3 The Knowledge Framework

In previous research I developed, together with other members of our group, an overall framework for medical knowledge specification. This framework includes a central guideline library called DeGeL [Shahar et al. 2004; Hatsek et al. 2008], which is used for storage and retrieval of clinical guidelines represented at multiple representation levels. The framework also includes a knowledge specification tool called Gesher, which supports the collaboration of expert physicians and knowledge engineers to perform a gradual process of knowledge specification. One of the major benefits of this previous research was the integration between two types of knowledge, procedural knowledge and declarative medical domain knowledge, that were represented in multiple levels, from free text source, through structured text representation, to a fully formal representation.

The DiscovErr system includes a new Knowledge Framework, which is based on the previous research, but comprises a more concise implementation of the previously existing modules, focusing on the library modules that are relevant to support the compliance analysis task. In addition to the new implementation, the DiscovErr Knowledge Framework uses an extended schema for the knowledge representation with an additional module that facilitates knowledge about the definition of atomic clinical steps that can be used in multiple paths of the guidelines. To support the process of mapping between the knowledge concepts and the items in the patient database, each clinical step can reference one or more terms from controlled medical vocabularies. As mentioned in previous works, such as [Abu-Hanna and Jansweijer 1994], separating the domain-knowledge and clinical-steps-knowledge from the guidelines’ procedural- knowledge, increase the reusability of the knowledge and assist in reducing the time required for knowledge specification and maintenance.

The Integrated Knowledge Model

The knowledge model of the system (Figure 3) comprises three integrated schemas, a schema for procedural knowledge of clinical guidelines, based on the Asbru guideline specification language; a schema for declarative medical domain knowledge, according to the KBTA ontology; and a schema for the representation of atomic clinical steps referencing terms from controlled medical vocabularies.

Chapter 3 17 The DiscovErr System

Figure 3. The integrated knowledge model of the DiscovErr system.

The integrated knowledge model, which combines three formal schemas for knowledge representation, inherits the expressiveness of the Asbru language regarding the representation of clinical guidelines, and the expressiveness of the KBTA ontology regarding the representation of declarative domain knowledge. The clinical steps library supports the organization of the guidelines’ knowledge, and by mapping the steps to controlled medical vocabularies, allows the compliance analysis algorithms to utilize additional knowledge from external medical vocabularies, for example, the hierarchical structure of the ATC drug classification system, which is used by the system to identify medications that belong to the same therapeutic or chemical groups. The clinical guidelines schema uses both the domain knowledge and the clinical steps library for representation of the procedural knowledge. For example, an entry condition of a guideline may reference a declarative concept such as “high hemoglobin A1c”, which is formally represented according to the KBTA schema.

In the following sections I provide a detailed description of each part of the integrated knowledge models.

3.3.1 Clinical Guidelines Representation

In the DiscovErr system, the formal representation of the procedural clinical guidelines’ knowledge is based on the Asbru guideline specification language. The Knowledge Framework includes an object-oriented implementation of the Asbru language. In the

Chapter 3 18 The DiscovErr System

following paragraphs I present the class diagrams and describe this object oriented implementation, with details about specific elements that were added as an extension to the original Asbru schema to support features of the compliance analysis algorithms.

Figure 4 displays a class diagram describing the heart of the procedural schema focusing on the class Guideline-Plan. Each guideline plan is identified in the library using a Plan-ID and contains several meta-data attributes. The Title is used for presenting the plan in the knowledge specification tool and in the results of the compliance analysis. The Description is a free-text attribute populated during knowledge specification. Each plan can be assigned with a Level-Of-Evidence attribute if it is available in the guideline’s source. The Source attribute is used to specify the source of the guideline. Additional important attribute is the plan’s Status, which can be assigned as ‘Draft’ during the specification process, ‘Completed’ when the specification is complete, and ‘Operative’ to annotate that the plan should be used in the guideline compliance analysis process.

Each Guideline-Plan can be assigned Asbru’s conditions (Filter, Setup, Complete, Abort, Suspend, and Restart), which in the case of compliance analysis, are used to determine the desired state of a plan at any specific point in time; for example, when the system detects an action that might belong to an operative guideline plan, it examines whether at the same point of time, the state of the plan should be active, completed, or aborted. The conditions are represented as pointers to constraints involving declarative temporal abstraction concepts (e.g., cholesterol-state = ‘high’), which are part of the declarative knowledge; thus, this forms the first level of integration between the procedural and declarative schemas.

Each guideline plan can be assigned Asbru’s outcome-intentions. Similar to the conditions, each intention is represented as a pointer to a constraint involving the declarative knowledge, but in the case of intentions, the schema was extended with the Time-To-Fulfill attribute, which is used by the system to consider the fact that there is a need for a reasonable time period before intentions can be achieved. For example, while analyzing the record of a patient who was just diagnosed with diabetes, the system should not comment about outcome intention that was not achieved in the first several months after the diagnosis.

Chapter 3 19 The DiscovErr System

Each Guideline-Plan is assigned Asbru’s plan-body, which is used to represent the structure of the guideline’s plan. The Plan-Body can be one of the following types: Clinical-Step, Composite, If-Then-Else, and Switch-Case. The Plan-Body is described in more detail the following pages.

Additional attributes that were added in this implementation are attributes that extend the original Asbru schema. The Exact-Match-Only is used by the compliance algorithm to avoid searching for alternative actions that replace a given plan. For example, avoid trying to detect alternative medications when insulin administration is missing in a diabetic patient. The Is-Single-Time-Event attribute is used to determine if a plan can happen only a single time in a patient’s life-time. For example, the event of a chronic disease diagnosis can happen only a single time; therefore, actions that should be performed adjacent to the diagnosis can occur only once.

Figure 4. Guideline-Plan class diagram.

Figure 5 displays a class diagram focusing on the representation of the Plan-Body. As mentioned above, each plan is assigned a single type from a set of plan-body types.

Plan-Body-Clinical-Step is the simplest plan-body, used to represent atomic clinical actions. Plans of this type hold a pointer to a specific step in the clinical steps library,

Chapter 3 11 The DiscovErr System

which is later described in more detail. In general, the clinical steps represent atomic actions, such as therapeutic, diagnostic, or administrative actions.

Plan-Body-Composite is used to represent composite plans that comprise one or more sub-plans. In addition to the collection of Sub-Plans, a class that is described in more detail in the next diagram, the composite plans include the attribute Wait-For-Non- Mandatory, which is used to indicate whether the plan can be completed before all (non-mandatory) sub-plans are completed. The representation of the temporal order in which the sub-plans should be executed appears in the sub-plan objects; thus, the original Asbru attribute regarding the plans order (Sequential, Parallel, Any-Order) is omitted in this implementation.

Plan-Body-If-Then-Else is used to represent the guideline’s control plans, and plans of this type are annotated with a condition to control the guideline applications. In cases where the conditions holds, the application should continue with the sub-plan pointed to by the Plan-If-True annotation, and when the condition does not hold, the application should continue with the Plan-If-False annotation. Similarly to the representation of each of the guideline-plan’s conditions, the if-then-else Condition takes part in the integration between the procedural and declarative schemas, and is implemented as a pointer to a declarative concept (e.g., blood-pressure-state = ‘very high’), which is represented according to the declarative schema.

Plan-Body-Switch-Case is an additional structure that is used to represent control plans, and consists of one or more Switch-Case-Items, each assigned a condition that if it holds, the application of the guideline should continue with a specific sub-plan.

Chapter 3 10 The DiscovErr System

Figure 5. Plan-Body class diagram.

The Sub-Plan class (see Figure 6) is included in the representation of all of the composite plan-body classes. It represents a reference to another guideline plan, and allows annotating whether the sub-plan is mandatory (the Mandatory-Level attribute can be assigned Mandatory or Non-Mandatory). Each sub-plan instance includes a Sub- Plan-Reference-Point-Type, which represents the reference point in which this sub-plan should be applied. In most cases, the reference will use the value of Parent-Start, except in cases such as sequential sub-plans, where the reference point should be Previous-Step-End (i.e., the end of the previous step). In addition to the point of reference, the sub-plans are assigned with a time annotation according to the original Asbru specification, which consists of six sub-annotations: earliest-start, latest-start, min-duration, max-duration, earliest-end, and latest-end.

Chapter 3 10 The DiscovErr System

Figure 6. Sub-Plan class diagram.

An additional attribute of the Plan-Body (see Figure 4) is the Periodic-Constraint, which is used to represent guideline plans with a periodic execution manner. When specifying a periodic plan, the representation should include details about the periodicity attributes. The Cardinality constraint allows specifying constraints regarding the number of repetitions of the plan, with rules such as “repeat between 4 and 6 times” or “repeat at least 3 times”. The Minimum-Gap and Maximum-Gap annotations allow specifying constraints regarding the time gap between the repetitions, for example, repeat between 4 and 6 months. Although the original Asbru schema includes additional annotations regarding the plan periodicity, such as envelope-min- duration and envelope-max-duration, that can be used to determine the overall time period for all repetition of the step, the current implementation of the compliance analysis algorithm is using the minimal set of periodicity annotations mentioned above.

Chapter 3 12 The DiscovErr System

3.3.2 Clinical Steps Representation

The DiscovErr’s Knowledge Framework includes a library of clinical step concepts. The idea behind this design is to create an internal repository for storing knowledge about atomic clinical actions that can be used in multiple paths of a guideline and even in more than one guideline. Reusing the same definition of a clinical step in multiple parts of the guideline assists in reducing the time needed for knowledge specification and maintenance.

The class diagram in Figure 7 presents the classes used for the representation of the clinical steps in the library.

The Clinical-Steps are assigned a Concept-ID used for library identification, a Title attribute used for presentation, and the attribute Clinical-Step-Type to describe the step type. The types of steps that are supported by the system are Test-Step to represent lab examinations and measurements, Procedure-Step to represent therapeutic procedures (excluding drug related therapy), Specialist-Referral-Step to represent referrals to specialist treatment, and Drug-Administration-Step to represent medication-related actions. All step types share the same representation, except that the Drug- Administration step type includes an additional attribute to describe the type of administration action, which can receive one of the values Prescribe, Increase, Maintain, Decrease, and Stop.

The set of clinical-steps types presented above was defined based on our previous experience in formal representation of clinical guidelines in multiple clinical domains, and was sufficient to represent the guideline used in the system’s evaluation, However, other types of clinical-steps can be added, such as documentation-step or education- step. An interesting work regarding a determination of a complete set of clinical-steps types that covers all guideline’s recommendation can be found in [Essaihi et al. 2003], who created an “Action-Palette” for guideline’s implementers. In addition, these steps can be organized in an hierarchical structure, but this is not really required for practical usage until the list of types is extended.

The attribute Related-Terms is used to map the step to terms from controlled medical vocabularies, and can be assigned with zero or more references to a medical term. Each reference to a medical term is specified with the identifier of the source vocabulary (the attribute Source), the original code of the term in this vocabulary (the attribute Code), a

Chapter 3 11 The DiscovErr System

title describing the term (the attribute Term), for example, beta blocking agents, and if available, the UMLS identifiers of the medical term (CUI and AUI). The use of standard terms from controlled medical vocabularies assists in the preliminary task of the system setup when the system is configured to retrieve data from a new electronic medical record. In addition, the compliance analysis algorithm can benefit from hierarchical structures provided by some of these vocabularies. More information about this usage is available in section 3.5.2, in which the compliance analysis algorithm is presented in detail.

It is important to note that the current implementation of the knowledge specification tool supports specifying related medical terms from the following four standard vocabularies: ICD-9, CPT, LOINC and ATC.

Figure 7. Clinical-Step concepts class diagram.

Chapter 3 11 The DiscovErr System

3.3.3 Domain Knowledge Representation

The declarative knowledge represents the medical domain knowledge specified using a formal representation for medical concepts. The definition “declarative knowledge” is used to emphasize the difference between this type of knowledge about medical concept, which is different from the “procedural knowledge” with details about the clinical recommendations within the clinical guidelines. It is important to mention that published textual clinical guidelines include both types of knowledge; for example, the criteria for diagnosing a diabetes patient is a clinical definition of a declarative medical concept (“Diabetes Diagnosis”), the clinical actions recommended in this case, including the types of actions and the temporal constraints (i.e., when to start and which actions), are considered “procedural knowledge”.

This declarative knowledge is represented using both structured-text and a formal representation. The formal knowledge is represented according to the KBTA ontology, which supports the representation of different types of medical concepts according to a formal structure that provides information about the concepts’ properties and about the relations between different concepts. The strength of the KBTA ontology is in the ability to represent the temporal aspects of the knowledge in a manner that supports the computational task of temporal-abstraction, in which higher level abstractions are extracted from raw temporal data.

In the following section I provide a partial description of the KBTA ontology, focusing on the major parts of the ontology that are relevant for the current implementation.

The KBTA ontology defines a set of concepts of different types that are used to formally represent the declarative medical concepts. The class diagram in Figure 8 displays these concepts with details about the properties of each concept type.

The abstract class Concept is used to define the properties that are common to all concept types. All concepts are assigned with a Concept-ID for library identification, a Title used for presentation, and an attribute called Structured-Text used for a textual representation of the concept’s definition. In addition, each KBTA concept is specified with information about Temporal-Persistence, to allow the temporal abstraction engine to apply temporal extrapolation to create temporal intervals from time stamped data,

Chapter 3 16 The DiscovErr System

and decide whether to apply temporal interpolation between close intervals with the same data.

Primitive concepts are used to represent the raw data concepts, such as laboratory tests and physical examinations. To support the interpretation of the data from the electronic medical record, the primitive concepts are specified with a Data-Type attribute that can be designated Numeric, Boolean, or Symbolic. The Units attribute is used mainly for numeric concepts, for the specification of the unit of measurement.

Similarly to primitive parameters, Event concepts are used to represent raw data concepts, but are used for concepts with multiple attributes. For example, a data item representing a medication administration may contains several attributes, such as the identifier of the medication, the dose, and the route of administration. The Attributes member is used to provide meta-data on each of the attributes, in a format similar to that used by the primitive concepts to describe properties of a raw data item with a single attribute (e.g., a single lab result).

Gradient concepts represent a gradient abstraction between sequential ordinal parameters, and abstraction that can result with three possible values: decrease, same, and increase. The attribute Significance-Interval is used to specify the threshold in which the gradient value result is different than same. For example, an increase gradient value between two consequent hemoglobin A1c measurements, the first with value of 7% and the second with value of 7.5%, will be abstracted only if the significance-interval threshold is lower than 0.5%. The Abstracted-From attribute is used to specify the relation between the gradient concept to the primitive it is abstracted from.

Trend concepts are very similar to gradients, and also describe an abstraction that can be resolved to decrease, same, and increase, but resolved with a slightly different temporal-abstraction mechanism. The Trend concepts include additional information regarding a time-gap in which the change in value should occur in order to abstract a result different than same. For example, if using a Gradient to describe a weight-gain concept, the knowledge may include a Significance-Interval of 5 kg. When using a Trend concept, a time-gap should be also specified, for instance, 5 kg in one month.

Rate concepts also relate to changes in ordinal parameter values that occur over time, and allow the specification of a custom set of Rate-Output values. For example, to

Chapter 3 17 The DiscovErr System

describe a concept such as weight-gain-rate, a change of 2 kg in one month can be considered as slow, a change of 5 kg in one month can be considered as fast, and a change of 10 kg in one month can be considered as very fast.

State concepts are used to represent high level abstraction for classification of numeric or qualitative values of one or more parameters. The Abstracted-From attribute is a collection of the constraints defining the value classification. An example of a simple state is the BMI-state concept (Body Mass Index classification). The state is used to extract a discrete value out of the numeric BMI value, a person with a BMI between 25 and 30 is classified with an Overweight state. The output of a state concept is an ordinal scale of discrete values, and the MinMax attribute is used to determine the direction in which the resolution should occur in cases when more than one class matches the parameter values.

Pattern concepts are used to represent medical concepts using complex temporal relations. The patterns are abstracted from one or more sub-components that reference addition KBTA concepts. The schema of the patterns supports the definition of temporal relations between these sub-components. An increase in a dose of a medication after an increase in some lab result value is a common example of a clinical pattern. The Local-Constraint attribute is used for the specification of the constraints on the values of each of the sub-components. ‘LDL cholesterol > 100’, and ‘statins-dose- gradient = Increase’ are examples of such local constraints. The Pairwise-Constraints are used for the specification of the temporal relation between the components. For example, the start of an increase in statins dose should begin just after the start of the high LDL interval. The Periodic-Constraints are used to specify periodic patterns, and have an internal schema similar to the schema described regarding periodic guideline plans, to support the specification of constraints regarding number of and the temporal gaps between the repetitions.

Chapter 3 18 The DiscovErr System

Figure 8. KBTA concepts class diagram.

Extension of the KBTA Schema

The DiscovErr’s implementation of the KBTA schema includes a cross-wise extension on any element that includes a constraint on a parameter’s value in the form of a logic relation between the parameter value and a threshold. In the original KBTA schema, each constraint is specified using a triplet of arguments, which include a parameter identifier, a relation-operator, and a threshold value; an example of such a triplet is the {blood-glucose, >, 125 (mg/dL)}. In the new implementation, to support the use of fuzzy logic techniques in the evaluation of the logic relations, an additional argument was added, called a deviation-interval. For example, the constraint above can be extended with a deviation-interval of 5 mg/dL. The use of this fuzzy logic mechanism is described in more detail in section 3.5.1, in which the reasoning process that is applied by the Fuzzy Temporal Reasoner is presented.

Chapter 3 19 The DiscovErr System

3.3.4 The Knowledge Specification Tool

To support the process of knowledge specification and maintenance, the DiscovErr system includes a knowledge specification tool that is used by the knowledge editors and medical experts to formally represent the underlying knowledge.

The knowledge specification tool comprises three main user interfaces: the Clinical Guideline Library supports the specification of the procedural knowledge, represented according to the Asbru language; the Domain Knowledge Library supports the specification of the declarative knowledge with the definition of medical concepts, represented according to the KBTA ontology; the Clinical Steps Library supports the specification of the library of atomic clinical steps that are used (and reused) by multiple paths in the guidelines. Each of these interfaces provide graphical widgets that are used to create, modify and delete knowledge elements, and the knowledge is being stored on the file system using xml files, structured according to the formal schemas of the knowledge model.

The interface for editing the Clinical Guideline Library is presented in Figure 9. On the left side of the screen the user can browse and edit the hierarchical structure of the guidelines’ plans, and the right side of the screen is used to view and edit the properties of the plan that is currently selected on the hierarchical view. To edit the plan’s properties, the user selects between various tabs in the right side of the interface.

Figure 9. The Clinical Guidelines Library knowledge specification interface.

Chapter 3 61 The DiscovErr System

The interface for editing the Domain Knowledge Library is presented in Figure 10. On the left side of the screen, the user can browse, create new, edit and delete medical concepts represented according to the KBTA ontology; on the right side of the screen the user can view and edit the properties of the concepts. The example in the figure illustrates the specification of a State-Abstraction concept, defining the diagnosis of diabetes, which is abstracted from hemoglobin A1c and glucose measurements, with a constraint on the hemoglobin A1c to be greater or equal 6.5% with a deviation interval of 0.5%.

Figure 10. The Domain Knowledge Library knowledge specification interface.

The interface for editing the Clinical Steps Library is presented in Figure 11. On the left side of the screen, the user can browse, create new, edit and delete clinical steps represented according the clinical steps schema. On the right side the user can modify the properties of the steps, and map the steps to standard terms from available controlled medical vocabularies.

Figure 12 present the Taxonomy Library Viewer, that is used to select terms from the medical vocabularies. On the left side of the screen the user can browse the hierarchical structure of the taxonomies, or use the search tool available on the right side of the screen. The example in the figure illustrates a search in the medical vocabularies for the term “Atorvastatin”, which is located in the ATC drug classification under the “Cardiovascular System”.

Chapter 3 60 The DiscovErr System

Figure 11. The Clinical steps Library knowledge specification interface.

Figure 12. The Taxonomy Library Viewer interface.

Chapter 3 60 The DiscovErr System

3.4 The Patient Data Access Module

The Patient Data Access module supports the access to data from external patient electronic medical records. The module comprises the Data Mapper, a component for data mapping and conversion, and an internal temporal database that is accessed by the system when performing the compliance analysis.

The data transformation task is a common task, non-trivial although, needed to be addressed by any system that provides high level analysis on existing data. Similar to many domains of application, decision support systems in the clinical domain are usually developed separately from the operative systems that are being used to store and access operative data. Thus, in order to apply the decision support functionality, a process of data transformation is required.

Figure 13 presents the main components of the Patient Data Access module. The source of the patient data is an external electronic medical record, where the operative clinical data is recorded. The Data Mapper, which is provided with access to view the data in the electronic medical record, retrieves the patient data in the source format, converts it to the temporal format, and stores it in the temporal database used by the system.

Figure 13. The data flow in the Patient Data Access module.

The Electronic Medical Record (EMR) is external to the system and can be implemented in any format, thus, a preliminary task is to design the schema in which the Data Mapper will access for retrieving the data, a task that usually involves technical data professionals with administrative access to the EMR. Optimally, the data will be organized according to a relational temporal schema, similar to the one used by DiscovErr (described later in this section). However, in many cases there will be a need to customize the data access layer to retrieve all relevant types of data elements, such

Chapter 3 62 The DiscovErr System

lab results or medication orders, from multiple tables and views in the EMR, and even aggregate multiple data repositories; in this case, adjustment will be implemented in the Data Mapper to support accessing data from any required source type, such as relational, XML, JSON etc.

The Data Mapper uses pre-defined mapping tables to perform the mapping between the items in the patient database and the items that are represented in the knowledge base. The mapping tables include information about the source concept identifiers and units of measurement, and the corresponding concept identifiers and units of measurement represented in the formal knowledge base.

The task of preparing the mapping tables is performed in a preliminary step to enable the application of the compliance analysis computation. In most cases, the preparation of the mapping tables is performed after the knowledge specification process is completed, when the lists of all required knowledge items is available and can be compared to the parameters in the patients’ database. The complexity of the mapping process depends on the representation of the concepts in the patients databases, in cases the database parameters are represented using codes from standard vocabularies the mapping can be done automatically, whereas in cases the parameters are represented using internal codes of the organization from which the data was originated, the mapping will need to include manual efforts; Some computational tools assist in matching between the database parameters names and the knowledge concepts names, but their output will still need to be validated by relevant professionals.

The following example is provided to clarify the mapping process. To set up the application of a hypertension management guideline, after completing the formal specification of the knowledge, the knowledge library is queried to extract a complete list of knowledge items referenced by the guideline. In the case of a guideline for hypertension management, the results of this query will include parameters such as blood pressure measurements, hypertension medications, and relevant background diagnoses. In the next step, the electronic medical records are examined, usually by extraction of the full set of distinct data parameters. Then, the relevant parameters in this set that match the items that are specified in the guideline are detected, and their identifiers are mapped to the identifiers of the corresponding concepts in the formal knowledge.

Chapter 3 61 The DiscovErr System

The Data Mapper imports the data into a temporal database, which is accessed during the compliance analysis process to retrieve the patient data. The temporal database is a relational database composed from two main tables. The first is called the Patients table, and is used to store the non-temporal data items, with demographic information such as age, gender, and date of birth. The second is called the Patient-Data table, and is used to store the historical data of the patients. Each record in this table comprises the following attributes: patient-identifier, concept-identifier, valid-start-time, valid-end- time, value, and units of measurement. The generic structure of the table supports the storage of multiple types of data parameters, such as symptoms, lab results, diagnoses, medication orders, and clinical procedures.

3.5 The Analysis Framework

The heart of the DiscovErr system is the Analysis Framework, which performs the guideline compliance analysis. The Analysis Framework comprises several components (see Figure 14) that are integrated for carrying out the analysis computation. The Analysis Controller orchestrates the operations of other components. When the system is loading, the controller initiates an instance of the Knowledge Adapter, which retrieves the knowledge from the library and constructs an in-memory cache with the formal representation of all operative guidelines.

Figure 14. The components of the Analysis Framework.

Chapter 3 61 The DiscovErr System

When the framework is invoked to analyze the compliance on a given patient, the controller operates the Patient Data Adapter to access the temporal database and retrieve all data of that patient, and represent it using an in-memory object with a similar schema of the relational temporal database described in the previous section. The patient data is passed to the Compliance Analysis Engine, in which the overall compliance analysis algorithm is implemented. During the analysis, the analysis engine uses the Knowledge Adapter to access the knowledge in a manner that is optimized to support the compliance algorithm. The analysis engine uses the Fuzzy Temporal Reasoner to extract temporal abstractions regarding the patients’ state. The Results Viewer is a graphical interface that allows users to view and explore the results of the analysis.

3.5.1 The Fuzzy Temporal Reasoner

The Fuzzy Temporal Reasoner is a knowledge-based temporal abstraction engine (KBTA engine) that is used to extract high level interpretations from raw temporal data. It uses formal declarative knowledge about medical concepts, which are represented according to the KBTA ontology. The Fuzzy Temporal Reasoner applies techniques of fuzzy logic [Zadeh 1965, 1968] that take part in the reasoning process, enabling it to provide a membership score for each abstraction it generates. The use of fuzzy logic techniques distinguishes the Fuzzy Temporal Reasoner from other existing KBTA engines that are able to extract only deterministic results based on the classical logic.

In addition to applying temporal logic for the evaluation of the various types of KBTA concepts (e.g., Gradient, State), the Fuzzy Temporal Reasoner uses techniques of fuzzy logic both in the evaluation of logical relations (e.g., x < y), and in the evaluation of logical operators (i.e., AND, OR operators) that are part of compound logical expressions.

The Fuzzy Temporal Reasoner is an internal component of the Analysis Framework and is used by the guideline Compliance Analysis Engine to evaluate the state of the patient with regard to the conditions specified in the guidelines. For example, when evaluating the compliance to a guideline that should be aborted in the case of reduced kidney function, the Compliance Analysis Engine uses the Fuzzy Temporal Reasoner to retrieve the time intervals during which the reduced kidney function state was equal to true according to patient’s data (i.e., reduced-kidney-function-state=‘true’). The results

Chapter 3 66 The DiscovErr System

of the Fuzzy Temporal Reasoner include a set of time-intervals, each assigned with a membership score represented as a continuous number between 0 and 1. In time periods when the raw data of the patient completely satisfies the constraints specified in definition of the reduced-kidney function concept, the membership score is assigned the value of 1; in time periods when these constraints are fully contradicted, the membership score is assigned the value of 0; in time periods when these constraints are partially satisfied, due to raw data that is close to satisfying the relation, the membership score is assigned a rational number between 0 and 1.

The reasoning process that is applied by the Fuzzy Temporal Reasoner involves a multistep process. In the following paragraphs, I describe each step of the reasoning process, and demonstrate it using an example for the evaluation of a Hypertension (High Blood Pressure) concept. For this example, I define Hypertension as a State concept, abstracted from two Primitive parameters, systolic blood pressure (SBP) and diastolic blood pressure (DBP), each is a numeric parameter defined with a local persistency of one hour (i.e., the measurement is good for one hour). The abstraction includes the logic OR operator applied on two constraints on the values of the Primitive parameters. The definition of the Hypertension concept:

Hypertension = {SBP > 140 mmHg OR DBP > 90 mmHg}

Examples of the raw measurements of the patient’s systolic and diastolic blood pressures are illustrated in Figure 15.

Figure 15. Blood Pressure measurements used for the demonstration of the Fuzzy Temporal Reasoner.

Chapter 3 67 The DiscovErr System

Extrapolation of Temporal Intervals

The first step of the reasoning process includes extrapolation of temporal intervals using the Temporal-Persistence knowledge that was specified for the raw parameters. In addition, each pair of intervals of the same parameter that share the same value, and share a mutual time period (i.e., overlapping intervals), are merged into a single interval of this parameter that is assigned the same value.

In the example (see Figure 16), each blood pressure measurement is now extrapolated to create an interval of one hour, the three DBP measurements are extrapolated to three consequent intervals with same value = 86. In the merging step, these three intervals are merged into a single longer interval with the same value.

Figure 16. Extrapolation of the time-stamped measurements by the Fuzzy Temporal Reasoner.

Temporal Partitioning

The second step of the temporal reasoning process involves partitioning of the temporal data. The idea behind this operation is to create a segmented (partitioned) temporal representation, with the minimal set of partitions, in which each relevant parameter has zero or a single value. This is done to support the next steps of the reasoning process, in which the evaluation logic is applied on each of these partitions.

In the example (see Figure 17), the data is partitioned into five partitions. In the first partition, none of the parameters is provided with a value; in the second partition, the

Chapter 3 68 The DiscovErr System

DBP value is 86, and the SBP value is 125; in the third partition, the DBP value is 86 and the SBP value is 139; in the fourth partition the DBP value is still 86 and the SBP has no value; and in the last partition, none of the parameters is provided with a value.

Figure 17. Temporal partitioning by the Fuzzy Temporal Reasoner.

Fuzzy Evaluation of the Logical Relations

The third step of the reasoning process involves the evaluation of the logical relations. The evaluation is done for each parameter value in each of the partitions.

In classic logic, the evaluation of logical relation results with a true or a false value, referred as the truth-value of a relation. In fuzzy logic, the result of a relation evaluation is called the membership score, and is represented as a continuous (rational) number, usually between 0 and 1.

In section 3.3.3 I described an extension to the KBTA schema, and mentioned that the representation of logical relations was extended with an attribute called a deviation- interval. This attribute is used in the relation evaluation to enable the assignment of a membership score in cases where the constraint is not fully satisfied. In such cases, a special fuzzification-function is applied for the reasoning process.

The fuzzification-function receives the following arguments: the current-value of the parameter, the threshold, the deviation-interval, and the relation-operator specified in the constraint definition. The deviation-interval represents the maximal deviation from

Chapter 3 69 The DiscovErr System

the threshold that can be evaluated with a membership score higher than zero; thus, in cases where the absolute distance between the current-value and the threshold is greater than the deviation-interval, the membership score is evaluated as zero. In all other cases, the membership score is calculated according the following formula:

Figure 18 illustrates the application of the fuzzification-function for the evaluation of the constraint SBP > 140 mmHg, with a deviation-interval of 10 mmHg.

Figure 18. Illustration of the fuzzification-function. Evaluation of the constraint SBP>140 mmHg, with a deviation-interval of 10 mmHg. On a measurement of SBP=139, the membership score is evaluated as 0.9; on a measurement of SBP=135, the membership score is evaluated as 0.5; on any measurement of SBP≤130, the membership score is evaluated as 0; on any measurement of SBP≥140, the membership score is evaluated as 1.

Figure 19 illustrates the complete logical constraint evaluation on each parameter value in each partition that was generated in the previous step of the example. The interval with SBP=125 was evaluated with a membership score = 0, the interval with SBP=139 was evaluated with a membership score = 0.9, and each interval with DBP=86 was evaluated with a membership score = 0.6.

Chapter 3 71 The DiscovErr System

Figure 19. Evaluation of logical constraints by the Fuzzy Temporal Reasoner.

Evaluation of Logic Operators

The last step of the reasoning process is the evaluation of the logic operators within compound logic expressions. For this, an additional fuzzy logic technique is used.

The operators AND, OR, and NOT of classic logic, exist in fuzzy logic with a different implementation. A fuzzy logic implementation of these logic operators, called Zadeh operators, uses the minimum function for the evaluation of the AND operator, and the maximum function for the evaluation of the OR operator, and uses the function 1-truth- value to evaluate the NOT operator.

In the Fuzzy Temporal Reasoner implementation I used the Zadeh operators for the evaluation of the AND and OR operators. The minimum function is used to evaluate the AND operator only in cases in which a membership score exists for all of the operands of the expression. The maximum function is used to evaluate the OR operator in cases in which the membership score is available for at least one of the operands of the expression.

For the NOT operator I have implemented an operation that inverts the relation by replacing the relation’s operator with an opposite operator. For example, the expression NOT(x ≥ y) is evaluated as (x < y). The following example illustrates this new implementation. Consider the expression NOT(SBP > 140 mmHg), with a deviation-

Chapter 3 70 The DiscovErr System

interval of 10 mmHg. The new implementation of the operator resolves the truth value as 1  SBP ≤ 140  NOT(SBP >140).

To evaluate the false-value of compound operations, I used an implementation of De Morgan’s law. Evaluation of the expression NOT(x OR y), is implemented as (NOT(x) AND NOT(Y)). For example, the hypertension expression on measurements of SBP=139 and DBP=92, is evaluated as:

0.8  min(1,0.8)  SBP < 140 AND DBP < 90  SBP ≥ 140 OR DBP ≥ 90

* Assuming a deviation interval of DBP ≥ 90mmHg is 10mmHg, thus, the membership score of DBP=92 to the constraint DBP < 90 is 0.8.

Figure 20 illustrates evaluation of the logic operators on the demonstration example. The expression in the example is a compound expression using the OR operator (SBP>140 OR DBP>90); thus, for each of the partitions, the maximum function is applied on the membership scores that were calculated in the previous steps.

Figure 20. Evaluation of logic operators by the Fuzzy Temporal Reasoner.

The evaluation of the logic operators as described above enables the Fuzzy Temporal Reasoner to evaluate any complex compound expression. The knowledge specified in the KBTA schema allows the representation of complex compound logic expressions, represented in the form of AND-OR trees. For example, a definition of the Preeclampsia diagnosis is “high blood pressure with proteinuria in a pregnant woman after 20 weeks of gestation”. Such definitions are specified in the knowledge as compound

Chapter 3 70 The DiscovErr System

expressions, illustrated in Figure 21 using an AND-OR tree. The Fuzzy Temporal Reasoner uses a recursive implementation of the fuzzy logic evaluation function that supports evaluating any complex AND-OR tree.

Figure 21. AND-OR tree representation of the Preeclampsia diagnosis concept.

3.5.2 The Compliance Analysis Engine

The Compliance Analysis Engine is the heart of DiscovErr’s Analysis Framework, as it includes the implementation of the overall compliance analysis algorithm. In the following section I describe the compliance analysis algorithm in detail, which I formalized by the use of flow chart diagrams with text descriptions.

General Description of the Compliance Analysis Algorithm

The goal of the compliance analysis algorithm is to evaluate the data of each single patient regarding its compliance with the full set of operative formal clinical guidelines that are represented in the library. When invoking the method that implements the algorithm, which is called Analyze-Patient, it accepts the patient’s complete data as an argument, represented using an in-memory object that holds both the demographic and the temporal data of the patient; in addition, the method holds a reference to the Knowledge Adapter, which is used in multiple scenarios during the compliance analysis.

The algorithm combines between two computational approaches: top-down and bottom-up. The algorithm consists of several sequential steps, in which each step can

Chapter 3 72 The DiscovErr System

directly analyze the raw patient data or use the outputs calculated in previous steps of the algorithm. The outputs of the steps are stored in a central data structure that is called a TimeLine, which supports fast storage and retrieval of the time-stamped data items. The main steps of the algorithm are described in Figure 22; each step is described in detail in a separate section.

Figure 22. The main steps of the DiscovErr’s compliance analysis algorithm.

The TimeLine data structure is used to store and retrieve time-stamped data items, which are called TimePoints, that represent multiple events in the patient’s history. Each TimePoint includes an annotation regarding its type, a timestamp, and references to additional relevant information according to its type. The TimePoints are divided into two main types: Data-Item-Points represent the raw data items that exist in the original medical record, and Explanation-Points represent new higher level information that is created during the compliance analysis. The description of the algorithm in the following sections, clarifies the context in which the various types of computed- explanation points are added to the TimeLine. In addition, the various types of Explanation-Points are organized in section 3.6, with the presentation of the taxonomy of computed-explanations and compliance-related comments.

Chapter 3 71 The DiscovErr System

Compliance Analysis: Step 1: Initialize the TimeLine

The first step of the algorithm (see Figure 23) includes the initialization of a TimeLine object and populating it with data points that represent the raw data items. This is done by scanning all temporal items (i.e., the data items are represented as temporal intervals) in chronological order, and adding a data item point for each temporal item in the patient’s data.

For medication related data items, the system performs an additional computation, and generates an abstract drug administration point with a notion of a drug-action-type (Start, Increase, Maintain, Decrease, or Stop), by a comparison of the current dosing value to the previous dosing data of the same medication. In addition, while scanning all the medication events, the system examines the time-gap between sequential data items of the same medication; in case it is higher than a certain threshold, a new abstract drug-stop-data-point is generated and added to the TimeLine. This threshold, called drug-stop-detection-time, is currently a configurable parameter of the system that is being set according to the source of the patient’s data. For example, in the evaluation of the system on outpatient diabetes data, the drug-stop detection-time was set to 35 days, to reflect a reasonable time gap that if exceeded in the medication purchase data, a drug stop point is added to the TimeLine.

Figure 23. Initialize the TimeLine step of the compliance analysis algorithm.

Chapter 3 71 The DiscovErr System

Compliance Analysis Step 2: Top-Down Analysis

The Top-Down analysis is a computational process that analyzes the data from the perspective of the guidelines, i.e., a knowledge-driven process. In general, it is done for detection of the high level temporal abstractions that are related to the entry condition, stop condition, and outcome-intentions of any operative guideline that is available in the knowledge library.

Figure 24 presents a flow chart illustrating the steps of top-down analysis; a description of the process is provided in the following paragraphs.

Figure 24. The top-down Analysis step of the compliance analysis algorithm.

When starting the top-down analysis, the system retrieves the complete set of operative guideline plans from the Knowledge Adapter. In this phase, the Knowledge Adapter breaks composite guidelines into a virtual set of guideline-plans, each describing a single clinical path of the original composite guideline. When a composite plan is divided into a set of multiple smaller virtual guideline-plans, the virtual plans are assigned the guideline-conditions (i.e., Asbru’s conditions: Filter, Setup, Complete,

Chapter 3 76 The DiscovErr System

Abort, Suspend, and Restart) of the original “parent” guideline. I call this process a “Condition Propagation”, as the conditions of a composite plan are propagated to each of its sub-plans (that inherits the conditions from its parent plan). The condition propagation involves the following logic:

 When propagating an entry condition of a guideline (i.e., Filter, Setup, and Restart) from a parent to a sub-plan, the condition of the parent is added in conjunction (i.e., AND relation) to the existing entry condition of the sub-plan. For example, a composite guideline for the management of Multiple Sclerosis, with a filter condition of “Multiple Sclerosis Diagnosis”, may include a sub- plan that relates to the special treatment for pregnant women, which includes a filter condition of “Pregnant Woman”. In such case, the sub-plan inherits the condition of the parent, and is assigned the following composite filter condition: “Multiple Sclerosis Diagnosis AND Pregnant Woman”.

 When propagating a stop condition of a guideline (i.e., Complete, Abort, or Suspend) from a parent to a sub-plan, the condition of the parent is added in disjunction (i.e., OR relation) to the existing stop condition of the sub-plan. For example, if the Multiple Sclerosis guideline includes an Abort condition of “Patient Hospitalization”, and the sub-plan of the pregnant women treatment includes an Abort condition “Severe Pain”, the sub-plan inherits the condition of the parent, and is assigned the following composite abort condition: “Patient Hospitalization OR Severe Pain”.

The set of operative guideline plans is saved in a cache memory of the Knowledge Adapter to reduce the time of future retrieval, and the Compliance Analysis Engine continues with the top-down analysis.

For each operative guideline plan, the declarative concepts of the entry conditions, stop conditions, and outcome-intentions are extracted and sent to the Fuzzy Temporal Reasoner that evaluates them on the patient data. For each of these declarative concepts, the Fuzzy Temporal Reasoner returns a set of zero or more temporal abstractions, in the form of time stamped intervals assigned with a membership score.

The system filters the intervals returned when evaluating the entry conditions, and accepts only intervals with a membership score that passes a certain threshold, called guideline-start-threshold, that is currently configured to 0.8. This is done to ignore

Chapter 3 77 The DiscovErr System

intervals in which the condition holds with a membership score that is too low. For each interval that is accepted by the filter, the system generates data points that represent the start and the end of the interval, and adds them to the TimeLine. In addition, the system adds earliest-start and latest-start points to mark the time points in which the plan should have been started. For guidelines marked with Is-Single-Time- Event = true, the data points described above are added only for the first interval when the start condition holds.

The system filters the intervals returned when evaluating the stop conditions, and accepts only intervals with a membership score that passes a certain threshold, called guideline-stop-threshold, that is also configured to 0.8. For each interval that is accepted by the filter, the systems generates data points that represent the start and end of the interval that in the case of stop conditions, mark the time points in which the application of the guideline should have been discontinued.

For each intention of the guideline the system adds a data point to mark the time when the intention should have started to be monitored, using the Time-To-Fulfill attribute that is specified for the intention in the knowledge of the guideline (described earlier in section 3.3.1). Then, the declarative knowledge representing the outcome intention itself is evaluated for that patient using the Fuzzy Reasoner, and for each of the intervals returned, the system generates data points that represent the start and the end of the interval, and assigns them the membership score that represents the level of compliance with the intention. For example, the intentions of a diabetes guideline can include a specification of the target (goal) levels of Hemoglobin A1c. The formal representation of this outcome-intention will include a reference to a declarative knowledge constraint representing these levels (e.g. Hemoglobin A1c <8%), and a value for the Time-To-Fulfill attribute, (e.g. 1 month after the treatment is applied). The system will add a TimePoint to the TimeLine 1 month after the start time of the first interval where the entry condition met, and an interval representing the membership score for these target levels, whenever the Hemoglobin A1c data was available.

Compliance Analysis Step 3: Bottom-Up Analysis

The Bottom-Up analysis is a computational process that analyzes the guideline compliance from the perspective of the patient’s data, i.e., a data-driven process. In general, the bottom-up analysis consists of a process that examines each data item in

Chapter 3 78 The DiscovErr System

the patient’s medical record, and provides it with a set of possible knowledge-based (i.e., clinical guideline-based) explanations.

In the bottom-up analysis, the system scans each item in the patient data, and tries to provide as many computed-explanations for each data item. The computed- explanations are structured semantic comments that evaluate the data item in the context of any guideline-plan that is related to this item by its type of knowledge concept; this, of course, is done according to what is known to the system, i.e., the full set of formal operative guidelines. For example, consider a high hemoglobin A1c measurement that is related to a diabetes guideline in two relations; it can be a pre- diabetic test that is taken to decide if the patient is diagnosed with diabetes, or it can be an ongoing periodic test step, to monitor a patient who was already diagnosed. The data item is examined in the context of these two guideline relations, and a possible computed-explanation is provided for each. If the examined A1c test is the first occurrence of an A1c measurement, it will be assigned a computed-explanation marking it as a possible start point of the guideline; in this case it will not be assigned with a computed explanation regarding the periodic test step of monitoring an already diagnosed patient, because the system detects that the guideline should not have been started before the first A1c measurement point. Each of the computed-explanations that are added for the data points is represented in a data structure that holds additional details, for example, computed-explanations in the context guideline step points, are assigned a score for the assessment of the timing of the action, with a corresponding description of the temporal relation, which can be assigned step-too-early, step-on-time, or step-is-late.

When the bottom-up analysis is finished, each data point that represents a clinical parameter in the TimeLine is assigned a collection of one or more possible computed- explanations. In the next step of the compliance assessment, described in the next section, the computed-explanations are summarized to select the most reasonable one.

Figure 25 presents a hierarchical flow chart illustrating the main flow of the bottom- up analysis, in which the processes with the bolded borders are described in more detail in additional flow charts.

Chapter 3 79 The DiscovErr System

Figure 25. The bottom-up Analysis step of the compliance analysis algorithm.

Description of the Bottom-Up Analysis: The system scans the data items of the patient in chronological order, and for each item, it uses the Knowledge Adapter to retrieve all possible relations between the item and the guideline plans that are available in the knowledge library. These relations represent references to the data parameters from the guideline knowledge, for example, a cholesterol measurement parameter is referenced by a cholesterol guideline (i.e., the cholesterol measurement is used in the clinical process). To preserve the flow of the bottom-up analysis description, a detailed description of the Knowledge Adapter’s method that retrieves these relations between data items and guidelines is provided in the following section.

In cases where the system does not detect any relation between the data item to any guideline available in the knowledge library, an Unexplained-Data comment is added to the TimeLine to represent this unexplained data item.

In the next step, for data items that were found with relations to available operative guidelines, each of the relations is examined to determine the Applicability-Status of its

Chapter 3 81 The DiscovErr System

guideline. The applicability status can be generally described as an answer to the question “Was the guideline-plan applicable in the time referred to by the data item?”, and is determined by a special method of the Compliance Analysis Engine that receives the time referred by the data item and the formal guideline as arguments, and scans the previous events recorded in TimeLine to determine the applicability status. The method is using the data points that were added to the TimeLine during the compliance analysis so far, which include the guideline’s start and stop points added in the top-down analysis, and the computed-explanations of all clinical steps that occurred prior to the current examined time (a reminder: the data items are scanned chronologically). The result of this method, which can be one of the following applicability-statuses: Not- Applicable, Applicable, Stopped, Completed, and Just-Completed, is used to determine the next step in the flow of the bottom-up analysis. In addition, a numeric score is attached to the applicability status, a score that represents the level of applicability. For Applicable guidelines the score is assigned a number between 0.8 and 1 (0.8 is the guideline-start-threshold described above), according to the membership score of the guideline’s start point assigned by the Fuzzy Temporal Reasoner when evaluating if the entry condition holds. For all other applicability statuses the score is 0.

The flow chart in Figure 26 illustrates the process of assessment of Not-Applicable guidelines, i.e., guideline-plans of which the entry conditions (Asbru’s filter and setup conditions) did not hold before the time of the data item.

Figure 26. Compliance analysis in the context of a Not-Applicable guideline.

Chapter 3 80 The DiscovErr System

In case the relation to the Not-Applicable guideline is Abstracts-Start-Condition, meaning the data item’s parameter abstracts the filter or setup conditions of the guideline, a computed-explanation that the data may affect the decision to start the application of the guideline is added. In case the relation is Abstract-Stop-Condition, a computed-explanation that the data affects a decision to stop the guideline is added, in this case it is a guideline that was not yet applicable. In case the relation is Part-Of- Plan-Body, the system checks whether another plan of the guideline is currently applicable. If there is another applicable plan of the guideline at the time, it generates a computed-explanation of type Wrong-Path-Selection is added, and if not, a computed- explanation of type Step-Not-Supported is added, meaning there is no previous data that supports the current application of the guideline.

The flow chart in Figure 27 illustrates the process of assessment of Applicable guidelines, i.e., guideline-plans of which the entry conditions hold at the time of the data item. Figure 24 extends this illustration of the assessment process when the item- guideline relation is Part-Of-Plan-Body.

Figure 27. Compliance Assessment in the context of an Applicable guideline.

In case the relation to the Applicable guideline is Abstracts-Start-Decision, a computed-explanation is added mentioning the guideline was already applicable, so the new data that also supports starting the plan application, may be a redundant test. In case the relation is Abstract-Stop-Condition, a computed-explanation that the data affects a decision to stop the guideline is added.

Chapter 3 80 The DiscovErr System

When the relation is Part-Of-Plan-Body of an applicable guideline, the process is more complicated and described in the flow chart illustrated in Figure 28. In this case, the system already knows the step is expected (because the guideline is applicable), and now it needs to determine if it was performed in a timely manner, i.e., too-early, on- time, or too-late action. To do this, the system examines more details about the data item-guideline relation, which was computed earlier by the Knowledge Adapter when it retrieved all guideline relations for the current data item.

Figure 28. Compliance Assessment of a step in the Plan-Body of Applicable guidelines.

In case the detailed relation is a Plan-Body-Start-Plan, meaning it is a plan that starts the plan body of the applicable guideline, the system examines the timing by comparing the time stamp of the item to the earliest-start and latest-start time points, which were added to the TimeLine in the top-down analysis.

In case the detailed relation is Plan-Body-Later-Plan, meaning the guideline has additional plans that should have been performed before the plan currently being examined, the system examines the action timing according to the time gap from the

Chapter 3 82 The DiscovErr System

previous steps, which were added to the TimeLine in the bottom-up analysis (reminder: the data items are scanned in chronological order). In this scenario, cases may exist in which the preliminary action were not performed, and if this is the case, the system adds a computed-explanation regarding missing preliminary action(s).

In the last case, in which the detailed relation is Plan-Body-Periodic-Plan, the system examines the TimeLine to determine if this is the first instance of the plan action by checking if it is the first occurrence of the same parameter since the guideline application should have been started. If the current item is the first instance of the periodic step, the timing of the action is examined using the earliest-start and latest- start of the guideline; if it is not the first instance, the timing is examined by examination of the time-gap to the closest previous instance.

In all cases described above examining timing of the action, the system adds a corresponding computed-explanation of one of the following types: Step-Too-Early, Step-On-Time, or Step-Too-Late. In addition to the type of computed-explanation, the system also assigns a numeric score to represent the quality of the timing, with a value between 0 and 1. The score is computed using a fuzzification-function (see more details in section 3.5.1., in the description of the Fuzzy Temporal Reasoner), using a default deviation-interval that is established as half of the time gap between the earliest-start and latest-start of the guideline-plan. For example, if the earliest start point of a plan is one week, and the latest start point is three weeks, the time deviation-interval that is used is one week, which is half of the gap between the earliest and latest start points. This means that if a plan is later by more than one week from the latest start, it is assigned a score 0, or if, for example, it is late from the latest start by exactly half a week (which is half of the deviation-interval), it will be assigned the score 0.5. Although it is possible to extend the representation of the guideline to include a specific definition of the deviation-interval for computing the timing of each guideline step, the knowledge specification process is shortened if using this mechanism of default values.

The flow chart in Figure 29 illustrates the process for the assessment of Stopped guidelines, i.e., guideline-plans in which their application should have been stopped according to the complete and abort conditions.

Chapter 3 81 The DiscovErr System

Figure 29. Compliance analysis in the context of a Stopped guideline.

In this scenario, the relation between the data item and the guideline is being examined, and the process continues accordingly.

In the case the relation to the Stopped guideline is Abstracts-Start-Condition, a computed-explanation that the data may affect the decision to start the application of the guideline is added. In case the relation is Abstracts-Stop-Condition, no computed- explanation is added because the guideline application should already have been stopped. In the last case, in which the relation is Part-Of-Plan-Body, a computed- explanation of type Stopped-Guideline-Step is added, meaning the decision to perform the action is against the stop-conditions of the guideline.

The flow chart in Figure 30 illustrates the process for the assessment of Completed guidelines, i.e., guideline-plans in which their application have been completed according to the previous data. The relation between the data item to the guideline is examined, and the process continues accordingly.

Figure 30. Compliance analysis in the context of a Completed guideline.

In a similar manner as in the previous process for Stopped guidelines, in the case where the relation to the Completed guideline is Abstracts-Start-Condition, a

Chapter 3 81 The DiscovErr System

computed- explanation that the data may affect the decision to start the application of the guideline is added. In case the relation is Abstracts-Stop-Condition, no computed- explanation is added because the guideline application has already been completed. And in the last case, in which the relation is Part-Of-Plan-Body, a computed- explanation of type Redundant-Step-Repeated is added, meaning that the action that completes the guideline was already executed, and the action that is represented by the current (newer) data item, is a redundant action.

The flow chart in Figure 31 illustrates the process of assessment of Just-Completed guidelines, i.e., guideline-plans in which their application has been completed in the same exact time as the examined data item. The relation between the data item to the guideline is examined, and the process continues accordingly.

Figure 31. Compliance analysis in the context of a Just-Completed guideline.

In a similar manner as in the previous processes of Stopped and Completed guidelines, in case the relation to the Just-Completed guideline is Abstracts-Start-Condition, a computed-explanation that the data may affect the decision to start the application of the guideline is added. In case the relation is Abstracts-Stop-Condition, no computed- explanation is added because the guideline’s application was just completed. And in the last case, in which the relation is Part-Of-Plan-Body, a computed-explanation of the type Duplicate-Step is added, meaning that an action that completes the guideline was just executed, in the same time with current data item, which is redundant with regard to the process intention, i.e., multiple actions were performed in parallel to fulfill the same process intention. In such cases, the system resets the computed-explanations of these parallel step items, which were already examined by the analysis process, but only now detected as duplicate actions, and assigns them also as Duplicate-Steps for the same process intention.

Chapter 3 86 The DiscovErr System

Detecting Relations between Data Items and Guidelines The method for retrieving the relations between the data items and the guidelines performs an exhaustive search on the full structure of every guideline-plan in the library to detect elements in the guidelines that refer to the concept of the data item. When a reference from the guideline to the data item’s concept is detected, it is assigned an attribute to represent the type of relation, and added to a list of possible relations. The type of relation is derived from the type of knowledge element (also called knowledge- role) in which the concept is referenced. The possible set of guideline-concept relation types consist of the following types: Abstracts-Start-Condition, Abstracts-Stop- Conditions, and Part-Of-Plan-Body.

For example, the data item of a hemoglobin A1c measurement may be referred to by a composite condition “Diabetes Diagnosis”, which is the filter condition of a diabetes guideline. The Diabetes Diagnosis condition is represented in the knowledge using a KBTA State-Abstraction that is abstracted from fasting-blood-glucose and hemoglobin A1c measurements. In that case, there is a possibility that the examined hemoglobin A1c data item is a test that was taken to diagnose the patient; therefore, a new relation from the type Abstracts-Start-Condition is added to the possible relations. The A1c measurement concept is also referenced from the plan-body of the guideline plan in the context of a clinical step to be performed periodically; therefore, a second relation is added, assigned the type of Part-Of-Plan-Body.

During the process of detecting references between the guidelines and the concepts of the data items, there is a need to compare the concept identifiers of the data items to the concept identifiers referenced by the knowledge roles of the guidelines. For example, compare the concept identifier of a hemoglobin A1c test that abstracts the filter condition of the guideline, and the concept identifier of the data item. The comparison of the concept identifiers can result with three levels of matching between the concepts; Exact-Match in case the concepts identifiers are exactly similar; Partial-Match if the concepts are not exactly similar, but are closely related according to an available taxonomy of a controlled medical vocabulary; and No-Match in the case concept identifiers do not match exactly or partially. In the current implementation, the Knowledge Adapter exploits the hierarchical structure of the ATC drug classification system to compare between medication items identifiers. When a possible reference from the guideline to a drug item is examined, in case there is no exact match between

Chapter 3 87 The DiscovErr System

the item identifiers, the system checks whether the source of the data item concept is the ATC vocabulary. If the concept of the medication item is from the ATC vocabulary, the system examines the clinical-step item of the guideline to check if it is mapped to a medical term from the ATC vocabulary. Then, if the concept does have a mapping to an ATC drug identifier, the system performs a taxonomical search to find the relation between the items. This taxonomical search can result with the following detections: Exact-Match is detected when the data item is a descendent of the concept in the guideline, for example the identifier of the Atorvastatin medication will result with an Exact-Match to the identifier of the HMG-CoA-Reductase class of medications (also called Statins), because according to ATC, it is classified as a type of Statin. Same- Chemical-Group is detected when the items share the same chemical group parent, Same-Pharma-Group is detected in cases where the items share the same pharmaceutical group, and Same-Therapeutic is detected when the items share the same therapeutic group. This mechanism of exploiting a taxonomical structure of the ATC drug classification system, strengthens the system by allowing it to examine any medication record connected to an ATC identifier, and relate it to the existing guideline without a need to perform manual mapping between the concepts. It is important to mention that exploiting the hierarchical structure of knowledge taxonomies is a common approach used by many medical informatics systems, for various types of applications.

Due to the complexity of the method described above, its implementation included a result-caching mechanism. Each time the method is invoked, the memory is checked to determine whether the relations between the item and the guideline were already computed. This way, by using the caching mechanism, the exhaustive search operation is performed only once for each pair of distinct concepts that exists in the patient database and a guideline that exists in the knowledge library.

Compliance Analysis Step 4: Missing Actions Analysis

The Missing-Actions-Analysis is an important step of the compliance analysis, due to the fact that missing an action is one of the more common problems in regard to compliance to clinical guidelines.

In this step, the system scans the TimeLine again, this time to detect missing actions. The need for an additional scan of the TimeLine occurs due to the fact that in the

Chapter 3 88 The DiscovErr System

previous steps of the compliance analysis the system could not detect the missing actions; in the top-down analysis the system examines the guidelines’ conditions and outcome-intentions and does not relate to clinical actions (which are represented in the plan-body of the formal guidelines); in the bottom-up analysis the process scans the existing data items in the patient’s record, thus, this way it cannot detect actions that are missing.

The hierarchical flow-chart in Figure 32 illustrates the process of Missing-Actions- Analysis. The process with the bold borders is illustrated in an additional flow-chart in Figure 33.

Figure 32. The Missing-Actions Analysis of the compliance analysis algorithm.

The Missing-Actions-Analysis starts with initialization of a dictionary in memory, which is used for monitoring the applicability status of each guideline that is detected during the scan of the TimeLine. Then, the system scans the TimeLine in a chronological order, and reacts according to the type of each TimePoint that is detected.

When detecting a Plan-Start-Point or a Plan-Stop-Point that was added in the top- down analysis, the system assigns the applicability status of the point’s guideline as Applicable or Stopped, respectively. When detecting a Plan-Step-Point, the system examines whether the step is starting or completing the guideline, and assigns the applicability status of the point’s guideline as Started or Completed, respectively. When

Chapter 3 89 The DiscovErr System

the system detects a Plan-Latest-Start point that represents the latest start of a specific instance of the guideline, it examines whether a step exists that fulfils the guideline scanning the time line forward to look for later steps that include a Step-Too-Late assigned to them in the bottom-up analysis. In case it does not detect any late steps that fulfill the guideline, it adds a computed-explanation of type Missing-Action to the TimeLine. In the special case of Drug-Increase clinical steps, before adding the missing action explanation, the system performs further investigations trying to determine if there is a reason for avoiding the drug increase action.

When assessing a missing Drug-Increase clinical steps item, the system is trying to explain the missing actions by examining two common scenarios in which the medication-increase action should be canceled even if it is required according to the patient’s data. The first scenario is reaching a maximal dose of a medication, a situation in which further dose increase is not recommended. The second scenario is cases where the compliance to the medication was low at the time of the decision, meaning the patient did not take the medication, it is more reasonable to improve compliance than increase the dose of the medication.

The process of investigating missing Drug-Increase actions is illustrated in a flow- chart in Figure 33.

Figure 33. Missing-Actions Analysis for drug-increase clinical steps.

Chapter 3 91 The DiscovErr System

The missing action analysis of drug-increase clinical steps start with checking if the drug is currently at its maximal dose. Currently, this method for checking the maximal dose is implemented in a naïve fashion, by assigning maximal dose values for specific medications (i.e., medications that were part of the guidelines used in the system’s evaluation). In future improvements, the system can be extended with existing commercial databases with knowledge about the medications doses, and use specific information on the patient, such as gender, age and weight, to decide on the maximal doses. If the latest dose of the medication equals the maximal dose, the system adds a computed-explanation of type Drug-Aborted-Maximal-Dose, which it is opposite to the negative guideline compliance comment regarding a missing action, it is a positive comment regarding correct decision. If the drug is not at its maximal dose, the system examines whether there was a compliance problem before the time of the missing action; this is done by searching the TimeLine for drug-stop-data-points that have no reasonable explanation. If there is no compliance problem, a Missing-Action explanation is added. If there is a compliance problem, the system examines whether the compliance was increased, because good patient’s compliance to a medication is necessary in order to decide whether to increase the dose of this medication. If the compliance was increased, the system adds a computed-explanation of type Drug- Compliance-Increase-Revision, which is another positive guideline compliance comment that is added instead of the negative Missing-Action comment that is added if the compliance to the medication did not increase. In other words, the systems expects good compliance to a medication before a decision of increasing its dose, and by that avoids prompting wrong comments about missing drug increase actions.

Compliance Analysis Step 5: Results Summarization

In this final step of the compliance analysis, the results of the analysis process are filtered, sorted, and aggregated and then represented in a structured manner that is used later for the presentation and storage of the result.

Extraction of the Relevant Results

During the application of the previous steps of the compliance analysis algorithm, the system is generating a large amount of computed-explanations, which are represented as TimePoints objects that are stored in the TimeLine. Some of these computed- explanations represent useful comments regarding compliance with clinical guidelines

Chapter 3 90 The DiscovErr System

(e.g. a late drug administration action), whereas other computed-explanations are less useful for potential users and relevant mostly to support the analysis process itself (e.g. an internal notion of the system regarding a detected drug-increase action). A full description of the taxonomy of the computed-explanations is presented later in section 3.6. In this phase of the Results Summarization, the system extracts the useful computed-explanations and filters out the less useful ones. The useful computed- explanations that are extracted in this phase represent the system’s Comments regarding guidelines’ compliance.

Intention Related Comments are extracted from the outputs of the top-down analysis, in which each outcome-intention of a guideline-plan was evaluated using the Fuzzy- Reasoner. The outcome intentions-related explanations are represented as scored temporal-intervals, and for each outcome-intention the following sets of temporal- intervals are extracted: a set of temporal-intervals with assessment regarding the achievement of the intention (e.g., the hemoglobin A1c goals were almost on target in a certain period of time, therefore membership score = 0.85); a set of temporal-intervals in which the intentions should have been monitored (e.g., for a patient who was diagnosed with diabetes in January, the outcome-intentions should have been monitored from April); and a set of temporal intervals in which the system detects insufficient data to determine the achievement of the outcome intention, i.e., intervals in which an intention was not monitored.

Data Item Comments are extracted from the outputs of the bottom-up analysis, in which, as explained, each data item in the patient record is evaluated and assigned zero or more computed-explanations. In general, it is reasonable to provide multiple explanations when analyzing data in a retrospective manner, such as the analysis performed by the DiscovErr system. However, for practical reasons, there is a need to organize these multiple computed-explanations in a manner that emphasizes the most reasonable computed-explanations provided for each data item. In this phase of the Results Summarization, the Compliance Analysis Engine uses a specific method that is described in more detail in the following section. This method sorts the computed- explanations of each data item according to a score that represents the reasonability of the computed-explanations. The computed-explanation with the highest reasonability is selected as the guideline compliance comment and assigned to the examined data item.

Chapter 3 90 The DiscovErr System

Missing Actions Explanations are extracted from the output of the Missing-Actions- Analysis, in which the system scans the TimeLine and adds computed-explanations regarding any missing action of the operative guideline. Extracting these computed- explanations is important, as the missing action is a common problem of guideline compliance.

Selecting the Most Reasonable Explanation

At this phase of Results-Summarization, the system is applying an operation to sort the multiple computed-explanations that were assigned to each data item, in a manner that emphasizes the most reasonable computed-explanation(s). Each computed-explanation of a data item that was generated in the bottom-up analysis, is assigned one of the following three scores; (1) a score between 0 and 100 that represents the level of applicability of the guideline at the valid time the of data item; (2) a score between 0 and 100 that describes the strength of the relation between the parameter of the data item to the clinical guideline, computed by the Knowledge Adapter when retrieving the relations to the guidelines; (3) a score between 0 and 100 for the timing of the clinical step that is represented by the data item, available only for clinical steps that were expected according to an applicable guideline.

The system uses these scores to compare the alternative computed-explanations of a data item in the following manner:

If the score guideline’s applicability and the score of the data item’s relation to the guideline are NOT equal, the system will prefer a computed-explanation with the highest average of the guideline applicability score, data item’s relation to the guideline score, and the score for the timing of the clinical steps. This intuition behind this logic is to prefer computed-explanations regarding applicable guidelines that are found more related to the steps described in the guideline and express better guideline compliance. Note that for guidelines that are not applicable, the timing score is always assigned 0; a fact that “punishes” these computed-explanations even further, if compared to a computed-explanation regarding applicable guidelines.

If the score guideline’s applicability and the score of the data item’s relation to the guideline are equal, then a different logic is applied. In these cases, the system examines the guideline’s applicability score and the complexity of the guideline’s start conditions to determine the method of comparison.

Chapter 3 92 The DiscovErr System

The complexity of the start conditions represents the potential for them to hold. The more the conditions are complex, the potential of the conditions to hold is lower. The complexity of the start conditions is computed by the number of atomic logic relations in the filter and setup conditions, and by the type of logic operator between them. In compound logic expression with the logic relation of the type OR, a lower complexity score is assigned for logic expressions with a larger number of atomic relations. For example, the expression “Systolic Blood Pressure > 140 OR Diastolic Blood Pressure > 90” is considered less complex than the expression “Systolic Blood Pressure > 140” because it involves two relations; thus the potential for it to hold is greater. In compound logic expression with the logic relation of the type AND, a higher complexity score is assigned for logic expressions with a larger number of atomic relations. For example, the expression “Systolic Blood Pressure > 140 AND Diastolic Blood Pressure > 90”, is considered more complex than the expression “Systolic Blood Pressure > 140” because it involves two relations that needs to hold; thus, the potential for the overall expression to hold is lower.

Cases in which the score for the guideline’s applicability is high and exceeds the guideline-start-threshold, are cases in which the system needs to compare two computed-explanations regarding different applicable guidelines. In these cases, the system will prefer computed-explanations regarding guidelines with the most complex start conditions (according to the definition of complexity presented above). The intuition behind this is that when the system detects a step that is expected by more than one applicable guideline, it prefers the computed-explanation regarding the guideline that is most specific to the examined scenario, expressed by the complexity of its start conditions. For example, when a certain action can be explained by a guideline with the start condition “adult patient with diabetes” or by a guideline with the start condition “adult patient with diabetes and history of heart failure”, the second explanation will be chosen.

Cases in which the score for the guideline’s applicability is low are cases in which the system needs to compare different guidelines that are not applicable. In these cases, the system will prefer computed-explanations regarding guidelines with the simplest start conditions (according to the definition of complexity presented above). The intuition behind that is that when the system detects a step that relates to more than one guideline that are not applicable, it prefers to judge the treatment according to more general

Chapter 3 91 The DiscovErr System

guidelines. For example, in this case, when a certain action can be explained by a guideline with the start condition “adult patient with diabetes” or by a guideline with the start condition “adult patient with diabetes and history of heart failure”, the first explanation will be chosen, as both guidelines are not applicable so if the physician had considered one of them to be applicable, it is more reasonable it was the simplest one with less constraints in the start conditions.

The method described above was designed to select the most relevant explanation in cases that multiple explanation are assigned to a specific clinal action. Although this method considers cases in which multiple guidelines are applicable in a certain point in time, it does not provide a mechanism to handle contradictive recommendations that might exists when multiple guidelines are represented in the library. If such cases will occur, the system will prefer the explanation that involves the guideline which was evaluated with better compliance.

3.5.3 Visualization of the Compliance Analysis Results

The compliance analysis process results with multiple compliance-related comments. The number of comments depends on the time period that was selected for the analysis, the number of items in the patient record and the number of the operative guidelines available in the knowledge base. In order to allow the users to explore these multiple comments, the DiscovErr system includes a graphical interface for the visualization of the results, called the Results Viewer.

The Results Viewer (see Figure 34) is an interactive user interface that organizes the results for the user and displays the relevant details when the user selects to view each comment. The Results Viewer interface divides the screen into two main areas. On the left part of the screen, the user is presented with the full set of compliance comments, that can be presented in a flat chronological manner, or grouped by guidelines. When the user selects a specific comment, the details about the comments are presented in the bottom part (see Figure 35), and the patient data which is relevant for that comment is presented on the right part of the screen. The data presented on the right part of the screen is called Explaining Data, and provides a visual explanation of the compliance comment by visualizing the data of the clinical parameters that are involved, according to the guideline, in the clinical decision or action. In order to identify the clinical parameters are involved in the guideline, the Results Viewer uses a special method of

Chapter 3 91 The DiscovErr System

the Knowledge Adapter, that retrieves all primitive parameters that abstract each condition in the guideline (Filter, Setup, Complete etc.).

Figure 34. The Results Viewer graphical interface. The user can browse the compliance-related comments presented on the left side of the screen. When selecting a specific comment, the comment’s details are presented on the left bottom area, and explaining data for the relevant clinical parameters is presented on the right area.

Figure 35 presents the area of the screen in which the details on the compliance- related comments are presented. For each comment, the system displays the guideline plan’s title, a textual description of the comment (highlighted), the score between 0 and 100 for the relation of the data item to the guideline, a score between 0 and 100 for the applicability of the guideline at the time, and when relevant, a score between 0 and 100 for the timing of the clinical action . The example in the figure includes details about a compliance comment regarding late LDL evaluation, which should have been performed in adjacent to diabetes diagnosis but was preformed 484 days later.

Chapter 3 96 The DiscovErr System

Figure 35. Visualization of the details of a compliance-related comment

Figure 36 presents the area of the screen in which the explaining data are presented, in order to provide the user with a visual explanation of the compliance-related comment. In order to determine the relevant data for the comment, the Results Viewer uses the Knowledge Adapter to retrieve all primitive parameter that are involved in the guideline’s conditions, i.e., all primitive parameters that abstracts the conditions of the guideline. In cases of comments that relate to a specific data item (i.e., data-item computed-explanations), the data for the parameter is presented in a time-serious graph where the specific data item is highlighted, presented with the previous data for that same parameter (if not the first item for that parameter). The data for the parameters abstracting the guideline conditions, is presented in a time serious graph to the point in time of the current comment.

The example in Figure 36 visualizes a comment regarding a LDL measurement that was performed late according to a diabetes guideline, that recommends to monitor the LDL of the patient in adjacent to the diagnosis of diabetes. As this is a data-item-related comment, the data for the LDL measurement is presented in the top graph under the Data for Current Event label. The comment in the example relates to the first LDL measurement in the patient’s record, therefore, the graph in the example displays a single highlighted LDL measurement; however, previous data points were presented in the graph if previous measurements were available for the LDL parameter.

The data for the hemoglobin A1c and Glucose parameters are presented in the graphs under the “Data that might be relevant” label, because these parameters abstract the diabetes diagnosis concept which is referred by the filter condition of the diabetes guideline. When viewing the time stamped data for these parameters, up to the point in time of the late LDL measurement, a professional user can understand that the LDL

Chapter 3 97 The DiscovErr System

measurement was performed more than a year after the first high hemoglobin A1c measurement that determined the diabetes diagnosis. In this manner, the users can understand the reason for the system to provide the late action comment.

Figure 36. Visual explanation of compliance-related comments.

Chapter 3 98 The DiscovErr System

3.6 The Taxonomy of Computable Compliance Comments

During the compliance analysis process of DiscovErr, multiple computed-explanations of multiple types are generated by the system, and as mentioned earlier, some represent useful comments regarding compliance to clinical guidelines, whereas others are less useful for potential users and relevant mostly to support the analysis process itself. In order to better explain the possible types of computed-explanations, I have created a Taxonomy of Computable Compliance Comments, in which the computed-explanations are organized in a hierarchical structure. The figures presented in this section illustrate the various parts of the taxonomy. Note that computed-explanations that are presented with dashed borders represent abstract computed-explanations that were added to organize the taxonomy, computed-explanations that are presented with bold borders are computed-explanations that are used as the guidelines compliance comments.

Figure 37 illustrates the taxonomy of computed-explanations regarding drug-related action, which are generated during the first step of the compliance analysis of TimeLine initialization. These computed-explanations are generated by analyzing the time-based sequences of administration orders of each medication.

Figure 37. Taxonomy of computed-explanations of abstracted medication (drug)-related actions

The drug-start, drug-increase, drug-maintain, drug-decrease and drug-stop computed- explanations are computed by comparing the current administration order to the previous order of the same drug, and are not provided by the system as direct compliance comments. The drug-compliance-problem and drug-compliance-improved computed-explanations are computed by analyzing drug-purchase records, and are provided by the system as negative or positive guideline compliance related comments.

Chapter 3 99 The DiscovErr System

Figure 38 illustrates the taxonomy of guideline control computed-explanations, which are generated during the top-down analysis.

Figure 38. Taxonomy of guideline control computed-explanations.

The guideline control computed-explanations are used only in the compliance analysis process and are not provided by the system as direct compliance comments.

Figure 39 illustrates the taxonomy of outcome-intention–related computed- explanations that are also generated during the top-down analysis. The intention-latest- monitoring computed-explanation is only used internally in the compliance analysis process, whereas the intention assessments are provided as guideline compliance comments.

Figure 39. Taxonomy of computed-explanations related to outcome intentions.

Each pair of intention-assessment-start and intention-assessment-end computed- explanations is assigned with a score representing the compliance to the outcome intention, and is aggregated with other pair computed-explanations of intention assessment to represent the overall score of the compliance to the guideline goals over a time period.

Chapter 3 011 The DiscovErr System

Figure 40 illustrates the taxonomy of computed-explanations that are generated during the bottom-up analysis and assigned to each raw data item. The clinical-step- explanations are provided by the system as compliance comments.

Figure 40. Taxonomy of computed-explanations assigned to each raw data items.

The unexplained-data computed-explanation is assigned to data items that are not found related to any of the operative guidelines; although these computed-explanations are not used in the compliance analysis and are not provided by the system as a compliance related comments, they can be used to track common parameters that exist in many places in the patients’ records, and may pinpoint to important new guidelines that can be added to the knowledge-base for future compliance analysis.

The affect-plan-state computed-explanations are only used only by the compliance analysis process internally.

Chapter 3 010 The DiscovErr System

The clinical-step-explanations are divided into two categories, the expected-steps are assigned to clinical steps that were found expected by a guideline, and represent a comment regarding the compliance to the scheduling of the action; the scheduling comments are assigned as too-early, on-time or to-late, and include a score between 0 to 1 for the timing of the action. The unexpected-steps computed-explanations represent comments regarding actions that were not expected by any of the guidelines, and as explained in the description of the bottom-up analysis, represent various type of redundant non-compliant actions.

Figure 41 illustrates the taxonomy of computed-explanations that are generated during the Missing-Actions-Analysis, all can be provided by the system as compliance related comments.

Figure 41. Taxonomy of computed-explanations related to missing actions.

The missing-action comments can relate to any diagnostic or therapeutic clinical step that is expected by some guideline. As explained earlier, missing actions from the type of drug-increase are furthered examined, because increasing the dose of a medication can be aborted or suspended by physicians when the medication is at maximal dose or in cases of low patient compliance, in which the compliance should be improved before increasing the dose of the medication.

4

Evaluation

Chapter 4 011 Evaluation

In this chapter, I describe the evaluation of the DiscovErr system; an evaluation designed to assess the feasibility of implementing DiscovErr in a real clinical quality- assessment setting. The experiment was performed in the diabetes domain, in which I specified knowledge from a state-of-the-art guideline in a formal representation, and applied the system to real retrospective patient data, to assess to what extent the treatment complied with the standard guideline. The data were presented to expert physicians. The expert physicians were then asked to manually evaluate the compliance of the treatment of a set of patients to the guideline by examining the electronic medical records. Then, the system's comments regarding compliance to the guidelines were presented to the experts, and they assessed the correctness of these comments.

The general objective of the evaluation was to enable assessing the correctness (i.e., precision) and completeness (i.e., recall or coverage) of the comments provided by the system relative to the gold standard comments of the clinicians, when automatically analyzing electronic medical records for guideline-based compliance of the therapy manifested by these records.

Note that the terms completeness and correctness used here refer to different measures than the formal definitions of first-order logic, and describe a continuous grade for the level of coverage and level of correctness of the system’s comments regarding guidelines compliance issues.

4.1 Research Questions

The general objective of the research was to evaluate the feasibility of applying the system in realistic clinical settings. This led to the definition of several more specific research questions, that aim to assess the quality and significance of the results of the compliance analysis. For each research question presented in this section (i.e., specific objective), I describe here the general idea of how it was measured, as more details are available in section 4.2 describing the experimental design, and in chapter 5 with the experiments results.

Chapter 4 011 Evaluation

Question 1: Completeness : Does the system produce all or most of the important comments relevant to the task of assessing compliance to a guideline?

To use such a system in real clinical settings, it is necessary to evaluate the quality of its results. One dimension of the result's quality is the completeness, or level of coverage, as it is important to know if all or most of the deviations from the guidelines are detected by the system. I was interested in assessing the level of completeness and in analyzing it further by drilling down the various types of treatment actions (e.g., medications versus lab tests and other monitoring actions).

To answer this question, I conducted an experiment in which the systems and three medical experts examined the medical records of the same set of patients, and provided comments regarding the compliance to the same clinical guideline for the management of diabetes mellitus. The measure for completeness was defined as the portion of compliance-related comments that were mentioned by the system from the comments mentioned by the majority of the medical experts.

Question 2: Correctness: Is the system correct in its comments regarding the compliance to the guideline?

Another dimension of the result's quality is the level of correctness of the system's output, sometimes referred to as precision. It is important to know the proportion of correct compliance comments provided by the system out of all of the system's comments. I was interested in assessing the correctness level and analyzing it further by drilling down by the various types of treatment actions and by the significance or importance of the compliance to the guideline regarding each action.

To answer this question, I conducted an experiment in which two diabetes experts evaluated the correctness of compliance-related comments provided by the system when analyzed the medical records of a set patients. The measure for correctness was defined as the portion of system comment that were evaluated as correct by the two experts.

Chapter 4 016 Evaluation

Question 3: Importance: Are the comments provided by the system significant and important for understanding the quality of treatment?

In addition to measuring the correctness and completeness of the compliance analysis results, I was also interested in measuring the level of significance of the comments provided by the system regarding the compliance. This aspect is related to strength-of- recommendation and level-of-evidence annotations that are assigned to the recommendation in published clinical guidelines, annotations that describe the level of impact of the treatment action on the state of the patient. Note that these annotations are represented in the meta-data of the formal guidelines and presented in the system’s interface, but are not used as part of the compliance analysis algorithm that may prompt compliance related comments regardless to the strength-of-recommendation and level- of-evidence of the recommendations.

To answer this question, I conducted an experiment in which two diabetes experts evaluated the importance of the compliance-related comments provided by the system when analyzed the medical records of a set patients. The measure for importance was defined as the portion of system comment that were evaluated as important by the two experts.

Question 4: Performance: Does the system perform well regarding run times and memory consumption?

In order to better understand the feasibility of applying the system in real medical domains, it was important to me to evaluate the performance of the system regarding run times and memory consumption, in addition to the evaluation of the quality of the results. From my knowledge about the artificial intelligence field, there are many systems and theoretical solutions that achieved good results regarding the computational tasks and outcomes but require endless resources when applied to real world problems. In this research, I intend to suggest an approach and tools to support improving the compliance to clinical guidelines; thus, it was important for me to learn if the solution is scalable regarding performance, to allow its implementation in a real clinical environment.

To answer this question, I applied the system on the data of a large set of patients, and monitored its behavior regarding CPU utilization and memory consumption.

Chapter 4 017 Evaluation

Secondary research Issues

In addition to the primary research questions presented above, I was also interested in examining two secondary issues that focus on the human aspects of the experiment, and to answer them to some extent.

Issue 1: Internal expert similarity: What is the similarity between the comments of several expert physicians, when assessing the compliance [of a care provider and/or patient] to a certain guideline, given the same set of patient records?

Issue 2: External expert agreement: What is the level of agreement between several experts regarding the quality of the system's assessments?

I found these issues interesting as one of the major motivations for clinical guideline implementation is its contribution to the reduction of the variance among treatments provided by different physicians. It was important for me to learn whether the experts mostly agree with each other when performing the task of compliance assessment, which is a different task from providing a real treatment. Due to the fact that in the compliance analysis task the experts refer to the same guideline, I assumed that the experts will have a reasonable level of consensus.

In addition, evaluating the level of agreement between the experts is also essential in order to enable the evaluation of the system itself, as there is no true meaning for the experts’ evaluation of the system if there is no agreement between them.

To answer these questions, I used the Cohen’s Kappa statistic, and applied it on the evaluations of the experts regarding the correctness and importance of the system’s comments. In addition, I defined indirect measures to evaluate the completeness and correctness of the experts themselves. The specific details of these measures are described in the results chapter.

4.2 Experimental Design

To answer these research questions, I designed a study that included several experimental steps, which enabled evaluation of several aspects of the overall framework.

A preliminary step before designing the study was to find a medical domain in which I can get access to electronic data of real patients, and recruit physicians who are

Chapter 4 018 Evaluation

experts in the same domain. I also wanted to select a domain that is suitable for clinical guideline application, with well-established evidence, and in which the diagnostic and therapeutic process is performed over a sufficiently long period of time (i.e., a period of time of sufficient duration for performing meaningful longitudinal patient monitoring and therapeutic decisions that might adhere to a continuous-care guideline). After examining several medical domains, I decided to perform the experiment in the Diabetes domain, in which I could obtain real patient data and in which I managed to collaborate with two endocrinology experts from a large academic medical center who were willing to participate in the formal evaluation of the DiscovErr system, and an expert in general internal medicine who frequently encounters, of course, patients who have diabetes.

It is important to mention that although I wished to evaluate the system in more than one medical domain, I had limited resources and decided, instead of performing several smaller experiments in several medical domains (collaborating with one expert in each domain), to focus more deeply on a larger experiment within a single domain. Evaluating the system in additional domains is certainly a part of the plans for future research.

The dataset I obtained included information about 2,038 patients diagnosed with diabetes, and comprised 378,273 time-oriented data records, including details about test results and medication orders and purchases that are relevant for assessment of compliance to the diabetes-management guideline. The records in this data set covered up to five years of treatment for each of the patients, together with general demographic information about the patient, such as gender and age.

In the following sections, I describe the structure of the study and the various experimental steps.

4.3 Performing the Experiment

Step 1: Creating a formal representation of an established diabetes-management guideline

The first step in the evaluation focused on the formal representation of a clinical guideline within the selected clinical domain, using the knowledge specification

Chapter 4 019 Evaluation

interface of DiscovErr. The guideline that was selected is the Standards of Medical Care in Diabetes 2014, published by the American Diabetes Association. This is a comprehensive guideline that is based on an extensive literature review and is updated every year. The guideline addresses multiple aspects of diabetes, from screening through diagnosis to multiple aspects of the long-term treatment.

While reviewing the textual guideline, I noticed that several of the sections are not relevant for the experiment, because they relate to data that I did not have in the dataset. For example, the section "medical nutrition therapy" was not relevant for specification, as I did not have any data regarding that aspect. Other examples include the sections about diabetes self-management education and physical activity.

The following sections were found relevant to the experiment, considering the data that I had: "Criteria for the diagnosis of diabetes"; "Risk analysis for pre-diabetic patients"; "Prevention and delay of type 2 diabetes"; "Definition of the glycemic goals"; "recommendations for hemoglobin A1c monitoring and evaluation"; "management of non-insulin drug therapy for type 2 diabetes"; "management of insulin therapy for type 2 diabetes"; and "dyslipidemia/lipid goals and management recommendations".

The specification of the guideline's knowledge according to the formal model involves three different types of knowledge. Each type of knowledge is represented according to the relevant part of formal model. (1) Declarative knowledge includes details about concepts that characterize the state of the patient; this type of knowledge was represented using the KBTA ontology (See Section 3.3.3). An example of a declarative concept is one that binds together the criteria for diagnosis of diabetes; (2) Procedural recommendations are represented using the Asbru ontology (See Section 3.3.1); for example, the recommendation regarding hemoglobin A1c monitoring is a procedural recommendation, and was represented as a conditional plan (i.e., an if-then-else plan in Asbru), with the current patient's glycemic level state as a condition that affects the recommended time of the next hemoglobin A1c evaluation; (3) Clinical steps are common atomic clinical actions, such as "increase insulin therapy", which are represented in the clinical-step library, in which they are linked to one or more terms within the relevant standard vocabulary that can fill the role of the relevant action. Clinical steps can be used and reused in different parts of the guideline. An example of a clinical step is "initiate insulin therapy", a step that is linked to a term that is an element of a whole class of medications within the ATC drug classification system (i.e.,

Chapter 4 001 Evaluation

different types of insulin medications), and that can be referred to from within multiple paths of the guideline (See Section 3.3.2).

The overall guideline specification was performed in two steps. In the first step, I created a prototype version of the formal guideline by detecting the relevant sections in the guideline, deciding on the optimal representation according to the formal model, and using the system's knowledge acquisition interface to formally specify the knowledge. In the second step, one of the expert physicians was involved, and assisted in validating the knowledge represented in the first prototype and extending it with additional knowledge that was not explicit in the original guideline. An example of such knowledge that did not exist in the original guideline is the definition of the concept "reduced kidney function", which is important for the management of Metformin drug therapy. The definition of this concept was not mentioned in the guideline and had to be added by the expert by referring to additional sources. Another example of a decision that was made by the expert physician during the specification process, is a general decision regarding the specification of the temporal aspect of recommendations about patient monitoring. In the first prototype, when specifying a recommendation about the temporal aspect of periodic monitoring (e.g., test hemoglobin A1c every 3 months), I used the temporal annotations of the Asbru language, "earliest-start" and "latest-start", to define a time range for the next measurement. After consulting with the expert physician, the "earliest-start" was removed in some sections of the guideline to prevent the system from providing comments regarding tests that were performed too early. This was decided because in some cases in clinical practice, such as hemoglobin A1c and LDL monitoring, there is no limit to the number of tests (frequency of testing) of the same clinical variable that the physician can order. This relaxed constraint is of course true only for certain tests, which are simple to perform and are not too expensive.

The process of formal representation of the guideline was concluded after applying the system on a small set of patients, and briefly examining the results to perform the validation.

The overall knowledge specification process was completed in two weeks, in which the first week was dedicated for the first step of the process, of creating a prototype representation, and the second week was dedication for collaborating with the medical expert to validate and improve the guideline’s representation.

Chapter 4 000 Evaluation

Step 2: Preparation of the data and mapping it to concepts in the formal knowledge base

The data preparation is a preliminary step that enables application of the system on the medical records. The data I received was extracted from the operational EMR and separated into several files in an Excel format. I imported the data into the database, according to the schema I use that includes the following two relations; the Patient relation, which is used to store the information about all patients, including details about demographic parameters, such as age and gender; the PatientData relation, which is used to store the temporal data, with information about test results and medication purchase. As mentioned, the database included 2,038 patients diagnosed with diabetes, and was composed of 378,273 temporal records, each representing, for example, a single laboratory measurement result.

To allow the system to retrieve the relevant items in the patient's data at run time, there is a need to map between the concepts in the guideline's formal knowledge base and the terms in the patient database. This can be completed only after finalizing the specification of the guideline in a formal representation, when the complete set of relevant concepts is determined.

Regarding tests and measurements, the original data set did not include codes from standardized vocabularies for medical concept identification. Therefore, I had to manually examine each term in the database, and map it to the relevant declarative concept in the system's knowledge-base, and if necessary, perform a conversion to the same units of measurement specified in the knowledge case. Examples of relevant test and measurement concepts include blood glucose, hemoglobin A1c, blood cholesterol, and creatinine lab tests.

Regarding drug therapy concepts, the data included codes according to the standard of WHO's ATC classification system, a fact that simplified the mapping process. Using the interface of DiscovErr, I only had to select the relevant ATC terms and attach them to the concepts appearing within the appropriate steps in the formal clinical steps library. For example, the concept used within the step "initiate-insulin-therapy" was mapped to several potentially relevant ATC items, such as "A10B: Insulins and analogues for injection, fast-acting" or "A10AC: Insulins and analogues for injection, intermediate-acting". Mapping the concepts appearing within the clinical steps to these

Chapter 4 000 Evaluation

higher level classes of the ATC hierarchy allows the system to automatically detect the drug identifiers found in the electronic medical record and to relate them to the relevant items in the knowledge.

Step 3: Training the expert physicians

A training session was conducted with each of the expert physicians. The session covered the following topics:

 A general overview of the field and the specific research and its goals.

 Presentation of the diabetes guideline, which was provided to the expert in a printed copy format and in an electronic format; in both formats, the relevant sections were visually marked. It is important to note that all of the experts were quite familiar with the selected guideline, so this topic was covered within a short time.

 Description of the patients' dataset, its source, structure, available clinical variables, and data format.

 Demonstration of the system by performing a compliance analysis of two or three demonstration patients. I found that the demonstration allowed the experts to better understand the idea of the system, and to understand the nature of the comments provided by the system regarding compliance to the guideline.

 A short training regarding the user interface that was used by the expert physicians to both insert their own evaluations of the actions found within the patient's record, and their evaluation of the system's assessment of these actions (see the interface's description below).

Step 4: Manual compliance analysis of the patients' management, by the expert physicians, performed on a randomly selected set of patients

Following the training step, the experts were provided with a convenient (visual) interactive interface for browsing the complete set of longitudinal data of multiple types of a randomly selected subset of patients from the database that was introduced earlier to them, for the full duration of time for which the patient was followed (up to five years per patient) (see Figure 42). They were then asked to perform the two evaluation

Chapter 4 002 Evaluation

tasks (i.e., assessment of the quality of care, and assessment of the quality of the DiscovErr system's comments).

Specifically, I first asked each of the expert physicians to review the data of the same 10 randomly selected patients, which comprised, altogether, 1584 time-oriented records, and manually add comments regarding the compliance of the patient's treatment to the diabetes guideline. At this point, the experts could not yet see the system's comments regarding the compliance of these patients to the guideline. They were provided with a user interface for the visualization of the patient's raw temporal data, and also added their comments using this interface.

For each of their comments, the experts were asked to provide data about the date of the clinical event, the importance/significance of the clinical issue, the type of the comment, and a textual description of the clinical issue. For the type of the comment they could select from a given set of comment types, or insert the own type of comment. The given set of values is presented in Table 4. The set of comment types from which the experts could select to describe their comments

Table 4. The set of comment types from which the experts could select to describe their comments

Comment Type On Time (Expected action that was performed in a timely manner) Early Action (Expected action that was performed too early) Late (Expected action that was performed too late) Missing (Missing expected action) No support (Action that should not been started at the time) Redundant (Another action with the same intention was performed before) Duplicate (Another action with the same intention was performed in the same time) Guideline Contradicted (Opposing action that contradicts the recommendation of the guideline) Patient Compliance (Low compliance to medications, manifested missing medication purchase records) Insights (Any insight regarding the patient state or other aspects of the treatment)

Chapter 4 001 Evaluation

The experts were not limited regarding the time they could invest for the evaluation of each patient, except their own limitation on time constraints; they could browse several years of data for each patient, and add as many comments as they wanted.

Figure 42 presents the visual interface used by the experts to browse the longitudinal patients' data. Figure 43 shows the interface used by the expert physicians for adding their manual comments about the patient's management.

Figure 42. The interface for visualization of the raw patient data. On the left side of the screen, the expert can select the clinical concept to view from the concept tree, which includes only the concepts appearing in the specific patient's datasets; the data for the selected concept is then presented graphically in a panel on the right. Multiple panels can be displayed concurrently, and it is possible to zoom into and out of each panel to a certain extent. A tooltip (with highlighted borders in this figure) enables the expert to examine the precise value of each data point.

Chapter 4 001 Evaluation

Figure 43. The interface for adding an expert comment about the patient's management. The expert can add a new comment regarding some event and provide details regarding the date of the event, a textual comment, the importance/significance of the issue, and the comment type, using the comment taxonomy, as explained in the text.

Step 5: Assessment, by the expert physicians, of the comments provided by the system

In this step, I asked the two diabetes experts to evaluate the comments of the system regarding the compliance to the diabetes guideline. The experts used the output interface of the system, as previously described (see section 3.5.3), to explore the guideline compliance comments. The comments were presented together with the relevant patient data in a visual manner. Using the meta-commentary part of the interface, the experts assessed the system's comments regarding correctness and importance (see Figure 44).

Chapter 4 006 Evaluation

Figure 44. The interface for evaluating the comments of the system. The expert can view the list of comments on the top left side of the screen, and when selecting a specific comment, the relevant information is displayed on the bottom left side. On the right side, the relevant data to the selected comment is presented.

For each system comment, the expert marked whether the comment is correct according to the guideline when considering the data that was available to the care provider at the specific point in time. In addition, the expert marked whether the comment is important, to express an opinion regarding the level of clinical significance of the issue referred to by this comment (regardless of whether the comment itself is judged as correct or incorrect). For the correctness evaluation, the expert could select from three options: correct, partially correct, and not correct; for the importance evaluation, the expert could select from two options: important and less important. (The binary scale emerged from the discussions with the experts; in fact, in the pilot experiment there was an additional level very-important, but it presented a problem in maintaining consistency when assessing the system's comments, for example, a different instances of a comment regarding late HA1C lab tests were annotated both as important and as very-important by the same expert). In addition to annotating regarding the correctness and importance of the system’s comment, the experts could enter an optional free text comment when they wanted to add additional information.

Chapter 4 007 Evaluation

Figure 45 presents the area of the screen that was used by the experts used to add their evaluation on each of the system's comments.

Figure 45.The area of the screen in which the experts evaluated each specific system comment. In the top frame, the details of the comment are presented with a generated explanatory text (highlighted); in the upper part of the bottom frame, the expert added an evaluation regarding the correctness and importance of the comment (correct and important in the figure). The colored buttons in the lower part of the bottom frame were added to save some time for the experts; By representing common combinations of assessment, the experts could click one of these buttons instead of answering the two questions (using the radio buttons), thus selecting two answers in a single click. Note that althoughit is not so clear from the interface that these buttons are only an alternative for answering the two questions, after a short experience with the system the experts found them useful in saving many clicks during the evaluation of the hundreds of system’s comments.

Chapter 4 008 Evaluation

Step 6: Meta-critiquing-analysis of the expert's results, preparing the data for quantitative analysis

The final step of the result's evaluation included additional preparation that was required for supporting a deeper quantitative analysis of the results. In this step, the knowledge engineer performed a meta-critiquing-analysis by scanning each of the results of the previous steps of the evaluation, both of the system and of the experts, and extended it with additional metadata.

Meta-critiquing-analysis of the Experts' Comments

In the first step of this meta-critiquing-analysis, I examined the comments that were added by the experts in the step of manual evaluation of the management of the treatment of patients. I examined each expert's comment, and compared it to the comments of the other two experts and to those of the system. Each comment was then annotated with three additional attributes that indicated whether it was detected by (1) the system, (2) the other first expert, (3) the other second expert. For example, when examining a comment of the first diabetes expert, for each patient I scanned the full set of comments of the system, of the second diabetes expert, and of the family physician expert; and annotated whether it was included in each of these comment sets. In Figure 46, a system screenshot is presented to illustrate the environment I used to examine the experts' comments. By opening four instances of the system, I scanned hundreds of comments provided by each expert, and annotated whether each comment was detected by the system and by the other experts.

It is important to mention that in this part of the meta-critiquing-analysis, I had to deal with cases where comments were phrased differently by each expert. This was solved by manually reviewing the comments and relating to the semantic meaning of each comment. For example, the same scenario of a (too) late clinical action can be described as a "late action" by one expert, or as a "missing action" in the earlier period by the other expert. In Figure 47, a zoom-in illustration is presented, demonstrating the comments of the three experts regarding a compliance problem of late insulin therapy given to a certain patient. The example in this figure demonstrates the problem of the experts’ different phrasing of the same treatment compliance problem.

Chapter 4 009 Evaluation

Figure 46. A system screenshot illustrating the environment for the meta-critiquing-analysis of the experts’ comments. I opened three instances of the system to compare the comments of the three experts. The comparison was performed by examining the comments of each of the experts and searching for parallel comments in the sets of other expert's comments. Although not seen in this figure, I also compared the comments of the expert to those of the system by opening another window to view the system's comments, using the interface for viewing system comments.

Figure 47. A zoom-in illustrating an example of compliance comments of the three experts regarding the same patient. In this example, all three experts commented on the same issue, regarding late initiation of insulin therapy. I had used the checkboxes on the bottom part of the screen to annotate whether this comment was detected by all other experts and by the system. In this example, Expert 1 and Expert 2 phrased the problem as late initiation of insulin treatment started in August 2004, whereas Expert 3 phrased it as a problem of missing insulin in January 2004. The system's comment is not visible in this figure, but it also referred to the problem as late insulin initiation, in a similar manner to that of Expert 1 and Expert 3. Although there is this difference, this treatment problem was indicated as detected by all experts and by the system in the final meta-critiquing-analysis.

An additional issue that had to be addressed in the meta-critiquing-analysis of the experts' comments is the fact that some of the comments provided by the experts were regarding knowledge that belonged to sections of the source guideline that were not included in the formal representation, or even belong to other guidelines. Comments of this type were annotated as "out-of-scope" and were excluded from the formal analysis of system completeness. More details about these out-of-scope comments are provided in the Results chapter.

Chapter 4 001 Evaluation

Meta-critiquing-analysis of the experts' evaluations of the system's comments

In the second step of the meta-critiquing-analysis, I re-examined the comments provided by the system. As explained earlier the system's comments were already evaluated in the previous step regarding correctness and importance by each of the diabetes experts. However, to support additional levels of result analysis, I re-examined each comment to determine whether it exists in the sets of patient management comments provided by each diabetes expert when they made their own management compliance comments, before they were exposed to the comments of the system. Figure 48 illustrates the interface I used for this meta-critiquing-analysis, which is based on the one used by the expert to evaluate the system's comments, but with an additional checkbox to indicate whether this comment was also made by this same expert when reviewing the management of the patient on his/her own.

Figure 48. The interface I used to perform the meta-critiquing-analysis of the system comments. Each comment that was evaluated in the previous step by the expert regarding its correctness and importance, and was now examined to determine if it existed in the set of comments provided by this same expert, when manually added compliance of the treatment before seeing the system's comments. In this example, the comment was evaluated as correct and important by the expert, but was not detected by this same expert before seeing the system's comments. Thus, the checkbox "Made by the Expert" on the bottom of the screen is unchecked.

In this manner, each comment of the system was extended with additional meta- attributes that annotated whether it was, in fact, mentioned by each of the two diabetes experts.

Chapter 4 000 Evaluation

Step 7: Application of the system to the electronic data and assessment of the performance regarding runtime and memory consumption

In this step of the experiment, the system was applied on the data of all patients in the dataset for measuring the runtimes, memory consumption, and CPU utilization.

Specific time-measures were implemented to evaluate the following aspects; (1) the time needed for the system to load the data of all patients, stored in the database, and represent it according to the in-memory data structures used by the system, and (2) the time needed for the system to analyze the diabetes guideline compliance on all patients. In addition, the system’s memory consumption and CPU utilization were monitored using the standard tools available in the Window’s Task Manager. The details about the results of this performance experiment are available in Chapter 5.

5

Results

Chapter 5 001 Results

5.1 Completeness of the System's Comments

The results in this section relate to the first part of the experiment, namely, the manual compliance analysis of the patients' management, by the three expert physicians, performed on a randomly selected set of patients. The results in this section provide insights about several interesting aspects; these aspects include the time needed for the experts to complete the compliance analysis task, the distribution of the types of compliance issues found by the experts in the records of the patients’ management, a comparison between the experts based on their compliance comments, the completeness of the system’s comments relative to one or more of the experts’ comments, and more.

Table 5 displays general information on this first part of the experiment, which included two diabetes experts and one family medicine expert. Each expert evaluated the longitudinal record of each patient within a group of 10 patients who were randomly selected before the evaluation. The average time period of patient data was 5.25 years. The overall data of the 10 patients consisted of 1,584 time-oriented single medical transactions (e.g., a single laboratory test result or a single administration of a medication), i.e., an average of 158 time-oriented medical transactions per patient, which had to be examined by the experts.

Table 5. General information about the manual compliance analysis experiment.

Diabetes Experts 2 Family Medicine Expert 1 Patient Records 10 Average Time Period of Patient Data 5.23 years Minimum Time Period of Patient Data 4.25 years Maximum Time Period of Patient Data 5.5 years Total Time-Oriented Medical Transactions 0,181

Table 6 shows the summary of the evaluation time. The number of sessions, the overall time, and the average time per patient are presented in the table for each of the experts. The mean time for an expert to examine a single patient was 27 minutes.

Chapter 5 001 Results

Table 6. Summary of the compliance evaluation time.

Number of Overall Time Average Time for Patient Sessions (hours) (minutes) Diabetes Expert 1 2 4.5 27 Diabetes Expert 2 1 4 24 Family Medicine Expert 3 5 30 All 6 13.5 27

Table 7 shows the number of comments made by the experts in this part of the experiment. From the total of 381 comments mentioned by all experts, 31 were labeled as insights (i.e., comments that are not directly related to guideline compliance), 21 were out of the scope of the guideline's sections that were used in the experiment, and 329 were compliance comments within the scope of the guideline. These 329 comments are further analyzed in the rest of this section.

Table 7. Expert's comments regarding compliance to the guideline.

All Insights Comments Comments Comments Out of Scope in Scope Diabetes Expert 1 100 2 12 86 Diabetes Expert 2 125 5 2 118 Family Medicine Expert 156 24 7 125 All 381 31 21 329

Table 8 shows the distribution of the compliance comments regarding the type of the clinical action in which they relate to; 66% of the comments were regarding drug actions and 34% regarding test and monitoring actions.

Table 8. Distribution of the experts' comments with respect to the type of clinical action to which the comment refers.

Clinical Action Type Comments % Medication 216 66% Tests and Monitoring 113 34% All 329 100%

Table 9 presents the distribution of the experts’ comments regarding the type of compliance comments.

Note that “Action is on time” is the only positive type comment, denoting the fact that in the expert’s opinion, the action is a correct one and was performed in a timely manner.

Chapter 5 006 Results

Table 9. The distribution of the experts' comments with respect to the type of compliance issue.

Compliance Issue Type Comments % Late Action 118 36% Action is On Time (An action expected by the guideline is performed on time) 61 19% Missing Actions 59 18% Patient Compliance 56 17% No Support (Action should not be started at this time) 32 10% Redundant (Another action with same intention performed previously) 1 0.3% Early Action (Action performed too early) 1 0.3% Guideline Contradicted 1 0.3% All

The completeness of the DiscovErr system

Tables 10, 11 and 12 display the number of comments mentioned by each of the experts, and the portions of these comments that were also mentioned by another agent, namely, by the system or by another expert. The portion of an expert’s comments that were mentioned by another agent expresses the completeness of the comments of the other agent relative to the “anchor” expert. Following these results, I also show the completeness of each agent relative to a consensus of the experts.

The system made 94% of the comments made by diabetes expert 1, 85% of the comments made by diabetes expert 2, and 83% of the comments made by the family medicine expert.

Table 10. Completeness of all agents relative to the comments made by Diabetes Expert 1.

Comments Completeness of the comments made by another agent Total comments 86 Mentioned by the System 79 92% Mentioned by Expert 2 61 71% Mentioned by Family Medicine Expert 64 74%

Chapter 5 007 Results

Table 11. Completeness of all agents relative to the comments made by Diabetes Expert 2.

Comments Completeness of the comments made by another agent Total comments 118 Mentioned by the System 100 85% Mentioned by Diabetes Expert 1 57 48% Mentioned by Family Medicine Expert 73 62%

Table 12. Completeness of all agents relative to the comments made by the Family Medicine Expert.

Comments Completeness of the comments made by another agent Total comments 125 Mentioned by the System 103 82% Mentioned by Diabetes Expert 1 63 50% Mentioned by Diabetes Expert 2 78 62%

In Summary: The completeness of the system relative to the comments made by each of the anchor experts was always higher than the completeness of each of the other experts relative to the comments made by the anchor expert. This relation was found to be statistically significant when performing multiple proportion tests (Binomial distribution tests) comparing the completeness of the system to the completeness of each expert.

Table 13 displays a summary of the completeness of all of the agents, relative to the union of all of the comments made by all of the experts (note that this unification preserves duplicates, i.e., repeating comments of different experts mentioning the same compliance issue); the results are also presented separately for medication-related comments and for test- and monitoring-related comments. The overall completeness of the system is 86%, the completeness for medication-related comments is also 86% and the completeness for test- and monitoring-related comments is 92%. Although the completeness is higher for test- and monitoring-related comments than for the medication-related comments, when performing a proportion test this difference was found to be insignificant (z = 1.68 and p = .092).

Chapter 5 008 Results

Table 13. Summary of the completeness of the comments made by all agents, relative to the union of the comments made by the three experts.

Medication-Related Test-Related All Comments Comments Comments Total Expert Comments (including repetitions) 216 113 329 Mentioned by the System 185 (86%) 104 (92%) 282 (86%) Mentioned by Diabetes Expert 1 149 (69%) 57 (50%) 206 (63%) Mentioned by Diabetes Expert 2 170 (79%) 84 (74%) 257 (78%) Mentioned by Family Medicine Expert 179 (83%) 86 (76%) 262 (80%)

To further analyze the completeness of the system, the comments were divided into three groups according to the level of support by the experts: comments mentioned by only one expert, by exactly two experts, or by all three experts; then, by meticulously examining the text of all comments, the number of unique compliance issues was counted for each group, where a unique issue is a specific clinical management topic that can be mentioned, possibly using other words, or as part of another comment by one or more experts and the system. Notice that in the previous analysis, I counted the number of separate instances of the experts’ comments, without checking if some of them refer to the same compliance issue.

Table 14 displays the distribution of the comments by the level of the experts’ support; 28% of the unique compliance issues were mentioned by all three experts (note that 50 unique compliance issues were mentioned by three experts, leading to a total of 150 instances of comments in that particular case); 27% of the unique compliance issues were mentioned by two experts; and 46% of the unique compliance issues were mentioned by only one expert. In total, 55% of the unique compliance issues were mentioned by two or more experts out of the three, i.e., by a majority of the experts.

Table 14. The level of support, by the three experts, to the unique compliance issues.

All Comments All Unique Compliance Issues

Mentioned by one Expert 83 83 (46%) Mentioned by two Experts 96 48 (26.5%) Mentioned by three Experts 150 50 (27.5%) All Comments 329 181 (100%)

Chapter 5 009 Results

Table 15 displays the completeness results analyzed for each level of support. The completeness is expressed by the portion of unique compliance issues detected by the system from the overall number of unique compliance issues in each group. In the case of comments mentioned by only one expert, the completeness of the system was 66%; in the case of comments mentioned by two experts, the completeness was 83%, and in the case of comments mentioned by all three experts, the completeness was 98%. Applying in each case a proportion test, the completeness by the system of the unique compliance issues mentioned by all three experts was found to be significantly higher than the completeness by the system of the unique compliance issues that were mentioned by only two experts (z = 2.51, p = .012), which, in turn, was found to be significantly higher than the completeness of the system with respect to unique compliance issues that were mentioned by only one expert (z = 2.11, p = .035).

Table 15. Completeness of the system’s comments relative to the unique compliance issues, by level of support of the three experts.

Unique Compliance Completeness Completeness Compliance Issues Detected Regarding the exact at that support Issues By System no. of supporting level or higher experts Mentioned by one Expert 83 55 66% 80% Mentioned by two Experts 48 40 83% 91% Mentioned by three Experts 50 49 98% 98% All 181 144

The trend of an increasingly higher completeness of the system’s comments for unique compliance issues that are supported by a larger number of experts was repeated when I analyzed the unique compliance issues mentioned by only the two diabetes experts (i.e., ignoring the comments of the family expert physician). In the case of unique compliance issues mentioned by both of the diabetes experts, the completeness of the system was 95% and was found to be significantly higher (z = 2.8, p = .005) than the 78% completeness of the system in the case of unique issues mentioned by only one diabetes expert (Table 16).

(In Section 5.3, I perform an indirect completeness and correctness analysis, which compares all of the agents, including the experts themselves, as well as the system.)

Chapter 5 021 Results

Table 16. Completeness of the system’s comments relative to the unique compliance issues, by level of support of the two diabetes experts.

Unique Compliance Compliance Issues Completeness Issues Detected By System Mentioned by 1 Diabetes Expert 86 67 78% Mentioned by 2 Diabetes Experts 59 56 95% All 145 123 85%

In Summary: The system’s completeness for comments that had the support of a majority of the experts (i.e., two or three out of three) was 91%.

The completeness of the system’s comments was significantly higher for comments that had higher levels of support by the experts, whether the expert panel included all of the three experts or only the two diabetes experts. In particular, the system generated 98% of the comments in the cases in which all three experts agreed on the comment, and 83% of the comments on which two of the three experts agreed (while generating only 66% of the comments that were made by only a single expert).

An Error Analysis of the System’s Incompleteness

To better understand the scenarios in which the system missed important guideline compliance comments, I examined the nine unique compliance issues that were mentioned by at least two experts but were not mentioned by the system. These undetected compliance issues were examined to identify the scenarios in which the system did not provide a comment. To better grasp the frequency of each type of system error, and how they might have arisen, I grouped the undetected compliance issues into several critique categories, or scenarios, and counted the number of compliance issues occurring in each scenario group.

Table 17 displays the scenarios in which the system did not mention any of the unique compliance issues, although they were mentioned by at least two experts in their comments, listing the rate of reoccurrence of each scenario. An explanation of each scenario is provided in the paragraphs below.

Chapter 5 020 Results

Table 17. Distribution of the scenarios in which unique compliance issues were not detected by the system.

Scenario in which the system did not detect the unique Undetected compliance issue Unique Compliance Issues Low patient compliance 5 56% No agreement between the diabetes experts 2 22% Missing knowledge about medication dose decrease guidelines 1 11% Missing replacement therapy 1 11% All 9 100%

Low patient compliance: In this scenario, the expert's comments were about a redundant or missing increase of a medication dose in situations in which the patient’s compliance to the medication was low, as reflected by the drug purchase data. In some cases, the expert mentioned that the dose of the medication should have been increased, while the system mentioned that the patient’s compliance should have been increased. In other cases in which the medication dose was increased, the experts commented that the increase should be avoided because of the preliminary low compliance, while the system labeled the actual increase as a correct action, because of the good compliance of the patient for a short period before the dose increase.

No agreement between the diabetes experts: In this scenario, the undetected comments were mentioned by one of the diabetes experts and by the family medicine expert, while the second diabetes expert did not mention the comment, but instead had made a comment similar to that of the system.

Missing knowledge about medication dose decrease: In this scenario (which occurred in only one case), the experts’ comments were about a correct or missing medication- dose decrease. The system did not comment about these issues at all, because the guideline that was used did not include specific recommendations about when to decrease medication dosing. Such meta-knowledge reflects a particular utility function, namely, preference for dose minimization, which is common expert knowledge, but is not part of the formal guideline.

A missing replacement-therapy: In this scenario (which occurred in only one case) the experts commented about a missing alternative treatment when they detected an unexplained medication stop. The comment of the system was only regarding the

Chapter 5 020 Results

unexplained stop, and not about the missing alternative therapy. When the alternative therapy was eventually instituted, the system erroneously considered it to be timely, since the replacement medication was viewed as being administered in the context of a new instance of the guideline, while the experts labeled it as being too late, since it was necessary to perform that action when the medication was stopped.

5.2 Correctness of the System's Comments

The results in this section relate to the second part of the experiment, in which the two diabetes experts evaluated the correctness and importance of the compliance comments mentioned by the system. This evaluation was performed on the data of the same patients who were examined by the experts in the first part of the experiment. The results in this section provide insight regarding several aspects, including general information about the experiment, distribution of the types of system comments, and results regarding the correctness and the importance of the system’s comments, and the relationship between these two measures.

Table 18 displays general information regarding this part of the experiment. The experiment included two diabetes experts who examined the comments of the system regarding the diabetes guideline compliance manifested in the medical records of the same 10 patients who were manually evaluated by the experts in the previous step of the experiment.

Table 18. General information about the correctness experiment.

Diabetes Experts 2 Overall Patients 10

Table 19 displays information about the time used by the experts to complete the comment evaluation task. The mean time for the experts to evaluate the comments provided by the system on a single patient was 9.5 minutes.

Table 19. Summary of the correctness evaluation time.

Number of Sessions Overall Time Average Time for Patient (hours) (minutes) Diabetes Expert 1 2 2 12 Diabetes Expert 2 1 1.5 9 All 3 3.5 9.5

Chapter 5 022 Results

Table 20 displays the number of comments provided by the system when applied to the data of all 10 patients. The system provided a total of 279 comments, 59% regarding issues related to the compliance to tests and monitoring recommendations, and 41% regarding issues related to the compliance to medication therapy recommendations. Table 20 also displays the number of the system’s comments that were actually evaluated by the experts. A total of 172 comments were evaluated (62% of the comments by the system): 100% of the medication-therapy-related comments and 35% of the tests-and-monitoring-related comments. The reason for this discrepancy is that only the data of three of the patients were fully evaluated regarding the correctness of the tests and monitoring comments made by the system, due to constraints on the experts’ time.

Table 20. System's comments regarding the compliance to the guideline.

Comments % of Evaluated % of system’s by the evaluated comments experts comments System comments - Tests and monitoring related 165 59% 58 35% System comments - Medication therapy related 114 41% 114 100% Total system comments 279 100% 172 62%

To validate the consistency of the experts’ assessments of the system’s comments, I started by calculating the level of agreement about the validity of these comments between the experts themselves, before proceeding further in the analysis.

Table 21 displays the numbers of comments in each possible combination of the correctness assessment of the two diabetes experts. I measured the inter-expert agreement using a weighted version of Cohen’s Kappa coefficient, assigning linear disagreement weights according to the distance between the ordered values of the correctness ordinal scale (correct, partially-correct, not-correct). Cases in which both experts agreed were assigned a weight of 0; cases in which the experts did not agree, but with a distance of only one level between the assessments (i.e., correct/partially- correct, partially-correct/not correct), were assigned a weight of 1; and cases in which the experts did not agree, with a distance of two levels between the assessments (correct/not-correct), were assigned a weight of 2.

Chapter 5 021 Results

The weighted Kappa was 0.61, a value that represents a good agreement [Altman 1991], and is significantly higher compared to a chance value of 0 (p < .05).

Table 21. The diabetes expert’s assessments of the system’s comments’ correctness.

Diabetes Expert 2 Partially Not Diabetes Expert 1 Correct Correct Correct Total Correct 139 5 0 144 (84%) Partially Correct 12 6 2 20 (11%) Not Correct 1 1 6 8 (5%) Total 152 (88%) 12 (7%) 8 (5%) 172

Table 22 displays the level of agreement between the diabetes experts regarding the truth value of the correctness of the system’s comment when assessing the system’s comments (whether agreeing that the system is correct or agreeing that the system’s comment is incorrect). The experts fully agreed on the correctness, partial correctness, or incorrectness of 151 of the 172 evaluated system’s comments (87.8%), partially agreed on the truth value of an additional 20 of the 172 comments (11.6%) (i.e., the comment was evaluated as partially correct by one expert and as correct or not-correct by the other expert), and did not agree at all on only one of the system’s comments (0.6%) (i.e., one expert assessed the system’s comment as correct, and the other expert assessed the comment as not-correct).

Table 22. The level of agreement between the two diabetes experts regarding the truth value of the correctness of the system's comments.

System Comments % Full Agreement 151 87.8% Partial Agreement 20 11.6% No Agreement 1 0.6% Total 172 100%

Since it was clear that the experts significantly agree on the correctness or incorrectness of the system’s comments, I now looked at what they actually said about these comments. According to the judgment of diabetes expert 1, 84% of the comments were correct, 11% were partially correct, and 5% were not correct; according to the judgment of diabetes expert 2, 88% of the comments were correct, 7% were partially

Chapter 5 021 Results

correct, and 5% were not correct. The next step was to integrate the results of the two diabetes experts.

Table 23 displays the correctness when integrating the result of both diabetes experts. To integrate the evaluation of the two experts, the possible combinations of the correctness results were organized into six combination groups. The groups were ordered from top to bottom, from the most correct combination to the most incorrect combination. For each combination the number of comments is presented together its proportion to the total number of comments. The cumulative percentage represents the proportion between the number of comments that are at least as correct as the current combination to the total number of comments; 81% of the evaluated comments were judges as correct by both experts, 91% of the evaluated comments were judged as correct by both experts or judged as correct by one expert and partially correct by the other. Only 3% of the evaluated comments were judged as not correct by both experts.

Thus, 91% of the system’s comments were fully supported by at least one of the experts, while the other expert did not disagree with the system (i.e., partially or completely agreed with its comments). I consider that 91% portion as representing, for practical purposes, the level of correctness of the system’s comments.

Table 23. Correctness of the system’s comments according to both of the diabetes experts.

Comments % Cumulative % Comments judged as correct by both experts 139 81% 81% Comments judged as correct by one expert and as partially correct 17 10% 91% by the other Comments judged as partially correct by both experts 6 3% 94% Comments judged as not correct by one expert and as correct by 1 1% 95% the other Comments judged as not correct by one expert and as partially 3 2% 97% correct by the other Comments judged as not correct by both experts 6 3% 100% All 172 100%

To further understand the correctness results, I calculated the level of the correctness of the comments separately for medication-related issues and for test- and monitoring- related issues. Table 24 and Table 25 present this more detailed analysis.

Table 24 displays the correctness when integrating the assessments of both of the diabetes experts, but only for test- and monitoring-related comments. In this case, 98%

Chapter 5 026 Results

of the evaluated comments were judged as correct by both experts, or at least judged as correct by one expert and as partially correct by the other; only 2% were judged as not correct by both experts.

Table 24. Correctness of the system's comments regarding tests and patient monitoring issues.

Comments % Cumulative % Comments judged as correct by both experts 57 98% 98% Comments correct by one expert and partially correct by the other 0 0% 98% Comments partially correct by both experts 0 0% 98% Comments not correct by one expert and correct by other 0 0% 98% Comments not correct by one expert and partially correct by other 0 0% 98% Comments not correct by both experts 1 2% 100% All 58 100%

Table 25 displays the correctness when integrating the assessments of both diabetes experts, but this time only for drug therapy-related comments. In this case, 87% of the evaluated comments were judged as correct by both experts, or at least judged as correct by one expert and as partially correct by the other; 4% were judged as not correct by both experts.

Table 25. Correctness of the system's comments regarding medication therapy issues.

Comments % Cumulative % Comments judged as correct by both experts 82 72% 72% Comments correct by one expert and partially correct by the other 17 15% 87% Comments partially correct by both experts 6 5% 92% Comments not correct by one expert and correct by other 1 1% 93% Comments not correct by one expert and partially correct by other 3 3% 96% Comments not correct by both experts 5 4% 100% All 114 100%

When performing a proportion test, the correctness of the test and monitoring-related comments (57/58, or 98%) was found significantly higher (z = 2.44, p = .015) than the correctness of the drug therapy-related comments (99/114, or 87%).

In Summary: The system comments’ overall correctness, i.e., the portion of the comments that were judged as completely correct by one expert and as at least partially correct by the other expert, was 91%.

Chapter 5 027 Results

The system’s correctness when making test- and monitoring-related comments (98%) was significantly higher than its correctness when making medication-related comments (87%).

Importance of the System‘s Comments

Table 26 displays the results regarding the importance of the issues referred to by the guideline compliance comments provided by the system; 153 comments (89%) were judged as referring to important issues by both experts, 14 comments (8%) were judged as referring to important issues by only one of the experts, and 5 comments (3%) were judged as less important by both experts.

I measured the inter-expert agreement regarding the importance assessment, using the standard (0/1 weights) version of Cohen’s Kappa coefficient, and the Kappa coefficient was 0.37. This value, although significantly higher than a chance value of 0 (p < .05), technically represents only a fair agreement [Altman, 1991], but that is probably an artifact, due to the highly skewed distribution of importance values.

Please note that the experts agreed on the importance (or less importance) of 92% of the comments (158 from total of 172 comments), a number that reflect high level of agreement, in contrast to the relatively low Kappa. In order to estimate the number of experts required for achieving higher Kappa of 0.7, I used the Spearman-Brown prediction formula, as suggested by [van Ast, 2004], and found that it will require to multiply the number of experts by 3.97, i.e., include eight experts.

Table 26. The diabetes expert’s assessments of the system’s comments’ importance.

Diabetes Expert 2

Diabetes Expert 1 Important Less Important Total Important 153 8 161 (94%) Less Important 6 5 11 (6%) Total 159 (92%) 13 (8%) 172

The overall voting score for importance of the issues referred to by all of the was 93%, the portion of the number of judgments of a comment as important (320 votes) out of the number of all importance judgments (344 votes); see

Table 27.

Chapter 5 028 Results

Table 27. Importance of the system's comments.

Comments Importance Votes System comments marked as important by both experts 153 2 System comments marked as important by only one expert 14 1 System comments marked as less important by both experts 5 0 Importance 93%

I wanted to check the level of correctness for each degree of importance.

I also wanted to check whether the degree of the importance of the comments judged as correct (measured by number of “important” votes) was different from the degree of the importance of the comments judged as incorrect.

Table 28 displays the result regarding the importance of the comments, but this time the results are separated for correct comments and comments that are not correct. Comments that were judged as correct by both experts or correct by one and partially correct by the other, were considered as correct comments; all other comments were considered as not correct.

Table 28. Importance of the correct comments compared to importance of the incorrect comments.

Comments Comments No. of judged as judged as Comments Correct Incorrect System comments marked as important by both experts 145 (95%) 8 (5%) 153 System comments marked as important by only one expert 11 (79%) 3 (21%) 14 System comments marked as less important by both experts 0 (0%) 5 (100%) 5 Importance 96% 59%

The correctness of the comments referring to issues considered as important by both experts (95%) was significantly higher, in a proportion test, than the correctness of the comments referring to issues considered as important by only one expert (79%) (z = 2.34, p = .019).

Considering the level of correctness, the proportion of “important” votes for the correct comments (301/312, or 96%) was found to be significantly higher than the proportion of “important” votes for the incorrect comments (19/32, or 59%) (z = 7.85, p < .001).

Chapter 5 029 Results

Chapter 5 011 Results

An Error Analysis of the System’s Incorrectness

To better understand the scenarios in which the system made incorrect comments, I examined the system’s 16 incorrect comments to identify the scenarios in which erroneous comments were generated. To understand the frequency of each scenario, I grouped the incorrect comments by their scenarios, and counted the number of comments in each scenario group. Table 29 displays these scenarios in which the system made an incorrect comment, with the rate of reoccurrence of each scenario. An explanation of each scenario is provided in the paragraphs below.

Table 29. Distribution of the reoccurring scenarios among the incorrect comments.

Scenario in which the system made an incorrect comment Comments % Missing data at the start of the time window 8 50% Low patient compliance 3 19% Borderline decision 2 13% Duplicate action at scenario 2 13% Missing replacement-therapy 1 6% All 16 100%

Missing data at the start of the time window: In this scenario, which was the most common within the incorrect comments scenarios, the experts did not agree with comments that the system provided regarding a compliance issue that occurred in the start of the time window of the patient data. As mentioned in the experiment design section, the data that was used in the experiment included a few years of medical records for each of the patients. In most cases, the time window of the data started just before the diabetes diagnosis; but in several cases, the data seemed to have missing earlier time-oriented transactions that could support and explain later transactions that appeared in the beginning of the time window. For example, a patient record was started with administration of a high cholesterol medication, but the first high cholesterol laboratory value was found only later in the chronological time line. In such cases, the system provided a comment of the type “Action-Not-Supported” (e.g., the cholesterol medication administration is not supported by a preliminary high cholesterol lab result). When the experts examined such system comments, they did not agree with the system and related the action to probably missing data.

Chapter 5 010 Results

Low patient compliance: This scenario was explained earlier in the section about the error analysis of the system incompleteness (i.e., missing comments regarding patient compliance issues); however, this time it is mentioned as a scenario in which the system made several incorrect comments. It is not surprising that this scenario was the source of these two different problems - missing comments and providing incorrect comments; whenever the system detected a period of time with low patient compliance to medications, in parallel to an uncontrolled related parameter (e.g., high LDL values in parallel to low compliance with statins), it provided a comment regarding missing compliance increase instead of a comment about a missing increase in the dose of the medication, even if the low compliance was only for a short period just before the lab result value. Thus, if the expert did not agree with the missing compliance increase comment (incorrect comment), they expected a comment about (missing) dose increase of the medication.

Borderline decision scenario: In this scenario, the expert did not agree with two system comments although they were justified according to the guideline, in cases where the values were close to the thresholds in the guideline’s constraints. For example, when the system commented about a missing statin therapy increase for a patient with LDL of 105 mg/dL, the expert did not agree with the comment although according to the guideline, these LDL levels are above the target levels of 100 mg/dL. Even though the system is using the fuzzy reasoning mechanism, it provided the comment because the result was completely above the guideline thresholds. In fact, examining these cases led me to think of improvement to the analysis algorithm, and improve the mechanism to consider avoiding such comments by applying the Fuzzy Temporal Reasoner with a negated query, and ignore the missing action issues in cases resulted with a high membership value. For example, the query “LDL < 100 mg/dL” will result with a high membership value for LDL lab result of 101 mg/dL, and in such cases the system can avoid providing a comment about missing statins even though the LDL is above the desired thresholds.

Duplicate action scenario: In this scenario, the experts accepted a medication therapy initiation that was performed in parallel to the administration of an additional medication with the same therapeutic effect, which was initiated in the same time. This might be explained by the fact that the guideline’s primary therapeutic option was rejected for some reason by the care provider. The system's comment in these cases was

Chapter 5 010 Results

that one of the medications is a duplicate form of therapy, unless the guideline specifically stated that more than one medication at time is recommended.

A missing replacement-therapy: Although this scenario had only one case and was explained earlier in the section about the error analysis of the system’s incompleteness, the explanation is repeated because the same scenario led the system to provide an incorrect comment. In this case, the system detected a correct unexplained medication stop but did not comment about a missing alternative therapy (the missing comment from the completeness analysis). When the alternative therapy was eventually instantiated, the system erroneously considered it to be timely, since the replacement medication was viewed as being administered in the context of a new instance of the guideline, while the experts labeled it as being too late, since it was necessary to perform that action when the medication was stopped.

5.3 A Comparison of Correctness and Completeness among the Experts and Between the Experts and the System

As an additional aspect in the correctness and completeness analysis, I was interested in comparing the results of the different experts, and to compare their results to those of the system. Although the experiment did not include a step in which the experts directly evaluated the comments of each other, I could use their evaluations from the two experimental steps to indirectly calculate their level of completeness and correctness. For this, I have defined two additional objective measures to evaluate the quality of the experts’ evaluations: Indirect Correctness and Indirect Completeness.

Indirect Correctness of the Experts

The Indirect Correctness of an expert was calculated using the comments that were mentioned by the expert in the first step of manual evaluation of the treatment’s compliance, where each expert added his comments regarding the compliance to the guideline manifested in the medical records of each of the 10 patients. The indirect correctness is measured by the portion of the expert’s comments that were mentioned by at least one additional expert. Table 30 shows the results of this analysis.

86% of the comments mentioned by diabetes expert 1 were mentioned by at least one additional expert, 70% of the comments mentioned by diabetes expert 2 were

Chapter 5 012 Results

mentioned by at least one additional expert, and 71% of the comments mentioned by the family medicine expert were mentioned by at least one additional expert.

Table 30. Indirect correctness of the experts’ comments, partitioned by level of support of the comments by the other experts.

Diabetes Diabetes Family Medicine Expert 1 Expert 2 Expert All Comments 86 118 125 Comments not mentioned by any other expert 12 35 36 Comments mentioned by one other expert 23 36 37 Comments mentioned by two other experts 51 47 52 Rate of comment mentioned by at least 1 other expert 86% 70% 71%

It is important to note that the level of completeness of an expert has an effect on the results of the indirect correctness of the other experts, since an expert who makes only a relatively small number of the relevant comments artificially reduces the correctness of the comments made by the other experts. Due to this fact, I was interested in adding the system as an additional compliance evaluation agent. In the previous steps of the analysis, the completeness and correctness of the system were found relatively high; therefore, I assumed it is reasonable to use the comments of the system as part of the evaluations of the experts’ indirect correctness.

Table 31 displays the results of the indirect correctness analysis, this time taking into consideration the comments of the system as well.

99% of the comments mentioned by diabetes expert 1 were mentioned by at least one other agent (an expert or the system), 91% of the comments mentioned by diabetes expert 2 were mentioned by at least one other agent, and 88% of the comment mentioned by the family medicine expert were mentioned by at least one other agent. Notice that the indirect correctness results of all experts were higher when the system was added as an additional agent, due to the high completeness of the system.

Chapter 5 011 Results

Table 31. Indirect correctness of the experts’ comments, partitioned by level of support of the comments by the other agents, including the system.

Diabetes Diabetes Family Medicine Expert 1 Expert 2 Expert All Comments 86 118 125 Comments not mentioned by any other agent 2 11 15 Comments mentioned by 1 other agent 14 30 27 Comments mentioned by 2 other agents 20 31 32 Comments mentioned by 3 other agents 50 46 51 Rate of comments mentioned by at least 1 other agent 99% 91% 88%

Indirect Completeness of the Experts

The Indirect Completeness of an expert was calculated using the evaluations from the second part of the experiment, in which the two diabetes experts separately evaluated the correctness of the compliance comments provided by the system, and thus, indirectly, assessed each other’s comments. In this part of the evaluation, the expert judged 156 system comments as “jointly correct” (i.e., judged as correct by both experts or as correct by one and as partially correct by the other). The indirect completeness of an expert is measured by the portion of the “jointly-correct” comments that were mentioned by the expert in his manual compliance evaluation of the same patient.

Table 32 displays the results of the indirect completeness analysis. 75% of the “jointly-correct” comments were mentioned by diabetes expert 1, 60% of the “jointly- correct” comments were mentioned by diabetes expert 2, and 55% of the “jointly- correct” comments were mentioned by the family medicine expert.

Table 32. Indirect Completeness of the experts in the manual compliance evaluation.

Diabetes Diabetes Family Medicine Average Expert 1 Expert 2 Expert Judged as jointly correct and mentioned 117 93 86 296 by the expert in his comments Judged as jointly correct but not 39 63 70 172 mentioned by the expert in his comments Total "jointly correct" comments 156 156 156 468 Indirect Completeness relative to the 75% 60% 55% 63% "jointly correct" comments

Chapter 5 011 Results

Table 33 displays the distribution of the type of jointly correct comments that were not mentioned by the experts. It can be seen in this table that 37% of the comments are positive comments regarding good compliance (Start Step on Time, Drug Compliance Increase On Time, Increase aborted due to maximal dose, Drug Switch Correct), and the remaining 63% are negative comments regarding compliance problems.

Table 33. Distribution of the comments that were not mentioned by an expert, although they were judged as jointly correct by the two diabetes experts.

Comment Type Positive/Negative Comments % Start Step On Time Positive 35 20% Start Step Too Late Negative 32 19% Drug Compliance Increase Too Late Negative 25 15% Step is Missing Negative 19 11% Drug Compliance Increase On Time Positive 18 10% Non Active Plan Step Negative 12 7% Step Aborted Maximal Dose Positive 10 6% Plan Already Completed Negative 7 4% Unexplained Drug Stop Negative 5 3% Step is Missing Compliance Increase Negative 4 2% Wrong Path Selection Negative 4 2% Drug Switch Correct Positive 1 1%

Due to the fact that 37% of the jointly correct comments that were not mentioned by the experts are positive comments regarding correct compliance to the guideline, I was interested to check if the indirect completeness of the experts is higher regarding the negative comments regarding compliance problems. Table 34 displays the results of the indirect completeness calculated only on these negative comments. The average experts’ completeness was 62%, which is lower than the completeness calculated on all comments, positive and negative, which was 63%.

Table 34. Indirect Completeness of the experts in the manual compliance evaluation, regarding compliance problems only.

Diabetes Diabetes Family Medicine Average Expert 1 Expert 2 Expert Judged as correct and mentioned by 64 60 57 181 the expert in his comments Judged as correct but not mentioned 33 37 40 110 by the expert in his comments Total "jointly correct" comments 97 97 97 291 Indirect Completeness relative to 66% 62% 59% 62% the "jointly correct" comments

Chapter 5 016 Results

Comparison between the Experts and the System

To conclude the completeness and correctness analysis, I performed a comparison between the results of all experts and the results of the system.

Table 35 and Figure 49 summarize the completeness and correctness results for the system and all experts. For the system I used the results mentioned in the completeness and correctness sections of the analysis, 86% for completeness and 91% for correctness. For the experts I used the results of the indirect completeness and indirect correctness presented in the previous sections. Diabetes expert 1 had the highest correctness score of 99%, and the system had the highest completeness score of 86% with correctness score of 91%, which is similar to correctness score of diabetes expert 2. When using the Harmonic Mean to integrate the correctness and completeness, the system resulted with the highest score of 0.91. The Harmonic Mean is defined as:

Table 35. Summary of completeness and correctness of the system and the experts.

Completeness (%) Correctness (%) Harmonic Mean System 91 91 0.91 Diabetes Expert 1 75 99 0.85 Diabetes Expert 2 60 91 0.72 Family Medicine Expert 55 88 0.68

Figure 49. A profile of the completeness and correctness of the experts and the system.

Chapter 5 017 Results

5.4 Results Regarding Runtimes and Memory Consumption

To test the performance of the systems, I applied the system on the full set of 2,038 patients that included 378,273 time-oriented data records, and monitored the runtime and memory measures. The experiment was done on a Windows 7 based personal computer, with Intel i5 Quad-Core CPU and internal memory of 8GB.

Table 36 shows the results regarding runtimes; the average time to load the data of all patients from the database, and to store it in-memory according to the system’s internal data structures was 13.45 seconds, the average time to analyze the data of all patients was 686.11 seconds, which is 0.34 seconds per patient.

Table 36. Results regarding runtimes.

Run 1 Run 2 Run 3 Average Patients 2,038 2,038 2,038 2,038 Time-oriented medical records 378,273 378,273 378,273 378,273 Load data from database (seconds) 11.85 11.88 16.61 13.45 Analyze all patients (seconds) 714.52 665.10 678.71 686.11 Time/Patient (seconds) 0.36 0.33 0.34 0.34

The memory used by the process of the system was monitored during these runs, and the results reflected a steady consumption of internal memory that did not exceed 270MB at any point of time during the performance experiment. Figure 50 shows a snapshot the memory consumption graph, for a period of 60 seconds during the run.

Figure 50. A snapshot of the memory consumption graph, illustrating the steady memory consumption for a period of 60 seconds during the performance-testing runs.

In addition to measuring runtimes and memory consumption, the CPU utilization was also monitored during the performance-testing runs. A typical state of the CPU utilization graphs is presented in Figure 51. As can be seen in these graphs, the

Chapter 5 018 Results

processing was performed utilizing 100% of one of the processor cores, with overall utilization of around 25% of the CPU.

Figure 51. A snapshot the CPU utilization graphs for a period of 60 seconds during the performance- testing runs. The total CPU utilization was steady at around 25%. The analysis process was performed on a single CPU core (CPU 0 in the graphs), which was fully utilized during most of the time of the run.

6

Summary and Discussion Chapter 6 010 Summary and Discussion

In this final chapter, I summarize my overall research, including its objectives, specific research questions, and experimental results, discuss the implications of the results to the task of quality assessment of guideline-based care in particular, and to the medical field in general, followed by the limitations of the study with appropriate suggestions for future work, and present the study’s conclusions.

6.1 Summary of the Methods and of the Results

Clinical guidelines contain evidence-based recommendations that aim to improve the quality of medical care by providing guidance through the diagnosis, management, and treatment of patients with specific medical conditions. However, guidelines are not always followed, or are followed incompletely.

Thus, it is potentially highly useful to provide an automated analysis of the compliance to established clinical guidelines, in a manner sensitive to the context of each patient, using continuous, partial-match measures between the guideline and the actual management, and not just discrete, binary (0/1) scores. The analysis should also be complemented by effective explanations of the critique. The critique and its explanation can then serve as an important tool for both short-term and long-term enhancement of the compliance of care providers. Such a service can assist both clinicians and the health-care administrators.

A Summary of the Methods

In this research, I have developed and evaluated the DiscovErr system for medical critiquing and quality assessment, which performs a knowledge-based analysis of the compliance to clinical guidelines. The DiscovErr system uses several well-known models for medical knowledge representation: the Asbru language for the representation of the procedural aspects of the guideline; the KBTA ontology for representing the declarative aspects of the knowledge, i.e., formal representation of the medical concepts used for the evaluation of patient state according to their multivariate, time-oriented medical record; and standard medical vocabularies (e.g., ATC, ICD-9) for denoting the medical concepts in the knowledge base, to support the mapping between the knowledge base and data identifiers in each local (e.g., specific to each medical center) electronic medical record.

Chapter 6 010 Summary and Discussion

The DiscovErr system includes a graphical tool for specification and maintenance of the guideline-based knowledge, and a knowledge library for fast storage and retrieval of the formal procedural and declarative knowledge. The knowledge-specification tool allows users to specify new knowledge according to the formal models, and to maintain and modify knowledge in the knowledge library. The knowledge library includes three sub-libraries: The plan library, which includes the formal Asbru-based representations of the procedural guideline-based clinical knowledge; the declarative concepts library, which includes the formal definitions of the medical concepts according to the KBTA ontology; and the clinical steps library, which includes knowledge about atomic clinical steps, which are mapped to concepts in the standard medical vocabularies and which can be re-used in multiple sections or paths within the guidelines.

The main module of the DiscovErr system is the Compliance Analysis Engine, which analyzes the longitudinal data in the patients’ medical records, using the formal knowledge in the library, and provides comments regarding the compliance of the clinical actions manifested by these data (i.e., their relative match) with the guidelines by which each patient should be managed. The core of the Compliance Analysis Engine is a multi-step algorithm that analyzes the electronic records by combining a top-down (i.e., knowledge-driven, from the knowledge to the data) analysis approach with a bottom-up (i.e., data-driven, from the data to the knowledge) approach.

Clinicians do not appreciate binary, black and white judgments. A cut-off value of 140 mmHg in the systolic blood pressure means that a 141 mmHg value is somewhat high, perhaps a bit “highish”, but not significantly so, and is certainly quite different from a value of 157 mmHg, although both are above the cut-off value. To explicitly consider the uncertainty inherent in clinical medicine, and the degree of freedom clinicians have when making clinical decisions, the analysis algorithm uses the Fuzzy Temporal Reasoner, a temporal reasoning engine that implements fuzzy logic methods to interpret the patient’s state in a continuous, non-binary, manner, enabling the system to assign a continuous membership score (in any of the provided ranges, e.g., normal blood pressure) between 0 and 1. During the analysis, the system uses these less strict membership scores to examine multiple (alternative) explanations; this helps the system to flag a particular value (such as a diastolic blood pressure of 141) only when it crosses a particular predefined fuzzy membership value, and to avoid making redundant comments about deviations from the guidelines, when a reasonable alternative

Chapter 6 012 Summary and Discussion

explanation with a sufficiently high membership value exists. This capability for flexible, non-binary reasoning might explain the system’s success, with respect to the high level of agreement of the expert clinicians with its comments, and the fact that its correctness scores were much higher in the case of comments regarding issues judged as important, as opposed to comments regarding issues that were judged as less important.

To evaluate the DiscovErr system, I conducted an experiment that included the specification of a complex diabetes guideline, and applied the system on real patient data. The experiment included two diabetes experts and one family medicine expert, who examined the data in two steps. In the first step, the experts manually evaluated the raw data of 10 patients, whose longitudinal multivariate data were measured for up to five years, and added comments regarding the compliance of the patients to the diabetes guideline for every meaningful segment of the patient’s clinical history, throughout the full course of the management of that patient. In the second step, the experts assessed the correctness and importance of the comments provided by the system regarding the guideline’s compliance of the same patients. The results of the experiment were analyzed to evaluate the completeness and correctness of the system, and, in a certain sense, those of the experts. The results indicated high completeness and correctness of the system; the results are discussed in the next section.

A Summary of the Results of the Evaluation

The results of the evaluation provided answers to all of the four major research questions which were asked in the Introduction and in the Evaluation chapters:

The DiscovErr system was found to produce most (91%) of the important compliance-related comments, when applied to real patients; its completeness score was actually higher than all medical experts participating in the experiment.

Most of the system’s comments were found correct when assessed by the medical experts, and the system achieved a correctness score (91%) which was similar to the score of one of the diabetes experts; it was in fact higher than the score of the family medicine expert, although lower than the score of the second diabetes expert.

The compliance-related comments of the system, were found significant and important, when directly evaluated by the diabetes experts regarding this aspect.

Chapter 6 011 Summary and Discussion

The system was found to perform well regarding runtimes and memory consumption, in a manner that supports its possible implementation in real clinical quality assessment settings.

With respect to the secondary issues examined during the evaluation, 46% of the unique (with respect to content) comments regarding compliance issues were made by only one expert; 26.5% were made by two; and the rest, 27.5%, were made by all three experts. The inter-agreement between the diabetes experts regarding the correctness of the system’s comments, assessed through a weighted Kappa measure, is considered good and significant 0.61 (p < 0.05).

6.2 Discussion

It is interesting to note, beyond the significant weighted-Kappa value of agreement amongst the experts regarding the correctness of the system’s comments, that the diabetes experts fully agreed on the correctness, partial correctness, or incorrectness of 151 of the 172 evaluated system’s comments (87.8%), and partially agreed on the truth value of an additional 20 of the 172 comments (11.6%) – a total of 99.4% full or partial agreement. This high level of inter-observer agreement was rather surprising to us, but I found it quite encouraging, with respect to supporting the evaluation – it would have been more difficult to assess the correctness of the comments made by the DiscovErr system if the agreement was rather low. However, one must keep in mind that the high agreement amongst the two diabetes experts was about the critiquing comments of an external agent, i.e., the DiscovErr system, regarding a given therapy by the patient’s care provider, and not about the recommended optimal therapy. Apparently, it is easier to agree with a computer system on whether an unknown clinician complied with a specific, well-defined guideline, rather than to agree on what should be done given just the patient’s data. In the latter case, little agreement between two diabetes experts was found with respect to the open question, “What would you recommend to this patient?” [Shahar 1994].

The fact that the DiscovErr system resulted with a higher completeness when compared to the medical experts, can be explained by the advantage of the computer in performing such tasks, that involve scanning a relatively large amount of temporal data in order to analyze the compliance to the guideline. As previously mentioned in the

Chapter 6 011 Summary and Discussion

results, it took the experts 27 minutes, on average, to evaluate the medical record of a single patient, where the medical records contained an average number of 158 time- stamped data items, describing about five years of medical treatment. In this kind of tasks, the computer has a clear advantage on human experts, who may miss certain compliance problems. It is important to mention that the experts stated that the user interface provided for them, that visualizes the temporal data of multiple parameters in parallel graphs (See Figure 42), was very useful, and that if they were required to perform this task using the systems they currently use in the real clinical settings, it would have been much harder, even almost impossible.

The system resulted with higher completeness and correctness scores regarding test and monitoring-related comments than regarding medication-related comments. This fact is not surprising, and can be explained by the complexity of the medication-related guideline recommendations, that is higher than the complexity of the test-and- monitoring-related recommendations. When deciding about medication administration actions, there are many aspects that need to be considered, such as patient state, previous medications and dosages, patient compliance and side effects. Thus, in some complex scenarios, the experts have an advantage on the system that may lack some of the general knowledge of the experts. Nevertheless, the system achieved high completeness and correctness scores regarding both the medication-related and test and monitoring related actions, but performed better when analyzed the compliance of test- and-monitoring-related actions.

It is encouraging to note that the DiscovErr system produced 98% of the comments made by all three experts (versus 83% of the comments made by only two experts, and 66% of the comments made by only one expert). In our opinion, this result provides, in addition to the explicit assessment of the system’s comments by the experts, yet another implicit validation of the system’s focus on important issues.

Chapter 6 016 Summary and Discussion

6.3 Implications to the Field of Medicine

The evaluation of the DiscovErr system in the diabetes domain resulted in a critiquing proficiency equivalent to that of an expert clinician, at a level somewhere between an experienced family medicine expert and a diabetes expert. Although the system was evaluated only in this single medical domain, I would expect it to perform in a similar proficiency in many other time-oriented clinical domains, such as management of other types of chronic patients, monitoring pregnancies, and even management of patients in an intensive-care unit. This assumption, although not evaluated in the current study, is based on the fact that these domains share similar characteristics of data and knowledge, and on the fact that the underlying knowledge representation model that the DiscovErr system uses is based on the Asbru procedural specification language and on the KBTA declarative-knowledge ontology. The expressiveness of these representation formats has been assessed by multiple studies at the past in various clinical domains. For example, the Asbru language has been used by other systems for critiquing and for quality assessment [Advani et al. 2002; Sips et al. 2006; Boldo 2007; Groot et al. 2008] and its capability for formal representation of clinical guidelines was assessed in several projects as described in the background section. Furthermore, as there is no domain-specific element in the compliance analysis algorithm I had presented, it is reasonable to assume that it will work well in other clinical domains. Nevertheless, this assertion needs to verified in future research.

The results of this thesis suggest the possibility of using systems such as the DiscovErr to perform large-scale clinical-guidelines-based quality assessment of electronic medical records. This assessment would be performed at a level that is at the very least sufficient to flag certain records as problematic, and include an appropriate explanation, thus enabling quality-assessment experts, care providers, and clinical researchers, to browse them more rigorously. In the following sections I present several manners in which systems such as DiscovErr could be used to increase the support to guideline-based care

Providing Critique at the Point of Care

The current study showed the validity of a retrospective automated critiquing process. The compliance analysis, however, can be performed in real time, at the point of care,

Chapter 6 017 Summary and Discussion

by assessing the quality of the care provider’s decisions, as opposed to actions, to immediately assist clinicians in increasing their compliance when deviations from the guidelines are being detected. Such a critiquing mode, “over the shoulder” style of guideline-based support aims to provide on-line decision support with minimal interaction with the clinician, thus enhancing the acceptance of decision-support systems in real clinical settings. For that reason, earliest works in the field focused on critiquing, including systems such as HyperCritic [van der Lei and Musen 1990], Trauma-TIQ [Gertner 1997] and AsthmaCritic [Kuilboer et al. 2003].

It is important to mention that the term “point-of-care” can be extended from the specific moment in which the clinician treats the patient, to the more general period of patient management, thus extending the scope of a critiquing system into that of a management system. The comments regarding compliance to the guideline can be effectively provided to the clinician, or even to the patient, during the overall treatment period, or even at the time in which certain recommended actions would be expected, but were not recorded. For example, a comment regarding an unexpected increase of medication dosage, can be sent to the clinician at the end of the patient visit, with a beneficial effect; and a comment regarding a missing laboratory test can be sent to the patient (and clinician) at the time in which this action was expected, but no record of scheduling such an action was found in the patient’s file.

Retrospective Compliance Analysis and Quality Assessment

The compliance analysis can be performed retrospectively as a tool for clinical managers to measure the level of compliance of the treatment provided in their clinical units, and assist them in identifying specific sections in the guidelines in which the compliance should be improved.

Quality is becoming a major issue in medical care; organizations monitor the quality of care using various quality measures, such as the Clinical Quality Measures (CQMs) published by Medicare and Medicaid services, and the Indicators for Quality Improvement (IQIs) published by the NHS. Although these measures include a formal objective definition, most of them are rather simple in comparison to the clinical complexity of clinical guidelines. An example for a CQM measure in the domain of the current study, is the “Low Density Lipoprotein (LDL-C) Control in Diabetes Mellitus”,

Chapter 6 018 Summary and Discussion

defined as “Percentage of patients aged 18 through 75 years with diabetes mellitus who had most recent LDL-C level in control (less than 100 mg/dL)”. This measure, and much more complex one, which consider also longitudinal therapy and not just specific points in the patient’s life (such as what happened on a particular clinic visit), can be easily represented using the DiscovErr’s knowledge model, and then can automatically applied to medical records, utilizing its algorithm for compliance analysis.

Furthermore, due to use of the fuzzy temporal logic mechanism to assess constraints such as that the LDL-C measure should be lower than 100 mg/dL, the DiscovErr system can correctly assess a group of 50 of a care provider’s patients, whose LDL-C values at the point in which they are examined were all just a bit higher than 100 mg/dL, thus assigning their care provider a high score for their overall level of therapy, with respect to that measure. Note that a simplistic algorithm using a rigid cut-off value would assign that care provider a quality measure of zero.

In addition, if more sophisticated systems such as the DiscovErr system will be used for quality assessment, more sophisticated quality measures could be defined, to address other aspects of the correct application of the relevant medical knowledge; for example, the temporal aspects of the data (e.g., note that one would wish to examine the pattern formed by all of the LDL-C measurements, and not only the most recent one).

Note that the use of more complex quality assessment measures, such as time-oriented ones, through the use of systems such as DiscovErr, can prevent a “rigging of the system” by clinicians who adhere to superficial, static QA measures, such as the requirement for administering a questionnaire once a year, so as to obtain financial or other benefits. Adherence to the complex time-oriented, guideline-based intentions and actions might be easier by simply following the guideline, than by artificially manipulating some actions.

Guideline Related Research

An additional way to benefit from systems such as DiscovErr, is to apply them to study whether compliance to guidelines indeed improves short- and long-term patient outcomes. Using sophisticated systems that analyze compliance by tracking both the intentions and complex plans of the guidelines, can explore and greatly extend results

Chapter 6 019 Summary and Discussion

such as those of Micieli et al. [2002], who have shown that greater compliance to guideline-based cerebral vascular accident (CVA) (brain stroke) management guidelines reduces morbidity and mortality; and also results such as by Quaglini et al. [2004], who have shown that enhanced compliance to CVA guidelines also reduces economic expenditure, for example by reducing the number of hospitalization days in the first year after the stroke. Indeed, early attempts to identify various patterns of insulin therapy over time in diabetes patients, to support further research, were made by Kahn and Abrams [1990].

Researchers can use compliance analysis systems to group patients in regards to their level of compliance to the guidelines, and then, by examining their short and long terms outcomes, compare, validate and modify the clinical guidelines.

6.4 Limitations of the Work

Although the DiscovErr system is generic and can be used for analysis of compliance to multiple clinical guidelines in multiple medical domains, the evaluation in this study was performed in the single medical domain of diabetes, due to limited resources and time. By extending the evaluation to additional medical domains, the results regarding completeness and correctness could be broadly generalized, and the ability to apply the system in additional medical domains would have been further examined and hopefully validated. In addition, adding more medical experts and extending their evaluation to include medical records of additional patients, could have increased the statistical significance of the results, if such experts were available.

An additional aspect that was not included in the current implementation, which could extend the capabilities of the system, is the ability to determine the actor who is responsible for each detected compliance problem, in particular, the patient or the physician. Such an ability may assist in deciding to whom should the comments be delivered, improve the way they are presented, and increase their acceptance. In addition, determining the responsibility for the compliance problems, will enable the performance of comparative studies, such as the work of Vashitz et al. [2011], who compared the physicians’ adherence to dyslipidemia guidelines to the adherence of the patients, using very specific measures to analyze the adherence, and even assign responsibility, without representing the actual guideline. The idea for adding such an

Chapter 6 061 Summary and Discussion

ability emerged while collaborating with the medical expert during the system’s evaluation, and specific methods for adding that capability were suggested. For example, comments regarding the patient’s low compliance to medications, which can be detected through the analysis of medication purchase data, can obviously be related to the patient, whereas comments regarding drug administration, such as an unexplained increase or decrease in the dosage, can be related to the physician. More sophisticated methods for responsibility assignment can be investigated in future research.

An additional ability that can be added to the system, relevant mostly when applied in an online critiquing mode, is to support the collection of the reasons for non- compliance. Such ability, as shown in the development of the RoMA module [Panzarasa 2007 and Quaglini 2008], is important for providing feedback regarding the reasons for guidelines recommendation are not applied, and for detecting problems in the compliance analysis system.

6.5 Conclusions

I presented a comprehensive framework, implemented by the DiscovErr system, intended to critique the management of chronic patients using a formal representation of a set of potentially relevant clinical guidelines, a new algorithm that exploits temporal fuzzy logic, and an integration of data-driven and goal-driven reasoning approaches.

Overall, the completeness of the DiscovErr system, assessed over a set of type II diabetes patients’ records, with the help of three clinical experts, was 91% for comments made by at least a majority of the three experts. The completeness increased from 66% for comments made by only an individual expert, to 83% for comments that were mentioned by a majority of the experts, and reached 98% for comments made by all experts.

The correctness of the comments, as assessed by the two diabetes experts, was also 91%. The correctness was higher for comments concerning issues that were judged as being of higher importance, as opposed to comments concerning issues judged as being of lesser importance.

Chapter 6 060 Summary and Discussion

Overall, when compared to a majority of the three experts with respect to correctness, the DiscovErr system could be placed between the family medicine expert and the two diabetes experts; with respect to completeness, it could be placed as displaying a higher level of completeness than any of the three experts.

I conclude that systems such as DiscovErr can be effectively used to provide medical critique and to effectively assess the quality of time-oriented guideline-based care of large numbers of patients.

162

7

References Chapter 7 062 References

[1] Abu-Hanna A, Jansweijer W (1994). Modeling domain knowledge using explicit conceptualization. IEEE Expert; 9(5):53-64. [2] Advani A, Shahar Y, Musen M (2002). Medical quality assessment by scoring adherence to guideline intentions. Journal of the American Medical Informatics Society; 9:s92-s97. [3] Allen JF (1983). Maintaining knowledge about temporal intervals. Communications of the ACM;26 (11): 832-843. [4] Altman DG (1991). Practical statistics for medical research. London: Chapman and Hall. [5] Boldo Irit (2007). Knowledge-Based Recognition of Clinical-Guideline Application in Time- Oriented Medical Records. Ph.D Thesis, Department of Industrial Engineering and Management, Technion, Israel. [6] Boxwala A, Peleg M, Tu S, Ogunyemi O, Zeng Q, Wang D, et al. (2004). GLIF3: a representation format for sharable computer-interpretable clinical practice guidelines. J Biomed Inform; 37(3):147-61. [7] Chan AS, MArtins SB, Coleman RW, Bosworth HB, et al. (2005). Post•fielding Surveillance of a Guideline•based Decision Support System. Advances in Patient Safety: From Research to Implementation (Volume 1: Research Findings). [8] Ciccarese P, Caffi E, Boiocchi L, Quaglini S, Stefanelli M. (2004). A guideline management system. Medinfo;28-32. [9] Clarke . J.R., Rymon R., Niv M., Webber B.L., Hayward C.Z., Santora T.S., Wagner D.K. and Ruffin A., (1993). The importance of planning in the provision of medical care. Medical Decision Making 13(4); 383. [10] De Clercq P, Blom J, Korsten HHM, Hasmanet A. (2004). Approaches for creating computer- interpretable guidelines that facilitate decision support. Artificial Intelligence in Medicine; 31(1): 1- 27. [11] De Clercq PA, Hasman A. (2004). Experiences with the Development, Implementation and Evaluation of Automated Decision Support Systems. Medinfo;1033-1037. [12] De Clercq PA, Blom JA, Hasman A, Korsten HHM. (2001). Design and implementation of a framework to support the development of clinical guidelines. Int J Med Inf; 64(2- 3):285-318. [13] Dykes P, Caligtan C, Novack A, Thomas D, Winfield L, Zuccotti G, Rocha R. (2010). Development of Automated Quality Reporting: Aligning Local Efforts with National Standards. AMIA Annu Symp Proc. 187–191. [14] Essaihi A, Michel G, Shiffman R N. (2003). Comprehensive Categorization of Guideline Recommendations: Creating an Action Palette for Implementers. AMIA Annu Symp Proc.: 220- 224. [15] Fox J, Johns N, Rahmanzadeh A. (1998). Disseminating medical Knowledge: the PROforma approach. Artificial Intelligence in Medicine; 14(1): 157-181. [16] Gertner A.(1997). Plan recognition and evaluation for on-line critiquing. User Modeling and User- Adapted Interaction; 7(2):107-140. [17] Goldstein, M. K., B. B. Hoffman, et al. (2001). Patient Safety in Guideline-Based Decision Support for Hypertension Management: ATHENA DSS. AMIA Annual Symposium, Washington, DC. [18] Grimshaw JM, Russel I T. (1993). Effect of clinical guidelines on medical practice: A systematic review of rigorous evaluations; Lancet 342: 1317-1322. [19] Groot P, Hommersom A., et al. (2008). Using model checking for critiquing based on clinical guidelines, Artificial Intelligence in Medicine; 46(1):19-36. [20] Hatsek A, Shahar Y, et al. (2008). DeGeL: A Clinical-Guidelines Library and Automated Guideline-Support Tools. In: Ten Teije, A., Miksch, S., Lucas, P. (eds) Computer-based Medical Guidelines and Protocols: A Primer and Current Trends, Studies in Health Technology and Informatics, vol. 139, IOS Press. [21] Hripcsak, G., P. Ludemann, et al. (1994). Rationale for the Arden Syntax. Computers and Biomedical Research; 27: 291-324.

Chapter 7 061 References

[22] Johnson P, Tu S, Jones N. (2001). Achieving reuse of computable guideline systems. Medinfo; 10(Pt 1):99-103. [23] Johnson P, Tu S, Booth N, Sugden B, Purves I. (2000). Using scenarios in chronic disease management guidelines for primary care. Proceedings of AMIA Annual Symposium; 389–393. [24] Kahn MG, Abrams CA, et al. (1990). Automated interpretation of diabetes patient data: Detecting temporal changes in insulin therapy. In Miller RA(ed), Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care , Los Alamitos: IEEE Computer Society Press; 569-573. [25] Kautz H. (1991). A Formal Theory of Plan Recognition and its Implementation. In Reasoning About Plans, Morgan Kaufmann Publishers; 69-126 (chapter 2). [26] Kautz H. (1987). A Formal Theory of Plan Recognition. PhD Thesis, Dept. of , University of Rochester. [27] Konolige K., Pollack M. (1989). Ascribing plans to agents - preliminary report. Proceedings of the International Joint Conference on Artificial Intelligence. Detroit; 924-930 [28] Kuilboer MM, van Wijk M, Mosseveld M, van der Lei J. (2003). Asthma Critic: Issues in Designing a Noninquisitive Critiquing System for Daily Practice. Journal of the American Medical Information Association; 10:419-424. [29] Lanzola G. Parimbelli E. Micieli G. Cavallini A. Quaglini S. (2014). Data quality and completeness in a web stroke registry as the basis for data and process mining. J Healthc Eng; 5(2):163-84 [30] McDermott D. (1978). Planning and Acting. In Cognitive Science; 2(2):71-109. [31] Micieli G, Cavallini A, Quaglini S, Fontana G, Dué M. (2010) The Lombardia Stroke Unit Registry: 1-year experience of a webbased hospital stroke registry. Neurol Sci:555-64. [32] Micieli G, Cavallini A, Quaglini S. (2002). Guideline Compliance Improves Stroke Outcome: A Preliminary Study in 4 Districts in the Italian Region of Lombardia. Stroke; 33:1341-1347. [33] Miksch S. (1999). Plan Management in the Medical Domain. AI Communications; 12(4):209-235. [34] Miller P.L. (1986) Expert Critiquing Systems: Practice-Based Medical Consultation by Computer. New York, Springer-Verlag. [35] Musen MA, Tu SW, Das AK, Shahar Y. (1996). EON: A component-based approach to automation of protocol-directed therapy. Journal of the American Medical Information Association; 3(6): 367- 388. [36] Musen, M. A., R. W. Carlson, et al. (1992). T-HELPER: Automated Support for Community-Based Clinical Research. Proceedings of the Sixteenth Annual Symposium on Computer Applications in Medical Care, Washington, DC. [37] Ohno-Machado, L., J. H. Gennari, et al. (1998). The guideline interchange format: a model for representing guidelines. Journal of the American Medical Informatics Association; 5: 357-372. [38] Panzarasa S, Quaglini S, Cavallini A, Marcheselli S, Stefanelli M, Micieli G. (2007). Computerised Guidelines Implementation: Obtaining Feedback for Revision of Guidelines, Clinical Data Model and Data Flow. Artificial Intelligence in Medicine; LNCS; 4594:461-466. [39] Panzarasa S, Quaglini S, Cavallini A, Micieli G, Pernice C, Pessina M, Stefanelli M. (2006). Workflow Technology to Enrich a Computerized Clinical Chart with Decision Support Facilities. AMIA Annu Symp Proc:619-23. [40] Patkar V, Hurthttp://www.nature.com/bjc/journal/v95/n11/full/6603470a.html - aff1#aff1 C, Steele R, Love S, Purushothamhttp://www.nature.com/bjc/journal/v95/n11/full/6603470a.html - aff3#aff3 A, Williamshttp://www.nature.com/bjc/journal/v95/n11/full/6603470a.html - aff1#aff1 M, Thomsonhttp://www.nature.com/bjc/journal/v95/n11/full/6603470a.html - aff1#aff1 R, Foxhttp://www.nature.com/bjc/journal/v95/n11/full/6603470a.html - aff1#aff1 J.

Chapter 7 061 References

(2006). Evidence-based guidelines and decision support services: a discussion and evaluation in triple assessment of suspected breast cancer; Br J Cancer 4:95 (11):1490-1496. [41] Peleg M (2013). Comparing Computer-interpretable clinical guidelines: a methodological review. Biomed Inform. 46(4):744-763. [42] Peleg M, Tu S, Bury J, et al. (2003). Comparing Computer-Interpretable Guideline Models: A Case-Study Approach. Journal of the American Medical Informatics Association; 10(1): 52-68. [43] Peleg, M., A. Boxwala, et al. (2001). Sharable Representation of Clinical Guidelines in GLIF: Relationship to the Arden Syntax. Journal of Biomedical informatics; 34: 170-181. [44] Quaglini S. (2008). Compliance with clinical practice guidelines. Stud Health Technol Inform.:139:160-79. [45] Quaglini S, Cavallini A, Gerzeli S, Micieli G. (2004). Economic Benefit from Clinical Practice Guideline Compliance in Strok Patient Management Health Policy; 69: 305–315. [46] Quaglini S, Stefaneli M, Lanzola G, Caporusso V, and Panzarasa S. (2001). Flexible guideline- based patient careflow systems. Artificial Intelligence in Medicine; 22:65-80. [47] Rao AS, Georgeff MP. (1995). BDI-agents: From Theory to Practice, In Proceedings of the First International Conference on Multiagent Systems (ICMAS'95), San Francisco. [48] Riano D (2007). The SDA model: A set theory approach. In: 20th IEEE international symposium on computer-based medical systems; 563-568 [49] Ruben A, Laura P, Marie DW, Darrell JG, Neil RP. (2009). Clinical Information Technologies and Inpatient Outcomes A Multiple Hospital Study; Arch Intern Med; 169(2): 108-114. [50] Seyfang A, Miksch S, Marcos M. (2002). Combining Diagnosis and Treatment using Asbru, International Journal of Medical Informatics; 68 (1-3): 49-57. [51] Shahar Y, Young O, Shalom E, et al. (2004). A hybrid, multiple-ontology framework for specification and retrieval of clinical guidelines. The Journal of Biomedical Informatics; 37(5): 325- 344. [52] Shahar Y, Miksch S, Johnson P. (1998). The Asgaard project: A task-specific framework for the application and critiquing of time-oriented clinical guidelines. Artificial Intelligence in Medicine; 14(1-2): 29-51. [53] Shahar Y. (1997). A framework for knowledge-based temporal abstraction. Artificial Intelligence; 90(1-2): 79-133. [54] Shahar Y and Musen M.A. (1995). Plan recognition and revision in support of guideline-based care. In: Proceedings AAAI Symposium on Representing Mental States and Mechanisms:118–126. [55] Shahar Y. (1994). A knowledge-based method for temporal abstraction of clinical data. Ph.D. DissertationProgram in Medical Information Sciences, Stanford University School of Medicine, Stanford, CA: 247-248. [56] Sips R, Braun L, Roos N. (2006). Applying intention-based guidelines for critiquing. In: ten Teije A, Misch A, Lucas P, editors. ECAI 2006 WS–—AI techniques in healthcare: evidence based guidelines and protocols: 83-88. [57] Sips R, Braun L, Roos N. (2008). Enabling Medical Expert Critiquing Using a BDI Approach. [58] Sutton DR, Fox J. (2003). The Syntax and Semantics of the PROforma guideline modelling language. J Am Med Inform Assoc.; 10(5):433-443. [59] Terenziani P, Montani S, Bottrighi A, Torchio M, Molino G, Correndo G. (2004). The GLARE approach to clinical guidelines: main features. Stud Health Technol Inform.; 101: 162-166. [60] Terenziani P, Molino G, Torchio M. (2001). A modular approach for representing and executing clinical guidelines. Artif Intell Med.; 23(3): 249-76. [61] Tu SW, Campbell JR, Glasgow J et al. (2007). The SAGE Guideline Model: Achievements and Overview. J Am Med Inform Assoc.; 14(5): 589-598. [62] Tu SW, Musen MA, Shankar R et al. (2004). Modeling guidelines for integration into clinical workflow. Medinfo.: 174-178.

Chapter 7 066 References

[63] van Ast JF , Talmon JL , Renier WO, Hasman A. (2004) An approach to knowledge base construction based on expert opinions. Methods Inf Med; 43(4):427-432. [64] van der Lei J, Musen M. (1990). A Model for Critiquing Based on Automated Medical Records. Computers and Biomedical research; 24: 344-378. [65] van der Lei J, van Bemmel JH, van der Does E, Man in ‘t Veld AJ, Musen MA. (1991). Comparison of computer-aided and human review of general practitioners’ management of hypertension. The Lancet; 338: 1504-1508. [66] Vashitz G, Meyer J, Parmet Y et al. (2011), Physician adherence to the dyslipidemia guidelines is as challenging an issue as patient adherence, FAMILY PRACTICE; 28: 524-531, ISSN: 0263-2136 [67] Wilensky R. (1981). Meta-planning: representing and using knowledge about planning in problem solving and natural language understanding. Cognitive Science; 5: 197-233. [68] Wilensky R. (1978). Why John married Mary: understanding stories involving recurring goals. Cognitive Science; 2: 235-266. [69] Wilensky R. (1977). PAM - A Program That Infers Intentions. In IJCAI’77 . [70] Zadeh, L.A. (1968). "Fuzzy algorithms". Information and Control; 12 (2): 94-102. [71] Zadeh, L.A. (1965). "Fuzzy sets". Information and Control; 8 (3): 338-353.

II

נכונות הערות מערכת DiscovErr נבחנה על ידי שני מומחי הסכרת, אשר העריכו כל הערה כנכונה, נכונה חלקית, או לא נכונה, וגם העריכו אותה כחשובה או כפחות חשובה. רמת ההסכמה בין המומחים נבחנה על ידי שימוש במדד Cohen’s Kappa, ונמצאה גבוה באופן מובהק. הערה של המערכת נחשבה נכונה עם היא הוערכה כנכונה על ידי אחד המומחים ולפחות נכונה חלקית על ידי המומחה השני. נכונות הערות המערכת, לפי הערכת שני מומחי הסכרת, היתה גם היא 19%. הנכונות אף היתה גבוהה יותר עבור ההערות שצויניו כחשובות, לעומת ההערות שצויניו כפחות חשובות. בנוסף, הערכתי בצורה בלתי ישירה את שלמות ונכונות הערות המומחים, על ידי בחינת ההערות שצוינו על ידם בשלב של הערכת ההיענות של המטופלים, ובשלב של בחינת ההערות המערכת. נכונות הערות מומחה הוגדרה על ידי החלק היחסי של הערותיו שצוינו על ידי לפחות מומחה נוסף אחד. מדד נוסף, לבחינת "נכונות בלתי ישירה כללית", הוגדר על ידי התייחסות גם למערכת כמומחה נוסף, כלומר, הערת מומחה שצוינה על ידי המערכת נחשבה נכונה גם אם לא צוינה על ידי מומחה אחר. מדד זה שימש להשוואה בין המומחים השונים ובין המערכת. שלמות הערות המומחה הוגדרה ביחס לכל הערות המערכת אשר הוערכו כנכונות של ידי שני מומחי הסכרת. באופן כללי, בהשוואה לשלושת המומחים, תוך שימוש במדדים הנ"ל למדידת נכונות המומחים והמערכת, מערכת DiscovErr הציגה רמת נכונות זהה לזו של מומחה לסכרת, אשר היתה גבוה מזו של המומחה לרפואת משפחה ונמוכה מזו של המומחה השני לסכרת; באשר לשלמות המומחים והמערכת, המערכת הציגה רמת שלמות גבוה מזו של כל אחד מהמומחים.

לסיכום, אני מסיק ממחקר זה שניתן להשתמש בצורה יעילה במערכות כגון DiscovErr, כדי להעריך את רמת ההיענות ליישום לאורך זמן של קווים מנחים רפואיים, באופן אוטומאטי על מספר רב של מטופלים.

I

תקציר

קווים מנחים רפואיים נכתבים ע"י איגודים רפואיים מקצועיים ככלי לסטנדרטיזציה של הטיפול הרפואי. קווים מנחים אלו מתפרסמים על מנת לסייע לרופאים לבסס את החלטותיהם הרפואיות על המחקר העדכני. למרות שקווים מנחים אלו באופן כללי נגישים, זה כמעט בלתי אפשרי עבור הרופאים, העסוקים מאוד מטבע עבודתם, לעקוב בקביעות אחר כל קו מנחה שמתפרסם, ולהיענות לכל המלצה חדשה. בעבר הוצגו מספר שיטות לפיתוח מערכות אוטומאטיות לזיהוי תכניות טיפול מבוססות קווים מנחים רפואים, ולבקרת איכות על ידי בחינת ההיענות להמלצותיהם. מכל מקום, עד כה לא הוצגה מערכת משולבת הכוללת ממשק גרפאי לרכישת הידע הרפואי; ספריה דיגיטאלית של קווים מנחים רפואיים המיוצגים בפורמט פורמאלי; מנוע חישובי לבקרת איכות מבוססת קווים מנחים אשר מנתח את ההיענות ליישומם לאורך זמן בהתבסס על הייצוג הפורמלאי שלהם; וממשק גראפי להצגת התוצאות של רמת ההיענות.

במחקר זה, פיתחתי ובחנתי מערכת חדשה, הנקראת DiscovErr, לביצוע בקרת איכות וניתוח היענות ליישום לאורך זמן של קווים מנחים רפואיים. המערכת מתבססת על ייצוג פורמאלי של ידע לגבי תהליכי הטיפול המומלצים בקוים המנחים, ושל ידע לגבי ההגדרות המדויקות של אופן ניתוח הנתונים הנדרש באופן ישיר או עקיף על ידם, כדי לנתח את רמת ההיענות לקווים המנחים בנתונים הנשמרים לאורך זמן ברשומה הרפואית הממוחשבת.

יתר על כן, על ידי שימוש במנגנון חישובי גמיש, המתבסס על לוגיקת זמן עמומה, מנוע ההערכת ההיענות מתחשב בחוסר בהירות שקיים לעיתים בקווים המנחים, בחוסר הוודאות הקיים בנתונים, ובעובדה שבמקרים רבים המטפלים ואף המטופלים אינם עוסקים במעקב אחר ההוראות המפורטות של הקווים מנחים אלא מיישמים אותם באופן שעונה להמלצותיהם העיקריות. לצורך הערכת המערכת, ייצגתי בפורמט פורמאלי קו מנחה עדכני לניהול הטיפול בחולים המאובחנים בסכרת סוג 2. לאחר מכן, ביצעתי מספר ניסויים על ידי הפעלת המערכת על מספר משמעותי של נתוני מטופלים אמיתיים, והשוואת הערות המערכת לגבי ההיענות לקו המנחה, להערות של שלושה מומחים רפואיים )מתוכם שני מומחים בסכרת ומומחית אחת ברפואת המשפחה( אשר בחנו את הרשומות הרפואיות של המטופלים כמו גם את ההערות המפורטות של המערכת.

שלמות הערות מערכת DiscovErr נבחנה על ידי השוואת ההערות שלה לאלו של המומחים, והוגדרה כהחלק היחסי של הערות המומחים שצויינו גם על ידי המערכת. שלמות הערות המערכת עלתה מ66%- עבור הערות שצויניו רק על ידי מומחה אחד, ל38%- להערות שצוינו בדיוק על ידי שני מומחים, והגיע ל13%- עבור הערות שצוניו על ידי כל המומחים. השלמות היתה 19%, עבור הערות שצוינו על ידי לפחות הרוב של שלושת המומחים )שני מומחים או יותר(, רמה שאותה אני מחשיב כשלמות הכללית של הערות המערכת.

העבודה נעשתה בהדרכת פרופ' יובל שחר

במחלקה להנדסת מערכות מידע

בפקולטה להנדסה

הערכה מבוססת ידע של רמת ההענות ליישום לאורך זמן של קווים מנחים רפואיים

מחקר לשם מילוי חלקי של הדרישות לקבלת תואר "דוקטור לפילוסופיה"

מאת

חצק אבנר

הוגש לסינאט אוניברסיטת בן גוריון בנגב

אישור המנחה ______

אישור דיקן בית הספר ללימודי מחקר מתקדמים ע"ש קרייטמן ______

כסלו תשע"ה נובמבר 4102

באר שבע

הערכה מבוססת ידע של רמת ההענות ליישום לאורך זמן של קווים מנחים רפואיים

מחקר לשם מילוי חלקי של הדרישות לקבלת תואר "דוקטור לפילוסופיה"

מאת

חצק אבנר

הוגש לסינאט אוניברסיטת בן גוריון בנגב

כסלו תשע"ה נובמבר 4102

באר שבע