For Eliciting and Modeling Expert Knowledge

Hoffman, R. R. (2002, September). “An Empirical Comparison of Methods for Eliciting and Modeling Expert Knowledge.” In Proceedings of the 46th Meeting of the Human Factors and Ergonomics Society. Santa Monica, CA: Human Factors and Ergonomics Society.

AN EMPIRICAL COMPARISON OF METHODS FOR ELICITING AND MODELING EXPERT KNOWLEDGE Robert R. Hoffman, Ph. D. John W. Coffey, Ed.D. Mary Jo Carnot, MA Joseph D. Novak, Ph.D. Institute for Human and Machine Cognition University of West Florida

The goal of this project was to apply a variety of methods of Cognitive Task Analysis (CTA) and Cognitive Field Research (CFR) to support a process going all the way from knowledge elicitation to system prototyping, and also use this as an opportunity to empirically compare and evaluate the methods. The research relied upon the participation of expert, journeyman, and apprentice weather forecasters at the Naval Training Meteorology and Oceanography Facility at Pensacola Naval Air Station. Methods included protocol analysis, a number of types of structured interviews, workspace and work patterns analysis, the Critical Decision Method, the Knowledge Audit, Concept Mapping, and the Cognitive Modeling Procedure. The methods were compared in terms of (1) their yield of information that was useful in modeling expert knowledge, (2) their yield in terms of identification of leverage points (where the application of new technology might bring about positive change), and (3) their efficiency. Efficiency was gauged in terms of total effort (time to prepare to run a procedure, plus time to run the procedure, plus time to analyze the data) relative to the yield (number of leverage points identified, number of propositions suitable for use in a model of domain knowledge). CTA/CFR methods supported the identification of dozens of leverage points and also yielded behaviorally-validated models of the reasoning of expert forecasters. Knowledge modeling using Concept-Mapping resulted in thousands of propositions covering domain knowledge. The Critical Decision Method yielded a number of richly-populated case studies with associated Decision Requirements Tables. Results speak to the relative efficiency of various methods of CTA/CFR, and also the strengths of each of the methods. In addition to extending our empirical base on the comparison of knowledge elicitation methods, a deliverable from the project was a knowledge model that illustrates the integration of training support and performance aiding in a single system.

INTRODUCTION extend our empirical base on knowledge elicitation methodology, including Thorsden's (1991) comparison of The empirical comparison of knowledge elicitation (KE) Concept Mapping with the Critical Decision Method, and methods is nearly 20 years old, dated from Duda and Evans, Jentsch, Hitt, Bowers, and Salas' (2001) comparison of Shortliffe (1983), who recognized what came to be called the Concept Mapping with methods for rating and ranking domain "knowledge acquisition bottleneck"—that it took longer for concepts. computer scientists to interview experts and build a A factor that has made interpretation difficult is that knowledge base than it did to actually write the software for some studies have used college-age participants (and, of the expert system. The first systematic comparisons of course, assessments of the sorts of knowledge that they would knowledge elicitation methods (i.e., Burton, Shadbolt, possess, e.g., sports, fashion). The transfer of the findings to Hedgecock, & Rugg, 1987; Hoffman 1987), and the first wave knowledge elicitation for genuine experts in significant of psychological research on expertise (e.g., Chi, Feltovich, & domains is questionable. A second and major difficulty in the Glaser, 1981; Chi, Glaser, & Farr, 1988; Glaser, et al., 1985; comparison of KE methods is the selection of dependent Hoffman, 1992; Shanteau, 1992; Zsambok & Klein, 1997), variables. Hoffman (1987) compared methods in terms of resulted in some guidance concerning knowledge elicitation relative efficiency—the number of useful propositions methodology (see Cooke, 1994; Hoffman, Shadbolt, Burton, obtained per total task minute, where total task minute is the & Klein, 1995). In the subsequent years, new methods were time take to prepare to run the KE procedure, the time taken to developed, including the Critical Decision Method (see run the procedure, plus the time taken to analyze the data and Hoffman, Crandall, & Shadbolt, 1998) and the Cognitive cull out the useful propositions; and where the adjective Modeling Procedure (Hoffman, Coffey, & Carnot, 2000). In "useful" was applied to any proposition that was not already addition, a number of research projects have attempted to contained in the first-pass knowledge base that had been constructed on the basis of a documentation analysis. (A METHODS somewhat similar metric, number of elicited procedural rules, was utilized in the work of Burton et al., 1987.) Hoffman's Participants initial purpose for creating an efficiency metric involved the need of computer scientists to assess the usefulness of the Participants (n = 22) were senior expert civilian results in terms of building knowledge bases for expert forecasters, junior Aerographers (i.e., Apprentices who were systems. While a somewhat reasonable metric from the qualified as Observers) and senior Aerographers (i.e., standpoint of first-generation expert systems, it would not Advanced Journeymen and Journeymen who were qualified as work for all of the purposes of either computer science or Forecasters) at the Meteorology and Oceanography Training experimental psychology. Facility at Pensacola Naval Air Station. For their dependent variable, Evans et al. (2001) generated correlations of the similarity ratings among domain concepts. This correlation approach makes it possible to lock Methods down the relative similarity of domain concepts and scale the convergence among alternative methods (e.g., ranking versus The following methods of CTA/CFR were utilized: Concept-Mapping), but raw pairwise similarity of domain 1. Bootstrapping (documentation analysis, analysis of SOP concepts glosses over the meaning and content that are documents, the Recent Case Walkthrough method), necessary for the construction of models. Another factor that 2. Proficiency Scaling (Participant Career Interviews; clouds the interpretation of results from some studies that have comparison of experience versus forecast hit rates as a used the Concept-Mapping procedure is that it is often measure of actual performance), apparent that the Concept Maps that are created (either by 3. Client (i.e., pilots and pilot trainers) Interviews, domain practitioners or by practitioners in a collaboration with 4. Workspace Analysis (Repeated photographic surveys, the researchers) are lacking in the qualities that define Concept detailed workspace mapping), Maps. These criteria, and their foundations in the theory of 5. Workpatterns Analysis (live and videotaped Technical meaningful learning, have been discussed by Novak and his Training Briefings, Watchfloor observations), colleagues (e.g., Ausubel, Novak, & Hanesian, 1978; Novak, 6. The Knowledge Audit, 1998). Criteria include semi-hierarchical morphology, 7. Decision Requirements Analysis, propositional coherence, labeled links, the use of cross-links, 8. The Critical Decision Method, and the avoidance of certain pitfalls that characterize Concept 9. The Cognitive Modeling Procedure (see Hoffman, et al., Maps made by unpracticed individuals (including the creation 2000), of "fans," "stacks," sentence-like "spill-overs," and other 10. Protocol Analysis, features). 11. Concept Mapping using the CMap Tools software. A final factor that makes interpretation difficult is the fact that some studies involve apples-oranges comparisons. RESULTS AND DISCUSSION For instance, to those who are familiar with the techniques, it would make little sense to compare a concept sorting task to The conduct of some methods was relatively easy and the Concept-Mapping in terms of their ability to yield models quick. For example, the Knowledge Audit procedure took a of expert reasoning—in fact, neither method is suited to that total of 70 minutes. Others were quite time-consuming. For purpose. One goal of the present research was to create a instance, we conducted over 60 hours of Concept Mapping comparison that involved a reasonable mix of alternative sessions. methods, but also to put all of the methods on a more level Full protocol analysis of a single knowledge modeling playing field. Hoffman's efficiency metric was re-defined as session took a total of 18 hours to collect and analyze the data. the yield of useful propositions, useful in that they could be Results for protocol analysis confirm a finding from previous used in a variety of ways (and not just in creating a knowledge studies (Burton, et al., 1990; Hoffman, et al., 1995), that full base for an expert system). One could seek to create models protocol analysis (i.e., transcription and functional coding of of expert knowledge, or create models of expert reasoning. In audiotaped protocol statements, with independent coders) is so addition, a second metric was used to carve out the time consuming and effortful as to have a relatively low applications aspect of KE research—the yield of leverage effective yield. Knowledge models and reasoning models can points. A leverage point was defined as any aspect of the be developed, refined, and validated much more efficiently domain or work practice where an infusion of new tools (i.e., by orders of magnitude), using such procedures as (simple or complex) might result in an improvement in the Concept Mapping and the Cognitive Modeling Procedure. work. Leverage points were initially identified by the researchers but were then affirmed by the domain practitioners The CDM themselves. Also, there was ample opportunity for convergence in that leverage points could be identified in the The CDM worked effectively as a method for results from more than one KE method.1 generating rich case studies. However, the present results provide a useful qualification to previous reports on the CDM (e.g., Hoffman, et al., 1998). A lesson learned in the present project was that in this domain and organizational context, the conduct of each CDM session had to span more than one day. two mappable propositions per session minute. If one takes On the first day the researcher would conduct the first 3 steps into account the fact that for the Concept Mapping procedure, in the CDM, then retreat to the lab to input the results into the session time actually is total task time (i.e., there is no method's boilerplate forms. The researcher returned to the preparation time and the result from a session is the final workplace on a subsequent day to complete the procedure. product), it can be safely concluded that Concept Mapping is Weather forecasting cases are rich (in part because weather as at least as efficient at generating models of domain phenomena can span days and usually involve dozens of data knowledge as any other method of knowledge elicitation. types and scores of data fields). More importantly, expert Indeed, it is quite probably much more efficient. forecasters' memories of cases are often remarkably rich. Indeed, there is a tradition in meteorology to convey important lessons by means of case reports (e.g., Buckley & Leslie, Leverage Points 2000; any issue of The Monthly Weather Review). The impact of this domain feature was that the conduct of the CDM was In terms of effectiveness at the identification of time-consuming and effortful. Previous studies had suggested leverage points, 35 in all were identified. Leverage points that the CDM procedure takes about 2 hours, but those ranging all the way from simple interventions (e.g., a tickle measurements only looked at session time. The present study board to remind the forecasters of when certain tasks need to involved a more inclusive measure of effort, total task time, be conducted) to the very complex (e.g., an AI-enabled fusion and in the present research context, the conduct of the CDM box to support the forecaster's creation of a visual took about 10 hours per case. representation of their mental models of atmospheric dynamics). All of the leverage points were affirmed as being Concept Mapping leverage points by one or more of the participating experts.2 Furthermore, all of the leverage points were confirmed We are led to qualify a conclusion of Thorsden (1991), by their identification in more than one method. The leverage who also used the CDM in conjunction with Concept points were placed into broad categories (e.g., decision-aids Mapping. Thorsden argued that the strength of the CDM lies for the forecaster, methods of presenting weather data to in eliciting "tacit knowledge" whereas Concept Mapping has pilots, methods of archiving organizational knowledge, etc.). its strength in supporting the domain practitioner in laying out No one of the CTA/CFR methods resulted in leverage points a model of their tasks. Putting aside legitimate (and overdue) that were confined to any one category. We found it debate about the meaning of the phrase "tacit knowledge," we interesting that, overall, the observational methods (e.g., see the greatest strength of the CDM to be the generation of Watchfloor observations) had a greater yield of identified rich case studies, including information about cues, leverage points. On the other hand, acquiring those leverage hypothetical reasoning, strategies, etc. (i.e., decision points took more time. For example, we observed 15 weather requirements), all of which can be useful in the modeling of briefings that were presented either to pilots or to the other the reasoning procedures or strategies. The strength of forecasters, resulting in 15 identified leverage points. But the Concept Mapping lies in generating models of domain yield was 15/954 minutes = 0.016 leverage points per knowledge. Concept Mapping (either paper-and-pencil or observed minute. through the use of the CMap Tools software) can be used to concoct diagrams that look like flow diagrams or decision APPLICATION TO SYSTEM DESIGN trees. Our experience is that it is easy for novices to see Concept Maps as being flow-diagrams or models of After identifying the preservation of local weather procedural knowledge. However, good Concept Maps can just forecasting expertise as an organizationally-relevant leverage as easily describe the domain in a way that is task and device point for a prototyping effort, the models of reasoning that independent. (And therefore the Concept Mapping procedure were created using the Cognitive Modeling Procedure, the can provide a window into the nature of the "true work;" as in models of knowledge that were created using the Concept Vicente, 1999.) Mapping Procedure, and the case studies that were created To put a fine point on it, our calculations of yield using the CDM were all integrated into a Concept Map-based (number of mappable propositions generated per total task Knowledge Model. This model contained 24 Concept-Maps, minute) place Concept Mapping right on the mark in terms of which themselves contained a total of 1,129 propositions and rate of gain of information for knowledge modeling. Previous 420 individual multimedia resources. This "System To guidance (Hoffman, 1987) was that the "effective" knowledge Organize Representations in Meteorology-Local Knowledge" elicitation techniques yield two or more informative (STORM-LK) is not an expert system but instead uses the propositions per total task minute. (Again by comparison, full Concept-Mapsa model of the expert's knowledgeto be the protocol analysis was calculated to yield less than one interface to support the trainee or practicing forecaster as they informative proposition per total task minute.) In the present navigate through the work domain. A screen shot of a research, it took about 1.5 to 2 hours to create, refine, and Concept-Map is presented in Figure 1, below. The screen shot verify each Concept-Map. (The Concept Maps contained an in Figure 2 shows a Concept-Map overlaid with examples of average of 47 propositions. Verification took about seven some of the kinds of resources that are directly accessible propositions per minute, for about seven minutes per Concept- from the clickable icons that are appended to many of the Map.) The rate of gain for Concept Mapping was just about concept-nodes. These include satellite images, charts, and digitized videos allow the apprentice to "stand on the expert's Burton, A. M., Shadbolt, N. R., Hedgecock, A. P., & Rugg, shoulders" by viewing mini-tutorials. G. (1987). A formal evaluation of a knowledge elicitation techniques Also appended to concept-nodes are Concept Map icons for expert systems: Domain 1. In D. S. Moralee (Ed.), Research and that take one to the Concept Map indicated by the concept- development in expert systems, Vol 4. (pp.35-46). Cambridge: University Press. node to which the icon is attached. The Top Map serves as a Chi, M. T. H, Feltovich, P. J., & Glaser, R. (1981). "Map of Maps" in that it contains concept-nodes that designate Categorization and representation of physics problems by experts and all of the other Concept-Maps (e.g., cold fronts, novices. Cognitive Science, 5, 121-152. thunderstorms, etc.). At the top node in every other Concept Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.) (1988). The Map is an icon that takes one back to the Top Map and to all nature of expertise. Mahwah, NJ: Erlbaum. of the immediately associated Concept-Maps. For example, Cooke, N. M. (1994). Varieties of knowledge elicitation the Top Map contains a concept-node for Hurricanes, and techniques. International Journal of human-Computer Studies, 41, appended to that are links to both of the Concept-Maps that 801-849. are about hurricanes (i.e., hurricane dynamics and hurricane Duda, R. O., & Shortliffe, E. H. (1983). Expert systems research. Science, 220, 261-268. developmental phases). Through the use of these clickable Evans, A. W., Jentsch, F., Hitt, J. M., Bowers, C, & Salas, icons, one can meaningfully navigate from anywhere in the E. (2001). Mental model assessments: Is there convergence among knowledge model to anywhere else, in two clicks at most. different methods? In Proceedings of the Human Factors and Disorientation in webspace becomes a non-issue. Ergonomics Society 45th Annual Meeting, (pp. 293-296). Santa STORM-LK contains all of the information in the Monica, CA: Human Factors and Ergonomics Society. "Local Forecasting Handbook," and since the Concept Maps Glaser, R., Lesgold, A. Lajoie, S., Eastman, R., Greenberg, are web-enabled, they allow real-time access to actual weather L., Logan, D., Magone, M., Weiner, A., Wolf, R., & Yengo, L. data (radar satellite, computer forecasts, charts, etc.)—within a (1985). Cognitive task analysis to enhance technical skills training and assessment. Report, Learning Research and Development context that provides the explanatory glue for the weather Center, University of Pittsburgh, Pittsburgh, PA. understanding process. STORM-LK is intended also for use in Hoffman, R. R. (1987, Summer). The problem of extracting distance learning and collaboration, acceleration of the the knowledge of experts from the perspective of experimental acquisition of expertise, and knowledge preservation at the psychology. The AI Magazine, 8, 53-67. organizational level. Evaluations and extensions of STORM- Hoffman, R. R. (Ed.). (1992). The psychology of expertise: LK are currently underway. Cognitive research and empirical AI. New York: Springer Verlag. Hoffman, R. R., Shadbolt, N., Burton, A. M., & Klein, G. CONCLUSION A. (1995). Eliciting knowledge from experts: A methodological analysis. Organizational Behavior and Human Decision Processes, 62, 129-158. Our understanding of the strengths and weakness of Hoffman, R. R., Coffey, J. W., & Carnot, M. J. (2000, alternative CTA/CFR methods is becoming more refined, as is November). Is there a "fast track" into the black box?: The Cognitive our understanding that knowledge elicitation is one part of a Models Procedure. Poster presented at the 41st annual meeting of the larger process of co-creative system design and evaluation Psychonomics Society, New Orleans, LA. (see Hoffman & Woods, 2000; Hollnagel & Woods, 1984; Hoffman, R. R., Crandall, B., & Shadbolt, N. (1998). A Potter, Roth, Woods, and Elm, 2000; Rasmussen, 1992; case study in cognitive task analysis methodology: The Critical Vicente, 1999), a larger process that embraces both the science Decision Method for the elicitation of expert knowledge. Human and aesthetics of the design of complex cognitive systems. Factors, 40, 254-276. However, there remains a need for more work along these Hoffman, R. R., Shadbolt, N., Burton, A. M., & Klein, G. A. (1995). Eliciting knowledge from experts: A methodological lines, especially including studies in domains of expertise analysis. Organizational Behavior and Human Decision Processes, having characteristics that differ from those of the domains 62, 129-158. that have been studied to date. Additional KE methods can be Hoffman, R. R., & Woods, D. D. (2000). Studying examined as well. cognitive systems in context. Human Factors, 42, 1-7. Hollnagel, E. & Woods, D. D. (1983). Cognitive Systems Footnote Engineering: New wine in new bottles. International Journal of Man- Machine Studies, 18, 583-600. 1. To be sure, other researchers might have identified leverage points Novak, J. D. (1998). Learning, creating, and other than the ones we identified. using knowledge. Mahwah, NJ: Erlbaum. 2. We can note also that leverage point affirmation also took the form Potter, S. S., Roth, E. M., Woods, D. D., & Elm, W. C. of concrete action on the basis of our recommendations. For instance, (2000). Bootstrapping multiple converging cognitive task analysis the physical layout of the watchfloor was changed. techniques for system design. In J. M. Schraagen & S. F. Chipman (Eds.), Cognitive task analysis (pp. 317-340). Mahwah, NJ: Erlbaum. References Rasmussen, J. (1992). Use of field studies for design of Ausubel, D. P., Novak, J. D., & Hanesian, H. (1978). workstations for integrated manufacturing systems. In M. Helander & Educational psychology: A cognitive view (2nd ed.). New York: N. Nagamachi (Eds.), Design for manufacturability: A systems Holt, Rinehart and Winston. approach to concurrent engineering and ergonomics (pp. 317-338). Buckley, B. W., & Leslie, L. M. (2000). The Australian London: Taylor and Francis. Boxing Day storm of 1998--Synoptic description and numerical Shanteau, J. (1992). Competence in experts: The role of simulations. Weather & Forecasting, 16, 543-558. task characteristics. Organizational Behavior and Human Decision Processes, 53, 252-266. Thorsden, M. L. (1991). A comparison of two tools for Vicente, K. (1999). Cognitive work analysis: Toward safe, cognitive task analysis: Concept Mapping and the Critical Decision productive, and healthy computer-based work. Mahwah, NJ: Method. In Procedings of the Human Factors Society 35th Annual Erlbaum. Meeting (pp. 283-285). Santa Monica: CA: Human Factors Society Zambok, C. E., & Klein, G. (Eds.) (1997). Naturalistic decision making. Mahwah, NJ: Erlbaum. Figure 1 A screen shot from STORM-LK showing a Concept-Map

Figure 2. A screen shot from STORM-LK showing example resources