Quantifying Uncertainty in Expert Judgment: Initial Results
Dennis R. Goldenson Robert W. Stoddard II
March 2013
TECHNICAL REPORT CMU/SEI-2013-TR-001 ESC-TR-2013-001
Software Engineering Measurement and Analysis
http://www.sei.cmu.edu
This report was prepared for the
SEI Administrative Agent AFLCMC/PZE 20 Schilling Circle, Bldg 1305, 3rd floor Hanscom AFB, MA 01731-2125
NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.
This material has been approved for public release and unlimited distribution except as restricted below.
Internal use:* Permission to reproduce this material and to prepare derivative works from this ma- terial for internal use is granted, provided the copyright and “No Warranty” statements are includ- ed with all reproductions and derivative works.
External use:* This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is re- quired for any other external and/or commercial use. Requests for permission should be directed to the Software Engineering Institute at [email protected].
* These restrictions do not apply to U.S. government entities.
Carnegie Mellon® is registered in the U.S. Patent and Trademark Office by Carnegie Mellon University.
DM-0000137
Table of Contents
Acknowledgments v
Abstract vii
1 Introduction 1 1.1 The Research Problem 1 1.2 Research Method 1
2 Experimental Design 3 2.1 Participant Guidance and Training 3 2.2 Selection of Study Participants 3 2.3 Data Collection and Management 4
3 Experimental Results 6 3.1 Calibrating Judgment 6 3.1.1 Characterizing Uncertainty 6 3.1.2 Our Focus on Precision and Accuracy 9 3.1.3 Summarizing Relationships Among the Three Measures 11 3.2 Impact of Generic Training on Domain-Specific Judgment 17 3.3 Participant Feedback 19
4 Conclusion 28 4.1 Summary of the Research 28 4.2 Next Steps 28
Appendix A: The Domain-Specific Test Batteries 31
Appendix B: The Domain-Specific Reference Points for Test Battery 3 42
Appendix C: Participant Feedback Questionnaire 49
References/Bibliography 51
CMU/SEI-2013-TR-001 | I
List of Figures
Figure 1: Interpreting Box Plots 7 Figure 2: Accuracy-Within-Bounds by Test Battery 8 Figure 3: Balancing Accuracy and Precision 9 Figure 4: Relative Accuracy by Test Battery 10 Figure 5: Precision by Test Battery 11 Figure 6: Relative Accuracy by Accuracy-Within-Bounds: Generic Test Batteries 12 Figure 7: Relative Accuracy by Accuracy-Within-Bounds: Domain-Specific Test Batteries 13 Figure 8: Relative Accuracy by Precision: Generic Test Batteries 14 Figure 9: Relative Accuracy by Precision: Domain-Specific Test Batteries 15 Figure 10: Accuracy-Within-Bounds by Precision: Generic Test Batteries 16 Figure 11: Accuracy-Within-Bounds by Precision: Domain-Specific Test Batteries 16 Figure 12: Accuracy-Within-Bounds by Calibration Training and Domain-Specific Test Battery 17 Figure 13: Precision by Calibration Training and Domain-Specific Test Battery 18 Figure 14: Relative Accuracy by Calibration Training and Domain-Specific Test Battery 19 Figure 15: Familiarity with Software Systems 20 Figure 16: Participants’ Surprise About How They Did 20 Figure 17: Difficulty Answering Questions 21 Figure 18: How Much Guidance Participants Would Have Liked 22 Figure 19: How Much Practice Participants Would Have Liked 22 Figure 20: Methods Participants Used to Match Intervals with State of Knowledge 23 Figure 21: Informativeness of the Contextual Information in the Test Questions 24 Figure 22: Informativeness of the Reference Points Tables 24 Figure 23: Helpfulness of the Contextual Information and Reference Points 25 Figure 24: Value of Generic Training Exercises for Answering Domain-Specific Questions 27 Figure 25: Over and Under Confidence as Measured by a Brier Score 30
CMU/SEI-2013-TR-001 | II
List of Tables
Table 1: Example Generic Test Questions 4 Table 2: Example Questions from Domain-Specific Test Battery 1 4 Table 3: Example Questions from Domain-Specific Test Battery 2 5 Table 4: Number of Study Participants 6
CMU/SEI-2013-TR-001 | III
CMU/SEI-2013-TR-001 | IV
Acknowledgments
First of all we offer our heartfelt thanks to the individuals who participated in the training exercis- es and provided the data from which we drew our results. We clearly would have nothing to re- port without them. Our colleagues in the SEI’s cost estimation research group were instrumental in myriad ways to the design and implementation of this study as well as the larger project on Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE) of which this work is a part. They include in particular Bob Ferguson, Jim McCurley, Dave Zubrow, and Mike Zuccher from the SEI, Eduardo Miranda of the Carnegie Mellon School of Computer Science, and Ricardo Valerdi of the University of Arizona.
Eduardo Miranda also was instrumental in recruiting study participants from the Carnegie Mellon School of Computer Science. Ricardo Valerdi coordinated access to defense and industry partici- pants in a master class held in Melbourne, Australia. R. Krishnan, Susan Timko, and Andy Was- ser kindly helped us recruit study participants from Carnegie Mellon’s Heinz College. Teri Reiche, the director of Carnegie Mellon’s institutional review board, was extremely helpful to us in obtaining appropriate authorization for our research involving human subjects. Jill Diorio and Jerry Pottmeyer of the SEI’s ethics compliance team were similarly helpful in gaining access to sites for conducting the training on the Carnegie Mellon campus as well as in their usual respon- sibilities for work of this kind.
Doug Hubbard graciously gave us permission to reuse some of his large catalog of generic test questions. Mike Cullen and his engineering leadership team at Tektronix provided valuable in- sights for implementing and refining our methods during an early site visit there. The SEI’s Julie Cohen, John Foreman, and Tim Morrow participated in another early calibration training session where they also recognized the importance of such exercises to quantify expert judgment for sen- ior DoD personnel responsible for risk analysis and program management. In addition to John Foreman, others from the SEI and the DoD who have been particularly helpful at various stages in this research include Rob Flowe, John Robert, Mike Philips, David Scherb, and Maryann Watson.
As always we owe a great debt of gratitude to Erin Harper who is a valued colleague on the re- search as well as an outstanding technical writer. Finally we very much appreciate the leadership of Anita Carleton, Michael May, Bill Scherlis, and Doug Schmidt in the SEI and DoD manage- ment chain for their recognition of the importance of this research and approval of funding for it.
CMU/SEI-2013-TR-001 | V
CMU/SEI-2013-TR-001 | VI
Abstract
The work described in this report, part of a larger SEI research effort on Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE), aims to develop and validate methods for calibrat- ing expert judgment. Reliable expert judgment is crucial across the program acquisition lifecycle for cost estimation, and perhaps most critically for tasks related to risk analysis and program man- agement. This research is based on three field studies that compare and validate training tech- niques aimed at improving the participants’ skills to enable more realistic judgments commensu- rate with their knowledge.
Most of the study participants completed three batteries of software engineering domain-specific test questions. Some participants completed four batteries of questions about a variety of general knowledge topics for purposes of comparison. Results from both sets of questions showed im- provement in the participants' recognition of their true uncertainty. The domain-specific training was accompanied by notable improvements in the relative accuracy of the participants' answers when more contextual information to the questions was given along with “reference points” about similar software systems. Moreover, the additional contextual information in the domain-specific training helped the participants improve the accuracy of their judgments while also reducing their uncertainty in making those judgments.
CMU/SEI-2013-TR-001 | VII
CMU/SEI-2013-TR-001 |
1 Introduction
1.1 The Research Problem
Reliance on expert judgment is indispensable for unprecedented Major Defense Acquisition Pro- grams (MDAPs). Many decisions, both technical and managerial, rely on expert judgment in the absence of sufficient historical data throughout the system lifecycle. Yet research and experience show that experts often are overconfident and overoptimistic in their judgments under uncertain conditions [4, 7, 9, 11, 14]. A major part of our larger research agenda on Quantifying Uncertainty in Early Lifecycle Cost Estimation (QUELCE), the work described here aims to develop and vali- date methods for calibrating expert judgment in early DoD cost estimation. Reliable expert judg- ment is crucial for many tasks across the program acquisition lifecycle, including cost estimation and perhaps most critically for tasks related to risk analysis and program management.
There is a large literature on overconfidence and optimism in expert judgment. While the litera- ture on the effectiveness of training to calibrate the judgment of individual experts is smaller, re- sults show that calibration training can lead to marked improvements in the trainees’ judgment skills [12]. However, the literature focuses most on reducing over-confidence. More work is nec- essary to better understand how to make judgment more realistic and accurate. Moreover, much of the literature relies on generic questions about a wide variety of general knowledge even though the expertise needed is usually domain-specific [12].
The QUELCE method relies on expert judgment at several steps, including (a) the identification of “change drivers” that can affect the costs of a given project over its lifecycle, (b) the identifica- tion of states within a change driver, (c) the probability of a change driver departing from a nomi- nal status, (d) the strength of the cause-effect relationship between one change driver and another, and (e) identifying any significant interactions between change drivers that may jointly affect a third change driver. The outputs of the QUELCE method serve as inputs to existing cost estima- tion models [19, 6]. Expert judgment must be consistently dependable and repeatable to be credi- ble within cost estimation. Therefore, a method is needed to ensure that expert judgment is satis- factorily calibrated before experts participate in the QUELCE method.
1.2 Research Method
Our research includes a series of field studies to compare and validate training techniques aimed at improving expert judgment skills. Our current focus is on training to improve individual judg- ment skills to enable participants to make more realistic judgments commensurate with the state of their knowledge [13].
We followed a phased approach to reduce risk before doing experiments with the DoD or contrac- tor personnel. As seen in Section 3, participants in the first three studies were Carnegie Mellon University software engineering graduate students, members of the SEI technical staff, and partic- ipants in a cost estimation master class in Australia. We used this initial phase to refine our under- standing and use of domain-specific questions and “reference points” before undertaking more expensive and logistically difficult experiments with defense experts.
CMU/SEI-2013-TR-001 | 1
As shown in more detail in Section 2.3, and Appendix A, our domain-specific questions provide much more contextual background information than do the typical, short ”trivial pursuit” ques- tions that have been used in other research on calibrating judgment or risk literacy. We chose to ask questions about existing software systems in the first phase of our research. The reference points provide comparable information about systems from similar application domains. While there is increasing recognition in the research that expertise is domain specific [4, 12], to our knowledge such questions and reference points are unprecedented in the research literature on calibrating expert judgment.
Our training materials emphasize the importance of recognizing uncertainty. However, the ulti- mate goal also includes improving the accuracy of estimates for use in QUELCE and other deci- sion making under uncertain conditions. Discussion and information sharing about various heuris- tics was meant to help the participants establish reasonable bounds of uncertainty around their answers to the test questions. The increased contextual information in the domain-specific ques- tions and reference points was meant to narrow those bounds around the correct answers as the participants considered other pertinent factors.
In addition to recording their answers to the test questions, the participants completed a short feedback questionnaire at the end of the domain-specific training (see Appendix C). We used pa- per forms to collect the data in the first two field studies, but we replaced the paper with custom software support in the third study.
As shown in Appendices A, and C, the software keeps the study participants from making a num- ber of errors typically made when completing paper forms. A major time saver for both the study participants and us, the software also relieves study staff from misinterpreting undecipherable handwriting.1
1 As shown in Appendix B, the reference points still are limited to paper. Resources permitting, we may modify the current software user interface for use in distance learning to allow participants to see only a single question at a time with no back referencing to compare with their earlier answers. We also intend to improve the interface for querying the reference points more flexibly and efficiently.
CMU/SEI-2013-TR-001 | 2
2 Experimental Design
2.1 Participant Guidance and Training
The individual calibration training began with a brief introduction that included guidance about how to make more realistic judgments tempered with a degree of confidence to reflect the partici- pants’ actual knowledge. Experts often are expected to know the “right” answer. We stress that it is vital to recognize what remains uncertain under as yet unknown circumstances.
The introductions were followed by a series of three or four calibration exercises. Each exercise started with a battery of factual questions. The questions asked the trainees to provide upper and lower bounds within which they were 90 percent certain the correct answer was included. Each test battery was followed immediately by a brief review of the correct answers. A short discussion followed where the students were given further guidance about ways to explicitly consider inter- dependencies among related factors that might affect the basis of their best judgments under un- certain circumstances.
The guidance included heuristics about the following: • ways to increase the odds of being right • thinking of other factors beyond the questions themselves that might affect the pros and cons of being right • adjusting answers based on previous feedback • avoiding “anchoring” on an initial “best” answer before thinking about why you may be wrong • thinking first about why you might be wrong and then reducing your uncertainty based on your knowledge of related things
We limited the size of each training session to a maximum of 15 participants to make the training more manageable to conduct and valuable for the participants. The sessions were small enough to encourage wide-ranging discussion and active learning among the participants. Small sessions also allowed us to incrementally increase the total number of cases and diversity of the total sam- ple. Each training session took 2½ to 3 hours. As shown in Section 3 we included both generic and domain-specific questions to test hypotheses about the comparative effectiveness of their use in the training.
2.2 Selection of Study Participants
Participation in the study was entirely voluntary in compliance with approval for research on hu- man subjects by the Carnegie Mellon Institutional Review Board. Since most of the training exer- cises were held outside of normal class time we relied on flyers, email, and the good offices of faculty colleagues and deans to encourage participation. Light meals and snacks were provided as appropriate for the time of day. Participants received a report of their own performance and the overall results. Anything that could identify the participants personally was of course held in strict confidence and stored separately in a secure manner with access limited to the research team.
CMU/SEI-2013-TR-001 | 3
2.3 Data Collection and Management
Four generic calibration test batteries were used that included 20 short questions each. Examples from the first battery are in Table 1.
Table 1: Example Generic Test Questions How many feet tall is the Hoover dam? What percentage of aluminum is recycled in the US? In 1913, the US military owned how many airplanes? The first European printing press was invented in what year? In what year was Harvard founded? What is the wingspan (in feet) of a Boeing 747 jumbo jet?
It required more time to consider the domain-specific questions since they and the associated ref- erence points included more contextual information than the generic questions. Hence the do- main-specific test batteries were limited to 10 questions in each of three batteries. Examples from the first battery are in Table 2. The contextual information for each software system in the left column was followed by the question itself in the right column. The full question set is shown in Appendix A.2
Table 2: Example Questions from Domain-Specific Test Battery 1 Epiphany is the web browser for the GNOME desktop. GNOME (GNU Network How much total effort Object Model Environment) runs on Unix-like operating systems, most notably in person years has Linux. Powered by the WebKit engine, Epiphany aims to provide an uncomplicat- been spent on this ed user interface that enables users to focus on Web content instead of the project? browser application. Apache JAMES Project: A complete and portable enterprise mail engine based What is the project’s on open protocols; also a mail application platform that allows processing emails, current codebase size e.g., to generate automatic replies, update databases, filter spam, or build mes- in LOC? sage archives. LibreOffice: A multi-platform, integrated office suite based on copyleft licenses How much total effort and compatible with most document formats and standards: Includes spread- in person years has sheet, word processor, chart, business productivity, presentation, database, linix, been spent on this C++ and other applications. project? OpenGroupware.org is a set of applications for contact, appointment, project, What is the current and content management. It is comparable to Exchange and SharePoint portal codebase size in servers. It is accessible using Web interfaces and various native clients, including LOC? Outlook. Its servers run on almost any GNU/Linux system, can synchronize with Palm PDAs, and are completely scriptable using XML-RPC.
The first domain-specific test battery included only a limited amount of contextual information about the software system about which each question asked. These questions were meant to get the study participants thinking about what they needed to consider in making realistic judgments under uncertain conditions, while recognizing the need for more contextual information to make well-informed judgments.
2 We crafted the software engineering domain-specific questions and reference points from information available from Ohloh (www.ohloh.net/). Ohloh is a directory that provides links to many project source code repositories and provides “factoid” metrics for thousands of open source projects.
CMU/SEI-2013-TR-001 | 4
We provided the study participants with additional information about the questions in the second and third domain-specific test batteries. We also introduced at that time the use of the reference points to provide comparable information about similar software systems. Examples from the sec- ond battery are shown in Table 3. The contextual information for these batteries and reference points was limited to other factors, the knowledge of which might help the study participants an- swer the questions. Additional questions asked about the projects in the second and third domain- specific batteries included the same ones asked in the first battery: (1) “What is the project’s cur- rent codebase size in LOC?” and (2) “How much total effort in person years has been spent on this project?” The full question sets for the second and third domain-specific test batteries are shown in Appendix A. The reference points can be found in Appendix B. The format differs, but the reference points contain the same kinds of contextual information used in domain-specific test batteries 2 and 3.3
Table 3: Example Questions from Domain-Specific Test Battery 2 Mercurial is a fast, lightweight Source Control Management system designed for What percentage of efficient handling of very large distributed projects. the code is written in ------the product’s major Over the past twelve months, 130 developers contributed new code. This is one of language (Perl)? the largest open-source teams in the world, and is in the top 2% of all project teams in our database. Over the entire history of the project, 458 developers have contributed. The first lines of source code were added in 2005. ------LOC = 152,551 | 14% comment to code ratio | 39 person years of effort Google Chrome: The open-source project behind Google Chrome (Chromium) What is the ratio (%) builds on components from other open source software projects, including WebKit of comments to LOC and Mozilla: It is aimed at improving stability, speed and security with a simple in the current code- and efficient user interface. base? ------Established codebase: The first lines of source code were added in 2008. The project has seen a substantial increase in activity over the last twelve months. ------C++ = 39%; C = 33%; XML = 8%; HTML = 6%; Other =14% LOC = 5,535,674 | 1683 person years of effort Mozilla Calendar project develops Mozilla Sunbird (a stand-alone calendar appli- In what year were the cation) and Lightning, a calendaring extension for Mozilla Thunderbird. Their goal first lines of source is to bring Mozilla-style ease-of-use to your calendar, without tying you to a partic- code added? ular storage solution. ------Over the past twelve months, 157 developers contributed new code to Mozilla Calendar. This is one of the largest open-source teams in the world, and is in the top 2% of all project teams in our database. Over the entire history of the project, 495 developers have contributed. The first lines of source code were added in . ------C++ = 32%; JavaScript = 29%; XML = 15%; C = 7%; CSS = 7%; Java = 5%; Oth- er = 5% LOC = 927,266 32% comment to code ratio 253 person years of effort
3 The correct answers to the questions that the study participants had not yet answered remained hidden from view in the reference points as well as the questions.
CMU/SEI-2013-TR-001 | 5
3 Experimental Results
As shown in Table 4, a total of 36 individuals participated in the study during FY 2012. The first of three separate groups consisted of Carnegie Mellon University software engineering graduate students along with a few members of the SEI technical staff. The second group consisted of members of a master class led by Ricardo Valerdi and Dave Zubrow in conjunction with the Im- proving Systems and Software Engineering Conference (ISSEC) held in Melbourne, Australia in August 2012. The third group consisted of Carnegie Mellon University graduate students from the Heinz College along with two additional software engineering graduate students. All of the grad- uate students had previous industrial experience. We kept the three groups small to encourage active learning and class discussion.
Table 4: Number of Study Participants Number of Test Cases Domain- Venue Total specific Generic 1: Carnegie Mellon Graduate Students and Software 21 14 14 Engineering Institute Technical Staff Members 2: Australian Master Class Participants 8 8 0 3: Carnegie Mellon Graduate Students 7 7 0 Totals = 36 29 14
A total of 29 individuals from all three groups completed three batteries of software engineering domain-specific test batteries. A total of 14 participants from the first group also completed four batteries of generic knowledge questions that often are used for training meant to calibrate recog- nition of uncertainty.
As noted in Section 2 our domain-specific questions included much more contextual information than the generic knowledge questions. We also provided reference points to give the study partic- ipants additional information about software systems similar to the ones in the questions.
The results for both sets of questions showed improvement over the test batteries that were con- sistent with studies in other domains with respect to recognition of the participants' true uncertain- ty. The domain-specific training was accompanied by notable improvements in the relative accu- racy of the participants’ answers when we introduced the additional contextual information to the questions and the reference points about similar software systems.
3.1 Calibrating Judgment
3.1.1 Characterizing Uncertainty
A simple summation of the number of times that the correct answer for a calibration test questions falls within the upper and lower bounds specified by the study participants is commonly used to measure calibration of expert judgment. Such a measure characterizes the idea of recognizing people’s uncertainty reasonably well. In fact, faculty with whom we have collaborated in these studies who teach software engineering graduate courses in cost estimation have used their stu-
CMU/SEI-2013-TR-001 | 6
dents’ uncertainty in answering questions similar to our domain-specific software engineering questions (but without the additional contextual information) as a teaching moment to get the stu- dents thinking about what else they need to know to inform their technical skills and manage their time and attendant risks. Our results are consistent with prior research using the same measure that we call “accuracy-within-bounds.” That is true particularly for our domain-specific questions.
We used box plots to summarize the distributions of the study participants’ scores over this meas- ure and two other derived measure. Box plots as originally envisaged by Tukey [21] make no as- sumption of statistical normality. They are simply based on distribution of the data by percentiles. As shown in Figure 1, the box runs from the first through the third quartile (25th and 75th percen- tiles) of the entire data distribution. The distance between the two ends of the box is called the interquartile range; it contains half of the observations (study participants in this report). The whiskers, which may exist both above and below the box, extend to the outermost data points within another 1½ times the interquartile range. Asterisks above or below the whiskers are classi- fied as outliers (i.e., cases that are unusually large or small).
Figure 1: Interpreting Box Plots
As shown in Figure 2, the median proportion of study participants whose upper and lower bounds included the correct answers to the generic questions rose from 40 percent in the first test battery to 80 percent in the fourth battery (p < .0002).4 Notably, the participants’ median proportions rose from 10 percent to 70 percent over the course of only three domain-specific test batteries (p < .00001).
4 All of the significance tests for the box plots throughout this report are based on the Mann-Whitney U-test. The U-test can be used determine the probability that the medians of two distributions are significantly different. However, with larger samples, it also can detect important differences in the shape and spread of the distribu- tions. A succinct description of the U-test can be found in The SAGE Encyclopedia of Social Science Research Methods [1].
CMU/SEI-2013-TR-001 | 7
The slight dip in generic test battery 3 is statistically insignificant. As shown in Figure 4 on page 10, however, the inconsistency for a different measure of accuracy in test battery 3 is much larger. That is because the measure of accuracy-within-bounds confounds accuracy with precision.
Percent Accurate Within Bounds
100
80
60
40
20 % Within Low and Hig h Bounds
0
-A -A -A -A -A -A -A 1 2 3 4 1 2 3 ic ic ic ic n n n r r r r a ai ai e e e e m m m en en en en o o o G G G G D D D
Figure 2: Accuracy-Within-Bounds by Test Battery
While the measure of accuracy-within-bounds is a reasonable way to characterize the partici- pants’ recognition of uncertainty, it is an imperfect measure of accuracy in making judgments under uncertain conditions. Participants in calibration training exercises sometimes improve their chances of being accurate by unrealistically expanding the width of their confidence bounds to recognize their true uncertainty. However, it is not enough to know how often study participants or true domain experts are able to establish confidence bounds that contain the correct answers on calibration tests. It is equally important to achieve sufficient confidence in those answers.
Recognizing one’s uncertainty is a major lesson of calibration training. Students are encouraged to think about what else they already know that can inform their judgments beyond what is stated explicitly in the test questions. Other unstated factors can be equally or more important. However, narrowing the distance between one’s lower and upper confidence limits is also a major goal of calibration training. That is why we included reference points in our domain-specific training ex- ercises.
Such an approach is even more important in real-world practice situations involving requirements analysis, portfolio management, performance modeling, risk analysis, program management, and
CMU/SEI-2013-TR-001 | 8
many other competences in addition to cost estimation. That is why we are building a working prototype repository of DoD domain-specific information in FY 2013.5
3.1.2 Our Focus on Precision and Accuracy
Both precision and relative accuracy are extremely important. Experts need to narrow their bounds of uncertainty and consider how close their best estimates can come to what proves to be accurate over the passage of time. Yet this is not emphasized widely in the existing literature on calibration of expert judgment. As illustrated in Figure 3, for example, having a more precise ex- pert judgment that is closer to being correct is preferable even if the bounds around the expert’s best judgment do not include the correct answer.
Figure 3: Balancing Accuracy and Precision
We used two independently defined derived measures to characterize the study participants’ rela- tive accuracy and precision.6 Our measure of relative accuracy simply takes the absolute value of the distance between the participant’s best judgment and the accurate score for each test question. It then divides that value by the accurate score for the question to normalize for differences be- tween the units of measure among the questions in the full test battery. The derived measure for each participant is the median score over all of his or her answers in that test battery.