This project has received funding from the European Union’s Preparatory Action for Defence Research - PADR programme under grant agreement No 800893 [PYTHIA]

D2.1 – Methods for measuring the correctness of a prediction

WP number and title WP2 – The cognitive factor in foresight

Lead Beneficiary HAWK

Contributor(s) Z&P, ENG, BDI, ICSA

Deliverable type Report

Planned delivery date 31/05/2018

Last Update 25/09/2018

Dissemination level PU

PYTHIA Project PADR-STF-01-2017 – Preparatory Action on Defence Research Grant Agreement n°: 800893 Start date of project: 1 February 2018 Duration: 18 months

D2.1 – Methods for measuring the correctness of a prediction

Disclaimer

This document contains material, which is the copyright of certain PYTHIA contractors, and may not be reproduced or copied without permission. All PYTHIA consortium partners have agreed to the full publication of this document. The commercial use of any information contained in this document may require a license from the proprietor of that information. The PYTHIA Consortium consists of the following partners:

Partner Name Short name Country 1 Engineering Ingegneria Informatica S.p.A. ENG Italy 2 Zanasi & Partners Z&P Italy 3 Expert System France ESF France 4 Hawk Associates Ltd HAWK UK 5 Military University of Technology WAT Poland 6 Bulgarian Defence Institute BDI Bulgaria 7 Fondazione ICSA ICSA Italy 8 Carol I National Defense University NDU Romania

2

D2.1 – Methods for measuring the correctness of a prediction

Document History

VERSION DATE STATUS AUTHORS, REVIEWER DESCRIPTION V0.1 01/05/2018 Draft Hawk First draft

V0.2 25/05/2018 Draft Hawk, Z&P Second draft

V0.3 31/05/2018 Complete Draft Hawk, Z&P, ICSA Version ready for peer review

V0.4 05/06/2018 Final Draft Hawk, Z&P, ENG, BDI, Final Draft pre submission ICSA V0.5 12/06/2018 Final Updated Hawk, Z&P, ENG, BDI, Final Updated Version Version ICSA V0.6 12/06/2018 Final Hawk, Z&P, ENG, BDI, Final submitted version ICSA V0.7 19/09/2018 Final Hawk, Z&P, ENG, BDI, Final version post-EDA ICSA comments

3

D2.1 – Methods for measuring the correctness of a prediction

Definitions, Acronyms and Abbreviations

ACRONYMS / DESCRIPTION ABBREVIATIONS CRPS Continuous Ranked Probability Score

RMSE Root Mean Square Error

PEs Percentage Errors

CSI Critical Success Index

ETC Equitable Threat Score

FAR False Alarm Ratio

GMRAE Geometric Mean of the RAE

MAE Mean Absolute Error

MAD MEAN Absolute Deviation

MAPE Mean Absolute Percentage Error

MAPD Mean Absolute Percentage Deviation

RPSS Ranked Probabilistic Skill Score

ROC Receiver Operating Characteristics

UAP Unbiased Absolute Percentage Error

4

D2.1 – Methods for measuring the correctness of a prediction

Table of Contents

Disclaimer ...... 2 Table of Contents ...... 5 Executive Summary ...... 6 1 Introduction ...... 8 2 Terminology ...... 9 3 Context ...... 10 4 Types of Forecast ...... 12 4.1 Quantitative versus qualitative methodologies ...... 12 4.2 Exploratory versus normative methodologies ...... 13 4.3 Forecast versus foresight methodologies ...... 13 5 Literature Review ...... 14 5.1 Informal Methods for Measuring Forecast Accuracy ...... 14 5.1.1 Objective Performance Feedback ...... 15 5.1.2 Peer Review and Critical Review Process ...... 15 5.1.3 Training Approaches ...... 15 5.1.4 Inferred Probabilities and Blind Retrospective Assessments ...... 16 5.2 Formal Methods for Measuring Forecast Accuracy ...... 16 5.2.1 Proper Scoring Rules ...... 16 5.2.2 Brier scoring rule ...... 17 5.2.3 Logarithmic scoring rule ...... 17 5.2.4 Spherical scoring rule ...... 18 5.2.5 Continuous Ranked Probability Score (CRPS) ...... 18 5.2.6 Loss function ...... 18 5.2.7 Root Mean Square Error (RMSE) ...... 19 5.2.8 Measures based on percentage errors (PEs) ...... 19 5.2.9 Binary Logistic Models ...... 20 5.2.10 Critical Success Index (CSI) ...... 20 5.2.11 Equitable Threat Score (ETC) ...... 20 5.2.12 False Alarm Ratio (FAR) ...... 20 5.2.13 Forecast Skill Score (SS) ...... 21 5.2.14 Geometric Mean of the RAE (GMRAE) ...... 21 5.2.15 MAE (Mean Absolute Error) / MEAN Absolute Deviation (MAD) ...... 21 5.2.16 MAPE (Mean Absolute Percentage Error) / MAPD (Mean Absolute Percentage Deviation) ... 21 5.2.17 Ranked Probabilistic Skill Score (RPSS) ...... 22 5.2.18 Receiver Operating Characteristics (ROC) ...... 22 5.2.19 The ‘Goodness Score’, discernment factor, base algorithm ...... 22 5.2.20 Unbiased Absolute Percentage Error or UAPE ...... 22 6 Conclusions ...... 23 7 References ...... 26

5

D2.1 – Methods for measuring the correctness of a prediction

Executive Summary

This document attempts to review the ‘family’ of approaches to accuracy in order to make recommendations as to the most appropriate choices for PYTHIA. An Executive Summary follows here:

• The importance of measuring the accuracy of a forecast is that it helps determine if one forecasting method or system is better than another one. In other words, if we do not measure the correctness of our predictions we cannot expect to improve their results

• This is true for individual forecasters or for comparing results from different forecasting groups. The evaluation of past forecast performances becomes a key factor in improving future forecasts

• When measuring forecast accuracy, we must define the type of forecast, its purpose or aim, and set specific criteria for the evaluation metrics. Standard terms define the data sets, the time this data is taken for and the forecast horizon

• These evaluation or assessment metrics (the measures used to gauge accuracy) can be broadly categorised into formal or informal methods. The formal approach uses statistical analysis and the informal approach is empirical and uses a critical review process (e.g. expert appraisals and analytical judgement)

• Both approaches carry flaws: the formal approach can be misleading, depending on data size and type of forecast; the informal approach is inherently full of cognitive biases and individualistic interpretation

• Having access to more information does not lead to greater accuracy of forecasts. On the contrary, and perhaps counter-intuitively, more information leads to more over confidence with judgement and their accuracy

• People (including analysts) have an over-inflated opinion when it comes to correctly assessing the accuracy of their forecasts. When we make self-assessments of performances we have the tendency to see ourselves as ‘better than average’. This is problematic for self-assessments. We are also incapable of escaping the hindsight memory bias: a tendency to incorrectly recall having made accurate forecasts. Together with confirmation bias and the fallacy of expert opinion, the nature of making accurate assessments as to the correctness of a prediction is inherently a difficult one!

• The formal approach is a rigorous, scientific methodology. Forecasting accuracy can be measured rigorously but there is some reticence about doing this (Lehner, 2010)

6

D2.1 – Methods for measuring the correctness of a prediction

• The informal approach is based primarily on a critical review and lessons-learned approach. Whilst this has its merits and is often the approach of choice for analysts it may be limited in providing inputs for improvements in forecasting (Lehner, 2010)

• Forecasting is not an exact science and there are many layers for which we must interpret data and results. These can make understanding the accuracy of forecasting complicated.

• In simple terms the accuracy of a forecast can be denoted as the actual value minus the forecast value over a given time period. This is called the forecast error. We can also use forecast bias as a measure of accuracy and some of the more common metrics obtain a numerical figure for forecast error. The greater the error the less accurate the forecast and vice-versa. Discernment is also a factor to take into consideration when assessing the accuracy of any prediction.

• Adoption of any technique should be made simple, and a common understanding as to how to talk about probability should be a priority

• Much scepticism exists in relation to measuring the accuracy of forecasts, especially from those in the Intelligence Community. Based on evidence from other sectors we know that improvements in accuracy of forecasts can be made by better approaches and methods for measuring the accuracy

7

D2.1 – Methods for measuring the correctness of a prediction

1 Introduction

Today’s world is increasingly unpredictable, but we have an increasing amount of information available to help us navigate these uncertainties. One’s ability to predict what might happen in the future is an important and fundamental skill that humans possess. It defines us as a species and we have forever dreamt of an ability to see into that future, the famous crystal ball, in order to reduce uncertainty and prepare ourselves for what lies ahead. Without the mystique of that crystal ball, we are left, rather bluntly, with data, information and knowledge. Our own minds make judgements about what is most likely to take place in the future. These are predictions about future events. They are based on what has gone on in the past, what is going on in the present and – through memory, information processing, and judgement based on analysis – what might occur in the future. We strive to make better decisions based on these assumptions or predictions and as forecasters we assign probabilities as to the likelihood of their occurrence. But if we want to measure the correctness of these predictions, how can we do that? This is the work that will be investigated here as part of the research within WP2 for D2.1. How do we appraise or judge whether something that was said in the past is correct in the future? What methods exist for measuring this? And what metrics are used for the measurements? Why should we do it at all? How can it change our approach to making better predictions? Measuring the accuracy of a prediction is not an easy task. Not only do we need sufficient data to do so, but we also need sufficient willingness from the forecaster to actually go back and check where they may have gone wrong. There are many types of forecast that exist and not all methods for measuring forecast accuracy can be applied. In other words, depending on what we are predicting, this may need to be measured in different ways. The simplest and most intuitive division of the type of method for measuring the correctness of a prediction is by formal (statistical) and informal (judgemental) approaches. Can we apply a mathematical formula to check the error of the forecast? Or do we appraise the forecast using analysis and reasoning? The latter of these can still be formalised (for example, post-forecast reviews can be grouped together into a score) but the distinction remains clear. The former requires raw data (and specifically numbers) so we can express errors using a formula-based approach. In any case, when we talk about the correctness of a prediction we are talking about how accurate our forecast is (or was). It is good forecasting practice to assign a probability to a forecast. This highlights the all-important question: how likely is it that something is going to happen. It is with this likelihood that we can make judgements about the potential outcomes of particular events and begin to reduce the uncertainty in our decision making process, which ultimately is the goal.

8

D2.1 – Methods for measuring the correctness of a prediction

2 Terminology

It is important we include a note here on terminology for the purposes of clear communication. As defined by Lehner, we ‘use the word forecast to signify any statement about a future occurrence, whether a discrete event (e.g. “The peace process will likely break down.”) or a quantity (e.g. “GDP will probably increase more than 3% next year.”). Judgment-based forecasts are forecasts made by experts; they are sometimes called ‘estimates’ or ‘judgment calls.’ Probabilistic forecasts are generally defined as forecast statements expressed with a degree of certainty (e.g., “There is a good chance the recession will end in the next quarter”), where the degree of certainty is often stated quantitatively (e.g. “There is a 70% chance that the recession will end in the next quarter.”). Furthermore, we use the word accuracy to refer to both whether a forecasted event occurred and whether the forecast was expressed with an appropriate degree of certainty. (Lehner, 2010) So, a forecast is a prediction about the probability of a future event based on what we know now. This is a facet of doing foresight work. Foresight specifically looks at mapping or predicting different futures where the context of a prediction may change. Prediction is a word that we use somewhat interchangeably with forecast. All forecasts are predictions but not all predictions are forecasts. The word prediction is used here in a more general sense. Probability is the extent to which something is likely to happen. We assign probability to judge the likelihood of a future event occurring. This is different to a probabilistic model which is meant to give a distribution of possible outcomes, rather than a deterministic model which is meant to give a single solution. Discernment is also a term that is used within the context of this research. It means the ability to judge well but as we will see there are factors to include when looking at forecast accuracy related to discernment.

9

D2.1 – Methods for measuring the correctness of a prediction

3 Context

It is important that we give some context to this research, in order to place it within the overall picture of WP2. We will define here what we are trying to achieve and the areas we are not covering. Different forecasting methodologies together with errors made during the forecasting process (incorrect processing of data, incorrect assumptions over the reliability of the available information, judgment distorted by misconceptions and cognitive biases, etc.) means that despite forecasters using the same information, predictions will be different. WP2 looks at the inputs (human factors) for a model for improved forecasting and foresight, notably the behaviour of a good forecaster, what cognitive factors are involved and how this can be rendered understandable in a practical situation (for the good of the project). Various steps cover the research work, and this document is concerned with the first:

• To review methods commonly used for assessing whether a prediction is correct or not and devise a new one specific for the PYTHIA methodology;

• To review the most notable failures in the history of technology forecasting, attempting to understand the reasons behind those failures, with a particular focus on the analysis of cognitive factor-related aspects that impact the quality of predictions;

• To study the strategies adopted by successful forecasters and identify their main characteristics;

• To elaborate a set of recommendations for improving the accuracy of technology foresight.

For D2.1 the task was ‘to conduct a literature review of the different techniques used to measure the correctness or accuracy of a prediction and provide a method that can be used and tested within the PYTHIA foresight methodology.’ By doing this, we aim for a detailed understanding of the current methods (for measuring the accuracy of a prediction) and how these could be used and applied for PYTHIA based on a tailored approach. We are less concerned here with ‘what makes a good forecaster’, rather what makes a good forecast. The human factors will be investigated later and we concentrate primarily on the descriptions of methods for measuring the correctness of a prediction. We assume that a good forecast is an accurate forecast. So it is very important to be able to assess how accurate a forecast is otherwise we cannot tell if it is useful. This element of research into measuring forecast accuracy is therefore crucial for PYTHIA. Our original premise was that expert forecasters are well versed in the habit of assigning probability distributions to their predictions. Based on these percentages or figures, we assumed that it was difficult for observers to assess, after the event, whether a certain prediction was correct or not and to what extent. A good example to illustrate this point is the following; what if an event was foreseen as “60% likely to happen” and did not happen? Was the prediction wrong or not?

10

D2.1 – Methods for measuring the correctness of a prediction

This research covers these aspects by reviewing existing literature on the various methods for measuring the correctness of a prediction. Our aim is then to provide a (combined) best-fit method for determining the correctness of a prediction, according to which the accuracy of the foresight methodology proposed in PYTHIA could be tested. So, this research will look at how we can better judge what a good forecast is (using the techniques for measuring accuracy) by reviewing the methods commonly used. Recommendations will be proposed as to the method to be used in PYTHIA. The activities carried out in this work package aim at understanding the role played by the cognitive factor in technology forecasting, both as a source of error and, conversely, as something which contributes to make correct predictions. On the basis of this analysis, recommendations will be produced for forecasters on how to improve their forecasting strategies and, at the same time, to make sure that human factors will not negatively impact the quality of their predictions.

11

D2.1 – Methods for measuring the correctness of a prediction

4 Types of Forecast

Whilst it is not the intention of this document to present the different methods of forecasting, it is inevitable that we must discuss this as forecasting accuracy is dependent on the forecasting method. In other words, different forecasting approaches can lead to different levels of forecasting accuracy. The question for PYTHIA is what type of forecast do we need to use to make predictions about future events? Specifically, this is something that will come later in this project but has been addressed in Deliverable 3.1. For now, we will content ourselves with a literature review of the available methods for measuring the correctness of a prediction. Before this, it is important to understand the types of forecast that are used. We are not looking here at what qualities make a good forecaster, we are looking at how we can make a judgement about how accurate a forecast is. Why? Ultimately, this work is important as it will allow us to evaluate different forecast methods by assessing their accuracy. And conversely, the type of forecast used may determine its accuracy. This is an important feature and why a section on different types of forecast is necessary. Forecasts can benefit us when we begin to understand the different types of forecasting methods, ‘recognize what a particular forecasting method type can and cannot do, and know what forecast type is best suited to a particular need’ (Nordmeyer, 2018). Business operations understand and appreciate this, and the classifications used by Nordmeyer are highly relevant for understanding which different methods of measuring accuracy of a prediction can be used within a particular context. Readers are strongly advised to also review part 1.3 (General Features of Methodologies) of Deliverable 3.1, and for convenience we have included those classifications here:

4.1 Quantitative versus qualitative methodologies

Quantitative methodologies represent reality in a numeric form. They usually use mathematical models to study the development of variables that depict the foreseen future. Quantitative approaches often involve experts assigning numerical values to these variables or creating such values on the basis of the numbers of people agreeing with particular statements, usually in form of simple surveys or questionnaires. Many statistics are generated by means of the collected data and many statistical tools are employed to determine the relationships that can be found between variables. Advantages in using quantitative methodologies are: • Possibility to manipulate the information in consistent and reproducible ways, allowing greater precision; • Possibility to compare data from different sources; • Results can be represented in the form of tables, graphs and charts, which can communicate very efficiently with people under severe time-shortage and information-overload. On the other hand, disadvantages are: • Not all factors can be quantified numerically, such as many important social and political variables; • Gathering quality data can be difficult or excessively expansive or time-consuming; • Audience can be uncomfortable with reading statistical information;

12

D2.1 – Methods for measuring the correctness of a prediction

• Models could become very complex and could have an excessive formalisation, decreasing levels of involvement by the participants. Qualitative methodologies are employed where the developments are hard to study using numerical indicators or where numerical data are not available. They are usually methods that stimulate creative thinking and collaboration between the subjects involved. Qualitative methodologies are radically different from the quantitative ones but they can be used to balance their flaws or to support them. For example, qualitative methodologies can be used to understand the meaning of the numbers produced by quantitative methodologies. Nonetheless, qualitative methodologies still remain less documented than quantitative ones and it can be hard to define their good practice and use cases.

4.2 Exploratory versus normative methodologies

Exploratory methodologies allow to make a projection into the future from the present, in order to foresee events and trends evolution. They usually predict the future on the basis of extrapolating past trends. Among these kind of tools, there are trend, impact, and cross-impact analyses and the Delphi approach. Normative methodologies make a regression from the future to the present, trying to understand what trends and events will lead there. They move from the future to the past, considering the technologies and resources as constraints to overcome in order to reach the fixed goal. Some example of normative methodologies are backcasting, relevance trees and some uses of Delphi, known as "goals Delphi". Both methodologies are valuable and there is no golden rule to choose one or the other. Usually, normative approaches are more effective when a highly desired future exists and has to be reached. In these cases, normative approaches can be powerful tools to progress towards the shared goal in the best way possible. In every other case, exploratory methodologies should be preferred, especially when there is not a shared consensus about what the most desirable future is.

4.3 Forecast versus foresight methodologies

Forecast methodologies predict the most probable future, regardless of whether this future is desired or not, focusing on a single possible scenario. These tools are also known as “predictive methodologies”. Among them, there are SWOT analysis, Backcasting and MACTOR. Foresight methodologies predict different, possible, alternative futures. They are usually exploratory but this is not necessarily the case if alternative ways of reaching a desirable future are considered (normative methodologies). In this case, the term “open methodology” is used too. Some examples of foresight methodologies are Gaming, Expert Panels and Scenario Building.

13

D2.1 – Methods for measuring the correctness of a prediction

5 Literature Review

The following sections split the research into two sections: informal and formal methods for measuring the accuracy of a forecast. More work should be done throughout the project for the informal approaches as these may provide, together with the formal methods, a form or intuitive convergence of techniques. Indeed, a simple training approach that introduced the formal methods would be a useful first step. The research has been conducted using open sources of information and desk based research. All references are given for each part.

5.1 Informal Methods for Measuring Forecast Accuracy

In 2014, David R. Mandel and Alan Barnes completed a very interesting study into the accuracy of 1,514 strategic intelligence forecasts. (David R. Mandela, 2014). It is interesting to note here, within the section on informal methods for measuring forecast accuracy, some of the main reasons accuracy assessments are not routinely undertaken.

These relate to several specific points, which in turn are affected by two key perennial organisational and individual problems: lack of planning and subsequent lack of devoted resources to the task; lack of knowledge or understanding regarding their importance. If an organisation does not enforce comprehensive planning phases to analysis or forecast tasks, with the specific mention of ‘forecast accuracy assessments / measurements’, then, especially if the individual is not aware of their importance, these tasks will not be performed.

Despite good reasons to proactively track the accuracy of intelligence forecasts, intelligence organizations seldom keep an objective scorecard of forecasting accuracy (RK, 2007). There are many reasons why they do not do so. First, analysts seldom use numeric probabilities, which lend themselves to quantitative analyses of accuracy (Horowitz MC, 2012). Many analysts, including Kent’s poets (Sherman, 1964), are resistant to the prospect of doing so (Weiss, 2008). Second, intelligence organizations do not routinely track the outcomes of forecasted events, which are needed for objective scorekeeping. Third, only recently have behavioural scientists offered clear guidance to the intelligence community on how to measure the quality of human judgment in intelligence analysis (Fischhoff B, 2011) (Derbentseva N, 2011) (Council). Finally, there may be some apprehension within the community regarding what a scorecard might reveal.

It is important to note that whilst the quality of human judgment can be measured, how it is measured and why (for what purpose or end or specific improvement), may depend on specific individual or organisational needs. We can envisage an organisation that sets out a requirement for using numeric probabilities and for routinely tracking the outcomes of forecasted events, but this organisation must clearly communicate and explain why this is needed and how it will be achieved within the specific working context of each analyst. Ensuring this would avoid situations where a lack of knowledge or ignorance of the problem occurs (an analyst may not be aware of the advantages of performing such forecast accuracy measurements) or where lack of planning and resources are the main reasons for accuracy assessments not being routinely undertaken (management would need, in this situation, to provide the guidelines and routines to be performed, ensuring compliance through best-practice rules and/or check-lists).

Following on from this, we can list here the main informal methods for measuring forecast accuracy. These are all grouped within the same category and a description is given for each. 14

D2.1 – Methods for measuring the correctness of a prediction

5.1.1 Objective Performance Feedback

Objective performance feedback can help reveal forecasters’ judgement characteristics (e.g. over-/under- confidence) longitudinally. (David R. Mandela, 2014). Performance feedback is ubiquitous in Organizational Behaviour Management (OBM); yet its essential components are still debated. It has been assumed that performance feedback must be accurate, but this assumption has not been well established. Two experiments were carried out to research feedback accuracy which showed accurate and tripled feedback significantly improved performance over the control and low-inaccurate feedback groups. (Eunju Choi, 2018)

5.1.2 Peer Review and Critical Review Process

There are several types of peer review. Taking the examples provided by major technical publishers (Elsevier): whilst reviewers do play a vital role in academic publishing, their contributions are often hidden. This process can be transposed to the situation of an analyst or person responsible for a forecast or prediction statement. It does not give a metric of accuracy, rather a way of reviewing the material as part of a feedback loop, which is essential for improving the quality of the intelligence product. The description offered by Elsevier continues: this could be through single blind review and this common practice can be adopted to review outputs of forecasts as they would be outputs of written text / publications. The names of the reviewers are hidden from the author / producer of the forecast. This is the traditional method of reviewing and is the most common type by far. Reviewer anonymity allows for impartial decisions – the reviewers will not be influenced by the authors. Reviewers may use their anonymity as justification for being unnecessarily critical or harsh when commenting on the authors’ work. Double-blind review would have both the reviewer and the author anonymous. Author anonymity prevents any reviewer bias, for example based on an author's country of origin or previous controversial work. Open review could also be used where the reviewer and author are known to each other. Some believe this is the best way to prevent malicious comments, stop plagiarism, prevent reviewers from following their own agenda, and encourage open, honest reviewing. Others see open review as a less honest process, in which politeness or fear of retribution may cause a reviewer to withhold or tone down criticism. A more transparent peer review may be part of a wider organisational Critical Review Process.

5.1.3 Training Approaches

Knowing the importance of training as part of improving internal processes – and as we saw earlier we identify forecasting as a process which can be improved – there are several training techniques that are specific to understanding the importance of evaluating forecasting accuracy. Again, as an informal method, the client would need to define the training parameters and outcomes, set objectives and agendas (for example a training session specific to cognitive bias and how they impact forecasts). These relate to overall awareness and improvement programmes that should be run internally. Part of the objectives of these training sessions would be to identify the parameters for which one could begin to make individual and group assessments about the accuracy of forecasts. Training data and experimental data could be provided as well as notable case studies. Structured Analytical Techniques could also be used to discuss and define the measurement parameters.

15

D2.1 – Methods for measuring the correctness of a prediction

5.1.4 Inferred Probabilities and Blind Retrospective Assessments

The blind part suggests that some information is kept masked from the assessor/participant to eliminate any possible bias. The Assessment is retrospective, meaning that it is looking back at past events in time. The approach taken in the study of the accuracy of intelligence products (Lehner, 2010) is founded on two basic ideas - Inferred Probabilities and Blind Retrospective Assessments - in order to establish ground truth. We see this as having some genuine relevance on the informal methods. As part of the study, we are asked to ‘consider the following forecast statement from the declassified key judgments in the 2007 National Intelligence Estimate (NIE) on Prospects for Iraq Stability: “... the involvement of these outside actors is not likely to be a major driver of violence ...” The forecast event is “The involvement of outside actors will not be a major driver of violence in Iraq in the January 2007 to July 2009 timeframe.” In the first step, five different reviewers read the NIE and on the basis of what was written inferred probabilities of 80%, 85%, 75%, 85%, and 70% for the forecast event. As occurred in this case, if multiple reviewers infer similar probabilities, then the average of those inferred probabilities is a fair representation of how intelligence consumers would consistently interpret the product. On the other hand, if multiple reviewers infer very divergent probabilities then that divergence is measurable evidence that the forecast statement was largely meaningless.’ This is an area that will require further study and during PYTHIA we can hope to experiment with trial data to test the possibility of these methods in relation to our project.

5.2 Formal Methods for Measuring Forecast Accuracy

5.2.1 Proper Scoring Rules

Scoring rules are used to measure the accuracy of predictions that assign probabilities to a set of mutually exclusive discrete outcomes (e.g. a weather forecast). Scoring rules assign a numerical score based on the predictive distribution (i.e. a series of forecast events) and on the events or values that actually materialise. If the forecaster quotes the predictive distribution P and the event x materializes, its scoring rule is S(P,x). The function S(P, ·) (i.e. scoring rule before the actual event happens) takes values in the real line R = [- ∞,∞] (Gneiting & Raftery, 2007). Specific scoring rules could take values in a subset of this range. The orientation of a scoring rule can be positive or negative. S(P,x) is positively oriented if for two different forecasts (such as P’ and P’’), S(P’,x) > S(P’’,x) means that P’ is a more accurate probabilistic forecast than P’’. In other words, positive scoring rules have to be high, while negative ones have to be low for the forecast to be good. There exist a variety of metrics to evaluate forecasting activities and, typically, the metrics most used by forecaster are those that are not vulnerable to manipulation. Indeed, if a forecaster can improve his scores by maliciously modifying the forecasts in light of the chosen metric, then it’s impossible to know when the forecaster is being truthful and when he’s capitalising on the metric (Merkle & Steyvers, 2013). For this reason, proper scoring rules are usually used in forecasting evaluation, that are scoring rules in which the highest reward is obtained by reporting the believed probability distribution. Intuitively, it’s desirable to make “cheating” difficult, that is, if a forecaster really subjectively believe in P, he should have no incentive to report any deviation from P in order to achieve a better score. The use of a proper scoring rule encourages the forecaster to be honest to maximize the score (Constantinou & Fenton, 2012). In the following subchapters, the main proper scoring rules will be described.

16

D2.1 – Methods for measuring the correctness of a prediction

5.2.2 Brier scoring rule

The Brier score was historically the first scoring rule (Brier, 1950) and it was created as a means to verify . It is one of the metrics that have negative orientation, that means the lower the Brier score is for a set of predictions, the more accurate the predictions are. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but is inappropriate for ordinal variables which can take on three or more values. When the output of the forecasting activity is binary (e.g. “snow” or “no snow”), Brier score can be written as: ! BS = ! � − � ! ! !!! ! ! where: �! is the probability that was forecast, �! the actual outcome of the event at instance t and N is the number of forecasting instances. This binary form is only proper for binary events. If a multi-category forecast is to be evaluated (e.g. “cold”, “warm”, “hot”), then the following definition should be used: ! BS = ! ! � − � ! ! !!! !!! !" !" where: R is the number of possible classes in which the event can fall and N the overall number of instances of all classes. This last formulation was the original one proposed by Brier and it is a proper scoring rule both for binary and multi-category forecasts. Brier score’s flaws concern very rare (or very frequent) events. A forecaster who obtains a 0.4 Brier score in predicting the weather for a European capital at spring could be considered definitely “better” than someone who obtains a 0.5 score in predicting the weather across the Sahara deserts. A forecaster who foresees correctly an extremely rare event, should be rewarded much more than one foreseeing an easier event (Tetlock, 2016).

5.2.3 Logarithmic scoring rule

One of the most known metrics for forecasting accuracy estimation is the log-probability (Good, 1952). This score has positive orientation (i.e. the forecaster’s goal is to maximize it to achieve a good prediction).

This metric is local: when an outcome i realises, only the predicted value pi is used to compute the score and not the other outcomes. The base of the logarithm does not change significantly the metric, since different logarithms differ by a constant factor (Roughgarden, 2016). Besides, the scoring rule is never positive and sometimes a shifted version is used (in those cases, the score must be minimized). Logarithmic score can be written as:

Slog (P, i) = log pi

Given an event, the forecaster will assign a probability pi to its actual realization and a probability 1 - pi to the opposite event. If a forecaster predicts that it will rain at 70% and it actually rains, Slog = log(0.7) = - 0.155 while, if it doesn’t rain, Slog = log(1 - 0.7) = -0.523. The first score is indeed higher than the second one because the first forecast is better than the second one.

17

D2.1 – Methods for measuring the correctness of a prediction

The logarithmic scoring rule is the most sensitive one: it has the property that a prediction of 0% on an event that instead actually verifies, causes an infinite error. To avoid that, predictions should never be as extreme as probabilities of 0 or 1. For this reason, many forecasters choose the Brier score instead.

5.2.4 Spherical scoring rule

Given a set of c events and the vector r of their respective probabilities r1, …, rc, the spherical scoring rule of the ith event can be written as:

!! !! Sspherical(r,i) = = | � | ! ! !! ! …!!!

The spherical scoring rule has positive orientation and takes values between 0 and 1.

5.2.5 Continuous Ranked Probability Score (CRPS)

The continuous ranked probability score (CRPS) is a much-used measure of performance for probabilistic forecasts of a scalar observation. Given a real-valued random variable X, its cumulative distribution function (CDF) is defined as FX(y) = P(X ≤ y), that is the probability that the random variable X takes a value less than or equal to y. The CRPS can be defined as a quadratic measure of the difference between the foreseen cumulative distribution function (CDF) and the empirical CDF of the observation (M. Zamo, 2017).

Given a CDF of a real-valued random variable X named FX(y) and given the observed value z, the CRPS between z and FX(y) can be defined as: ! ! CRPS(FX,z) = !!(�! � − ����(� − �)) �� where Step is a step function that is equal to: • 1, if the argument is greater than or equal to 0 • 0, otherwise.

5.2.6 Loss function

A generic loss function maps the values of one or more variables onto a real number, intuitively representing some "cost". Loss functions are usually used in optimization problems in which the score has to be minimised. When a forecast f(t, h) of a variable Y(t+h) is made at time t for the future time t+h, the loss (or cost) will arise if a forecast turns to be different from the actual value (Lee, 2007). The loss function of the forecast error is: e(t+h) = Y(t+h) – f(t,h). Note that Y(t+h) can be only 1 or 0, based on the fact that the event took place or not. Calculating the loss function in every time position, it’s possible to map the distance between the forecasts and the actual events.

18

D2.1 – Methods for measuring the correctness of a prediction

5.2.7 Root Mean Square Error (RMSE)

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in climatology, forecasting, and regression analysis to verify experimental results. (Barnston, 1992) The formula is:

RMSE = (� − �)! Where: • f = forecasts (expected values or unknown results), • o = observed values (known results).

The same formula can be written with the following, slightly different, notation (Barnston, 1992):

(!"#!!"#) ! RMSE = [ ! ]2 fo !!! ! One can use whichever formula one is most comfortable with, as they perform the same operation. Alternatively, one can find the RMSE by: 1 Squaring the residuals. 2 Finding the average of the residuals. 3 Taking the square root of the result. That said, this can be a lot of calculation, depending on how large the data set is. A shortcut to finding the root mean square error is:

RMSError = 1 − �!���

Where SDy is the standard deviation of Y. When standardized observations and forecasts are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).

5.2.8 Measures based on percentage errors (PEs)

PEs can be aggregated across periods and across series, but PE-based measures have the following general limitations (Davydenko, 2010):

19

D2.1 – Methods for measuring the correctness of a prediction

• Observations with zero actual values cannot be processed; • Dividing by low actuals results in extreme percentage values that do not allow for a useful interpretation (since they are not necessarily harmful or damaging); • Therefore the evaluation of intermittent demand forecasts becomes intractable due to a large proportion of zero and close to zero actual values; • All PE-based measures can be misleading when the improvement in accuracy correlates with actual value on the original scale.

5.2.9 Binary Logistic Models

In this type of model, you estimate the probability of an event occurring. It is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). It allows one to say that the presence of a risk factor increases the odds of a given outcome by a specific factor. The model itself simply models probability of output in terms of input, and does not perform statistical classification (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cut-off value and classifying inputs with probability greater than the cut-off as one class, below the cut-off as the other. The coefficients are generally not computed by a closed-form expression, unlike linear least squares. (NCL)

5.2.10 Critical Success Index (CSI)

Also known as Threat Score (TS) this can be expressed as follows: CSI=TS=Hits/(Hits+Misses+False alarms) In practice, it can be used to answer the following type of question: how well did the forecast "yes" events correspond to the observed "yes" events? CSI ‘measures the fraction of observed and/or forecast events that were correctly predicted. It can be thought of as the accuracy when correct negatives have been removed from consideration. That is, CSI is only concerned with forecasts that are important (i.e., assuming that the correct rejections are not important). Sensitive to hits, penalizes both misses and false alarms. Does not distinguish the source of forecast error. (WWRP/WGNE)

5.2.11 Equitable Threat Score (ETC)

ETC, also known as Gilbert Skill Score, measures the fraction of observed and/or forecast events that were correctly predicted, adjusted for hits associated with random chance (for example, it is easier to correctly forecast rain occurrence in a wet climate than in a dry climate). The ETS is often used in the verification of rainfall in NWP models because its "equitability" allows scores to be compared more fairly across different regimes. Sensitive to hits. Because it penalises both misses and false alarms in the same way, it does not distinguish the source of forecast error. (WWRP/WGNE)

5.2.12 False Alarm Ratio (FAR)

FAR is the number of false alarms per the total number of warnings or alarms in a given study or situation. FAR= false alarms / hits + false alarms. (WWRP/WGNE)

20

D2.1 – Methods for measuring the correctness of a prediction

5.2.13 Forecast Skill Score (SS)

A statistical evaluation of the accuracy of forecasts or the effectiveness of detection techniques. Several simple formulations are commonly used in . The skill score (SS) is useful for evaluating predictions of temperatures, pressures, or the numerical values of other parameters. It compares a forecaster's root-mean-squared or mean-absolute prediction errors (Ef), over a period of time, with those of a reference technique (Erefr) such as forecasts based entirely on climatology or persistence, which involve no analysis of synoptic weather conditions:

SS = 1 - (Ef/Erefr). If SS > 0, the forecaster or technique is deemed to possess some skill compared to the reference technique. For binary, yes/no kinds of forecasts or detection techniques, the probability of detection (POD), false alarm rate (FAR), and critical success index (CSI) may be useful evaluators. For example, if A is the number of forecasts that rain would occur when it subsequently did occur (forecast = yes, observation = yes), B is the number of forecasts of no rain when rain occurred (no, yes), and C is the number of forecasts of rain when rain did not occur (yes, no), then POD = A / (A + B) FAR = C / (A + C) CSI = A / (A + B + C) For perfect forecasting or detection, POD = CSI =1.0 and FAR = 0.0. POD and FAR scores should be presented as a pair. (AMS, 2018)

5.2.14 Geometric Mean of the RAE (GMRAE)

RAE is the Relative Absolute Error. It can be averaged by taking a geometric mean to get the GMRAE. The GMRAE is used for calibrating models. (Armstrong)

5.2.15 MAE (Mean Absolute Error) / MEAN Absolute Deviation (MAD)

The average of the absolute differences between forecasts and observations.

5.2.16 MAPE (Mean Absolute Percentage Error) / MAPD (Mean Absolute Percentage Deviation)

Subsequently, a measure of prediction accuracy of a forecasting method in statistics, for example in trend estimation. It usually expresses accuracy as a percentage. (NOAA) MdAPE and MdRAE are further variations here that provide simple accuracy metrics over data sets.

21

D2.1 – Methods for measuring the correctness of a prediction

5.2.17 Ranked Probabilistic Skill Score (RPSS)

RPSS Is a widely used measure to describe the quality of categorical probabilistic forecasts. (Andreas P. Weigel)

5.2.18 Receiver Operating Characteristics (ROC)

The receiver operating characteristic (ROC) curve is a two-dimensional measure of classification performance. In its simplest form it is a parametric plot of the hit rate (or probability of detection) versus the false alarm rate, as a decision threshold is varied across the full range of a continuous forecast quantity. (Marzban)

5.2.19 The ‘Goodness Score’, discernment factor, base algorithm

The goodness score is a metric that allows you to look at how 'good' a forecast is by introducing the element of discernment in a quantitative way. It is based on Brier's scoring rule. Every forecaster’s baseline score starts with the inherent uncertainty baked into the system. Indeed, in the case of a forecaster who guesses the climactic average, both [discernment] and [error] will be 0, and thus the total score is exactly the uncertainty. Forecasters can improve their score (i.e. lower it) by increasing discernment, i.e. removing some amount of that uncertainty. Forecasters worsen their score by being inaccurate, which shows up with a higher [error] metric. In general, a forecaster with increasing discernment will probably introduce a little more error as well. The better forecasters increase discernment larger than the amount of error, thus decreasing the overall score. In a given context, uncertainty is usually fairly constant, because it’s a long-term average quantity. The other two scores vary, and here’s a handy guide to interpreting those extremes: • low error & low discernment = Useless. You’re only accurate because you’re just guessing the average. • high error & low discernment = Failure. You’re not segmenting the population, and yet you’re still worse than guessing the climactic average. • low error & high discernment = Ideal. You’re making strong, unique predictions, and you’re correct. • high error & high discernment = Try Again. You’re making strong predictions, so at least you’re trying, but you’re not guessing correctly. We can derive a simple formula from this: [goodness] = [uncertainty] - [discernment] + [error] (Cohen, 2016)

5.2.20 Unbiased Absolute Percentage Error or UAPE

The absolute error is divided by the average of the forecast and actual values. This has also been referred to as the Unbiased Absolute Percentage Error (UAPE) and as the symmetric MAPE (sMAPE). (Armstrong)

22

D2.1 – Methods for measuring the correctness of a prediction

6 Conclusions

• The aim is to achieve the best decision making model that would allow PYTHIA to identify whether one forecasting technique is better than another by using several accuracy metrics at the same time. The convergence or merging of these accuracy measures is an important feature to establishing a better model

• We therefore see the combination of various accuracy metrics, from informal and formal approaches, as key to PYTHIA

• More testing should be done to see which combinations work most effectively in which situation

• A best-fit ‘Accuracy Appraisal Award Amongst Analysts’ could be introduced. This would allow a group of analysts to discuss which accuracy measures worked well with which forecast systems

• Adoption of any technique should be made simple, and a common understanding of the benefits of these techniques should be made clear. The potential for training on the specifics of the importance and relevance of forecasting accuracy (and how to measure it) is seen as key for overall awareness and process improvement

• There are far more formal methods for measuring the accuracy of forecasts. More research should be done within the project for identifying informal methods and these should be tested against trial data

• The formal methods borrow from diverse scientific fields where large amounts of data are available. A convergence of these techniques as well as the informal techniques into a ‘fit for purpose’ approach for the EDA would be very beneficial

• In order to achieve this, we recommend a dedicated workshop with the EDA to develop the convergence techniques into one overall method (for measuring the accuracy of forecasts)

• We recommend, as the most promising formal techniques: MAPE, Brier Score, Forecast Skill Score, Equivalent Threat Score, Critical Success Index and Binary Logistics Model, combined with Objective Performance Feedback as the most promising informal technique

• We recommend specifically, in the first instance, the Brier Score as the formal technique to be used. This is a known method that we can refine - based on a better understanding of the specific working context of the EDA - and combine with the selected informal method

23

D2.1 – Methods for measuring the correctness of a prediction

• The Brier Score has been recommended and ‘externally’ validated during the Good Judgement Project (GJP). Our aim would be to develop a correct usage of this metric within the EDA context The Brier Score would, for example, be useful in determining accuracy over the lifetime of a question (calculating a Brier Score for every day an active forecast was taken, then taking the average of those daily Brier Scores). Also, an evaluation that takes into account the rarity of the event can be obtained using the Brier Score. This is a key aspect to having more reliable accuracy and prediction measurements

• This must be done within an overall framework of Communication > Training > Best Practices > Tools > Objective Performance Feedback > Binary / non-Binary measurements

• The PYTHIA consortium should work with the EDA to develop a ‘forecasting accuracy scorecard’ together with an analyst check-list for measuring accuracy, ensuring the techniques are adaptive

• Whilst it has not been the focus of this research, it is important to mention here that the communication of probability and uncertainty is extremely important. How can we measure accuracy if we do not have a common understanding about what constitutes a prediction or the percentages assigned to a verbal statement? For the PYTHIA project, we will make further recommendations as to how best communicate uncertainty. Initially, one recommendation is to adopt a similar method as used by the UK MoD’s ‘Yardstick’ (see Figure 1) and to explore in more detail the implications of language and their associated probability range. This is not the method for forecasting accuracy, but a part of the overall foresight methodology. The EDA should mandate the use of a standardised lexicon of terms – similar to the Uncertainty Yardstick – expressing probability and uncertainty. This approach assumes familiarity with the basic concepts of probability and uncertainty (i.e. what it means to say something like ‘it is 25% likely that Country X has an active nuclear programme’). It also assumes that the analyst has arrived at a probabilistic judgement using a robust method. This method of communicating uncertainty should form part of the overall technology foresight methodology, and as such is not elaborated here.

24

D2.1 – Methods for measuring the correctness of a prediction

Figure 1: The “Uncertainty Yardstick”

25

D2.1 – Methods for measuring the correctness of a prediction

7 References

AMS, A. M. (2018). Tratto da http://glossary.ametsoc.org/wiki/Skill Andreas P. Weigel, M. A. (s.d.). The Discrete Brier and Ranked Probability Skill Scores. AMETSOC . Armstrong, J. S. (s.d.). The Forecasting Dictionary. Tratto da The Forecasting Dictionary: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.4833&rep=rep1&type=pdf Barnston, A. (1992). Correspondence among the Correlation [root mean square error] and Heidke Verification Measures; Refinement of the Heidke Score. Climate Analysis Center. Climate Analysis Center. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Cohen, J. (2016, June 28). How to measure the accuracy of forecasts. Tratto da https://blog.asmartbear.com/forecast.html Constantinou, A., & Fenton, N. (2012). Solving the Problem of Inadequate Scoring Rules for Assessing Probabilistic Football Forecast Models. Council, N. R. Field Evaluation in the Intelligence and Counterintelligence Context: Workshop Summary. (p. 114). Washington, DC: Natl Acad Press. David R. Mandela, A. B. (2014). Accuracy of forecasts in strategic intelligence. Socio-Cognitive Systems Section, Defence Research and Development Canada. Pittsburgh: Carnegie Mellon University. Davydenko, A. R. (2010). Measuring the Accuracy of Judgmental Adjustments to SKU-level Demand Forecasts. Lancaster University. Lancaster: Lancaster University. Derbentseva N, M. L. (2011). Issues in Intelligence Production: Summary of Interviews with Canadian Intelligence Managers. Toronto: Def Res and Dev Can. Elsevier. (s.d.). What is Peer Review. Tratto da https://www.elsevier.com/reviewers/what-is-peer-review Eunju Choi, D. A. (2018). Effects of Positive and Negative Feedback Sequence on Work Performance and Emotional Responses. ournal of Organizational Behavior Management , 97-115. Fischhoff B, C. C. (2011). Intelligence Analysis: Behavioral and Social Scientific Foundations. ~Washington, DC: Natl Acad Press. Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction and estimation. Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society . Horowitz MC, T. P. (2012). Trending upward: How the intelligence community can better see into the future. Tratto da Foreign Policy: http://www. foreignpolicy.com/articles/2012/09/06/trending_upward. Lee, T. H. (2007). Loss Functions in Time Series Forecasting. Lehner, P. a. (2010). Measuring the Forecast Accuracy of Intelligence Products . Measuring the Forecast Accuracy of Intelligence Products , 3. M. Zamo, P. N. (2017). Estimation of the Continuous Ranked Probability Score with Limited Information and Applications to Ensemble Weather Forecasts. Mathematical Geosciences , 209–234. Marzban, C. (s.d.). The ROC Curve and the Area under It as Performance Measures. AMETSOC . Merkle, E. C., & Steyvers, M. (2013). Choosing a strictly proper scoring rule.

26

D2.1 – Methods for measuring the correctness of a prediction

NCL. (s.d.). Tratto da Data Analysis: https://www.ncl.ac.uk/itservice/dataanalysis/advancedmodelling/regressionanalysis/binarylogisticregressi on/ NOAA. (s.d.). Glossary_Verification_Metrics.pdf. Tratto da Glossary_Verification_Metrics.pdf Nordmeyer, B. (2018, January 10). Types of Forecasting Methods. (B. Nordmeyer, Produttore) Tratto il giorno May 31, 2018 da https://bizfluent.com: https://bizfluent.com/info-8195437-types-forecasting- methods.html RK, B. (2007). Enemies of Intelligence: Knowledge and Power in American National Security. New York: Columbia Univ Press. Roughgarden, T. (2016). Scoring Rules and Peer Prediction. Sherman, K. (1964). Words of estimative probability. Tratto da https://www.cia.gov/ library/center-for-the- study-of-intelligence/csi-publications/books-and-monographs/ sherman-kent-and-the-board-of-national- estimates-collected-essays/6words.html Tetlock, P. E. (2016). Superforecasting: The Art and Science of Prediction. Weiss, C. (2008). Communicating uncertainty in intelligence and other professions. Int J Intell CounterIntell . WWRP/WGNE, R. J. (s.d.). WWRP/WGNE Joint Working Group on Forecast Verification Research. Tratto da http://www.cawcr.gov.au/projects/verification/verif_web_page.html

27