<<

RESEARCHERS AND PRACTITIONERS VOLLEY ABOUT MAKING RESEARCH USEFUL

Dear Practitioners:

The research community is actually discovering things you might find useful. Please help us organize this knowledge so that it’s actually useful to you. Understand that this isn’t absolute truth, but rather the best we can do at the moment. You must be thoughtful about using this knowledge, but it’s a lot better than guessing.

Sincerely, The Researchers

Dear Researchers,

We have a lot of questions, and we suspect you have answers. Unfortunately, the answers are scattered among thousands of papers, and we can’t tell fact from fiction. Worse, there are entire topics that no one is studying because they aren’t “scientific enough.” We have fallen back on getting insights from Hacker News, Stevey’s Drunken Blog Rants, and Jeff, who just transferred from Accounting. We’re pretty sure Jeff doesn’t know anything, but he’s the loudest person in our stand-up, and we don’t have any evidence to dispute him. We’ll take whatever evidence you have.

Sincerely, The Practitioners

imagining a dialogue between how these tradeoffs between rigor writing programs; the concept of rea- researchers and practitioners (see and pragmatism have been handled soning mathematically about a pro- the sidebar). in medicine, where risk is often ac- gram had just been introduced. The ceptable in the face of urgency. We emphasis was on demonstrated capa- This pattern is common: engi- propose an approach to describ- bility—what we might now call feasi- neers often rely on their experience ing SE research results with varied bility—rather than rigorous validation. and a priori beliefs1 or turn to co- quality of evidence and synthesizing This is visible in a sampling of workers for advice. This is better those results into codified knowledge major results of the period. For ex- than guessing or giving up. But what for practitioners. This approach can ample, Carnegie Mellon University if incompletely validated research both improve practice and increase identified a set of canonical papers outcomes could be distilled into re- the pace of research, especially in ex- published between 1968 and 2002.3 liable sources, intermediate between ploratory topics. Several are formal analyses or em- validated results and folk wisdom? pirical studies, and a few are case To impact practice, SE research Engineering studies. However, the majority are results must lead to pragmatic, ac- Research Expectations carefully reasoned essays that pro- tionable advice. This involves syn- over Time pose new approaches based on the thesizing recommendations from When the 1968 NATO Conference authors’ experience and insight. results with different assumptions introduced “” The field has historically built and levels of rigor, assigning appro- to our vocabulary,2 research often on results with varying degrees of priate levels of confidence to the rec- focused on designing and building certainty. Indeed, Fred Brooks pro- ommendations. Here, we examine programs. There were guidelines for posed a “certainty-shell” structure

SEPTEMBER/OCTOBER 2018 | IEEE SOFTWARE 51 FOCUS: SOFTWARE ENGINEERING’S 50TH ANNIVERSARY

for knowledge, to balance “the ten- rigorous validation.7 The strength EBM emphasizes the synthesis of sion between narrow truths proved of validation was the most impor- (possibly weaker) individual results convincingly by statistically sound tant factor affecting acceptance, and into stronger conclusions. Signifi- experiments, and broad ‘truths,’ there were clear alignments between cantly, EBM then supports practi- generally applicable, but supported the types of result and the validation cal decision making that traces the only by possible unrepresentative techniques. level of evidence and confidence in observations.”4 This evolution is consistent with a decision to its source by assign- Brooks’ structure recognizes three the way ideas typically evolve in our ing strengths to a recommendation nested classes of results: scientifi- field—building from key insights based on the level of the evidence cally validated findings, observa- and early exploration to products, that supports it: tions, and rules of thumb—with over several decades. Different re- different evaluation criteria for each. search methods are appropriate at What are we to do when the By properly identifying each re- different stages of this evolution, as irresistible force of the need to sult, we can take advantage of in- increasing confidence in the work offer clinical advice meets with complete or partial knowledge. For justifies larger-scale, controlled eval- the immovable object of flawed example, ’s “Hints uation. However, well-controlled ex- evidence? All we can do is our for Computer System Design” is an periments almost inevitably narrow best: give the advice, but alert excellent set of well-thought-out the scope of their results and their the advisees to the flaws in the rules of thumb.5 immediate practical relevance. evidence on which it is based.9 Expectations for rigor in SE re- Additionally, certain types of search have evolved since the NATO research (like design), and most Table 1 gives the rules for assign- conference. Around the turn of the early-stage research, remain more ing recommendation grades.9 21st century, the SE research com- narrative; setting inappropriate eval- Today, many diagnoses combine munity became concerned about the uation expectations can deter prog- the context of the particular patient lack of quantitative, experimental ress. Even results falling short of case with levels of uncertainty from research. A preponderance of papers current standards are “better than the analysis model to determine di- accepted for the 2002 International nothing” for practitioners. To help agnosis certainty and confidence Conference on - reconcile this tension between the in the recommended treatment. ing (ICSE 02) defended their results research community’s standards The decision can map to combined with examples, followed by papers of rigor and the pragmatic needs levels of certainty, including less that supported results with formal of practitioners, we observe how certain results from different analy- or controlled experimental tech- evidence-based medicine (EBM) sis methods. On the basis of EBM niques and reports on experience in connects research results to the needs principles, physicians can now con- practice.6 Beginning in 2004, the of clinical medical practice, and sult best medical-practice guidance evidence-based software engineer- vice versa. via a smartphone from a patient’s ing (EBSE) community began call- bedside. ing for synthesizing research results Evidence-Based EBM favors neither the research through systematic literature re- Medicine nor the practice. It instead represents views (SLRs). EBM is the conscientious, explicit, a systematic approach to clinical The field has matured in its and judicious use of the current best problem solving that allows the inte- awareness of the variety of research evidence from medical clinicians gration of the best available research methods available. In 2014 and to make timely decisions about the evidence with clinical expertise and 2015, ICSE asked authors to classify care of individual patients. EBM ar- patient values. EBM can teach SE submissions as analytical, empiri- ranges medical evidence into a hier- the value of researchers and practitio- cal, methodological, perspective, or archy (see Figure 1): case reports and ners collaborating. Researchers make technological, providing criteria for series are toward the bottom, pro- in-development techniques available each category. Compared to ICSE gressing to individual randomized for testing, and practitioners com- 02, ICSE 16 had substantially more clinical trials and then meta-analysis bine such evidence using their best empirical reports and much more and systematic reviews.8 In this, judgment.

52 IEEE SOFTWARE | WWW.COMPUTER.ORG/SOFTWARE | @IEEESOFTWARE

FOCUS: SOFTWARE ENGINEERING’S 50TH ANNIVERSARY

Budgen and his colleagues found Table 2. The hierarchy of evidence for software that only 37 out of 178 SLRs pub- engineering research. lished between 2010 and 2015 provided recommendations or con- Type of study Level Evidence clusions of relevance to education or Secondary or filtered studies 0 Systematic reviews with recommendations practice.16 A follow-up study found for practice; meta-analyses that the SLRs with recommenda- Primary studies Systematic 1 Formal or analytic results with rigorous tions on practice were derived pre- evidence derivation and proof dominantly from primary studies conducted in industry. While the fo- 2 Quantitative empirical studies with careful experimental design and good statistical cus on what has been done is helpful control for researchers, the lack of focus on what has been learned is unhelpful Observational 3 Observational results supported by sound evidence qualitative methods, including well-designed for practitioners. case studies In one paper that provided action- able insights, the authors synthesized 4 Surveys with good sampling and good nine papers from the software-test- design; field studies; data mining ing literature.17 They gave rules of 5 Experience from multiple projects, with thumb for practice, such as “when analysis and cross-project comparison; a you need higher assurance, ... a tool, a prototype, a notation, a dataset, or another artifact (that has been certified as data-flow technique … can be more usable by others) effective than random testing.”17 What was missing was labeling the 6 Experience from a single project: an objective review of a specific project; lessons learned; strength of the recommendations. a solution to a specific problem, tested and The authors instead provided cave- validated in the context of that problem; an ats about the initial studies, such as in-depth experience report; a notation, a small evaluation programs (level 7) dataset, or an unvalidated artifact and the use of seeded faults (level 6). No design 7 Anecdotes on practice; a rule of thumb; an SLRs in SE have also not system- evaluation with small or toy examples; a atically addressed recommendation novel idea backed by strong argumentation; strength, which requires assessing a position paper or an op-ed based principally on expert opinion whether the evidence is consistent and field-tested. We advocate adopting the ad- • Hakan Erdogmus. Systematic EBSE, producing recommendations ditional step of EBM, expecting evidence, anecdotal evidence, for practitioners and researchers.12 SLRs to make explicit recommen- and feasibility checks map to EBSE recognizes that impact in both dations on practice, clearly labeled levels 1 and 2, 3, and 4 and 5, medicine and SE arises from the with strength of recommendation respectively.11 collection of multiple sources sur- reflecting the level of rigor of the • Forrest Shull and his colleagues. rounding an idea, ideally synthesized underlying evidence. Whereas EBSE Empirical methods map to levels through secondary studies. Key out- emphasizes the quality of execu- 2 through 6.14 comes of this work are a set of rec- tion of individual studies as a ba- • Chris Scaffidi and Mary Shaw. ommendations for conducting SLRs sis for assigning confidence to SLR Low-ceremony evidence maps to and a call for increased empirical outcomes, we emphasize the level levels 6 and 7.15 and controlled experiments in SE of evidence of the individual stud- research. ies. Doing this recognizes the role EBM has previously inspired SE Most SLRs in SE to date have ad- of lower-confidence evidence in research practice. Barbara Kitchen- dressed research questions rather informing practice, both in medi- ham and her colleagues introduced than actionable advice. David cine (where animal studies can be

54 IEEE SOFTWARE | WWW.COMPUTER.ORG/SOFTWARE | @IEEESOFTWARE informative) and in SE (where engi- We are proposing not a new sub- This sketch of a levels-of-evidence neers value information from expe- field of SE but rather a new way to framework leaves open key ques- rience reports and overviews as well label, organize, and synthesize re- tions for the SE community. as empirical results).13 We recognize sults across SE to more concretely First, how can such a framework the need for a set of rules similar benefit practice. This vision is help researchers value research with to Table 1 for assigning strength to achievable, given sufficient commu- different levels of evidence appropri- the recommendations in secondary nity participation and cooperation. ately, and select research methods studies, such as SLRs. It requires the following: appropriate to their research ques- The strength-of-recommendation tions? How should publication ven- taxonomy from EBM offers a start- • Consensus on a formal frame- ues decide which types of research to ing point. For example, in EBM, work for levels of evidence, include? How should they be labeled consistent evidence from the field together with a mapping between and differentiated? has higher recommendation strength the evidence and the strength of Venues such as ICSE have previ- than expert opinions. Other factors the resulting recommendation. ously required authors to label the that may influence decision making SE researchers and practitioners type of their submissions, although with respect to this hierarchy include should collaborate to refine the this practice has not been consis- the age of a result, application or classification of research methods tent. We expect that top venues can evaluation context (e.g., using practi- in Table 2. The refinement should and should continue to accept pa- tioners or students as subjects), and establish guidelines for consistent pers making use of “less rigorous” tooling availability. The research and application of those methods, methodologies, assuming they are practitioner communities should re- supporting replication and meta- suitably identified. An important fine the hierarchy of evidence before analysis. Rules for describing element of our argument is that designing a specific rule for strengths the level of confidence in a result such studies can importantly con- of recommendations in SE. should be formulated at all levels tribute to the body of knowledge, of this classification. They should especially for emerging techniques. What Next? reflect both the intrinsic power Opening the field to a wider variety We envision a system that allows re- and quality of execution of each of research results can increase the searchers and practitioners to reliably type of research, including the pace and novelty of the research we synthesize research results into ac- recommendation strength. perform. It can also encourage the tionable, real-world guidance. Imag- • Explicit identification of meth- exploration of new directions even ine an alternative universe for Emily: ods and results. This should when controlled empirical data is allow the interested software difficult to collect. Emily quickly identifies the latest engineer to easily identify where In addition, how and when should reputable SLR on static analy- on the “pyramid” a contribution meta-analyses be conducted and pre- sis. The study provides specific falls. sented to software engineers? What recommendations for evaluating • Incentives for and recognition of incentives would persuade research- false-positive rates of commercial reviews that synthesize, inter- ers and practitioners to perform tools and integrating and custom- pret, or provide meta-analysis these syntheses, and where should izing those tools. It also identifies of bodies of prior work. Such they be published and discussed? a comparative case study (level 3) meta-analyses must have the EBM relies on a central repository on a benchmark code base, show- goal of synthesizing actionable of SLRs, but this was established be- ing which tools catch the errors of practical guidance. They should fore Internet search largely replaced interest with few false positives. be labeled with the confidence indexes and repositories as the pre- In addition, the study references and range of applicability. ferred means of finding information. several level 6 studies reporting • Education of software engineers Going forward, it seems appropriate experiences at other companies. in how to use the framework. SE to publish SLRs in venues that match By noon, Emily has selected a students should be taught how their subject matter and to make vir- tool to try, modeled on the other to find appropriate studies and tual collections as appropriate for studies. interpret their recommendations. comparison or comment.

SEPTEMBER/OCTOBER 2018 | IEEE SOFTWARE 55

hat about this article? 4. F.P. Brooks, “Grasping Reality Analysis of the Literature,” . Sys- It’s clearly opinion, and through Illusion—Interactive Graphics tems and Software, vol. 81, no. 10, W we tried to make it Serving Science,” Proc. 1988 SIGCHI 2008, pp. 1694–1714. well-reasoned. It does draw heavily Conf. Human Factors in Computing 14. F. Shull, J. Singer, and D.I.. Sjøberg, on practice, albeit from a different Systems (CHI 88), 1988, pp. 1–11. eds., Guide to Advanced Empiri- field. So this article is at level 7. That 5. B.W. Lampson, “Hints for Computer cal Software Engineering, Springer, justifies further discussion and explo- System Design,” Proc. 9th ACM 2007. ration. The next step should be com- Symp. Operating Systems Principles 15. C. Scaffidi and M. Shaw, “Toward munity refinement of the hierarchy of (SOSP 83), 1983, pp. 33–48. a Calculus of Confidence,”Proc. 1st evidence and the protocol for estab- 6. M. Shaw, “Writing Good Software Int’l Workshop Economics of Soft- lishing the level of confidence. Engineering Research Papers,” Proc. ware and Computation (ESC 07), 25th Int’l Conf. Software Eng. (ICSE 2007, pp. 7–9. Acknowledgments 03), 2003, pp. 726–736. 16. D. Budgen et al., “Reporting System- We appreciate constructive comments 7. C. Theisen et al., “Software Engi- atic Reviews: Some Lessons from a from Fred Brooks and David Budgen. This neering Research at the International Tertiary Study,” Information and work was supported by the Alan J. Perlis Conference on Software Engineering Software Technology, vol. 95, 2018, Chair of at Carnegie in 2016,” SIGSOFT Software Eng. pp. 62–74. Mellon University and by US National Notes, vol. 42, no. 4, 2018, pp. 1–7. 17. N. Juristo et al., “A Look at 25 Years Science Foundation Structure and Hard- 8. Evidence-Based Medicine Working of Data,” IEEE Software, vol. 26, ware Foundation awards 1645136 and Group, “Evidence-Based Medicine: A no. 1, 2009, pp. 15–17. 1714699. Ipek Ozkaya’s contribution is New Approach to Teaching the Prac- based on work funded and supported by tice of Medicine,” J. Am. Medical the US Department of Defense under con- Assoc., vol. 268, no. 17, 1992, tract FA8702-15-D-0002 with Carnegie pp. 2420–2425. Mellon University for the operation of the 9. “Oxford Centre for Evidence-based Software Engineering Institute, a federally Medicine—Levels of Evidence funded research and development center. (March 2009),” Centre for Evidence- DM18-0598 Based Medicine, Oxford Univ., Mar. 2009; https://www.cebm References .net/2009/06/oxford-centre-evidence 1. P. Devanbu, T. Zimmerman, and C. -based-medicine-levels-evidence Bird, “Belief & Evidence in Empirical -march-2009. Software Engineering,” Proc. 38th 10. J. Forrest, Evidence-Based Decision Int’l Conf. Software Eng. (ICSE 16), Making: Introduction and Formu- 2016, pp. 108–119. lating Good Clinical Questions, 2. P. Naur and B. Randell, eds., Soft- Continuing Dental Education Course ware Engineering: Report on a 311, dentalcare.com, 2014; https:// Conference Sponsored by the NATO www.dentalcare.com/en-us Science Committee, Garmisch, Ger- /professional-education/ce-courses many, 7th to 11th October 1968, /ce311. NATO, 1968; http://homepages 11. H. Erdogmus, “How Important Is .cs.ncl.ac.uk/brian.randell/NATO Evidence, Really?,” IEEE Software, /index.html. vol. 27, no. 3, 2010, pp. 2–5. 3. M. Shaw et al., Seminal Papers in 12. B.A. Kitchenham, D. Budgen, and P. Software Engineering: The Carnegie Brereton, Evidence-Based Software Mellon Canonical Collection, tech. Engineering and Systematic Reviews, Read your subscriptions report CMU-ISR-15-107, Inst. for Chapman & Hall, 2015. through the myCS publications portal at Software Research, Carnegie Mellon 13. M. Montesi and P. Lago, “Soft- http://mycs.computer.org Univ., Sept. 2015. ware Engineering Article Types: An

SEPTEMBER/OCTOBER 2018 | IEEE SOFTWARE 57