Editor: Giuliano Antoniol Polytechnique Montréal INVITED CONTENT [email protected]

Editor: Steve Counsell Brunel University Editor: Phillip Laplante [email protected] Pennsylvania State University [email protected]

Software Engineering for Machine-Learning Applications

The Road Ahead

Foutse Khomh, Bram Adams, Jinghui Cheng, Marios Fokaefs, and Giuliano Antoniol

THE NEED AND desire for more auto- to address these challenges. In fact, experts could come together to dis- mation and intelligence have led to the learned behavior of an ML-based cuss challenges, new insights, and breakthroughs in machine learning system might be incorrect, even if practical ideas regarding the engi- (ML) and artifi cial intelligence (AI), the learning algorithm is imple- neering of ML- and AI-based sys- yet we still experience failures and mented correctly, a situation in tems. The program included talks shortcomings in the resulting soft- which traditional testing techniques and panels presented by renowned ware systems. The main reason is the are ineffective. A critical problem is academic researchers and indus- shift in the development paradigm in- how to effectively develop, test, and trial practitioners, including keynote duced by ML and AI. Traditionally, evolve such systems, given that they speakers David Parnas, Lionel Briand, systems are constructed don’t have (complete) specifi cations and Yoshua Bengio. The full pro- deductively, by writing down the or even source code corresponding gram is at http://semla.polymtl.ca. rules that govern the system behav- to some of their critical behaviors. Here, we summarize some key chal- iors as program code. However, Motivated by these challenges, we lenges these experts identifi ed. with ML techniques, these rules are organized the First Symposium on inferred from training data (from for Machine System Accuracy which the requirements are gener- Learning Applications (SEMLA) at The fi rst topic concerned the accu- ated inductively). This paradigm Polytechnique Montréal on 12 and racy of systems built using ML and shift makes reasoning about the be- 13 June 2018, with the kind support AI models, and the responsibilities of havior of software systems with ML of Polytechnique Montréal’s Depart- engineers building them. For exam- components diffi cult, resulting in ment of and ple, one keynote speaker mentioned software systems that are intrinsi- Software Engineering, the Institute three categories of AI research: cally challenging to test and verify. for Data Valorization (IVADO), SAP, Given the critical and increasing and Red Hat. The event attracted • building programs that imitate role of ML- and AI-based systems around 160 participants from all over human behavior to better under- in our society, it’s imperative for the world, including students, aca- stand human thinking (used in both the software engineering (SE) demics, and industrial practitioners. psychology research), and ML communities to research SEMLA’s main objective was to • building programs that play games and develop innovative approaches create a space in which SE and ML well (challenging and fun), and

2469-7087/19/$33.000740-7459/18/$33.00 © 2019 IEEE © 2018 IEEE Published by the IEEE Computer SocietySEPTEMBER/OCTOBER 2018 February | IEEE 2019SOFTWARE 8121 INVITED CONTENT

• demonstrating that practical other domains (such as requirement systems, since an AI system’s behav- computerized products can use elicitation) are more challenging. ior might be incorrect even if the the same methods that humans Overall, AI’s full impact on SE is learning algorithms are implemented use (risky and often naive). still unclear. correctly. One keynote speaker Because of AI and ML systems’ explained how in complex cyber- He stressed that researchers should intrinsic imperfection, one panelist physical systems (CPSs), when no be very concerned about AI systems argued, only harmless AI technology clear specifications of the intended in the third category because they or applications should be released to systems exist (that is, humans have a can’t guarantee 100 percent accu- the public, since the responsibility of lot of knowledge but can’t formalize racy or correct answers in all cases. every engineer is to protect the pub- it), only AI can approximate the sys- He also raised concerns that people lic. He also mentioned that the pub- tem’s intended behavior by learning are using the Turing test to falsely lic should be informed accurately of models from the available data. claim intelligence in systems. He the AI technology it’s being exposed This is a clear improvement over commented, “Turing did not claim to. For example, instead of touting a the manual design of models and that his test was a test for artificial “100 percent self-driving car,” auto- controllers. However, it pushes most intelligence!” motive companies should advertise of the risk toward the trained models’ In response, a leading AI expert their products as “AI-assisted cars,” quality. So, how can we perform stated that AI’s goal is not to achieve with a clear list of the ways in which adequate quality assurance (QA) of 100 percent accuracy because AI is assisting. AI models, given that the number Another panelist emphasized that of environments in which the mod- • humans are also far from 100 AI isn’t a panacea. He illustrated els will be deployed is unlimited and percent accuracy in their daily how simple techniques could give that the human operator will re- tasks, and the illusion of AI, or how the blind quire a detailed explanation of any • AI technology’s strength comes application of AI wouldn’t improve failures? from the ability to abstract up the workflow of workers. For ex- Fortunately, we can use AI tech- from different factors of varia- ample, in principle, an intelligent nology to reduce the search space of tion between environments, to robot could easily replace a human the environments to be tested, nudg- obtain models that can general- worker to hand another worker the ing QA techniques to those environ- ize and transfer to situations that right tool for a given job, but not if ments most likely to have failures or weren’t encountered before. the worker afterward throws the violate important safety constraints. tool back on a pile. (The robot will Such an approach could even work He further explained that AI tech- have a hard time retrieving the right in the system-of-systems context of nologies’ main challenge is the curse tool from an unordered pile.) How- CPSs, where each sensor and actua- of dimensionality—that is, the need ever, using an intelligent robot to tor must be validated not only in iso- for sufficient, labeled data to cover return tools in an ordered fashion lation but also in close integration all important factors (features) of (which is a different problem) could with each other. a given problem. AI, in fact, needs allow other robots later on to be de- However, this QA doesn’t guard more training data than humans do! ployed to hand over tools to work- against hardware failure. So, hard- Whereas the key properties of ers. If a traditional ware systems should incorporate techniques such as deep learning algorithm can solve a problem, we fault-tolerance mechanisms to cope (for example, compositionality, en- should just use that. with such failures. One audience par- coding into a simpler domain, and ticipant also observed that hardware conditional computation) aim to re- System Testing could incorporate fault-tolerance duce dimensionality’s impact, appli- The second hot topic our experts dis- mechanisms to mitigate the effect of cations of AI still risk being limited cussed was the difficulty of testing AI model errors, improving AI sys- to domains in which labeled data ML and AI systems. Our panelists tems’ robustness. is cheap. Although labeled data is debated whether we should tackle Another major challenge is that somehow abundant in some SE do- the testing of those systems the same humans, once they’ve started trust- mains (such as defect prediction), way we do the testing of traditional ing AI in their daily tasks, could

2282 IEEE SOFTWAREComputingEdge | WWW.COMPUTER.ORG/SOFTWARE | @IEEESOFTWARE February 2019 INVITED CONTENT INVITED CONTENT

• demonstrating that practical other domains (such as requirement systems, since an AI system’s behav- computerized products can use elicitation) are more challenging. ior might be incorrect even if the the same methods that humans Overall, AI’s full impact on SE is learning algorithms are implemented FOUTSE KHOMH is an associate professor MARIOS FOKAEFS is an assistant profes- use (risky and often naive). still unclear. correctly. One keynote speaker at Polytechnique Montréal, where he leads the sor in Polytechnique Montréal’s Department of Because of AI and ML systems’ explained how in complex cyber- SWAT (Software Analytics and Technology) Lab. Computer Engineering and - He stressed that researchers should intrinsic imperfection, one panelist physical systems (CPSs), when no Contact him at [email protected]. ing. Contact him at [email protected]. be very concerned about AI systems argued, only harmless AI technology clear specifications of the intended in the third category because they or applications should be released to systems exist (that is, humans have a can’t guarantee 100 percent accu- the public, since the responsibility of lot of knowledge but can’t formalize racy or correct answers in all cases. every engineer is to protect the pub- it), only AI can approximate the sys- He also raised concerns that people lic. He also mentioned that the pub- tem’s intended behavior by learning BRAM ADAMS is an associate professor at GIULIANO ANTONIOL is a professor of soft- are using the Turing test to falsely lic should be informed accurately of models from the available data. Polytechnique Montréal, where he leads the ware engineering in Polytechnique Montréal’s claim intelligence in systems. He the AI technology it’s being exposed This is a clear improvement over MCIS (Maintenance, Construction and Intel- Department of Computer Engineering and commented, “Turing did not claim to. For example, instead of touting a the manual design of models and ligence of Software) Lab. Contact him at Software Engineering. Contact him at giuliano

that his test was a test for artificial “100 percent self-driving car,” auto- controllers. However, it pushes most AUTHORS THE ABOUT [email protected]. [email protected]. intelligence!” motive companies should advertise of the risk toward the trained models’ In response, a leading AI expert their products as “AI-assisted cars,” quality. So, how can we perform stated that AI’s goal is not to achieve with a clear list of the ways in which adequate quality assurance (QA) of 100 percent accuracy because AI is assisting. AI models, given that the number JINGHUI CHENG is an assistant professor in Another panelist emphasized that of environments in which the mod- Polytechnique Montréal’s Department of Com- • humans are also far from 100 AI isn’t a panacea. He illustrated els will be deployed is unlimited and puter Engineering and Software Engineering. percent accuracy in their daily how simple techniques could give that the human operator will re- Contact him at [email protected]. tasks, and the illusion of AI, or how the blind quire a detailed explanation of any • AI technology’s strength comes application of AI wouldn’t improve failures? from the ability to abstract up the workflow of workers. For ex- Fortunately, we can use AI tech- from different factors of varia- ample, in principle, an intelligent nology to reduce the search space of tion between environments, to robot could easily replace a human the environments to be tested, nudg- obtain models that can general- worker to hand another worker the ing QA techniques to those environ- ize and transfer to situations that right tool for a given job, but not if ments most likely to have failures or weren’t encountered before. the worker afterward throws the violate important safety constraints. tool back on a pile. (The robot will Such an approach could even work begin adapting their behavior to the unanimously disagreed because hu- some open problems and what they He further explained that AI tech- have a hard time retrieving the right in the system-of-systems context of AI assistance. For example, a study mans are essential for putting the consider to be their biggest needs nologies’ main challenge is the curse tool from an unordered pile.) How- CPSs, where each sensor and actua- in Munich showed how assisted decisions of AI into context. Al- and top priorities. of dimensionality—that is, the need ever, using an intelligent robot to tor must be validated not only in iso- braking initially reduced the number though the outcome (and potential For example, a presenter from for sufficient, labeled data to cover return tools in an ordered fashion lation but also in close integration of accidents, until drivers relied too failures) of the AI impacts the hu- Google Brain pinpointed several all important factors (features) of (which is a different problem) could with each other. much on the assisted braking and mans’ recommendations, those rec- programming-language issues in- a given problem. AI, in fact, needs allow other robots later on to be de- However, this QA doesn’t guard drove more aggressively. So, when ommendations are also a human volved in ML libraries, models, and more training data than humans do! ployed to hand over tools to work- against hardware failure. So, hard- is an AI-enabled product ready for fi lter for AI failures. Further socio- frameworks. Different approaches Whereas the key properties of ers. If a traditional computer science ware systems should incorporate release to the public? Although four logical research is necessary to study in the current libraries have differ- techniques such as deep learning algorithm can solve a problem, we fault-tolerance mechanisms to cope million miles of test drives can’t pre- how AI technology affects human ent advantages and disadvantages. (for example, compositionality, en- should just use that. with such failures. One audience par- vent a serious accident in the next behavior. Creating an effi cient syntax for coding into a simpler domain, and ticipant also observed that hardware mile, how much information in the automatic differentiation that can conditional computation) aim to re- System Testing could incorporate fault-tolerance test drives can be used to debug and Industrial Applications deliver ease of implementation, per- duce dimensionality’s impact, appli- The second hot topic our experts dis- mechanisms to mitigate the effect of fi x the corresponding fault? SEMLA’s second day was devoted to formance, usability, and fl exibility is cations of AI still risk being limited cussed was the difficulty of testing AI model errors, improving AI sys- An additional important ques- industrial applications of AI. These important but diffi cult. Testing and to domains in which labeled data ML and AI systems. Our panelists tems’ robustness. tion discussed at SEMLA is humans’ industrial speakers discussed the debugging these implementations is cheap. Although labeled data is debated whether we should tackle Another major challenge is that role in an AI-driven world. Are hu- current state of AI in industry and are also salient challenges. Industrial somehow abundant in some SE do- the testing of those systems the same humans, once they’ve started trust- mans obsolete once AI technologies the challenges they face when apply- practitioners further mentioned that mains (such as defect prediction), way we do the testing of traditional ing AI in their daily tasks, could become mainstream? The panelists ing AI models. They also discussed collaboration among experts from

82 IEEE SOFTWARE | WWW.COMPUTER.ORG/SOFTWARE | @IEEESOFTWARE www.computer.org/computingedge SEPTEMBER/OCTOBER 2018 | IEEE SOFTWARE 8323 INVITED CONTENT

different fields is important for de- Unfortunately, a rift exists between systems. On the other hand, miss- veloping ML applications. these two communities, which we ing test cases or test cases that are tried to understand through SEMLA. known to fail (but are rare) are a Healing the Rift One reason for this rift is that stake- larger issue for ML than for regular From these two days of intensive dis- holders in the AI community focus software systems. cussions, two key questions emerged: on algorithms and their performance characteristics, whereas stakehold- • How should software develop- ers in the SE community focus on e believe that the SE ment teams integrate the AI implementing and deploying those and ML communities model lifecycle (training, testing, algorithms. W should work together deploying, evolving, and so on) So far, no real venue has inte- to solve the critical challenges of as- into their software process? grated both fields, yet intersections suring the quality of AI and software • What new roles, artifacts, and exist between them, one of which systems in general. We have a lot to activities come into play, and is testing. The notion of coming up benefit from each other! how do they tie into existing with ways to break a system is in- agile or DevOps processes? tegral to ML, and the scale of test Read your subscriptions sets (thousands to millions of in- through the myCS This article originallypublications appeared portal at in Answering these questions requires stances) is huge compared to the IEEEhttp://mycs.computer.org Software, vol. 35, no. 5, 2018. combined knowledge in SE and ML. number of test cases in software

Rejuvenating Binary Executables ■ Visual Privacy Protection ■ Communications Jamming Policing Privacy ■ Dynamic Cloud Certification■ Security for High-Risk Users Smart TVs ■ Code Obfuscation ■ The Future of Trust Take the CSIEEE SymposiumLibrary on whereverSecurity and Privacy you go!

January/February 2016 March/April 2016 May/June 2016 Vol. 14, No. 1 IEEE Computer Society magazinesVol. 14, No. 2 and Transactions are now Vol. 14, No. 3 available to subscribers in the portable ePub format.

Just download the articles from the IEEE Computer Society Digital Library, and you can read them on any device that supports ePub. For more information, including a list of compatible devices, visit IEEE Security & Privacy magazine provides articles with both a practical andwww.computer.org/epub research bent by the top thinkers in the fi eld. • stay current on the latest security tools and theories and gain invaluable practical and research knowledge, • learn more about the latest techniques and cutting-edge technology, and computer.org/security • discover case studies, tutorials, columns, and in-depth interviews and podcasts for the information security industry.

2484 IEEE SOFTWAREComputingEdge | WWW.COMPUTER.ORG/SOFTWARE | @IEEESOFTWARE February 2019