The Science of Deep Learning COLLOQUIUM INTRODUCTION Richard Baraniuka, David Donohob,1, and Matan Gavishc

COLLOQUIUM INTRODUCTION The science of deep learning COLLOQUIUM INTRODUCTION Richard Baraniuka, David Donohob,1, and Matan Gavishc artificial intelligence | machine learning | deep learning | neural networks Scientists today have completely different ideas of what people posted trillions of images and documents on machines can learn to do than we had only 10 y ago. social media, as the phrase “Big Data” entered media In image processing, speech and video processing, awareness. Image processing and natural language machine vision, natural language processing, and clas- processing were forever changed by this new data sic two-player games, in particular, the state-of-the-art resource as they tapped the new image and text re- has been rapidly pushed forward over the last decade, sources using the revolutionary increases in comput- as a series of machine-learning performance records ing power and newly globalized human talent pool. were achieved for publicly organized challenge prob- The field of image processing felt the impact of the lems. In many of these challenges, the records now new data first, as the ImageNet dataset scraped from meet or exceed human performance level. the web by Fei-Fei Li and her collaborators powered a A contest in 2010 proved that the Go-playing com- series of annual ImageNet Large-Scale Visual Recog- puter software of the day could not beat a strong hu- nition Challenge (ILSVRC) prediction challenge com- man Go player. Today, in 2020, no one believes that petitions. These competitions gave a platform for the human Go players—including human world champion emergence and successive refinement of today’s Lee Sedol—can beat AlphaGo, a system constructed deep-learning paradigm in machine learning. over the last decade. These new performance records, Deep neural networks had been developing and the way they were achieved, obliterate the expec- steadily since at least the 1980s; however, their heu- tations of 10 y ago. At that time, human-level perfor- ristic construction by trial-and-error resisted attempts mance seemed a long way off and, for many, it at analysis. For quite some time in the 1990s and seemed that no technologies then available would 2000s, artificial neural networks were regarded with be able to deliver such performance. suspicion by scientists insisting on formal theoretical Systems like AlphaGo benefited in this last decade justification. In this decade they began to dominate from a completely unanticipated simultaneous expan- prediction challenges like ImageNet. The explosion of sion on several fronts. On the one hand, we saw the image data on the internet and computing resources unprecedented availability of on-demand scalable from the cloud enabled new, highly ambitious deep computing power in the form of cloud computing, network models to win prediction challenges by sub- and on the other hand, a massive industrial investment stantial margins over more “formally analyzable” in assembling human engineering teams from a glob- methods, such as kernel methods. alized talent pool by some of the largest global tech- In fact, the performance advantage of deep net- nology players. These resources were steadily deployed works over more “theoretically understandable” methods over that decade to allow rapid expansions in challenge accelerated as the decade continued. The initial suc- problem performance. cesses famously involved separating pictures of cats The 2010s produced a true technology explosion, from dogs, but soon enough successes came in full- a one-time–only transition: The sudden public avail- blown computer vision problems, like face recognition ability of massive image and text data. Billions of and tracking pedestrians in moving images. aDepartment of Electrical and Computer Engineering, Rice University, Houston, TX 77005; bDepartment of Statistics, Stanford University, Stanford, CA 94305; and cSchool of Computer Science and Engineering, Hebrew University of Jerusalem, 919051 Jerusalem, Israel This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “The Science of Deep Learning,” held March 13–14, 2019, at the National Academy of Sciences in Washington, DC. NAS colloquia began in 1991 and have been published in PNAS since 1995. From February 2001 through May 2019 colloquia were supported by a generous gift from The Dame Jillian and Dr. Arthur M. Sackler Foundation for the Arts, Sciences, & Humanities, in memory of Dame Sackler’s husband, Arthur M. Sackler. The complete program and video recordings of most presentations are available on the NAS website at http://www.nasonline.org/science-of-deep-learning. Author contributions: R.B., D.D., and M.G. wrote the paper. The authors declare no competing interest. Published under the PNAS license. 1To whom correspondence may be addressed. Email: [email protected]. First published November 23, 2020. www.pnas.org/cgi/doi/10.1073/pnas.2020596117 PNAS | December 1, 2020 | vol. 117 | no. 48 | 30029–30032 Downloaded by guest on September 27, 2021 A few years following their initial successes in image process- (2). Sejnowski’s title stands in a tradition of similar titles that starts ing, deep networks began to penetrate natural language pro- from Eugene Wigner’s famous essay on “The unreasonable effec- cessing, eventually producing in the hands of the largest industrial tiveness of mathematics in the physical sciences” (3) and contin- research teams systems able to translate any of 105 languages to ued in this decade with “The unreasonable effectiveness of data” any other, even language pairs for which almost no prior trans- (4) by Alon Halevy, Peter Norvig, and Fernando Pereira of Google. lation examples were available. In this tradition, authors generally point to a technology (e.g., Today, it is no longer shocking to hear of deep networks with Mathematics, Big Data, Deep Learning) that enjoys undoubted tens of billions of parameters trained using databases with tens of success in certain endeavors, but which we don’t completely un- billions of examples. On the other hand, it may have been derstand, and which, from a higher-level perspective, might seem increasingly unsettling for scientists to witness human perfor- surprising. Sejnowski (2) examines the paradox that, for a range of mance being dominated by empirically derived systems whose important machine-learning problems, deep learning works far best-understood properties were simply their ability to prevail in better than conventional statistical learning theories would pre- gameplay and in prediction challenges like ImageNet. dict. Sejnowski suggests that, while today’s deep-learning sys- In March of 2019, the National Academy of Sciences convened tems have been inspired by the cerebral cortex of the brain, a Sackler Colloquium on “The Science of Deep Learning” in the reaching artificial general intelligence will require inspiration from Academy building in Washington, DC. The goal of the organizers other important brain regions, such as those responsible for plan- was to advance scientific understanding of today’s empirically de- ning and survival. rived deep-learning systems, and at the same time to advance the Tomaso Poggio, Andrzej Banburski, and Qianli Liao of MIT use of such systems for traditional scientific research. follow up nicely with “Theoretical issues in deep networks),” (5 To those ends, important figures from academia and industry which considers recent theoretical results on approximation power, made presentations over 2 d; audience members—who included complexity control, and generalization properties of deep neural many graduate students and postdocs from institutions nation- networks. Empirically, deep neural networks behave very differently wide, as well as research sponsors from the NSF, NIH, and De- under these three aspects from other machine-learning models. For partment of Defense (DoD), and US government scientists from approximation, the authors state formal results proving that certain Washington, DC-area laboratories—found much to discuss in the convolutional nets can avoid the “curse of dimensionality” when hallways between presentations. Two presentations that were approximating certain smooth functions. For complexity control strikingly successful with the audience included Amnon Shashua and regularization, the authors consider the gradient flow of appro- and Rodney Brooks. priately normalized networks under the exponential loss as a dy- Amnon Shashua, from Hebrew University and Intel Mobility namical system. The authors point to implicit regularization systems, discussed computer vision research strategies to enable properties of unconstrained gradient descent, to possibly explain self-driving cars. He told the audience that error rates of vision the complexity control observed in overparameterized deep nets. systems for moving vehicles need to stay below one missed de- The idea that “deep learning keeps surprising us” was further tection per trillion units of visual experience, and discussed mod- developed by Christopher D. Manning, Kevin Clark, John Hewitt, eling and test strategies that can someday produce verified Urvashi Khandelwal, and Omer Levy of Stanford University (6). systems with such low error rates. They consider deep neural nets, trained via self-supervision, which Rodney Brooks of Massachusetts Institute of Technology (MIT) predict a masked word in a given context without labeled training explained how, in his view, it would be hundreds of years before data. The authors challenge the dominant

The Science of Deep Learning COLLOQUIUM INTRODUCTION Richard Baraniuka, David Donohob,1, and Matan Gavishc

Tomaso A. Poggio

How Visual Cortex Works and Machine Vision Tools for the Visually Impaired

Old HMAX and Face Recognition

Learning from Brains How to Regularize Machines

A Framework for Searching Forgeneral Artificial Intelligence

Cognitive Architectures for Robot Agents Current Capabilities, Future Enhancements and Prospects for Collaborative Development CONTENT WELCOME

TOMASO A. POGGIO Department of Brain & Cognitive

Curriculum Vitae | Page 1 of 25

Minimal Images in Deep Neural Networks: Fragile Object Recognition in Natural Images

Levels of Analysis for Machine Learning

The Course of the Mind. Steven Ashley

Short CV with Publications