Machine Learning Beyond Physics Jonathan Walsh from a Physicist’S Perspective How Do We Do Physics? Model Building: Parameters Model Data

Machine Learning Beyond Physics Jonathan Walsh from a Physicist’s Perspective How do we do physics? model building: Parameters Model Data (kinda) natural (well) motivated extant/future calculations: Observable Model Result defined known unknown simulations: Phase Space Model Data tractable approximated realistic A core tenet of particle physics We can understand the model My own experience in physics (QCD-side) LHC jet algorithms, grad school SCET for jets Olympics jet substructure phenomenological theoretical QCD foundations non-global high-precision fixed order postdoc resummation logarithms Monte Carlo calculations increasingly technical Evolution of my research program largely governed by the process: 1. finding interesting problems in QCD 2. understanding the underlying theory 3. applying it to more interesting problems 4. rinse and repeat Modeling beyond physics Beyond physics, the model may be too complex or just unknown. A first-principles approach does not apply in these cases. We need tools capable of model inference that can learn and utilize relevant information in the data 2 common approaches: statistical and deterministic modeling • statistical : derive the form of the model • deterministic : input the form of the model Reframing the core tenet Beyond physics, the model may be too complex or just unknown. A first-principles approach does not apply in these cases. We need tools capable of model inference that can learn and utilize relevant information in the data We can We can understand understand the model the data Statistical vs. Deterministic Modeling performance deterministic models: • high accuracy in a region of phase space optimal • bad failure modes outside deterministic domain of applicability statistical statistical models: • reasonable accuracy across phase space • graceful failure modes which should we prefer? parameter Minimax Optimization optimize for the best performance of the worst case In most applications, we care about performance in the worst case: • you want your bike/car/train/plane not to crash • you want to avoid serious illness • you buy insurance for costly rare events • you prioritize products working over their features User experience is most sensitive to performance in the worst cases People tend to remember the worst parts of an experience and base a valuation more heavily on that: • the worst dishes at a meal • reliability of a car • annoyances in computer UX (e.g. Mac vs. Windows) • everything about flying A core tenet of machine learning Learn expressive models Understanding Datasets supervised discriminative learning model generative sampling Data model unsupervised inferred learning structure often, we want to transform the data into features as a first step Feature Extraction feature extraction: start with a linear model PCA: rotate to a basis which maximizes the variance along principal directions v = W(x µ) − orthogonal feature means transform sample can be ineffective for nonlinear manifolds manifold learning tools: manifold learning (e.g. isomap, LLE), autoencoders Neural Networks and Autoencoders autoencoder X : X X encoder M ! features a model that can reconstruct its inputs decoder with a constraint on intermediate features (the encoded representation) X encoder and decoder: (stacks of) neural network layers fully connected layer: x f(Wx + b) ! input bias feature weight vector activation matrix function Autoencoders simplest autoencoder X encoder and decoder: single layers, tied weights encoder features encoder: y = f(Wx + b) decoder full network: X x WTy b = WTf(Wx + b) b ! − − network is trained to minimize reconstruction error with a linear activation, the optimal solution is PCA (where the bias removes the sample mean) autoencoders are powerful tools for nonlinear manifold learning Building Expressive Models deep learning models constructed from many simple transformation layers challenges: effective learning algorithms and architectures, intelligent uses of data How do we do machine learning? discriminative models: Data Model Labels extant predicted learned generative models: Data Model Data extant sampled How do we do machine learning? discriminative models: Data Model Labels extant predicted p(y x) | conditional probability of labels : x y M ! joint probability distribution p(x, y) generative models: Data Model Data extant sampled Discriminative Models categorical discrimination regression Deep Visual-Semantic Alignments for Generating Image Descriptions “cat” Andrej Karpathy Li Fei-Fei Department of Computer Science, Stanford University karpathy,feifeili @cs.stanford.edu { } Abstract semantic labeling We present a model that generates natural language descriptions of images and their regions. Our approach lever- ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architec- ture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that AIs will eventually our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO replace us all datasets. We then show that the generated descriptions sig- Figure 1. Motivation/Concept Figure: Our model treats language nificantly outperform retrieval baselines on both full images as a rich label space and generates descriptions of image regions. style transfer and on a new dataset of region-level annotations. generating dense descriptions of images (Figure 1). The 1. Introduction primary challenge towards this goal is in the design of a model that is rich enough to simultaneously reason about A quick glance at an image is sufficient for a human to contents of images and their representation in the domain point out and describe an immense amount of details about of natural language. Additionally, the model should be free the visual scene [14]. However, this remarkable ability has of assumptions about specific hard-coded templates, rules proven to be an elusive task for our visual recognition mod- or categories and instead rely on learning from the training els. The majority of previous work in visual recognition data. The second, practical challenge is that datasets of im- has focused on labeling images with a fixed set of visual age captions are available in large quantities on the internet categories and great progress has been achieved in these en- [21, 58, 37], but these descriptions multiplex mentions of deavors [45, 11]. However, while closed vocabularies of vi- several entities whose locations in the images are unknown. sual concepts constitute a convenient modeling assumption, they are vastly restrictive when compared to the enormous Our core insight is that we can leverage these large image- amount of rich descriptions that a human can compose. sentence datasets by treating the sentences as weak labels, Some pioneering approaches that address the challenge of in which contiguous segments of words correspond to some generating image descriptions have been developed [29, particular, but unknown location in the image. Our ap- 13]. However, these models often rely on hard-coded visual proach is to infer these alignments and use them to learn concepts and sentence templates, which imposes limits on a generative model of descriptions. Concretely, our contri- their variety. Moreover, the focus of these works has been butions are twofold: on reducing complex visual scenes into a single sentence, We develop a deep neural network model that in- which we consider to be an unnecessary restriction. • fers the latent alignment between segments of sen- In this work, we strive to take a step towards the goal of tences and the region of the image that they describe. Generative Models learn a probability distribution: how to sample from data J(Φ)dΦ J(Φ)dΦ b Jacobian learned Jacobian (observed via sampling) given points, learn the underlying distribution physicists do this all the time Hopfield Networks early recurrent neural network (1982) pairwise connections between nodes si binary states (-1, +1) { } Wij connection strengths (couplings) { } +1 if W s ✓ ij j ≥ i update rule: s = j i X 1 otherwise ⇢ − threshold 1 energy function: E = W s s + ✓ s −2 ij i j i i i,j i X X under repeated updates, the network converges to a local minimum in the energy function The Ising Model the Hopfield network energy function is similar to an Ising model: E = J s s µ h s Hopfield: all sites ‘adjacent’ − ij i j − i i <i j> i X X βE(σ) note that the probability of a given state e− Pβ(s)= is dependent on the partition function: Zβ βE(σ) Zβ = e− σ why Ising models? X simple models that embody the Hebbian learning rule: neurons that fire together, wire together concept can be used to store “memories” in the network: attractor states that are local minima in the energy function Stochastic Networks while Hopfield networks are deterministic, Ising models are probabilistic - like generative models ∆E = E(i on) E(i o↵) energy difference between on/off states i − state is activated with probability given by the Boltzmann distribution: β∆E =lnp ln p i i on − i o↵ 1 p = = σ( β∆E ) i on 1+exp( β∆E ) − i − i this type of network is a Boltzmann machine Boltzmann Machines 1 p = = σ( β∆E ) i on

Machine Learning Beyond Physics Jonathan Walsh from a Physicist’S Perspective How Do We Do Physics? Model Building: Parameters Model Data

A Relativistic Extension of Hopfield Neural Networks Via the Mechanical Analogy

Neural Networks for Machine Learning Lecture 12A The

A Boltzmann Machine Implementation for the D-Wave

Boltzmann Machine Learning with the Latent Maximum Entropy Principle

Stochastic Classical Molecular Dynamics Coupled To

Learning and Evaluating Boltzmann Machines

Optimization by Mean Field Annealing

7Network Models

Chapter 14 Stochastic Networks

Learning Algorithms for the Classification Restricted Boltzmann

An Efficient Learning Procedure for Deep Boltzmann Machines

Machine Learning Algorithms Based on Generalized Gibbs Ensembles