<<

Systematic Uncertainties and Domain Adaptation

Michael Kagan SLAC

Hammers and Nails July, 2017 2 The Goal ATLAS-CONF-2017-043

The Higgs Boson! 3 The Goal ATLAS-CONF-2017-043

The Higgs Boson!

• Want to reject the “No Higgs” hypothesis ( + ) • Hopefully measure the size of the 4 The Goal ATLAS-CONF-2017-043

The Higgs Boson!

• Need accurate predictions for and 5 The Goal ATLAS-CONF-2017-043

The Higgs Boson!

• And we need to know how wrong these predictions could be… 6 Sources of Systematic Uncertainty ATLAS-CONF-2017-041 • Experimental – How well we understand the detector and our ability to identify / measure properties of particles given that they are present • Theoretical – How well we understand the underlying theoretical model of the physical processes

• Modeling uncertainties – How well can I constrain backgrounds in alternate data measurements / control regions Error Propagation 7

• Random variables x={x1, x2, … , xn} distributed according to p(x)

– Assume we only know E[x] = µ and Var[xi, xj] = Vij

• For some f(x), first order approximation N @f f(x) f(µ)+ (xi µi) ⇡ @xi x=µ i=1 X • Then E[f(x)] f(µ) ⇡ N 2 2 @f @f f = E[(f(x) f(µ)) ] Vij ⇡ @x @x x=µ i, j=1 i j X h i Error Propagation 8 • Often, we only have samples of x

• But we know how x varies under some uncertainty – x → x ± Δx – Where Δx is a 1 standard deviation change in x

• Then often approximate the error on f(x) with – ±Δf(x) = f(x ± Δx ) – f(x)

• So we vary x and see how f(x) changes – Often histogram the distributions, showing ±Δf(x) as a band on the histogram bin counts Hypothesis Testing 9

p(x ✓0) ( ; ✓0, ✓1)= | D p(x ✓1) x Y2D | • Likelihood p(x|θ) parameterized by θ

• Likelihood ratio used as test statistic for

hypothesis testing between null hypothesis (θ0) and alternative hypothesis (θ1)

• Parameterization θ=(µ, ν) includes parameters of interest (µ) and nuisance parameters (ν)

[For more details, see Glen Cowan’s talk] Nuisance Parameters and the Likelihood 10 N ni (µsi + bi) (µs +b ) L(µ)= e i i n ! i=1 i Y

• Comparing binned counts between observed data and prediction

– ni is data counts in bin i

– si is predicted signal (diamond) counts in bin i

– bi is predicted background (rocks) counts in bin i – m is parameter of interest, the “signal strength”

[For more details, see Glen Cowan’s talk] Nuisance Parameters and the Likelihood 11 N ni (µsi + ⌫bi) (µs +b ) L(µ, z)= e i i N(⌫;1, ) n ! i=1 i Y

Lprofile(µ) = max L(µ, ⌫) ⌫ • With nuisance parameters – ν is the nuisance parameter, where uncertainty parameterized by ν follows some prior (here a normal distribution)

– Frequently use profile likelihood, maximizing over nuisance for each value of µ [For more details, see Glen Cowan’s talk] Effect of 12

arXiv:1408.6865

• Systematics can have a big effect, want to mitigate them as much as possible [For more details, see Glen Cowan’s talk] Simulation and Analysis at the LHC 13 How do we get these predictions? 14

+ {higher order terms}

• Standard Model tells us how to compute interaction rates and their distributions – But we can only compute and approximation

• Several different “generators” that simulate how to compute the primary interactions and evolve them into multi-body states

• The dimensionality of the problem becomes enormous, can’t compute the full pdf

• Simulating these process is done through Monte Carlo sampling Generator Uncertainties 15

arXiv:1609.00607

• Different generators can sometimes give rather different predictions Uncertainty in the Theory Model 16 • Model true distributions has approximations in it

– xt = g(zt ; yt=c , νt) – g(…) is the generator of events, that can be conditioned on class

– zt is some latent random numbers (random seeds) – νt are parameters • Don’t know the exact model, but can quantify the uncertainty of the approximation – E.g. parameters have been constrained by other experiments Simulated True Distribuon

– Theoretical physicists give Background bounds on parameters Signal pt(x) – Variations of parameters νt can be done to get alternative true distributions

x=mass – p(x) → p(x, νt ) = p(x | νt ) p(νt ) Very Big Detectors: The ATLAS Experiment 17 ~108 detector channels

Size: Data: 46 m long, Weight: ~300 MB / sec 25 m high, 7000 tons ~3000 TB / year 25 m wide From Theory to Experiment From K. Cranmer 18 • Particles interact with custom detectors – At the LHC, detectors can 8 … have 10 sensors From Theory to Experiment From K. Cranmer 19 • Particles interact with muon Boom custom detectors Quark – At the LHC, detectors can have 108 sensors • Develop algorithms to

Boom identify particles from Quark electron detector signals From Theory to Experiment From K. Cranmer 20 • Particles interact with custom detectors – At the LHC, detectors can have 108 sensors • Develop algorithms to identify particles from detector signals • Develop small set (~20-40) engineered features that describe the interesting properties of the event…

Classifier • Train classifier to separate signal and background Getting it right… we need 21 • An accurate model of the underlying physics, including the properties of the proton, primary interaction, and the “showering” process • An accurate and detailed model of the detector • Accurate models of particle interactions with materials • Must tune parameters in both the underlying theory model and the particle interactions with matter • And we need to correct any mismodeling as best we can → Calibrations / Residual corrections Simulating the experiment 22 • Experiment simulation includes geometry, (mis)alignment, material properties, etc… • Detector / particle-material interactions simulated with GEANT

ATL-PHYS-PUB-2015-050 Simulating the experiment 23 • Experiment simulation includes geometry, (mis)alignment, material properties, etc… • Detector / particle-material interactions simulated with GEANT

ATL-PHYS-PUB-2016-015

Uncertainty in our simulated electron measurements due to our uncertainty in the amount of material in the detector Calibration / Modeling Corrections 24 • If we want to calibrate the modeling of a particle, find a pure sample of that particle and derive corrections to the simulation to match data

• Example – Z bosons decays to 2 electrons that have a combined mass of ≈90 GeV

– p(particle=electron | mee≈90) ≈ 99%

– So we can select (nearly) pure electron sample by looking for electron pairs consistent with Z boson

– Use known properties of Z boson to calibrate Calibration / Modeling Corrections 25 • For samples x from the simulated, a posteriori adjust the value

– pi → s*pi*(1+z) – s=constant scaling parameter to match the mean of simulation and data

– z~N(z; 0, σz) resolution smearing, match variance in simulation and data

• Measure s±Δs and σz±Δz with finite precision → family of distributions p(pi, s, σz)

– s ~ N(s; s0, Δs) z ~ N(z; 0, σz) σz ~ N(σz; σz0, Δz)

p1 2 2 2 mee =(E1 + E2) (p~1 + p~2) - True distribuon; no detector - Simulated distribuon with detector - Corrected simulaon,

with s and σz variaons - Observed Data Counts

m p2 ee Calibration / Modeling Corrections 26

• Correct simulated mean and variance to match data

Scale and smear momentum pi → s*pi*(1+z) to get matching mass distributions

p 1 m2 =(E + E )2 (p~ + p~ )2 ee 1 2 1 2

ATL-PHYS-PUB-2016-015

p2 Moving Forward 27 • Our known unknowns, our , will diminish our sensitivity to new physics

• We want to reduce them where we can – Improve our models with new inputs from new measurements

• And diminish their impact as much as possible – Can we do this when training a classifier to separate signal from background? Systematics in a Classifier: Standard Approach 28

[S. Hagebock]

• Train classifier, evaluate on nominal sample • Run systematically varied samples through classifier and see how it changes score distribution – Only evaluate systematics after training complete

• No notion of systematics in training Domain Adaptation 29 Typical Binary Classification Problem 30 Training Testing / Evaluation N • Data: {xi, yi}i=1 • Data: {xi} d – xi in R

– yi in {0, 1}

– xi, yi ~ p(x, y) – xi ~ p(x)

• Learn classifier h(x)

– Model p(y|x) and label {xi} Domain Adaptation [N. Courty, hps://mathemacal-coffees.github.io/mc01-ot/ ] 31

+ Labels No labels!

Source domain Target domain Domain Adaptation [N. Courty, hps://mathemacal-coffees.github.io/mc01-ot/ ] 32

Training on source data only won’t generalize well! Domain Adaptation 33 Training Testing / Evaluation • Data: t s s N • Data: {x } S ={x i , y i}i=1 i t Ut={x i}

s s – x i, y i ~ ps(x, y) – xt ~ p (x) t i t – x i ~ pt(x)

• Distribution of source and target domains not the same,

ps(x, y) ≠ pt(x, y)

• Setting – Perform the same classification in both domains – Distributions of target and source are closely related Domain Adaptation 34 • How can we learn a classifier from Source data and have a classifier that works will on Target data? • Depends on what we know about the problem! – We assume the tasks are related! – Do we know anything about what has changed between domains? – We can use the Target data to help in the training process! Theoretical bounds [Ben-David et al., 2010] Error of a classifier in the target domain is upper bounded by the sum of three terms: – Error of classifier in source domain – Divergence (or “distance”) between the feature distributions ps(x) and pt(x) – How closely related the classification tasks are Domain Adaptation 35 • Several method to address domain adaption problems – Differ depending on the details of the problem

Approaches • Instance weighting – Often in Covariate Shift setting • Self-Labeling approaches – Iteratively predict target labels from source and learn – Not going to discuss today • Feature representation approaches – Learn a feature representation f(x) that has the same distribution for target and source data – Adversarial domain adaptation takes this approach! Covariate Shift 36 • General situation:

ps(x, y) ≠ pt(x, y)

• In this setting assume: – Class probabilities conditional on features the same:

pt(y|x) = ps(y|x) = p(y|x)

– Marginal Distributions different:

ps(x)≠pt(x)

Covariate Shift 37

[M. Sugiyama, MLSS 2012] Covariate Shift 38 • Just estimate p(y|x) directly from source? – Model misspecification: if classifier model f(x; θ) can not module true decision boundary for any θ, classifier performance will differ between source and target Covariate Shift 39

• Training: minimize average loss l(xi, yi, θ) when with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z Covariate Shift 40

• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z • But want to minimize average loss l(xi, yi, θ) on target data

l(x, y, ✓)pt(x, y)dxdy Z Covariate Shift 41

• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z • But want to minimize average loss l(xi, yi, θ) on target data

l(x, y, ✓)pt(x, y)dxdy Z p (x, y) p(y x)p (x) p (x) • Can use density ratio t = | t = t p (x, y) p(y x)p (x) p (x) s | s s Covariate Shift 42

• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z • But want to minimize average loss l(xi, yi, θ) on target data

l(x, y, ✓)pt(x, y)dxdy Z p (x, y) p(y x)p (x) p (x) • Can use density ratio t = | t = t p (x, y) p(y x)p (x) p (x) s | s s N 1 s p (x) p (x) t l(xs,ys, ✓) t l(x, y, ✓)p (x, y)dxdy Then: N p (x) i i ! p (x) s s i=1 s s X Z = l(x, y, ✓)pt(x, y)dxdy Z Covariate Shift 43

pt(x) • How to estimate ps(x) • Several methods, e.g. train classifier h(x) to tell the difference between target and source data, then optimal classifier gives: pt(x) 1 h⇤(x)= = p (x) ps(x)+pt(x) 1+ s pt(x)

Covariate Shift 44

pt(x) • How to estimate ps(x) • Several methods, e.g. train classifier h(x) to tell the difference between target and source data, then optimal classifier gives: pt(x) 1 h⇤(x)= = p (x) ps(x)+pt(x) 1+ s pt(x) • Difficulties with this approach – Don’t know ratio exactly, only can estimate from samples – Weighting causes variance of training data to increase

Ns 1 pt(x) s s min l(xi ,yi , ✓) ✓ N p (x) s i=1 s ! X Feature Representation 45

• Setting: ps(x,y)≠pt(x,y)

• Don’t know if ps(y|x)≠pt(y|x)

Source Target

ps(x) pt(x)

x x Feature Representation 46

• Setting: ps(x,y)≠pt(x,y)

• Don’t know if ps(y|x)≠pt(y|x)

• Do our best, at least try to make feature distributions match

Source Target

ps(x) pt(x)

x x Feature Representation 47

• Setting: ps(x,y)≠pt(x,y)

• Don’t know if ps(y|x)≠pt(y|x)

• Find representation of features through mapping g(x) = z

z=g(x)

Source Target

ps(x) pt(x)

x x Feature Representation 48 • Want similar p(z) Source distributions of Target source and target features after mapping: z

ps(z) = pt(z) z=g(x)

Source Target

ps(x) pt(x)

x x Mapping that reduces difference between source and target distributions reduces divergence term of error bound [Ben-David et. al., 2006]… Still need to find a classifier that performs well on source task with this representation Feature Representation 49

Decision Boundary Source Target p(z)

z

• Build a classifier h(z) that classifies the source well under the new representation – If the target task is not far from source task, then this classifier should do well on target task

• Many ways to do this, focus on Adversarial Networks Adversarial Learning 50 • Generative Adversarial Network (GAN) – Generator (G) is Neural Network that produces images from random numbers (z)

– Discriminator (D) tries to classify images as real or fake data

– When training • Generator tries to fool discriminator, and is penalized when discriminator succeeds

Random numbers Generator Fake image

Discriminator Real vs fake

Real images arXiv:1406.2661 Adversarial Learning 51 • Domain Adversarial Neural Network (DANN) – Classifier for signal vs background task

– Discriminator uses hidden representation to tell the difference between Source and Target data

– When training: • Classifier tries to fool discriminator, and is penalized when discriminator succeeds Oen single neural network arXiv:1505.07818

Source Data Signal vs Background Encoder Classifier

Target Data

Discriminator Source vs Target Domain Domain Adversarial Neural Networks 52

arXiv:1505.07818 Systematics Uncertainties and Domain Adaptation 53 Problem Statement 54 Training Testing / Evaluation • Simulated Data: • Data: {xt } s s s N i S ={x i , y i, z i}i=1

– s s s t x i, y i z i ~ p(x, y, z) – x i ~ pt(x)

• Family of data generating processes p(X,Y,Z) – X is the input data – Y is the target (e.g. class target for classifier) – Z is the value of the nuisance parameter

• We want to learn the mapping f(x; θf) to make predictions of y

Problem Statement 55 Training Testing / Evaluation • Simulated Data: • Data: {xt } s s s N i S ={x i , y i, z i}i=1

– s s s t x i, y i z i ~ p(x, y, z) – x i ~ pt(x)

• Believe there is some z that would make simulated distribution match observed data

– Often have the situation that we believe ps(y) = pt(y), i.e. priors (fraction of backgrounds) are not changing, only their feature distributions… but won’t use that information here • So we want to be robust to changes in z – We want f(…) to be a “pivotal quantity”, such that

p( f=s | z) = p(f=s | z’ ) G. Louppe, M. K., K. Cranmer, Adversarial Networks arXiv:1611.01046 56

• Classifier built to solve problem at hand G. Louppe, M. K., K. Cranmer, Adversarial Networks arXiv:1611.01046 57

• Adversary is built to predict the value of Z given the classifier output – If adversary can predict Z, than there is information in f(…) about nuisance parameter, i.e. f(…) and Z are correlated in some way G. Louppe, M. K., K. Cranmer, Adversarial Networks arXiv:1611.01046 58

• Loss encodes performance of both classifier and adversary – Classifier is penalized when adversary does well at predicting Z

• Hyper-parameter λ controls the trade-off between performance of f(…) and its independence w.r.t. the nuisance parameter – Large λ enforces f(…) to be pivotal – Small λ allows f(…) to be more optimal Adversary Architecture 59 • The adversary is designed to model the posterior probability of Z given f(..): p(z | f(x) = s)

• If Z is categorical, the posterior can be modeled with a standard classifier – logistic / softmax output with cross- loss

• If Z is continuous, posterior can be modeled with a mixture density network

• No constraints on the prior p(z) Mixture Density Network 60 • Typically for regression, f(…) predicts only the value of Y and comparing with square error loss L = Σ(f-y)2

• Instead, have f(xi) output two values, µ(xi) and σ(xi) – Compare with prediction using Gaussian G(yi | µ(xi), σ(xi) )

• Since we don’t know the actual distribution of p(Z | f(x)=s), model using a sum of Gaussians as a more flexible family of distributions than a single Gaussian G. Louppe, M. K., K. Cranmer, Learning to Pivot: Toy Example arXiv:1611.01046 61 • 2D example G. Louppe, M. K., K. Cranmer, Learning to Pivot: Toy Example arXiv:1611.01046 62 • 2D example

• Without adversary (top) large variations in network output with nuisance parameter

• With adversary (bottom) performance is independent! Physics Example 63 • Discriminate between boosted W jets (Y=1) and QCD (Y=0) – Using data from from Baldi et al, arXiv:1603.09349

• Nuisance parameter is the amount of pileup – Z=0 for no pileup – Z=1 for <µ>=50

• Will tune λ to optimize the significance of finding W jets [Baldi et al, arXiv:1603.09349 ] – Normalize samples to that we have 1000 QCD jets and 100 W jets G. Louppe, M. K., K. Cranmer, Learning to Pivot: Physics Example arXiv:1611.01046 64

Opmal tradeoff of • λ=0, Z=0 performance vs. – Standard training with no systematics during training, Non-Adversarial training evaluate systematics after training

• λ=0 – Training samples include events with systematic variations, but no adversary used

• λ=10 – Trading accuracy for robustness results in net gain in terms of statistical significance [This is AMS, an esmate of significance, See Glen’s talk] Decorrelating Variables arXiv:1703.03507 65 • Same adversarial setup can decorrelate a classifier from a chosen variable (rather than nuisance parameter)

• In this example, decorrelate classifier from jet mass, so as not to sculpt jet mass distribution with classifier cut

[See Daniel Whiteson’s talk] Weak Supervision 66 Training Testing / Evaluation t • Simulated Data: • Data: {x i} s s N S ={x i , y i}i=1 U ={xt } i t – x i ~ pt(x)

s s – x i, y i ~ ps(x, y) t – x i ~ pt(x)

• In this setting, we believe we know that the simulation can accurately predict the class labels conditioned on an auxiliary variable w

– ps(y|w)=pt(y|w) where Cov(w, x)=0

• Can use known class fractions in bins of w as targets Weakly Supervised Training on Data 67

arXiv:1702.00414

• Train directly on data, many analysis uncertainties can be avoided Conclusion 68 • Systematics are our known unknowns that cause variations in our distributions, and are parameterized by nuisance parameters • Domain adaptation techniques provide insights into how we might mitigate the effects of systematics • Developed Adversarial techniques for training a classifier to avoid being sensitive to systematics – Can also be used to induce independence of a classifier w.r.t. some variable

• But should we even use this within the setting of hypothesis testing with a profile likelihood test statistic? – If we know well how a systematic works, is it better to simply profile the systematic in a statistical analysis? – For systematics that we don’t understand, e.g. generator uncertainties, this may be very useful! Backup 69 Particle Detection at ATLAS 70 Particle Interactions with Material 71 • Particles from the primary interaction interact with the detector – This is how we can measure their energy and trajectory – We have to simulate this process… at the atomic level • Bethe formula: mean energy loss per unit of material traversed – Losing energy in material is a stochastic process! – Use Monte Carlo techniques to compute energy loss of particle as it passes through a slice of material Background Modeling with Data 72 • Sometimes, we may not trust the theoretical model of the background • Find another sample with only background-like events and measure the background like distribution directly

– Is new distribution exactly like our real background distribution? • If not, can we correct it? Can we quantify how it is different?

– Typically compare several background-like samples to get an idea of how the distribution may vary

Background: Mean Standard Rocks p(x) from other caves Uncertainty:

x=size Building a Background Model From Data 73 • Find signal-free samples with different features Our Cave – Cov(v1, v2)=0

v2 • Change one feature to build model, change other feature to test v1

p(x) Background model from High and Low v2 samples p(x) high v1 but low v2 on low v1 samples

Uncertainty from low v1 measurements

Rao x 1 High/Low

x Training Algorithm: Alternating SGD 74 Calibration / Modeling Corrections 75 • Given some electron vs. non-electrons classifier, check classifier true positive rate with electrons coming from Z boson