Systematic Uncertainties and Domain Adaptation
Michael Kagan SLAC
Hammers and Nails July, 2017 2 The Goal ATLAS-CONF-2017-043
The Higgs Boson! 3 The Goal ATLAS-CONF-2017-043
The Higgs Boson!
• Want to reject the “No Higgs” hypothesis ( + ) • Hopefully measure the size of the 4 The Goal ATLAS-CONF-2017-043
The Higgs Boson!
• Need accurate predictions for and 5 The Goal ATLAS-CONF-2017-043
The Higgs Boson!
• And we need to know how wrong these predictions could be… 6 Sources of Systematic Uncertainty ATLAS-CONF-2017-041 • Experimental – How well we understand the detector and our ability to identify / measure properties of particles given that they are present • Theoretical – How well we understand the underlying theoretical model of the physical processes
• Modeling uncertainties – How well can I constrain backgrounds in alternate data measurements / control regions Error Propagation 7
• Random variables x={x1, x2, … , xn} distributed according to p(x)
– Assume we only know E[x] = µ and Var[xi, xj] = Vij
• For some function f(x), first order approximation N @f f(x) f(µ)+ (xi µi) ⇡ @xi x=µ i=1 X • Then E[f(x)] f(µ) ⇡ N 2 2 @f @f f = E[(f(x) f(µ)) ] Vij ⇡ @x @x x=µ i, j=1 i j X h i Error Propagation 8 • Often, we only have samples of x
• But we know how x varies under some uncertainty – x → x ± Δx – Where Δx is a 1 standard deviation change in x
• Then often approximate the error on f(x) with – ±Δf(x) = f(x ± Δx ) – f(x)
• So we vary x and see how f(x) changes – Often histogram the distributions, showing ±Δf(x) as a band on the histogram bin counts Hypothesis Testing 9
p(x ✓0) ( ; ✓0, ✓1)= | D p(x ✓1) x Y2D | • Likelihood p(x|θ) parameterized by θ
• Likelihood ratio used as test statistic for
hypothesis testing between null hypothesis (θ0) and alternative hypothesis (θ1)
• Parameterization θ=(µ, ν) includes parameters of interest (µ) and nuisance parameters (ν)
[For more details, see Glen Cowan’s talk] Nuisance Parameters and the Likelihood 10 N ni (µsi + bi) (µs +b ) L(µ)= e i i n ! i=1 i Y
• Comparing binned counts between observed data and prediction
– ni is data counts in bin i
– si is predicted signal (diamond) counts in bin i
– bi is predicted background (rocks) counts in bin i – m is parameter of interest, the “signal strength”
[For more details, see Glen Cowan’s talk] Nuisance Parameters and the Likelihood 11 N ni (µsi + ⌫bi) (µs +b ) L(µ, z)= e i i N(⌫;1, ) n ! i=1 i Y
Lprofile(µ) = max L(µ, ⌫) ⌫ • With nuisance parameters – ν is the nuisance parameter, where uncertainty parameterized by ν follows some prior (here a normal distribution)
– Frequently use profile likelihood, maximizing over nuisance for each value of µ [For more details, see Glen Cowan’s talk] Effect of Systematics 12
arXiv:1408.6865
• Systematics can have a big effect, want to mitigate them as much as possible [For more details, see Glen Cowan’s talk] Simulation and Analysis at the LHC 13 How do we get these predictions? 14
+ {higher order terms}
• Standard Model tells us how to compute interaction rates and their distributions – But we can only compute and approximation
• Several different “generators” that simulate how to compute the primary interactions and evolve them into multi-body states
• The dimensionality of the problem becomes enormous, can’t compute the full pdf
• Simulating these process is done through Monte Carlo sampling Generator Uncertainties 15
arXiv:1609.00607
• Different generators can sometimes give rather different predictions Uncertainty in the Theory Model 16 • Model true distributions has approximations in it
– xt = g(zt ; yt=c , νt) – g(…) is the generator of events, that can be conditioned on class
– zt is some latent random numbers (random seeds) – νt are parameters • Don’t know the exact model, but can quantify the uncertainty of the approximation – E.g. parameters have been constrained by other experiments Simulated True Distribu on
– Theoretical physicists give Background bounds on parameters Signal pt(x) – Variations of parameters νt can be done to get alternative true distributions
x=mass – p(x) → p(x, νt ) = p(x | νt ) p(νt ) Very Big Detectors: The ATLAS Experiment 17 ~108 detector channels
Size: Data: 46 m long, Weight: ~300 MB / sec 25 m high, 7000 tons ~3000 TB / year 25 m wide From Theory to Experiment From K. Cranmer 18 • Particles interact with custom detectors – At the LHC, detectors can 8 … have 10 sensors From Theory to Experiment From K. Cranmer 19 • Particles interact with muon Bo om custom detectors Quark – At the LHC, detectors can have 108 sensors • Develop algorithms to
Bo om identify particles from Quark electron detector signals From Theory to Experiment From K. Cranmer 20 • Particles interact with custom detectors – At the LHC, detectors can have 108 sensors • Develop algorithms to identify particles from detector signals • Develop small set (~20-40) engineered features that describe the interesting properties of the event…
Classifier • Train classifier to separate signal and background Getting it right… we need 21 • An accurate model of the underlying physics, including the properties of the proton, primary interaction, and the “showering” process • An accurate and detailed model of the detector • Accurate models of particle interactions with materials • Must tune parameters in both the underlying theory model and the particle interactions with matter • And we need to correct any mismodeling as best we can → Calibrations / Residual corrections Simulating the experiment 22 • Experiment simulation includes geometry, (mis)alignment, material properties, etc… • Detector / particle-material interactions simulated with GEANT
ATL-PHYS-PUB-2015-050 Simulating the experiment 23 • Experiment simulation includes geometry, (mis)alignment, material properties, etc… • Detector / particle-material interactions simulated with GEANT
ATL-PHYS-PUB-2016-015
Uncertainty in our simulated electron energy measurements due to our uncertainty in the amount of material in the detector Calibration / Modeling Corrections 24 • If we want to calibrate the modeling of a particle, find a pure sample of that particle and derive corrections to the simulation to match data
• Example – Z bosons decays to 2 electrons that have a combined mass of ≈90 GeV
– p(particle=electron | mee≈90) ≈ 99%
– So we can select (nearly) pure electron sample by looking for electron pairs consistent with Z boson
– Use known properties of Z boson to calibrate Calibration / Modeling Corrections 25 • For samples x from the simulated, a posteriori adjust the value
– pi → s*pi*(1+z) – s=constant scaling parameter to match the mean of simulation and data
– z~N(z; 0, σz) resolution smearing, match variance in simulation and data
• Measure s±Δs and σz±Δz with finite precision → family of distributions p(pi, s, σz)
– s ~ N(s; s0, Δs) z ~ N(z; 0, σz) σz ~ N(σz; σz0, Δz)
p1 2 2 2 mee =(E1 + E2) (p~1 + p~2) - True distribu on; no detector - Simulated distribu on with detector - Corrected simula on,
with s and σz varia ons - Observed Data Counts
m p2 ee Calibration / Modeling Corrections 26
• Correct simulated mean and variance to match data
Scale and smear momentum pi → s*pi*(1+z) to get matching mass distributions
p 1 m2 =(E + E )2 (p~ + p~ )2 ee 1 2 1 2
ATL-PHYS-PUB-2016-015
p2 Moving Forward 27 • Our known unknowns, our , will diminish our sensitivity to new physics
• We want to reduce them where we can – Improve our models with new inputs from new measurements
• And diminish their impact as much as possible – Can we do this when training a classifier to separate signal from background? Systematics in a Classifier: Standard Approach 28
[S. Hagebock]
• Train classifier, evaluate on nominal sample • Run systematically varied samples through classifier and see how it changes score distribution – Only evaluate systematics after training complete
• No notion of systematics in training Domain Adaptation 29 Typical Binary Classification Problem 30 Training Testing / Evaluation N • Data: {xi, yi}i=1 • Data: {xi} d – xi in R
– yi in {0, 1}
– xi, yi ~ p(x, y) – xi ~ p(x)
• Learn classifier h(x)
– Model p(y|x) and label {xi} Domain Adaptation [N. Courty, h ps://mathema cal-coffees.github.io/mc01-ot/ ] 31
+ Labels No labels!
Source domain Target domain Domain Adaptation [N. Courty, h ps://mathema cal-coffees.github.io/mc01-ot/ ] 32
Training on source data only won’t generalize well! Domain Adaptation 33 Training Testing / Evaluation • Data: t s s N • Data: {x } S ={x i , y i}i=1 i t Ut={x i}
s s – x i, y i ~ ps(x, y) – xt ~ p (x) t i t – x i ~ pt(x)
• Distribution of source and target domains not the same,
ps(x, y) ≠ pt(x, y)
• Setting – Perform the same classification in both domains – Distributions of target and source are closely related Domain Adaptation 34 • How can we learn a classifier from Source data and have a classifier that works will on Target data? • Depends on what we know about the problem! – We assume the tasks are related! – Do we know anything about what has changed between domains? – We can use the Target data to help in the training process! Theoretical bounds [Ben-David et al., 2010] Error of a classifier in the target domain is upper bounded by the sum of three terms: – Error of classifier in source domain – Divergence (or “distance”) between the feature distributions ps(x) and pt(x) – How closely related the classification tasks are Domain Adaptation 35 • Several method to address domain adaption problems – Differ depending on the details of the problem
Approaches • Instance weighting – Often in Covariate Shift setting • Self-Labeling approaches – Iteratively predict target labels from source and learn – Not going to discuss today • Feature representation approaches – Learn a feature representation f(x) that has the same distribution for target and source data – Adversarial domain adaptation takes this approach! Covariate Shift 36 • General situation:
ps(x, y) ≠ pt(x, y)
• In this setting assume: – Class probabilities conditional on features the same:
pt(y|x) = ps(y|x) = p(y|x)
– Marginal Distributions different:
ps(x)≠pt(x)
Covariate Shift 37
[M. Sugiyama, MLSS 2012] Covariate Shift 38 • Just estimate p(y|x) directly from source? – Model misspecification: if classifier model f(x; θ) can not module true decision boundary for any θ, classifier performance will differ between source and target Covariate Shift 39
• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z Covariate Shift 40
• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z • But want to minimize average loss l(xi, yi, θ) on target data
l(x, y, ✓)pt(x, y)dxdy Z Covariate Shift 41
• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z • But want to minimize average loss l(xi, yi, θ) on target data
l(x, y, ✓)pt(x, y)dxdy Z p (x, y) p(y x)p (x) p (x) • Can use density ratio t = | t = t p (x, y) p(y x)p (x) p (x) s | s s Covariate Shift 42
• Training: minimize average loss l(xi, yi, θ) when learning with source data Ns 1 s s l(xi ,yi , ✓) l(x, y, ✓)ps(x, y)dxdy N n ! s i=1 !1 X Z • But want to minimize average loss l(xi, yi, θ) on target data
l(x, y, ✓)pt(x, y)dxdy Z p (x, y) p(y x)p (x) p (x) • Can use density ratio t = | t = t p (x, y) p(y x)p (x) p (x) s | s s N 1 s p (x) p (x) t l(xs,ys, ✓) t l(x, y, ✓)p (x, y)dxdy Then: N p (x) i i ! p (x) s s i=1 s s X Z = l(x, y, ✓)pt(x, y)dxdy Z Covariate Shift 43
pt(x) • How to estimate ps(x) • Several methods, e.g. train classifier h(x) to tell the difference between target and source data, then optimal classifier gives: pt(x) 1 h⇤(x)= = p (x) ps(x)+pt(x) 1+ s pt(x)
Covariate Shift 44
pt(x) • How to estimate ps(x) • Several methods, e.g. train classifier h(x) to tell the difference between target and source data, then optimal classifier gives: pt(x) 1 h⇤(x)= = p (x) ps(x)+pt(x) 1+ s pt(x) • Difficulties with this approach – Don’t know ratio exactly, only can estimate from samples – Weighting causes variance of training data to increase