Systematic Uncertainties and Domain Adaptation
Total Page:16
File Type:pdf, Size:1020Kb
Systematic Uncertainties and Domain Adaptation Michael Kagan SLAC Hammers and Nails July, 2017 2 The Goal ATLAS-CONF-2017-043 The Higgs Boson! 3 The Goal ATLAS-CONF-2017-043 The Higgs Boson! • Want to reject the “No Higgs” hypothesis ( + ) • Hopefully measure the size of the 4 The Goal ATLAS-CONF-2017-043 The Higgs Boson! • Need accurate predictions for and 5 The Goal ATLAS-CONF-2017-043 The Higgs Boson! • And we need to know how wrong these predictions could be… 6 Sources of Systematic Uncertainty ATLAS-CONF-2017-041 • Experimental – How well we understand the detector and our ability to identify / measure properties of particles given that they are present • Theoretical – How well we understand the underlying theoretical model of the physical processes • Modeling uncertainties – How well can I constrain backgrounds in alternate data measurements / control regions Error Propagation 7 • Random variables x={x1, x2, … , xn} distributed according to p(x) – Assume we only know E[x] = µ and Var[xi, xj] = Vij • For some function f(x), first order approximation N @f f(x) f(µ)+ (xi µi) ⇡ @xi x=µ − i=1 X • Then E[f(x)] f(µ) ⇡ N 2 2 @f @f σf = E[(f(x) f(µ)) ] Vij − ⇡ @x @x x=µ i, j=1 i j X h i Error Propagation 8 • Often, we only have samples of x • But we know how x varies under some uncertainty – x → x ± Δx – Where Δx is a 1 standard deviation change in x • Then often approximate the error on f(x) with – ±Δf(x) = f(x ± Δx ) – f(x) • So we vary x and see how f(x) changes – Often histogram the distributions, showing ±Δf(x) as a band on the histogram bin counts Hypothesis Testing 9 p(x ✓0) λ( ; ✓0, ✓1)= | D p(x ✓1) x Y2D | • Likelihood p(x|θ) parameterized by θ • Likelihood ratio used as test statistic for hypothesis testing between null hypothesis (θ0) and alternative hypothesis (θ1) • Parameterization θ=(µ, ν) includes parameters of interest (µ) and nuisance parameters (ν) [For more details, see Glen Cowan’s talk] Nuisance Parameters and the Likelihood 10 N ni (µsi + bi) (µs +b ) L(µ)= e− i i n ! i=1 i Y • Comparing binned counts between observed data and prediction – ni is data counts in bin i – si is predicted signal (diamond) counts in bin i – bi is predicted background (rocks) counts in bin i – m is parameter of interest, the “signal strength” [For more details, see Glen Cowan’s talk] Nuisance Parameters and the Likelihood 11 N ni (µsi + ⌫bi) (µs +b ) L(µ, z)= e− i i N(⌫;1, σ) n ! i=1 i Y Lprofile(µ) = max L(µ, ⌫) ⌫ • With nuisance parameters – ν is the nuisance parameter, where uncertainty parameterized by ν follows some prior (here a normal distribution) – Frequently use profile likelihood, maximizing over nuisance for each value of µ [For more details, see Glen Cowan’s talk] Effect of Systematics 12 arXiv:1408.6865 • Systematics can have a big effect, want to mitigate them as much as possible [For more details, see Glen Cowan’s talk] Simulation and Analysis at the LHC 13 How do we get these predictions? 14 + {hiGher order terms} • Standard Model tells us how to compute interaction rates and their distributions – But we can only compute and approximation • Several different “generators” that simulate how to compute the primary interactions and evolve them into multi-body states • The dimensionality of the problem becomes enormous, can’t compute the full pdf • Simulating these process is done through Monte Carlo sampling Generator Uncertainties 15 arXiv:1609.00607 • Different generators can sometimes give rather different predictions Uncertainty in the Theory Model 16 • Model true distributions has approximations in it – xt = g(zt ; yt=c , νt) – g(…) is the generator of events, that can be conditioned on class – zt is some latent random numbers (random seeds) – νt are parameters • Don’t know the exact model, but can quantify the uncertainty of the approximation – E.g. parameters have been constrained by other experiments Simulated True DistribuTon – Theoretical physicists give BackGround bounds on parameters SiGnal pt(x) – Variations of parameters νt can be done to get alternative true distributions x=mass – p(x) → p(x, νt ) = p(x | νt ) p(νt ) Very Big Detectors: The ATLAS Experiment 17 ~108 detector channels Size: Data: 46 m long, Weight: ~300 MB / sec 25 m high, 7000 tons ~3000 TB / year 25 m wide From Theory to Experiment From K. Cranmer 18 • Particles interact with custom detectors – At the LHC, detectors can 8 … have 10 sensors From Theory to Experiment From K. Cranmer 19 • Particles interact with muon Boom custom detectors Quark – At the LHC, detectors can have 108 sensors • Develop algorithms to Boom identify particles from Quark electron detector signals From Theory to Experiment From K. Cranmer 20 • Particles interact with custom detectors – At the LHC, detectors can have 108 sensors • Develop algorithms to identify particles from detector signals • Develop small set (~20-40) engineered features that describe the interesting properties of the event… Classifier • Train classifier to separate signal and background Getting it right… we need 21 • An accurate model of the underlying physics, including the properties of the proton, primary interaction, and the “showering” process • An accurate and detailed model of the detector • Accurate models of particle interactions with materials • Must tune parameters in both the underlying theory model and the particle interactions with matter • And we need to correct any mismodeling as best we can → Calibrations / Residual corrections Simulating the experiment 22 • Experiment simulation includes geometry, (mis)alignment, material properties, etc… • Detector / particle-material interactions simulated with GEANT ATL-PHYS-PUB-2015-050 Simulating the experiment 23 • Experiment simulation includes geometry, (mis)alignment, material properties, etc… • Detector / particle-material interactions simulated with GEANT ATL-PHYS-PUB-2016-015 Uncertainty in our simulated electron energy measurements due to our uncertainty in the amount of material in the detector Calibration / Modeling Corrections 24 • If we want to calibrate the modeling of a particle, find a pure sample of that particle and derive corrections to the simulation to match data • Example – Z bosons decays to 2 electrons that have a combined mass of ≈90 GeV – p(particle=electron | mee≈90) ≈ 99% – So we can select (nearly) pure electron sample by looking for electron pairs consistent with Z boson – Use known properties of Z boson to calibrate Calibration / Modeling Corrections 25 • For samples x from the simulated, a posteriori adjust the value – pi → s*pi*(1+z) – s=constant scaling parameter to match the mean of simulation and data – z~N(z; 0, σz) resolution smearing, match variance in simulation and data • Measure s±Δs and σz±Δz with finite precision → family of distributions p(pi, s, σz) – s ~ N(s; s0, Δs) z ~ N(z; 0, σz) σz ~ N(σz; σz0, Δz) p1 2 2 2 mee =(E1 + E2) (p~1 + p~2) - True distribuTon; no detector − - Simulated distribuTon with detector - Corrected simulaon, with s and σz variaons - Observed Data Counts m p2 ee Calibration / Modeling Corrections 26 • Correct simulated mean and variance to match data Scale and smear momentum pi → s*pi*(1+z) to get matching mass distributions p 1 m2 =(E + E )2 (p~ + p~ )2 ee 1 2 − 1 2 ATL-PHYS-PUB-2016-015 p2 Moving Forward 27 • Our known unknowns, our , will diminish our sensitivity to new physics • We want to reduce them where we can – Improve our models with new inputs from new measurements • And diminish their impact as much as possible – Can we do this when training a classifier to separate signal from background? Systematics in a Classifier: Standard Approach 28 [S. Hagebock] • Train classifier, evaluate on nominal sample • Run systematically varied samples through classifier and see how it changes score distribution – Only evaluate systematics after training complete • No notion of systematics in training Domain Adaptation 29 Typical Binary Classification Problem 30 Training Testing / Evaluation N • Data: {xi, yi}i=1 • Data: {xi} d – xi in R – yi in {0, 1} – xi, yi ~ p(x, y) – xi ~ p(x) • Learn classifier h(x) – Model p(y|x) and label {xi} Domain Adaptation [N. Courty, hXps://mathemacal-coffees.Github.io/mc01-ot/ ] 31 + Labels No labels! Source domain TarGet domain Domain Adaptation [N. Courty, hXps://mathemacal-coffees.Github.io/mc01-ot/ ] 32 TraininG on source data only won’t Generalize well! Domain Adaptation 33 Training Testing / Evaluation • Data: t s s N • Data: {x } S ={x i , y i}i=1 i t Ut={x i} s s – x i, y i ~ ps(x, y) – xt ~ p (x) t i t – x i ~ pt(x) • Distribution of source and target domains not the same, ps(x, y) ≠ pt(x, y) • Setting – Perform the same classification in both domains – Distributions of target and source are closely related Domain Adaptation 34 • How can we learn a classifier from Source data and have a classifier that works will on Target data? • Depends on what we know about the problem! – We assume the tasks are related! – Do we know anything about what has changed between domains? – We can use the Target data to help in the training process! Theoretical bounds [Ben-David et al., 2010] Error of a classifier in the target domain is upper bounded by the sum of three terms: – Error of classifier in source domain – Divergence (or “distance”) between the feature distributions ps(x) and pt(x) – How closely related the classification tasks are Domain Adaptation 35 • Several method to address domain adaption problems – Differ depending on the details of the problem Approaches • Instance weighting – Often in Covariate Shift setting • Self-Labeling approaches – Iteratively predict target labels from source and learn – Not going to discuss today • Feature representation approaches – Learn a feature representation f(x) that has the same distribution for target and source data – Adversarial domain adaptation takes this approach! Covariate Shift 36 • General situation: ps(x, y) ≠ pt(x, y) • In this setting assume: – Class probabilities conditional on features the same: pt(y|x) = ps(y|x) = p(y|x) – Marginal Distributions different: ps(x)≠pt(x) Covariate Shift 37 [M.