<<

Some definitions

● experiment – e.g. throwing 2 times a dice ● realization: outcome of one specific experiment ● event space/sample space: all possible outcomes ● “Zufallsvariable” – function of outcome of experiment (e.g. sum of dices) – event space and realization defined accordingly for random variables – random variable can be either discrete or continues

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Example

● experiment: throwing two dices ● event space of experiment {11,12,13,14,15,16,21,22,23,...,64,65,66} ● random variable x = sum of dices – event space of x: {2,3,4,5,6,7,8,9,10,11,12} ● realization of experiment e.g. 25 accordingly realization of x=7

cumulated distribution u(x): probability to observe or smaller value

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Probability density function (p.d.f)

● Repeat an experiment with outcome characterized by single continuous variable x ● Definition: the probability to measure a value x in the interval [x,x+dx] is give by probability density function f(x) (pdf) “Wahrscheinlichkeitsdichte”

P is a “measure” of how often a value of x occurs in a given sample pdf f(x) >=0 and normalized

pdf f(x) is NOT a probability, it has dimension 1/x! Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Cumulative distribution function (c.d.f.)

● Cumulative distribution function F(x), also know as probability distribution function. “Wahrscheinlichkeitsverteilung = Verteilungsfunktion” ● F(x') is interpreted as probability to find value x <= x' ● F(x) is continuously non-decreasing function ● F(-∞) = 0 and F(+∞)=1 ● is directly related to the probability density function f(x) by:

● f(x) is given (for well-behaved distributions) by:

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Average

● arithmetic of data set:

● geometric mean of data set:

● harmonic mean of data set:

3 P y t h a g o r e a n m e a n s : h a r m o n i c m e a n < = g e o m e tr i c m e a n < = a r it h m e t i c mean ● weighted mean of data set:

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Example:

● Average number of children per family in Germany is 2.3 ● Average lifetime expectation for men is 74 for women 78 ● Average amount of semester for physics studies in Heidelberg is 11.2 ● ...

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Example: Geometric Mean

● Needed to average multiplicative functions: – interests per year in the last 5 years ● 2002: 2.5 % ● 2003: 2.5 % ● 2004: 3.0 % ● 2005: 3.5 % ● 2006: 3.5 % ● after five years:

● comparable average interest:

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Example: Harmonic Mean

● Travel half of the distance with 40 km/h and half of the distance with 60 km/h. What is the average speed?

● Travel half of the time with 40 km/h and half of the time with 60 km/h. What is the average speed?

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Example: Weighted Mean

● 5 measurements { } with different uncertainties { }

● arithmetic mean is special case of weighted (same weight for each measurement)

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer ... even more averages

● Mode/Modus: most probable value (highest bin in distribution) definition not unique, unimodal, bimodal distributions

● Median: smallest value which is ≥ than 50% of the events (median more robust against outliers than artithm .mean)

For unimodal, symmetric functions centered around : median = mode = arithmetic mean

else, for unimodal functions, empiric rule mean-mode = 3 x (mean-median)

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Examples:

● Give median, mode, arithmetic mean of:

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Energy loss in Material

● Assume tracking stations with 12 layers ● The energy loss per traversed material (dE/dx) of the particle traversing a layer follows a Landau distribution

m o d e

dE/dx

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Energy loss in Material

● The mode of the dE/dx distribution as a function of particle momentum is used to seperate different particle species

1 10 momentum [GeV/c]

How to get estimate of mode from 12 measurements?

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Truncated Mean

● Discard 10/20% measurements with larges value (symmetrize the function) ● Take arithmetic mean of remaining ones

● All about estimating the true mode/median/mean of distributions from given data set. More about estimators later.

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Citation ....

“Die Rate der Menschen die in USA in Armut leben dramatisch gestiegen. Die Hälfte aller Menschen haben ein Einkommen unter dem Durchschnitt. “

aus “So lügt man mit Statistik”

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Measure the Spread

● How to characterise width/spread? ● First thought mean deviation from the mean:

Could consider average absolute deviation:

However hard to handle mathematically.

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Variance

● Way better quantity:

mean square deviation called sample variance

● For any function :

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer Variance

● For data analysis, preferably loop only once over data:

mean square – square of the mean

For large numbers, safer to shift distribution by estimated mean :

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer (R.M.S), FWHM

● standard deviation σ or RMS: root mean squared

[“standard ” is a joke, there are several standards in literature ...] ● FWHM: full width at half maximum more robust against outliers, fluctuations harder at low statistσics for Gaussian distribution: FWHM = 2.35σ

Modern Methods of Data Analysis - WS 07/08 Stephanie Hansmann-Menzemer