statistical_analysis-WORKBOOK

In [1]: ## CSS coloring for the dataframe tables

from IPython.core.display import HTML

css = open('../style/style-table.css').read() + open('../style/style-not ebook.css').read() HTML(''.format(css)) Out[1]:

Statistical Analysis

Index

Exploratory Data Analysis Data Types Estimates of Location mean exercise exercise exercise exercise Estimates of Variability and related estimates exercise Estimates based on percentiles Estimates of and tailedness exercise Estimates of Distribution Boxplot exercise Frequency tables, density plots and histograms exercise exercise Correlation Data and Sampling distribution Common distributions Processing math: 100%

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Binomial distribution Exercise Exercise Exercise Exercise Uniform distribution Chi2 distribution Sampling Distribution of a Statistic Sampling Distribution of Means Exercise

Exploratory Data Analysis [[1](#ref1)]

Statistics has developed mostly in the past century. was develped in the 17th to 19th century based on the work of Thomas Bayes, Pierre-Simon laplace and Carl Gauss.

source: https://bit.ly/1LEZVtR

Statistics is an applied science concerned with analysis and modeling of data. Francis Galton, Karl Pearson andRoland Fisher laying the pillars of moders statistics by introducing key ideas like experimental design and maximum likelihood estimation.

These and many statistical concepts formed the foundations of moderen data science.

The first step on any data science project is exploring the data. Exploratory Data Analysis (EDA) is a new area of statistics established by John W. Tukey, where he called fo rreformation of statistics in his paper The Future of Data Analysis. He proposed a new scientific discipline called data analyis that includes statistical inference as one of its compenents, beside engineering and computer science techniques. [[2](#ref2)]

source: https://bit.ly/2SuYDdm

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

With the rapid development of computing power, as well as data analysis software and the availability of more and bigger data, allowed exploratory data analysis has evelved well.

Donoho in his paper 50 years of data science traces the root of data science back to Tukey's work.

Data Types[[1](#ref1), [3](#ref3)]

DIFFERENT TYPES OF DATA SOURCES

Data manifests itself in many different shapes. Each shape of data may hold much value to your targeted goal. In some shapes this is easier to extract than others.

Different shapes of data require different storage solutions and should therefore be dealt with in different ways. Thee shapes of data will be distinguished as follows:

1. Unstructured Data

Unstructured data is the raw form of data. It can be any type of file, for example; texts, pictures, sounds, sensor measurements or videos.

This data is often stored in a repository of files. Extracting value out of this shape of data is often the hardest. Since you first need to extract structured features from the data that describe or abstract it.

Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly — often many times faster than structured databases are growing.[[4](#ref4)]

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

source: https://bit.ly/2GmXK1F

2. Structured Data

Structured Data is tabular data (rows and columns) which are very well defined. Meaning that we know which columns there are and what kind of data they contain.

source:

https://bit.ly/2rdzwi9

The two basic types of structured data is: [[1](#ref1)]

1. Numeric, which comes in two forms: A. continuous, are the data that can take any value real in an interval ( ∈ R) such as speed or time duration.

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

synonyms: interval, float, numeric

B. discrete, are the data that takes only integer values ( ∈ Z) such as the count of event occurance.

synonyms: integer, count

2. Categorical, are the data that takes a fixed set of values (such as car or bike or truck). Two special cases of categorical data:

synonyms: enums, enumerated, factors, nominal, polychotomous

A. Binary data: where the data can take one of two possible values, such as 0/1, true/false, cat/dog, ...

synonyms: dichotomous, logical, inducator, boolean

B. Ordinal data: where the order of the values matters, as example very cold, cold, warm, hot.

synonyms: ordered factor

Knowing the data type is important in the process of exploring your data, since you can know the required type of visual display, data analysis, even statistical/training model.

The typical frame of data using in data science is Rectangular data objects, like spread sheets and pandas dataframes. It's essentially two-dimensional table/matrix, where:

the rows represent the observation samples

also called: cases, examples, instances, records

the columns represent the features:

also called: attributes, inputs, predictors, variables

The dataset that you require to analyse does not come always in Rectangular form (such as unstructured data (like text)), but it has to be processed to the the Rectangular form.

There is another important data structures that is used in data science beside the Rectangular data, and that is the nonrectangular data stuructures, these data structures can come if the forms of:

1. Time Series, is a series of data points indexed (or listed or graphed) in time order. Most

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots or the measurements of a sensor over time.[[5](#ref5)]

source:

https://bit.ly/2SX7ghq

1. Spatial data structures, is a data stuructures that are used in mapping and location analytics. It is optimized for storing and querying data that represents objects defined in a geometric space. Most spatial databases allow the representation of simple geometric objects such as points, lines and polygons. [[6](#ref6)] The structure of this data is more complex than the Rectangular data. Possible represenations of this data sturcture: [[1](#ref1)]

Object representation: where the focus of the data is an pbject (e.g. a house) and its spatial coordinates field View representation: focuses on small units of space and the value of a relevant metric (as example, pixel brightness)
source:

https://bit.ly/2YfljVn

1. Graph (or network) data structures, are used to represent physical, social, and anstract

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

replationships. Forexample, a graph of a social network. Graph structures are sueful for certain types of problems such as network optimization and recommender systems.

source: https://bit.ly/2KeXXW1

Estimates of Location [[1](#ref1)]

Also called "Central Tendency" measure, and is the basic step in exploring numeric data, by estimating where most oof the data is located.

The basic estimators for location are mean and median

Mean

also called average or mathematical expectation, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values.

for a dataset with n values, x = {x1, x2, . . . , xn}, its mean is defined as: ∑ x n ˉx = ni=1 i

In statistics when representing the count of population/sample, the capitalized character (as example N) refer to the population, while the lower case (as example n) refers to a sample from a population.

mean is a sensitive estimation to outliers, but the effect of the outlier decreases as the number of samples (n) increases.

Exercise

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

1. From the sub_set dataset below: A. Plot the points B. compute the mean to tips and plot that mean as a horizontal line.

In [118]: import numpy as np import seaborn as sns; sns.set() import matplotlib.pyplot as plt from pandas import DataFrame as df

In [119]: dataset = sns.load_dataset("tips") sub_set = dataset.loc[:,['total_bill', 'tip']] sub_set.head()

### YOUR CODE HERE # # ###

Out[119]: total_bill tip

0 16.99 1.01

1 10.34 1.66

2 21.01 3.50

3 23.68 3.31

4 24.59 3.61

1. Create sub_set_with_outlier by taking the first 15 elements of sub_set , then: A. Plot the points B. compute the mean to tips and plot that mean as a horizontal line.

In [121]: ### YOUR CODE HERE # # ###

Exercise

Repeat the previous example but with taking the sub_set_with_outlier as the complete set of sub_set (i.e. increasing the n) and notice the result of having larger dataset on the amount of effect an outlier can have on mean value.

In [123]: ### YOUR CODE HERE # # file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

# # ###

in statistics there are more robust variation of the mean against outliers, such as trimmed mean and weighted mean.

Trimmed mean

A fixed number of sorted values at each end of the dataset values are removed from the calculation of average.

for a dataset with n values, x = {x1, x2, . . . , xn}, the trimmed mean with p smallest and largest valued ommited: ∑ x n − 2P Trimmed Mean = ˉxtrimmed p = n−pi=p+1 i

As example: Trimmed 5% mean, means 5% of the left most and right most of values:

In [125]: ##################################### CODE HERE ######################## #############

mean without outlier = 9.500 mean with outlier = 56.667 trimmed 5% mean = 10.000

Weighted mean

is calculated by multiplying each data value xi by a weight wi and dividing their sum by the sum of the weight.

for a dataset with n values, x = {x1, x2, . . . , xn}, and their assiciated weights w = {w1, w2, . . . , wn}, the weighted mean is calculated as: ∑ w x ∑ w Weighted Mean = ˉxw = ni=1 i i ni=1 i

There are tree reasons why/when to use weighted mean:

weighted mean is robust to outliers. Some values are intrinsically more variable than others, and highly variable observations are given a lower weight.

For example, if we are taking the average from multiple sensors and one of the sensors is less accurate, then we might downweight the data from that sensor. file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

The data collected does not equally represent the different groups that we are interested in measuring.

For example, because of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented.

Exercise

For the sub_set_with_outlier.tip dataset used in the last exercise:

1. calculate the trimmed 10% mean

In [126]: ### YOUR CODE HERE # # ###

1. generate normal PDF (propability density function) with μ = ˉxtrimmed 10% and σ = σ : ˉxtrimmed 10%

use this PMF function as weights for the sub_set_with_outlier calculate the weighted mean.

In [128]: ### YOUR CODE HERE # # ###

1. Plot sub_set_with_outlie and the following means as horizonal lines:

mean ˉxtrimmed 10% ˉxw

In [130]: ### YOUR CODE HERE # # ###

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Note that: Trimming the top and bottom 10% is a common choice for robust measure of location.

Median

The median is the middle number of a sorted list of data. If there is an even numer of data values, the median will be the mean of the two observations in the middle of the sorted data list.

source: https://en.wikipedia.org/wiki/Median

For any (PDF) on the real line R, the median represent the point in the middle of the distribution, where the median split the distribution curve into two equal areas.

let (m) be the median of set of random variables (X ∈ R) from a probability distribution function, the median is by definition a real number that statisfies: P(X ≥ m) = P(X ≤ m) = 12

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

source: https://en.wikipedia.org/wiki/Median

An outlier is any value that is verydistance from the other values in a dataset. Outliers (extreme cases) may be a result of bad data/measurements, that could skew the results.

Anomaly Detection is the algorithm that determines the outliers out from the normal data. The greater the mass of data that defines the normal out from the anomaly the better the anomaly detection algorithm to determine the outliers.

The median and trimmed mean are robust measures for estimation against outliers. The trimmed mean uses more data to calculate the estimate for location compared to the median.

Usually, for a random variable X:

mean(X) > trimmed mean(X) > median(X)

Other Robust metrics for location: In statistics plenty of other robust and efficient (i.e. better ability to distinguish the small location differences between data sets.). These methods are usually useful for small datasets, but they are not likely to provided added benefit for large or even moderately sized datasets.

Remember that even the sensitive (in definition) mean is gain better resistance to outliers as the number of datasets (n) increases.

Weighted Median

Exercise

Import the data set state.csv from assets folder, and comput the following:

1. Rename the Murder_Rate column to Murder_Rate 2. Plot the Population to the Murder_Rate 3. Mean of the Population 4. Trimmed 10% mean of Population 5. Median of Population 6. Weighted mean of Murder_Rate referenced to Population

Use weightedstats library

7. Weighted median of Murder_Rate referenced to Population

In [132]: import pandas as pd import numpy as np from scipy import stats # for trimming

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

import weightedstats as ws # for weighted mean and weighted median

import matplotlib.pyplot as plt

%matplotlib inline ### YOUR CODE HERE # # ###

Estimates of Variability [[1](#ref1)]

Location is just one dimension in summarizing a feature. A second dimension, variability, measures whether the data values are tightly clustered or spread out. Variability measure, is a measure how much the data is spread around the estimate of location (usually the mean). This measure describes weather the data values are tightly clustered or spread out. In the heart of statistics lies variability and the need to: measure it reduce it distinguish random (usually noise or outliers) from real variability identify the various sources of real variability and finally make decisions based on these information.

Here the following estimates will be considered:

1. Standard deviation and related estimates 2. Estimates based on percentiles 3. Estimates of asymmetry

Standard deviation and related estimates

Mean Absolute Deviation

Mean Absolute Deviation is a measure for the mean of absolute difference between each point and the measure of location (usually the mean).

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

∑ x − ˉx n Mean absolute deviation = ni=1| i |

As example: The Mean Absolute Deviation of the Population in the state.csv dataset:

In [134]: ##################################### CODE HERE ######################## #############

The mean absolute deviation of Population = 4450933.36

Median Absolute Deviation (MAD)

In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. For a univariate data set X1, X2, . . . , Xn, the MAD is defined as the median of the absolute deviations from the data's median.

Exercise

For the Population of state.csv dataset:

1. compute the Median Absolute Deviation (MAD).

In [135]: ### YOUR CODE # # ###

1. compute the Median absolute deviation from the mean.

In [137]: ### YOUR CODE # # ###

Variance and Standard Deviation

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

The "most" known estimates for variability are the :

2 2 (s (for population) or σ (for sample)) and standard deviation (s (for population) or σ (for sample)),

and both are based on the square of spread distance away from the mean.

2 1n − 1 2 1n − 1 2 Variance = σ = n∑ i=1(xi − ¯x) Standard Devation = σ √= n∑ i=1(xi − ¯x)

Note that, in computing the variance and standard deviation, we have used N − 1 for the denominator instead of n, leading to the concept of Degrees of Freedom. In general you aim to have large n which will make our choice of using n − 1 DOF have less effect of the result. The reason is:

When we attempt to estimate the spread of population data from a collected sample, using the denominator n will underestimate the true value of the variance in the population (this is referred to biased estimate).

When dividing over n − 1 instead of n, the standard deviation becomes an unbiased estimate.

The standard deviation preserves the scale of the original data, and is preferred in statistics over the Mean Absolute Deviation.

In general, working with squared values is much more convenient that absolute values, specially is statistical models

Variance and Standard Deviation are not robust estimates for variability is the presence of outliers since the measure the spread around the mean. One other robust variation to Standard Deviation is the Trimmed Standard Deviation.

For example, The standard deviation (unbiased sample σ) and the 10% trimmed version of it, of the Murder_Rate are:

In [139]: ##################################### CODE HERE ######################## #############

The std of Murder_Rate = 1.90 The std of Murder_Rate = 1.33

The variance, the standard deviation, mean absolute deviation, and median absolute deviation from the median (MAD) are not equivalent estimates (and can not be compared together), even in the case where the data comes from a normal distribution. file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

σ > Mean Absolute Deviation > MAD

In case of data with normal distribution, the MAD is sometimes multiplied by a constant factor (1.4826) to put the MAD is the same scale as the standard deviation.

Estimates based on percentiles

A different approach to estimating dispersion is based on looking at the spread of the sorted data. Statistics based on sorted (ranked) data are referred to as order statistics.

Range and Interquartile Range

Is the difference between the largest and the smallest number in the dataset.

The minimum and maximum amplitudes of a dataset is useful information to know, but the range is extremely sensitive to outliers.

Compared to the range, another more useful information is finding the range based on different percentiles. The Pth percentile is a value such that at least P% of the values take on this value or less, and at least (100 − P) take on his value or more. For example: To find the 80th percentile: we sort the data, them we start from the smallest value up to 80% of the values.

Percentiles are the same as quantiles, but quantiles are indexed with fraction (as example 0.8 quantile is equivalent to 80% percentile).

As example, for the ordered list {3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20}, the: median (50th) percentile : 9 25th percentile: 7 100th percentile: 20 75th percentile: 15

Interquartile Range (IQR) is a common measure of variability, found by taking the difference between the 25th and the 75th percentiles. As example, for the previous ordered list, the IQR = 15 − 7 = 8

For very large dataset, calculating the exact percentiles can be computationally expensive (since it will require sorting all the dataset). Machine learning and statistics algorithms use an approximation to percentiles, such as [7].

For example: 1. To compute the IQR for the Population in the State dataset in python: file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In [140]: ##################################### CODE HERE ######################## #############

IQR of Population is 4847308.00

2. To compute the basic percentiles

In [141]: ##################################### CODE HERE ######################## #############

Minimum value of the Population is 563626.00 25th percentile of the Population is 1833004.25 50th percentile of the Population is 4436369.50 75th percentile of the Population is 6680312.25 Maximum value of the Population is 37253956.00

Estimates of skewness and tailedness

Skewness

Skewness, also called Third Central , is a measure of the asymmetry of the probability distribution of a real-values random variable about it's mean. The skewness value can be:

negative : The left tail is longer, and the mass of the distribution is concentrated on the right side. positive : The right tail is longer, and the mass of the distribution is concentrated on the left side.

source: https://en.wikipedia.org/wiki/Skewness

Pearson's moment coefficient of skewness, mainly compares the value of mean and median to estimate the skewness of a random variable X:

3 X − μσ γ1 = E[( )]

Example:

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In [6]: import numpy as np from scipy import stats import matplotlib.pyplot as plt import seaborn as sns

%matplotlib inline

In [143]: ##################################### CODE HERE ######################## #############

Kurtosis

Kurtosis, also called Fourth Central Moment, is a measure of the "tailedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution.

The Kurtosis is defined as:

(Fisher) Kurtosis: 4 X − μσ γ2 = E[( )] − 3

Kurtosis can be a good measure to differ between two distributions that has the same mean and the same σ, but the tailedness is different, as the example below:

import seaborn as sns

In [7]: ##################################### CODE HERE ######################## #############

For the t-distribution: + mean = 0.00 + variance = 111.11 + skewness = 0.00 file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

+ kurtosis = 0.38

For the norm-distribution: + mean = 0.00 + variance = 100.00 + skewness = 0.00 + kurtosis = 0.00

Exercise

From Seabord datasets, import iris dataset[[9](#ref9)], then:

1. in two subplots, with different color for each species , plot: A. For sepal: length against width B. For pental: length against width

In [145]: import pandas as pd import numpy as np from scipy import stats import seaborn as sns import matplotlib.pyplot as plt #from mpl_toolkits.mplot3d import Axes3D # for 3d plots

%matplotlib inline

In [146]: iris_df = sns.load_dataset("iris")

In [147]: ### YOUR CODE HERE # Get the Unique "species" # # Plot # #

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

# # ###

1. For each of the different species compute for each sepal / pental length and width: A. Mean B. Median C. Standard Deviation D. Skewness E. Kurtosis F. Interquartile Range

In [149]: ### YOUR CODE HERE # # # ###

Estimates of Distribution

Beside exploring the location and the variability of the data, its is also useforul to explore how is the data is distributed overall. One method to achieve this goal is to explore the data through plots, such as:

Boxplot Frequency table and histograms Density Plot

Boxplot [[11](#ref11)]

also called whisker plot, it is a plot introduced by Tukey to visualize the distribution of data.

A boxplot is a standardized way of displaying the distribution of data based on a five number summary:

“minimum” : Q1 − 1.5 ∗ IQR first quartile (Q1): 25% Percentile median (Q2): 50% Percentile third quartile (Q3): 75% Percentile file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

“maximum” : Q3 + 1.5 ∗ IQR

source:

https://bit.ly/2YHEIh7

Boxplots visually inform us about:

outliers their values symmetry/skewness of the data how tightly your data is grouped (IQR)

Example: From state.csv dataset, if we explore the pecentiles of the Population :

In [156]: state_df.head() Out[156]: State Population Murder_Rate Abbreviation

0 Alabama 4779736 5.7 AL

1 Alaska 710231 5.6 AK

2 Arizona 6392017 4.7 AZ

3 Arkansas 2915918 5.6 AR

4 California 37253956 4.4 CA

In [157]: ##################################### CODE HERE ########################

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

#############

Out[157]:

Exercise

[[9](#ref9)] From Seaborn import the dataset tips :

1. create boxplot for the total_bill grouped by the day

In [158]: import seaborn as sns tips_df = sns.load_dataset("tips")

### YOUR CODE HERE # # ###

1. Repeat the same previous plot for nested grouping based on smoker condition

In [160]: ### YOUR CODE HERE # # # ###

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Frequency table, density plots and histograms

Frequency tables list the count of numeric data values that fall into a set of equidistant intervals (bins).

This can be performed in Pandas using the methods cut and value_counts() .

To show what these methods do, see the following example:

In [162]: a = pd.Series(np.arange(1,11)*10) print(a)

0 10 1 20 2 30 3 40 4 50 5 60 6 70 7 80 8 90 9 100 dtype: int32

Lets divide this list into four bins of the same size:

In [163]: a_bins = pd.cut(a, 4) a_bins

Out[163]: 0 (9.91, 32.5] 1 (9.91, 32.5] 2 (9.91, 32.5] 3 (32.5, 55.0] 4 (32.5, 55.0] 5 (55.0, 77.5] 6 (55.0, 77.5] 7 (77.5, 100.0] 8 (77.5, 100.0] 9 (77.5, 100.0] dtype: category Categories (4, interval[float64]): [(9.91, 32.5] < (32.5, 55.0] < (55.0, 77.5] < (77.5, 100.0]]

Now, lets get the count of every bin:

In [164]: a_bins_count = a_bins.value_counts() a_bins_count

Out[164]: (77.5, 100.0] 3 (9.91, 32.5] 3 (55.0, 77.5] 2 (32.5, 55.0] 2 dtype: int64

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

You can also visualize this count (see also histogram below):

In [165]: ##################################### CODE HERE ######################## #############

Out[165]:

Exercise

For the imported iris dataset:

1. summarize is a table-form, for each column the: observations-count mean standard deviation minimum 0.25 quantile median 0.75 quantile maximum

In [166]: iris_df = sns.load_dataset("iris") iris_df.head()

Out[166]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

4 5.0 3.6 1.4 0.2 setosa

In [167]: ### YOUR CODE HERE # # ###

1. Get the count of each or the categories in species column

In [169]: ### YOUR CODE HERE # # ###

1. For each of the first 4 columns, divide the them into 10 equi-distance bins and compute the count of values that fall in each of those bins, and display the result in 4-bar subplots:

In [171]: ### YOUR CODE HERE # # ###

Notes

Notes about Frequence Tables: It is important to include the empty bins; the fact that there are no values in those bins is useful information. It's useful to experiment different bin size: too large number of bins: important features of the distribution can be obscured. too small number of bins: the true distribution of the data will be hidden.

Histograms are a visualization for the Frequency table.

Density Plots are a smoothed version of the histogram, and often based on a kernal density estimate (KDE)[[10](#ref10)].

As example:

In [175]: #Simple Desity Plot ##################################### CODE HERE ######################## #############

Out[175]: file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In [178]: ##################################### CODE HERE ######################## #############

Out[178]:

In [179]: ##################################### CODE HERE ######################## #############

Out[179]:

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Violine Plot [[12](#ref12)]

violine plots combines the visual information interpreted in both Box plots as well as density plots.

It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

This can be an effective and attractive way to show multiple distributions of data at once, but keep in mind that the estimation procedure is influenced by the sample size, and violins for relatively small samples might look misleadingly smooth.

Violin plots have many of the same summary statistics as box plots:

the white dot represents the median the thick gray bar in the center represents the interquartile range the thin gray line represents the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the interquartile range.

source:

https://bit.ly/2YQCeZG

On each side of the gray line is a kernel density estimation to show the distribution shape of the data. Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability.

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

source:

https://bit.ly/2YQCeZG

As example, using the iris dataset provided by Seaborn:

In [2]: import pandas as pd import numpy as np import seaborn as sns; sns.set() import matplotlib.pyplot as plt

%matplotlib inline

In [3]: iris_df = sns.load_dataset("iris") iris_df.head()

Out[3]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

In [4]: # Summariese the data set ##################################### CODE HERE ########################

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

#############

Out[4]: sepal_length sepal_width petal_length petal_width

count 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.057333 3.758000 1.199333

std 0.828066 0.435866 1.765298 0.762238

min 4.300000 2.000000 1.000000 0.100000

25% 5.100000 2.800000 1.600000 0.300000

50% 5.800000 3.000000 4.350000 1.300000

75% 6.400000 3.300000 5.100000 1.800000

max 7.900000 4.400000 6.900000 2.500000

In [8]: # Creating violine plots for every feature (column except the label colu mn (species))

##################################### CODE HERE ######################## #############

Exercise

Divide the iris dataset into subgoups based on the species categories, then plot (in subplots) the violine-plot for each of the four features represented by the columns of the iris dataset (expect the species label).

In [ ]: ### YOUR CODE HERE #

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

# ###

Correlation

Exploratory data analysis involves examining relation amoung predictors, and between predictors and a target variable.

Linear relashionships are most common, but variables can also have:

non-linear relationship monotonic relationship no relationship

Visual example of Linear relation-ship and no-relationship

source: https://bit.ly/1q6WTYJ

Visual example of non-linear- relationship

If a relationship between two variables is not linear, the rate of increase or decrease can change as one variable changes, causing a "curved pattern" in the data. This curved trend might be better modeled by a nonlinear function, such as a quadratic or cubic function, or be transformed to make it linear.

source: https://bit.ly/2ZSTiiU

Visual example of Monotonic relationship

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate. In a linear relationship, the variables move in the same direction at a constant rate. The figure below shows both variables increasing concurrently, but not at the same rate. This relationship is monotonic, but not linear.

source: https://bit.ly/2ZSTiiU

Linear correlation

(linear) Correlation correlation exploits the linear relation ship between two random variables, as example:

Variables X and Y are said to be positively correlated if high values of X go linearly with high values of Y, and low values of X go linearly with low values of Y.

source: https://bit.ly/31zX5T5

correlation measures:

**1. Pearson's Correlation Coefficient**

The Pearson Correlation Coefficient is a measure of linear association between two random variables. Although there are other types of correlation (such as Spearman rand-order correlation coefficient), the Pearson correlation coefficient is the most common, and is defined by:

∑ x − ˉx y − ˉy (n − 1)σ σ cov(x, y)σ σ σ σ σ r = ni=1( i )( i ) x y = x y = xy x y

Notice that :

we divide by n − 1 (n − 1 DF) instead of n to have unbiased estimator with more robustness against outliers. −1 ≤ r ≤ + 1 +1 : perfect positive correlation

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

0 : no correlation −1 : perfect negative correlation

As example:

In [51]: import numpy as np import pandas as pd from scipy import stats import matplotlib.pyplot as plt import seaborn as sns

sns.set(style='white', font_scale=1.2) %matplotlib inline

In [63]: ##################################### CODE HERE ######################## #############

Note that: p_val is the two tailed P-value that roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.

Correlation coefficient is sensitive to outliers. An alternatives to the classical correlation coefficient, a trimmed correlation coefficient can be used. As example:

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In [64]: ##################################### CODE HERE ######################## #############

NOTE : Spearman and Kendall correlation as more robust versions of correlation

Data and Sampling distribution [[13](#ref13)]

Random Variable : is a function that associates a real number with each element in the sample space.

For Example: The sample space giving a detailed description of each possible outcome when three electronic components are tested may be written:

S = {NNN, NND, NDN, DNN, NDD, DND, DDN, DDD},

where N denotes non-defective, and D denotes defective.

If the random variable X represents the number of defective items when three electronic components are tested; then the sample space S can be assigned to numerical values of 0, 1, 2, 3.

If the X = 2, this means the possibility of two defective elements as outcomes of the test, which mean the subset:

E = {DDN, DND, NDD} file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Discrete and Continuous Sample Space:

Discrete Sample Space: If a sample space contains a finite number of possibilities or an unending sequence with as many elements as there are whole numbers, it is called a discrete sample space. Continuous Sample Space: If a sample space contains an infinit number of possibilities equal to the number of points on a line segment, it is called a continuous sample space.

Discrete and Continuous Random Variable

A random variable is called a discrete random variable if its set of possible outcomes is countable.

Discrete Random Variables represet count data, such as the number of defectives in a sample of k items.

A random variable is called a continuous random variable if its set of possible outcomes is an entire interval of numbers is not discrete.

Continuous random variables represent measured data, such as heights, weights, temperatures, distances, or life periods.

Common Distributions [[1](#ref1), [13](#ref13), [14] (#ref14)]

Here it provided a review of some useful probability distributions and some applications to modeling data.

Distributions: Set of all possible random variables. Example: Flipping a fair coin for heads and tails a (two possible outcomes) discrete outcomes (categories of head and tails, no real numbers) evenly weighted (heads are just as likely as tails) Distributions that will be covered are: 1. Discrete distributions Binomial distribution Poisson distribution 2. Continuous distributions Uniform distribution Normal distribution Exponential distribution

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Gamma distribution Chi2 distribution Student's t-distribution Multivariate Normal distribution Multivariate t-distribution

Binomial Distribution

Also called Bernolli Distribution; and is the frequence distribution of umber of successes (x) in a given number of n- Binomial -independent- Trials with specifies probability (p) of success in each trial (and probability (1 − q) of failure in each trial). x n−x b(x; n, p) = (nx)p q , x = 0, 1, 2, . . . , n

Binomial Trials, are trials that have binomial (True/False) outcomes.

(nx) = nx! (n−x)!

Choose x from n trials, with no repetition and order doesn't matter.

As Example:

If the probability of a click converting to a sale is 0.02, the probability of observing 0 sales in 200 clicks, can be represented as binomial probability: b(x; n, p) = b(0; 200, 0.02)

In [1]: import numpy as np from scipy import stats from scipy.stats import binom import seaborn as sns import matplotlib.pyplot as plt

%matplotlib inline

In [13]: ##################################### CODE HERE ######################## #############

Pr(X = 0) = b(0;200,0.02) = 0.018

Moments of Binomial distribution (b(x; n, p)) Mean = np Variance = np(1 − p)

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Exercise

A. Plot three binomial distributions by generating a set of 10000 random variables from these distributions:

n=100, p=0.5 n=200, p=0.5 n=100, p=0.7

In [ ]: ### YOUR CODE HERE # # ###

B. Get the following moments for the previously defined distributions:

Mean Variance Skewness Kurtosis

In [ ]: ### YOUR CODE HERE # # ###

Frequently, we are interested in problems where it is necessary to find Cumulative Distribution Function (CDF), such as Pr(X < r) or P(a ≤ X ≤ b). Binomial Sums: Pr(X < r) = B(r; n, p) = r∑ x=0b(x; n, p)

Exercise [[13](#ref13)]

The probability that a patient recovers from a rare blood disease is 0.4. If 15 people are known to have contracted this disease, what is the probability that:

1. at least 10 survive 2. from 3 to 8 survive 3. exactly 5 survive

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In [12]: ### YOUR CODE/SOLUTION HERE # # ###

Areas of application: [[13](#ref13)] This distribution applies to any industrial situation where: an outcome of a process is dichotomous (binary) and the results of the process are independent, with probability of success being constant for trial to trial. Often, quality control measures and sampling schemes for processes are based on the binomial distribution.

Poisson Distribution

From prior data we can estimate the avetage numver of events per unit of time or space, but we might also want to know how different this might be from one unit (of time or space) to another. The Poisson distribution tells us the distribution of events per unit of time or space when we sample such many units.

Experiments yielding numerical values of a random variable X, the numver of outcomes occuring during a given time interval or in a specified region, are called Poisson Experiments .

For Example: X representing

the number of telephone calls received per hour by an office. the number of days school is closed due to snow during the winter.

A Poisson Experiment is derived from the Poisson process. Properties of the Posson Process:

The number of outcomes occuring in one time interval or scpecified region of space is independent of the number that occur in any other disjoint time interval or region.

In this sense we say that the Posson process has no memory.

The probability that a single outcome will occur during a very short time interval or in a small region is propertional to the length of the time interval or the size of the region and does not depend on the numver of outcomes occuring outside this time interval or region. The probability that more than one outcome will occur in such a short time interval or fall in such a small region is negligible.

The key parameter in a Poisson distribution is λ, or lambda. A random variable X that has a file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Poisson Distribution represents the number of events occurring in a fixed time interval with a rate parameters λ.

If a random variable X (representing the number of outcomes occurring in a given time/space interval t) is a _Poisson Random Variable_ with parameter λ, λ > 0, then it has the probability x mass function (pmf) given by: f(x; λt) = Pr(X = x) = e−λt (λt) x!, x = 0, 1, 2, . . .

Moments of Poisson distribution ( f(x; λ) ):

Mean = λ Variance = λ

The Poisson Probability Sums (CDF): Pr(X < r) = P(r; λt) = r∑ x=0f(x; λt)

Example : During a laboratory experiment, the average numver of radioactive particles passing through a counter in 1 millisecond is 4. What is the probability that 6 particles enter the counter in a given millisecond ?

x = 6, λt = 4

f(x; λt) = Pr(X = 6)

In [2]: from scipy.stats import poisson as pois

In [6]: ##################################### CODE HERE ######################## #############

f(6;4) = 0.1042

Exercise

A. Plot three poisson distributions by generating a set of 10000 random variables from these distributions:

λt=1 λt=4 λt=10

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

In [ ]: ### YOUR CODE HERE # # ###

B. Get the following moments for the previously defined distributions:

Mean Variance Skewness Kurtosis

In [ ]: ### YOUR CODE HERE # # ###

Approximation of Binomial Distribution by a Poisson Distribution

Poisson distribution is often used to approximate the binomial. When n is large and p is small (so np is moderate), then the numver of successes occuring can be approximated by the Poisson random variable with parameter λ = np: \begin{theorem} Let X be a binomial random variable with probability distribution $b(x;n,p)$. When $ {n \rightarrow \infty} \mu$ remains constant, \begin{equation*} b(x;n,p) \xrightarrow[] {n \rightarrow \infty} f(x; \mu) \end{equation*} \end{theorem}

Exponential distribution

The exponential distribution describes the time between events in a Poisson point process (i.e. a process in which events occur continuously and independently at a constant average rate).

The exponential distribution can be used to model the amount of time until a specific event occurs or to model the time between independent events, and is an example of continuous probability distribution.

Some examples where the exponential distribution is appropriate are:

the time until computer locks up the time between arrivals of telephone calls the time until a part fails

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

If a random variable X (representing the amount of time until a an event occurs) is an exponential random variable with the parameter λ, λ > 0, then it has the probability density function (pdf) given by: f(x; λ) = Pr(X = x) = λe−λx, x ≥ 0; λ > 0

Moments of exponential distribution ( f(x; λ) ):

Mean = 1λ 2 Variance = 1λ

The Cumulative Distribution Function (CDF) of an exponential random variable is given by: F(x) = 1 − e−λx; x ≥ 0

The exponential distribution if the only continuous distribution that has the memoryless property. This property describes the fact that the remaining lifetime of an object (whose lifetime follows an exponential distribution) does not depend on the amount of time it has already lived.

In [1]: import numpy as np from scipy import stats from scipy.stats import expon import seaborn as sns import matplotlib.pyplot as plt

%matplotlib inline

In [25]: ##################################### CODE HERE ######################## #############

Exercise

The time between arrivals of vehicles at an intersection follows an exponential distribution with a file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

mean of 12 seconds. What is the probability that the time between arrivals is 10 seconds or less?

In [ ]: ### YOUR SOLUTION HERE # # ###

Uniform distribution

This distribution is characterized by density function that is flat, and thus the probability is uniform in a closed interval [A, B]

The density function of the continuous uniform random variable X on the interval [A, B] is: f(x; A, B{) = 1B−A if A ≥ x ≥ B0 if elsewhere

The density function form a rectangle with base B − A and constant height 1B−A.

The mean and variance of the uniform distribution are:

mean = A+B2 2 variance = (B−A) 12

In [26]: from scipy.stats import uniform

In [42]: ##################################### CODE HERE ######################## #############

+ mean = 2.00 + variance = 0.33 + skewness = 0.00 + kurtosis = -1.20

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

The Cumulative Distribution Function (CDF) of the continuous uniform random variable X on the x−AB−A interval [A, B] is: F{(x) =0 if x ≥ A if x ∈ [A, B)1 if x ≥ B

Normal distribution [[13](#ref13)]

The most important continuous probability distribution in the entire field of statistics is the normal distribution (also called Gaussian Distribution), since it approximates many phenomenna that occur in nature, industry and research. For example, physiscal measurements in areas such as rainfall studies andmeasurements of manufactured parts are often explained with a normal distribution. In addition, errors in scientific measurements are extremely well approximated by a normal distribution.

The probability density function (PDF) of the normal random variable X, with mean μ and variance 2 −12σ (x−μ)2 σ2, is: n(x; μ, σ) = 1√2πσe , − ∞ < x < ∞

The mean and variance of n(x; μ, σ) are μ and σ2, respectively. Hence, the standard deviation is σ.

In [1]: import numpy as np import pandas as pd from scipy import stats import seaborn as sns import matplotlib.pyplot as plt

%matplotlib inline

In [25]: ##################################### CODE HERE ######################## file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

#############

Properties of the normal curve:

The (which is the point on the horizontal axis where the curve is maximum, occurs at x = μ)

In [40]: ##################################### CODE HERE ######################## #############

mode of norm_dist_1 = -0.0 mode of norm_dist_2 = 3.0

The curve is symmetric about a vertical axis through the mean μ

In [41]: ##################################### CODE HERE ######################## #############

+ mean = 0.00 + variance = 1.00 + skewness = 0.00 + kurtosis = 0.00 ------+ mean = 3.00 + variance = 9.00 + skewness = 0.00 + kurtosis = 0.00

The curve has its points of inflection at x = μ ± σ. Approximately 68% of the area under the normal curve will be between x ∈ [(μ − σ), (μ + σ)]

Standard Normal Distribution is a specific case of normal distribution where μ = 0 and σ = 1, given by the probability density function (PDF): 2 −x 2 z(x) = Pr(x) = 1√2πe

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Chi2 distribution [[13](#ref13)]

2 Chi-squared (χ ) distribution has a single parameter (v) called the degree of freedom.

The continuous random variable X had a chi-squared distribution, which v degrees of freedom, v/2 v/2−1 −x/2 if its density function is given by: f(x; v{) = 12 Γ(v/2)x e if x > 0, v > 00 if elsewhere,

Gamma Function = Γ(t) = (t − 1)!, t ∈ Z

Mean and variance of f(x; v):

Mean = v Variance = 2v

In [8]: ##################################### CODE HERE ######################## #############

[[14](#ref14)] The chi-square distribution is used to derive the distribution of the sample variance and is important for goodness-of-fit tests in statistical analysis.

Sampling Distribution of a Statistic[[1](#ref1)]

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Much of classical statistics is concerned with making conclusions (in statistical terms: inferences ) from (small) _samples_ to reflect the "true" parameters of (large) _population_ .

source: https://bit.ly/3215Khl

sample statistic : A metric calculated for a sample of data drawn from a larger population. Like calculating the mean of the drawn sample (ˉx)

Data Distribution : The frequency distribution (probability distribution) of individual _values_ in a data set.

Sample Distribution : The frequency distribution (probability distribution) a sample statistic over many samples or resamples.

A sample is drawn with the goal of:

measuring something (with sample statistic) or modeling something (with statistical- or machine-learning model)

Since our estimate or model is based on a sample, it might differ:

from the "true" (of population) parameter/model from one sample to another

We are therefore interested in:

How sample statistic vary from one estimation to another (sampling variability) How confident this sample statistic might represent the "true" (of population) parameter.

This will lead us to the topic of Hypothesis Tests and Confidence Intervals

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Sampling Distribution of Means[[13](#ref13)]

Suppose that we have a population (with N elements) with the following parameters:

mean of population: μ = 1N∑ X Ni=1 i variance of population: σ2 = 1N∑ X − μ Ni=1 i

If we a sample has n-observations from a (vary large) population with unknown distribution. If we 2 calculated the mean and variance of each sample (ˉx, s ). Looking as each sample, we will notice that:

Mean of samples-mean : μˉx ≈ μ 2 Variance of samples-mean : σ ≈ σ n provided that the each-sample size (n) is large. 2ˉx

The sampling distribution (ˉX) will approximately have a normal distribution shape. This result leaded to Central Limit Theorem.

Central Limit Theorem [[1](#ref1), [13](#ref13)]

This theorem emphasise that the means drawn from multiple samples with resemble the a normal curve, even if the source population is not normally distributed, provided that:

the sample size (n) is large enough (from practice: n ≥ 30) departure of the population-distribution from normality is not too great (not highly skewed).

If ˉX is the mean of a random sample of size n taken from a population with mean μ and finite variance σ2, then the limiting form of the distribution of z = ˉX − μσ/√n as n → ∞, is the standard normal distribution n(z; 0, 1)

Generally, if n < 30, the approximation is good only if the population is not too different from the normal distribution.

If the population is known to be normally distributed, the sampling distribution of ˉX will follow exactly a normal distribution, no matter how small the size of the samples.

The central limit theorem allows normal-approximation formulas (like t-distribution) to use used in calculating sampmling distributions for inference, that is:

Confidence intervals Hypothesis tests

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

Exercise

Import the loan_data.csv data set, then:

In [35]: import numpy as np import pandas as pd from scipy import stats import seaborn as sns import matplotlib.pyplot as plt

%matplotlib inline

In [36]: data_set = pd.read_csv("../assets/loan_data.csv") data_set.head()

Out[36]: Unnamed: status loan_amnt term annual_inc dti payment_inc_ratio revol_bal revol_util purpose 0

Charged 60 0 1 2500 30000 1.00 2.39320 1687 9.4 Off months

Charged 60 1 2 5600 40000 5.55 4.57170 5210 32.6 small Off months

Charged 60 2 3 5375 15000 18.08 9.71600 9279 36.5 Off months

Charged 36 3 4 9000 30000 10.08 12.21520 10452 91.7 debt_con Off months

Charged 36 4 5 10000 100000 7.06 3.90888 11997 55.5 Off months

5 rows × 21 columns

A. Consider the Loan Amount ( loan_amnt ) is your population. Compute the population size, mean and variance:

In [ ]: ### YOUR CODE HERE # # ###

B. You want to create a function that allows you to a sample distribution. This sample distribution is a result of a certain sample statistic computed for each one of the repeated sample.

In [ ]: def repeated_sample(data, function, sample_size, repeat): file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

""" data : the population that you want to sample (list) function : the sample statistic that you want to compute for each sa mple sample_size : the size of one sample from the population repeat : length of the sample distribution (how much you reapeat your sampling) """ result = []

### YOUR CODE HERE # # ### return result

C. Compute the mean of samples of different sizes (1 sample, 5 sample, 30 samples), then create three samples distributions by repeating each of the above sample processes 1000 times.

In [ ]: repeat = 1000 # Create samples-mean with each sample of size (1, 5, 30), repeat 1000 t imes

### YOUR CODE HERE # # ###

D. Plot (histogram) the population and the three samples in separate subplots. In the title compute the mean and variance of each.

In [ ]: ### YOUR CODE HERE # # # ###

References

[1] [BOOK] Practical statistics for Data Scientists

[2] [BOOK] Exploratory Data Analysis

[3] [BLOG] Different Types of Data Sources

[4] [BLOG] Big Data, structured and unstructured data

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK

[5] [WIKIPEDIA] Time Series

[6] [WIKIPEDIA] Spatial database

[7] [PAPER] A Fast Algorithm for Approximate Quantiles in High Speed Data Streams

[8] [WIKIPEDIA] Skewness

[9] [GITHUB] Seaborn datasets

[10] [BOOK] Python Data Science Handbook

[11] [BLOG] Understanding Boxplots

[12] [BLOG] Violine Plots 101

[13] [BOOK] Probability and Statistics for Engineers and Scientists, Myers

[14] [BOOK] Computational Statistics Handbook with Matlab, W.Martinez

[15] [ARTICLE] Discrete Statistical Distributions

file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37]