Statistical Analysis
Total Page:16
File Type:pdf, Size:1020Kb
statistical_analysis-WORKBOOK In [1]: ## CSS coloring for the dataframe tables from IPython.core.display import HTML css = open('../style/style-table.css').read() + open('../style/style-not ebook.css').read() HTML('<style>{}</style>'.format(css)) Out[1]: Statistical Analysis Index Exploratory Data Analysis Data Types Estimates of Location mean exercise exercise exercise median exercise Estimates of Variability Standard deviation and related estimates exercise Estimates based on percentiles Estimates of skewness and tailedness exercise Estimates of Distribution Boxplot exercise Frequency tables, density plots and histograms exercise exercise Correlation Data and Sampling distribution Common distributions Processing math: 100% file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK Binomial distribution Exercise Exercise Poisson distribution Exercise Exponential distribution Exercise Uniform distribution Normal distribution Chi2 distribution Sampling Distribution of a Statistic Sampling Distribution of Means Exercise Exploratory Data Analysis [[1](#ref1)] Statistics has developed mostly in the past century. Probability theory was develped in the 17th to 19th century based on the work of Thomas Bayes, Pierre-Simon laplace and Carl Gauss. source: https://bit.ly/1LEZVtR Statistics is an applied science concerned with analysis and modeling of data. Francis Galton, Karl Pearson andRoland Fisher laying the pillars of moders statistics by introducing key ideas like experimental design and maximum likelihood estimation. These and many statistical concepts formed the foundations of moderen data science. The first step on any data science project is exploring the data. Exploratory Data Analysis (EDA) is a new area of statistics established by John W. Tukey, where he called fo rreformation of statistics in his paper The Future of Data Analysis. He proposed a new scientific discipline called data analyis that includes statistical inference as one of its compenents, beside engineering and computer science techniques. [[2](#ref2)] source: https://bit.ly/2SuYDdm file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK With the rapid development of computing power, as well as data analysis software and the availability of more and bigger data, allowed exploratory data analysis has evelved well. Donoho in his paper 50 years of data science traces the root of data science back to Tukey's work. Data Types[[1](#ref1), [3](#ref3)] DIFFERENT TYPES OF DATA SOURCES Data manifests itself in many different shapes. Each shape of data may hold much value to your targeted goal. In some shapes this is easier to extract than others. Different shapes of data require different storage solutions and should therefore be dealt with in different ways. Thee shapes of data will be distinguished as follows: 1. Unstructured Data Unstructured data is the raw form of data. It can be any type of file, for example; texts, pictures, sounds, sensor measurements or videos. This data is often stored in a repository of files. Extracting value out of this shape of data is often the hardest. Since you first need to extract structured features from the data that describe or abstract it. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly — often many times faster than structured databases are growing.[[4](#ref4)] file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK source: https://bit.ly/2GmXK1F 2. Structured Data Structured Data is tabular data (rows and columns) which are very well defined. Meaning that we know which columns there are and what kind of data they contain. source: https://bit.ly/2rdzwi9 The two basic types of structured data is: [[1](#ref1)] 1. Numeric, which comes in two forms: A. continuous, are the data that can take any value real in an interval ( ∈ R) such as speed or time duration. file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK synonyms: interval, float, numeric B. discrete, are the data that takes only integer values ( ∈ Z) such as the count of event occurance. synonyms: integer, count 2. Categorical, are the data that takes a fixed set of values (such as car or bike or truck). Two special cases of categorical data: synonyms: enums, enumerated, factors, nominal, polychotomous A. Binary data: where the data can take one of two possible values, such as 0/1, true/false, cat/dog, ... synonyms: dichotomous, logical, inducator, boolean B. Ordinal data: where the order of the values matters, as example very cold, cold, warm, hot. synonyms: ordered factor Knowing the data type is important in the process of exploring your data, since you can know the required type of visual display, data analysis, even statistical/training model. The typical frame of data using in data science is Rectangular data objects, like spread sheets and pandas dataframes. It's essentially two-dimensional table/matrix, where: the rows represent the observation samples also called: cases, examples, instances, records the columns represent the features: also called: attributes, inputs, predictors, variables The dataset that you require to analyse does not come always in Rectangular form (such as unstructured data (like text)), but it has to be processed to the the Rectangular form. There is another important data structures that is used in data science beside the Rectangular data, and that is the nonrectangular data stuructures, these data structures can come if the forms of: 1. Time Series, is a series of data points indexed (or listed or graphed) in time order. Most file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots or the measurements of a sensor over time.[[5](#ref5)] source: https://bit.ly/2SX7ghq 1. Spatial data structures, is a data stuructures that are used in mapping and location analytics. It is optimized for storing and querying data that represents objects defined in a geometric space. Most spatial databases allow the representation of simple geometric objects such as points, lines and polygons. [[6](#ref6)] The structure of this data is more complex than the Rectangular data. Possible represenations of this data sturcture: [[1](#ref1)] Object representation: where the focus of the data is an pbject (e.g. a house) and its spatial coordinates field View representation: focuses on small units of space and the value of a relevant metric (as example, pixel brightness) </br> source: https://bit.ly/2YfljVn 1. Graph (or network) data structures, are used to represent physical, social, and anstract file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK replationships. Forexample, a graph of a social network. Graph structures are sueful for certain types of problems such as network optimization and recommender systems. source: https://bit.ly/2KeXXW1 Estimates of Location [[1](#ref1)] Also called "Central Tendency" measure, and is the basic step in exploring numeric data, by estimating where most oof the data is located. The basic estimators for location are mean and median Mean also called average or mathematical expectation, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. for a dataset with n values, x = {x1, x2, . , xn}, its mean is defined as: ∑ x n ˉx = ni=1 i In statistics when representing the count of population/sample, the capitalized character (as example N) refer to the population, while the lower case (as example n) refers to a sample from a population. mean is a sensitive estimation to outliers, but the effect of the outlier decreases as the number of samples (n) increases. Exercise file:///C/...ta_science_in_python/01.intrduction_to_python_for_data_science/04.statistical_analysis/statistical_analysis-WORKBOOK.html[18.09.2019 09:04:37] statistical_analysis-WORKBOOK 1. From the sub_set dataset below: A. Plot the points B. compute the mean to tips and plot that mean as a horizontal line. In [118]: import numpy as np import seaborn as sns; sns.set() import matplotlib.pyplot as plt from pandas import DataFrame as df In [119]: dataset = sns.load_dataset("tips") sub_set = dataset.loc[:,['total_bill', 'tip']] sub_set.head() ### YOUR CODE HERE # # ### Out[119]: total_bill tip 0 16.99 1.01 1 10.34 1.66 2 21.01 3.50 3 23.68 3.31 4 24.59 3.61 1. Create sub_set_with_outlier by taking the first 15 elements of sub_set , then: A. Plot the points B. compute the mean to tips and plot that mean as a horizontal line. In [121]: ### YOUR CODE HERE # # ### Exercise Repeat the previous example but with taking the sub_set_with_outlier