Data Analysis

Data Analysis Numpy, Matplotlib and Pandas by Bernd Klein bodenseo © 2021 Bernd Klein All rights reserved. No portion of this book may be reproduced or used in any manner without written permission from the copyright owner. For more information, contact address: [email protected] www.python-course.eu Python Course Data Analysis With Python by Bernd Klein Numpy Tutorial ..........................................................................................................................8 Numpy Tutorial: Creating Arrays.............................................................................................17 Data Type Objects, dtype..........................................................................................................36 Numerical Operations on Numpy Arrays.................................................................................48 Numpy Arrays: Concatenating, Flattening and Adding Dimensions.......................................68 Python, Random Numbers and Probability ..............................................................................79 Weighted Probabilities..............................................................................................................90 Synthetical Test Data With Python.........................................................................................119 Numpy: Boolean Indexing......................................................................................................136 Matrix Multiplicaion, Dot and Cross Product ........................................................................143 Reading and Writing Data Files .............................................................................................149 Overview of Matplotlib ..........................................................................................................157 Format Plots............................................................................................................................168 Matplotlib Tutorial..................................................................................................................172 Shading Regions with fill_between() .....................................................................................183 Matplotlib Tutorial: Spines and Ticks ....................................................................................186 Matplotlib Tutorial, Adding Legends and Annotations..........................................................197 Matplotlib Tutorial: Subplots .................................................................................................212 Exercise ....................................................................................................................................44 Exercise ....................................................................................................................................44 Matplotlib Tutorial: Gridspec .................................................................................................239 GridSpec using SubplotSpec ..................................................................................................244 Matplotlib Tutorial: Histograms and Bar Plots ......................................................................248 Matplotlib Tutorial: Contour Plots .........................................................................................268 Introduction into Pandas.........................................................................................................303 Data Structures .......................................................................................................................305 Accessing and Changing values of DataFrames.....................................................................343 Pandas: groupby .....................................................................................................................361 Reading and Writing Data ......................................................................................................380 Dealing with NaN...................................................................................................................394 Binning in Python and Pandas................................................................................................404 Expenses and Income Example ..............................................................................................465 Net Income Method Example.................................................................................................478 3 NUMERICAL PROGRAMMING WITH PYTHON NUMERICAL PROGRAMMING DEFINITION The term "Numerical Computing" - a.k.a. numerical computing or scientific computing - can be misleading. One can think about it as "having to do with numbers" as opposed to algorithms dealing with texts for example. If you think of Google and the way it provides links to websites for your search inquiries, you may think about the underlying algorithm as a text based one. Yet, the core of the Google search engine is numerical. To perform the PageRank algorithm Google executes the world's largest matrix computation. Numerical Computing defines an area of computer science and mathematics dealing with algorithms for numerical approximations of problems from mathematical or numerical analysis, in other words: Algorithms solving problems involving continuous variables. Numerical analysis is used to solve science and engineering problems. DATA SCIENCE AND DATA ANALYSIS This tutorial can be used as an online course on Numerical Python as it is needed by Data Scientists and Data Analysts. Data science is an interdisciplinary subject which includes for example statistics and computer science, especially programming and problem solving skills. Data Science includes everything which is necessary to create and prepare data, to manipulate, filter and clense data and to analyse data. Data can be both structured and unstructured. We could also say Data Science includes all the techniques needed to extract and gain information and insight from data. Data Science is an umpbrella term which incorporates data analysis, statistics, machine learning and other related scientific fields in order to understand and analyze data. Another term occuring quite often in this context is "Big Data". Big Data is for sure one of the most often used buzzwords in the software-related marketing world. Marketing managers have found out that using this term can boost the sales of their products, regardless of the fact if they are really dealing with big data or not. The term is often used in fuzzy ways. Big data is data which is too large and complex, so that it is hard for data-processing application software to deal with them. The problems include capturing and collecting data, data storage, search the data, visualization of the data, querying, and so on. The following concepts are associated with big data: • volume: the sheer amount of data, whether it will be giga-, tera-, peta- or exabytes • velocity: the speed of arrival and processing of data • veracity: 4 uncertainty or imprecision of data • variety: the many sources and types of data both structured and unstructured The big question is how useful Python is for these purposes. If we would only use Python without any special modules, this language could only poorly perform on the previously mentioned tasks. We will describe the necessary tools in the following chapter. CONNECTIONS BETWEEN PYTHON, NUMPY, MATPLOTLIB, SCIPY AND PANDAS Python is a general-purpose language and as such it can and it is widely used by system administrators for operating system administration, by web developpers as a tool to create dynamic websites and by linguists for natural language processing tasks. Being a truely general-purpose language, Python can of course - without using any special numerical modules - be used to solve numerical problems as well. So far so good, but the crux of the matter is the execution speed. Pure Python without any numerical modules couldn't be used for numerical tasks Matlab, R and other languages are designed for. If it comes to computational problem solving, it is of greatest importance to consider the performance of algorithms, both concerning speed and data usage. If we use Python in combination with its modules NumPy, SciPy, Matplotlib and Pandas, it belongs to the top numerical programming languages. It is as efficient - if not even more efficient - than Matlab or R. 5 Numpy is a module which provides the basic data structures, implementing multi-dimensional arrays and matrices. Besides that the module supplies the necessary functionalities to create and manipulate these data structures. SciPy is based on top of Numpy, i.e. it uses the data structures provided by NumPy. It extends the capabilities of NumPy with further useful functions for minimization, regression, Fourier-transformation and many others. Matplotlib is a plotting library for the Python programming language and the numerically oriented modules like NumPy and SciPy. The youngest child in this family of modules is Pandas. Pandas is using all of the previously mentioned modules. It's build on top of them to provide a module for the Python language, which is also capable of data manipulation and analysis. The special focus of Pandas consists in offering data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data". Pandas is well suited for working with tabular data

Load more