<<

LATIS Research Workshop Series Data Munging with pandas Pernu Menheer, Research Programmer, LATIS David Olsen, Research Systems Engineer, LATIS

Materials: z.umn.edu/pandas Course Eval: z.umn.edu/pandas-eval Notebooks: nb.latis.umn.edu Liberal Arts Technologies and Innovation Services (LATIS) Workshop Roadmap

What you will learn • Overview of the pandas package • How to instantiate pandas data structures • How to select and filter the content within a pandas DataFrame • How to modify the content of a pandas DataFrame • How to perform exploratory and simple plotting • How to clean and massage data using methods built into the pandas DataFrame

Future directions • Using pandas to model • Improving pandas plotting • Using pandas with databases • Optimizing pandas for speed What is Pandas?

• pandas = Panel Data • Python library for data analysis and statistical computing on structured data sets • Built on the array to provide fast computation • Vectorization built-in • Provides hierarchical axis indexing --> out-of-the-box data alignment • Great for: • Heterogeneously-typed tabular data • Time series data • Split-Apply-Combine paradigm • Relational operations • Robust I/O https://pandas.pydata.org/ pandas-docs/stable/api.html The Series

The Series—the primary building block of pandas data structures—is a one-dimensional sequence comprised of an index and values. Indices An index is an immutable ndarray that implements an ordered, sliceable set. Indices are requires for pandas objects, but are automatically created as a RangeIndex if not otherwise given or set. The DataFrame

A pandas DataFrame is a two-dimensional matrix comprised of an index, columns, and values. The DataFrame The .head function The .tail function The .info function The .shape attribute The .describe function The DataFrame Class

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Parameter Description

data an ndarray, dict, Series, or another DataFrame containing the values for the DataFrame

index An Index or array to be used in the DataFrame. Defaults to a RangeIndex.

columns Column labels for the DataFrame. Defaults to range if not provided.

dtype The data type to cast data objects, otherwise DataFrame will infer

copy A Boolean on whether to copy data from data (not applicable to all datatypes) A DataFrame from a dictionary DataFrame from a csv file Dropping columns Column (re)naming Column (re)naming Setting the index Sorting the index Saving a DataFrame The robust I/O features of pandas DataFrames make converting to different formats a (almost) trivial exercise.

to_clipboard([excel, sep]) Attempt to write text representation of object to the system clipboard This can be pasted into Excel, for example. to_csv([path_or_buf, sep, na_rep, ...]) Write DataFrame to a comma-separated values (csv) file to_dense() Return dense representation of NDFrame (as opposed to sparse) to_dict([orient, into]) Convert DataFrame to dictionary. to_excel(excel_writer[, sheet_name, na_rep, ...]) Write DataFrame to an excel sheet to_feather(fname) write out the binary feather-format for DataFrames to_gbq(destination_table, project_id[, ...]) Write a DataFrame to a Google BigQuery table. to_hdf(path_or_buf, key, **kwargs) Write the contained data to an HDF5 file using HDFStore. to_html([buf, columns, col_space, header, ...]) Render a DataFrame as an HTML table. to_json([path_or_buf, orient, dte_format, ...]) Convert the object to a JSON string. to_latex([buf, columns, col_space, header, ...]) Render an object to a tabular environment table. to_msgpack([path_or_buf, encoding]) msgpack (serialize) object to input file path to_panel() Transform long (stacked) format (DataFrame) into wide (3D, Panel) format. to_parquet(fname[, engine, compression]) Write a DataFrame to the binary parquet format. to_period([freq, axis, copy]) Convert DataFrame from DatetimeIndex to PeriodIndex with desired to_pickle(path[, compression, protocol]) Pickle (serialize) object to input file path. to_records([index, convert_datetime64]) Convert DataFrame to record array. to_sparse([fill_value, kind]) Convert to SparseDataFrame to_sql(name, con[, flavor, schema, ...]) Write records stored in a DataFrame to a SQL database. to_stata(fname[, convert_dates, ...]) A class for writing Stata binary dta files from array-like objects to_string([buf, columns, col_space, header, ...]) Render a DataFrame to a console-friendly tabular output. to_timestamp([freq, how, axis, copy]) Cast to DatetimeIndex of timestamps, at beginning of period to_xarray() Return an xarray object from the pandas object. Saving a DataFrame Try It!

Import fred_resampled_weekly.csv from ../data/fred into a pandas 1 DataFrame and view its content.

Work the imported Dataframe to set the index, drop unneeded 2 columns, and update column labels. Save the result as fred_resampled_cleaned.csv for later use. The indexing operator The most basic way to select subsets of the data is done using the [ ] operator. For a DataFrame, this works as the Series corresponding to the column label.

The .loc indexer

The .loc indexer allows multidimensional selection based on the label of the index. The .iloc indexer

Similar to our .loc method, but uses integer locations rather than labels to subset the data. Selection by attribute (dot notation)

Pandas also reflects the column labels of a DataFrame as attributes— assuming that it is a valid Python identifier and does not conflict with existing method names. MultiIndex slicing

The indexers for MultiIndex slicing work in a similar manner to those of the 1D index counterpart. Boolean indexing

We can also slice by a vector based on a predicate statement that evaluates to a Boolean. Using a callable

Indeed, we can use any arbitrary function (so long as it returns one argument) that returns a valid output for indexing. Chaining

Chaining allows calling methods one after another to 'pipe' the output of one method as the input to the next. The .sample function

We can also return a (weighted) random sample of the items contained within our DataFrame. Try It!

Open the Jupyter Notebook titled selecting_data.ipynb. Start by 1 importing the met tower readings from the turbine dataset and reviewing the data and the DataFrame metadata. Then complete the selection statements that are part of Exercise 1. Automatic alignment The .concat() function DataFrame from multiple csv files The .merge() function Try It!

Append all of the series from the World Bank dataset (located in 1 ../wb_data/) into a list of DataFrames using the glob module. Then, use the pandas concatenate function to collapse the list of DataFrames into a single DataFrame. Finally, set the DataFrame index to country_code and series_code.

Import met_tower.csv and scada.csv from the Turbine dataset (located 2 in ../turbine/) into two DataFrames. Then use the pandas merge function to combine the meteorological data with the blade data on the Timestamp field. Finally, set the index to Timestamp. Creating new columns The .groupby function The .sort_values Function Descriptive

Parameter Description

corr Computes pairwise correlation of columns (excludes NaN).

count Computes the number of observations (excluding NaN).

describe A metafunction to return summary of a dataset's distribution

mean Computes the mean value

max Returns the maximum value

median Computes the median value

min Returns the minimum value

mode Computes the mode(s)

pct_change Computes the percent change over a given number of periods

quantile Returns values at a given quantile

rank Computes numerical ranks

round Rounds values to number of decimal places

sum Computes sum of values

std Computes standard deviation The .nlargest Function The .transpose Function The .plot Function

df.plot(x=None, y=None, kind='line', figsize=None, title=None, rot=None, colormap=None, subplots=False, ..., **kwargs)

Parameter Description

x label or position

y label or position

kind Type of plot, e.g., line, bar, hist, box, kde, scatter, …

figsize Tuple of plot width, height

title Title of the plot (or array if subplots)

rot rotation of ticks

colormap Colormap to select colors from

kwargs options Line Plots Scatter Plots Boxplots Histograms Saving Plots Try It!

Import the World Bank data set compiled in a previous exercise. Explore 1 the GDP series (NY.GDP.MKTP.KD) and answer questions about the top 10 country (by GDP) and about the US share of global GDP.

Import the ATUS 2005 data set and use groupby to answer questions 2 about time use grouped by birth sex.

Import the turbine data set and explore the relationship between wind 3 speed and real power generated. Duplicate data Dropping duplicate values Missing data The .dropna function The .fillna function The .fillna function The .interpolate function Conversion of data types There are a multitude of methods that can be used to convert types. However, pandas helps us in two ways. First, there are class methods (e.g., to_datetime()) that are part of the pandas module. Conversion of data types The .astype() function also is useful to cast a pandas object to a specific type. The .apply function The apply function aptly applies a given function to the values in the axis of the pandas object. The .map function The map function is similar to apply, though maps work element-wise on a Series using not only functions, but dictionaries or other series. The .resample function The .melt function Often we will obtain data in a tabular format that is easy for us to read, though might not be great for analysis. The pandas melt function can be used to transform wide data into long. The .pivot_table function The reverse often holds true as well, where we will get key-value pairs and want to create columns based on these values. Pandas pivot_table function allows us to go from a long format to wide. Try It!

Import the World Bank dataset and melt the data for all series to create 1 a DataTable containing the series_name, series_code, country_name, country_code, year, and value.

Import the FRED data, index based on the Date field, resample monthly, 2 and create a plot of GDP. Challenge

Take what you have learned today and try applying that knowledge to a 1 synthetic set of participant data that is full of problems. After cleaning up the data, do some EDA and see what you can come up with.

Note: Do not feel compelled to limit yourself to the scripted exercise. Remember the purpose of today was to give you tools to do the 'E' part of EDA. LATIS Research Workshops – Spring 2018

Qualitative Analysis Mixed Methods NVivo Michael Beckstrand Michael Beckstrand 02 Mar 2018, 09:30 – 12:00, BruH 131A 13 Apr 2018, 09:30 – 12:00, Appleby 128

Introduction to Python Advanced Pernu Menheer and David Olsen Alicia Hofelich Mohr and David Olsen 09 Mar 2018, 09:30 – 12:00, BruH 131A 20 Apr 2018, 09:30 – 12:00, BruH 131A

Atlas.ti Introduction to R Intro to SQL and Research Databases Michael Beckstrand Alicia Hofelich Mohr and David Olsen David Olsen and Robert Wozniak 09 Feb 2018, 09:30 – 12:00, BruH 131A 23 Mar 2018, 09:30 – 12:00, BruH 131A 27 Apr 2018, 09:30 – 12:00, BruH 131A

Reproducible Research in Qualtrics Data Management in Transition Alicia Hofelich Mohr and Andy Sell Alicia Hofelich Mohr 16 Feb 2018, 09:30 – 12:00, BruH 131A 30 Mar 2018, 09:30 – 12:00, BruH 131A

Introduction to Linux II Parallel Computing Judy Kallestad, Tom Lindsay, and David Olsen David Olsen 23 Feb 2018, 09:30 – 12:00, BruH 131A 06 Apr 2018, 09:30 – 12:00, BruH 131A Contact: Workshop Links: Pernu Menheer Materials: http://z.umn.edu/pandas David Olsen Evaluation: http://z.umn.edu/pandas-eval LATIS Research Division