Cheat Sheet: the Pandas Dataframe Object

Cheat Sheet: the Pandas Dataframe Object

Cheat Sheet: The pandas DataFrame Object Preliminaries Get your data into a DataFrame Always start by importing these Python modules Instantiate an empty DataFrame import numpy as np df = DataFrame() import matplotlib.pyplot as plt import pandas as pd Load a DataFrame from a CSV file from pandas import DataFrame, Series df = pd.read_csv('file.csv') # often works Note: these are the recommended import aliases df = pd.read_csv('file.csv', header=0, Note: you can put these into a PYTHONSTARTUP file index_col=0, quotechar='"', sep=':', na_values = ['na', '-', '.', '']) Note: refer to pandas docs for all arguments Cheat sheet conventions Get data from inline CSV text to a DataFrame from io import StringIO Code examples data = """, Animal, Cuteness, Desirable # Code examples are found in yellow boxes row-1, dog, 8.7, True row-2, cat, 9.5, True In the code examples, typically I use: row-3, bat, 2.6, False""" s to represent a pandas Series object; df = pd.read_csv(StringIO(data), header=0, df to represent a pandas DataFrame object; index_col=0, skipinitialspace=True) idx to represent a pandas Index object. Note: skipinitialspace=True allows for a pretty layout Also: t – tuple, l – list, b – Boolean, i – integer, a – numpy array, st – string, d – dictionary, etc. Load DataFrames from a Microsoft Excel file # Each Excel sheet in a Python dictionary workbook = pd.ExcelFile('file.xlsx') d = {} # start with an empty dictionary The conceptual model for sheet_name in workbook.sheet_names: df = workbook.parse(sheet_name) d[sheet_name] = df DataFrame object: is a two-dimensional table of data with column and row indexes (something like a spread Note: the parse() method takes many arguments like sheet). The columns are made up of Series objects. read_csv() above. Refer to the pandas documentation. Column index (df.columns) Load a DataFrame from a MySQL database import pymysql from sqlalchemy import create_engine engine = create_engine('mysql+pymysql://' +'USER:PASSWORD@HOST/DATABASE') df = pd.read_sql_table('table', engine) Data in Series then combine into a DataFrame Row index (df.index) # Example 1 ... Series of data Series of data Series of data Series of data Series of data Series of data Series of data s1 = Series(range(6)) s2 = s1 * s1 A DataFrame has two Indexes: s2.index = s2.index + 2 # misalign indexes Typically, the column index (df.columns) is a list of df = pd.concat([s1, s2], axis=1) strings (variable names) or (less commonly) integers Typically, the row index (df.index) might be: # Example 2 ... o Integers - for case or row numbers; s3 = Series({'Tom':1, 'Dick':4, 'Har':9}) o Strings – for case names; or s4 = Series({'Tom':3, 'Dick':2, 'Mar':5}) o DatetimeIndex or PeriodIndex – for time series df = pd.concat({'A':s3, 'B':s4 }, axis=1) Note: 1st method has in integer column labels Series object: an ordered, one-dimensional array of Note: 2nd method does not guarantee col order data with an index. All the data in a Series is of the same data type. Series arithmetic is vectorised after first Get a DataFrame from a Python dictionary aligning the Series index for each of the operands. # default --- assume data is in columns df = DataFrame({ s1 = Series(range(0,4)) # -> 0, 1, 2, 3 'col0' : [1.0, 2.0, 3.0, 4.0], s2 = Series(range(1,5)) # -> 1, 2, 3, 4 'col1' : [100, 200, 300, 400] s3 = s1 + s2 # -> 1, 3, 5, 7 }) Version 30 April 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 1 Get a DataFrame from data in a Python dictionary Working with the whole DataFrame # --- use helper method for data in rows df = DataFrame.from_dict({ # data by row # rows as python dictionaries Peek at the DataFrame contents/structure 'row0' : {'col0':0, 'col1':'A'}, df.info() # index & data types 'row1' : {'col0':1, 'col1':'B'} dfh = df.head(i) # get first i rows }, orient='index') dft = df.tail(i) # get last i rows dfs = df.describe() # summary stats cols df = DataFrame.from_dict({ # data by row top_left_corner_df = df.iloc[:4, :4] # rows as python lists 'row0' : [1, 1+1j, 'A'], DataFrame non-indexing attributes 'row1' : [2, 2+2j, 'B'] df = df.T # transpose rows and cols }, orient='index') l = df.axes # list row and col indexes (r_idx, c_idx) = df.axes # from above Create play/fake data (useful for testing) s = df.dtypes # Series column data types # --- simple - default integer indexes b = df.empty # True for empty DataFrame df = DataFrame(np.random.rand(50,5)) i = df.ndim # number of axes (it is 2) t = df.shape # (row-count, column-count) # --- with a time-stamp row index: i = df.size # row-count * column-count df = DataFrame(np.random.rand(500,5)) a = df.values # get a numpy array for df df.index = pd.date_range('1/1/2005', periods=len(df), freq='M') DataFrame utility methods df = df.copy() # copy a DataFrame # --- with alphabetic row and col indexes df = df.rank() # rank each col (default) # and a "groupable" variable df = df.sort_values(by=col) import string df = df.sort_values(by=[col1, col2]) import random df = df.sort_index() rows = 52 df = df.astype(dtype) # type conversion cols = 5 assert(1 <= rows <= 52) # min/max row count df = DataFrame(np.random.randn(rows, cols), DataFrame iteration methods columns=['c'+str(i) for i in range(cols)], df.iteritems() # (col-index, Series) pairs index=list((string.ascii_uppercase + df.iterrows() # (row-index, Series) pairs string.ascii_lowercase)[0:rows])) # example ... iterating over columns ... df['groupable'] = [random.choice('abcde') for (name, series) in df.iteritems(): for _ in range(rows)] print('\nCol name: ' + str(name)) print('1st value: ' + str(series.iat[0])) Maths on the whole DataFrame (not a complete list) Saving a DataFrame df = df.abs() # absolute values df = df.add(o) # add df, Series or value Saving a DataFrame to a CSV file s = df.count() # non NA/null values df = df.cummax() # (cols default axis) df.to_csv('name.csv', encoding='utf-8') df = df.cummin() # (cols default axis) df = df.cumsum() # (cols default axis) Saving DataFrames to an Excel Workbook df = df.diff() # 1st diff (col def axis) from pandas import ExcelWriter df = df.div(o) # div by df, Series, value writer = ExcelWriter('filename.xlsx') df = df.dot(o) # matrix dot product df1.to_excel(writer,'Sheet1') s = df.max() # max of axis (col def) df2.to_excel(writer,'Sheet2') s = df.mean() # mean (col default axis) writer.save() s = df.median() # median (col default) s = df.min() # min of axis (col def) Saving a DataFrame to MySQL df = df.mul(o) # mul by df Series val import pymysql s = df.sum() # sum axis (cols default) from sqlalchemy import create_engine df = df.where(df > 0.5, other=np.nan) e = create_engine('mysql+pymysql://' + Note: methods returning a series default to work on cols 'USER:PASSWORD@HOST/DATABASE') df.to_sql('TABLE',e, if_exists='replace') Select/filter rows/cols based on index label values Note: if_exists 'fail', 'replace', 'append' df = df.filter(items=['a', 'b']) # by col df = df.filter(items=[5], axis=0) # by row Saving to Python objects df = df.filter(like='x') # keep x in col d = df.to_dict() # to dictionary df = df.filter(regex='x') # regex in col str = df.to_string() # to string df = df.select(lambda x: not x%5) # 5th rows m = df.as_matrix() # to numpy matrix Note: select takes a Boolean function, for cols: axis=1 Note: filter defaults to cols; select defaults to rows Version 30 April 2017 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 2 Set column values set based on criteria Working with Columns df['b'] = df['a'].where(df['a']>0, other=0) df['d'] = df['a'].where(df.b!=0, other=df.c) Get column index and labels Note: where other can be a Series or a scalar idx = df.columns # get col index label = df.columns[0] # first col label Data type conversions l = df.columns.tolist() # list of col labels st = df['col'].astype(str)# Series dtype a = df.columns.values # array of col labels a = df['col'].values # numpy array l = df['col'].tolist() # python list Change column labels Note: useful dtypes for Series conversion: int, float, str df = df.rename(columns={'old':'new','a':'1'}) Trap: index lost in conversion from Series to array or list df.columns = ['new1', 'new2', 'new3'] # etc. Common column-wide methods/attributes Selecting columns value = df['col'].dtype # type of data s = df['colName'] # select col to Series value = df['col'].size # col dimensions df = df[['colName']] # select col to df value = df['col'].count() # non-NA count df = df[['a','b']] # select 2-plus cols value = df['col'].sum() df = df[['c','a','b']] # change col order value = df['col'].prod() s = df[df.columns[0]] # select by number value = df['col'].min() df = df[df.columns[[0, 3, 4]]] # by numbers value = df['col'].max() df = [df.columns[:-1]] # all but last col value = df['col'].mean() # also median() s = df.pop('c') # get & drop from df value = df['col'].cov(df['col2']) s = df['col'].describe() s = df['col'].value_counts() Selecting columns with Python attributes s = df.a # same as s = df['a'] Find index label for min/max values in column # cannot create new columns by attribute df.existing_column = df.a / df.b label = df['col1'].idxmin() df['new_column'] = df.a / df.b label = df['col1'].idxmax() Trap: column names must be valid identifiers. Common column element-wise methods Adding new columns to a DataFrame s = df['col'].isnull() df['new_col'] = range(len(df)) s = df['col'].notnull() # not isnull() df['new_col'] = np.repeat(np.nan,len(df)) s = df['col'].astype(float) df['random'] = np.random.rand(len(df)) s = df['col'].abs() df['index_as_col'] = df.index s = df['col'].round(decimals=0) df1[['b','c']] = df2[['e','f']] s = df['col'].diff(periods=1) df3 = df1.append(other=df2) s = df['col'].shift(periods=1) s = df['col'].to_datetime() Trap: When adding a new column, only items from the s = df['col'].fillna(0) # replace NaN w 0 new series that have a corresponding index in the DataFrame will be added.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us