Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Introduction to Python Pandas for Data
Introduction Analytics to Python
Python programming
NumPy Srijith Rajamohan
Matplotlib Advanced Research Computing, Virginia Tech Introduction to Pandas Case study Tuesday 19th July, 2016 Conclusion
1 / 115 Course Contents
Introduction to Python Pandas for Data Analytics This week: Srijith Rajamohan Introduction to Python • Introduction Python Programming to Python • Python NumPy programming • Plotting with Matplotlib NumPy • Introduction to Python Pandas Matplotlib • Introduction Case study to Pandas • Case study Conclusion • Conclusion
2 / 115 Section 1
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
3 / 115 Python Features
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Why Python ? Interpreted Introduction • to Python Intuitive and minimalistic code Python • programming Expressive language NumPy • Dynamically typed Matplotlib • Introduction Automatic memory management to Pandas •
Case study
Conclusion
4 / 115 Python Features
Introduction to Python Pandas for Data Advantages Analytics Srijith Ease of programming Rajamohan • Minimizes the time to develop and maintain code Introduction • to Python Modular and object-oriented • Python Large community of users programming • NumPy A large standard and user-contributed library Matplotlib •
Introduction Disadvantages to Pandas Interpreted and therefore slower than compiled languages Case study • Conclusion Decentralized with packages •
5 / 115 Code Performance vs Development Time
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan
Introduction to Python
Python programming
NumPy
Matplotlib
Introduction to Pandas
Case study
Conclusion
6 / 115 Versions of Python
Introduction to Python Pandas for Data Analytics
Srijith Two versions of Python in use - Python 2 and Python 3 Rajamohan • Python 3 not backward-compatible with Python 2 • Introduction A lot of packages are available for Python 2 to Python • Python Check version using the following command programming • NumPy
Matplotlib Example
Introduction to Pandas $ python --version Case study
Conclusion
7 / 115 Section 2
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
8 / 115 Variables
Introduction to Python Pandas for Data Analytics Variable names can contain alphanumerical characters and • Srijith some special characters Rajamohan It is common to have variable names start with a Introduction • to Python lower-case letter and class names start with a capital letter Python Some keywords are reserved such as ‘and’, ‘assert’, programming • NumPy ‘break’, ‘lambda’. A list of keywords are located at
Matplotlib https://docs.python.org/2.5/ref/keywords.html Introduction Python is dynamically typed, the type of the variable is to Pandas • derived from the value it is assigned. Case study
Conclusion A variable is assigned using the ‘=’ operator •
9 / 115 Variable types
Introduction to Python Pandas for Data Analytics Variable types • Srijith • Rajamohan Integer (int) • Float (float) Introduction • Boolean (bool) to Python • Complex (complex) Python programming • String (str)
NumPy • ...
Matplotlib • User Defined! (classes) Introduction Documentation to Pandas • • https://docs.python.org/2/library/types.html Case study • https://docs.python.org/2/library/datatypes.html Conclusion
10 / 115 Variable types
Introduction to Python Pandas for Data Analytics Srijith Use the type function to determine variable type Rajamohan •
Introduction Example to Python
Python programming >>> log_file = open("/home/srijithr/ NumPy logfile","r") Matplotlib >>> type(log_file) Introduction file to Pandas
Case study
Conclusion
11 / 115 Variable types
Introduction to Python Pandas for Data Analytics Variables can be cast to a different type Srijith • Rajamohan Example Introduction to Python >>> share_of_rent = 295.50 / 2.0 Python programming >>> type(share_of_rent) NumPy float Matplotlib >>> rounded_share = int(share_of_rent) Introduction to Pandas >>> type(rounded_share)
Case study int
Conclusion
12 / 115 Operators
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Arithmetic operators +, -, *, /, // (integer division for Introduction • to Python floating point numbers), ’**’ power Python Boolean operators and, or and not programming • NumPy Comparison operators >, <, >= (greater or equal), <= • Matplotlib (less or equal), == equality Introduction to Pandas
Case study
Conclusion
13 / 115 Strings (str)
Introduction to Python Example Pandas for Data Analytics >>> dir(str) Srijith Rajamohan [...,’capitalize’,’center’,’count’,’ decode’,’encode’,’endswith’,’ Introduction to Python expandtabs’,’find’,’format’,’index’, Python ’isalnum’,’isalpha’,’isdigit’,’ programming islower’,’isspace’,’istitle’,’ NumPy isupper’,’join’,’ljust’,’lower’,’ Matplotlib
Introduction lstrip’,’partition’,’replace’,’rfind to Pandas ’,’rindex’,’rjust’,’rpartition’,’ Case study rsplit’,’rstrip’,’split’,’splitlines Conclusion ’,’startswith’,’strip’,’swapcase’,’ title’,’translate’,’upper’,’zfill’]
14 / 115 Strings
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> greeting ="Hello world!"
Introduction >>> len(greeting) to Python 12 Python programming >>> greeting
NumPy ’Hello world’
Matplotlib >>> greeting[0]# indexing starts at0
Introduction ’H’ to Pandas >>> greeting.replace("world","test") Case study Hello test ! Conclusion
15 / 115 Printing strings
Introduction to Python Example Pandas for Data Analytics # concatenates strings witha space Srijith >>> print("Go","Hokies") Rajamohan Go Hokies Introduction to Python # concatenated without space
Python >>> print("Go"+"Tech"+"Go") programming GoTechGo NumPy #C-style string formatting Matplotlib >>> print("Bar Tab=%f" %35.28) Introduction to Pandas Bar Tab = 35.280000 Case study # Creatinga formatted string Conclusion >>> total ="My Share= %.2f. Tip=%d"% (11.76, 2.352) >>> print(total) My Share = 11.76. Tip = 2 16 / 115 Lists
Introduction to Python Pandas for Data Analytics Array of elements of arbitrary type Srijith Rajamohan Example
Introduction to Python >>> numbers = [1,2,3] Python >>> type(numbers) programming
NumPy list
Matplotlib >>> arbitrary_array = [1,numbers,"hello"]
Introduction >>> type(arbitrary_array) to Pandas list Case study
Conclusion
17 / 115 Lists
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # createa new empty list Introduction >>> characters = [] to Python # add elements using‘append’ Python programming >>> characters.append("A") NumPy >>> characters.append("d") Matplotlib >>> characters.append("d") Introduction to Pandas >>> print(characters)
Case study [’A’,’d’,’d’]
Conclusion
18 / 115 Lists
Introduction to Python Pandas for Data Analytics Lists are mutable - their values can be changed. Srijith Rajamohan Example
Introduction to Python >>> characters = ["A","d","d"] Python # Changing second and third element programming
NumPy >>> characters[1] ="p"
Matplotlib >>> characters[2] ="p"
Introduction >>> print(characters) to Pandas [’A’,’p’,’p’] Case study
Conclusion
19 / 115 Lists
Introduction to Python Pandas for Example Data Analytics
Srijith >>> characters = ["A","d","d"] Rajamohan # Inserting before"A","d","d"
Introduction >>> characters.insert(0,"i") to Python >>> characters.insert(1,"n") Python programming >>> characters.insert(2,"s") NumPy >>> characters.insert(3,"e") Matplotlib >>> characters.insert(4,"r") Introduction >>> characters.insert(5,"t") to Pandas
Case study >>>print(characters)
Conclusion [’i’,’n’,’s’,’e’,’r’,’t’,’A’,’d’,’ d’]
20 / 115 Lists
Introduction to Python Pandas for Example Data Analytics
Srijith >>> characters = [’i’,’n’,’s’,’e’,’r’, Rajamohan ’t’,’A’,’d’,’d’]
Introduction # Remove first occurrence of"A" from list to Python >>> characters.remove("A") Python programming >>> print(characters) NumPy [’i’,’n’,’s’,’e’,’r’,’t’,’d’,’d’] Matplotlib # Remove an element ata specific location Introduction >>> del characters[7] to Pandas
Case study >>> del characters[6]
Conclusion >>> print(characters) [’i’,’n’,’s’,’e’,’r’,’t’]
21 / 115 Tuples
Introduction to Python Tuples are like lists except they are immutable. Difference is in Pandas for performance Data Analytics
Srijith Example Rajamohan >>> point = (10, 20)# Note() for tuples Introduction to Python instead of[] Python >>> type(point) programming
NumPy tuple
Matplotlib >>> point = 10,20
Introduction >>> type(point) to Pandas tuple Case study >>> point[2] = 40# This will fail! Conclusion TypeError :’tuple’ object does not support item assignment
22 / 115 Dictionary
Introduction to Python Dictionaries are lists of key-value pairs Pandas for Data Analytics Example Srijith Rajamohan >>> prices = {"Eggs" : 2.30, Introduction ..."Sausage" : 4.15, to Python ..."Spam" : 1.59,} Python programming >>> type(prices) NumPy dict Matplotlib >>> print (prices) Introduction to Pandas {’Eggs’: 2.3,’Sausage’: 4.15,’Spam’:
Case study 1.59}
Conclusion >>> prices ["Spam"] 1.59
23 / 115 Conditional statements: if, elif, else
Introduction to Python Pandas for Example Data Analytics
Srijith >>> I_am_tired = False Rajamohan >>> I_am_hungry = True
Introduction >>> if I_am_tired is True:# Note the to Python colon fora code block Python programming ... print("You have to teach!") NumPy ... elif I_am_hungry is True: Matplotlib ... print("No food for you!") Introduction ... else: to Pandas
Case study ... print"Go on...!"
Conclusion ... No food for you!
24 / 115 Loops - For
Introduction to Python Example Pandas for Data Analytics >>> fori in [1,2,3]:#i is an arbitrary Srijith variable for use within the loop Rajamohan section Introduction to Python ... print(i)
Python 1 programming 2 NumPy 3 Matplotlib >>> for word in["scientific","computing" Introduction to Pandas ,"with","python"]: Case study ... print(word) Conclusion scientific computing with python 25 / 115 Loops - While
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>>i = 0
Introduction >>>whilei<5: to Python ... print(i) Python programming ... i = i + 1
NumPy 0
Matplotlib 1
Introduction 2 to Pandas 3 Case study 4 Conclusion
26 / 115 Functions
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> def print_word_length(word):
Introduction ...""" to Python ... Printa word and how many Python programming characters it has
NumPy ..."""
Matplotlib ... print(word +" has"+ str(len(
Introduction word )) +" characters.") to Pandas >>>print_word_length("Diversity") Case study Diversity has 9 characters. Conclusion
27 / 115 Functions - arguments
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Passing immutable arguments like integers, strings or Introduction • to Python tuples acts like call-by-value
Python • They cannot be modified! programming Passing mutable arguments like lists behaves like NumPy • Matplotlib call-by-reference
Introduction to Pandas
Case study
Conclusion
28 / 115 Functions - arguments
Introduction to Python Pandas for Data Analytics Call-by-value Srijith Rajamohan Example
Introduction to Python >>> def make_me_rich(balance): Python balance = 1000000 programming
NumPy account_balance = 500
Matplotlib >>> make_me_rich(account_balance)
Introduction >>> print(account_balance) to Pandas 500 Case study
Conclusion
29 / 115 Functions - arguments
Introduction to Python Call-by-reference Pandas for Data Analytics Example Srijith Rajamohan >>> def talk_to_advisor(tasks): Introduction tasks.insert(0,"Publish") to Python tasks.insert(1,"Publish") Python programming tasks.insert(2,"Publish") NumPy >>> todos = ["Graduate","Geta job","...", Matplotlib "Profit!"] Introduction to Pandas >>> talk_to_advisor(todos)
Case study >>> print(todos)
Conclusion ["Publish","Publish","Publish","Graduate" ,"Geta job","...","Profit!"]
30 / 115 Functions - arguments
Introduction However, you cannot assign a new object to the argument to Python • Pandas for • A new memory location is created for this list Data Analytics • This becomes a local variable
Srijith Rajamohan Example
Introduction to Python >>> def switcheroo(favorite_teams):
Python ... print (favorite_teams) programming ... favorite_teams = ["Redskins"] NumPy ... print (favorite_teams) Matplotlib >>> my_favorite_teams = ["Hokies"," Introduction to Pandas Nittany Lions"] Case study >>> switcheroo(my_favorite_teams) Conclusion ["Hokies","Nittany Lions"] ["Redskins"] >>> print (my_favorite_teams)
["Hokies","Nittany Lions"] 31 / 115 Functions - Multiple Return Values
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> def powers(number): Introduction to Python ... return number ** 2, number ** 3 Python >>> squared, cubed = powers(3) programming >>> print(squared) NumPy 9 Matplotlib
Introduction >>> print(cubed) to Pandas 27 Case study
Conclusion
32 / 115 Functions - Default Values
Introduction to Python Pandas for Data Example Analytics Srijith >>> def likes_food(person, food="Broccoli" Rajamohan , likes=True): Introduction to Python ... if likes:
Python ... print(str(person) +" likes" programming + food ) NumPy ... else: Matplotlib ... print(str(person) +" does not Introduction to Pandas like" + food) Case study >>> likes_food("Srijith", likes=False) Conclusion Srijith does not like Broccoli
33 / 115 Section 3
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
34 / 115 NumPy
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Used in almost all numerical computations in Python Introduction Used for high-performance vector and matrix computations to Python • Python Provides fast precompiled functions for numerical routines programming • Written in C and Fortran NumPy • Matplotlib Vectorized computations • Introduction to Pandas
Case study
Conclusion
35 / 115 Why NumPy?
Introduction to Python Example Pandas for Data Analytics >>> from numpy import* Srijith >>> import time Rajamohan >>> def trad_version(): Introduction to Python t1 = time.time()
Python X = range(10000000) programming Y = range(10000000) NumPy Z = [] Matplotlib fori in range(len(X)): Introduction to Pandas Z.append(X[i] + Y[i]) Case study return time.time() - t1 Conclusion >>> trad_version() 1.9738149642944336
36 / 115 Why NumPy?
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> def numpy_version():
Introduction t1 = time.time() to Python X = arange(10000000) Python programming Y = arange(10000000)
NumPy Z = X + Y
Matplotlib return time.time() - t1
Introduction to Pandas >>> numpy_version() Case study 0.059307098388671875 Conclusion
37 / 115 Arrays
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> from numpy import*
Introduction # the argument to the array function isa to Python Python list Python programming >>> v = array([1,2,3,4])
NumPy # the argument to the array function isa
Matplotlib nested Python list
Introduction >>> M = array([[1, 2], [3, 4]]) to Pandas >>> type(v), type(M) Case study (numpy.ndarray, numpy.ndarray) Conclusion
38 / 115 Arrays
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> v.shape, M.shape
Introduction ((4,), (2, 2)) to Python >>> M. size Python programming 4
NumPy >>> M. dtype
Matplotlib dtype (’int64’)
Introduction # Explicitly define the type of the array to Pandas >>> M = array([[1, 2], [3, 4]], dtype= Case study complex) Conclusion
39 / 115 Arrays - Using array-generating functions
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> x = arange(0, 10, 1)# arguments: Introduction start, stop, step to Python array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Python programming >>> linspace(0,10,11)# arguments: start, NumPy end and number of points( start and Matplotlib end points are included) Introduction to Pandas array([ 0., 1., 2., 3., 4., 5.,
Case study 6., 7., 8., 9., 10.])
Conclusion
40 / 115 Diagonal and Zero matrix
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> diag([1,2,3]) Introduction array([[1, 0, 0], to Python [0, 2, 0], Python programming [0, 0, 3]]) NumPy >>> zeros((3,3)) Matplotlib array([[ 0., 0., 0.], Introduction to Pandas [ 0., 0., 0.],
Case study [ 0., 0., 0.]])
Conclusion
41 / 115 Array Access
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> M = random.rand(3,3) Introduction >>> M to Python array ([ Python programming [ 0.37389376, 0.64335721, 0.12435669], NumPy [ 0.01444674, 0.13963834, 0.36263224], Matplotlib [ 0.00661902, 0.14865659, 0.75066302]]) Introduction to Pandas >>> M[1 ,1]
Case study 0.13963834214755588
Conclusion
42 / 115 Array Access
Introduction to Python Example Pandas for Data Analytics # Access the first row Srijith >>> M [1] Rajamohan array ( Introduction to Python [ 0.01444674, 0.13963834, 0.36263224])
Python # The first row can be also be accessed programming using this notation NumPy >>> M[1 ,:] Matplotlib array ( Introduction to Pandas [ 0.01444674, 0.13963834, 0.36263224]) Case study # Access the first column Conclusion >>> M[: ,1] array ( [ 0.64335721, 0.13963834, 0.14865659])
43 / 115 Array Access
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # You can also assign values to an entire Introduction row or column to Python >>> M[1,:] = 0 Python programming >>> M NumPy array ([ Matplotlib [ 0.37389376, 0.64335721, 0.12435669], Introduction to Pandas [0. ,0. ,0. ],
Case study [ 0.00661902, 0.14865659, 0.75066302]])
Conclusion
44 / 115 Array Slicing
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan # Extract slices of an array
Introduction >>> M [1:3] to Python array ([ Python programming [0. ,0. ,0. ],
NumPy [ 0.00661902, 0.14865659, 0.75066302]])
Matplotlib >>> M[1:3,1:2]
Introduction array ([ to Pandas [ 0. ], Case study [ 0.14865659]]) Conclusion
45 / 115 Array Slicing - Negative Indexing
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # Negative indices start counting from the Introduction end of the array to Python >>> M[ -2] Python programming array ( NumPy [ 0., 0., 0.]) Matplotlib >>> M[ -1] Introduction to Pandas array (
Case study [ 0.00661902, 0.14865659, 0.75066302])
Conclusion
46 / 115 Array Access - Strided Access
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Example
Introduction to Python # Strided access Python programming >>> M[::2,::2]
NumPy array([[ 0.37389376, 0.12435669],
Matplotlib [ 0.00661902, 0.75066302]])
Introduction to Pandas
Case study
Conclusion
47 / 115 Array Operations - Scalar
Introduction to Python These operation are applied to all the elements in the array Pandas for Data Analytics Example Srijith Rajamohan >>> M*2 Introduction array ([ to Python [ 0.74778752, 1.28671443, 0.24871338], Python programming [0. ,0. ,0. ], NumPy [ 0.01323804, 0.29731317, 1.50132603]]) Matplotlib >>> M + 2 Introduction to Pandas array ([
Case study [ 2.37389376, 2.64335721, 2.12435669],
Conclusion [2. ,2. ,2. ], [ 2.00661902, 2.14865659, 2.75066302]])
48 / 115 Matrix multiplication
Introduction to Python Pandas for Data Example Analytics Srijith >>> M * M# Element-wise multiplication Rajamohan array ([ Introduction to Python [1.397965e-01,4.139085e-01,1.546458e-02],
Python [0.000000e+00,0.000000e+00,0.00000e+00], programming [4.381141e-05,2.209878e-02,5.634949e-01]]) NumPy >>> dot(M,M)# Matrix multiplication Matplotlib array ([ Introduction to Pandas [ 0.14061966, 0.25903369, 0.13984616], Case study [0. ,0. ,0. ], Conclusion [ 0.00744346, 0.1158494 , 0.56431808]])
49 / 115 Iterating over Array Elements
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan In general, avoid iteration over elements • Introduction Iterating is slow compared to a vector operation to Python • If you must, use the for loop Python • programming In order to enable vectorization, ensure that user-written • NumPy functions can work with vector inputs. Matplotlib • Use the vectorize function Introduction • Use the any or all function with arrays to Pandas
Case study
Conclusion
50 / 115 Vectorize
Introduction to Python Example Pandas for Data Analytics >>> def Theta(x): Srijith ...""" Rajamohan ... Scalar implemenation of the Introduction to Python Heaviside step function.
Python ...""" programming ... if x >= 0: NumPy ... return1 Matplotlib ... else: Introduction to Pandas ... return0 Case study ... Conclusion >>> Theta(1.0) 1 >>> Theta(-1.0) 0 51 / 115 Vectorize
Introduction to Python Pandas for Data Without vectorize we would not be able to pass v to the Analytics function Srijith Rajamohan Example
Introduction to Python >>> v Python programming array([1, 2, 3, 4])
NumPy >>> Tvec = vectorize(Theta)
Matplotlib >>> Tvec (v)
Introduction array([1, 1, 1, 1]) to Pandas >>> Tvec(1.0) Case study array (1) Conclusion
52 / 115 Arrays in conditions
Introduction to Python Pandas for Data Analytics Use the any or all functions associated with arrays Srijith Rajamohan Example
Introduction to Python >>> v Python array([1, 2, 3, 4]) programming
NumPy >>> (v > 3).any()
Matplotlib True
Introduction >>> (v > 3).all() to Pandas False Case study
Conclusion
53 / 115 Section 4
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
54 / 115 Matplotlib
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Used for generating 2D and 3D scientific plots Introduction • to Python Support for LaTeX Python • programming Fine-grained control over every aspect • NumPy Many output file formats including PNG, PDF, SVG, EPS Matplotlib • Introduction to Pandas
Case study
Conclusion
55 / 115 Matplotlib - Customize matplotlibrc
Introduction to Python Pandas for Data Analytics Configuration file ‘matplotlibrc’ used to customize almost Srijith • Rajamohan every aspect of plotting Introduction On Linux, it looks in .config/matplotlib/matplotlibrc to Python • Python On other platforms, it looks in .matplotlib/matplotlibrc programming • Use ‘matplotlib.matplotlib fname()’ to determine NumPy • Matplotlib from where the current matplotlibrc is loaded Introduction Customization options can be found at to Pandas • http://matplotlib.org/users/customizing.html Case study
Conclusion
56 / 115 Matplotlib
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Matplotlib is the entire library • Introduction Pyplot - a module within Matplotlib that provides access to Python • to the underlying plotting library Python programming Pylab - a convenience module that combines the NumPy • functionality of Pyplot with Numpy Matplotlib
Introduction Pylab interface convenient for interactive plotting to Pandas •
Case study
Conclusion
57 / 115 Pylab
Introduction to Python Example Pandas for Data Analytics >>> import pylab as pl Srijith Rajamohan >>> pl.ioff() >>> pl.isinteractive() Introduction to Python False Python >>> x = [1,3,7] programming >>> pl.plot(x)# if interactive mode is NumPy off use show() after the plot command Matplotlib
Introduction [
58 / 115 Pylab
Introduction to Python Pandas for Data Simple Pylab plot Analytics 7
Srijith Rajamohan 6
Introduction to Python 5
Python programming 4
NumPy
Matplotlib 3
Introduction to Pandas 2
Case study
1 Conclusion 0.0 0.5 1.0 1.5 2.0
59 / 115 Pylab
Introduction to Python Example Pandas for Data Analytics >>> X = np.linspace(-np.pi, np.pi, 256, Srijith endpoint=True) Rajamohan >>> C, S = np.cos(X), np.sin(X) Introduction to Python # Plot cosine witha blue continuous line
Python of width1(pixels) programming >>> pl.plot(X, C, color="blue", linewidth NumPy =1.0, linestyle="-") Matplotlib >>> pl.xlabel("X") ; pl.ylabel("Y") Introduction to Pandas >>> pl.title("Sine and Cosine waves") Case study # Plot sine witha green continuous line Conclusion of width1(pixels) >>> pl.plot(X, S, color="green", linewidth =1.0, linestyle="-") >>> pl.show() 60 / 115 ...... Pylab
Introduction to Python Pandas for Data Sine and Cosine waves Analytics 1.0 Srijith Rajamohan
0.5 Introduction to Python
Python programming Y 0.0 NumPy
Matplotlib 0.5 Introduction to Pandas
Case study 1.0 Conclusion 4 3 2 1 0 1 2 3 4 X
61 / 115 Pylab - subplots
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> pl.figure(figsize=(8, 6), dpi=80)
Introduction >>> pl.subplot(1, 2, 1) to Python >>> C, S = np.cos(X), np.sin(X) Python programming >>> pl.plot(X, C, color="blue", linewidth
NumPy =1.0, linestyle="-")
Matplotlib >>> pl.subplot(1, 2, 2)
Introduction >>> pl.plot(X, S, color="green", linewidth to Pandas =1.0, linestyle="-") Case study >>> pl.show() Conclusion
62 / 115 Pylab - subplots
Introduction to Python Pandas for Data Analytics 1.0 1.0 Srijith Rajamohan
0.5 0.5 Introduction to Python
Python programming 0.0 0.0 NumPy
Matplotlib 0.5 0.5 Introduction to Pandas
Case study 1.0 1.0 Conclusion 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
63 / 115 Pyplot
Introduction to Python Example Pandas for Data Analytics >>>import matplotlib.pyplot as plt Srijith >>>plt.isinteractive() Rajamohan False Introduction to Python >>>x = np.linspace(0, 3*np.pi, 500)
Python >>>plt.plot(x, np.sin(x**2)) programming [
64 / 115 Pyplot
Introduction to Python Pandas for Data Pyplot plot Analytics 1.0
Srijith Rajamohan
0.5 Introduction to Python
Python programming 0.0
NumPy
Matplotlib
0.5 Introduction − to Pandas
Case study
1.0 − Conclusion 0 2 4 6 8 10
65 / 115 Pyplot - legend
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>> import matplotlib.pyplot as plt
Introduction >>> line_up, = plt.plot([1,2,3], label=’ to Python Line2’) Python programming >>> line_down, = plt.plot([3,2,1], label=’
NumPy Line1’)
Matplotlib >>> plt.legend(handles=[line_up, line_down
Introduction ]) to Pandas
66 / 115 Pyplot - legend
Introduction to Python Pandas for Data Analytics 3.0 Line 2 Srijith Rajamohan Line 1
2.5 Introduction to Python
Python programming 2.0 NumPy
Matplotlib 1.5 Introduction to Pandas
Case study 1.0 Conclusion 0.0 0.5 1.0 1.5 2.0
67 / 115 Pyplot - 3D plots
Introduction to Python Pandas for Data Surface plots Analytics
Srijith Rajamohan
Introduction to Python
Python programming
NumPy
Matplotlib
Introduction to Pandas
Case study Visit http://matplotlib.org/gallery.html for a gallery of Conclusion plots produced by Matplotlib
68 / 115 Section 5
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
69 / 115 What is Pandas?
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Pandas is an open source, BSD-licensed library Introduction • to Python High-performance, easy-to-use data structures and data Python • programming analysis tools NumPy Built for the Python programming language. Matplotlib •
Introduction to Pandas
Case study
Conclusion
70 / 115 Pandas - import modules
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Example
Introduction to Python >>>from pandas import DataFrame, read_csv
Python # General syntax to importa library but programming no functions: NumPy >>>import pandas as pd#this is howI Matplotlib usually import pandas Introduction to Pandas
Case study
Conclusion
71 / 115 Pandas - Create a dataframe
Introduction to Python Example Pandas for Data Analytics
Srijith Rajamohan >>>d = {’one’ : pd.Series([1., 2., 3.], index =[’a’,’b’,’c’]), Introduction to Python ’two’ : pd.Series([1., 2., 3., 4.], index Python =[’a’,’b’,’c’,’d’])} programming >>>df = pd.DataFrame(d) NumPy >>>df Matplotlib
Introduction one two to Pandas a 1.0 1.0 Case study b 2.0 2.0 Conclusion c 3.0 3.0 d NaN 4.0
72 / 115 Pandas - Create a dataframe
Introduction to Python Pandas for Example Data Analytics
Srijith Rajamohan >>> names = [’Bob’,’Jessica’,’Mary’,’John’,
Introduction ’Mel’] to Python >>>births = [968, 155, 77, 578, 973] Python programming #To merge these two lists together we will NumPy use the zip function. Matplotlib Introduction >>>BabyDataSet = list(zip(names,births)) to Pandas
Case study >>>BabyDataSet
Conclusion [(’Bob’, 968), (’Jessica’, 155), (’Mary’, 77) , (’John’, 578), (’Mel’, 973)]
73 / 115 Pandas - Create a data frame and write to a csv file
Introduction to Python Pandas for Data Analytics Use the pandas module to create a dataset. Srijith Rajamohan Example Introduction to Python
Python programming >>>df = pd.DataFrame(data = BabyDataSet, NumPy columns =[’Names’,’Births’]) Matplotlib >>>df.to_csv(’births1880.csv’,index=False, Introduction to Pandas header=False)
Case study
Conclusion
74 / 115 Pandas - Read data from a file
Introduction to Python Pandas for Data Analytics Import data from the csv file Srijith Rajamohan Example
Introduction to Python >>>df = pd.read_csv(filename) Python #Don’t treat the first row asa header programming
NumPy >>>df = pd.read_csv(Location, header=None)
Matplotlib # Provide specific names for the columns
Introduction >>>df = pd.read_csv(Location, names=[’ to Pandas Names’,’Births’]) Case study
Conclusion
75 / 115 Pandas - Get data types
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # Check data type of the columns Introduction >>>df.dtypes to Python Names object Python programming Births int64 NumPy dtype : object Matplotlib # Check data type of Births column Introduction to Pandas >>>df.Births.dtype
Case study dtype (’int64’)
Conclusion
76 / 115 Pandas - Take a look at the data
Introduction to Python Pandas for Example Data Analytics
Srijith >>>df.head(2) Rajamohan Names Births
Introduction 0 Bob 968 to Python 1 Jessica 155 Python programming >>>df.tail(2) NumPy Names Births Matplotlib 3 John 578 Introduction 4 Mel 973 to Pandas
Case study >>>df.columns
Conclusion Index ([u’Names’,u’Births’], dtype=’object ’)
77 / 115 Pandas - Take a look at the data
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>>df.values
Introduction array ([[’Bob’, 968], to Python [’Jessica’, 155], Python programming [’Mary’, 77],
NumPy [’John’, 578],
Matplotlib [’Mel’, 973]], dtype=object)
Introduction to Pandas >>>df.index Case study Int64Index([0, 1, 2, 3, 4], dtype=’int64’) Conclusion
78 / 115 Pandas - Working on the data
Introduction to Python Pandas for Data Analytics Srijith Example Rajamohan
Introduction >>>df[’Births’].plot() to Python # Maximum value in the data set Python programming >>>MaxValue = df[’Births’].max() NumPy # Name associated with the maximum value Matplotlib >>>MaxName = df[’Names’][df[’Births’] == Introduction df[’Births’].max()].values to Pandas
Case study
Conclusion
79 / 115 Pandas - Describe the data
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>>df[’Names’].unique()
Introduction array ([’Mary’,’Jessica’,’Bob’,’John’,’ to Python Mel’], dtype=object) Python programming >>>print(df[’Names’].describe())
NumPy count 1000
Matplotlib unique 5
Introduction top Bob to Pandas freq 206 Case study Name: Names, dtype: object Conclusion
80 / 115 Pandas - Add a column
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan >>>d = [0,1,2,3,4,5,6,7,8,9]
Introduction to Python # Create dataframe Python programming >>>df = pd.DataFrame(d)
NumPy #Name the column
Matplotlib >>>df.columns = [’Rev’]
Introduction #Add another one and set the value in that to Pandas column Case study >>>df[’NewCol’] = 5 Conclusion
81 / 115 Pandas - Accessing and indexing the data
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan
Introduction #Perform operations on columns to Python >>>df[’NewCol’] = df[’NewCol’] + 1 Python programming #Deletea column
NumPy >>>del df[’NewCol’]
Matplotlib #Edit the index name
Introduction >>>i = [’a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i to Pandas ’,’j’] Case study >>>df.index = i Conclusion
82 / 115 Pandas - Accessing and indexing the data
Introduction to Python Pandas for Example Data Analytics
Srijith #Find based on index value Rajamohan >>>df. loc [’a’]
Introduction >>>df. loc [’a’:’d’] to Python #Do integer position based indexing Python programming >>>df.iloc[0:3] NumPy #Access using the column name Matplotlib >>>df[’Rev’] Introduction #Access multiple columns to Pandas
Case study >>>df [[’Rev’,’test’]]
Conclusion #Subset the data >>>df.ix[:3,[’Rev’,’test’]]
83 / 115 Pandas - Accessing and indexing the data
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Example
Introduction to Python #Find based on index value
Python >>>df.at[’a’,’Rev’] programming 0 NumPy >>>df.iat[0,0] Matplotlib 0 Introduction to Pandas
Case study
Conclusion
84 / 115 Pandas - Accessing and indexing for loc
Introduction to Python Pandas for Data Analytics
Srijith A single label, e.g. 5 or ’a’, (note that 5 is interpreted as a Rajamohan • label of the index. This use is not an integer position Introduction along the index) to Python
Python A list or array of labels [’a’, ’b’, ’c’] programming • A slice object with labels ’a’:’f’, (note that contrary to NumPy • Matplotlib usual python slices, both the start and the stop are
Introduction included!) to Pandas A boolean array Case study • Conclusion
85 / 115 Pandas - Accessing and indexing for iloc
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan
Introduction An integer e.g. 5 to Python • Python A list or array of integers [4, 3, 0] programming • A slice object with ints 1:7 NumPy • Matplotlib
Introduction to Pandas
Case study
Conclusion
86 / 115 Pandas - Accessing and indexing summarized
Introduction to Python Pandas for Data Analytics Example
Srijith Rajamohan loc: only work on index
Introduction iloc: work on position to Python ix: this is the most general and Python programming supports index and position based
NumPy retrieval
Matplotlib at: get scalar values, it’sa very fast
Introduction loc to Pandas iat: get scalar values, it’s a very fast Case study iloc Conclusion
87 / 115 Pandas - Missing data
Introduction to Python How do you deal with data that is missing or contains NaNs Pandas for Data Analytics Example Srijith Rajamohan >>>df = pd.DataFrame(np.random.randn(5, 3) Introduction , index =[’a’,’c’,’e’,’f’,’h’], to Python columns =[’one’,’two’,’three’]) Python programming >>>df. loc [’a’,’two’] = np.nan NumPy one two three Matplotlib a -1.192838 NaN -0.337037 Introduction to Pandas c 0.110718 -0.016733 -0.137009
Case study e 0.153456 0.266369 -0.064127
Conclusion f 1.709607 -0.424790 -0.792061 h -1.076740 -0.872088 -0.436127
88 / 115 Pandas - Missing data
Introduction to Python Pandas for Data Analytics How do you deal with data that is missing or contains NaNs?
Srijith Rajamohan Example
Introduction >>>df.isnull() to Python
Python one two three programming a False True False NumPy c False False False Matplotlib e False False False Introduction to Pandas f False False False Case study h False False False Conclusion
89 / 115 Pandas - Missing data
Introduction to Python Pandas for Data Analytics You can fill this data in a number of ways.
Srijith Rajamohan Example
Introduction >>>df.fillna(0) to Python
Python one two three programming a -1.192838 0.000000 -0.337037 NumPy c 0.110718 -0.016733 -0.137009 Matplotlib e 0.153456 0.266369 -0.064127 Introduction to Pandas f 1.709607 -0.424790 -0.792061 Case study h -1.076740 -0.872088 -0.436127 Conclusion
90 / 115 Pandas - Query the data
Introduction to Python Pandas for Also, use the query method where you can embed boolean Data Analytics expressions on columns within quotes
Srijith Rajamohan Example
Introduction to Python >>>df.query(’one>0’)
Python one two three programming c 0.110718 -0.016733 -0.137009 NumPy e 0.153456 0.266369 -0.064127 Matplotlib f 1.709607 -0.424790 -0.792061 Introduction to Pandas >>>df.query(’one>0& two>0’) Case study one two three Conclusion e 0.153456 0.266369 -0.064127
91 / 115 Pandas - Apply a function
Introduction to Python Pandas for Data Analytics
Srijith You can apply any function to the columns in a dataframe Rajamohan Example Introduction to Python
Python >>>df.apply(lambda x: x.max() - x.min()) programming one 2.902445 NumPy two 1.138457 Matplotlib three 0.727934 Introduction to Pandas
Case study
Conclusion
92 / 115 Pandas - Applymap a function
Introduction to Python Pandas for Data You can apply any function to the element wise data in a Analytics dataframe Srijith Rajamohan Example
Introduction to Python >>>df.applymap(np.sqrt) Python programming one two three
NumPy a NaN NaN NaN
Matplotlib c 0.332742 NaN NaN
Introduction e 0.391735 0.516109 NaN to Pandas f 1.307520 NaN NaN Case study h NaN NaN NaN Conclusion
93 / 115 Pandas - Query data
Introduction to Python Pandas for Data Determine if certain values exist in the dataframe Analytics
Srijith Example Rajamohan
Introduction >>>s = pd.Series(np.arange(5), index=np. to Python arange(5)[::-1], dtype=’int64’) Python programming >>>s.isin([2,4,6])
NumPy 4 False
Matplotlib 3 False
Introduction 2 True to Pandas 1 False Case study 0 True Conclusion
94 / 115 Pandas - Query data
Introduction to Python Pandas for Data Use the where method Analytics
Srijith Example Rajamohan
Introduction >>>s = pd.Series(np.arange(5), index=np. to Python arange(5)[::-1], dtype=’int64’) Python programming >>>s.where(s>3)
NumPy 4 NaN
Matplotlib 3 NaN
Introduction 2 NaN to Pandas 1 NaN Case study 0 4 Conclusion
95 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics
Srijith Creating a grouping organizes the data and returns a groupby Rajamohan object Introduction to Python Example Python programming grouped = obj.groupby(key) NumPy grouped = obj.groupby(key, axis=1) Matplotlib grouped = obj.groupby([key1, key2]) Introduction to Pandas
Case study
Conclusion
96 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan
Introduction df = pd.DataFrame({’A’:[’foo’,’bar’,’ to Python foo’,’bar’, Python programming ’foo’,’bar’,’foo’,’foo’], NumPy ’B’:[’one’,’one’,’two’,’three’, Matplotlib ’two’,’two’,’one’,’three’], Introduction to Pandas ’C’ : np.random.randn(8),
Case study ’D’ : np.random.randn(8)})
Conclusion
97 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Example Analytics
Srijith Rajamohan ABCD Introduction to Python 0 foo one 0.469112 -0.861849
Python 1 bar one -0.282863 -2.104569 programming 2 foo two -1.509059 -0.494929 NumPy 3 bar three -1.135632 1.071804 Matplotlib 4 foo two 1.212112 0.721555 Introduction to Pandas 5 bar two -0.173215 -0.706771 Case study 6 foo one 0.119209 -1.039575 Conclusion 7 foo three -1.044236 0.271860
98 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics Group by either A or B columns or both Srijith Rajamohan Example Introduction to Python >>>grouped = df.groupby(’A’) Python programming >>>grouped = df.groupby([’A’,’B’]) NumPy # Sorts by default, disable this for Matplotlib potential speedup Introduction to Pandas >>>grouped = df.groupby(’A’,sort=False)
Case study
Conclusion
99 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Get statistics for the groups
Introduction Example to Python
Python programming >>>grouped.size()
NumPy >>>grouped.describe() Matplotlib >>>grouped.count() Introduction to Pandas
Case study
Conclusion
100 / 115 Pandas - Grouping the data
Introduction to Python Print the grouping Pandas for Data Analytics Example Srijith Rajamohan >>>list(grouped) Introduction ABCD to Python 1 bar one -1.303028 -0.932565 Python programming 3 bar three 0.135601 0.268914 NumPy 5 bar two -0.320369 0.059366) Matplotlib 0 foo one 1.066805 -1.252834 Introduction to Pandas 2 foo two -0.180407 1.686709
Case study 4 foo two 0.228522 -0.457232
Conclusion 6 foo one -0.553085 0.512941 7 foo three -0.346510 0.434751)]
101 / 115 Pandas - Grouping the data
Introduction to Python Get the first and last elements of each grouping. Also, apply Pandas for Data the ’sum’ function to each column Analytics Example Srijith Rajamohan >>>grouped.first() Introduction to Python ABCD Python bar one -1.303028 -0.932565 programming foo one 1.066805 -1.252834 NumPy # Similar results can be obtained withg. Matplotlib
Introduction last() to Pandas >>>grouped.sum() Case study ACD Conclusion bar -1.487796 -0.604285 foo 0.215324 0.924336
102 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics
Srijith Group aggregation Rajamohan Example Introduction to Python
Python >>>grouped.aggregate(np.sum) programming ACD NumPy bar -1.487796 -0.604285 Matplotlib foo 0.215324 0.924336 Introduction to Pandas
Case study
Conclusion
103 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics Apply multiple functions to a grouped column Srijith Rajamohan Example
Introduction to Python >>>grouped[’C’].agg([np.sum, np.mean]) Python programming
NumPy A sum mean
Matplotlib
Introduction bar -1.487796 -0.495932 to Pandas foo 0.215324 0.043065 Case study
Conclusion
104 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics
Srijith Visually inspecting the grouping Rajamohan Example Introduction to Python
Python >>>w = grouped[’C’].agg([np.sum, np.mean]) programming . plot () NumPy >>>import matplotlib.pyplot as plt Matplotlib >>>plt.show() Introduction to Pandas
Case study
Conclusion
105 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Apply a transformation to the grouping
Introduction Example to Python
Python programming >>>f = lambda x: x*2
NumPy >>>transformed = grouped.transform(f) Matplotlib >>>print transformed Introduction to Pandas
Case study
Conclusion
106 / 115 Pandas - Grouping the data
Introduction to Python Pandas for Apply a filter to select a group based on some criterion. Data Analytics Example Srijith Rajamohan >>>grouped.filter(lambda x: sum(x[’C’]) > Introduction to Python 0)
Python programming ABCD NumPy 0 foo one 1.066805 -1.252834 Matplotlib 2 foo two -0.180407 1.686709 Introduction to Pandas 4 foo two 0.228522 -0.457232 Case study 6 foo one -0.553085 0.512941 Conclusion 7 foo three -0.346510 0.434751
107 / 115 Section 6
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
108 / 115 Cost of College
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan
Introduction We are going to analyze the cost of college data scorecard to Python • Python provided by the federal government programming https://collegescorecard.ed.gov/data/ NumPy • Matplotlib
Introduction to Pandas
Case study
Conclusion
109 / 115 Cost of College
Introduction to Python Pandas for Data Analytics Srijith Find the top 10 median 10 year debt Rajamohan • Find the top 10 median earnings Introduction • to Python Find the top 10 schools with the best sat scores Python • programming Find the top 10 best return of investment • NumPy Find average median earnings per state Matplotlib • Compute the correlation between the SAT scores and Introduction • to Pandas median income Case study
Conclusion
110 / 115 Cost of College
Introduction to Python Pandas for Data Analytics
Srijith Columns of interest Rajamohan UNITID Introduction • to Python INSTNM • Python STABBR programming • NumPy CITY • Matplotlib GRAD DEBT MDN SUPP Introduction • to Pandas SAT AVG Case study •
Conclusion
111 / 115 Cost of College - Generate metrics and create interactive visualizations using Bokeh
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan Generate metrics and create interactive visualizations Introduction • to Python using Bokeh Python Create an interactive chloropleth visualization programming • Sample given here at NumPy • Matplotlib http://sjster.bitbucket.org/sub2/index.html Introduction to Pandas
Case study
Conclusion
112 / 115 Interactive Chloropleth for querying and visualization
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan
Introduction to Python
Python programming
NumPy
Matplotlib
Introduction to Pandas
Case study
Conclusion
113 / 115 Section 7
Introduction to Python Pandas for Data 1 Introduction to Python Analytics
Srijith Rajamohan 2 Python programming
Introduction to Python 3 NumPy
Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study
Conclusion 7 Conclusion
114 / 115 Questions
Introduction to Python Pandas for Data Analytics
Srijith Rajamohan
Introduction to Python Thank you for attending ! Python programming
NumPy
Matplotlib
Introduction to Pandas
Case study
Conclusion
115 / 115