<<

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Introduction to Python Pandas for Data

Introduction Analytics to Python

Python programming

NumPy Srijith Rajamohan

Matplotlib Advanced Research Computing, Virginia Tech Introduction to Pandas Case study Tuesday 19th July, 2016 Conclusion

1 / 115 Course Contents

Introduction to Python Pandas for Data Analytics This week: Srijith Rajamohan Introduction to Python • Introduction Python Programming to Python • Python programming • Plotting with NumPy • Introduction to Python Pandas Matplotlib • Introduction Case study to Pandas • Case study Conclusion • Conclusion

2 / 115 Section 1

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

3 / 115 Python Features

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Why Python ? Interpreted Introduction • to Python Intuitive and minimalistic code Python • programming Expressive language NumPy • Dynamically typed Matplotlib • Introduction Automatic memory management to Pandas •

Case study

Conclusion

4 / 115 Python Features

Introduction to Python Pandas for Data Advantages Analytics Srijith Ease of programming Rajamohan • Minimizes the time to develop and maintain code Introduction • to Python Modular and object-oriented • Python Large community of users programming • NumPy A large standard and user-contributed library Matplotlib •

Introduction Disadvantages to Pandas Interpreted and therefore slower than compiled languages Case study • Conclusion Decentralized with packages •

5 / 115 Code Performance vs Development Time

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan

Introduction to Python

Python programming

NumPy

Matplotlib

Introduction to Pandas

Case study

Conclusion

6 / 115 Versions of Python

Introduction to Python Pandas for Data Analytics

Srijith Two versions of Python in use - Python 2 and Python 3 Rajamohan • Python 3 not backward-compatible with Python 2 • Introduction A lot of packages are available for Python 2 to Python • Python Check version using the following command programming • NumPy

Matplotlib Example

Introduction to Pandas $ python --version Case study

Conclusion

7 / 115 Section 2

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

8 / 115 Variables

Introduction to Python Pandas for Data Analytics Variable names can contain alphanumerical characters and • Srijith some special characters Rajamohan It is common to have variable names start with a Introduction • to Python lower-case letter and class names start with a capital letter Python Some keywords are reserved such as ‘and’, ‘assert’, programming • NumPy ‘break’, ‘lambda’. A list of keywords are located at

Matplotlib https://docs.python.org/2.5/ref/keywords.html Introduction Python is dynamically typed, the type of the variable is to Pandas • derived from the value it is assigned. Case study

Conclusion A variable is assigned using the ‘=’ operator •

9 / 115 Variable types

Introduction to Python Pandas for Data Analytics Variable types • Srijith • Rajamohan Integer (int) • Float (float) Introduction • Boolean (bool) to Python • Complex (complex) Python programming • String (str)

NumPy • ...

Matplotlib • User Defined! (classes) Introduction Documentation to Pandas • • https://docs.python.org/2/library/types.html Case study • https://docs.python.org/2/library/datatypes.html Conclusion

10 / 115 Variable types

Introduction to Python Pandas for Data Analytics Srijith Use the type function to determine variable type Rajamohan •

Introduction Example to Python

Python programming >>> log_file = open("/home/srijithr/ NumPy logfile","") Matplotlib >>> type(log_file) Introduction file to Pandas

Case study

Conclusion

11 / 115 Variable types

Introduction to Python Pandas for Data Analytics Variables can be cast to a different type Srijith • Rajamohan Example Introduction to Python >>> share_of_rent = 295.50 / 2.0 Python programming >>> type(share_of_rent) NumPy float Matplotlib >>> rounded_share = int(share_of_rent) Introduction to Pandas >>> type(rounded_share)

Case study int

Conclusion

12 / 115 Operators

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Arithmetic operators +, -, *, /, // (integer division for Introduction • to Python floating point numbers), ’**’ power Python Boolean operators and, or and not programming • NumPy Comparison operators >, <, >= (greater or equal), <= • Matplotlib (less or equal), == equality Introduction to Pandas

Case study

Conclusion

13 / 115 Strings (str)

Introduction to Python Example Pandas for Data Analytics >>> dir(str) Srijith Rajamohan [...,’capitalize’,’center’,’count’,’ decode’,’encode’,’endswith’,’ Introduction to Python expandtabs’,’find’,’format’,’index’, Python ’isalnum’,’isalpha’,’isdigit’,’ programming islower’,’isspace’,’istitle’,’ NumPy isupper’,’join’,’ljust’,’lower’,’ Matplotlib

Introduction lstrip’,’partition’,’replace’,’rfind to Pandas ’,’rindex’,’rjust’,’rpartition’,’ Case study rsplit’,’rstrip’,’split’,’splitlines Conclusion ’,’startswith’,’strip’,’swapcase’,’ title’,’translate’,’upper’,’zfill’]

14 / 115 Strings

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> greeting ="Hello world!"

Introduction >>> len(greeting) to Python 12 Python programming >>> greeting

NumPy ’Hello world’

Matplotlib >>> greeting[0]# indexing starts at0

Introduction ’H’ to Pandas >>> greeting.replace("world","test") Case study Hello test ! Conclusion

15 / 115 Printing strings

Introduction to Python Example Pandas for Data Analytics # concatenates strings witha space Srijith >>> print("Go","Hokies") Rajamohan Go Hokies Introduction to Python # concatenated without space

Python >>> print("Go"+"Tech"+"Go") programming GoTechGo NumPy #-style string formatting Matplotlib >>> print("Bar Tab=%f" %35.28) Introduction to Pandas Bar Tab = 35.280000 Case study # Creatinga formatted string Conclusion >>> total ="My Share= %.2f. Tip=%d"% (11.76, 2.352) >>> print(total) My Share = 11.76. Tip = 2 16 / 115 Lists

Introduction to Python Pandas for Data Analytics Array of elements of arbitrary type Srijith Rajamohan Example

Introduction to Python >>> numbers = [1,2,3] Python >>> type(numbers) programming

NumPy list

Matplotlib >>> arbitrary_array = [1,numbers,"hello"]

Introduction >>> type(arbitrary_array) to Pandas list Case study

Conclusion

17 / 115 Lists

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # createa new empty list Introduction >>> characters = [] to Python # add elements using‘append’ Python programming >>> characters.append("A") NumPy >>> characters.append("d") Matplotlib >>> characters.append("d") Introduction to Pandas >>> print(characters)

Case study [’A’,’d’,’d’]

Conclusion

18 / 115 Lists

Introduction to Python Pandas for Data Analytics Lists are mutable - their values can be changed. Srijith Rajamohan Example

Introduction to Python >>> characters = ["A","d","d"] Python # Changing second and third element programming

NumPy >>> characters[1] ="p"

Matplotlib >>> characters[2] ="p"

Introduction >>> print(characters) to Pandas [’A’,’p’,’p’] Case study

Conclusion

19 / 115 Lists

Introduction to Python Pandas for Example Data Analytics

Srijith >>> characters = ["A","d","d"] Rajamohan # Inserting before"A","d","d"

Introduction >>> characters.insert(0,"i") to Python >>> characters.insert(1,"n") Python programming >>> characters.insert(2,"s") NumPy >>> characters.insert(3,"e") Matplotlib >>> characters.insert(4,"r") Introduction >>> characters.insert(5,"t") to Pandas

Case study >>>print(characters)

Conclusion [’i’,’n’,’s’,’e’,’r’,’t’,’A’,’d’,’ d’]

20 / 115 Lists

Introduction to Python Pandas for Example Data Analytics

Srijith >>> characters = [’i’,’n’,’s’,’e’,’r’, Rajamohan ’t’,’A’,’d’,’d’]

Introduction # Remove first occurrence of"A" from list to Python >>> characters.remove("A") Python programming >>> print(characters) NumPy [’i’,’n’,’s’,’e’,’r’,’t’,’d’,’d’] Matplotlib # Remove an element ata specific location Introduction >>> del characters[7] to Pandas

Case study >>> del characters[6]

Conclusion >>> print(characters) [’i’,’n’,’s’,’e’,’r’,’t’]

21 / 115 Tuples

Introduction to Python Tuples are like lists except they are immutable. Difference is in Pandas for performance Data Analytics

Srijith Example Rajamohan >>> point = (10, 20)# Note() for tuples Introduction to Python instead of[] Python >>> type(point) programming

NumPy tuple

Matplotlib >>> point = 10,20

Introduction >>> type(point) to Pandas tuple Case study >>> point[2] = 40# This will fail! Conclusion TypeError :’tuple’ object does not support item assignment

22 / 115 Dictionary

Introduction to Python Dictionaries are lists of key-value pairs Pandas for Data Analytics Example Srijith Rajamohan >>> prices = {"Eggs" : 2.30, Introduction ..."Sausage" : 4.15, to Python ..."Spam" : 1.59,} Python programming >>> type(prices) NumPy dict Matplotlib >>> print (prices) Introduction to Pandas {’Eggs’: 2.3,’Sausage’: 4.15,’Spam’:

Case study 1.59}

Conclusion >>> prices ["Spam"] 1.59

23 / 115 Conditional statements: if, elif, else

Introduction to Python Pandas for Example Data Analytics

Srijith >>> I_am_tired = False Rajamohan >>> I_am_hungry = True

Introduction >>> if I_am_tired is True:# Note the to Python colon fora code block Python programming ... print("You have to teach!") NumPy ... elif I_am_hungry is True: Matplotlib ... print("No food for you!") Introduction ... else: to Pandas

Case study ... print"Go on...!"

Conclusion ... No food for you!

24 / 115 Loops - For

Introduction to Python Example Pandas for Data Analytics >>> fori in [1,2,3]:#i is an arbitrary Srijith variable for use within the loop Rajamohan section Introduction to Python ... print(i)

Python 1 programming 2 NumPy 3 Matplotlib >>> for word in["scientific","computing" Introduction to Pandas ,"with","python"]: Case study ... print(word) Conclusion scientific computing with python 25 / 115 Loops - While

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>>i = 0

Introduction >>>whilei<5: to Python ... print(i) Python programming ... i = i + 1

NumPy 0

Matplotlib 1

Introduction 2 to Pandas 3 Case study 4 Conclusion

26 / 115 Functions

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> def print_word_length(word):

Introduction ...""" to Python ... Printa word and how many Python programming characters it has

NumPy ..."""

Matplotlib ... print(word +" has"+ str(len(

Introduction word )) +" characters.") to Pandas >>>print_word_length("Diversity") Case study Diversity has 9 characters. Conclusion

27 / 115 Functions - arguments

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Passing immutable arguments like integers, strings or Introduction • to Python tuples acts like call-by-value

Python • They cannot be modified! programming Passing mutable arguments like lists behaves like NumPy • Matplotlib call-by-reference

Introduction to Pandas

Case study

Conclusion

28 / 115 Functions - arguments

Introduction to Python Pandas for Data Analytics Call-by-value Srijith Rajamohan Example

Introduction to Python >>> def make_me_rich(balance): Python balance = 1000000 programming

NumPy account_balance = 500

Matplotlib >>> make_me_rich(account_balance)

Introduction >>> print(account_balance) to Pandas 500 Case study

Conclusion

29 / 115 Functions - arguments

Introduction to Python Call-by-reference Pandas for Data Analytics Example Srijith Rajamohan >>> def talk_to_advisor(tasks): Introduction tasks.insert(0,"Publish") to Python tasks.insert(1,"Publish") Python programming tasks.insert(2,"Publish") NumPy >>> todos = ["Graduate","Geta job","...", Matplotlib "Profit!"] Introduction to Pandas >>> talk_to_advisor(todos)

Case study >>> print(todos)

Conclusion ["Publish","Publish","Publish","Graduate" ,"Geta job","...","Profit!"]

30 / 115 Functions - arguments

Introduction However, you cannot assign a new object to the argument to Python • Pandas for • A new memory location is created for this list Data Analytics • This becomes a local variable

Srijith Rajamohan Example

Introduction to Python >>> def switcheroo(favorite_teams):

Python ... print (favorite_teams) programming ... favorite_teams = ["Redskins"] NumPy ... print (favorite_teams) Matplotlib >>> my_favorite_teams = ["Hokies"," Introduction to Pandas Nittany Lions"] Case study >>> switcheroo(my_favorite_teams) Conclusion ["Hokies","Nittany Lions"] ["Redskins"] >>> print (my_favorite_teams)

["Hokies","Nittany Lions"] 31 / 115 Functions - Multiple Return Values

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> def powers(number): Introduction to Python ... return number ** 2, number ** 3 Python >>> squared, cubed = powers(3) programming >>> print(squared) NumPy 9 Matplotlib

Introduction >>> print(cubed) to Pandas 27 Case study

Conclusion

32 / 115 Functions - Default Values

Introduction to Python Pandas for Data Example Analytics Srijith >>> def likes_food(person, food="Broccoli" Rajamohan , likes=True): Introduction to Python ... if likes:

Python ... print(str(person) +" likes" programming + food ) NumPy ... else: Matplotlib ... print(str(person) +" does not Introduction to Pandas like" + food) Case study >>> likes_food("Srijith", likes=False) Conclusion Srijith does not like Broccoli

33 / 115 Section 3

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

34 / 115 NumPy

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Used in almost all numerical computations in Python Introduction Used for high-performance vector and matrix computations to Python • Python Provides fast precompiled functions for numerical routines programming • Written in C and Fortran NumPy • Matplotlib Vectorized computations • Introduction to Pandas

Case study

Conclusion

35 / 115 Why NumPy?

Introduction to Python Example Pandas for Data Analytics >>> from numpy import* Srijith >>> import time Rajamohan >>> def trad_version(): Introduction to Python t1 = time.time()

Python X = range(10000000) programming Y = range(10000000) NumPy Z = [] Matplotlib fori in range(len(X)): Introduction to Pandas Z.append(X[i] + Y[i]) Case study return time.time() - t1 Conclusion >>> trad_version() 1.9738149642944336

36 / 115 Why NumPy?

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> def numpy_version():

Introduction t1 = time.time() to Python X = arange(10000000) Python programming Y = arange(10000000)

NumPy Z = X + Y

Matplotlib return time.time() - t1

Introduction to Pandas >>> numpy_version() Case study 0.059307098388671875 Conclusion

37 / 115 Arrays

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> from numpy import*

Introduction # the argument to the array function isa to Python Python list Python programming >>> v = array([1,2,3,4])

NumPy # the argument to the array function isa

Matplotlib nested Python list

Introduction >>> M = array([[1, 2], [3, 4]]) to Pandas >>> type(v), type(M) Case study (numpy.ndarray, numpy.ndarray) Conclusion

38 / 115 Arrays

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> v.shape, M.shape

Introduction ((4,), (2, 2)) to Python >>> M. size Python programming 4

NumPy >>> M. dtype

Matplotlib dtype (’int64’)

Introduction # Explicitly define the type of the array to Pandas >>> M = array([[1, 2], [3, 4]], dtype= Case study complex) Conclusion

39 / 115 Arrays - Using array-generating functions

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> x = arange(0, 10, 1)# arguments: Introduction start, stop, step to Python array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) Python programming >>> linspace(0,10,11)# arguments: start, NumPy end and number of points( start and Matplotlib end points are included) Introduction to Pandas array([ 0., 1., 2., 3., 4., 5.,

Case study 6., 7., 8., 9., 10.])

Conclusion

40 / 115 Diagonal and Zero matrix

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> diag([1,2,3]) Introduction array([[1, 0, 0], to Python [0, 2, 0], Python programming [0, 0, 3]]) NumPy >>> zeros((3,3)) Matplotlib array([[ 0., 0., 0.], Introduction to Pandas [ 0., 0., 0.],

Case study [ 0., 0., 0.]])

Conclusion

41 / 115 Array Access

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan >>> M = random.rand(3,3) Introduction >>> M to Python array ([ Python programming [ 0.37389376, 0.64335721, 0.12435669], NumPy [ 0.01444674, 0.13963834, 0.36263224], Matplotlib [ 0.00661902, 0.14865659, 0.75066302]]) Introduction to Pandas >>> M[1 ,1]

Case study 0.13963834214755588

Conclusion

42 / 115 Array Access

Introduction to Python Example Pandas for Data Analytics # Access the first row Srijith >>> M [1] Rajamohan array ( Introduction to Python [ 0.01444674, 0.13963834, 0.36263224])

Python # The first row can be also be accessed programming using this notation NumPy >>> M[1 ,:] Matplotlib array ( Introduction to Pandas [ 0.01444674, 0.13963834, 0.36263224]) Case study # Access the first column Conclusion >>> M[: ,1] array ( [ 0.64335721, 0.13963834, 0.14865659])

43 / 115 Array Access

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # You can also assign values to an entire Introduction row or column to Python >>> M[1,:] = 0 Python programming >>> M NumPy array ([ Matplotlib [ 0.37389376, 0.64335721, 0.12435669], Introduction to Pandas [0. ,0. ,0. ],

Case study [ 0.00661902, 0.14865659, 0.75066302]])

Conclusion

44 / 115 Array Slicing

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan # Extract slices of an array

Introduction >>> M [1:3] to Python array ([ Python programming [0. ,0. ,0. ],

NumPy [ 0.00661902, 0.14865659, 0.75066302]])

Matplotlib >>> M[1:3,1:2]

Introduction array ([ to Pandas [ 0. ], Case study [ 0.14865659]]) Conclusion

45 / 115 Array Slicing - Negative Indexing

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # Negative indices start counting from the Introduction end of the array to Python >>> M[ -2] Python programming array ( NumPy [ 0., 0., 0.]) Matplotlib >>> M[ -1] Introduction to Pandas array (

Case study [ 0.00661902, 0.14865659, 0.75066302])

Conclusion

46 / 115 Array Access - Strided Access

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Example

Introduction to Python # Strided access Python programming >>> M[::2,::2]

NumPy array([[ 0.37389376, 0.12435669],

Matplotlib [ 0.00661902, 0.75066302]])

Introduction to Pandas

Case study

Conclusion

47 / 115 Array Operations - Scalar

Introduction to Python These operation are applied to all the elements in the array Pandas for Data Analytics Example Srijith Rajamohan >>> M*2 Introduction array ([ to Python [ 0.74778752, 1.28671443, 0.24871338], Python programming [0. ,0. ,0. ], NumPy [ 0.01323804, 0.29731317, 1.50132603]]) Matplotlib >>> M + 2 Introduction to Pandas array ([

Case study [ 2.37389376, 2.64335721, 2.12435669],

Conclusion [2. ,2. ,2. ], [ 2.00661902, 2.14865659, 2.75066302]])

48 / 115 Matrix multiplication

Introduction to Python Pandas for Data Example Analytics Srijith >>> M * M# Element-wise multiplication Rajamohan array ([ Introduction to Python [1.397965e-01,4.139085e-01,1.546458e-02],

Python [0.000000e+00,0.000000e+00,0.00000e+00], programming [4.381141e-05,2.209878e-02,5.634949e-01]]) NumPy >>> dot(M,M)# Matrix multiplication Matplotlib array ([ Introduction to Pandas [ 0.14061966, 0.25903369, 0.13984616], Case study [0. ,0. ,0. ], Conclusion [ 0.00744346, 0.1158494 , 0.56431808]])

49 / 115 Iterating over Array Elements

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan In general, avoid iteration over elements • Introduction Iterating is slow compared to a vector operation to Python • If you must, use the for loop Python • programming In order to enable vectorization, ensure that user-written • NumPy functions can work with vector inputs. Matplotlib • Use the vectorize function Introduction • Use the any or all function with arrays to Pandas

Case study

Conclusion

50 / 115 Vectorize

Introduction to Python Example Pandas for Data Analytics >>> def Theta(x): Srijith ...""" Rajamohan ... Scalar implemenation of the Introduction to Python Heaviside step function.

Python ...""" programming ... if x >= 0: NumPy ... return1 Matplotlib ... else: Introduction to Pandas ... return0 Case study ... Conclusion >>> Theta(1.0) 1 >>> Theta(-1.0) 0 51 / 115 Vectorize

Introduction to Python Pandas for Data Without vectorize we would not be able to pass v to the Analytics function Srijith Rajamohan Example

Introduction to Python >>> v Python programming array([1, 2, 3, 4])

NumPy >>> Tvec = vectorize(Theta)

Matplotlib >>> Tvec (v)

Introduction array([1, 1, 1, 1]) to Pandas >>> Tvec(1.0) Case study array (1) Conclusion

52 / 115 Arrays in conditions

Introduction to Python Pandas for Data Analytics Use the any or all functions associated with arrays Srijith Rajamohan Example

Introduction to Python >>> v Python array([1, 2, 3, 4]) programming

NumPy >>> (v > 3).any()

Matplotlib True

Introduction >>> (v > 3).all() to Pandas False Case study

Conclusion

53 / 115 Section 4

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

54 / 115 Matplotlib

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Used for generating 2D and 3D scientific plots Introduction • to Python Support for LaTeX Python • programming Fine-grained control over every aspect • NumPy Many output file formats including PNG, PDF, SVG, EPS Matplotlib • Introduction to Pandas

Case study

Conclusion

55 / 115 Matplotlib - Customize matplotlibrc

Introduction to Python Pandas for Data Analytics Configuration file ‘matplotlibrc’ used to customize almost Srijith • Rajamohan every aspect of plotting Introduction On Linux, it looks in .config/matplotlib/matplotlibrc to Python • Python On other platforms, it looks in .matplotlib/matplotlibrc programming • Use ‘matplotlib.matplotlib fname()’ to determine NumPy • Matplotlib from where the current matplotlibrc is loaded Introduction Customization options can be found at to Pandas • http://matplotlib.org/users/customizing.html Case study

Conclusion

56 / 115 Matplotlib

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Matplotlib is the entire library • Introduction Pyplot - a module within Matplotlib that provides access to Python • to the underlying plotting library Python programming Pylab - a convenience module that combines the NumPy • functionality of Pyplot with Numpy Matplotlib

Introduction Pylab interface convenient for interactive plotting to Pandas •

Case study

Conclusion

57 / 115 Pylab

Introduction to Python Example Pandas for Data Analytics >>> import pylab as pl Srijith Rajamohan >>> pl.ioff() >>> pl.isinteractive() Introduction to Python False Python >>> x = [1,3,7] programming >>> pl.plot(x)# if interactive mode is NumPy off use show() after the plot command Matplotlib

Introduction [] Case study >>> pl.savefig(’fig_test.pdf’,dpi=600, Conclusion format=’pdf’) >>> pl.show()

58 / 115 Pylab

Introduction to Python Pandas for Data Simple Pylab plot Analytics 7

Srijith Rajamohan 6

Introduction to Python 5

Python programming 4

NumPy

Matplotlib 3

Introduction to Pandas 2

Case study

1 Conclusion 0.0 0.5 1.0 1.5 2.0

59 / 115 Pylab

Introduction to Python Example Pandas for Data Analytics >>> X = np.linspace(-np.pi, np.pi, 256, Srijith endpoint=True) Rajamohan >>> C, S = np.cos(X), np.sin(X) Introduction to Python # Plot cosine witha blue continuous line

Python of width1(pixels) programming >>> pl.plot(X, C, color="blue", linewidth NumPy =1.0, linestyle="-") Matplotlib >>> pl.xlabel("X") ; pl.ylabel("Y") Introduction to Pandas >>> pl.title("Sine and Cosine waves") Case study # Plot sine witha green continuous line Conclusion of width1(pixels) >>> pl.plot(X, S, color="green", linewidth =1.0, linestyle="-") >>> pl.show() 60 / 115 ...... Pylab

Introduction to Python Pandas for Data Sine and Cosine waves Analytics 1.0 Srijith Rajamohan

0.5 Introduction to Python

Python programming Y 0.0 NumPy

Matplotlib 0.5 Introduction to Pandas

Case study 1.0 Conclusion 4 3 2 1 0 1 2 3 4 X

61 / 115 Pylab - subplots

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> pl.figure(figsize=(8, 6), dpi=80)

Introduction >>> pl.subplot(1, 2, 1) to Python >>> C, S = np.cos(X), np.sin(X) Python programming >>> pl.plot(X, C, color="blue", linewidth

NumPy =1.0, linestyle="-")

Matplotlib >>> pl.subplot(1, 2, 2)

Introduction >>> pl.plot(X, S, color="green", linewidth to Pandas =1.0, linestyle="-") Case study >>> pl.show() Conclusion

62 / 115 Pylab - subplots

Introduction to Python Pandas for Data Analytics 1.0 1.0 Srijith Rajamohan

0.5 0.5 Introduction to Python

Python programming 0.0 0.0 NumPy

Matplotlib 0.5 0.5 Introduction to Pandas

Case study 1.0 1.0 Conclusion 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4

63 / 115 Pyplot

Introduction to Python Example Pandas for Data Analytics >>>import matplotlib.pyplot as plt Srijith >>>plt.isinteractive() Rajamohan False Introduction to Python >>>x = np.linspace(0, 3*np.pi, 500)

Python >>>plt.plot(x, np.sin(x**2)) programming [] Matplotlib >>>plt.title(’Pyplot plot’) Introduction to Pandas Conclusion >>>savefig(’fig_test_pyplot.pdf’,dpi=600, format=’pdf’) >>>plt.show()

64 / 115 Pyplot

Introduction to Python Pandas for Data Pyplot plot Analytics 1.0

Srijith Rajamohan

0.5 Introduction to Python

Python programming 0.0

NumPy

Matplotlib

0.5 Introduction − to Pandas

Case study

1.0 − Conclusion 0 2 4 6 8 10

65 / 115 Pyplot - legend

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>> import matplotlib.pyplot as plt

Introduction >>> line_up, = plt.plot([1,2,3], label=’ to Python Line2’) Python programming >>> line_down, = plt.plot([3,2,1], label=’

NumPy Line1’)

Matplotlib >>> plt.legend(handles=[line_up, line_down

Introduction ]) to Pandas Case study >>> plt.show() Conclusion

66 / 115 Pyplot - legend

Introduction to Python Pandas for Data Analytics 3.0 Line 2 Srijith Rajamohan Line 1

2.5 Introduction to Python

Python programming 2.0 NumPy

Matplotlib 1.5 Introduction to Pandas

Case study 1.0 Conclusion 0.0 0.5 1.0 1.5 2.0

67 / 115 Pyplot - 3D plots

Introduction to Python Pandas for Data Surface plots Analytics

Srijith Rajamohan

Introduction to Python

Python programming

NumPy

Matplotlib

Introduction to Pandas

Case study Visit http://matplotlib.org/gallery.html for a gallery of Conclusion plots produced by Matplotlib

68 / 115 Section 5

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

69 / 115 What is Pandas?

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Pandas is an open source, BSD-licensed library Introduction • to Python High-performance, easy-to-use data structures and data Python • programming analysis tools NumPy Built for the Python programming language. Matplotlib •

Introduction to Pandas

Case study

Conclusion

70 / 115 Pandas - import modules

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Example

Introduction to Python >>>from pandas import DataFrame, read_csv

Python # General syntax to importa library but programming no functions: NumPy >>>import pandas as pd#this is howI Matplotlib usually import pandas Introduction to Pandas

Case study

Conclusion

71 / 115 Pandas - Create a dataframe

Introduction to Python Example Pandas for Data Analytics

Srijith Rajamohan >>>d = {’one’ : pd.Series([1., 2., 3.], index =[’a’,’b’,’c’]), Introduction to Python ’two’ : pd.Series([1., 2., 3., 4.], index Python =[’a’,’b’,’c’,’d’])} programming >>>df = pd.DataFrame(d) NumPy >>>df Matplotlib

Introduction one two to Pandas a 1.0 1.0 Case study b 2.0 2.0 Conclusion c 3.0 3.0 d NaN 4.0

72 / 115 Pandas - Create a dataframe

Introduction to Python Pandas for Example Data Analytics

Srijith Rajamohan >>> names = [’Bob’,’Jessica’,’Mary’,’John’,

Introduction ’Mel’] to Python >>>births = [968, 155, 77, 578, 973] Python programming #To merge these two lists together we will NumPy use the zip function. Matplotlib Introduction >>>BabyDataSet = list(zip(names,births)) to Pandas

Case study >>>BabyDataSet

Conclusion [(’Bob’, 968), (’Jessica’, 155), (’Mary’, 77) , (’John’, 578), (’Mel’, 973)]

73 / 115 Pandas - Create a data frame and write to a csv file

Introduction to Python Pandas for Data Analytics Use the pandas module to create a dataset. Srijith Rajamohan Example Introduction to Python

Python programming >>>df = pd.DataFrame(data = BabyDataSet, NumPy columns =[’Names’,’Births’]) Matplotlib >>>df.to_csv(’births1880.csv’,index=False, Introduction to Pandas header=False)

Case study

Conclusion

74 / 115 Pandas - Read data from a file

Introduction to Python Pandas for Data Analytics Import data from the csv file Srijith Rajamohan Example

Introduction to Python >>>df = pd.read_csv(filename) Python #Don’t treat the first row asa header programming

NumPy >>>df = pd.read_csv(Location, header=None)

Matplotlib # Provide specific names for the columns

Introduction >>>df = pd.read_csv(Location, names=[’ to Pandas Names’,’Births’]) Case study

Conclusion

75 / 115 Pandas - Get data types

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan # Check data type of the columns Introduction >>>df.dtypes to Python Names object Python programming Births int64 NumPy dtype : object Matplotlib # Check data type of Births column Introduction to Pandas >>>df.Births.dtype

Case study dtype (’int64’)

Conclusion

76 / 115 Pandas - Take a look at the data

Introduction to Python Pandas for Example Data Analytics

Srijith >>>df.head(2) Rajamohan Names Births

Introduction 0 Bob 968 to Python 1 Jessica 155 Python programming >>>df.tail(2) NumPy Names Births Matplotlib 3 John 578 Introduction 4 Mel 973 to Pandas

Case study >>>df.columns

Conclusion Index ([u’Names’,u’Births’], dtype=’object ’)

77 / 115 Pandas - Take a look at the data

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>>df.values

Introduction array ([[’Bob’, 968], to Python [’Jessica’, 155], Python programming [’Mary’, 77],

NumPy [’John’, 578],

Matplotlib [’Mel’, 973]], dtype=object)

Introduction to Pandas >>>df.index Case study Int64Index([0, 1, 2, 3, 4], dtype=’int64’) Conclusion

78 / 115 Pandas - Working on the data

Introduction to Python Pandas for Data Analytics Srijith Example Rajamohan

Introduction >>>df[’Births’].plot() to Python # Maximum value in the data set Python programming >>>MaxValue = df[’Births’].max() NumPy # Name associated with the maximum value Matplotlib >>>MaxName = df[’Names’][df[’Births’] == Introduction df[’Births’].max()].values to Pandas

Case study

Conclusion

79 / 115 Pandas - Describe the data

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>>df[’Names’].unique()

Introduction array ([’Mary’,’Jessica’,’Bob’,’John’,’ to Python Mel’], dtype=object) Python programming >>>print(df[’Names’].describe())

NumPy count 1000

Matplotlib unique 5

Introduction top Bob to Pandas freq 206 Case study Name: Names, dtype: object Conclusion

80 / 115 Pandas - Add a column

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan >>>d = [0,1,2,3,4,5,6,7,8,9]

Introduction to Python # Create dataframe Python programming >>>df = pd.DataFrame(d)

NumPy #Name the column

Matplotlib >>>df.columns = [’Rev’]

Introduction #Add another one and set the value in that to Pandas column Case study >>>df[’NewCol’] = 5 Conclusion

81 / 115 Pandas - Accessing and indexing the data

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan

Introduction #Perform operations on columns to Python >>>df[’NewCol’] = df[’NewCol’] + 1 Python programming #Deletea column

NumPy >>>del df[’NewCol’]

Matplotlib #Edit the index name

Introduction >>>i = [’a’,’b’,’c’,’d’,’e’,’f’,’g’,’h’,’i to Pandas ’,’j’] Case study >>>df.index = i Conclusion

82 / 115 Pandas - Accessing and indexing the data

Introduction to Python Pandas for Example Data Analytics

Srijith #Find based on index value Rajamohan >>>df. loc [’a’]

Introduction >>>df. loc [’a’:’d’] to Python #Do integer position based indexing Python programming >>>df.iloc[0:3] NumPy #Access using the column name Matplotlib >>>df[’Rev’] Introduction #Access multiple columns to Pandas

Case study >>>df [[’Rev’,’test’]]

Conclusion #Subset the data >>>df.ix[:3,[’Rev’,’test’]]

83 / 115 Pandas - Accessing and indexing the data

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Example

Introduction to Python #Find based on index value

Python >>>df.at[’a’,’Rev’] programming 0 NumPy >>>df.iat[0,0] Matplotlib 0 Introduction to Pandas

Case study

Conclusion

84 / 115 Pandas - Accessing and indexing for loc

Introduction to Python Pandas for Data Analytics

Srijith A single label, e.g. 5 or ’a’, (note that 5 is interpreted as a Rajamohan • label of the index. This use is not an integer position Introduction along the index) to Python

Python A list or array of labels [’a’, ’b’, ’c’] programming • A slice object with labels ’a’:’f’, (note that contrary to NumPy • Matplotlib usual python slices, both the start and the stop are

Introduction included!) to Pandas A boolean array Case study • Conclusion

85 / 115 Pandas - Accessing and indexing for iloc

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan

Introduction An integer e.g. 5 to Python • Python A list or array of integers [4, 3, 0] programming • A slice object with ints 1:7 NumPy • Matplotlib

Introduction to Pandas

Case study

Conclusion

86 / 115 Pandas - Accessing and indexing summarized

Introduction to Python Pandas for Data Analytics Example

Srijith Rajamohan loc: only work on index

Introduction iloc: work on position to Python ix: this is the most general and Python programming supports index and position based

NumPy retrieval

Matplotlib at: get scalar values, it’sa very fast

Introduction loc to Pandas iat: get scalar values, it’s a very fast Case study iloc Conclusion

87 / 115 Pandas - Missing data

Introduction to Python How do you deal with data that is missing or contains NaNs Pandas for Data Analytics Example Srijith Rajamohan >>>df = pd.DataFrame(np.random.randn(5, 3) Introduction , index =[’a’,’c’,’e’,’f’,’h’], to Python columns =[’one’,’two’,’three’]) Python programming >>>df. loc [’a’,’two’] = np.nan NumPy one two three Matplotlib a -1.192838 NaN -0.337037 Introduction to Pandas c 0.110718 -0.016733 -0.137009

Case study e 0.153456 0.266369 -0.064127

Conclusion f 1.709607 -0.424790 -0.792061 h -1.076740 -0.872088 -0.436127

88 / 115 Pandas - Missing data

Introduction to Python Pandas for Data Analytics How do you deal with data that is missing or contains NaNs?

Srijith Rajamohan Example

Introduction >>>df.isnull() to Python

Python one two three programming a False True False NumPy c False False False Matplotlib e False False False Introduction to Pandas f False False False Case study h False False False Conclusion

89 / 115 Pandas - Missing data

Introduction to Python Pandas for Data Analytics You can fill this data in a number of ways.

Srijith Rajamohan Example

Introduction >>>df.fillna(0) to Python

Python one two three programming a -1.192838 0.000000 -0.337037 NumPy c 0.110718 -0.016733 -0.137009 Matplotlib e 0.153456 0.266369 -0.064127 Introduction to Pandas f 1.709607 -0.424790 -0.792061 Case study h -1.076740 -0.872088 -0.436127 Conclusion

90 / 115 Pandas - Query the data

Introduction to Python Pandas for Also, use the query method where you can embed boolean Data Analytics expressions on columns within quotes

Srijith Rajamohan Example

Introduction to Python >>>df.query(’one>0’)

Python one two three programming c 0.110718 -0.016733 -0.137009 NumPy e 0.153456 0.266369 -0.064127 Matplotlib f 1.709607 -0.424790 -0.792061 Introduction to Pandas >>>df.query(’one>0& two>0’) Case study one two three Conclusion e 0.153456 0.266369 -0.064127

91 / 115 Pandas - Apply a function

Introduction to Python Pandas for Data Analytics

Srijith You can apply any function to the columns in a dataframe Rajamohan Example Introduction to Python

Python >>>df.apply(lambda x: x.max() - x.min()) programming one 2.902445 NumPy two 1.138457 Matplotlib three 0.727934 Introduction to Pandas

Case study

Conclusion

92 / 115 Pandas - Applymap a function

Introduction to Python Pandas for Data You can apply any function to the element wise data in a Analytics dataframe Srijith Rajamohan Example

Introduction to Python >>>df.applymap(np.sqrt) Python programming one two three

NumPy a NaN NaN NaN

Matplotlib c 0.332742 NaN NaN

Introduction e 0.391735 0.516109 NaN to Pandas f 1.307520 NaN NaN Case study h NaN NaN NaN Conclusion

93 / 115 Pandas - Query data

Introduction to Python Pandas for Data Determine if certain values exist in the dataframe Analytics

Srijith Example Rajamohan

Introduction >>>s = pd.Series(np.arange(5), index=np. to Python arange(5)[::-1], dtype=’int64’) Python programming >>>s.isin([2,4,6])

NumPy 4 False

Matplotlib 3 False

Introduction 2 True to Pandas 1 False Case study 0 True Conclusion

94 / 115 Pandas - Query data

Introduction to Python Pandas for Data Use the where method Analytics

Srijith Example Rajamohan

Introduction >>>s = pd.Series(np.arange(5), index=np. to Python arange(5)[::-1], dtype=’int64’) Python programming >>>s.where(s>3)

NumPy 4 NaN

Matplotlib 3 NaN

Introduction 2 NaN to Pandas 1 NaN Case study 0 4 Conclusion

95 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics

Srijith Creating a grouping organizes the data and returns a groupby Rajamohan object Introduction to Python Example Python programming grouped = obj.groupby(key) NumPy grouped = obj.groupby(key, axis=1) Matplotlib grouped = obj.groupby([key1, key2]) Introduction to Pandas

Case study

Conclusion

96 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics Example Srijith Rajamohan

Introduction df = pd.DataFrame({’A’:[’foo’,’bar’,’ to Python foo’,’bar’, Python programming ’foo’,’bar’,’foo’,’foo’], NumPy ’B’:[’one’,’one’,’two’,’three’, Matplotlib ’two’,’two’,’one’,’three’], Introduction to Pandas ’C’ : np.random.randn(8),

Case study ’D’ : np.random.randn(8)})

Conclusion

97 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Example Analytics

Srijith Rajamohan ABCD Introduction to Python 0 foo one 0.469112 -0.861849

Python 1 bar one -0.282863 -2.104569 programming 2 foo two -1.509059 -0.494929 NumPy 3 bar three -1.135632 1.071804 Matplotlib 4 foo two 1.212112 0.721555 Introduction to Pandas 5 bar two -0.173215 -0.706771 Case study 6 foo one 0.119209 -1.039575 Conclusion 7 foo three -1.044236 0.271860

98 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics Group by either A or B columns or both Srijith Rajamohan Example Introduction to Python >>>grouped = df.groupby(’A’) Python programming >>>grouped = df.groupby([’A’,’B’]) NumPy # Sorts by default, disable this for Matplotlib potential speedup Introduction to Pandas >>>grouped = df.groupby(’A’,sort=False)

Case study

Conclusion

99 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Get for the groups

Introduction Example to Python

Python programming >>>grouped.size()

NumPy >>>grouped.describe() Matplotlib >>>grouped.count() Introduction to Pandas

Case study

Conclusion

100 / 115 Pandas - Grouping the data

Introduction to Python Print the grouping Pandas for Data Analytics Example Srijith Rajamohan >>>list(grouped) Introduction ABCD to Python 1 bar one -1.303028 -0.932565 Python programming 3 bar three 0.135601 0.268914 NumPy 5 bar two -0.320369 0.059366) Matplotlib 0 foo one 1.066805 -1.252834 Introduction to Pandas 2 foo two -0.180407 1.686709

Case study 4 foo two 0.228522 -0.457232

Conclusion 6 foo one -0.553085 0.512941 7 foo three -0.346510 0.434751)]

101 / 115 Pandas - Grouping the data

Introduction to Python Get the first and last elements of each grouping. Also, apply Pandas for Data the ’sum’ function to each column Analytics Example Srijith Rajamohan >>>grouped.first() Introduction to Python ABCD Python bar one -1.303028 -0.932565 programming foo one 1.066805 -1.252834 NumPy # Similar results can be obtained withg. Matplotlib

Introduction last() to Pandas >>>grouped.sum() Case study ACD Conclusion bar -1.487796 -0.604285 foo 0.215324 0.924336

102 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics

Srijith Group aggregation Rajamohan Example Introduction to Python

Python >>>grouped.aggregate(np.sum) programming ACD NumPy bar -1.487796 -0.604285 Matplotlib foo 0.215324 0.924336 Introduction to Pandas

Case study

Conclusion

103 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics Apply multiple functions to a grouped column Srijith Rajamohan Example

Introduction to Python >>>grouped[’C’].agg([np.sum, np.mean]) Python programming

NumPy A sum mean

Matplotlib

Introduction bar -1.487796 -0.495932 to Pandas foo 0.215324 0.043065 Case study

Conclusion

104 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics

Srijith Visually inspecting the grouping Rajamohan Example Introduction to Python

Python >>>w = grouped[’C’].agg([np.sum, np.mean]) programming . plot () NumPy >>>import matplotlib.pyplot as plt Matplotlib >>>plt.show() Introduction to Pandas

Case study

Conclusion

105 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Apply a transformation to the grouping

Introduction Example to Python

Python programming >>>f = lambda x: x*2

NumPy >>>transformed = grouped.transform(f) Matplotlib >>>print transformed Introduction to Pandas

Case study

Conclusion

106 / 115 Pandas - Grouping the data

Introduction to Python Pandas for Apply a filter to select a group based on some criterion. Data Analytics Example Srijith Rajamohan >>>grouped.filter(lambda x: sum(x[’C’]) > Introduction to Python 0)

Python programming ABCD NumPy 0 foo one 1.066805 -1.252834 Matplotlib 2 foo two -0.180407 1.686709 Introduction to Pandas 4 foo two 0.228522 -0.457232 Case study 6 foo one -0.553085 0.512941 Conclusion 7 foo three -0.346510 0.434751

107 / 115 Section 6

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

108 / 115 Cost of College

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan

Introduction We are going to analyze the cost of college data scorecard to Python • Python provided by the federal government programming https://collegescorecard.ed.gov/data/ NumPy • Matplotlib

Introduction to Pandas

Case study

Conclusion

109 / 115 Cost of College

Introduction to Python Pandas for Data Analytics Srijith Find the top 10 median 10 year debt Rajamohan • Find the top 10 median earnings Introduction • to Python Find the top 10 schools with the best sat scores Python • programming Find the top 10 best return of investment • NumPy Find average median earnings per state Matplotlib • Compute the correlation between the SAT scores and Introduction • to Pandas median income Case study

Conclusion

110 / 115 Cost of College

Introduction to Python Pandas for Data Analytics

Srijith Columns of interest Rajamohan UNITID Introduction • to Python INSTNM • Python STABBR programming • NumPy CITY • Matplotlib GRAD DEBT MDN SUPP Introduction • to Pandas SAT AVG Case study •

Conclusion

111 / 115 Cost of College - Generate metrics and create interactive visualizations using Bokeh

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan Generate metrics and create interactive visualizations Introduction • to Python using Bokeh Python Create an interactive chloropleth visualization programming • Sample given here at NumPy • Matplotlib http://sjster.bitbucket.org/sub2/index.html Introduction to Pandas

Case study

Conclusion

112 / 115 Interactive Chloropleth for querying and visualization

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan

Introduction to Python

Python programming

NumPy

Matplotlib

Introduction to Pandas

Case study

Conclusion

113 / 115 Section 7

Introduction to Python Pandas for Data 1 Introduction to Python Analytics

Srijith Rajamohan 2 Python programming

Introduction to Python 3 NumPy

Python programming 4 Matplotlib NumPy Matplotlib 5 Introduction to Pandas Introduction to Pandas 6 Case study Case study

Conclusion 7 Conclusion

114 / 115 Questions

Introduction to Python Pandas for Data Analytics

Srijith Rajamohan

Introduction to Python Thank you for attending ! Python programming

NumPy

Matplotlib

Introduction to Pandas

Case study

Conclusion

115 / 115