Data Mining with Python (Working Draft)
Total Page:16
File Type:pdf, Size:1020Kb
Data Mining with Python (Working draft) Finn Arup˚ Nielsen May 8, 2015 Contents Contents i List of Figures vii List of Tables ix 1 Introduction 1 1.1 Other introductions to Python?...................................1 1.2 Why Python for data mining?....................................1 1.3 Why not Python for data mining?.................................2 1.4 Components of the Python language and software........................3 1.5 Developing and running Python...................................5 1.5.1 Python, pypy, IPython . ..................................5 1.5.2 IPython Notebook......................................6 1.5.3 Python 2 vs. Python 3....................................6 1.5.4 Editing............................................7 1.5.5 Python in the cloud.....................................7 1.5.6 Running Python in the browser...............................7 2 Python 9 2.1 Basics.................................................9 2.2 Datatypes...............................................9 2.2.1 Booleans (bool).......................................9 2.2.2 Numbers (int, float and Decimal)............................ 10 2.2.3 Strings (str)......................................... 11 2.2.4 Dictionaries (dict)...................................... 11 2.2.5 Dates and times....................................... 12 2.2.6 Enumeration......................................... 13 2.3 Functions and arguments...................................... 13 2.3.1 Anonymous functions with lambdas ............................ 13 2.3.2 Optional function arguments................................ 13 2.4 Object-oriented programming.................................... 14 2.4.1 Objects as functions..................................... 16 2.5 Modules and import......................................... 17 2.5.1 Submodules.......................................... 17 2.5.2 Globbing import....................................... 18 2.5.3 Coping with Python 2/3 incompatibility.......................... 19 2.6 Persistency.............................................. 19 2.6.1 Pickle and JSON....................................... 19 2.6.2 SQL.............................................. 20 2.6.3 NoSQL............................................ 21 i 2.7 Documentation............................................ 21 2.8 Testing................................................ 21 2.8.1 Testing for type........................................ 21 2.8.2 Zero-one-some testing.................................... 22 2.8.3 Test layout and test discovery................................ 23 2.8.4 Test coverage......................................... 23 2.8.5 Testing in different environments.............................. 24 2.9 Profiling................................................ 25 2.10 Coding style.............................................. 26 2.10.1 Where is private and public?............................... 27 2.11 Command-line interface scripting.................................. 28 2.11.1 Distinguishing between module and script......................... 28 2.11.2 Argument parsing...................................... 28 2.11.3 Exit status.......................................... 29 2.12 Debugging............................................... 29 2.12.1 Logging............................................ 30 2.13 Advices................................................ 31 3 Python for data mining 33 3.1 Numpy................................................. 33 3.2 Plotting................................................ 33 3.2.1 3D plotting.......................................... 34 3.2.2 Real-time plotting...................................... 34 3.2.3 Plotting for the Web..................................... 36 3.2.4 Vispy............................................. 39 3.3 Pandas................................................. 40 3.3.1 Pandas data types...................................... 40 3.3.2 Pandas indexing....................................... 40 3.3.3 Pandas joining, merging and concatenations........................ 42 3.3.4 Simple statistics....................................... 43 3.4 SciPy................................................. 44 3.4.1 scipy.linalg ........................................ 44 3.4.2 scipy.fftpack ........................................ 45 3.5 Statsmodels.............................................. 45 3.6 Sympy................................................. 47 3.7 Machine learning........................................... 48 3.7.1 Scikit-learn.......................................... 48 3.8 Text mining.............................................. 49 3.8.1 Regular expressions..................................... 49 3.8.2 Extracting from webpages.................................. 51 3.8.3 NLTK............................................. 52 3.8.4 Tokenization and part-of-speech tagging.......................... 52 3.8.5 Language detection...................................... 53 3.8.6 Sentiment analysis...................................... 54 3.9 Network mining............................................ 54 3.10 Miscellaneous issues......................................... 55 3.10.1 Lazy computation...................................... 55 3.11 Testing data mining code...................................... 57 4 Case: Pure Python matrix library 59 4.1 Code listing.............................................. 59 ii 5 Case: Pima data set 65 5.1 Problem description and objectives................................. 65 5.2 Descriptive statistics and plotting.................................. 66 5.3 Statistical tests............................................ 67 5.4 Predicting diabetes type....................................... 69 6 Case: Data mining a database 71 6.1 Problem description and objectives................................. 71 6.2 Reading the data........................................... 71 6.3 Graphical overview on the connections between the tables.................... 72 6.4 Statistics on the number of tracks sold............................... 74 7 Case: Twitter information diffusion 75 7.1 Problem description and objectives................................. 75 7.2 Building a news classifier...................................... 75 8 Case: Big data 77 8.1 Problem description and objectives................................. 77 8.2 Stream processing of JSON..................................... 77 Bibliography 79 Index 83 iii iv Preface Python has grown to become one of the central languages in data mining offering both a general programming language and libraries specifically targeted numerical computations. This book is continuously being written and grew out of course given at the Technical University of Denmark. v vi List of Figures 1.1 The Python hierarchy.........................................4 2.1 Overview of methods and attributes in the common Python 2 built-in data types plotted as a formal concept analysis lattice graph. Only a small subset of methods and attributes is shown. 15 3.1 Comorbidity for ICD-10 disease code (appendicitis)........................ 55 5.1 Seaborn correlation plot on the Pima data set........................... 68 6.1 Database tables graph........................................ 73 vii viii List of Tables 2.1 Basic built-in and Numpy and Pandas datatypes......................... 10 2.2 Class methods and attributes.................................... 16 2.3 Testing concepts........................................... 21 3.1 Function for generation of Numpy data structures......................... 33 3.2 Some of the subpackages of SciPy.................................. 44 3.3 Python machine learning packages................................. 48 3.4 Scikit-learn methods......................................... 49 3.5 sklearn classifiers........................................... 49 3.6 Metacharacters and character classes................................ 50 3.7 NLT submodules............................................ 52 5.1 Variables in the Pima data set................................... 65 ix x Chapter 1 Introduction 1.1 Other introductions to Python? Although we cover a bit of introductory Python programming in chapter2 you should not regard this book as a Python introduction: Several free introductory ressources exist. First and foremost the official Python Tu- torial at http://docs.python.org/tutorial/. Beginning programmers with no or little programming experience may want to look into the book Think Python available from http://www.greenteapress.com/thinkpython/ or as a book [1], while more experience programmers can start with Dive Into Python available from http://www.diveintopython.net/.1 Kevin Sheppard's presently 381-page Introduction to Python for Econo- metrics, Statistics and Data Analysis covers both Python basics and Python-based data analysis with Numpy, SciPy, Matplotlib and Pandas, | and it is not just relevant for econometrics [2]. Developers already well- versed in standard Python development but lacking experience with Python for data mining can begin with chapter3. Readers in need of an introduction to machine learning may take a look in Marsland's Machine learning: An algorithmic perspective [3], that uses Python for its examples. 1.2 Why Python for data mining? Researchers have noted a number of reasons for using Python in the data science area (data mining, scientific computing) [4,5,6]: 1. Programmers regard Python as a clear and simple language with a high readability. Even non- programmers may not find it too difficult. The simplicity