Data Mining with Python (Working Draft)

Data Mining with Python (Working Draft)

Data Mining with Python (Working draft) Finn Arup˚ Nielsen May 8, 2015 Contents Contents i List of Figures vii List of Tables ix 1 Introduction 1 1.1 Other introductions to Python?...................................1 1.2 Why Python for data mining?....................................1 1.3 Why not Python for data mining?.................................2 1.4 Components of the Python language and software........................3 1.5 Developing and running Python...................................5 1.5.1 Python, pypy, IPython . ..................................5 1.5.2 IPython Notebook......................................6 1.5.3 Python 2 vs. Python 3....................................6 1.5.4 Editing............................................7 1.5.5 Python in the cloud.....................................7 1.5.6 Running Python in the browser...............................7 2 Python 9 2.1 Basics.................................................9 2.2 Datatypes...............................................9 2.2.1 Booleans (bool).......................................9 2.2.2 Numbers (int, float and Decimal)............................ 10 2.2.3 Strings (str)......................................... 11 2.2.4 Dictionaries (dict)...................................... 11 2.2.5 Dates and times....................................... 12 2.2.6 Enumeration......................................... 13 2.3 Functions and arguments...................................... 13 2.3.1 Anonymous functions with lambdas ............................ 13 2.3.2 Optional function arguments................................ 13 2.4 Object-oriented programming.................................... 14 2.4.1 Objects as functions..................................... 16 2.5 Modules and import......................................... 17 2.5.1 Submodules.......................................... 17 2.5.2 Globbing import....................................... 18 2.5.3 Coping with Python 2/3 incompatibility.......................... 19 2.6 Persistency.............................................. 19 2.6.1 Pickle and JSON....................................... 19 2.6.2 SQL.............................................. 20 2.6.3 NoSQL............................................ 21 i 2.7 Documentation............................................ 21 2.8 Testing................................................ 21 2.8.1 Testing for type........................................ 21 2.8.2 Zero-one-some testing.................................... 22 2.8.3 Test layout and test discovery................................ 23 2.8.4 Test coverage......................................... 23 2.8.5 Testing in different environments.............................. 24 2.9 Profiling................................................ 25 2.10 Coding style.............................................. 26 2.10.1 Where is private and public?............................... 27 2.11 Command-line interface scripting.................................. 28 2.11.1 Distinguishing between module and script......................... 28 2.11.2 Argument parsing...................................... 28 2.11.3 Exit status.......................................... 29 2.12 Debugging............................................... 29 2.12.1 Logging............................................ 30 2.13 Advices................................................ 31 3 Python for data mining 33 3.1 Numpy................................................. 33 3.2 Plotting................................................ 33 3.2.1 3D plotting.......................................... 34 3.2.2 Real-time plotting...................................... 34 3.2.3 Plotting for the Web..................................... 36 3.2.4 Vispy............................................. 39 3.3 Pandas................................................. 40 3.3.1 Pandas data types...................................... 40 3.3.2 Pandas indexing....................................... 40 3.3.3 Pandas joining, merging and concatenations........................ 42 3.3.4 Simple statistics....................................... 43 3.4 SciPy................................................. 44 3.4.1 scipy.linalg ........................................ 44 3.4.2 scipy.fftpack ........................................ 45 3.5 Statsmodels.............................................. 45 3.6 Sympy................................................. 47 3.7 Machine learning........................................... 48 3.7.1 Scikit-learn.......................................... 48 3.8 Text mining.............................................. 49 3.8.1 Regular expressions..................................... 49 3.8.2 Extracting from webpages.................................. 51 3.8.3 NLTK............................................. 52 3.8.4 Tokenization and part-of-speech tagging.......................... 52 3.8.5 Language detection...................................... 53 3.8.6 Sentiment analysis...................................... 54 3.9 Network mining............................................ 54 3.10 Miscellaneous issues......................................... 55 3.10.1 Lazy computation...................................... 55 3.11 Testing data mining code...................................... 57 4 Case: Pure Python matrix library 59 4.1 Code listing.............................................. 59 ii 5 Case: Pima data set 65 5.1 Problem description and objectives................................. 65 5.2 Descriptive statistics and plotting.................................. 66 5.3 Statistical tests............................................ 67 5.4 Predicting diabetes type....................................... 69 6 Case: Data mining a database 71 6.1 Problem description and objectives................................. 71 6.2 Reading the data........................................... 71 6.3 Graphical overview on the connections between the tables.................... 72 6.4 Statistics on the number of tracks sold............................... 74 7 Case: Twitter information diffusion 75 7.1 Problem description and objectives................................. 75 7.2 Building a news classifier...................................... 75 8 Case: Big data 77 8.1 Problem description and objectives................................. 77 8.2 Stream processing of JSON..................................... 77 Bibliography 79 Index 83 iii iv Preface Python has grown to become one of the central languages in data mining offering both a general programming language and libraries specifically targeted numerical computations. This book is continuously being written and grew out of course given at the Technical University of Denmark. v vi List of Figures 1.1 The Python hierarchy.........................................4 2.1 Overview of methods and attributes in the common Python 2 built-in data types plotted as a formal concept analysis lattice graph. Only a small subset of methods and attributes is shown. 15 3.1 Comorbidity for ICD-10 disease code (appendicitis)........................ 55 5.1 Seaborn correlation plot on the Pima data set........................... 68 6.1 Database tables graph........................................ 73 vii viii List of Tables 2.1 Basic built-in and Numpy and Pandas datatypes......................... 10 2.2 Class methods and attributes.................................... 16 2.3 Testing concepts........................................... 21 3.1 Function for generation of Numpy data structures......................... 33 3.2 Some of the subpackages of SciPy.................................. 44 3.3 Python machine learning packages................................. 48 3.4 Scikit-learn methods......................................... 49 3.5 sklearn classifiers........................................... 49 3.6 Metacharacters and character classes................................ 50 3.7 NLT submodules............................................ 52 5.1 Variables in the Pima data set................................... 65 ix x Chapter 1 Introduction 1.1 Other introductions to Python? Although we cover a bit of introductory Python programming in chapter2 you should not regard this book as a Python introduction: Several free introductory ressources exist. First and foremost the official Python Tu- torial at http://docs.python.org/tutorial/. Beginning programmers with no or little programming experience may want to look into the book Think Python available from http://www.greenteapress.com/thinkpython/ or as a book [1], while more experience programmers can start with Dive Into Python available from http://www.diveintopython.net/.1 Kevin Sheppard's presently 381-page Introduction to Python for Econo- metrics, Statistics and Data Analysis covers both Python basics and Python-based data analysis with Numpy, SciPy, Matplotlib and Pandas, | and it is not just relevant for econometrics [2]. Developers already well- versed in standard Python development but lacking experience with Python for data mining can begin with chapter3. Readers in need of an introduction to machine learning may take a look in Marsland's Machine learning: An algorithmic perspective [3], that uses Python for its examples. 1.2 Why Python for data mining? Researchers have noted a number of reasons for using Python in the data science area (data mining, scientific computing) [4,5,6]: 1. Programmers regard Python as a clear and simple language with a high readability. Even non- programmers may not find it too difficult. The simplicity

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    101 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us