<<

Introduction in ML with scikit- learn

Professor Patrick McDaniel Jonathan Price Fall 2015 Features • Attributes in a data set • “Individual measurable property of phenomenon being observed” • Choosing/discovering features is a crucial part of ML • Ex: ‣ Character Recognition: histograms of pixels ‣ Speech Recognition: Sound length, power, frequency ‣ Malware Detection: Function use count, byte counts

Page Supervised Learning • Inferring a function from labeled training data • The features are selected by the developer • As such, it requires the developer to know something about the dataset to infer good features • Based on pairs of input objects and output values • Ex: ‣ Regression – Predict values ‣ Classification – Predict groupings

Page Unsupervised Learning • Find hidden structure or patterns in unlabled data • Requires no prior knowledge of the nature of data • Not limited by biases inherent in feature selection • Ex: ‣ K-means ‣ Clustering ‣ Neural networks

Page Scikit-learn • The easy way to do data mining and data analysis • Its all Python scripts (yay) • Built on , SciPy, and • Okay, lets get it: ‣ pip install numpy scipy scikit-learn

Page Lets do one • Classification of digits problem • Classify images of drawn numbers

Page Before We Start • What can we use about the image of a character to solve this problem?

Page Dataset • Dataset object in scikit-learn is a dictionary-like object that holds all data (and some metadata). • Actual data is stored as a N_sampes, N_features array • Lets get the digit dataset: >>> from sklearn import datasets >>> digits = datasets.load_digits()

Page Dataset

Page Dataset • “digit database by collecting 250 samples from 44 writers. The samples written by 30 writers are used for training, cross-validation and writer dependent testing, and the digits written by the other 14 are used for writer independent testing” • 500 x 500 pixel characters, compressed to form this (and then a feature vector of length=64):

Page Lets Do Some Estimating • We’re going to use support vector classification (SVC). We’ll explain later. • This code sets up the classifier clf: >>> from sklearn import svm >>> clf = svm.SVC(gamma=0.001, =100.) • We will also treat this as a black box and come back to the gamma/C values later

Page Fit And Predict • To fit the classifier: >>> clf.fit(digits.data[:-1], digits.target[:-1])

• Now, we predict! >>> clf.predict(digits.data[-1]) array([8]) • Which is apparently this from before:

Page Its (Sort of) That Easy! • We glossed over a couple details, but this shows how easy scikit learn makes the actual implementation • Lets talk about some of the concepts we skipped over earlier

Page SVC’s • We are NOT going into implementation details. • Used for classification, regression, and detecting outliers • Advantages: ‣ Works in high-dimensional spaces ‣ Memory efficient ‣ Versatile • Disadvantages ‣ Bad when # of features > # of samples ‣ Don’t directly provide probability

Page SVC: Graphically

Page Next Week • Next, we will go over a security usage of data analysis: a malware classification Kaggle challenge from Microsoft • See the course site for supplemental readings and setup instructions

Page