Introduction Into ML W/ Scikit-Learn

Introduction Into ML W/ Scikit-Learn

Introduction in ML with scikit- learn Professor Patrick McDaniel Jonathan Price Fall 2015 Features • Attributes in a data set • “Individual measurable property of phenomenon being observed” • Choosing/discovering features is a crucial part of ML • Ex: ‣ Character Recognition: histograms of pixels ‣ Speech Recognition: Sound length, power, frequency ‣ Malware Detection: Function use count, byte counts Page Supervised Learning • Inferring a function from labeled training data • The features are selected by the developer • As such, it requires the developer to know something about the dataset to infer good features • Based on pairs of input objects and output values • Ex: ‣ Regression – Predict values ‣ Classification – Predict groupings Page Unsupervised Learning • Find hidden structure or patterns in unlabled data • Requires no prior knowledge of the nature of data • Not limited by biases inherent in feature selection • Ex: ‣ K-means ‣ Clustering ‣ Neural networks Page Scikit-learn • The easy way to do data mining and data analysis • Its all Python scripts (yay) • Built on NumPy, SciPy, and matplotlib • Okay, lets get it: ‣ pip install numpy scipy scikit-learn Page Lets do one • Classification of digits problem • Classify images of drawn numbers Page Before We Start • What can we use about the image of a character to solve this problem? Page Dataset • Dataset object in scikit-learn is a dictionary-like object that holds all data (and some metadata). • Actual data is stored as a N_sampes, N_features array • Lets get the digit dataset: >>> from sklearn import datasets >>> digits = datasets.load_digits() Page Dataset Page Dataset • “digit database by collecting 250 samples from 44 writers. The samples written by 30 writers are used for training, cross-validation and writer dependent testing, and the digits written by the other 14 are used for writer independent testing” • 500 x 500 pixel characters, compressed to form this (and then a feature vector of length=64): Page Lets Do Some Estimating • We’re going to use support vector classification (SVC). We’ll explain later. • This code sets up the classifier clf: >>> from sklearn import svm >>> clf = svm.SVC(gamma=0.001, C=100.) • We will also treat this as a black box and come back to the gamma/C values later Page Fit And Predict • To fit the classifier: >>> clf.fit(digits.data[:-1], digits.target[:-1]) • Now, we predict! >>> clf.predict(digits.data[-1]) array([8]) • Which is apparently this from before: Page Its (Sort of) That Easy! • We glossed over a couple details, but this shows how easy scikit learn makes the actual implementation • Lets talk about some of the concepts we skipped over earlier Page SVC’s • We are NOT going into implementation details. • Used for classification, regression, and detecting outliers • Advantages: ‣ Works in high-dimensional spaces ‣ Memory efficient ‣ Versatile • Disadvantages ‣ Bad when # of features > # of samples ‣ Don’t directly provide probability Page SVC: Graphically Page Next Week • Next, we will go over a security usage of data analysis: a malware classification Kaggle challenge from Microsoft • See the course site for supplemental readings and setup instructions Page.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    16 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us