Introduction to Machine Learning: Examples of Unsupervised And
Total Page:16
File Type:pdf, Size:1020Kb
In [1]: from __future__ import division, print_function, absolute_import, unicode_literals Introduction to Machine Learning: Examples of Unsupervised and Supervised Machine- Learning Algorithms Version 0.1 Broadly speaking, machine-learning methods constitute a diverse collection of data-driven algorithms designed to classify/characterize/analyze sources in multi-dimensional spaces. The topics and studies that fall under the umbrella of machine learning is growing, and there is no good catch-all definition. The number (and variation) of algorithms is vast, and beyond the scope of these exercises. While we will discuss a few specific algorithms today, more importantly, we will explore the scope of the two general methods: unsupervised learning and supervised learning and introduce the powerful (and dangerous?) Python package scikit- learn (http://scikit-learn.org/stable/). By AA Miller (Jet Propulsion Laboratory, California Institute of Technology.) (c) 2016 California Institute of Technology. Government sponsorship acknowledged. In [2]: import numpy as np from astropy.table import Table import matplotlib.pyplot as plt %matplotlib inline Problem 1) Introduction to scikit-learn At the most basic level, scikit-learn makes machine learning extremely easy within Python. By way of example, here is a short piece of code that builds a complex, non-linear model to classify sources in the Iris data set that we learned about yesterday: from sklearn import datasets from sklearn.ensemble import RandomForestClassifier iris = datasets.load_iris() RFclf = RandomForestClassifier().fit(iris.data, iris.target) Those 4 lines of code have constructed a model that is superior to any system of hard cuts that we could have encoded while looking at the multidimensional space. This can be fast as well: execute the dummy code in the cell below to see how "easy" machine-learning is with scikit-learn. In [6]: # execute dummy code here Generally speaking, the procedure for scikit-learn is uniform across all machine-learning algorithms. Models are accessed via the various modules (ensemble, SVM, neighbors, etc), with user-defined tuning parameters. The features (or data) for the models are stored in a 2D array, X, with rows representing individual sources and columns representing the corresponding feature values. [In a minority of cases, X, represents a similarity or distance matrix where each entry represents the distance to every other source in the data set.] In cases where there is a known classification or scalar value (typically supervised methods), this information is stored in a 1D array y. Unsupervised models are fit by calling .fit(X) and supervised models are fit by calling .fit(X, y). In both cases, predictions for new observations, Xnew, can be obtained by calling .predict(Xnew). Those are the basics and beyond that, the details are algorithm specific, but the documentation for essentially everything within scikit-learn is excellent, so read the docs. To further develop our intuition, we will now explore the Iris dataset a little further. Problem 1a What is the pythonic type of iris? In [11]: type(iris) Out[11]: sklearn.datasets.base.Bunch You likely haven't encountered a scikit-learn Bunch before. It's functionality is essentially the same as a dictionary. Problem 1b What are the keys of iris? In [12]: iris.keys() Out[12]: ['target_names', 'data', 'target', 'DESCR', 'feature_names'] Most importantly, iris contains data and target values. These are all you need for scikit-learn, though the feature and target names and description are useful. Problem 1c What is the shape and content of the iris data? In [13]: print(np.shape(iris.data)) print(iris.data) (150, 4) [[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] [ 5. 3.6 1.4 0.2] [ 5.4 3.9 1.7 0.4] [ 4.6 3.4 1.4 0.3] [ 5. 3.4 1.5 0.2] [ 4.4 2.9 1.4 0.2] [ 4.9 3.1 1.5 0.1] [ 5.4 3.7 1.5 0.2] [ 4.8 3.4 1.6 0.2] [ 4.8 3. 1.4 0.1] [ 4.3 3. 1.1 0.1] [ 5.8 4. 1.2 0.2] [ 5.7 4.4 1.5 0.4] [ 5.4 3.9 1.3 0.4] [ 5.1 3.5 1.4 0.3] [ 5.7 3.8 1.7 0.3] [ 5.1 3.8 1.5 0.3] [ 5.4 3.4 1.7 0.2] [ 5.1 3.7 1.5 0.4] [ 4.6 3.6 1. 0.2] [ 5.1 3.3 1.7 0.5] [ 4.8 3.4 1.9 0.2] [ 5. 3. 1.6 0.2] [ 5. 3.4 1.6 0.4] [ 5.2 3.5 1.5 0.2] [ 5.2 3.4 1.4 0.2] [ 4.7 3.2 1.6 0.2] [ 4.8 3.1 1.6 0.2] [ 5.4 3.4 1.5 0.4] [ 5.2 4.1 1.5 0.1] [ 5.5 4.2 1.4 0.2] [ 4.9 3.1 1.5 0.1] [ 5. 3.2 1.2 0.2] [ 5.5 3.5 1.3 0.2] [ 4.9 3.1 1.5 0.1] [ 4.4 3. 1.3 0.2] [ 5.1 3.4 1.5 0.2] [ 5. 3.5 1.3 0.3] [ 4.5 2.3 1.3 0.3] [ 4.4 3.2 1.3 0.2] [ 5. 3.5 1.6 0.6] [ 5.1 3.8 1.9 0.4] [ 4.8 3. 1.4 0.3] [ 5.1 3.8 1.6 0.2] [ 4.6 3.2 1.4 0.2] [ 5.3 3.7 1.5 0.2] [ 5. 3.3 1.4 0.2] [ 7. 3.2 4.7 1.4] [ 6.4 3.2 4.5 1.5] [ 6.9 3.1 4.9 1.5] [ 5.5 2.3 4. 1.3] [ 6.5 2.8 4.6 1.5] [ 5.7 2.8 4.5 1.3] [ 6.3 3.3 4.7 1.6] [ 4.9 2.4 3.3 1. ] [ 6.6 2.9 4.6 1.3] [ 5.2 2.7 3.9 1.4] [ 5. 2. 3.5 1. ] [ 5.9 3. 4.2 1.5] [ 6. 2.2 4. 1. ] [ 6.1 2.9 4.7 1.4] [ 5.6 2.9 3.6 1.3] [ 6.7 3.1 4.4 1.4] [ 5.6 3. 4.5 1.5] [ 5.8 2.7 4.1 1. ] [ 6.2 2.2 4.5 1.5] [ 5.6 2.5 3.9 1.1] [ 5.9 3.2 4.8 1.8] [ 6.1 2.8 4. 1.3] [ 6.3 2.5 4.9 1.5] [ 6.1 2.8 4.7 1.2] [ 6.4 2.9 4.3 1.3] [ 6.6 3. 4.4 1.4] [ 6.8 2.8 4.8 1.4] [ 6.7 3. 5. 1.7] [ 6. 2.9 4.5 1.5] [ 5.7 2.6 3.5 1. ] [ 5.5 2.4 3.8 1.1] [ 5.5 2.4 3.7 1. ] [ 5.8 2.7 3.9 1.2] [ 6. 2.7 5.1 1.6] [ 5.4 3. 4.5 1.5] [ 6. 3.4 4.5 1.6] [ 6.7 3.1 4.7 1.5] [ 6.3 2.3 4.4 1.3] [ 5.6 3. 4.1 1.3] [ 5.5 2.5 4. 1.3] [ 5.5 2.6 4.4 1.2] [ 6.1 3. 4.6 1.4] [ 5.8 2.6 4. 1.2] [ 5. 2.3 3.3 1. ] [ 5.6 2.7 4.2 1.3] [ 5.7 3. 4.2 1.2] [ 5.7 2.9 4.2 1.3] [ 6.2 2.9 4.3 1.3] [ 5.1 2.5 3. 1.1] [ 5.7 2.8 4.1 1.3] [ 6.3 3.3 6. 2.5] [ 5.8 2.7 5.1 1.9] [ 7.1 3. 5.9 2.1] [ 6.3 2.9 5.6 1.8] [ 6.5 3. 5.8 2.2] [ 7.6 3. 6.6 2.1] [ 4.9 2.5 4.5 1.7] [ 7.3 2.9 6.3 1.8] [ 6.7 2.5 5.8 1.8] [ 7.2 3.6 6.1 2.5] [ 6.5 3.2 5.1 2. ] [ 6.4 2.7 5.3 1.9] [ 6.8 3. 5.5 2.1] [ 5.7 2.5 5. 2. ] [ 5.8 2.8 5.1 2.4] [ 6.4 3.2 5.3 2.3] [ 6.5 3. 5.5 1.8] [ 7.7 3.8 6.7 2.2] [ 7.7 2.6 6.9 2.3] [ 6. 2.2 5. 1.5] [ 6.9 3.2 5.7 2.3] [ 5.6 2.8 4.9 2. ] [ 7.7 2.8 6.7 2. ] [ 6.3 2.7 4.9 1.8] [ 6.7 3.3 5.7 2.1] [ 7.2 3.2 6. 1.8] [ 6.2 2.8 4.8 1.8] [ 6.1 3. 4.9 1.8] [ 6.4 2.8 5.6 2.1] [ 7.2 3. 5.8 1.6] [ 7.4 2.8 6.1 1.9] [ 7.9 3.8 6.4 2. ] [ 6.4 2.8 5.6 2.2] [ 6.3 2.8 5.1 1.5] [ 6.1 2.6 5.6 1.4] [ 7.7 3. 6.1 2.3] [ 6.3 3.4 5.6 2.4] [ 6.4 3.1 5.5 1.8] [ 6.