Scikit-Learn: Classifiers

Scikit-Learn: Classifiers - Binary • Binary classification from sklearn.linear model import SGDClassifier

SGDClassifier(loss=hinge, penalty=l2, alpha=0.0001, l1 ratio=0.15, fit intercept=True, max iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, n jobs=None, random state=None, learning rate=optimal, eta0=0.0, power t=0.5, early stopping=False, validation fraction=0.1, n iter no change=5, class weight=None, warm start=False, average=False, n iter=None) – This implements linear classifiers (i.e., SVM, logistic regression, a.o.) – For best results using the default learning rate schedule, the data should have zero mean and unit variance – Expects floating point values for the features – Parameters: ∗ loss (string) · The function used to calculate error (e.g., MSE) · Determines model used: hinge: linear SVM log: logistic regression modified huber squared hinge perceptron squared loss epsilon insensitive squared epsilon insensitive ∗ penalty (string) · Type of regularization ∗ alpha (float) · Regularization term multiplier · Affects learning rate when set to ’optimal’ ∗ l1 ratio · Elastic Net mixing parameter ∗ fit intercept · Whether the intercept should be estimated or not · If False, the data is assumed to be already centered

1 Scikit-Learn: Classiﬁers - Binary (2) ∗ max iter: (int) · Max number of passes ∗ tol (ﬂoat) · Stopping criterion

· If it is not None, the iterations will stop when (loss > previousloss− tol) ∗ shuffle · Whether or not the training data should be shuffled after each epoch ∗ epsilon (float) · value in loss functions for huber, epsilon insensitive, or squared epsilon insensitive ∗ eta0 (double) · Initial learning rate ∗ learning rate · Learning rate schedule ’constant’: η = eta0 ’optimal’: η = 1.0/(α(t + t0) where t0 chosen by heuristic ’invscaling’: η = eta0/pow(t, power t) ’adaptive’: η = eta0 as long as training keeps decreasing ∗ power t (double) ∗ early stopping · If T rue, automatically sets aside a fraction of training data as validation and terminate training when validation score is not improving by at least tol for n iter no change consecutive epochs ∗ validation fraction (float) · Proportion of training data to set aside as validation set for early stopping

2 Scikit-Learn: Classifiers - Binary (3) ∗ n iter no change · Number of iterations with no improvement to wait before early stopping ∗ average · When set to True, computes the averaged SGD weights and stores the result in the coef attribute · If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average ∗ n iter · Number of passes over the training data (deprecated) – Attributes: ∗ coef (array) · Weights assigned to the features ∗ intercept (array) · Constants in decision function ∗ n iter ∗ loss function – Methods (beyond fit() and predict(): ∗ decision function(X) · Predict confidence scores for samples ∗ partial fit(X, y, classes=None, sample weight=None) (array) · Perform one epoch of stochastic gradient descent on given samples ∗ score(X, y, sample weight=None) · Returns the mean accuracy on the given test data and labels ∗ predict proba · Probability estimates · Only available for log loss and modified Huber loss

3 Scikit-Learn: Classifiers - Multiclass and Multilabel 1. Note: All classifiers in scikit-learn do multiclass classification out-of-the-box • Use module sklearn.multiclass if you want to experiment with different multiclass strategies 2. Multiclass classification: • Classification task with more than two classes • Assume that each sample is assigned to one and only one label 3. Multilabel classification: • Each sample assigned a set of target labels • For when labels are not mutually exclusive 4. Multioutput regression: • Each sample assigned a set of target values • For when labels are not mutually exclusive 5. Multioutput-multiclass classification and multi-task classification: • Single estimator has to handle several joint classification tasks • For when labels are not mutually exclusive 6. Classes: (a) One v One multiclass classification from sklearn.multiclass import Onev- sOneClassifier

OneVsOneClassifier(estimator, n jobs=None) • Parameters: – Self-evident • Methods: – See above • Attributes: – estimators – classes • See also OnevsRestClassifier multiclass/multilabel classifier

4 Scikit-Learn: Classifiers - Multiclass and Multilabel (2) (b) Multilabel classification from sklearn.neighbors import KNeighborsClassi- fier

KNeighborsClassifier((n neighbors=5, weights=uniform, algorithm=auto, leaf size=30, p=2, metric=minkowski, metric params=None, n jobs=None, **kwargs)) • Parameters: – n neighbors (int) ∗ Number of neighbors to use by default for kneighbors queries – weights (string, callable) ∗ Weight function used in prediction ’uniform’: All points in each neighborhood are weighted equally ’distance’: weight points by the inverse of their distance Closer neighbors have a greater influence than farther ones callable: a user-defined function – algorithm (string) ∗ Algorithm used to compute the nearest neighbors ’ball tree’: ’kd tree’: ’brute’: brute force ’auto’: Decides the most appropriate algorithm based on the values passed to fit method – leaf size (int) ∗ Leaf sized passed to the two tree algorithms – p (int) ∗ Power parameter for the Minkowski metric – metric (string, callable) ∗ *** – metric params (dictionary) ∗ Additional keyword arguments for the metric function

5 Scikit-Learn: Classiﬁers - Multiclass and Multilabel (3) • Methods: – kneighbors(X=None, n neighbors=None, return distance=True) ∗ Finds the K-neighbors of a point ∗ Returns indices of and distances to the neighbors of each point – kneighbors graph(X=None, n neighbors=None, mode=connectivity) ∗ Computes the (weighted) graph of k-Neighbors for points in X ∗ Returns a sparse matrix in CSR format, shape = [n samples, n samples ﬁt]

6 Scikit-Learn: Binary Classification • References Geron C3 • MNIST corpus – Handwriting sets of integers – Already divided into training and test sets – Each image is 28 × 28 pixels ∗ Results in 784 features (one for each pixel) ∗ Values that range from 0 (white) to 255 (black) • Recommended to scramble training set set before using, especially as numbers are listed in order in MNIST • SGDClassifier: See Learning Models in Scikit notes • For more control over cross-validation than Scikit models provide, do it manually: (See StratifiedKFold in Cross-Validation in Scikit notes

from sklearn.model_selection import StratifiedKFold from sklearn.base import clone skfolds = StratifiedKFold(n_splits=3,random_state=42) for train_index,test_index in skfolds.split(X_train,y_train_5): clone_clf = clone(sgd_clf) X_train_folds = X_train[train_index] y_train_folds = (y_train_5[train_index]) X_test_fold = X_train[test_index] y_test_fold = (y_train_5[test_index]) clone_clf.fit(X_train_folds,y_train_folds) y_pred = clone_clf.predict(X_test_fold) n_correct = sum(y_pred == y_test_fold) print(n_correct / len(y_pred)) – Above equivalent to cross val score • Note that the accuracy is potentially a false high – MNIST is skewed: only 10% of the samples are a given number (when binary testing for is x digit y or not?)

7 Scikit-Learn: Binary Classification - Tuning • Rather than evaluate on accuracy, use the confusion matrix – A confusion matrix is a special type of contingency table that illustrates how well a classifier performs – So called because helps to determine whether classifier is confusing two classes – For example:

prediction A B category A 8 2 B 6 4

prediction A B C category A 5 3 0 B 2 3 1 C 0 2 11

– A table of confusion (also called a confusion matrix) illustrates how well a classiﬁer performs by showing true positives, false positives, true negatives and false negatives ∗ In the ﬁrst example above:

prediction A not A category A 8 TP 6 FP not A 2 FN 4 TN ∗ And for the second example above:

prediction A not A category A 5 TP 2 FP not A 3 FN 17 TN

8 Scikit-Learn: Binary Classiﬁcation - Tuning (2) – To get the predictions, use cross val score() (See cross val score() in Cross- Validation in Scikit notes) – Then use function confusion matrix() from sklearn.metrics import confusion matrix ∗ confusion matrix(y true, y pred, labels=None, sample weight=None) ∗ Parameters: · y true: Correct target values · y pred: Estimated target values · labels: List of labels to index the matrix If omitted, values that appear in true or y pred are used (in sorted order) Labels can be used to select a subset · sample weight: Sample weights ∗ Returns an array ∗ Examples (from API) >>> from sklearn.metrics import confusion_matrix >>> y_true = [2, 0, 2, 2, 0, 1] >>> y_pred = [0, 0, 2, 2, 0, 2] >>> confusion_matrix(y_true, y_pred) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])

>>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"] >>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"] >>> confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"]) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])

>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], ... [1, 1, 1, 0]).ravel() >>> (tn, fp, fn, tp) (0, 2, 1, 1)

9 Scikit-Learn: Binary Classification - Tuning (3) · Note: The results are represented as in the above examples: The predicted values are the columns, the actual the rows So in the first example, array[0, 0] indicates that 2 zeroes are correctly labeled as zeroes, array[0, 1] that no ones were labeled as zeroes, and array[0, 2] that one two was labeled as a zero • Another set of metrics to look at are the precision and recall of the classifier – Precision: ∗ Measures accuracy of the positive predictions true positives precision = true positives + false positives – Recall (sensitivity or true positive rate): ∗ Measures how accurately a samples of a class are correctly classified true positives recall = true positives + false negatives – Precision score() sklearn.metrics.precision score(y true, y pred, labels=None, pos label=1, average=binary, sample weight=None) ∗ Parameters: · y true, y pred, labels: Already discussed · average (string) Required for multiclass/multilabel targets Values: None: Scores for each class are returned ’binary’: Only report results for the class specified by pos label Applicable only if targets (y {true, pred}) are binary ’micro’: Calculate metrics globally by counting the total true positives, false negatives and false positives ’macro’: Calculate metrics for each label, and find their unweighted mean This does not take label imbalance into account ’weighted’: Calculate metrics for each label, and find their weighted by support (the number of true instances for each label) This alters macro to account for label imbalance

10 Scikit-Learn: Binary Classification - Tuning (4) ’samples’: Calculate metrics for each instance, and find their average Only meaningful for multilabel classification where this differs from accuracy score · Returns precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task – Recall score() sklearn.metrics.recall score(y true, y pred, labels=None, pos label=1, average=binary, sample weight=None) ∗ Parameters: As above ∗ Best value is one, worst is zero

– The F1 score combines the precision and recall into a single score

∗ The F1 score is the harmonic mean of precision and recall: 2 F1 = 1 1 precision + recall

precision × recall = 2 × precision + recall

TP = FN+FP TP + 2 ∗ The harmonic mean gives more weight to low values

∗ Therefore, both precision and recall must be high to get a high F1 score ∗ f1 score function: · sklearn.metrics.f1 score(y true, y pred, labels=None, pos label=1, average=binary, sample weight=None) ∗ Sometimes you want precision, other times recall · Increasing one decreases the other and vice-versa · This is called the precision/recall tradeoﬀ

11 Scikit-Learn: Binary Classiﬁcation - Tuning (5)

• Precision/recall tradeoﬀ – Consider the following situation (Geron p90)

∗ The SGDClassifier makes decisions based on a decision function: If the function’s score is greater than some threshold, sample is consid- ered positive ∗ Consider the central arrow as the current threshold · Four 5’s are classified correctly, and one six incorrectly, so precision = 0.8, · Since two 5’s are misclassified, recall = 0.67 ∗ By moving the threshold to right, precision increases to 1.0, but recall falls to 0.5 ∗ By moving the threshold to left, precision falls to 0.75, but recall increases to 1.0 – This can be implemented in scikit manually, using a classifier’s decision function() method in stead of its predict method ∗ This returns a score for each sample ∗ You can then make your own predictions based on the returned scores – Deciding on a threshold ∗ Apply cross val predict(), but use ’decision function’ for method ∗ This returns the decision scores rather than the actual predictions ∗ Then apply the function precision recall curve()

12 Scikit-Learn: Binary Classification - Tuning (6) – sklearn.metrics.precision recall curve(y true, probas pred, pos label=None, sample weight=None) ∗ Computes precision-recall pairs for all possible thresholds ∗ Restricted to binary classification ∗ Parameters: · y true (array) Targets of binary classification in range {−1, 1} or {0, 1} · probas pred (array) Estimated probabilities or decision function

· poslabel (int, string) Label of the positive class · sample weight (array-like) Sample weights ∗ Returns: · precision (array) Precision values such that element i is the precision of predictions with score ≥ thresholds[i] and the last element is 1 · recall (array) Decreasing recall values such that element i is the recall of predictions with score ≥ thresholds[i] and the last element is 0 · thresholds (array) Increasing thresholds on the decision function used to compute precision and recall

13 Scikit-Learn: Binary Classiﬁcation - Tuning (7) ∗ Typical plot (Geron p91):

∗ Select the threshold with the best precision/recall tradeoﬀ – You can also plot precision directly against recall and select the threshold right before the curve nose dives ∗ Typical plot (Geron p92):

14 Scikit-Learn: Binary Classiﬁcation - Tuning (8)

• The ROC curve – The receiver operating characteristic curve plots true positive rate v false positive rate, or sensitivity v specificity for binary classifications ∗ TPR = recall, of sensitivity TP TP TPR = = P TP + FN ∗ TNR is true negative rate, or specificity, the ratio of negative instances classified as negative TN TN TNR = = = 1 − FPR N TN + FP ∗ FPR is ratio of negative instances incorrectly classified as positive FP FP FPR = = = 1 − TNR N FP + TN – Use function roc curve() sklearn.metrics.roc curve(y true, y score, pos label=None, sample weight=None, drop intermediate=True) ∗ Parameters: · drop intermediate Whether to drop some suboptimal thresholds which would not appear on a plotted ROC curve ∗ Returns: · fpr (array) Increasing false positive rates such that element i is the false positive rate of predictions with score ≥ thresholds[i] · tpr (array) Increasing true positive rates such that element i is the true positive rate of predictions with score ≥ thresholds[i] · thresholds (array) Decreasing thresholds on the decision function used to compute fpr and tpr thresholds[0] represents no instances being predicted and is arbitrarily set to max(y score) + 1

15 Scikit-Learn: Binary Classiﬁcation - Tuning (9) – Typical plot:

∗ Sensitivity axis is vertical, specificity horizontal ∗ As recall increases, so do the false positives ∗ The dotted line represents the curve of a purely random classifier ∗ A good classifier wants to be in the top left corner of the plot – ROC curves can be used to compare classifiers by comparing the areas under the curves (AUCs) ∗ Perfect classifiers have AUC = 1; purely random AUC = 0.5 ∗ Use function roc auc score() for this: sklearn.metrics.roc auc score(y true, y score, average=macro, sample weight=None, max fpr=None) · Parameters: average (string) None: Scores for each class are returned ’micro’: Calculate metrics globally by considering each element of the label indicator matrix as a label ’macro’: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account ’weighted’: Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label) ’samples’: Calculate metrics for each instance, and find their average

16 Scikit-Learn: Binary Classiﬁcation - Tuning (10) · sample weight (array-like) · max fpr (0 < float ≤ 1) The standardized partial AUC [3] over the range [0, max fpr] is returned

17 Scikit-Learn: Multiclass Classification • Multiclass (multinomial) classifiers distinguish among more than two classes • Some learning models can perform both binary and multiclass, while others are dedicated binary classifiers • Dedicated binary classifiers can be used for multiclass classification: – In one-versus-all (also called one-versus-the-rest) strategy, train the model on each possible output value ∗ For n possible classifications, this will generate n classifiers ∗ Test each sample on each version ∗ Output the value that receives the highest score – In one-versus-one strategy, train the model on all possible pairs of output values n(n−1) ∗ For n possible classifications, this requires 2 classifiers ∗ Test each sample on every classifier ∗ Output the value that wins the greatest number of pairings ∗ The savings come from the fact that each classifier only needs be trained on the two values in the pairing (unlike the OvA approach, which requires training on the entire training set) – In general, OvA is preferred • Scikit-Learn detects when you are using a binary classification model for multiclass classification and automatically applies OvA (except for SVM, where it used OvO) • Scikit provides OvO and OvR classifiers: – OvO sklearn.multiclass.OneVsOneClassifier(estimator, ...) ∗ Methods: As usual plus · decision function(X): Returns the distance of each sample from the decision boundary for each class ∗ Attributes: · estimators : Estimators used for predictions (list of n(n − 1)/2) · classes : Array of labels

18 Scikit-Learn: Multiclass Classification – OvR sklearn.multiclass.OneVsOneClassifier(estimator, ...) ∗ Methods: As above ∗ Attributes: As above plus · label binarizer : Object used to transform multiclass labels to binary labels and vice-versa · multilabel : Whether this is a multilabel classifier

19 Scikit-Learn: Multilabel Classiﬁcation • To be included with instance-based learning