Scikit-Learn: Classifiers - Binary • Binary classification from sklearn.linear model import SGDClassifier

SGDClassifier(loss=hinge, penalty=l2, alpha=0.0001, l1 ratio=0.15, fit intercept=True, max iter=None, tol=None, shuffle=True, verbose=0, epsilon=0.1, n jobs=None, random state=None, learning rate=optimal, eta0=0.0, power t=0.5, early stopping=False, validation fraction=0.1, n iter no change=5, class weight=None, warm start=False, average=False, n iter=None) – This implements linear classifiers (i.e., SVM, , a.o.) – For best results using the default learning rate schedule, the data should have zero and unit – Expects floating point values for the features – Parameters: ∗ loss (string) · The function used to calculate error (e.g., MSE) · Determines model used: hinge: linear SVM log: logistic regression modified huber squared hinge squared loss epsilon insensitive squared epsilon insensitive ∗ penalty (string) · Type of regularization ∗ alpha (float) · Regularization term multiplier · Affects learning rate when set to ’optimal’ ∗ l1 ratio · Elastic Net mixing parameter ∗ fit intercept · Whether the intercept should be estimated or not · If False, the data is assumed to be already centered

1 Scikit-Learn: Classifiers - Binary (2) ∗ max iter: (int) · Max number of passes ∗ tol (float) · Stopping criterion

· If it is not None, the iterations will stop when (loss > previousloss− tol) ∗ shuffle · Whether or not the training data should be shuffled after each epoch ∗ epsilon (float) ·  value in loss functions for huber, epsilon insensitive, or squared epsilon insensitive ∗ eta0 (double) · Initial learning rate ∗ learning rate · Learning rate schedule ’constant’: η = eta0 ’optimal’: η = 1.0/(α(t + t0) where t0 chosen by heuristic ’invscaling’: η = eta0/pow(t, power t) ’adaptive’: η = eta0 as long as training keeps decreasing ∗ power t (double) ∗ early stopping · If T rue, automatically sets aside a fraction of training data as vali- dation and terminate training when validation score is not improving by at least tol for n iter no change consecutive epochs ∗ validation fraction (float) · Proportion of training data to set aside as validation set for early stopping

2 Scikit-Learn: Classifiers - Binary (3) ∗ n iter no change · Number of iterations with no improvement to wait before early stop- ping ∗ average · When set to True, computes the averaged SGD weights and stores the result in the coef attribute · If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average ∗ n iter · Number of passes over the training data (deprecated) – Attributes: ∗ coef (array) · Weights assigned to the features ∗ intercept (array) · Constants in decision function ∗ n iter ∗ – Methods (beyond fit() and predict(): ∗ decision function(X) · Predict confidence scores for samples ∗ partial fit(X, y, classes=None, sample weight=None) (array) · Perform one epoch of stochastic gradient descent on given samples ∗ score(X, y, sample weight=None) · Returns the mean accuracy on the given test data and labels ∗ predict proba · Probability estimates · Only available for log loss and modified

3 Scikit-Learn: Classifiers - Multiclass and Multilabel 1. Note: All classifiers in scikit-learn do multiclass classification out-of-the-box • Use module sklearn.multiclass if you want to with different multiclass strategies 2. Multiclass classification: • Classification task with more than two classes • Assume that each sample is assigned to one and only one label 3. Multilabel classification: • Each sample assigned a set of target labels • For when labels are not mutually exclusive 4. Multioutput regression: • Each sample assigned a set of target values • For when labels are not mutually exclusive 5. Multioutput-multiclass classification and multi-task classification: • Single estimator has to handle several joint classification tasks • For when labels are not mutually exclusive 6. Classes: (a) One v One multiclass classification from sklearn.multiclass import Onev- sOneClassifier

OneVsOneClassifier(estimator, n jobs=None) • Parameters: – Self-evident • Methods: – See above • Attributes: – estimators – classes • See also OnevsRestClassifier multiclass/multilabel classifier

4 Scikit-Learn: Classifiers - Multiclass and Multilabel (2) (b) Multilabel classification from sklearn.neighbors import KNeighborsClassi- fier

KNeighborsClassifier((n neighbors=5, weights=uniform, algorithm=auto, leaf size=30, p=2, metric=minkowski, metric params=None, n jobs=None, **kwargs)) • Parameters: – n neighbors (int) ∗ Number of neighbors to use by default for kneighbors queries – weights (string, callable) ∗ Weight function used in prediction ’uniform’: All points in each neighborhood are weighted equally ’distance’: weight points by the inverse of their distance Closer neighbors have a greater influence than farther ones callable: a user-defined function – algorithm (string) ∗ Algorithm used to compute the nearest neighbors ’ball tree’: ’kd tree’: ’brute’: brute force ’auto’: Decides the most appropriate algorithm based on the values passed to fit method – leaf size (int) ∗ Leaf sized passed to the two tree algorithms – p (int) ∗ Power parameter for the Minkowski metric – metric (string, callable) ∗ *** – metric params (dictionary) ∗ Additional keyword arguments for the metric function

5 Scikit-Learn: Classifiers - Multiclass and Multilabel (3) • Methods: – kneighbors(X=None, n neighbors=None, return distance=True) ∗ Finds the K-neighbors of a point ∗ Returns indices of and distances to the neighbors of each point – kneighbors graph(X=None, n neighbors=None, =connectivity) ∗ Computes the (weighted) graph of k-Neighbors for points in X ∗ Returns a sparse matrix in CSR format, shape = [n samples, n samples fit]

6 Scikit-Learn: Binary Classification • References Geron C3 • MNIST corpus – Handwriting sets of integers – Already divided into training and test sets – Each image is 28 × 28 pixels ∗ Results in 784 features (one for each pixel) ∗ Values that from 0 (white) to 255 (black) • Recommended to scramble training set set before using, especially as numbers are listed in order in MNIST • SGDClassifier: See Learning Models in Scikit notes • For more control over cross-validation than Scikit models provide, do it man- ually: (See StratifiedKFold in Cross-Validation in Scikit notes

from sklearn.model_selection import StratifiedKFold from sklearn.base import clone skfolds = StratifiedKFold(n_splits=3,random_state=42) for train_index,test_index in skfolds.split(X_train,y_train_5): clone_clf = clone(sgd_clf) X_train_folds = X_train[train_index] y_train_folds = (y_train_5[train_index]) X_test_fold = X_train[test_index] y_test_fold = (y_train_5[test_index]) clone_clf.fit(X_train_folds,y_train_folds) y_pred = clone_clf.predict(X_test_fold) n_correct = sum(y_pred == y_test_fold) print(n_correct / len(y_pred)) – Above equivalent to cross val score • Note that the accuracy is potentially a false high – MNIST is skewed: only 10% of the samples are a given number (when binary testing for is x digit y or not?)

7 Scikit-Learn: Binary Classification - Tuning • Rather than evaluate on accuracy, use the – A confusion matrix is a special type of that illustrates how well a classifier performs – So called because helps to determine whether classifier is confusing two classes – For example:

prediction A B category A 8 2 B 6 4

OR

prediction A B C category A 5 3 0 B 2 3 1 C 0 2 11

– A table of confusion (also called a confusion matrix) illustrates how well a classifier performs by showing true positives, false positives, true negatives and false negatives ∗ In the first example above:

prediction A not A category A 8 TP 6 FP not A 2 FN 4 TN ∗ And for the second example above:

prediction A not A category A 5 TP 2 FP not A 3 FN 17 TN

8 Scikit-Learn: Binary Classification - Tuning (2) – To get the predictions, use cross val score() (See cross val score() in Cross- Validation in Scikit notes) – Then use function confusion matrix() from sklearn.metrics import confusion matrix ∗ confusion matrix(y true, y pred, labels=None, sample weight=None) ∗ Parameters: · y true: Correct target values · y pred: Estimated target values · labels: List of labels to index the matrix If omitted, values that appear in true or y pred are used (in sorted order) Labels can be used to select a subset · sample weight: Sample weights ∗ Returns an array ∗ Examples (from API) >>> from sklearn.metrics import confusion_matrix >>> y_true = [2, 0, 2, 2, 0, 1] >>> y_pred = [0, 0, 2, 2, 0, 2] >>> confusion_matrix(y_true, y_pred) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])

>>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"] >>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"] >>> confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"]) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])

>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], ... [1, 1, 1, 0]).ravel() >>> (tn, fp, fn, tp) (0, 2, 1, 1)

9 Scikit-Learn: Binary Classification - Tuning (3) · Note: The results are represented as in the above examples: The predicted values are the columns, the actual the rows So in the first example, array[0, 0] indicates that 2 zeroes are correctly labeled as zeroes, array[0, 1] that no ones were labeled as zeroes, and array[0, 2] that one two was labeled as a zero • Another set of metrics to look at are the of the classifier – Precision: ∗ Measures accuracy of the positive predictions true positives precision = true positives + false positives – Recall (sensitivity or true positive rate): ∗ Measures how accurately a samples of a class are correctly classified true positives recall = true positives + false negatives – Precision score() sklearn.metrics.precision score(y true, y pred, labels=None, pos label=1, average=binary, sample weight=None) ∗ Parameters: · y true, y pred, labels: Already discussed · average (string) Required for multiclass/multilabel targets Values: None: Scores for each class are returned ’binary’: Only report results for the class specified by pos label Applicable only if targets (y {true, pred}) are binary ’micro’: Calculate metrics globally by counting the total true positives, false negatives and false positives ’macro’: Calculate metrics for each label, and find their unweighted mean This does not take label imbalance into account ’weighted’: Calculate metrics for each label, and find their weighted by support (the number of true instances for each label) This alters macro to account for label imbalance

10 Scikit-Learn: Binary Classification - Tuning (4) ’samples’: Calculate metrics for each instance, and find their average Only meaningful for multilabel classification where this differs from accuracy score · Returns precision of the positive class in binary classification or weighted average of the precision of each class for the multiclass task – Recall score() sklearn.metrics.recall score(y true, y pred, labels=None, pos label=1, average=binary, sample weight=None) ∗ Parameters: As above ∗ Best value is one, worst is zero

– The F1 score combines the precision and recall into a single score

∗ The F1 score is the of precision and recall: 2 F1 = 1 1 precision + recall

precision × recall = 2 × precision + recall

TP = FN+FP TP + 2 ∗ The harmonic mean gives more weight to low values

∗ Therefore, both precision and recall must be high to get a high F1 score ∗ f1 score function: · sklearn.metrics.f1 score(y true, y pred, labels=None, pos label=1, av- erage=binary, sample weight=None) ∗ Sometimes you want precision, other times recall · Increasing one decreases the other and vice-versa · This is called the precision/recall tradeoff

11 Scikit-Learn: Binary Classification - Tuning (5)

• Precision/recall tradeoff – Consider the following situation (Geron p90)

∗ The SGDClassifier makes decisions based on a decision function: If the function’s score is greater than some threshold, sample is consid- ered positive ∗ Consider the central arrow as the current threshold · Four 5’s are classified correctly, and one six incorrectly, so precision = 0.8, · Since two 5’s are misclassified, recall = 0.67 ∗ By moving the threshold to right, precision increases to 1.0, but recall falls to 0.5 ∗ By moving the threshold to left, precision falls to 0.75, but recall in- creases to 1.0 – This can be implemented in scikit manually, using a classifier’s decision function() method in stead of its predict method ∗ This returns a score for each sample ∗ You can then make your own predictions based on the returned scores – Deciding on a threshold ∗ Apply cross val predict(), but use ’decision function’ for method ∗ This returns the decision scores rather than the actual predictions ∗ Then apply the function precision recall curve()

12 Scikit-Learn: Binary Classification - Tuning (6) – sklearn.metrics.precision recall curve(y true, probas pred, pos label=None, sample weight=None) ∗ Computes precision-recall pairs for all possible thresholds ∗ Restricted to binary classification ∗ Parameters: · y true (array) Targets of binary classification in range {−1, 1} or {0, 1} · probas pred (array) Estimated probabilities or decision function

· poslabel (int, string) Label of the positive class · sample weight (array-like) Sample weights ∗ Returns: · precision (array) Precision values such that element i is the precision of predictions with score ≥ thresholds[i] and the last element is 1 · recall (array) Decreasing recall values such that element i is the recall of predictions with score ≥ thresholds[i] and the last element is 0 · thresholds (array) Increasing thresholds on the decision function used to compute precision and recall

13 Scikit-Learn: Binary Classification - Tuning (7) ∗ Typical plot (Geron p91):

∗ Select the threshold with the best precision/recall tradeoff – You can also plot precision directly against recall and select the threshold right before the curve nose dives ∗ Typical plot (Geron p92):

14 Scikit-Learn: Binary Classification - Tuning (8)

• The ROC curve – The receiver operating characteristic curve plots true positive rate v , or sensitivity v specificity for binary classifications ∗ TPR = recall, of sensitivity TP TP TPR = = P TP + FN ∗ TNR is true negative rate, or specificity, the ratio of negative instances classified as negative TN TN TNR = = = 1 − FPR N TN + FP ∗ FPR is ratio of negative instances incorrectly classified as positive FP FP FPR = = = 1 − TNR N FP + TN – Use function roc curve() sklearn.metrics.roc curve(y true, y score, pos label=None, sample weight=None, drop intermediate=True) ∗ Parameters: · drop intermediate Whether to drop some suboptimal thresholds which would not appear on a plotted ROC curve ∗ Returns: · fpr (array) Increasing false positive rates such that element i is the false positive rate of predictions with score ≥ thresholds[i] · tpr (array) Increasing true positive rates such that element i is the true positive rate of predictions with score ≥ thresholds[i] · thresholds (array) Decreasing thresholds on the decision function used to compute fpr and tpr thresholds[0] represents no instances being predicted and is arbitrarily set to max(y score) + 1

15 Scikit-Learn: Binary Classification - Tuning (9) – Typical plot:

∗ Sensitivity axis is vertical, specificity horizontal ∗ As recall increases, so do the false positives ∗ The dotted line represents the curve of a purely random classifier ∗ A good classifier wants to be in the top left corner of the plot – ROC curves can be used to compare classifiers by comparing the areas under the curves (AUCs) ∗ Perfect classifiers have AUC = 1; purely random AUC = 0.5 ∗ Use function roc auc score() for this: sklearn.metrics.roc auc score(y true, y score, average=macro, sample weight=None, max fpr=None) · Parameters: average (string) None: Scores for each class are returned ’micro’: Calculate metrics globally by considering each element of the label indicator matrix as a label ’macro’: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account ’weighted’: Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label) ’samples’: Calculate metrics for each instance, and find their average

16 Scikit-Learn: Binary Classification - Tuning (10) · sample weight (array-like) · max fpr (0 < float ≤ 1) The standardized partial AUC [3] over the range [0, max fpr] is returned

17 Scikit-Learn: Multiclass Classification • Multiclass (multinomial) classifiers distinguish among more than two classes • Some learning models can perform both binary and multiclass, while others are dedicated binary classifiers • Dedicated binary classifiers can be used for multiclass classification: – In one-versus-all (also called one-versus-the-rest) strategy, train the model on each possible output value ∗ For n possible classifications, this will generate n classifiers ∗ Test each sample on each version ∗ Output the value that receives the highest score – In one-versus-one strategy, train the model on all possible pairs of output values n(n−1) ∗ For n possible classifications, this requires 2 classifiers ∗ Test each sample on every classifier ∗ Output the value that wins the greatest number of pairings ∗ The savings come from the fact that each classifier only needs be trained on the two values in the pairing (unlike the OvA approach, which re- quires training on the entire training set) – In general, OvA is preferred • Scikit-Learn detects when you are using a binary classification model for mul- ticlass classification and automatically applies OvA (except for SVM, where it used OvO) • Scikit provides OvO and OvR classifiers: – OvO sklearn.multiclass.OneVsOneClassifier(estimator, ...) ∗ Methods: As usual plus · decision function(X): Returns the distance of each sample from the decision boundary for each class ∗ Attributes: · estimators : Estimators used for predictions (list of n(n − 1)/2) · classes : Array of labels

18 Scikit-Learn: Multiclass Classification – OvR sklearn.multiclass.OneVsOneClassifier(estimator, ...) ∗ Methods: As above ∗ Attributes: As above plus · label binarizer : Object used to transform multiclass labels to binary labels and vice-versa · multilabel : Whether this is a multilabel classifier

19 Scikit-Learn: Multilabel Classification • To be included with instance-based learning

20