Preprocessing and Dimensionality Reduction

Datasets Preprocessing Dimensionality reduction Preprocessing and Dimensionality Reduction JérémyFix CentraleSupélec jeremy.fi[email protected] 2019 1 / 73 Datasets Preprocessing Dimensionality reduction Where to get data You need datasets You can use open datasets For example for experimenting a new ML algorithm: UCI ML Repo : http://archive.ics.uci.edu/ml/ • Kaggle competitions, e.g. https: • //www.kaggle.com/c/diabetic-retinopathy-detection specific well known datasets for specific ML problems • 2 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Face expression classification 48x48 pixel grayscale images of faces 0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral 28K Train; 3K for public test, another 3K for final test. Kaggle, ICML 2013 3 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Object localization/detection PascalVOC2012: 20 classes, 20000 Train images, 20000 Test, 11000 Test Avg image size : 469x387 pixels, RGB Classes : person/bird, cat, cow, dog, horse, sheep/aeroplane, bicycle, boat, bus, car, motorbike, train/bottle, chair, dining table, potted plant, sofa, tv-monitor http://host.robots.ox.ac.uk/pascal/VOC/ 4 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Object localization/detection ImageNet, ILSVRC2014: 1000 classes, 1.2M Train images, 50K Valid, 100K Test Avg image size : 482x415 pixels, RGB ImageNet Large Scale Visual Recognition Challenge, Russakovsky et al. (2015) 5 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Object localization/detection Open Images Dataset: https://github.com/openimages/dataset 9M automatically labelled images, 4M human validated • ≈ 80M bounding boxes, 6000 classes • both meta labels (e.g. vehicle), fine-grain labels (e.g. honda • nsx) 6 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Object segmentation COCO 2017: 200K images, 80 classes, 500K masks http://cocodataset.org/ 7 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Recommendation systems MovieLens, Netflix Prize, Anime Recommendations Database MovieLens 20M 27K movies by 138K users • 5star ratings, 1/2 increment (0.0, 0.5, ..) • 20M ratings • metadata (e.g. genre) • links to imdb to enrich metadata • https://grouplens.org/datasets/movielens/ 8 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Automatic speech recognition Timit, VoxForge, ... Timit : 630 speakers, eight American english dialects • time-aligned orthographic, phonetic and word transcriptions • 16kHz speech waveform file for each utterance • https://catalog.ldc.upenn.edu/ldc93s1 9 / 73 Datasets Preprocessing Dimensionality reduction Where to get data Some available datasets Sentiment analysis Large Movie Review Dataset (IMDB) 25K reviews for training, 25K reviews for testing • movie reviews (sentences), with rating ([1,10]) • aim : Are reviews on a given product positive/negative ? • Maas(2011), Learning Word Vectors for Sentiment Analysis Automatic translation Dataset from the european parliament (Europarl dataset) single language datasets (language model) • parallel corpora (translation), e.g. french-english (2M • sentences), Czech-English (650K sentences), .. Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp 10 / 73 Koehn, MT Summit 2005 Datasets Preprocessing Dimensionality reduction Make your own dataset You need datasets You have a specific problem You may need to collect data on your own. Crawl the web ? (e.g. Tweeter API, ..) • if supervized learning : assign labels (mechanical turk, domain • experts (classifying tumors)) Ensure you collected sufficient features • 11 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Preprocessing 12 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Preprocessing data Data are not necessarily vectorial Ordinal or Categorical : poor/faire/excellent ; Male/Female • Text documents : bag of words / word embeddings • Even if vectorial Missing data : check how missing values are indicated (-9, ' ', • ..) Imputation of missing values ! Feature scaling • 13 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Your data might not be vectorial data Ordinal and categorical features Ordinal values have an order. Ordinal Feature value poor fair excellent Numerical feature value -1 0 1 Categorical values do not have an order (use one-hot) : Categorical value American Spanish German French Numerical value [1; 0; 0; 0] [0; 1; 0; 0] [0; 0; 1; 0] [0; 0; 0; 1] 14 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Your data might not be vectorial data Vectorial representation of text documents Bag Of Words define a vocabulary , = n • V jVj for each document, build a vector x so that x is the • i frequency of the word i V e.g. = I ; in; love; metz; machinelearning; study V f g I love machine learning and love metz too. x = [1; 0; 2; 1; 1; 0] ! I love studying machine learning in Metz. x = [1; 1; 1; 1; 1; 1] Does not take the order into account N! gram, but this leads to sparser representations ! − 15 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Your data might not be vectorial data Vectorial representation of text documents Word/Sentence embeddings (e.g. word2vec, GLoVe, fasttext). Continuous Bag of Words (CBOW) : predict a word given its context Input and output coded with one-hot • predict a word given its context • hidden layer : word representation • Captures some semantic information. For sentences : tweet2vec, sentence2vec, word vector avg see also : Bayesian approaches (e.g. Latent Dirichlet Allocation) Pennington(2014) GloVe: Global Vectors for Word Representation; Mikolov(2013) Efficient Estimation of Word Representations in Vector Space; https://fasttext.cc/ 16 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Some features might be missing Missing features Completely drop out the samples with missing attributes, or • the dimensions that have missing values or try to impute, i.e. set a value in place of the missing • attributes For missing value imputation, there are plenty of methods : global : assign the mean, median, most frequent value of an • attribute local : based on k-nearest neighbors, decide which value to • impute The bias you may introduce by imputing a value may depend on the causes of the missing values, see [Silva(2014)]. Silva(2014). A brief review of the main approaches for treatment of missing data 17 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Some vectorial data might not be appropriately scaled Feature scaling dimensions with the largest variations will dominate euclidean • distances (e.g. nearest neighbors) when gradient descent is involved, feature scaling makes • convergence faster (because the loss is circular symmetric) when regularization is involved, we would like to use a single • regularization coefficient, independent on the scale of the features 18 / 73 Datasets Preprocessing Dimensionality reduction We need vectors, appropriately scaled, without missing values Some vectorial data might not be appropriately scaled Feature scaling d Given xi R , you can normalize by : 2 min/max scaling : • x −min x i; j [0; d 1]x0 = i;j k k;j 8 8 2 − i;j maxk xk;j −mink xk;j z-score normalization : • xi;j µj i; j [0; d 1]xi;j = − 8 8 2 − σj 1 X µ = x j N k;j k s 1 X 2 σj = (xk j µj ) N ; − k Your statistics must be computed from the training set and applied also to test data. 19 / 73 Datasets Preprocessing Dimensionality reduction Dimensionality reduction 20 / 73 Datasets Preprocessing Dimensionality reduction Dimensionality reduction : what/why/how ? What n d Optimally transform xi R into zi R so that d << n It remains to define what2 means \optimally2 transform" Why visualization of the data • interpretability of the predictor • speed up the algorithms whose complexity depends on n • data may occupy a manifold of lower dimensionality than n • curse of dimensionality : data get quickly sparse, models may • overfit 21 / 73 Datasets Preprocessing Dimensionality reduction Dimensionality reduction: why ? Data analysis/Visualization How are your data distributed ? How are your classes intricated ? Do we have discriminative features ? 22 / 73 t-SNE, Mnist, Maaten et al. Datasets Preprocessing Dimensionality reduction Dimensionality reduction: why ? Interpretability of the predictor e.g. Why does this predictor say the tumor is malignant ? Real risk = 0:92 0:06 Real risk = 0:92 0:05 ± ± UCI ML Breast Cancer Wisconsin (Diagnostic) dataset Real risk estimated by 10-fold CV. 23 / 73 Datasets Preprocessing Dimensionality reduction Dimensionality reduction: why ? Speed up of the algorithms Decreasing dimensionality decreases training/inference times. For example : Linear regressiony ^ = θT x + b • Logistic regression(classification) : P(y = 1=x) = 1 • 1+exp(θT x) n Both training and inference in O(n), x R 2 24 / 73 Datasets Preprocessing Dimensionality reduction Dimensionality reduction: why ? The data may occupy a lower

Load more