Samy Bengio Real Life Setting
Total Page:16
File Type:pdf, Size:1020Kb
Sharing is Caring in the Land of The Long Tail Samy Bengio Real life setting “Real problems rarely come packaged as 1M images uniformly belonging to a set of 1000 classes…” 2 The long tail • Well known phenomena where a small number of generic objects/entities/words appear very often and most others appear more rarely. • Also knows as Zipf or Power law, or Pareto distribution. • The web is littered by this kind of distributions: • the frequency of each unique query on search engines, • the occurrences of each unique word in text documents, • etc. 3 Example of a long tail Frequency of words in Wikipedia 1e+09 1e+08 1e+07 1e+06 100000 Frequency 10000 1000 100 10 the anyways trickiest h-plane Words 4 Representation sharing • How do we design a classifier or a ranker when data follows a long tail distribution? • If we train one model per class, it is hard for poor classes to be well trained. • How come we humans are able to recognize objects we have seen only once or even never? • Most likely answer: representation sharing: all class models share/learn a joint representation. • Poor classes can then benefit from knowledge learned from semantically similar but richer classes. • Extreme case: zero-shot setting! 5 Outline In this talk, I will cover the following ideas: • Wsabie: a joint embedding space of images and labels • The many facets of text embeddings • Zero-shot setting through embeddings • Incorporate Knowledge Graph constraints • Use of a language model I will NOT cover the following important issues: • Prediction time issues for extreme classification • Memory issues 6 Wsabie Learn to embed images & labels to optimize top-ranked items. Labels Obama Eiffel Tower Shark Dolphin Lion ... 100-dimensional embedding space Wsabie: J. Weston et al, ECML 2010, IJCAI 2011 7 Wsabie: summary sim(i,x) = <Wi,Vx> W V 000100 real values Label i Image x Triplet Loss: sim( , dolphin) > sim( , obama) + 1 Trained by stochastic gradient descent and smart sampling of negative examples 8 Wsabie: experiments - results Method ImageNet 2010 Web prec@1 prec@10 prec@1 prec@10 approx 1.55% 0.41% 0.30% 0.34% kNN One-vs- 2.27% 1.02% 0.52% 0.29% Rest Wsabie 4.03% 1.48% 1.03% 0.44% Ensemble of 10 10.03% 3.02% Wsabies ImageNet 2010: 16000 labels and 4M images Web: 109000 labels and 16M images 9 Wsabie: embeddings Label Nearest Neighbors barack obama barak obama, obama, barack, barrack obama, bow wow david beckham beckham, david beckam, alessandro del piero, del piero santa santa claus, papa noel, pere noel, santa clause, joyeux noel dolphin delphin, dauphin, whale, delfin, delfini, baleine, blue whale cows cattle, shire, dairy cows, kuh, horse, cow, shire horse, kone rose rosen, hibiscus, rose flower, rosa, roze, pink rose, red rose eiffel tower eiffel, tour eiffel, la tour eiffel, big ben, paris, blue mosque ipod i pod, ipod nano, apple ipod, ipod apple, new ipod f18 f 18, eurofighter, f14, fighter jet, tomcat, mig 21, f 16 10 Wsabie: annotations delfini, orca, dolphin, mar, delfin, dauphin, whale, cancun, killer whale, sea world blue whale, whale shark, great white shark, underwater, white shark, shark, manta ray, dolphin, requin, blue shark, diving barrack obama, barak obama, barack hussein obama, barack obama, james marsden, jay z, obama, nelly, falco, barack eiffel, paris by night, la tour eiffel, tour eiffel, eiffel tower, las vegas strip, eifel, tokyo tower, eifel tower 11 “Why not an embedding of text only?” Skip-Gram (Word2Vec) Learn dense embedding vectors from an unannotated text corpus, e.g. Wikipedia wing chair Obama W tuna E bull shark “an exceptionally large male tiger shark can grow up to” tiger shark http://code.google.com/p/word2vec Tomas Mikolov, Kai Chen, Greg Corrado, Jeff Dean (ICLR 2013) 13 Skip-Gram Wikipedia t-SNE visualization of ImageNet Skip-gram trained on Wikipedia, labels 155K terms tiger shark car bull shark cars blacktip shark muscle car shark sports car oceanic whitetip shark compact car sandbar shark autocar dusky shark automobile blue shark pickup truck requiem shark racing car great white shark passenger car reptiles lemon shark dealership birds insects food musical instruments clothing dogs aquatic life animals transportation 14 Embeddings are powerful Berlin hot Germany big Rome hotter Italy bigger E(Rome) - E(Italy) + E(Germany) ≈ E(Berlin) E(hotter) - E(hot) + E(big) ≈ E(bigger) 15 Let’s go back to images! Deep convolutional models for images But what about the long tail of classes? Layer 7 ... What about using our semantic embeddings Layer 1 for that? Input 17 ConSE: Convex Combination of Semantic Embeddings [Norouzi et al, ICLR’2014] p(Lion x) | p(Apple x) | p(Orange x) | p(Tiger x) | p(Bear x) | 18 ConSE: Convex Combination of Semantic Embeddings from Skip-Gram for instance: s(y) = embedding position of y f(x)= p(y x)s(y ) i| i i f(x)=pX(Lion x)s(Lion)+ p(Apple| x)s(Apple)+ p(Orange| x)s(Orange)+ | p(Tiger x)s(Tiger)+ | p(Bear x)s(Bear) | Do a nearest neighbor search around f(x) to find the corresponding label 19 ConSE(T): Convex Combination of Semantic Embeddings In practice, consider the average of only a few labels: top(T )= i p(y x) is among top T probabilities { | i| } 1 f(x)= p(y x)s(y ) Z i| i i top(T ) 2X 20 ConSE(T): experiments on ImageNet • Model trained with 1.2M 3-hops ILSVRC 2012 images from 1,000 classes 2-hops • Evaluated on images from same classes. Training • Results are measured as hit@k. ConSe(T) experiments 22 Knowledge Graph 23 Multiclass Classifiers Softmax GoogleLeNet model Logistic 24 Object labels have rich relations Exclusion Hierarchical Dog Dog Cat Corgi Puppy Cat Corgi Puppy Overlap 25 Visual Model + Knowledge Graph Dog 0.9 Visual Knowledge Corgi 0.8 Model Joint Inference Graph Puppy 0.9 Cat 0.1 Hierarchy and Exclusion (HEX) Graph Exclusion Hierarchical Dog Cat [Deng et al, ECCV 2014] Corgi Puppy 26 HEX Classification Model x ∈ Rn y ∈ {0,1}n Input scores Binary Label vector 1 (y, y ) Pr(y| x) = ∏φi (xi, yi ) ∏ψi, j i j Z(x) i i, j sigmoid(x ) if y =1 0 If violates constraints φ (x , y ) = i i ψ (y, y ) = i i i 1 ( ) if y 0 i, j i j − sigmoid xi i = 1 Otherwise Unary: same as logistic regression Pairwise: set illegal configuration to zero ! All illegal configurations have probability zero. 27 Exp: Learning with weak labels • ILSVRC 2012: “relabel” or “weaken” a portion of fine-grained leaf labels to basic level labels. • Evaluate on fine-grained recognition Animal Animal Animal Relabel Dog … Dog Dog … … Corgi Husky Corgi Husky Corgi Husky Original ILSVRC 2012 Training Test (leaf labels) (“weakened” labels) 28 Exp: Learning with weak labels • ILSVRC 2012: “relabel” or “weaken” a portion of fine-grained leaf labels to basic level labels. • Evaluate on fine-grained recognition. • Consistently outperforms baselines. Top 1 accuracy (top 5 accuracy) 29 What about textual descriptions? • We have considered the long tail of objets. • What about more complex descriptions, involving multi-word descriptions, or captions? • We can use language models to help. 30 Neural Image Caption Generator [Vinyals et al, CVPR 2015] Vision Language 1. Two pizzas sitting on top of Deep CNN Generating a stove top RNN oven. 2. A pizza sitting on top of a pan on top of a stove. A group of people Vision! Language ! Deep CNN Generating! shopping at an RNN outdoor market. ! There are many vegetables at the fruit stand. 31 NIC: objective • Let I be an image (pixels). • Let S be the corresponding sentence (sequence of words). • Likelihood of producing the right sentence given the image: N log p(S I)= log p(St I,S0,...,St 1) | | − t=0 X • We maximize the likelihood of producing the right sentence given the image: ✓? = arg max log p(S I; ✓) ✓ | (XI,S) 32 NIC: model P(word 1) P(word 2) P(<end>) Embedding word 1 word N Recurrent Image Convolution Neural Net Neural Net 33 Examples A person riding a A skateboarder does a trick A dog is jumping to catch a motorcycle on a dirt road. Two dogs play in the grass. on a ramp. frisbee. A group of young people Two hockey players are A little girl in a pink hat is A refrigerator filled with lots of playing a game of frisbee. fighting over the puck. blowing bubbles. food and drinks. A herd of elephants walking A close up of a cat laying A red motorcycle parked on the A yellow school bus parked across a dry grass field. on a couch. side of the road. in a parking lot. Describes without errors Describes with minor errors Somewhat related to the image Unrelated to the image 34 It doesn’t always work… Human: A blue and black dress ... No! I see white and gold! Our model: A close up of a vase with flowers. Scheduled Sampling[NIPS 2015] 36 Scheduled Sampling 37 Scheduled Sampling 1 0.9 Exponential decay Inverse sigmoid decay 0.8 Linear decay 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 38 Conclusions • The long tail problem happens in most of the interesting tasks. • Sharing approaches can help “poor” classes to generalize thanks to “rich” classes. • At the extreme, semantic embeddings can represent classes with zero training examples. • Sharing approaches are interesting not only for text and images, but also for complete sentences.