Structured Prediction with Output Embeddings for Semantic Image Annotation

Ariadna Quattoni1, Arnau Ramisa2, Pranava Swaroop Madhyastha3 Edgar Simo-Serra4, Francesc Moreno-Noguer2 1Xerox Research Center Europe, [email protected] 2 Institut de Robotica` i Informatica` Industrial (CSIC-UPC), aramisa,fmoreno @iri.upc.edu { } 3TALP Research Center, Universitat Politecnica` de Catalunya, [email protected] 4Waseda University, [email protected]

Abstract that has proven to be useful in multiclass and mul- tilabel prediction tasks (Weston et al., 2010; Akata We address the task of annotating images with et al., 2013). The idea is to represent a value for semantic tuples. Solving this problem requires an argument a using a feature vector representation an algorithm able to deal with hundreds of φ IRn. We will integrate this argument represen- classes for each argument of the tuple. In ∈ tation into the structured prediction model. such contexts, data sparsity becomes a key challenge. We propose handling this spar- In summary, our main contribution is to pro- sity by incorporating feature representations pose an approach that incorporates feature represen- of both the inputs (images) and outputs (ar- tations of the outputs into a structured prediction gument classes) into a factorized log-linear model, and apply it to the problem of annotating model. images with semantic tuples. We present an experi- mental study using different output feature represen- 1 Introduction tations and analyze how they affect performance for Many important problems in can different argument types. be framed as structured prediction tasks where the goal is to learn functions that map inputs to struc- 2 Semantic Tuple Image Annotation tured outputs such as sequences, trees or general Task: We will address the task of predicting se- graphs. A wide range of applications involve learn- mantic tuples for images. Following Farhadi et al. ing over large state spaces, e.g., if the output is a (2010), we will focus on a simple semantic repre- labeled graph, each node of the graph may take val- sentation that considers three basic arguments: pred- ues over a potentially large set of labels. Data spar- icate, actors and locatives. For example, in the tuple play, dog, grass ,“play” is the predicate, “dog” is sity then becomes a challenge, as there will be many h i classes with very few training examples. the actor and “grass” is the locative. Within this context, we are interested in the task Given this representation, we can formally de- of predicting semantic tuples for images. That is, fine our problem as that of learning a function θ : X P A L IR that scores the given an input image we seek to predict what are × × × → the events or actions (referred here as predicates), compatibility between images and semantic tuples. who and what are the participants (referred here as Here X is the space of images; P , A and L are dis- actors) of the actions and where is the action taking crete sets of predicate, actor and locative arguments respectively, and p a l is a specific tuple instance. place (referred here as locatives). For example, an h i image might be annotated with the semantic tuples: The overall learning process is illustrated in Fig. 1. run, dog, park and play, dog, grass . We call Dataset: For our experiments we used a subset h i h i each field of a tuple an argument. of the Flickr8k dataset, proposed in Hodosh et al. To handle the data sparsity challenge imposed by (2013). This dataset (subset in Fig. 1) consists B the large state space, we will leverage an approach of 8,000 images from Flickr of people and animals

552

Proceedings of NAACL-HLT 2016, pages 552–557, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics

Training'Data' U)(Imagenet):) Images) annotated) with) Convolu2onal'NN' keywords) (cheap,) exploits) readily) available) (Trained)using)U))) dataset)) Training'Image'x (From!Q)' Q)(Flickr8K):)Images)annotated)with) descripJve) sentences) (relaJvely) cheap,) Image'Features'

exploits)readily)available)resources)) < ϕA(x), ϕP(x), ϕL(x) > L)(SPMDataset):)Images)annotated)with) descripJve) sentences) and) semanJc) tuples) (this) is) a) small) subset) of) Q,) expensive) annotaJon,)requires)experJse)) Embedded'CRF'' Image'to' (Implicitly)induces) Seman2c'Tuple' embedding)of)image) Training'Senten2al'Descrip2ons'(From'Q)' Seman2c'Tuples' Predictor' features)and)arguments)) A)brown)dog)is)running)in)a)grassy)plain.) ) A)brown)dog)runs)along)a)path)in)the)grass.) ) Dog)running)in)field.)) ) Dog)running)in)narrow)dirt)path.) ) The)dog)is)running)through)the)uncut)grass.) ))

Seman2c'Tuple'Extractor' (Trained)using)L)))

Figure 1: Overview of our approach. First, images x are represented using image features φs(x), and semantic tuples are ∈ A obtained applying our semantic tuple extractor (learned from the subset ) to their corresponding captions. The resulting enlarged C training set, is used to train our embedded CRF model that maps images to semantic tuples.

(mostly dogs) performing some action, with five sequence models over sequences of fixed length. crowd-sourced descriptive captions for each one. However, it should not be hard to see that all the We first manually annotated 1,544 captions, cor- ideas presented here can be easily generalized to responding to 311 images (approximately one third other structured prediction settings. of the development set (subset in Fig. 1), produc- Let y = [y . . .y ] be a set of labels and S = C 1 T ing more than 2,000 semantic tuples of predicate, ac- [S1, . . ., ST ] be the set of possible label values, tor and locative. For the experiments we partitioned where y S . We are interested in learning a model i∈ i the images and annotations into training, validation that computes P (y x), i.e., the conditional probabil- | and test sets of 150, 50 and 100 images respectively. ity of a sequence y given some input x. We will consider factorized log-linear models that take the Data augmentation: To enlarge the manually an- form: notated dataset we trained a model able to predict expθ(x,y) P (y x) = (1) semantic tuples from captions using standard shal- | expθ(x,y) low and deep linguistic features (e.g., POS tags, de- y pendency parsing, semantic role labeling). We ex- The scoring function θ(x,P y) is modeled as a sum of tract the predicates by looking at the words tagged unary and binary bilinear potentials and is defined as verbs by the POS tagger. Then, the extraction of as: arguments for each predicate is resolved as a classi- T T fication problem. θ(x, y) = vy>t Wtφ(x, t) + vy>t Ztvyt+1 (2) More specifically, for each detected predicate in a t=1 t=1 X X sentence we regard each noun as a positive or neg- where v IRnt is a n dimensional feature rep- yt ∈ t− ative training example of a given relation depend- resentation of label arguments y S and φ(x, t) t∈ t ∈ ing on whether the candidate noun is or is not an IRdt is a d dimensional feature representation of t− argument of the predicate. We use these examples the tth input factor of x. to train a SVM classifier that predicts if a candidate The first set of terms in the above equation are noun is an argument of a given predicate based on usually referred as unary potentials and measure the several linguistic features computed over the syntac- compatibility between a single state at t and the fea- tic path of the dependency tree that connects them. ture representation of input factor t. The second set We run the learned tuple predictor model on all the of terms are the binary potentials and measure the captions of the Fickr8k dataset to obtain a larger compatibility between pairs of states at adjacent fac- dataset of 8,000 images paired with semantic tuples. tors. The scoring θ(x, y) function is fully parameter- 3 Bilinear Models with Output Features ized by the unary parameter matrices W IRnt dt t∈ × In this section we explain how we incorporate out- and the binary parameter matrices Z IRnt nt . t∈ × put feature representations into a factorized linear The main idea is to define a feature space where model. For simplicity, we will consider factorized semantically similar labels will be close. Like in the

553 multilabel scenario (Weston et al., 2010; Akata et mize the following loss function (D, W , Z ): L { } { } al., 2013), having full feature representations for ar- logP (y x; W , Z ) Recall that we − x y D | { } { } guments will allow us to share information across are interestedh i∈ in learning low-rank unary and binary P different classes and generalize better. With a good potentials. To achieve this we take a common ap- output feature representation, our model should be proach which is to use as the nuclear norm W and | |∗ able to make sensible predictions about pairs of ar- Z as a convex approximation of the rank function, | |∗ guments that it has not observed at training. This is the final optimization problem becomes: easy to see: consider a case were we have a pair of arguments represented with feature vectors a1 and min W (D, W ) + α Wt + β Zt (3) { }L { } | |∗ | |∗ t a2 and suppose that we have not observed the factor X a1, a2 in our training data but we have observed the where (D, W ) = d D loss(d, W ) is the b , b a L { } ∈ { } factor 1 2. Then if 1 is close in the feature space negative log likelihood function and α and β are b a b P to argument 1 and 2 is close to 2 our model will two constants that control the trade off between a a predict that 1 and 2 are compatible. That is it will minimizing the loss and the implicit dimensionality a , a assign probability to the factor 1 2 which seems of the embeddings. We use a simple optimization a natural generalization from the observed training scheme known as Forward Backward Splitting, or data. FOBOS (Duchi and Singer, 2009). W Z Now we show that the rank of and have use- For our task we will consider a simple factor- W = UΣV ful interpretations. Let be the singular ized scoring function: θ(x, p a l ) that has one W h i value decomposition of . We can then write unary factor associated with the locative-predicate pair v W φ(x, t) v U Σ[V φ(x, t)] potentials: y> as: y> Thus and one factor associated with the predicate-actor we can regard the bilinear form as a function com- pair. Since this corresponds to a chain structure, puting a weighted inner product over some real em- argmaxt T θ(x; p a l ) can be efficiently computed v U y ∈ h i bedding y> representing state and some real em- using Viterbi decoding in time (N 2), where N = [V φ(x, t)] t O bedding representing input factor . The max( P , A , L ). Similarly, we can also find the W | | | | | | rank of gives us the intrinsic dimensionality of top k predictions in (kN 2). Thus for this appli- O the embedding. Thus if we want to induce shared cation the scoring function of the bilinear CRF will low-dimensional embeddings across different states take the form of: it seems reasonable to impose a low rank penalty on

W . Similarly, let Z = UΣV be now the singular θ(x, p a l ) = λloc(l)>Wlocφloc(l) value decomposition of Z. We can write the binary h i +λpre(p)>Wpreφpre(p) potentials vy>Zvy0 as: vy>U Σ V vy0 and thus the bi- nary potentials compute a weighted inner product +λact(a)>Wactφact(a) loc between a real embedding of state y and a real em- +φloc(l)>Wpreφpre(p) pre bedding of state y0. As before, the rank of Z gives us +φpre(p)>Wact φact(a) (4) the intrinsic dimensionality of the embedding and, to induce a low dimensional embedding for binary The unary potentials measure the compatibility be- potentials, we will impose a low rank penalty on Z. tween an image and a semantic argument, the first After having described the type of scoring func- binary potential measures the compatibility between tions we are interested in, we now turn our at- a locative and a predicate, and the second binary po- tention to the learning problem. That is, given tential measures the compatibility between a predi- a training set D = x y of pairs of in- cate and an actor. The scoring function is fully pa- {h i} puts x and output sequences y we need to learn rameterized by the unary parameter matrices Wloc d n dp np da na ∈ the parameters W and Z . For this purpose IR l× l , Wpre IR × and Wa IR × and { } { } ∈ loc ∈ nl np we will do standard max-likelihood estimation and the binary parameter matrices Wpre IR × and pre ∈ find the parameters that minimize the conditional W IRnp na . Where, n , n and n are the di- act ∈ × l p a negative log-likelihood of the data in D. That mensionality of the locatives, predicates and actors is, we will find the W and Z that mini- feature representations, respectively and d , d and { } { } l p 554 da are the dimensionality of the image representa- where a CRF model combines the output of several tions. Notice that if we let the argument representa- vision systems to produce input for a language gen- tion φ (r) IR St be an indicator vector for label eration method. In Farhadi et al. (2010), the authors t ∈ | | argument t, we obtain the usual parametrization of find the similarity between sentences and images in a standard factorized linear model, while having a a “meaning” space, represented by semantic tuples dense feature representations for arguments instead which are very similar to our triplets. Other works of indicator vectors will allow us to share informa- focus on a simplified problem: ranking of human- tion across different classes. generated captions for images. Hodosh et al. (2013) 4 Representing Semantic Arguments propose to use Kernel Anal- We will conduct experiments with two differ- ysis to project images and their captions into a joint ent feature representations: 1) Fully unsupervised representation space, in which images and captions Skip-Gram based Continuous Word Representations can be related and ranked to perform illustration and (SCWR) representation (Mikolov et al., 2013) and annotation tasks. Socher et al. (2014) also address 2) A feature representation computed using the the ranking of images given a sentence and vice- caption, semantic-tuples pairs, that we call Se- versa using a common subspace learned via Recur- h i mantic Equivalence Representation (SER). sive Neural Networks. Other recent works also ex- We decided to exploit the dataset of captions ploit deep networks to address the problem (Vinyals paired with semantic tuples to induce a useful fea- et al., 2015; Karpathy and Fei-Fei, 2015). Using la- ture representation for arguments. The idea is quite bel embeddings combined with bilinear forms has simple: we wish to leverage the fact that any pair been previously proposed in the context of multi- of semantic tuples associated with the same image class and multilabel image classification (Weston et will be likely describing the same event. Thus, al., 2010; Akata et al., 2013). they are in essence different ways of lexicalizing 6 Experiments the same underlying concept. Let’s look at a con- For image features we use the 4,096-dimensional crete example. Imagine that we have an image an- second to last layer of BVLC implementation of notated with the tuples: play, dog, water and h i ‘AlexNet’ ImageNet model, a Convolutional Neu- play, dog, river . Since both tuples describe the h i ral Network (CNN) as described in Jia et al. (2014). same image, it is quite likely that both “river” and To test our method we used the 100 test images that “water” refer to the same real world entity, i.e, were annotated with ground-truth semantic tuples. “river” and “water” are ’semantically equivalent’ To measure performance we first predict the top tu- for this image. Using this idea we can build a rep- ple for each image and then measure accuracy for resentation φ (i) IR L where the j-th dimen- loc ∈ | | each argument type (i.e. the number of correct pre- sion corresponds to the number of times the argu- dictions among the top 1 triplets). The regularization ment j has been semantically equivalent to argu- parameters of each model were set using the valida- ment i. More precisely, we compute the probabil- tion set. We compare the performance of the fol- ity that argument j can be exchanged with argument [i,j]sr lowing models: 1) Baseline Separate Predictors (S- i as: P Where [i, j]sr is the number of times j [i,j]sr Pred): We consider a baseline made of independent that i and j have appeared as annotations of the same predictors for each argument type. image and with the same other arguments. For ex- More specifically we train one-vs-all SVMs (we ample, for the actor arguments [i, j]sr represents the also tried multi-class SVMs but they did not im- number of time that actor i and actor j have appeared prove performance) to independently predict loca- with the same locative and predicate as descriptions tives, predicates and actors. For each argument type of the same image. and candidate label we have a score computed by 5 Related Work the corresponding SVM. Given an image we gener- In recent years, some works have tackled the prob- ate the top tuples that maximize the sum of scores for lem of generating rich textual descriptions of im- each argument type; 2) Baseline KCCA: This model ages. One of the pioneers is (Kulkarni et al., 2011), implements the Kernel Canonical Correlation Anal-

555 3 3 3 3 3 3 3

33 3 3 TRAINING(SENTENCES( A3guy3is3doing3a3skateboard3trick3in3front3of3a3crowd3 A3man3is3skateboarding3in3front3of3a3group3of3people.3 A3skateboarder3performs3a3trick3in3front3of3a3large3crowd3.3333 A3skateboarder3leaping3from3a3pool3in3front3of3a3crowd.333 Skateboarder3does3tricks3in3front3of3crowd3while3photographer33 watches3333 3 3 3

Incorrect(loca+ve(&(ac+on( Incorrect(actor( Incorrect(actor( Incorrect(actor(&(ac+on( Incorrect(actor(&(loca+ve(

3

3 33 3 3

3 Figure 2: Samples of predicted tuples. Top-left: Examples of visually correct predictions. Bottom: Typical errors on one or several arguments. Top-right: Sample image and its top predicted tuples. The tuples in blue were not observed neither in the SP-Dataset nor in the automatically enlarged dataset. Note that all of them are descriptive of what is occurring in the scene.

ysis approach of Hodosh et al. (2013). We first note biggest improvement we get is on the predicate ar- that this approach is able to rank a list of candidate guments, where we improve almost by 10% in aver- captions but cannot directly generate tuples. To gen- age precision over the baseline using the skip-gram erate tuples for test images, we first find the caption word representation. Overall, the model that uses in the training set that has the highest ranking score the best representation performs better than the indi- for that image and then extract the corresponding se- cator baseline. mantic tuples from that caption; 3) Indicator Fea- Regarding the rank of the parameter matrices, tures (IND), this is a standard factorized log-linear we observed that the learned models can work well model that does not use any feature representation even if we drop the rank to 10% of its maximum for the outputs; 4) A model that uses the skip-gram rank. This shows that the learned models are effi- continuous word representation of outputs (SCWR); cient in the sense that they can work well with low- 5) A model that uses that semantic equivalence rep- dimensional projections of the features. resentation of outputs (SER); 6) A combined model 7 Conclusion that makes predictions using the best feature repre- In this paper we have presented a framework for ex- sentation for each argument type (COMBO). ploiting input and output embeddings in the context of structured prediction. We have applied this frame- S-Pred KCCA IND SCWR SER COMBO LOC 15 23 32 28 33 work to the problem of predicting compositional se- PRED 11 20 24 33 25 mantic descriptions of images. Our results show the ACT 30 25 52 51 50 advantages of using output embeddings and induc- MEAN 18.6 22.6 36 37.3 36 39.3 ing low-dimensional embeddings for handling large Table 1: Comparison of Output Feature Representation. state spaces in structured prediction problems. The framework we propose is general enough to consider Table 1 shows the results. We observe that our additional sources of information. proposed method performs significantly better than 8 Acknowledgments the baselines. The second observation is that the This work was partly funded by the Spanish best performing output feature representation is dif- MINECO project RobInstruct TIN2014-58178-R ferent for different argument types, for the locatives and by the ERA-net CHISTERA projects VISEN the best representation is SER, for the predicates is PCIN- 2013-047 and I-DRESS PCIN-2015-147. the SCWR and for the actors using an output fea- The authors are grateful to the Nvidia donation pro- ture representation actually hurts performance. The gram for its support with GPU cards.

556 References erating image descriptions. In Proc. IEEE Conference on and Pattern Recognition (CVPR). [Akata et al.2013] Zeynep Akata, Florent Perronnin, Zaid [Kulkarni et al.2011] Girish Kulkarni, Visruth Premraj, Harchaoui, and Cordelia Schmid. 2013. Label- Sagnik Dhar, Siming Li, Yejin Choi, Alexander C embedding for attribute-based classification. In Proc. Berg, and Tamara L Berg. 2011. Baby talk: Un- IEEE Conference on Computer Vision and Pattern derstanding and generating image descriptions. In Recognition (CVPR). Proc. IEEE Conference on Computer Vision and Pat- [Duchi and Singer2009] John Duchi and Yoram Singer. tern Recognition (CVPR). 2009. Efficient online and batch learning using for- [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg ward backward splitting. Journal of Machine Learn- Corrado, and Jeffrey Dean. 2013. Efficient estimation ing Research (JMLR), 10:2899–2934. of word representations in vector space. In Proc. In- [Farhadi et al.2010] Ali Farhadi, Mohsen Hejrati, ternational Conference on Learning Representations MohammadAmin Sadeghi, Peter Young, Cyrus (ICLR). Rashtchian, Julia Hockenmaier, and David Forsyth. [Socher et al.2014] Richard Socher, Andrej Karpathy, 2010. Every picture tells a story: Generating sen- Quoc V. Le, Christopher D. Manning, and Andrew Y. tences from images. In Proc. European Conference Ng. 2014. Grounded compositional semantics for on Computer Vision (ECCV). finding and describing images with sentences. Trans- [Hodosh et al.2013] Micah Hodosh, Peter Young, and Ju- actions of the Association of Computational Linguis- lia Hockenmaier. 2013. Framing image descrip- tics (TACL), 2:207–218. tion as a ranking task: Data, models and evaluation [Vinyals et al.2015] Oriol Vinyals, Alexander Toshev, metrics. Journal of Artificial Intelligence Research Samy Bengio, and Dumitru Erhan. 2015. Show and (JAIR), 47:853–899. tell: A neural image caption generator. In Proc. IEEE [Jia et al.2014] Yangqing Jia, Evan Shelhamer, Jeff Don- Conference on Computer Vision and Pattern Recogni- ahue, Sergey Karayev, Jonathan Long, Ross Girshick, tion (CVPR). Sergio Guadarrama, and Trevor Darrell. 2014. : [Weston et al.2010] Jason Weston, Samy Bengio, and Convolutional architecture for fast feature embedding. Nicolas Usunier. 2010. Large scale image annota- arXiv preprint arXiv:1408.5093. tion: with joint word-image embed- [Karpathy and Fei-Fei2015] Andrej Karpathy and Li Fei- dings. In Proc. European Conference on Computer Fei. 2015. Deep visual-semantic alignments for gen- Vision (ECCV).

557