Structured Prediction with Output Embeddings for Semantic Image Annotation

Structured Prediction with Output Embeddings for Semantic Image Annotation Ariadna Quattoni1, Arnau Ramisa2, Pranava Swaroop Madhyastha3 Edgar Simo-Serra4, Francesc Moreno-Noguer2 1Xerox Research Center Europe, [email protected] 2 Institut de Robotica` i Informatica` Industrial (CSIC-UPC), aramisa,fmoreno @iri.upc.edu { } 3TALP Research Center, Universitat Politecnica` de Catalunya, [email protected] 4Waseda University, [email protected] Abstract that has proven to be useful in multiclass and multilabel prediction tasks (Weston et al., 2010; Akata We address the task of annotating images with et al., 2013). The idea is to represent a value for semantic tuples. Solving this problem requires an argument a using a feature vector representation an algorithm able to deal with hundreds of φ IRn. We will integrate this argument represen- classes for each argument of the tuple. In ∈ tation into the structured prediction model. such contexts, data sparsity becomes a key challenge. We propose handling this spar- In summary, our main contribution is to pro- sity by incorporating feature representations pose an approach that incorporates feature represen- of both the inputs (images) and outputs (ar- tations of the outputs into a structured prediction gument classes) into a factorized log-linear model, and apply it to the problem of annotating model. images with semantic tuples. We present an experi- mental study using different output feature represen- 1 Introduction tations and analyze how they affect performance for Many important problems in machine learning can different argument types. be framed as structured prediction tasks where the goal is to learn functions that map inputs to struc- 2 Semantic Tuple Image Annotation tured outputs such as sequences, trees or general Task: We will address the task of predicting se- graphs. A wide range of applications involve learn- mantic tuples for images. Following Farhadi et al. ing over large state spaces, e.g., if the output is a (2010), we will focus on a simple semantic repre- labeled graph, each node of the graph may take val- sentation that considers three basic arguments: pred- ues over a potentially large set of labels. Data spar- icate, actors and locatives. For example, in the tuple play, dog, grass ,“play” is the predicate, “dog” is sity then becomes a challenge, as there will be many h i classes with very few training examples. the actor and “grass” is the locative. Within this context, we are interested in the task Given this representation, we can formally de- of predicting semantic tuples for images. That is, fine our problem as that of learning a function θ : X P A L IR that scores the given an input image we seek to predict what are × × × → the events or actions (referred here as predicates), compatibility between images and semantic tuples. who and what are the participants (referred here as Here X is the space of images; P , A and L are dis- actors) of the actions and where is the action taking crete sets of predicate, actor and locative arguments respectively, and p a l is a specific tuple instance. place (referred here as locatives). For example, an h i image might be annotated with the semantic tuples: The overall learning process is illustrated in Fig. 1. run, dog, park and play, dog, grass . We call Dataset: For our experiments we used a subset h i h i each field of a tuple an argument. of the Flickr8k dataset, proposed in Hodosh et al. To handle the data sparsity challenge imposed by (2013). This dataset (subset in Fig. 1) consists B the large state space, we will leverage an approach of 8,000 images from Flickr of people and animals 552 Proceedings of NAACL-HLT 2016, pages 552–557, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics Training'Data' U)(Imagenet):) Images) annotated) with) Convolu2onal'NN' keywords) (cheap,) exploits) readily) available) (Trained)using)U))) dataset)) Training'Image'x (From!Q)' Q)(Flickr8K):)Images)annotated)with) descripJve) sentences) (relaJvely) cheap,) Image'Features' exploits)readily)available)resources)) < ϕA(x), ϕP(x), ϕL(x) > L)(SPMDataset):)Images)annotated)with) descripJve) sentences) and) semanJc) tuples) (this) is) a) small) subset) of) Q,) expensive) annotaJon,)requires)experJse)) EmBedded'CRF'' Image'to' (Implicitly)induces) Seman2c'Tuple' embedding)of)image) Training'Senten2al'Descrip2ons'(From'Q)' Seman2c'Tuples' Predictor' features)and)arguments)) A)brown)dog)is)running)in)a)grassy)plain.) <act=dog,)pre=run,)loc=plain>) A)brown)dog)runs)along)a)path)in)the)grass.) <act=dog,)pre=run,)loc=grass>) Dog)running)in)field.)) <act=dog,)pre=run,)loc=field>) Dog)running)in)narrow)dirt)path.) <act=dog,)pre=run,)loc=path>) The)dog)is)running)through)the)uncut)grass.) <act=dog,)pre=run,)loc=grass>)) Seman2c'Tuple'Extractor' (Trained)using)L))) Figure 1: Overview of our approach. First, images x are represented using image features φs(x), and semantic tuples are ∈ A obtained applying our semantic tuple extractor (learned from the subset ) to their corresponding captions. The resulting enlarged C training set, is used to train our embedded CRF model that maps images to semantic tuples. (mostly dogs) performing some action, with five sequence models over sequences of fixed length. crowd-sourced descriptive captions for each one. However, it should not be hard to see that all the We first manually annotated 1,544 captions, cor- ideas presented here can be easily generalized to responding to 311 images (approximately one third other structured prediction settings. of the development set (subset in Fig. 1), produc- Let y = [y . .y ] be a set of labels and S = C 1 T ing more than 2,000 semantic tuples of predicate, ac- [S1, . ., ST ] be the set of possible label values, tor and locative. For the experiments we partitioned where y S . We are interested in learning a model i∈ i the images and annotations into training, validation that computes P (y x), i.e., the conditional probabil- | and test sets of 150, 50 and 100 images respectively. ity of a sequence y given some input x. We will consider factorized log-linear models that take the Data augmentation: To enlarge the manually an- form: notated dataset we trained a model able to predict expθ(x,y) P (y x) = (1) semantic tuples from captions using standard shal- | expθ(x,y) low and deep linguistic features (e.g., POS tags, de- y pendency parsing, semantic role labeling). We ex- The scoring function θ(x,P y) is modeled as a sum of tract the predicates by looking at the words tagged unary and binary bilinear potentials and is defined as verbs by the POS tagger. Then, the extraction of as: arguments for each predicate is resolved as a classi- T T fication problem. θ(x, y) = vy>t Wtφ(x, t) + vy>t Ztvyt+1 (2) More specifically, for each detected predicate in a t=1 t=1 X X sentence we regard each noun as a positive or neg- where v IRnt is a n dimensional feature rep- yt ∈ t− ative training example of a given relation depend- resentation of label arguments y S and φ(x, t) t∈ t ∈ ing on whether the candidate noun is or is not an IRdt is a d dimensional feature representation of t− argument of the predicate. We use these examples the tth input factor of x. to train a SVM classifier that predicts if a candidate The first set of terms in the above equation are noun is an argument of a given predicate based on usually referred as unary potentials and measure the several linguistic features computed over the syntac- compatibility between a single state at t and the fea- tic path of the dependency tree that connects them. ture representation of input factor t. The second set We run the learned tuple predictor model on all the of terms are the binary potentials and measure the captions of the Fickr8k dataset to obtain a larger compatibility between pairs of states at adjacent fac- dataset of 8,000 images paired with semantic tuples. tors. The scoring θ(x, y) function is fully parameter- 3 Bilinear Models with Output Features ized by the unary parameter matrices W IRnt dt t∈ × In this section we explain how we incorporate out- and the binary parameter matrices Z IRnt nt . t∈ × put feature representations into a factorized linear The main idea is to define a feature space where model. For simplicity, we will consider factorized semantically similar labels will be close. Like in the 553 multilabel scenario (Weston et al., 2010; Akata et mize the following loss function (D, W , Z ): L { } { } al., 2013), having full feature representations for ar- logP (y x; W , Z ) Recall that we − x y D | { } { } guments will allow us to share information across are interestedh i∈ in learning low-rank unary and binary P different classes and generalize better. With a good potentials. To achieve this we take a common ap- output feature representation, our model should be proach which is to use as the nuclear norm W and | |∗ able to make sensible predictions about pairs of ar- Z as a convex approximation of the rank function, | |∗ guments that it has not observed at training. This is the final optimization problem becomes: easy to see: consider a case were we have a pair of arguments represented with feature vectors a1 and min W (D, W ) + α Wt + β Zt (3) { }L { } | |∗ | |∗ t a2 and suppose that we have not observed the factor X a1, a2 in our training data but we have observed the where (D, W ) = d D loss(d, W ) is the b , b a L { } ∈ { } factor 1 2.

Load more