The Thirty-Second International Florida Research Society Conference (FLAIRS-32)

Beyond Word Embeddings: Dense Representations for Multi-Modal Data

Luis Armona,1,2 José P. González-Brenes,2∗ Ralph Edezhath2∗ 1Department of Economics, Stanford University, 2Chegg, Inc. [email protected], {redezhath, jgonzalez}@chegg.com

Abstract was required to extend a model like to support additional features. The authors of the seminal Doc2Vec (Le Methods that calculate dense vector representations for text have proven to be very successful for knowledge represen- and Mikolov 2014) paper needed to design both a new neural tation. We study how to estimate dense representations for network and a new sampling strategy to add a single feature multi-modal data (e.g., text, continuous, categorical). We (document ID). Feat2Vec is a general method that allows propose Feat2Vec as a novel model that supports super- calculating embeddings for any number of features. vised learning when explicit labels are available, and self- supervised learning when there are no labels. Feat2Vec calcu- 2 Feat2Vec lates embeddings for data with multiple feature types, enforc- In this section we describe how Feat2Vec learns embed- ing that all embeddings exist in a common space. We believe that we are the first to propose a method for learning self- dings of feature groups. supervised embeddings that leverage the structure of multiple feature types. Our experiments suggest that Feat2Vec outper- 2.1 Model forms previously published methods, and that it may be use- Feat2Vec predicts a target output y˜ from each observation ful for avoiding the cold-start problem. ~x, which is multi-modal data that can be interpreted as a list of feature groups ~gi:

1 Introduction ~x = h~g1,~g2, . . . ,~gni = h~x~κ1 , ~x~κ2 , . . . , ~x~κn i (1) Informally, in a dense representation, or For notational convenience, we also refer to the i-th feature n ~ r group as ~x . In other words, each observation is constructed embedding of a vector ~x ∈ R is another vector β ∈ R that ~κi has much lower dimensionality (r  n) than the original from a partition of raw features specified by ~κ. Examples of representation. In general, we consider two kind of models how to group features include “text” from individual words, that produce embeddings: (i) supervised methods, like ma- or an “image” from individual pixels. We define an entity to trix factorization, calculate embeddings that are highly tuned be a particular value or realization of a feature group. to a prediction task. For example, in the Nextflix challenge, We define our model for a target y as follows: group i group j movie identifiers are embedded to predict user ratings. On embedding embedding the other hand, (ii) self-supervised methods (sometimes re-  n n  ~ ~ ~ X X ~ ~ ~ ~ ferred to as unsupervised methods) are not tuned for the y˜(x,φ,θ) = ω φi(x~κi ; θi) · φj(x~κj ; θj) (2) prediction task they are ultimately used for. For example, i=1 j=i algorithms such as Word2Vec (Mikolov The number of dimensions di of each feature group may et al. 2013b) are self-supervised. These algorithms are typi- vary, but all of them are embedded to the same space via di r cally evaluated by analogy solving, or sentiment analysis (Le their feature extraction function φi : R → R . These func- and Mikolov 2014), even though their loss functions are not tions learn how to transform each input (feature group) into tuned for either of these tasks. an r-dimensional embedding of length r. We only allow a Throughout this paper, we refer to any data that has dif- feature extraction function φi to act on the features of gi; and ferent types of information as multi-modal. We propose feature groups interact with each other only via the output of Feat2Vec as a novel method that embeds multi-modal data, φi. We denote the parameters of the extraction function φi such as text, images, numerical or categorical data, in a com- ~ as θi. ω is an activation (link) function. Intuitively, the dot mon vector space—for both supervised and self-supervised product (·) returns a scalar that measures the (dis)similarity scenarios. We believe that this is the first self-supervised al- between the embeddings of the feature groups. gorithm that is able to calculate embeddings from data with A simple implementation of φi is a linear fully-connected multiple feature types. Consider the non-trivial work that layer, where the output of the r-th entry is: ∗ Equal contribution di X Copyright c 2019, Association for the Advancement of Artificial φ ~x ; ~θ  = θ x (3) i i i r ir a ia Intelligence (www.aaai.org). All rights reserved. a=1

149 ~ where θi,r are learned weights. Other functional forms of φi 2.2 Supervised Learning from Data we have explored in our experiments include: We can learn the the parameters of a Feat2Vec model θ~ by • A convolutional neural network that transforms a text se- minimizing a supervised Lsup: quence into an r-dimensional embedding. This function X ~  is able to preserve information in the order of a text se- arg min Lsup y(~x), y˜(~x; φ, θ) (4) θ~ quence, unlike other methods for transforming text to nu- ~x∈X merical values such as bag-of-words. Here, X is the training dataset, y(~x) is the true target value • A deep, fully connected neural network that projects a observed in the data for an observation ~x ∈ X, and y˜(~x) ~ scalar, such as average rating, into an r-dimensional em- is its estimated value (using Equation 2); θ represents the bedding space. This extraction function, by treating the parameters learned during training (i.e. the parameters as- input as numerical instead of categorical, requires inputs sociated with the extraction functions φi). For labeling and close in value (e.g. 4.01 and 4.02 star rating) to have sim- classification tasks, we optimize the logistic loss. ilar embeddings. We can optimize Equation 4 directly using gradient de- scent algorithms. Multi-class classification models can be Any neural function that outputs an r-dimensional vector computationally costly to learn when the number of classes could be plugged into our framework as an extraction func- is very high. In these cases, a common work-around is to re- tion φi. Although in this work we have only used datasets formulate the learning problem into a binary classifier with with natural language, categorical and numerical features, it implicit sampling (Dyer 2014): simply redefine the obser- would be straightforward to incorporate other types of data– vation by appending the target label as one of the feature e.g. images or audio–with the appropriate feature extraction groups of the observation (e.g., rating), while keeping the functions, enabling the embedding of multi-modal features other feature group(s) the same (e.g. review text). The new in the same space. implicit target label is a binary value that indicates whether Figure 1 compares existing factorization methods with the observation exists in the training dataset. our novel model. Figure 1a shows a Factorization Ma- In implicit sampling, instead of using all of the possible chine (Rendle 2010) that embeds each feature. Figure 1b negative labels, one samples a fixed number (k) from the shows Feat2Vec with two feature groups: the first group set of possible negative labels for each positively labelled only has a single feature which is projected to an embed- record. Implicit sampling is also useful for multi-label clas- ding (just like matrix factorization); but the second group sification with a large number of labels, or when negative has multiple features, which are together projected to a examples are not easily available. We leverage on implicit single embedding. Figure 1c shows an approach of us- sampling for our self-supervised learning algorithm. ing neural networks within factorization machines that has been proposed multiple times (Dziugaite and Roy 2015; 2.3 Self-Supervised Learning From Data Guo et al. 2017). It replaces the dot product of factors with We now discuss how Feat2Vec can be used to learn embed- a learned neural function, which has been shown to improve dings in an self-supervised setting with no explicit target for predictive accuracy. The caveat of this architecture is that prediction. is no longer possible to interpret the embeddings as latent The training dataset for a Feat2Vec model consists of the factors related to the target task. observed data. In natural language, these would be docu- Our work in Feat2Vec significantly expands previously ments written by humans, along with document metadata. published models. For example, Principal Component Anal- Since self-supervised Feat2Vec will require positive and ysis (PCA) is a common method to learn embeddings of in- negative examples during training, we supply unobserved dividual dimensions of data. Although structured regulariza- data as negative examples. This sampling procedure is anal- tion techniques exist (Jenatton, Obozinski, and Bach 2010), ogous to the Word2Vec algorithm, which samples a nega- it is not obvious how to combine different dimensions in tive observation from a noise distribution Qw2v that is pro- PCA when we are interested in treating a subset of dimen- portional to the empirical frequency of a word in the train- sions as a single group. In traditional factorization meth- ing data. Unlike Word2Vec, we do not constrain features ods, item identifiers are embedded and thus need to be ob- types to be words. Instead, by leveraging the feature groups served during training (i.e., cold-start problem). Feat2Vec of the data defined by ~κ (Equation 2), the model can rea- can learn an embedding from an alternative characterization son on more abstract entities in the data. For example, in our of an item, such as a textual description. Feat2Vec extends experiments on a movie dataset, we define a “genre” fea- Factorization Machine (Rendle 2010), in that we allow cal- ture group, where we group non-mutually exclusive indica- culating embeddings for feature groups and we use feature tors for movie genres, including comedy, action, and drama extraction functions. StarSpace (Wu et al. 2018) introduces films. feature groups (called entities in their paper), but constrains We start with a training dataset X of records with n fea- them to be “bags of features,” is supervised, and limited ture groups. We assign a positive label to each observation to two feature types. In contrast, Feat2Vec allows contin- in X. For each observed (positive) datapoint ~x+, we gen- uous features, and does not require the user to assign one erate k observations with negative labels using the 2-step feature as a specific target. Additionally, we propose a self- algorithm documented in Algorithm 1. We illustrate the in- supervised learning algorithm. tuition of this algorithm with an example. Consider a dataset

150 ac Restaurant, Fancy Restaurant, text, observation Fancy of the passage for ob- example the negative another substituting a from by randomly example, entities For servation. the by of specifically one examples, substituting positive are modifying examples Negative by set. generated training are the examples in positive seen The and observations value). (categorical), continuous identifier text (a restaurant of sentiment a a passage (ii) a words), (i) of for (bag groups—entities feature three with products. dot denote (·) dots the while features, text for network tional 1: Figure rae eaieosrainta sietclto its identical is that that observation negative a the creates selects randomly it in First, i observation follows. as an dataset from training label negative the a with observation an supervised 1 Algorithm opoel er.Tesmln rbblte fec feature iterations each training of are: more probabilities group sampling require The to learn. param- properly going to more are have thus that and features a on eters with weight more associated places are dure that function learn extraction the group’s we feature mean parameters we complexity, of By number complexity. group’s group to feature feature proportional a a probabilities with choose distribution we multinomial a function group, extraction feature feature a a sample of complexity the the be lates use. not we that need distributions this noise but the describe sampling, now We of multino- case. levels (flattened both distributions for noise of mial) class same the distribution use noise a from pled 11: 10: t etr ru rmaniedistribution noise a from group feature -th 9: entity random a Draw 8: group feature which Draw 7: 6: 5: 4: 3: 2: 1: oefral,orngtv apigagrtmgenerates algorithm sampling negative our formally, More apigFaueGroups. Feature Sampling function n function end i return for end for X t etr ru au srpae yavlesam- value a by replaced is value group feature -th − ersn eplyr,freapeaconvolu- a example for layers, deep represent ( ) clouds white The models. factorization for architectures Network ~x ~x X j Feat2Vec ∅ ← κ − − { ∈ Q( − i X ← ← ← ~x k , . . . 1, − (a) mlctsmln loih o self- for algorithm sampling Implicit x + g. ~g ~x i X 100i. ; 100i + atrzto Machine Factorization ,α k, − . . x + i ape eaieosrain from observations negative Sampled 1 } x α , { a eutas result may j .

usiuethe substitute . do ~x . . 2 − ) e dnia otepstv sample positive the to identical Set x x } k j . . Q ~g h function The 2 Q ∼ x norapiain we application, our In (·). k x φ i i i Q ∼ x 2 hssmln proce- sampling This . t etr ru with group feature i-th i (X odwsgreat., was hFood . ww gross!!!, hEwww, 1 κ x α (φ, x i . j . . α , i x

. j params

2 . Q x ) . k 1 1 x x ) ~x tthen It (·). k . j

osample to . + . except , x i append φ k calcu- x from i i To . ~x (b) Le Le x + ~g . i Feat2Vec x j 151 . x

. x i . j x k iywti etr group feature Word2Vec a within tity plexity. α α ieylbldsmlsthat samples labeled tively group The complexity. group’s feature should a hyper-parameter towards parameters counted these func- be layers), extraction also convolutional feature (e.g., some for mul- tions have layers we if number intermediate However, fully-connected group. the tiple feature to the linear proportional in categories simply a of is complexity using the variables layer, categorical For hti h neligmdlhseog reprmtr,it parameters, free is enough intuition has The model model. underlying the the if of mod- that performance in the that show change setting who not parameters to (2012), many appeal Teh with we and els Mnih Instead, of model. work our the of complexity the crease This 1. to type function integrates partition a that requires func- distribution often loss our well-behaved that sure is make a tion to need optimize We we function. loss embeddings, Learning NCE of Self-Supervised learning self-supervised for our Function Loss The to 2010). identical Hyvärinen and being (Gutmann the (NCE) using labels for data negative account labeled we positively random Instead, of 2014). (Dyer probability the ignored in are solutions: ples two of offers Sampling literature Negative The records. served text. of passage a or number, continuous Again, a word, single distribution—whether a empirical is its it to according sampled is Here, x x k j 1 1 . P apigFaueGopValues. Group Feature Sampling hsmto ilsmtmsb hnegnrt nega- generate chance by sometimes will method This h etr rusaesmlduiomy n when and uniformly, sampled are groups feature the 0, = 1 = Q x hc a ecmuainlycsl n ral in- greatly and costly computationally be can ~x which , 2 k count( ( i ~g hyaesmldeatypootoa otercom- their to proportional exactly sampled are they , α x perdi h riigdataset training the in appeared x |X i 2 i P ssml atnn hyperparameter. flattening a simply is κ Q n s h miia itiuino values: of distribution empirical the use and i x ~g x 1 α , i j

. x (i ) j stenme ftmsa entity an times of number the is 2 α |φ, x = ) α k x 1 j ∈ x 1 k ~ j = ) ulct eaiesam- negative duplicate Word2Vec, ∈X [0, P (c) x x count( i κ k 1] i “Neural" P i count( do atn h itiuin When distribution. the flattens eueasmlrsrtg to strategy similar a use we , os otatv Estimation Contrastive Noise j | x ~ κ| params(φ =1 j xs norsml fob- of sample our in exist ~g ) Z Z α params(φ ~ j ~x 2 ~x ) o ahuiu record unique each for x α k 1 = 2 α , X osml nen- an sample To i ) hs nentity an Thus, . navnedoes advance in α 1 j 2 ) α ∈ ~g 1 [0, ffeature of 1] For (5) (6) will effectively learn the probabilities itself, since systemic 3.1 Supervised Embeddings under/over prediction of probabilities will result in penalties To evaluate our supervised learning model, we com- on the loss function. pare against two baselines: (i) a method called Deep- For self-supervised learning, we adjust the structural sta- CoNN (Zheng, Noroozi, and Yu 2017),which is a deep net- tistical model of Equation 2 to account for the probability work specifically designed for incorporating text into recom- of negative examples, as in (Dyer 2014). Written explicitly, mendation problems—reportedly, it is the state of the art for this is: predicting customer ratings when textual reviews are avail- expy˜(~x|φ~ , θ~) able; (ii) and with Matrix Factorization (Koren, Bell, and p˜(y = 1|~x, φ~ , θ~) = Volinsky 2009), a commonly used baseline. ~ ~ ~  ~ exp(˜y x, φ, θ) + k × PQ(x|α1, α2) We use our own implementation of Matrix Factoriza- ~ ~ tion, but we rely on DeepCoNN’s published results (at time , sˆ(x; θ) (7) of writing, their implementation was not available). Fortu- For notational convenience, we refer to this structural prob- nately, DeepCoNN also compares on Matrix Factorization, ability as sˆ(~x; θ~). Here, y˜(.) denotes the model in Equation which makes interpretation of our results easier. To make 2 with ω set to the identity function. This can be interpeted results comparable, the Feat2Vec experiments use the same as the “score” of an observation with respect to Feat2Vec. feature extraction reported by DeepCoNN (a convolutional PQ(x) denotes the probability of x under noise distribution neural ). Thus, we can rule out that dif- Q; it accounts for the possibility of sampling an observation ferences of performance are due to choices in feature extrac- with a negative label from Algorithm 1 that appears iden- tion functions. Instead of tuning the hyper-parameters, we tical to one in the training data (with a positive label). The follow previously published guidelines (Zhang and Wallace probability of a negatively labeled record ~x, conditional on 2017). a positive datapoint ~x+, is simply given by: We evaluate on the Yelp dataset2, which consists of 4.7 million reviews of restaurants. For each user-item pair, P (~x|α , α , X, ~x+) = P (~x |X , α )× Q 1 2 Q2 κi κi 2 DeepCoNN concatenates the text from all reviews for that n PQ1 (i| params(φi)}i=1, α1) (8) item and all reviews by that user. The concatenated text is ~ fed into a feature extraction function followed by a factor- We seek to optimize the parameters θ of the feature ex- ization machine. For Feat2Vec, we build 3 feature groups: traction functions φ~ : item (restaurant) identifiers, user identifiers, and review text. X ~ + ~ Table 1 summarizes our results. We use mean squared arg min Lself x ; θ (9) error (MSE) as the evaluation metric. Feat2Vec provides θ~ ~x +∈X a large performance increase. Additionally, we claim that

For this, we define our self-supervised loss Lself as: our approach is more general, because it can trivially be ex- tended to incorporate more features. It is also more efficient: X  + k −  − logsˆ(~x ; θ~) + P log1 − sˆ(~x ; θ~) DeepCoNN relies on concatenating text, and when the aver- − + age reviews per user is n¯ and the average reviews per item + ~x ∼Q(·|~x ) u ~x ∈X is n¯ , this results in text duplicated on average n¯ × n¯ times (10) i i u per training epoch. In contrast, for Feat2Vec each review is Note that the observations with positive labels come di- seen only once per epoch. Thus Feat2Vec, can be 1-2 orders rectly from the training data, and the observations with neg- of magnitude more efficient for datasets where n¯i × n¯u is ative labels are sampled using Algorithm 1. With n feature large. Note also that Matrix Factorization cannot predict rat- groups, the loss function of self-supervised Feat2Vec can be ings for restaurants or users that have not been during train- shown to be a weighted average of the losses from n distinct ing, whereas our model and DeepCoNN avoid the cold-start supervised Feat2Vec models,1 one for each feature group. problem since they use the review text. In other words, it optimizes n multi-label classifiers, where each classifier is optimized for a different feature group. The Table 1: Supervised Yelp rating prediction intuition is as follows. In Algorithm 1, we first sample a tar- get feature group i according to PQ1 , and then add the loss MSE Improvement over MF from a supervised Feat2Vec model for i to the total self- Feat2Vec 0.480 69.2 % supervised loss. Thus, PQ1 determines the weights the self- supervised loss function assigns to each feature group. DeepCoNN 1.441 19.6 % MF (Matrix 1.561 - 3 Empirical Results Factorization) For all our experiments, we define a development set and a single test set which is 10% of the dataset, and a part of the development set is used for or validating 3.2 Self-Supervised Embeddings hyper-parameters. The prediction of similar or related words is a commonly used method for evaluating self-supervised word embed- 1The proof can be found in supplementary material at http:// web.stanford.edu/~larmona/feat2vec/supplemental_theorem.pdf 2https://www.yelp.com/dataset/challenge

152 dings (Bojanowski et al. 2017; Mikolov et al. 2013a). We Table 2: IMDB Feature Groups generalize this task for feature types beyond text as follows. Given a dataset with n feature groups we have trained a self- Feature Type Name Type # of features supervised Feat2Vec model on, we choose a single group i Runtime (minutes) Real-valued 1 as the “target” and a group j as the predictor. Given a held- IMDB rating (0-10) Real-valued 1 out instance ~x, we rank all possible values of the target group # of rating votes Real-valued 1 in Xκi by cosine similarity to the source group’s embedding Is adult film? Boolean 2

φj(~x~κj ) for that instance. We then retrieve the ranking of the Movie release year Categorical 271 actual target entity in the test dataset relative to cosine simi- Movie title Text 165,471 larity scores for all other possible entities in ~κi. Rankings are Directors Bag of categories 174,382 evaluated according to their mean percentile rank (MPR): Genres Bag of categories 28 1 PN Writers Bag of categories 244,241 MPR = N i=1 Ri/(maxi Ri), where Ri is the rank of Principal actors Bag of categories 1,104,280 the entity for observation i under our evaluation procedure, and maxi Ri is the worst ranking that can be received. A score of 0 would indicate perfect performance (i.e. top rank every test sample given), so lower is better under this metric. iments to evaluate the efficacy of self-supervised embed- Under this evaluation, we will compare the performance dings in ratings prediction. We train embeddings for the of Feat2Vec algorithm to Word2Vec’s CBOW algorithm for following feature groups: user ID, restaurant ID, review learning embeddings. During training, CBOW uses a “con- text, number of times a review is flagged as funny, and text window” of nearby words to predict a target word, so 0-5 star rating of the review. there is a natural way to apply embeddings learned from CBOW to our ranking task (aggregate context embeddings Results After training IMDB embeddings, we use cast in the left-out instance’s source group to predict the tar- members in a held out set of movies to predict the actual get group). To create a corpus for training Word2Vec, each director of the film. We rank directors by cosine similarity record in the data is treated as a separate document. Specif- of their embedding to the title’s cast member feature group ically, we tokenize all feature values in a record and treat embedding. For the educational dataset, we directly retrieve them as words (e.g. a movie record flagged as belonging to the most similar textbooks to the user embedding. Table the horror genre has the token genre_horror in its docu- 3 presents our evaluation results. Feat2Vec outperforms ment). We set the Word2Vec context window wide enough CBOW in the MPR metric. Additionally, while CBOW pre- so that training is invariant to the token order. We use hyper- dicts the actual director 1.26% of the times, Feat2Vec does so 2.43% of the time, a 92% improvement in Top-1 Preci- parameters α2 = .75 and k = 5 from (Mikolov et al. 2013b) and set the embeddings dimension to r = 50. These hyper- sion over CBOW. parameters are used in each of our embedding models during Table 3: Mean percentile rank evaluation. The novel feature group weighting parameter in Feat2Vec was set to α1 = 0.25 in order to highlight the Dataset Feat2Vec CBOW differences with Word2Vec, since α1 = 1 corresponds to IMDB 19.36% 24.15% all feature groups being sampled equally. This parameter α1 Educational 25.2% 29.2% was not tuned further. We evaluate self-supervised Feat2Vec on 3 datasets: • Movies The Internet Movie (IMDB) is a pub- Self-Supervised Feat2Vec Performance with Continu- licly available dataset3 of information related to films, ous Inputs We now focus on how well our estimated television programs and video games. We limit our exper- self-supervised Feat2Vec model performs when predict- iments to data on its 465,136 movies. Table 2 summarizes ing a real-valued feature. We expect this task to highlight our feature groups. For groups with multiple categories, Feat2Vec’s advantage over token-based embedding learn- we use a “Bag of categories” approach. For example, we ing algorithms, such as Word2Vec. For IMDB ratings, we sum embeddings of individual actors such as Tom Cruise use a 3-layer DNN extraction function that will require em- to produce a single “actor” feature group embedding as- beddings of numerically similar ratings to be close, while sociated with a movie. Word2Vec will treat two numerically different ratings as completely different entities. We evaluate the prediction of • Education We use a dataset from a leading technology IMDB ratings by choosing the rating embedding most sim- company that provides educational services. In this pro- ilar4 to the embedding of the movie’s director, and compute prietary dataset, we have 57 million observations and 9 the MSE of the predicted rating in the test dataset. Feat2Vec categorical feature types which include textbook identi- scores an MSE of 6.6, while Word2Vec (CBOW) scores 9.3. fier, user identifier, school identifier, along with other pro- We also use the Yelp dataset, predicting the most similar prietary features. Each observation is an interaction a user rating embedding to the review text embeddings produced had with a textbook. by Feat2Vec, Word2Vec (CBOW), and Doc2Vec (DM). • Yelp We use the Yelp dataset from our supervised exper- Doc2Vec is only used for the Yelp dataset because there is no

3http://www.imdb.com/interfaces/ 4As before, the metric is cosine similarity.

153 Fraction of True Ratings Feat2Vec Doc2Vec Word2Vec 1.0 ther experimentation is necessary, we believe that our results 1 0.95 0.05 0.00 0.00 0.00 1 0.00 0.00 0.46 0.08 0.46 1 0.00 0.00 0.00 0.01 0.99 0.8 are an encouraging step forward towards general-purpose 2 0.78 0.20 0.00 0.00 0.01 2 0.00 0.00 0.23 0.11 0.66 2 0.00 0.01 0.00 0.01 0.98 embedding models. 0.6 3 0.58 0.31 0.02 0.04 0.05 3 0.00 0.00 0.15 0.11 0.75 3 0.00 0.01 0.00 0.01 0.98 4 0.35 0.15 0.02 0.13 0.34 4 0.00 0.00 0.15 0.11 0.74 4 0.00 0.01 0.00 0.02 0.97 0.4 References True Rating 0.18 0.05 0.00 0.05 0.71 0.00 0.01 0.31 0.13 0.56 0.00 0.01 0.00 0.02 0.97 5 5 5 0.2 Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. En- 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 0.0 riching word vectors with subword information. Transactions of Predicted Rating the Association for Computational Linguistics 5:135–146. Dyer, C. 2014. Notes on noise contrastive estimation and negative Figure 2: Confusion Matrices of Self-Supervised Yelp Rat- sampling. arXiv preprint arXiv:1410.8251. ings Predictions Dziugaite, G. K., and Roy, D. M. 2015. Neural network matrix factorization. CoRR abs/1511.06443. obvious “base document” in the IMDB dataset for Doc2Vec Guo, H.; TANG, R.; Ye, Y.; Li, Z.; and He, X. 2017. Deepfm: A to learn on, while in the Yelp dataset the review text is a nat- factorization-machine based neural network for ctr prediction. In Proceedings of the Twenty-Sixth International Joint Conference ural candidate. For Word2Vec and Doc2Vec, the review text on Artificial Intelligence, IJCAI-17, 1725–1731. embedding is the average of word embeddings, analogous to the context vector used for learning in these algorithms. Fig- Gutmann, M., and Hyvärinen, A. 2010. Noise-contrastive esti- mation: A new estimation principle for unnormalized statistical ure 2 reports the confusion matrices for each model from this models. In Proceedings of the Thirteenth International Confer- experiment. Word2Vec is poor in its predictive power, pre- ence on Artificial Intelligence and Statistics, 297–304. dicting 5 stars for 97% of the test sample. Though Feat2Vec and Doc2Vec yield comparable MSE (2.94 vs. 2.92, re- Jenatton, R.; Obozinski, G.; and Bach, F. 2010. Structured sparse principal component analysis. In Proceedings of the Thirteenth spectively), Feat2Vec outperforms Doc2Vec by a substantial International Conference on Artificial Intelligence and Statistics, margin in classification error rate (55% vs. 73%) and mean 366–373. absolute error (1.13 vs. 1.31). In general, Doc2Vec is unable Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix Factorization to identify low rating reviews: only 0.4% of Doc2vec pre- Techniques for Recommender Systems. Computer 42(8):30–37. dictions are ≤ 2 stars, despite this comprising 20% of the data. In contrast, Feat2Vec is more diverse in its predictions, Le, Q., and Mikolov, T. 2014. Distributed representations of sen- and better able to identify extreme reviews. tences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 1188–1196. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient 4 Conclusion estimation of word representations in vector space. arXiv preprint Embeddings have proven useful in a wide variety of con- arXiv:1301.3781. texts, but they are typically built from datasets with a single Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. feature type as in the case of Word2Vec, or tuned for a sin- 2013b. Distributed representations of words and phrases and their gle prediction task as in the case of Factorization Machine. compositionality. In Advances in neural information processing We believe Feat2Vec is an important step towards general- systems, 3111–3119. purpose embedding methods. It decouples feature extraction Mnih, A., and Teh, Y. W. 2012. A fast and simple algorithm for from prediction for datasets with multiple feature types, it training neural probabilistic language models. In Proceedings of can be self-supervised, and its embeddings are easily inter- the International Conference on Machine Learning. pretable. Rendle, S. 2010. Factorization machines. In Proceedings of the In the supervised setting, Feat2Vec outperforms an algo- 2010 IEEE International Conference on , ICDM ’10, rithm specifically designed for text—even when using the 995–1000. Washington, DC, USA: IEEE Computer Society. same feature extraction function. In the self-supervised set- Wu, L. Y.; Fisch, A.; Chopra, S.; Adams, K.; Bordes, A.; and ting, Feat2Vec exploits the structure of a dataset to learn em- Weston, J. 2018. Starspace: Embed all the things! In Thirty- beddings in a more sensible way than existing methods. This Second AAAI Conference on Artificial Intelligence. yields performance improvements in our ranking and predic- Zhang, Y., and Wallace, B. 2017. A sensitivity analysis of (and tion tasks. To the extent of our knowledge, self-supervised practitioners’ guide to) convolutional neural networks for sen- Feat2Vec is the first method to calculate continuous rep- tence classification. In Proceedings of the Eighth International resentations of data with multi-modal feature types and no Joint Conference on Natural Language Processing (Volume 1: explicit labels. Long Papers), 253–263. Asian Federation of Natural Language Future work could study how to reduce the amount of hu- Processing. man knowledge our approach requires; for example by auto- Zheng, L.; Noroozi, V.; and Yu, P. S. 2017. Joint deep mod- matically grouping features into entities, or by automatically eling of users and items using reviews for recommendation. In choosing a feature extraction function. These ideas can ex- Proceedings of the Tenth ACM International Conference on Web tend to our codebase that we make available5. Though fur- Search and Data Mining, WSDM ’17, 425–434. New York, NY, USA: ACM. 5The Feat2Vec model code is available here: https:// github.com/CheggEng/Feat2Vec. The experiments using non- proprietary data are available here: http://web.stanford.edu/ ~larmona/feat2vec/.

154