Pairwise FastText Classifier for Entity Disambiguation

a,b b b Cheng Yu , Bing Chu , Rohit Ram , James Aichingerb, Lizhen Qub,c, Hanna Suominenb,c a Project Cleopatra, Canberra, Australia b The Australian National University c DATA 61, Australia [email protected] {u5470909,u5568718,u5016706, Hanna.Suominen}@anu.edu.au [email protected]

deterministically. Then we will evaluate PFC’s Abstract performance against a few baseline methods, in- cluding SVC3 with hand-crafted text features. Fi- For the Australasian Language Technology nally, we will discuss ways to improve disambig- Association (ALTA) 2016 Shared Task, we uation performance using PFC. devised Pairwise FastText Classifier (PFC), an efficient embedding-based text classifier, 2 Pairwise Fast-Text Classifier (PFC) and used it for entity disambiguation. Com- pared with a few baseline algorithms, PFC Our Pairwise FastText Classifier is inspired by achieved a higher F1 score at 0.72 (under the the FastText. Thus this section starts with a brief team name BCJR). To generalise the model, description of FastText, and proceeds to demon- we also created a method to bootstrap the strate PFC. training set deterministically without human labelling and at no financial cost. By releasing 2.1 FastText PFC and the dataset augmentation software to the public1, we hope to invite more collabora- FastText maps each vocabulary to a real-valued tion. vector, with unknown words having a special vo- cabulary ID. A document can be represented as 1 Introduction the average of all these vectors. Then FastText will train a maximum entropy multi-class classi- The goal of the ALTA 2016 Shared Task was to fier on the vectors and the output labels. Fast Text disambiguate two person or organisation entities has been shown to train quickly and achieve pre- (Chisholm et al., 2016). The real-world motiva- diction performance comparable to Recurrent tion for the Task includes gathering information Neural Network embedding model for text classi- about potential clients, and law enforcement. fication (Joulin et al., 2016). We designed a Pairwise FastText Classifier (PFC) to disambiguate the entities (Chisholm et 2.2 PFC al., 2016). The major source of inspiration for 2 PFC is similar to FastText except that PFC takes PFC came from FastText algorithm which two inputs in the form of a list of vocabulary IDs, achieved quick and accurate text classification because disambiguation requires two URL inputs. (Joulin et al., 2016). We also devised a method to We specify that each of them is passed into the augment our training examples deterministically, same embedding matrix. If each entity is repre- and released all source code to the public. sented by a d dimensional vector, then we can The rest of the paper will start with PFC and a concatenate them, and represent the two entities mixture model based on PFC, and proceeds to pre- sent our solution to augment the labelled dataset

1 All source code can be downloaded from: 2 The original paper of FastText used the typography https://github.com/projectcleopatra/PFC fastText 3 SVC: Support vector classification

Cheng Yu, Bing Chu, Rohit Ram, James Aichinger, Lizhen Qu and Hanna Suominen. 2016. Pairwise FastText Classifier for Entity Disambiguation. In Proceedings of Australasian Language Technology Association Workshop, pages 175−179. by a 2d dimensional vector. Then we train a max- imum entropy classifier based on the concatenated vector. The diagram of the model is in Figure 1.

Figure 1: PFC model. W1 and W2 are trainable

weights. 2.3 The PFC Mixture Model Figure 2: The PFC Mixture Model. The previous section introduces word-embed- 3 Augmenting More Training Examples ding-based PFC. In order to improve disambigua- Deterministically tion performance, we built a mixture model based on various PFC sub-models: Besides word-em- Embedding-models tend to have a large number bedding-based PFC, we also trained character- of parameters. Our word-embedding matrix has embedding-based PFC, which includes one uni- over 3700 rows, and thus it is natural to brain- character PFC, and one bi-character PFC. In the storm ways to augment the training set to prevent following subsections, we will first briefly explain overfitting. character-embedding-based PFC, and then show We created a method to harvest additional the Mixture model. training examples deterministically without the need for human labelling, and the data can be ac- 2.3.1 Character-Embedding-Based PFCs quired at no additional cost. Character-embedding-based PFC models typi- 3.1 Acquiring Training Examples for the cally have fewer parameters than word-embed- Negative Class4 ding-based PFC, and thus reducing the probability of overfitting. To acquire URL pairs that refer to different peo- Uni-character embedding maps each character ple, we wrote a scraping bot that visits LinkedIn, in the URL and search engine snippet into a 13- and grabs hyperlinks in a section called “People dimensional vector, take the average of an input that are similar to the person”, where LinkedIn document, concatenate the two documents, and recommends professionals that have similar to the then train a maximum entropy classification on current profile that we are browsing. LinkedIn re- top of the concatenated vectors. stricts the number of profiles we can browse in a Bi-character embedding model has a moving given month unless the user is a Premium user, so window of two characters and mapped every such we upgraded our LinkedIn account for scraping two characters into a 16-dimensional vector. purpose. We used the LinkedIn URLs provided to Our implementation of the character-embed- us in the training samples, and grabbed similar ding based PFC model includes only lowercase LinkedIn profiles, which ended up with about 850 English letters and space. After converting all let- profiles, with some of the LinkedIn URLs no ters to lowercase, other characters are simply longer up to date. skipped and ignored. 3.2 Acquiring Training Examples for the 2.3.2 Mixing PFC Sub-models Positive Class The mixture model has two phases. In phase one, To acquire training examples of different social we train each sub-model independently. In phase media profiles that belong to the same person, we 2, we train a simple binary classifier based on the used examples from about.me. About.me is a probability output of each individual PFC. The di- platform where people could create a personal agram of the PFC mixture model is shown in Fig- page showing their professional portfolios and ure 2. links to various social media sites. We wrote a scraping bot that visits about.me/discover, where the site showcases their users, and clicks open

4 In the Shared Task, if a pair of URL entities refer to differ- class. if a pair of URL entities refer to the same persons or ent persons or organisations, the pair belongs to the negative organisations, the pair belongs to the positive class.

176 each user, acquires their social media links, and manually selected the following text features. Ex- randomly selects two as a training example. For planation of these features is available in Appen- example, for someone with 5 social media pro- dix-A. files, including Facebook, Twitter, LinkedIn, Pin- LSTM Word-Embedding: We passed each terest, and Google+, the bot can generate (5, 2) = document token sequentially using word embed- 10 training examples. ding into an LSTM layer with 50 LSTM units (Brownlee, 2016) (Goodfellow et al., 2016), con- 4 Experimental Setup catenated the two output vectors, and trained a maximum entropy classifier on top of it. To re- Using the training data provided by the Organ- duce overfitting, we added dropout layers with the iser and data acquired using the method men- dropout parameter set to 0.2 (Zaremba, Sutskever, tioned in Section 3, we evaluated the perfor- & Vinyals, 2014). mance of our PFC and PFC Mixture against a Neural Tensor Network: Inspired by Socher few baseline models. et al., by passing a pair of documents represented 4.1 Datasets in vector form into a tensor, we built a relationship classifier based on the architecture in the paper The organiser prepared 200 labelled pairs of train- (Socher et al., 2013). Document vectors are calcu- ing samples and 200 unlabelled test samples lated from pre-trained Google embedding word (Hachey, 2016). All baseline methods and PFC vectors. methods are trained using the original 200 URL pairs. The only exception is “PFC with augmented 5 Results and Discussion dataset”, which uses the method in the previous section to acquire 807 negative class URL pairs, The experimental results from the setup is sum- and 891 positive class URL pairs. marised in the table. Method F1 F1 F1 4.2 Pre-Processing Public Pri- To- vate 5 Text content for the PFC comes from the search tal engine snippet file provided by the Organiser and PFC- PFC with 0.75 0.64 0.69 based Word-Embed- text scraped from the URLs provided by the train- ding ing examples. Unknown words in the test set are PFC Mixture 0.74 0.71 0.72 represented by a special symbol. Model 4.3 Baselines PFC with 0.65 0.69 0.67 augmented The reason we choose a few baseline models is dataset that there is no gold-standard baseline model for Base- Neural tensor 0.67 0.6 0.64 URL entity disambiguation. Baseline models are line network explained as followed. SVC using 0.75 0.69 0.72 Word-Embedding with Pre-Trained Vec- hand-selected tors: The training corpus Google comes from features LSTM word- 0.51 0.53 0.52 News Articles (Mikolov et al., 2013). For each embedding URL entity, we calculated the mean vector of the search result snippet text by using pre-trained Table 1: Result comparison. vectors from Google. Unknown words were ignored. Then we concatenated the 5.1 Issues with Augmented Dataset vectors and trained a maximum entropy classifier Adding more training data seems to hurt the F1 on top of it. score for the Shared Task. However, if we allow SVC with Hand-Selected Text Features: Our the newly acquired training examples to be part of Support Vector Classifier is built on top of hand- the validation set, the validation set accuracy selected text features. For each pair of URLs, we could reach 0.92. Due to time constraint, we were only able to acquire about 1700 training examples,

5 F1 total is the simple average of F1 Public (calculated from half of half of the test data) and F1 Private (from the second half of the data)

177 with approximately equal number in each cate- Below are a list of URL gory. Whether adding more training data can im- features specific to one prove disambiguation performance remains to be URL. experimented. 3 isEducation(url_a) If the first URL con- tains domain names 5.2 Improve PFC such as “.ac.uk” or “.edu”, then the value is The performance of the PFC might improve if we 1. Otherwise 0. use a similarity scoring function � �, � = 4 isEntertain- If the url includes imdb, � ��, where � is a diagonal matrix. The binary ment(url_a) allmusic, artnet, classifier becomes � = �(� �, � ) , while the mtv.com, or band, it re- turns 1. Otherwise 0. original PFC classifier is � = �(� [�, �]) . Both � and � are learnable weights. 5 isProfes- If the url contains sional(url_a) linkedin.com or re- 5.3 Compare PFC with Baseline SVC searchgate.com, it re- turns 1. Otherwise 0. In our experiments, the PFC mixture model 6 isNonProfitOr- If the url contains achieves the best performance, comparable to Gov(url_a) “.org” or “.gov”, then it SVC with hand-selected features. Uni-character returns 1. model by itself tends to under fit because the train- 7 isSportsStar(url_a) If the url contains ing data themselves cannot be separated by the “espn”, “ufc.com”, or model alone. PFC is robust because allows text “sports”, then the fea- features to be learnt automatically. ture is 1. Otherwise 0. 8 - Features for url_b Analogous to Feature 3 6 Conclusion 12 - 7 We introduced Pairwise FastText Classifier to A.2 Title Features disambiguate URL entities. It uses embedding- ID Feature Description based vector representation for text, can be trained Name quickly, and performs better than most of the al- 13 Edit dis- Due to the differences of length ternative baseline models in our experiments. PFC tance of between different titles, only has the potential to generalise towards a wide the first the first part of the titles is pre- range of disambiguation tasks. In order to gener- part of the served for calculating the Le- alise the application of the model, we created a title for the venshtein distance. This feature method to deterministically harvest more training two URLs is chosen because the first part examples, which does not require manual label- of the title usually contains the ling. By releasing all of them to the public, we first and last name of the person or the name of the company. hope for the continual advancement in the field of 14 Cosine The vector representation of the disambiguation, which could be applied to iden- distance of text is same as FastText except tity verification, anti-terrorism, and online general the em- that the embedding matrix is knowledge-base creation. bedded pre-trained from Google. Any matrices token not trained by Google Appendix A will be ignored (Weston, Chopra, & Bordes, 2015). Appendix A includes manually selected text features for the SVC baseline model. A.3 Snippet Features A.1 URL Features This refers to features made from fields “ASnip- ID Feature Name Description pet” and “BSnippet” of the search result file pro- 1 Country code dif- If one URL has “au” ference and another one has vided by the Organiser. “uk”, then the value is ID Feature Name Description 1, otherwise 0. 15 Word Mover Distance be- Using pre- 2 Edit distance be- Simply the Levenshtein tween the nouns and trained Google tween the two distance between the named entities between word-embed- URLs string tokens of the two “ASnippet” and “BSnip- ding vectors URLs (Jurafsky & pet” (Pele & Werman, A Martin, 2007). linear time histogram metric for improved sift

178 matching, 2008) (Pele & Pennington, J., Socher, R., & Manning, C. D. (2014). Werman, Fast and robust GloVe: Global Vectors for Word earth mover’s distances, Representation. Empirical Methods in 2009). Natural Language Processing, pages 1532 – 16 Word Mover Distance be- Using the pre- 1543. tween the nouns and trained Stanford Socher, R., Chen, D., Manning, C., & Ng, A. (2013). named entities between GloVe vectors Completion, Reasoning With Neural Tensor “ASnippet” and “BSnip- (Pennington et Networks for Knowledge Base. In Advances pet” al., 2014). in Neural Information Processing Systems, 2013a. Reference Weston, J., Chopra, S., & Bordes, A. (2015). Memory Networks. arXiv:1410.3916. Brownlee, J. (2016, July 26). Sequence Classification Zaremba, W., Sutskever, I., & Vinyals, O. (2014). with LSTM Recurrent Neural Networks in Recurrent Neural Network Regularization. Python with Keras. Retrieved from Machine arXiv preprint arXiv:1409.2329. Learning Mastery: http://machinelearningmastery.com/sequence -classification-lstm-recurrent-neural- networks-python-keras/ Chisholm, A., Hachey, B., & Molla, D. (2016). Overview of the 2016 ALTA Shared Task: Cross-KB Coreference. Proceedings of the Australasian Language Technology Association Workshop 2016. Chisholm, A., Radford, W., & Hachey, B. (2016). Discovering Entity Knowledge Bases on the Web. Proceedings of the 5th Workshop on Automated Knowledge Base Construction Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, pages 373 - 420. Hachey, B. (2016). How was the data obtained? - ALTA 2016. Retrieved from Kaggle: https://inclass.kaggle.com/c/alta-2016- challenge/forums/t/23480/how-was-the-data- obtained Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759. Jurafsky, D., & Martin, J. H. (2007). Speech and Language Processing, 3rd edition [Draft]. Chapter 2. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781. Pele, O., & Werman, M. (2008). A linear time histogram metric for improved sift matching. Computer Vision--ECCV 2008. Pele, O., & Werman, M. (2009). Fast and robust earth mover’s distances. 2009 IEEE 12th International Conference on Computer Vision.

179