Pairwise FastText Classifier for Entity Disambiguation a,b b b Cheng Yu , Bing Chu , Rohit Ram , James Aichingerb, Lizhen Qub,c, Hanna Suominenb,c a Project Cleopatra, Canberra, Australia b The Australian National University c DATA 61, Australia [email protected] {u5470909,u5568718,u5016706, Hanna.Suominen}@anu.edu.au [email protected] deterministically. Then we will evaluate PFC’s Abstract performance against a few baseline methods, in- cluding SVC3 with hand-crafted text features. Fi- For the Australasian Language Technology nally, we will discuss ways to improve disambig- Association (ALTA) 2016 Shared Task, we uation performance using PFC. devised Pairwise FastText Classifier (PFC), an efficient embedding-based text classifier, 2 Pairwise Fast-Text Classifier (PFC) and used it for entity disambiguation. Com- pared with a few baseline algorithms, PFC Our Pairwise FastText Classifier is inspired by achieved a higher F1 score at 0.72 (under the the FastText. Thus this section starts with a brief team name BCJR). To generalise the model, description of FastText, and proceeds to demon- we also created a method to bootstrap the strate PFC. training set deterministically without human labelling and at no financial cost. By releasing 2.1 FastText PFC and the dataset augmentation software to the public1, we hope to invite more collabora- FastText maps each vocabulary to a real-valued tion. vector, with unknown words having a special vo- cabulary ID. A document can be represented as 1 Introduction the average of all these vectors. Then FastText will train a maximum entropy multi-class classi- The goal of the ALTA 2016 Shared Task was to fier on the vectors and the output labels. Fast Text disambiguate two person or organisation entities has been shown to train quickly and achieve pre- (Chisholm et al., 2016). The real-world motiva- diction performance comparable to Recurrent tion for the Task includes gathering information Neural Network embedding model for text classi- about potential clients, and law enforcement. fication (Joulin et al., 2016). We designed a Pairwise FastText Classifier (PFC) to disambiguate the entities (Chisholm et 2.2 PFC al., 2016). The major source of inspiration for 2 PFC is similar to FastText except that PFC takes PFC came from FastText algorithm which two inputs in the form of a list of vocabulary IDs, achieved quick and accurate text classification because disambiguation requires two URL inputs. (Joulin et al., 2016). We also devised a method to We specify that each of them is passed into the augment our training examples deterministically, same embedding matrix. If each entity is repre- and released all source code to the public. sented by a d dimensional vector, then we can The rest of the paper will start with PFC and a concatenate them, and represent the two entities mixture model based on PFC, and proceeds to pre- sent our solution to augment the labelled dataset 1 All source code can be downloaded from: 2 The original paper of FastText used the typography https://github.com/projectcleopatra/PFC fastText 3 SVC: Support vector classification Cheng Yu, Bing Chu, Rohit Ram, James Aichinger, Lizhen Qu and Hanna Suominen. 2016. Pairwise FastText Classifier for Entity Disambiguation. In Proceedings of Australasian Language Technology Association Workshop, pages 175−179. by a 2d dimensional vector. Then we train a max- imum entropy classifier based on the concatenated vector. The diagram of the model is in Figure 1. Figure 1: PFC model. W1 and W2 are trainable weights. 2.3 The PFC Mixture Model Figure 2: The PFC Mixture Model. The previous section introduces word-embed- 3 Augmenting More Training Examples ding-based PFC. In order to improve disambigua- Deterministically tion performance, we built a mixture model based on various PFC sub-models: Besides word-em- Embedding-models tend to have a large number bedding-based PFC, we also trained character- of parameters. Our word-embedding matrix has embedding-based PFC, which includes one uni- over 3700 rows, and thus it is natural to brain- character PFC, and one bi-character PFC. In the storm ways to augment the training set to prevent following subsections, we will first briefly explain overfitting. character-embedding-based PFC, and then show We created a method to harvest additional the Mixture model. training examples deterministically without the need for human labelling, and the data can be ac- 2.3.1 Character-Embedding-Based PFCs quired at no additional cost. Character-embedding-based PFC models typi- 3.1 Acquiring Training Examples for the cally have fewer parameters than word-embed- Negative Class4 ding-based PFC, and thus reducing the probability of overfitting. To acquire URL pairs that refer to different peo- Uni-character embedding maps each character ple, we wrote a scraping bot that visits LinkedIn, in the URL and search engine snippet into a 13- and grabs hyperlinks in a section called “People dimensional vector, take the average of an input that are similar to the person”, where LinkedIn document, concatenate the two documents, and recommends professionals that have similar to the then train a maximum entropy classification on current profile that we are browsing. LinkedIn re- top of the concatenated vectors. stricts the number of profiles we can browse in a Bi-character embedding model has a moving given month unless the user is a Premium user, so window of two characters and mapped every such we upgraded our LinkedIn account for scraping two characters into a 16-dimensional vector. purpose. We used the LinkedIn URLs provided to Our implementation of the character-embed- us in the training samples, and grabbed similar ding based PFC model includes only lowercase LinkedIn profiles, which ended up with about 850 English letters and space. After converting all let- profiles, with some of the LinkedIn URLs no ters to lowercase, other characters are simply longer up to date. skipped and ignored. 3.2 Acquiring Training Examples for the 2.3.2 Mixing PFC Sub-models Positive Class The mixture model has two phases. In phase one, To acquire training examples of different social we train each sub-model independently. In phase media profiles that belong to the same person, we 2, we train a simple binary classifier based on the used examples from about.me. About.me is a probability output of each individual PFC. The di- platform where people could create a personal agram of the PFC mixture model is shown in Fig- page showing their professional portfolios and ure 2. links to various social media sites. We wrote a scraping bot that visits about.me/discover, where the site showcases their users, and clicks open 4 In the Shared Task, if a pair of URL entities refer to differ- class. if a pair of URL entities refer to the same persons or ent persons or organisations, the pair belongs to the negative organisations, the pair belongs to the positive class. 176 each user, acquires their social media links, and manually selected the following text features. Ex- randomly selects two as a training example. For planation of these features is available in Appen- example, for someone with 5 social media pro- dix-A. files, including Facebook, Twitter, LinkedIn, Pin- LSTM Word-Embedding: We passed each terest, and Google+, the bot can generate (5, 2) = document token sequentially using word embed- 10 training examples. ding into an LSTM layer with 50 LSTM units (Brownlee, 2016) (Goodfellow et al., 2016), con- 4 Experimental Setup catenated the two output vectors, and trained a maximum entropy classifier on top of it. To re- Using the training data provided by the Organ- duce overfitting, we added dropout layers with the iser and data acquired using the method men- dropout parameter set to 0.2 (Zaremba, Sutskever, tioned in Section 3, we evaluated the perfor- & Vinyals, 2014). mance of our PFC and PFC Mixture against a Neural Tensor Network: Inspired by Socher few baseline models. et al., by passing a pair of documents represented 4.1 Datasets in vector form into a tensor, we built a relationship classifier based on the architecture in the paper The organiser prepared 200 labelled pairs of train- (Socher et al., 2013). Document vectors are calcu- ing samples and 200 unlabelled test samples lated from pre-trained Google embedding word (Hachey, 2016). All baseline methods and PFC vectors. methods are trained using the original 200 URL pairs. The only exception is “PFC with augmented 5 Results and Discussion dataset”, which uses the method in the previous section to acquire 807 negative class URL pairs, The experimental results from the setup is sum- and 891 positive class URL pairs. marised in the table. Method F1 F1 F1 4.2 Pre-Processing Public Pri- To- vate 5 Text content for the PFC comes from the search tal engine snippet file provided by the Organiser and PFC- PFC with 0.75 0.64 0.69 based Word-Embed- text scraped from the URLs provided by the train- ding ing examples. Unknown words in the test set are PFC Mixture 0.74 0.71 0.72 represented by a special symbol. Model 4.3 Baselines PFC with 0.65 0.69 0.67 augmented The reason we choose a few baseline models is dataset that there is no gold-standard baseline model for Base- Neural tensor 0.67 0.6 0.64 URL entity disambiguation. Baseline models are line network explained as followed. SVC using 0.75 0.69 0.72 Word-Embedding with Pre-Trained Vec- hand-selected tors: The training corpus Google comes from features LSTM word- 0.51 0.53 0.52 News Articles (Mikolov et al., 2013). For each embedding URL entity, we calculated the mean vector of the search result snippet text by using pre-trained Table 1: Result comparison.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages5 Page
-
File Size-