Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Meta-Learning for Query Conceptualization at Web Scale

Fred X. Han Di Niu Haolan Chen University of Alberta University of Alberta Tencent Edmonton, Canada Edmonton, Canada ShenZhen, China [email protected] [email protected] [email protected] Weidong Guo Shengli Yan Bowei Long Tencent Tencent Tencent BeiJing, China ShenZhen, China ShenZhen, China [email protected] [email protected] [email protected] ABSTRACT CCS CONCEPTS Concepts naturally constitute an abstraction for fine-grained enti- • Information systems → Query representation; Query log ties and knowledge in the open domain. They enable search engines analysis; Query suggestion. and recommendation systems to enhance user experience by discov- ering high-level abstraction of a search query and the user intent KEYWORDS behind it. In this paper, we study the problem of query conceptual- Information retrieval; query analysis; conceptualization; meta-learning ization, which is to find the most appropriate matching concepts for any given search query from a large pool of pre-defined con- ACM Reference Format: cepts. We propose a coarse-to-fine approach to first reduce the Fred X. Han, Di Niu, Haolan Chen, Weidong Guo, Shengli Yan, and Bowei Long. 2020. Meta-Learning for Query Conceptualization at Web Scale. In search space for each query through a shortlisting scheme and then Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and identify the matching concepts using pre-trained language models, (KDD ’20), August 23–27, 2020, Virtual Event, CA, USA. ACM, which are meta-tuned to our query-concept matching task. Our New York, NY, USA, 10 pages. https://doi.org/10.1145/3394486.3403357 shortlisting scheme involves using a GRU-based Relevant Words Generator (RWG) to first expand and complete the context of the 1 INTRODUCTION given query and then shortlisting the candidate concepts through a scoring mechanism based on word overlaps. To accurately identify Teaching machines to understand search queries and interpret the the most appropriate matching concepts for a query, even when user intents behind them is essential to building a more intelligent the concepts may have zero verbatim overlaps with the query, we web. To fully “understand” a search query means that a system is not meta-fine-tune a BERT pairwise text-matching model under the only capable of acquiring relevant knowledge about the terms in the Reptile meta-learning algorithm, which achieves zero-shot transfer query but is able to extrapolate beyond these terms to discover high- learning on the conceptualization problem. Our two-stage frame- level user intents, which often reveal additional hidden interests of work can be trained with data completely derived from a search the user. Query understanding is thus of profound significance to click graph, without requiring any human labelling efforts. For many downstream applications such as content recommendation, evaluation, we have constructed a large click graph based on more intent classification, and user-profiling. than 7 million instances of the click history recorded in Tencent QQ In this paper, we study the problem of search query conceptual- browser and performed the query conceptualization task based on a ization, which is to find one or multiple matching concepts fora large ontology with 159, 148 unique concepts. Results from a range given query from a large pool of pre-defined concepts. A concept of evaluation methods, including an offline evaluation procedure is a short text sequence that could be a phrase or a sentence. It on the click graph, human evaluation, online A/B testing and case associates various text entities under the isA relation. For example, studies, have demonstrated the superiority of our approach over a Toyota RAV4 isA real world entity under the concept fuel-efficient number of competitive pre-trained language models and fine-tuned SUV. The major benefit of query conceptualization is that concepts neural network baselines. create a tractable abstraction for the fine-grained knowledge in queries from the open domain. We showcase how conceptualization may help query understand- ing through an example in Fig. 1, where the task is to recommend queries after a user finishes reviewing a page of search results. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed We present the top-3 related queries for the input query Toyota for profit or commercial advantage and that copies bear this notice and the full citation RAV4 recommended by Google, Yahoo and Bing. Most of the rec- on the first page. Copyrights for components of this work owned by others than ACM ommended queries are more detailed versions of the original query. must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a While these can certainly be helpful to the user, in this paper, we are fee. Request permissions from [email protected]. interested in a different recommendation task, which is to discover KDD ’20, August 23–27, 2020, Virtual Event, CA, USA queries that have fewer or even no common words with the input © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7998-4/20/08...$15.00 query but may characterize the query on a higher conceptual level https://doi.org/10.1145/3394486.3403357 (e.g., fuel-efficient SUVs). The ability of conceptualization, although

3064 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

native to human intelligence, is what most search engines currently presents the experiment setup and results, while we discuss their lack and can benefit from. implications and conduct further analysis in Sec. 5. Finally, we We define that a concept is a match to a search query only ifthere review related work in Sec. 6 and conclude the work in Sec. 7. exists an isA relation between them, i.e., the knowledge conveyed in the query isA instance of the concept. Prior research on structured 2 FRAMEWORK knowledge base and taxonomy construction, such as DBPedia [3], We begin by formalizing the learning objective. Suppose that we YAGO [35], Probase [43], ConcepT [22], have provided abundant are given a fixed set C consisting of m unique concepts, i.e., C = sources to create the pool of concepts. {c1, c2,...cm }. For a search query q, assume that there exists a non- We propose an efficient solution to the query conceptualization q q empty set Ct ⊂ C, where each concept in Ct is a match to q. Our problem based on deep neural networks and meta-learning, which q goal is to learn a model f to predict Ct for every query with the can effectively learn generalizable knowledge from a large amount following objective, of unlabelled click logs (i.e., the click graph) and transfer it to the Õ Õ query-concept matching problem. Our approach follows a coarse-to- max log p(ci |q,C; f ), (1) f q ∈Q q fine strategy in a two-stage framework. Specifically, we introduce a ci ∈Ct reliable shortlisting scheme to reduce the search space for concepts, where Q represents the set of search queries available for training. where a GRU-based Relevant Words Generator (RWG) takes in a The dependence on C indicates that the cost of learning is pro- search query and produces the most relevant concept words, which portional to m. Since we lack a large amount of labelled training helps to reveal the its surrounding context. query-concept pairs, supervised classification is not feasible when To narrow the shortlist down to the most appropriate match- m is large. A shortlisting mechanism is thus needed to reduce the ing concepts, in the second stage, we meta-train a pairwise neural space for each query. text-matching model, where each candidate concept is matched Let us define a shortlisting model д, which takes in q and the against the input query, and the result indicates if there is indeed q complete concept set C, and then outputs a new set C ⊂ C. The an isA relation among the pair. The main challenge here is that it s purpose of д is to constrain the search space for each query from is impossible to manually label an unbiased training set due to the q C to Cs . The original objective from Eq. 1 then becomes more large number of candidate concepts at the web scale. Realizing that q tractable since |C | ≪ |C|, and it is expressed as many concepts are present in and mined from the click graph (i.e., s Õ Õ q most concepts are a part of a searched query or document title), we max log p(ci |q,Cs ; f ). (2) f q ∈Q q believe the problem of query-concept matching shares the same un- ci ∈Ct derlying distribution with other similar natural language matching To transform Eq. 2 into a text-matching problem, we define the tasks sampled from the click graph, which naturally links to the idea matching degree between a concept and a query as r, where r is a of meta-learning [9, 25]. Therefore, we derive four text-matching binary label of either 0 or 1, and the objective becomes tasks by mining the click graph and utilize Reptile [25], which is Õ Õ a type of Model-Agnostic Meta-Learning (MAML) algorithm, to max log p(rq,ci |q, ci ; f ), (3) f q ∈Q q meta-tune a BERT [7] model on these tasks. ci ∈C+,− To train and evaluate our framework, we have collected over 7 q million click logs from Tencent QQ Browser mobile app. Each log where C+,− represents the set of positive (matching) and negative contains a search query and the document title a user clicked after (non-matching) concepts for q. Fig. 2 depicts the overall training he/she issued that query. We then construct a large click graph and testing procedure. We provide a more detailed description of and derive the training data needed by each stage. For the concept each stage in the following subsections. pool, we adopt a version of the ontology presented in [22], which contains 159, 148 user-centered concepts mined from the web click 2.1 Shortlisting by Relevant Words logs. Simple ideas for shortlisting include comparing the mean word We have performed extensive comparisons with several base- embedding of a query with that of each concept or comparing the lines based on pre-trained word embeddings, contextualized word embeddings (encoded with some pre-trained language models) of representations, and conventionally fine-tuned BERT models. As a the query and each concept directly. However, pre-trained word labelled test set is costly to acquire, we have designed an evaluation or query embeddings suffer from the Lexical Chase problem [28]– procedure using the click graph, which enables us to report aver- the surrounding context of a search query cannot be accurately aged F1 scores for the matching results and indicates the closeness established by considering only the words in the query. Therefore, of a query and its matching concept(s) on the click graph. We also retrieving a shortlist of concepts according to only the information hired human judges to examine and score the results returned by in query words verbatim is not sufficient and tends to neglect truly each method under a blind policy. Finally, we performed online relevant concepts. To address this challenge, inspired by [13], we A/B tests on QQ Browser to verify the effectiveness of the proposed propose to use a Relevant Words Generator (RWG) as the very first approach in a production environment for a query recommendation step in our shortlisting stage to expand the set of terms conveyed task. by a query. The remainder of the paper is organized as follows. We present We denote a concept c as a sequence of words (tokens), i.e., { c c } { q q } the detailed methodology in Sec. 2 and describe how to mine c = w1,w2,... . Similarly, a queryq is denoted byq = w1 ,w2 ,... . the click graph to collect necessary training data in Sec. 3. Sec. 4 Let VC be a vocabulary of all the unique words that appear in the

3065 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Google Bing Yahoo

Toyota RAV4 Canada 2019 RAV4 hybrid reviews Canada Toyota RAV4 2019 Toyota RAV4 hybrid 2019 Toyota RAV4 reviews Canada Toyota Canada Toyota RAV4 2019 Canada 2019 RAV4 pricing Canada Toyota RAV4 2014 ......

Search Query (a) Conventional Approaches

Toyota RAV4 (b) With Conceptualizaion Concepts Recommended Queries

Concept Fuel-efficient SUV Best Fuel-efficient SUV 2019 Matching Compact SUV Best Compact SUV for daily-driving Fuel-efficient Vehicle Top Fuel-efficient Cars of 2019 ......

Figure 1: A comparison of the related search queries recommended a) by popular search engines and b) with concept matching. Note that even though the results of b) have fewer word overlaps with the original query, they are also likely to be attractive to the user as they match the user intent on a conceptual level.

Post-processing of RW Relevant Words Generator Queries Relevant Words Toyota RAV4 vehicle, fuel-efficient, SUV, price, ... Concept words Probability distribution over Python tutorials programming, online, learning, ... concept words ...... Part-Of-Speech (POS) Train Feedforward + Bi-GRU Collect Softmax Queries Queries Labels

Resident Evil 7 Resident card renewal 0 Query ......

Queries Titles Labels Click Graph RAV4 fuel-ecnonmy The all-new RAV4 SUV 1 Collect ...... BERT Classifier

0/1 Binary label Titles Titles Labels Meta-fine-tune Global news today 24-hour weather report 0 ...... BERT

Queries/Titles Relevant Words Labels Input pair RAV4 fuel-ecnonmy fuel-efficiency 1 ...... (a) Training procedure

(b) Testing procedure Matched Concepts Test Query BERT pairwise - Weight-loss diet Diet for losing body fat classifier - How to lose weight ...

Relevant Relevant Words Related Concepts Words Shortlisting by - weight-loss Generator word-overlaps - Weight-loss diet - diet - How to lose weight - weight - Diet for arthritis - overweight - Diet for gout ...... Concepts

Figure 2: The training a) and testing b) procedures for our query-concept matching framework. concepts. In reality, we normally have m > |VC |, since the words in relevant words of q, our objective then becomes VC can be combined to generate a large number of concepts. In other Õ Õ words, the size of VC does not increase linearly with the number of max log p(wc |q; θ), (4) m θ q concepts . Therefore, as opposed to directly shortlisting candidate q ∈Q wc ∈R concepts from C, for a given input query, we first generate (select) the most relevant words in the vocabulary V , which is precisely C where the generated relevant words wc ∈ VC . The RWG model what the RWG model achieves. The selected relevant words can [13] includes Bi-directional Gated Recurrent Unit (GRU) [6] that then be used to fast retrieve a shortlist of candidate concepts. encodes a query word-by-word to learn a contextual representation, Given an input query q and VC , we learn a model θ such that followed by a fully-connected layer, a Dropout layer [34], and a the log probabilities of choosing the words, which are relevant to Softmax operation. The output is a probability distribution over all q, are maximized. Suppose that the set Rq contains all the target words in VC .

3066 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

ˆq Let us define Rk as the set of predicted relevant words for a query Instead of performing the second-order differentiation in Eq. 7, q, which is constructed by taking the top-k results from the output we utilize Reptile [25], a first-order meta-learning algorithm to ˆq learn ϕ. Reptile converts the outer loss function into a simple step probability distribution of the RWG model. We then define Ck as ∇ L( Dtrain) the set of all concepts where each concept has at least one word in the direction of ϕ ϕ, i after K batches of training on q ˆq Ti . Similar to conventional supervised training, Reptile iteratively that is in Rˆ . Finally, we define Cs as the output set of candidate k updates a model, where every iteration is made up of a task-learning ˆq ⊆ ˆq ˆq concepts, where Cs Ck . Cs is obtained as follows: phase followed by a meta-learning phase. In the task-learning phase, we first make a copy of the current model, and then sample anew Cˆq = {c | c ∈ Cˆq, score(c) ≥ γ }, s k (5) task and train the copied model by performing K steps of gradient- ∈ ˆq descent on this task, where K > 1. In the meta-learning phase, we where for each concept c Ck , we compute an overlap score as q update the original model by the difference between its current the percentage of unique words in c that are also in Rˆ , the set of k weights and the weights of the copied model after task-learning, predicted relevant words for query q, i.e., which is expressed by 1 Õ q score(c) = λ(w , Rˆ ), (6) ϕ ← ϕ + α(ϕ˜ − ϕ), (9) |{c}| c k w ∈{c } c where ϕ˜ is the copied model after the task-learning phase and { } ( , ˆq ) α is the meta-learning rate. Under the definition of Reptile, the where c denotes the set of unique words in c and λ wc Rk is an q indicator function that returns 1 if wc is in Rˆ and 0 otherwise. objective for query-concept matching becomes the objective for k the task-learning phase, 2.2 Meta-Learned Matching Model Õ ˜ max log p(rui ,uj |ui,uj ; ϕ), (10) ϕ˜ Without a large amount of labelled training data, we could not ui ,uj ∈U directly learn the query-concept matching model f under Eq. 3. where u , u and U denote the two input candidates and the training Fortunately, since a significant portion of concepts is originated and i j data for the current task, respectively. r is still a label of 0 or 1. extracted from click histories [22], we assume that there exist distri- ui ,uj The meta-learning objective for our problem setting is butional similarities between the task of query-concept matching and other pairwise text-matching tasks that can be derived from a max log p(ϕ|Dmeta), (11) large click graph. ϕ For this reason, we take a meta-learning approach in the second Dmeta {Dtrain Dtrain } where = 1 , 2 ,... , i.e., it encapsulates all the stage to train f . A key assumption behind optimization-based meta- tasks used for training ϕ. learning algorithms [9, 25] is that in a machine learning problem, To establish a good starting point for learning ϕ, we utilize the there exists a distribution of tasks p(T). According to the idea of pre-trained BERT model [7], which is shown to capture generaliz- Model-Agnostic Meta-Learning (MAML) [9], it is possible for a able semantic knowledge that could easily be transferred to other model ϕ to learn to adapt to p(T) as opposed to a sampled task Ti NLP tasks through fine-tuning. BERT contains 12 layers of Trans- by minimizing the following loss function: former [39] blocks, where each block has a hidden dimension of Õ train test 768. We convert a pre-trained, general-purpose BERT model into a min L(ϕ − α∇ϕ L(ϕ, Di ), Di ), (7) ϕ deep matching model by first concatenating the input candidates Ti ∼p(T) ui , uj from Eq. 10 into one sequence, separated by a [SEP] token. Dtrain Dtest Then, we append a FeedForward layer after BERT to project its where i and i are the train and test set for the i-th task, and α is the meta-learning rate. The intuition behind Eq. 7 is that output into a 2-dimensional space. We meta-train BERT under the we meta-learn a model ϕ, by learning how training ϕ on a task Ti Reptile-learning procedures defined by Eq. 10 and 11. affects its generalizability on a held-out test set for the same task. We repeat this process on several tasks sampled from p(T) to learn 3 GENERATING DATA FROM CLICK GRAPHS ϕ. A successful ϕ would have low test error on every task, i.e., ϕ All necessary training data are automatically mined from a click adapts well to the underlying distribution p(T). graph without any human labelling efforts. A click graph is a bi- The main advantage of model ϕ is that for an unseen task Tj partite graph that records the click histories of many users in a sampled from p(T), we could directly transfer ϕ to Tj and expect search engine and is easy to retrieve. Each vertex in a click graph decent performance. If a small amount of labelled data is available corresponds to either a search query or a document title. An edge for Tj , ϕ also provides the best possible starting point for fine-tuning between a query vertex and a title vertex indicates that the user a new model ϕˆ, which is known as the adaptation [9], where who issued the query clicked through the corresponding title. We believe a click graph is an excellent source of collective intelligence, ˆ ( |Df ine−tune ) ϕ = arg max log p ϕ j , ϕ . (8) where the click behavior naturally reflects the relations between ϕ various entities in the open domain. Therefore, we generate the With a large number of concepts, it becomes impossible to manually training data by following a k-hop breadth-first strategy on a click label an unbiased dataset, even for fine-tuning purposes. Therefore, graph. We define 1 hop as going from a query vertex to a neigh- we do not perform any adaptation here and use ϕ for the zero-shot boring query vertex through a title vertex, or vice versa. We then matching of query-concept pairs. denote the k-hop neighbors set for a vertex v in a click graph G

3067 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Sv,k [ ] Table 1: Statistical information on datasets for training the by [ ] , where . indicates the type(s) of vertices to be included. . RWG model. Sv,k Sv,k For example, [Q] includes only neighboring queries while [Q,T ] includes both queries and document titles. Regardless of whether v Train Dev Test Sv,k is a query or title, [.] satisfies the following conditions: Size 7.9M 980K 880K Avg # of words in inputs 6.95 6.94 6.94 •Sv,k is non-empty. [.] Avg # of relevant words 4.94 4.94 4.93 • There exists a shortest path in G between any two vertices v,k , in S . Input vocabulary size 435 642 [.] Output vocabulary size 18, 717 • The number of edges passed by any shortest path is no more than 2k. QT2W task, we randomly choose 3 words from all possible relevant Each query-title edge in G is also weighted, where the weight words to create 3 negative pairs for every positive pair. reflects the number of click-throughs (by any user) from the query to the title. 4 EXPERIMENTATION 3.1 Generating Relevant Words Data 4.1 Datasets q Before deriving any data from the click graph, we randomly sample Let Rt denote the set of target relevant words for a query q. For each query vertex q, we first retrieve its 3-hop neighboring queries set 398, 447 queries from it as a held-out test set for query-concept q, q, S 3 . Next, we iterate through every query qˆ in S 3 and perform matching. To avoid information leakage, we then prune these [Q] [Q] queries and their corresponding links from the click graph. Finally, Part-Of-Speech (POS) tagging using the StanfordCoreNLP tagger q we extract training data by following the procedures in Sec. 3. [23]. For every word w in qˆ, we add it to R only if it satisfies the t Table 1 and 2 provide useful statistical information on the datasets. following conditions: To ensure an even contribution from every task when meta-learning (1) w ∈ VC , i.e., w is a concept word. the text-matching model, we constrain the train/dev set of each (2) w has a tag of Named Entity (NR), Noun (NN), or Verb (VV). task to have the same size by randomly pruning larger datasets. q q Once Rt is formed in the above way, we add the pair (q, Rt ) as a q 4.2 Baseline Models new sample into the RWG dataset, if Rt is non-empty. We repeat the above process for every query vertex q in the click graph to We compare our framework against the following baseline models. derive the complete dataset, from which we then split into train, MoWE. This baseline follows the conventional Mean-of-Word- dev and test sets. Embeddings setup. We utilize the Tencent AI Lab pre-trained Chi- nese [32]. We set a cosine similarity threshold 3.2 Generating Data for Meta-learning of 0.75 as the decision boundary. In other words, for every input Meta-learning requires a set of tasks that are sampled from the query, we find the most similar concepts and only keep a concept same underlying task distribution. Based on Eq. 10, we derive and if its cosine similarity with the input query is larger than or equal sample the following tasks from the click graph: to 0.75. We select this threshold because on the test set, it results in a final coverage similar to our proposed framework. • Query-to-query (Q2Q), where the goal is to classify whether ELMo-pre-train. As a contextualized representation, ELMo [26] two queries are related. For a query q from G, we create pos- incorporates the context of the sequence when generating word itive matching instances by randomly pairing q with 3 of embeddings. We use the pre-trained Chinese ELMo model with a Sq,2 its neighbors from [Q]. For the next two tasks, positive in- hidden size of 1024, which is provided by [5, 8]. The procedure to stances are created in the same fashion, though with different select the matching concepts is the same as the MoWE baseline. inputs and 2-hop neighbor sets. BERT-pre-train. We are interested in how much of the gener- • Query-to-title (Q2T), where the goal is to classify whether alizable knowledge learn by the pre-trained BERT language model a query is related to a document title. could directly transfer to the problem of query-concept matching. • Title-to-title (T2T), where we decide whether two docu- Therefore, we set up a pre-trained BERT-base model 1 in the same ment titles are related. fashion as the ELMo-pre-train baseline. • Query/title-to-word (QT2W). The first input is a sequence, RW. The Relevant Words (RW) baseline is only our proposed either a query or title, and the second input is a word. We first stage. We use the shortlisted concepts as the matching results. create positive instances by pairing an input vertex v to This baseline serves as an ablation comparison to help verify the Sv,3 all the relevant words in [Q,T ], following the same proce- effectiveness of the second stage models. dure in Sec. 3.1. Yet, the relevant words here are no longer In addition to the above baselines, we report the performance of constrained by Condition (1) from Sec. 3.1. the following two-stage frameworks. MoWE-BERT-ft We sample negative instances by randomly selecting other inputs . We fine-tune a pre-trained BERT-base model of the same type, i.e., for each positive pair (q, q+) from the Q2Q as a matching model on only the Q2Q task. We then pair it with a − task, we randomly select a q from the set of all unique queries 1We use the pre-trained Chinese BERT model from https://github.com/huggingface/ Q. The same procedure applies to the Q2T and T2T tasks. For the pytorch-transformers

3068 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Table 2: Statistical information on datasets for meta-fine-tuning BERT.

Q2Q Q2T T2T QT2W Train Dev Train Dev Train Dev Train Dev Size 4.85M 539K 4.85M 539K 4.85M 539K 4.85M 539K Avg # of words in inputs 4.65 4.65 7.78 7.78 10.87 10.87 4.06 4.06 Vocabulary size 406K 162K 755K 245K 586K 235K 222K 63K

Table 3: Top-N recall scores on the RWG test set. Table 4: F1 scores with different hop limitsH ( ).

Coverage Top-10 Top-50 Top-100 Top-500 H = 6 H = 8 H = 10 H = 12 H = 14 ELMo 1.0 0.139 0.216 0.258 0.383 MoWE 0.702 0.657 0.627 0.605 0.589 TAL 0.957 0.411 0.523 0.567 0.668 ELMo-pre-train 0.750 0.699 0.626 0.561 0.521 RWG 0.884 0.726 0.843 0.882 0.951 BERT-pre-train 0.827 0.755 0.691 0.637 0.608 RW 0.494 0.497 0.494 0.491 0.491 MoWE-BERT-ft 0.595 0.600 0.608 0.611 0.631 MoWE shortlisting scheme, which is constructed the same way as RW-BERT-ft 0.682 0.709 0.725 0.738 0.743 the MoWE baseline except that we do not set a similarity threshold. RW-BERT-meta 0.873 0.866 0.854 0.847 0.842 RW-BERT-ft. We adopt the same fine-tuned BERT model from MoWE-BERT-ft and instead pair it with our RW first stage. Table 5: Human evaluation results on 500 randomly selected We name our proposed two-stage framework RW-BERT-meta. test instances. We refer interested reader to the supplementary information section for more details about the overall training setup. Total # of 1s assigned # of 2s assigned score 4.3 Offline Evaluation MoWE 436 81 598 4.3.1 Evaluating the RWG model. We want to showcase that our ELMo-pre-train 199 58 315 RWG model is learning the word relevance features conveyed in a BERT-pre-train 86 32 150 click graph. We also wish to demonstrate that such features cannot RW 532 62 656 be accurately captured by pre-trained representations. Therefore, MoWE-BERT-ft 285 29 343 we report the top-N recall scores on the test set of the RWG model. RW-BERT-ft 435 31 497 For our model, the top-N recall score as the percentage of truth rel- evant words that appear in the top-N predicted words. We compare RW-BERT-meta 621 117 855 our approach to the Tencent AI Lab (TAL) pre-trained embeddings and the pre-trained ELMo representations. For these baselines, the 4.4 Human Evaluation query representation is the mean of its word embeddings, and we Since the offline evaluation does not explicitly reflect which model find the top-N most similar words to it. We also report the coverage is better at predicting isA relations, we further conduct human to ensure that the recall scores are reliable. Table 3 reports the evaluations. Specifically, we randomly select 500 test instances and results of this experiment. Some coverages are not 1.0 because we take the top-3 matched concepts of every model 2. We then hire 2 reject inputs that contain Out-Of-Vocabulary (OOV) words. human judges through crowdsourcing and ask each one to evaluate 4.3.2 Offline evaluation using click graphs. In the absence of a half of the instances according to the following criteria, labelled test set, we evaluate the results of query-concept matching • For a test instance, the judge reviews the top-3 concepts using our click graph. Considering the fact that many concepts are matched by each model without given any knowledge about also popular queries, we thoroughly examine our click graph and the models and assigns a score for each concept. find that it contains 2997 concepts as query vertices. Under our • A concept receives a score of 2 if there exists an isA relation previous assumption, the hop distance between two vertices is a between it and the input query. If a concept only matches rough estimate for their relatedness. Therefore, we propose that part of the query, then the judge assigns a score of 1 for this if a concept is a good match to a query, and they are in the same concept. Otherwise, for every non-related concept, a score click graph, then they should be within a certain number of hops of 0 is assigned. from each other. We assume the truth label between a concept and • On a test instance, the maximum attainable score for a model a query is 1 if both are in our click graph, and they could reach each is 6, and the minimum score is 0. other within H hops. Note that the click graph used for evaluation We report the results of human evaluation in Table 5. In Sec. 5, here is the original graph containing all the test queries. With the We discuss the implications behind these results and conduct case derived truth labels, we could then compute the macro-averaged F1 studies to get a more intuitive visualization of the matching results. score on the query-concept matching results. The definition of this metric is introduced in the supplementary section. We experiment 2We randomly take 3 concepts if there are more than 3 top concepts with the same with several H values and report the F1 scores in Table 4. similarity/score

3069 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

4.5 Online A/B Testing We conduct online A/B testing on the Tencent QQ browser to eval- uate the usefulness of our framework in a production environment. For a user’s search query, we discover its matching concepts and present them at the end of the article which the user clicked on, i.e., we directly recommend the matched concepts to the user. Fig. 3 reports the results of the A/B test. The control group here is the MoWE baseline. We sample 50 hours of click behavior from real users, where each point in the plot represents the average num- ber of clicks on the recommended concepts in an hour. We also Figure 4: F1-score and coverage results produced with differ- report the interquartile range box-plot with the 25th to the 75th ent combinations of k and γ values. percentiles.

5.1 Hyper-parameter Sensitivity We run the offline evaluation procedure with H = 10 to observe how the performance changes under different hyper-parameter settings. We test several values of k, which controls how many top relevant words to take from the output of the RWG model, and γ , which determines the percentage threshold of word-overlaps between the candidate concepts and the predicted relevant words. We report the changes in F1 scores and coverage, i.e., the percent of test instances that we could find at least one matching concept. From Fig. 4, we first observe that a larger value of γ produces higher F1 scores and lower coverage. This makes sense because a lower γ means we allow a candidate concept to contain more words that are not from the predicted relevant words, which increases the Figure 3: Average number of clicks per hour on the recom- chance of mismatches. Another trend we notice is that decreasing mended concepts. The experiment groups is our framework k values often leads to an increase in F1 scores. However, this does and the control group is the baseline. not mean that we should prefer a smaller k because smaller k values also produce a much lower coverage. In a real-world application, we want our framework to handle as many queries as possible while maintaining decent performance metrics. We set k = 15 to strike a balance between F1 score and coverage. 5 RESULT ANALYSIS & DISCUSSION Table 3 suggests that the RWG model easily outperforms the other 5.2 Case Study two pre-trained word representations. This indicates that the word- In Table 6, we show a representative example of the top-10 relevant relatedness information derived from a click graph is different words generated by our RWG model and compare with the top- from the word similarities learned by pre-trained embeddings. For 10 most similar words found by the Tencent AI Lab (TAL) word the results of query-concept matching, we observe in Table 4 that embeddings. In general, the RWG model discovers more words our two-stage setup outperforms all baselines by a large margin related to the input query on a higher conceptual-level, especially regardless of the hop limit H. Specifically, by adding a second stage the highlighted ones. For instance, our RWG model knows that text-matching model to the RW baseline, the performance increases “RongWei rx5” is a car, and the query is asking for tutorials on how significantly. to project displays between a cell phone and a car. Therefore, it We believe that while the imaginative nature of the RWG model discovers conceptually relevant words like “car”, “inter-connect” enables us to discover additional relevant words, it could “wander- and “tutorial”, while in the baseline approach, taking the average of off” too far and match completely irrelevant concepts. Therefore, word embeddings for the query cannot capture its overall meaning another stage of quality assurance is crucial, and our meta-learned and the majority of related words are only similar to “screen”. BERT text-matching model is a better candidate for this role com- In Table 7, we show the top-3 matched concepts found by our pared to the simple fine-tuned version (RW-BERT-ft). According framework and competitive baselines for a test query. Overall, our to the human evaluation results in Table 5, our framework scores framework is superior at both finding potential candidates and highest in every category. In particular, it receives more scores of picking out the best matching concepts. Here, Omen and Legion are 2 than the baselines, which proves its superiority in discovering gaming laptop models from HP and Lenovo, respectively. Clearly, matching isA relations between queries and concepts. We observe the MoWE baseline does not have this knowledge and base its the same pattern in the A/B testing results, where the concepts matching solely on the word “Legion”. In contrast, our proposed matched by our framework consistently attract more clicks than RW stage captures this relation. Then, the second stage reliably the baseline. filters out non-matching concepts such as “Laptop keyboard”.

3070 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

Table 6: Comparison of the top-10 words found by the RWG Query-concept matching is a relatively unexplored topic due model and the TAL pre-trained word embeddings. to the need of a pre-defined concept set. [42] proposes a Bayesian

Input query Model Top-10 words inference mechanism for query-concept matching. Specifically, the matching degree between a query and a concept is derived from 荣威, 手机, 映射, 汽车, 屏幕, 论坛, 问题, 互联, 教程, 大屏. the conditional probabilities between their related instances and 荣威rx5映射 RWG RongWei, cellphone, project, car, screen, attributes. [33] refines the Probase [43] ontology by adding more 手机屏幕 forum, question, inter-connect, tutorial, relations extracted from web documents, then presents a random RongWei rx5 project big-screen. walk strategy to discover the matching concepts. [22] proposes a cellphone screen 墨水屏 显示屏 屏幕 运存 曲面屏 , , , , , koobee, probabilistic approach according to the co-occurrence of context amoled, 全面屏, 折叠屏, 投屏. ink screen, display screen, screen, RAM, words as well as a similarity-based method using TF-IDF vectors. TAL curved screen, koobee, amoled, bezel-less display, folding display, screen projection. 6.2 Meta-Learning Meta-learning studies the problem of learning how to learn [24, 38]. Early works focus on the design of meta-trainers, i.e., a model that Table 7: Comparison of the top-3 concepts matched by our learns how to train another model such that it performs better proposed framework and competitive baseline models. on a given task [4, 30]. [1] transfers this idea to neural networks Input query Model Matched concepts and proposes an optimizer-optimizee setup, where each compo- 游戏笔记本电脑, 联想笔记本电脑. nent is learned with an iterative gradient-descent procedure. [20] RW-BERT-meta 暗夜精灵和拯救者 Gaming laptop, Lenovo laptop. follows a guided policy search strategy and automatically learns Omen and Legion 游戏笔记本电脑, 笔记本电脑键盘, the optimization procedure for updating a model. Meta-learning 联想笔记本电脑. RW is also studied as a promising solution to few-shot classification Gaming laptop, Laptop keyboard, Lenovo laptop. problems, where a model learns to recognize new classes given a 拯救者小说. limited amount of training data for each class. [27] proposes an MoWE Legion novel. LSTM meta-learner to learn an optimization procedure for few- 游戏笔记本电脑, 笔记本电脑键盘, shot image classification. [21] develops an SGD-like meta-learning 联想笔记本电脑. RW-BERT-ft process and experiment on few-shot regression and reinforcement Gaming laptop, Laptop keyboard, learning problems. MAML [9] is another popular approach that Lenovo laptop. does not impose a constraint on the architecture of the learner. Finally, Reptile [25] simplifies the learning process of MAML by conducting first-order gradient updates on the meta-learner. 6 RELATED WORK Meta-leaning could also be achieved with non-parametric meth- 6.1 Search Query Understanding ods. [19] meta-learns a Siamese network to classify whether two images are from the same class. Matching networks [40] refines this We review related works related to search query understanding idea by imitating the meta-testing procedure during meta-training, from the perspectives of query expansion, query reformulation, where the learned embedding of a test input is compared to embed- query generation and query-concept matching. Query expansion ding vectors of inputs from known classes, and the most matching tackles the Lexical Chase problem. [10] performs expansions through class is selected by the method of weighted k-nearest-neighbors. a relation graph. [11, 12, 28, 29] approach the problem from the Prototypical Networks [31] learn an entirely new metric space for perspective of statistical . comparing the similarities between inputs. [36] proposes that sepa- Query reformulation re-writes a search query such that it is rating embedding learning and relation matching further improves easier to process, while maintaining the original meaning. Early the accuracy of meta-learned few-shot classification models. methods either delete unnecessary terms in a query [18] or substi- tute ambiguous terms [37]. [45] proposes an active-learning-based method. SimRank and SimRank++ [2, 16] compute similarities be- 7 CONCLUSION tween queries using the links in a click graph. More recently, meth- In this paper, we propose a meta-learned neural framework for ods based on machine translation [28] or recurrent neural networks matching a search query to its high-level concepts. We begin by [15] further improves the quality of the reformulated queries. discovering potentially matching candidates with a novel shortlist- When generating search queries directly using Sequence-to- ing scheme, where a recurrent Relevant Words Generator performs Sequence models, the process of query understanding is implicitly context-completion for a query. To ensure that each concept is captured by the encoder part of the model. [41] considers the com- indeed a match to the query, we meta-fine-tune a BERT pairwise pression of E-commerce product titles and the generation of search text-matching model with the Reptile meta-learning algorithm, queries in a multi-task learning setup. [15, 44] perform generative which performs binary classification on a query-concept pair to query re-writing, while [14] represents documents as graphs and determine the final matching degree. We train our framework with reverse-engineers the most appropriate queries from them. Finally, data derived from a large click graph that contains over 7 million [13] proposes a two-stage generative framework for query recom- click histories, without any manual labelling. Through offline eval- mendation, where the Relevant Words Generator model is proven uations, human assessments, online tests and case studies, we have to be extremely useful in mitigating the Lexical Chase problem. verified the effectiveness of our proposed framework, especially its

3071 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

ability to infer and discover higher-level concepts for search queries, [21] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017. Meta-SGD: Learning as compared to a range of competitive baselines. We conclude that to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835 (2017). [22] Bang Liu, Weidong Guo, Di Niu, Chaoyue Wang, Shunnan Xu, Jinghong Lin, a combination of deep neural language models and meta-learning Kunfeng Lai, and Yu Xu. 2019. A User-Centered Concept Mining System for may open up avenues for natural language conceptualization and Query and Document Understanding at Tencent. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM. inference, an ability that is native to human intelligence and is [23] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. critical to enhancing human-computer interaction through search Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language engines. Processing Toolkit. In Association for Computational (ACL) System Demonstrations. 55–60. [24] Devang K Naik and RJ Mammone. 1992. Meta-neural networks that learn by learning. In [Proceedings 1992] IJCNN International Joint Conference on Neural REFERENCES Networks, Vol. 1. IEEE, 437–442. [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David [25] Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta- Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learning algorithms. arXiv preprint arXiv:1803.02999 (2018). learn by gradient descent by gradient descent. In Advances in neural information [26] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher processing systems. 3981–3989. Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word [2] Ioannis Antonellis, Hector Garcia Molina, and Chi Chao Chang. 2008. Simrank++: representations. arXiv preprint arXiv:1802.05365 (2018). query rewriting through link analysis of the click graph. Proceedings of the VLDB [27] Sachin Ravi and Hugo Larochelle. 2016. Optimization as a model for few-shot Endowment 1, 1 (2008), 408–421. learning. (2016). [3] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, [28] Stefan Riezler and Yi Liu. 2010. Query rewriting using monolingual statistical and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The machine translation. Computational Linguistics 36, 3 (2010), 569–582. semantic web. Springer, 722–735. [29] Stefan Riezler, Yi Liu, and Alexander Vasserman. 2008. Translating queries into [4] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. 1992. On the snippets for improved query expansion. In Proceedings of the 22nd International optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial Conference on Computational Linguistics-Volume 1. Association for Computational and Biological Neural Networks. Univ. of Texas, 6–8. Linguistics, 737–744. [5] Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards [30] Jürgen Schmidhuber. 1992. Learning to control fast-weight memories: An alter- Better UD : Deep Contextualized Word Embeddings, Ensemble, and Tree- native to dynamic recurrent networks. Neural Computation 4, 1 (1992), 131–139. bank Concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual [31] Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks Parsing from Raw Text to . Association for Computational for few-shot learning. In Advances in Neural Information Processing Systems. Linguistics, Brussels, Belgium, 55–64. 4077–4087. [6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. [32] Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. 2018. In Proceedings of the Empirical evaluation of gated recurrent neural networks on sequence modeling. 2018 Conference of the North American Chapter of the Association for Computational arXiv preprint arXiv:1412.3555 (2014). Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: for Computational Linguistics, 175–180. Pre-training of deep bidirectional transformers for language understanding. arXiv [33] Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu preprint arXiv:1810.04805 (2018). Chen. 2011. Short text conceptualization using a probabilistic knowledgebase. In [8] Murhaf Fares, Andrey Kutuzov, Stephan Oepen, and Erik Velldal. 2017. Word Twenty-Second International Joint Conference on . vectors, reuse, and replicability: Towards a community repository of large-text re- [34] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan sources. In Proceedings of the 21st Nordic Conference on Computational Linguistics. Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Association for Computational Linguistics, Gothenburg, Sweden, 271–276. Overfitting. Journal of Machine Learning Research 15 (2014), 1929–1958. [9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta- [35] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of learning for fast adaptation of deep networks. In Proceedings of the 34th Interna- semantic knowledge. In Proceedings of the 16th international conference on World tional Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135. Wide Web. ACM, 697–706. [10] Bruno M Fonseca, Paulo Golgher, Bruno Pôssas, Berthier Ribeiro-Neto, and Nivio [36] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Ziviani. 2005. Concept-based interactive query expansion. In Proceedings of the Hospedales. 2018. Learning to compare: Relation network for few-shot learning. 14th ACM international conference on Information and knowledge management. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. ACM, 696–703. 1199–1208. [11] Jianfeng Gao, Xiaodong He, Shasha Xie, and Alnur Ali. 2012. Learning lexicon [37] Egidio Terra and Charles LA Clarke. 2004. Scoring missing terms in information models from search logs for query expansion. In Proceedings of the 2012 Joint Con- retrieval tasks. In Proceedings of the thirteenth ACM international conference on ference on Empirical Methods in Natural Language Processing and Computational Information and knowledge management. ACM, 50–58. Natural Language Learning. 666–676. [38] Sebastian Thrun and Lorien Pratt. 2012. Learning to learn. Springer Science & [12] Jianfeng Gao and Jian-Yun Nie. 2012. Towards concept-based translation models Business Media. using search logs for query expansion. In Proceedings of the 21st ACM international [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, conference on Information and knowledge management. ACM, 1. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all [13] Fred X Han, Di Niu, Haolan Chen, Kunfeng Lai, Yancheng He, and Yu Xu. 2019. you need. In Advances in neural information processing systems. 5998–6008. A deep generative approach to search extrapolation and recommendation. In [40] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Matching networks for one shot learning. In Advances in neural information Discovery & Data Mining. ACM. processing systems. 3630–3638. [14] Fred X. Han, Di Niu, Weidong Guo, Kunfeng Lai, Yancheng He, and Yu Xu. 2019. [41] Jingang Wang, Junfeng Tian, Long Qiu, Sheng Li, Jun Lang, Luo Si, and Man Lan. Inferring Search Queries from Web Documents via a Graph-Augmented Sequence 2018. A Multi-task Learning Approach for Improving Product Title Compression to Attention Network. In Proceedings of The Web Conference 2019. with User Search Log Data. arXiv preprint arXiv:1801.01725 (2018). [15] Yunlong He, Jiliang Tang, Hua Ouyang, Changsung Kang, Dawei Yin, and Yi [42] Zhongyuan Wang, Kejun Zhao, Haixun Wang, Xiaofeng Meng, and Ji-Rong Wen. Chang. 2016. Learning to rewrite queries. In Proceedings of the 25th ACM In- 2015. Query understanding through knowledge-based conceptualization. In ternational on Conference on Information and Knowledge Management. ACM, Twenty-Fourth International Joint Conference on Artificial Intelligence. 1443–1452. [43] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q Zhu. 2012. Probase: A [16] Glen Jeh and Jennifer Widom. 2002. SimRank: a measure of structural-context probabilistic taxonomy for text understanding. In Proceedings of the 2012 ACM similarity. In Proceedings of the eighth ACM SIGKDD international conference on SIGMOD International Conference on Management of Data. ACM, 481–492. Knowledge discovery and data mining. ACM, 538–543. [44] Zi Yin, Keng-hao Chang, and Ruofei Zhang. 2017. Deepprobe: Information di- [17] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity rected sequence understanding and design via recurrent neural networks. search with GPUs. arXiv preprint arXiv:1702.08734 (2017). In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge [18] Rosie Jones and Daniel C Fain. 2003. Query word deletion prediction. In Pro- Discovery and Data Mining. ACM, 2131–2139. ceedings of the 26th annual international ACM SIGIR conference on Research and [45] Wei Vivian Zhang, Xiaofei He, Benjamin Rey, and Rosie Jones. 2007. Query rewrit- development in informaion retrieval. ACM, 435–436. ing using active learning for sponsored search. In Proceedings of the 30th annual [19] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural international ACM SIGIR conference on Research and development in information networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. retrieval. ACM, 853–854. [20] Ke Li and Jitendra Malik. 2016. Learning to optimize. arXiv preprint arXiv:1606.01885 (2016).

3072 Applied Data Science Track Paper KDD '20, August 23–27, 2020, Virtual Event, USA

8 SUPPLEMENTARY INFORMATION RTX-2080Ti GPU. Each epoch takes approximately 90 minutes to 8.1 Training Setup finish. In the task-learning phase of Reptile, we minimize the Cross- To pre-processing the click graph, we first prune any edges that Entropy loss on the predicted labels. We set K = 10. We choose a do not appear more than once to avoid noise caused by mis-clicks. fixed task-learn-rate of 0.0001 for the optimizer, and a fixed meta- We then keep the top-3 weighted outgoing edges of each query learn-rate of 0.1. On two RTX-2080Ti GPUs, we carry out the meta- vertex, because we found that over 90% of vertices have less than 3 fine-tuning with a batch size of 32, and we terminate the process if outgoing edges. the loss on the development set does not change by more than 0.01 We utilize the FAISS [17] tool for fast similarity searches. For between two epochs. In the end, our BERT-meta model converges all the two-stage setups, we limited the first stage to only pass the in 5 epochs with an average classification accuracy of 95.5% on the top-10 most similar concepts to the second stage. For all the RW dev sets, and each epoch takes approximately one day to complete. setups, we take the top-15 predicted relevant words and set γ = 1.0. The decision threshold for all classifiers is 0.5. 8.2 Evaluation Setup We implement both stages of our model using Pytorch 0.4 and We derive the macro-averaged F1 score from Mean Average Preci- train with the Adam optimizer. We train the RWG model by minimiz- sion (MAP) and Mean Average Recall (MAR) as, ing the Binary Cross-Entropy loss on the target relevant words. The N input and output vocabulary sizes are set to the values presented in 1 Õ TP MAP = i , (12) Table 1. We initialize the embedding layer with the Tencent AI Lab N TPi + FPi 200d pre-trained word embeddings. We choose the top-100 recall i=1 N rate, i.e., the percentage of truth words that appear in the top-100 1 Õ TP MAR = i , (13) predictions, as the metric for hyper-parameter tuning. We select a N TPi + FN i hidden size of 2048 for the Bi-GRU and a dropout probability of 0.5. i=1 MAP · MAR For the optimizer, We set an initial learn-rate of 0.001 and follow a F1 = 2 · , (14) simple learn-rate decay strategy: If the train/dev loss of the current MAP + MAR where N is the number of results with positive predictions. TP, FP, epoch is higher than the previous epoch, decay the learn-rate by FN denotes the number of true-positives, false-positives and false 0.5, where the minimum possible learn-rate is 0.0001. The RWG negatives. model converges in 20 epochs with a batch size of 256 on an Nvidia

3073