<<

Global Entity Disambiguation with Pretrained Contextualized Embeddings of Words and Entities

Ikuya Yamada1,4 Koki Washio2,4 Hiroyuki Shindo3,4 Yuji Matsumoto4 [email protected] [email protected].ac.jp [email protected] [email protected]

1Studio Ousia, Tokyo, Japan 2The University of Tokyo, Tokyo, Japan 3Nara Institute of Science and Technology, Nara, Japan 4RIKEN AIP, Tokyo, Japan

Abstract the model by predicting randomly masked entities based on words and non-masked entities. We train We propose a new global entity disambigua- the model using texts and their entity annotations tion (ED) model based on contextualized em- retrieved from Wikipedia. beddings of words and entities. Our model is based on a bidirectional transformer en- Furthermore, we introduce a simple extension coder (i.e., BERT) and produces contextual- to the inference step of the model to capture global ized embeddings for words and entities in the contextual information. Specifically, similar to the input text. The model is trained using a new approach used in past work (Fang et al., 2019; masked entity prediction task that aims to train Yang et al., 2019), we address ED as a sequen- the model by predicting randomly masked en- tial decision task that disambiguates mentions one tities in entity-annotated texts obtained from by one, and uses words and already disambiguated Wikipedia. We further extend the model by entities to disambiguate new mentions. solving ED as a sequential decision task to capture global contextual information. We We evaluate the proposed model using six - evaluate our model using six standard ED dard ED datasets and achieve new state-of-the-art datasets and achieve new state-of-the-art re- results on all but one dataset. Furthermore, we will sults on all but one dataset. publicize our code and trained embeddings.

1 Introduction 2 Background and Related Work Entity disambiguation (ED) refers to the task of Neural network-based approaches have recently assigning entity mentions in a text to correspond- achieved strong results on ED (Ganea and Hof- ing entries in a knowledge base (KB). This task mann, 2017; Yamada et al., 2017; Le and Titov, is challenging because of the ambiguity between 2018; Cao et al., 2018; Le and Titov, 2019; Yang entity names (e.g., “”) and the enti- et al., 2019). These approaches are typically based ties they refer to (e.g., FIFA World Cup or on embeddings of words and entities trained us- ). Recent ED models typi- ing a large KB (e.g., Wikipedia). Such embed- cally rely on two types of contextual information: dings enable us to design ED models that capture local information based on words that co-occur the contextual information required to address ED. with the mention, and global information based These embeddings are typically based on conven- on document-level coherence of the disambigua- tional word embedding models (e.g., skip-gram arXiv:1909.00426v2 [cs.CL] 4 Apr 2020 tion decisions. A key to improve the performance (Mikolov et al., 2013)) that assign a fixed embed- of ED is to combine both local and global infor- ding to each word and entity (Yamada et al., 2016; mation as observed in most recent ED models. Cao et al., 2017; Ganea and Hofmann, 2017). In this study, we propose a novel ED model Shahbazi et al.(2019) and Broscheit(2019) pro- based on contextualized embeddings of words and posed ED models based on contextualized word entities. The proposed model is based on BERT embeddings, namely, ELMo (Peters et al., 2018) (Devlin et al., 2019). Our model takes words and and BERT, respectively. These models predict the entities in the input document, and produces a con- referent entity of a mention using the contextual- textualized embedding for each word and entity. ized embeddings of the constituent or surrounding Inspired by the masked language model (MLM) words of the mention. However, unlike our pro- adopted in BERT, we propose masked entity pre- posed model, these models address the task based diction (MEP), a novel task that aims to train only on local contextual information. Output Embedding h[CLS] hmadonna hlives hin hnew hyork h[SEP] hMadonna hNew_York_City 3.2 Masked Entity Prediction Bidirectional Transformer To train the model, we propose masked entity pre- Token Embedding A[CLS] Amadonna Alives Ain Anew Ayork A[SEP] BMadonna BNew_York_City + + + + + + + + + diction (MEP), a novel task based on MLM. In Token Type Embedding Cword Cword Cword Cword Cword Cword Cword Centity Centity + + + + + + + + + particular, some percentage of the input entities Position E4 + E5 Embedding D0 D1 D2 D3 D4 D5 D6 E1 2 are masked at random; then, the model learns to Input [CLS] madonna lives in new york [SEP] Madonna New York City Words Entities predict masked entities based on words and non- masked entities. We represent masked entities us- Figure 1: Architecture of the proposed contextualized [MASK] embeddings of words and entities. ing the special entity token. We adopt a model equivalent to the one used to predict words in MLM. Specifically, we predict 3 Contextualized Embeddings of Words the original entity corresponding to a masked en- and Entities for ED tity by applying the softmax function over all en- tities in our vocabulary: Figure1 illustrates the architecture of our contex- tualized embeddings of words and entities. Our ˆyMEP = softmax(Bm + bo), (1) Ve H model adopts a multi-layer bidirectional trans- where bo ∈ R is the output bias, and m ∈ R former encoder (Vaswani et al., 2017). is derived as Given a document, the model first constructs m = layer normgelu(W h + b ), (2) a sequence of tokens consisting of words in the f f H document and entities appearing in the document. where h ∈ R is the output embedding corre- H×H Then, the model represents the sequence as a se- sponding to the masked entity, Wf ∈ R is H quence of input embeddings, one for each token, the weight matrix, bf ∈ R is the bias, gelu(·) is and generates a contextualized output embedding the gelu activation function (Hendrycks and Gim- for each token. Both the input and output embed- pel, 2016), and layer norm(·) is the layer normal- dings have H dimensions. Hereafter, we denote ization function (Lei Ba et al., 2016). the number of words and that of entities in the vo- 3.3 Training cabulary of our model by Vw and Ve, respectively. We used the same transformer architecture 3.1 Input Representation adopted in the BERTLARGE model (Devlin et al., Similar to the approach adopted in BERT (Devlin 2019). We initialized the parameters of our model et al., 2019), the input representation of a given that were common with BERT (i.e., parameters in token (word or entity) is constructed by summing the transformer encoder and the embeddings for the following three embeddings of H dimensions: words) using the uncased version of the pretrained 1 BERTLARGE model. Other parameters, namely, • Token embedding is the embedding of the cor- the parameters in the MEP and the embeddings for responding token. The matrices of the word entities, were initialized randomly. and entity token embeddings are represented as The model was trained via iterations over V ×H V ×H A ∈ R w and B ∈ R e , respectively. Wikipedia pages in a random order for seven • Token type embedding represents the type of epochs. We treated the hyperlinks as entity anno- token, namely, word type (denoted by Cword) tations, and masked 30% of all entities at random. or entity type (denoted by Centity). The input text was tokenized to words using the BERT’s tokenizer with its vocabulary consisting • Position embedding represents the position of of Vw = 30, 000 words. Similar to Ganea and the token in a word sequence. A word and an en- Hofmann(2017), we built an entity vocabulary tity appearing at the i-th position in the sequence consisting of Ve = 128, 040 entities, which were D E are represented as i and i, respectively. If contained in the entity candidates in the datasets an entity name contains multiple words, its po- used in our experiments. We optimized the model sition embedding is computed by averaging the by maximizing the log likelihood of MEP’s pre- embeddings of the corresponding positions. dictions using Adam (Kingma and Ba, 2014). Fur- Following BERT (Devlin et al., 2019), we insert ther details are provided in Appendix A. special tokens [CLS] and [SEP] to the word se- 1 We initialized Cword using BERT’s segment embedding quence as the first and last words, respectively. for sentence A. Name Train Accuracy Algorithm 1: Algorithm of our global ED model. Yamada et al.(2016) X 91.5 Input: Words and mentions m1, . . . mN in the input Ganea and Hofmann(2017) X 92.22±0.14 document Yang et al.(2018) X 93.0 Initialize: ei ← [MASK], i = 1 ...N Le and Titov(2018) X 93.07±0.27 repeat N times Cao et al.(2018) 80 For all mentions, obtain entity predictions Fang et al.(2019) X 94.3 eˆ1 ... eˆN using Eq.(2) and Eq.(3) using words Shahbazi et al.(2019) X 93.46±0.14 and entities e1, ..., eN as inputs Le and Titov(2019) 89.66±0.16 Select a mention mj that has the most confident Broscheit(2019) X 87.9 prediction in all unresolved mentions Yang et al.(2019) (DCA-SL) X 94.64±0.2 Yang et al.(2019) (DCA-RL) 93.73±0.2 ej ← eˆj X end Our (confidence-order) X 95.04±0.24 Our (natural-order) X 94.76±0.26 return {e1, . . . , eN } Our (local) X 94.49±0.22 Our (confidence-order) 92.42 Our (natural-order) 91.68 Our (local) 90.80 4 Our ED Model Table 1: In-KB accuracy on the CoNLL dataset. The We describe our ED model in this section. 95% confidence intervals over five runs are also re- ported if available. Train: whether the model is trained 4.1 Local ED Model using the training set of the CoNLL dataset. Given an input document with N mentions and their K entity candidates, our local ED model first 5 Experiments creates an input sequence consisting of words in the document, and N masked entity tokens corre- We test the proposed ED models using six stan- 2 sponding to the mentions in the document. Then, dard ED datasets: AIDA-CoNLL (CoNLL) (Hof- 0 H fart et al., 2011), MSNBC (MSB), AQUAINT the model computes the embedding m ∈ R for each mention using Eq. (2), and predicts the entity (AQ), ACE2004 (ACE), WNED-CWEB (CW), for each mention using the softmax function over and WNED-WIKI (WW) (Guo and Barbosa, its K entity candidates: 2018). We consider only the mentions that refer to valid entities in Wikipedia. For all datasets, we ∗ 0 ∗ use the KB+YAGO entity candidates and their as- ˆyED = softmax(B m + bo), (3) sociated pˆ(e|m) (Ganea and Hofmann, 2017), and ∗ K×H ∗ K use the top 30 candidates based on pˆ(e|m). We where B ∈ R and bo ∈ R consist of the split a document if it is longer than 512 words, entity token embeddings and the output bias val- which is the maximum word length of the BERT ues corresponding to the entity candidates, respec- model. We report the in-KB accuracy for the tively. Note that B∗ and b∗ are the subsets of B o CoNLL dataset, and the micro F1 score (averaged and bo, respectively. This model is denoted as lo- per mention) for the other datasets. cal in the remainder of the paper. Furthermore, we optionally fine-tune the model 4.2 Global ED Model by maximizing the log likelihood of the ED pre- dictions (ˆyED) using the training set of the CoNLL Our global model addresses ED by resolving men- dataset. We mask 90% of the mentions and fix the tions sequentially for N steps. The model is de- entity token embeddings (B and B∗) and the out- scribed in Algorithm1. First, the model initializes ∗ put bias (bo and bo). The model is trained for two the entity of each mention using the [MASK] to- epochs using Adam. Additional details are pro- ken. Then, for each step, the model predicts an vided in AppendixB. entity for each mention, selects a mention with the highest probability produced by the softmax func- 5.1 Results and Analysis tion in Eq.(3) in all unresolved mentions, and re- Table1 presents the results of the CoNLL dataset. solves the selected mention by assigning the pre- Our global models successfully outperformed all dicted entity to the mention. This model is denoted the recent strong models, including models based as confidence-order in the remainder of the paper. on ELMo (Shahbazi et al., 2019) and BERT Furthermore, we test a baseline model that selects (Broscheit, 2019). Furthermore, our confidence- a mention by its order of appearance in the docu- ment and denote it by natural-order. 2We used the test b set of the CoNLL dataset. Name Train MSB AQ ACE CW WW Avg. # annotations confidence-order natural-order local G&H2017 Ganea and Hofmann(2017) X 93.7 88.5 88.5 77.9 77.5 85.2 0 1.0 1.0 1.0 0.8 Yang et al.(2018) X 92.6 89.9 88.5 81.8 79.2 86.4 1–10 95.55 95.55 95.55 91.93 Le and Titov(2018) X 93.9 88.3 89.9 77.5 78.0 85.5 11–50 96.98 96.70 96.43 92.44 Cao et al.(2018) - 87 88 - 86 - ≥51 96.64 96.38 95.80 94.21 Fang et al.(2019) X 92.8 87.5 91.2 78.5 82.8 86.6 Shahbazi et al.(2019) X 92.3 90.1 88.7 78.4 79.8 85.9 Table 3: In-KB accuracy on the CoNLL dataset split Le and Titov(2019) 92.2 90.7 88.1 78.2 81.7 86.2 by the frequency in Wikipedia entity annotations. Our Yang et al.(2019) (DCA-SL) X 94.6 87.4 89.4 73.5 78.2 84.6 Yang et al.(2019) (DCA-RL) X 93.8 88.3 90.1 75.6 78.8 85.3 models were fine-tuned using the CoNLL dataset. Our (confidence-order) 96.3 93.5 91.9 78.9 89.1 89.9 G&H2017: The results of Ganea and Hofmann(2017). Our (natural-order) 96.1 92.9 91.9 78.4 89.2 89.6 Our (local) 96.1 91.9 91.9 78.4 88.8 89.3 Our (confidence-order) X 94.1 91.5 90.7 78.3 87.6 88.4 Our (natural-order) X 94.1 90.9 90.7 78.3 87.4 88.3 To investigate how the global contextual in- Our (local) X 93.9 90.8 90.7 78.2 87.2 88.2 formation helped our model to improve perfor- Table 2: Micro F1 scores on the five ED datasets. mance, we manually analyzed the difference be- Train: whether the model is trained using the training tween the predictions of the local, natural-order, set of the CoNLL dataset. and confidence-order models. The CoNLL dataset was used to fine-tune and test the models. The local model often failed to resolve men- order model trained only on our Wikipedia- tions of common names referring to specific en- based annotations outperformed two recent mod- tities (e.g., “New York” referring to the basketball els trained on the in-domain training set of the team New York Knicks). Global models were gen- CoNLL dataset, namely, Yamada et al.(2016) and erally better to resolve such mentions because of Ganea and Hofmann(2017). the presence of strong global contextual informa- Table2 presents the results of the datasets other tion (e.g., mentions referring to basketball teams). than the CoNLL dataset. Our models trained only Furthermore, we found that the CoNLL dataset on our Wikipedia-based annotations outperformed contains mentions that require a highly detailed recent strong models on the MSB, AQ, ACE, and context to resolve. For example, a mention WW datasets. Additionally, we also tested the per- “Matthew Burke” can refer to two different for- formance of our models fine-tuned on the CoNLL mer Australian rugby players. Although the lo- dataset, and found that fine-tuning generally de- cal and natural-order models incorrectly resolved graded the performance on these five datasets. this mention to the player who has the larger num- Furthermore, our local model performed ber of occurrences in our Wikipedia-based anno- equally or worse in comparison with our global tations, the confidence-order model successfully models on all datasets. This clearly demonstrates resolved this mention by disambiguating its con- the effectiveness of using global contextual infor- textual mentions, including his colleague players, mation even if the local contextual information in advance. We provide detailed inference of the is modeled based on expressive contextualized corresponding document in AppendixC. embeddings. Moreover, the natural-order model Next, we examined if our model learned effec- performed worse than the confidence-order model tive embeddings for rare entities using the CoNLL on most datasets. dataset. Following Ganea and Hofmann(2017), we used the mentions of which entity candidates Additionally, our models performed relatively contain their gold entities, and measured the per- worse on the CW dataset. We consider that our formance by dividing the mentions based on the model failed to capture important contextual infor- frequency of their entities in the Wikipedia anno- mation because this dataset is significantly longer tations used to train the embeddings. As presented on average than other datasets, i.e., approximately in Table3, our models achieved enhanced perfor- 1,700 words per document on average, which is mance to predict rare entities. more than thrice longer than the maximum word length of our model (i.e., 512 words). We also 6 Conclusions consider that Yang et al.(2018) achieved excel- lent performance on this specific dataset because We proposed a global ED model based on contex- their model is based on various hand-engineered tualized embeddings trained using Wikipedia. Our features capturing document-level contextual in- experimental results demonstrate the effectiveness formation. of our model across a wide range of ED datasets. References Phong Le and Ivan Titov. 2019. Boosting Entity Link- ing Performance by Leveraging Unlabeled Docu- Samuel Broscheit. 2019. Investigating Entity Knowl- ments. In Proceedings of the 57th Annual Meet- edge in BERT with Simple Neural End-To-End En- ing of the Association for Computational Linguis- tity Linking. In Proceedings of the 23rd Confer- tics, pages 1935–1945. ence on Computational Natural Language Learning (CoNLL), pages 677–685. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. 2016. Layer Normalization. arXiv preprint Yixin Cao, Lei Hou, Juanzi Li, and Zhiyuan Liu. 2018. arXiv:1607.06450v1. Neural Collective Entity Linking. In Proceedings of the 27th International Conference on Computational Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Linguistics, pages 675–686. Dean. 2013. Efficient Estimation of Word Represen- tations in Vector Space. In Proceedings of the 2013 Yixin Cao, Lifu Huang, Heng Ji, Xu Chen, and Juanzi International Conference on Learning Representa- Li. 2017. Bridge Text and Knowledge by Learn- tions, pages 1–12. ing Multi-Prototype Entity Mention Embedding. In Proceedings of the 55th Annual Meeting of the As- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt sociation for Computational Linguistics (Volume 1: Gardner, Christopher Clark, Kenton Lee, and Luke Long Papers), pages 1623–1633. Zettlemoyer. 2018. Deep Contextualized Word Rep- resentations. In Proceedings of the 2018 Conference Jacob Devlin, Ming-Wei Chang, Kenton Lee, and of the North American Chapter of the Association Kristina Toutanova. 2019. BERT: Pre-training of for Computational Linguistics: Human Language Deep Bidirectional Transformers for Language Un- Technologies, Volume 1 (Long Papers), pages 2227– derstanding. In Proceedings of the 2019 Conference 2237. of the North American Chapter of the Association for Computational Linguistics: Human Language Hamed Shahbazi, Xiaoli Z. Fern, Reza Ghaeini, Rasha Technologies, Volume 1 (Long and Short Papers), Obeidat, and Prasad Tadepalli. 2019. Entity-aware pages 4171–4186. ELMo: Learning Contextual Entity Representa- tion for Entity Disambiguation. arXiv preprint Zheng Fang, Yanan Cao, Qian Li, Dongjie Zhang, arXiv:1908.05762v2. Zhenyu Zhang, and Yanbing Liu. 2019. Joint En- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob tity Linking with Deep Reinforcement Learning. In Uszkoreit, Llion Jones, Aidan N Gomez, ukasz The World Wide Web Conference, WWW ’19, pages Kaiser, and Illia Polosukhin. 2017. Attention Is All 438–447. You Need. In Advances in Neural Information Pro- Octavian-Eugen Ganea and Thomas Hofmann. 2017. cessing Systems 30, pages 5998–6008. Deep Joint Entity Disambiguation with Local Neu- Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, ral Attention. In Proceedings of the 2017 Confer- and Yoshiyasu Takefuji. 2016. Joint Learning of ence on Empirical Methods in Natural Language the Embedding of Words and Entities for Named Processing, pages 2619–2629. Entity Disambiguation. In Proceedings of the 20th SIGNLL Conference on Computational Natu- Zhaochen Guo and Denilson Barbosa. 2018. Robust ral Language Learning, pages 250–259. Named Entity Disambiguation with Random Walks. Semantic Web, 9(4):459–479. Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2017. Learning Distributed Dan Hendrycks and Kevin Gimpel. 2016. Gaus- Representations of Texts and Entities from Knowl- sian Error Linear Units (GELUs). arXiv preprint edge Base. Transactions of the Association for Com- arXiv:1606.08415v3. putational Linguistics, 5:397–411. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- Xiyuan Yang, Xiaotao Gu, Sheng Lin, Siliang Tang, dino, Hagen Furstenau,¨ Manfred Pinkal, Marc Span- Yueting Zhuang, Fei Wu, Zhigang Chen, Guoping iol, Bilyana Taneva, Stefan Thater, and Gerhard Hu, and Xiang Ren. 2019. Learning Dynamic Con- Weikum. 2011. Robust Disambiguation of Named text Augmentation for Global Entity Linking. In Entities in Text. In Proceedings of the Conference Proceedings of the 2019 Conference on Empirical on Empirical Methods in Natural Language Pro- Methods in Natural Language Processing and the cessing, pages 782–792. 9th International Joint Conference on Natural Lan- guage Processing, pages 271–281. Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint Yi Yang, Ozan Irsoy, and Kazi Shefaet Rahman. 2018. arXiv:1412.6980v9. Collective Entity Disambiguation with Structured Gradient Tree Boosting. In Proceedings of the 2018 Phong Le and Ivan Titov. 2018. Improving En- Conference of the North American Chapter of the tity Linking by Modeling Latent Relations between Association for Computational Linguistics: Human Mentions. In Proceedings of the 56th Annual Meet- Language Technologies, Volume 1 (Long Papers), ing of the Association for Computational Linguistics pages 777–786. (Volume 1: Long Papers), pages 1595–1604. A Details of Training of Contextualized space described in Devlin et al.(2019) based on Embeddings the accuracy on the development set of the CoNLL dataset. As the input corpus for training our contextual- ized embeddings, we used the December 2018 Name Value version of Wikipedia, comprising approximately maximum word length 512 3.5 billion words and 11 million entity annota- number of epochs 2 tions. We generated input sequences by splitting batch size 16 the content of each page into sequences compris- learning rate 2e-5 ing ≤ 512 words and their entity annotations (i.e., learning rate decay linear hyperlinks). warmup proportion 0.1 To stabilize the training, we updated only those dropout 0.1 parameters that were randomly initialized (i.e., weight decay 0.01 fixed the parameters initialized using BERT) at gradient clipping 1.0 the first epoch, and updated all the parameters adam β1 0.9 in the remaining six epochs. We implemented adam β2 0.999 the model using PyTorch, and the training took adam  1e-6 approximately ten days using eight Tesla V100 GPUs. Table 5: Hyper-parameters during fine-tuning on the CoNLL dataset. The hyper-parameters used in the training are detailed in Table4.

Name Value C Example of Inference by number of hidden layers 24 Confidence-order Model hidden size 1024 Figure2 shows an example of the inference per- attention heads 16 formed by our confidence-order model fine-tuned attention head size 64 on the CoNLL dataset. The document was ob- activation function gelu tained from the test set of the CoNLL dataset. As maximum word length 512 shown in the figure, the model started with un- batch size 2048 ambiguous player names to recognize the topic of learning rate (1st epoch) 5e-4 the document, and subsequently resolved the men- learning rate decay (1st epoch) none tions that were challenging to resolve. warmup steps (1st epoch) 1000 Notably, the model correctly resolved the men- learning rate 5e-5 tion “Nigel Walker” to the corresponding former learning rate decay linear rugby player instead of a football player, and the warmup steps 1000 mention “Matthew Burke” to the correct former dropout 0.1 Australian rugby player born in 1973 instead of weight decay 0.01 the former Australian rugby player born in 1964, gradient clipping 1.0 by resolving other contextual mentions, includ- adam β1 0.9 ing their colleague players in advance. These two adam β2 0.999 mentions are denoted in red in the figure. Note adam  1e-6 that our local model failed to resolve both men- tions, and our natural-order model failed to resolve Table 4: Hyper-parameters used for training our con- “Mattew Burke.” textualized embeddings.

B Details of Fine-tuning on CoNLL Dataset

The hyper-parameters used in the fine-tuning on the CoNLL dataset are detailed in Table5. We selected these hyper-parameters from the search Document: "Campo has a massive following in this country and has had the public with him ever since he first played here in 1984," said Andrew, also likely to be making his final 20: Twickenham appearance. On tour, 17: have won all four tests against 46: , 47: , 48: Ireland and 45: , and scored 414 points at an average of almost 35 points a game. League duties restricted the 28: Barbarians' selectorial options but they still boast 13 internationals including 44: England full-back 16: and recalled wing 22: , plus 12: All Black forwards 25: and 14: Norm Hewitt. Teams: 27: Barbarians - 15 - 7: Tim Stimpson (31: England); 14 - 50: Nigel Walker (36: Wales), 13 - 1: (32: Wales), 12 - 10: (39: Scotland), 11 - 4: Tony Underwood (34: England); 10 - 17: (33: England), 9 - 2: (35: Wales); 8 - 15: (37: Wales), 7 - 8: (38: England), 6 - 19: Dale McIntosh (41: Pontypridd), 5 - 24: Ian Jones (51: ), 4 - 11: Craig Quinnell (40: Wales), 3 - 5: Darren Garforth (42: Leicester), 2 - 18: Norm Hewitt (52: New Zealand), 1 - 3: (49: Ireland). 43: Australia - 15 - 53: Matthew Burke; 14 - 9: , 13 - 26: , 12 - 20: (captain), 11 - 23: ; 10 - 29: Pat Howard, 9 - Sam Payne; 8 - Michael Brial, 7 - 30: David Wilson, 6 - 13: , 5 - 21: David Giffin, 4 - Tim Gavin, 3 - , 2 - Marco Caputo, 1 - 6: Dan Crowley.

Order of Inference by Confidence-order Model: Allan Bateman ! Rob Howley ! Nick Popplewell ! Tony Underwood ! Darren Garforth ! Dan Crowley ! Tim Stimpson ! Neil Back ! Joe Roff ! Gregor Townsend ! Craig Quinnell ! All Black ! Owen Finegan ! Norm Hewitt ! Scott Quinnell ! Tim Stimpson ! Australia ! Norm Hewitt ! Dale McIntosh ! Tim Horan ! David Giffin ! Tony Underwood ! David Campese ! Ian Jones ! Ian Jones ! Daniel Herbert ! Barbarians ! Barbarians ! Pat Howard ! David Wilson ! England ! Wales ! England ! England ! Wales ! Wales ! Wales ! England ! Scotland ! Wales ! Pontypridd ! Leicester ! Australia ! England ! Wales ! Italy ! Scotland ! Ireland ! Ireland ! Nigel Walker ! New Zealand ! New Zealand ! Matthew Burke

Figure 2: An illustrative example showing the inference performed by our fine-tuned confidence-order model on a document in the CoNLL dataset. Mentions are shown as underlined. Numbers in bold face represent the selection order of the confidence-order model.