Lecture 11: Question Answering
Danqi Chen Princeton University Lecture plan
1. What is question answering? (10 mins) Your default final project! 2. Reading comprehension (50 mins) ✓ How to answer questions over a single passage of text
3. Open-domain (textual) question answering (20 mins) ✓ How to answer questions over a large collection of documents
2 1. What is question answering?
Question (Q) Answer (A)
The goal of question answering is to build systems that automatically answer questions posed by humans in a natural language
The earliest QA systems dated back to 1960s! (Simmons et al., 1964)
3 Question answering: a taxonomy
Question (Q) Answer (A)
• What information source does a system build on? • A text passage, all Web documents, knowledge bases, tables, images.. • Question type • Factoid vs non-factoid, open-domain vs closed-domain, simple vs compositional, .. • Answer type • A short segment of text, a paragraph, a list, yes/no, …
4 Lots of practical applications
5 Lots of practical applications
6 Lots of practical applications
7 IBM Watson beated Jeopardy champions
8 IBM Watson beated Jeopardy champions
Image credit: J & M, edition 3
(1) Question processing, (2) Candidate answer generation, (3) Candidate answer scoring, and (4) Confidence merging and ranking.
9 Question answering in deep learning era
Image credit: (Lee et al., 2019)
Almost all the state-of-the-art question answering systems are built on top of end- to-end training and pre-trained language models (e.g., BERT)!
10 Beyond textual QA problems
Today, we will mostly focus on how to answer questions based on unstructured text.
Knowledge based QA
Image credit: Percy Liang 11 Beyond textual QA problems
Today, we will mostly focus on how to answer questions based on unstructured text.
Visual QA
(Antol et al., 2015): Visual Question Answering
12 2. Reading comprehension
Reading comprehension = comprehend a passage of text and answer questions about its content (P, Q) ⟶ A
Tesla was the fourth of five children. He had an older brother named Dane and three sisters, Milka, Angelina and Marica. Dane was killed in a horse-riding accident when Nikola was five. In 1861, Tesla attended the "Lower" or "Primary" School in Smiljan where he studied German, arithmetic, and religion. In 1862, the Tesla family moved to Gospić, Austrian Empire, where Tesla's father worked as a pastor. Nikola completed "Lower" or "Primary" School, followed by the "Lower Real Gymnasium" or "Normal School."
Q: What language did Tesla study while in school? A: German
13 2. Reading comprehension
Reading comprehension: building systems to comprehend a passage of text and answer questions about its content (P, Q) ⟶ A
Kannada language is the official language of Karnataka and spoken as a native language by about 66.54% of the people as of 2011. Other linguistic minorities in the state were Urdu (10.83%), Telugu language (5.84%), Tamil language (3.45%), Marathi language (3.38%), Hindi (3.3%), Tulu language (2.61%), Konkani language (1.29%), Malayalam (1.27%) and Kodava Takk (0.18%). In 2007 the state had a birth rate of 2.2%, a death rate of 0.7%, an infant mortality rate of 5.5% and a maternal mortality rate of 0.2%. The total fertility rate was 2.2.
Q: Which linguistic minority is larger, Hindi or Malayalam? A: Hindi
14 Why do we care about this problem?
• Useful for many practical applications • Reading comprehension is an important testbed for evaluating how well computer systems understand human language • Wendy Lehnert 1977: “Since questions can be devised to query any aspect of text comprehension, the ability to answer questions is the strongest possible demonstration of understanding.” • Many other NLP tasks can be reduced to a reading comprehension problem:
Information extraction Semantic role labeling (Barack Obama, educated_at, ?)
Question: Where did Barack Obama graduate from?
Passage: Obama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago.
(Levy et al., 2017) (He et al., 2015) 15 Stanford question answering dataset (SQuAD)
• 100k annotated (passage, question, answer) triples Large-scale supervised datasets are also a key ingredient for training effective neural models for reading comprehension! • Passages are selected from English Wikipedia, usually 100~150 words. • Questions are crowd-sourced. • Each answer is a short segment of text (or span) in the passage. This is a limitation— not all the questions can be answered in this way!
• SQuAD still remains the most popular reading comprehension dataset; it is “almost solved” today and the state-of-the-art exceeds the estimated human performance.
16 (Rajpurkar et al., 2016): SQuAD: 100,000+ Questions for Machine Comprehension Stanford question answering dataset (SQuAD)
• Evaluation: exact match (0 or 1) and F1 (partial credit). • For development and testing sets, 3 gold answers are collected, because there could be multiple plausible answers. • We compare the predicted answer to each gold answer (a, an, the, punctuations are removed) and take max scores. Finally, we take the average of all the examples for both exact match and F1. • Estimated human performance: EM = 82.3, F1 = 91.2
Q: What did Tesla do in December 1878? A: {left Graz, left Graz, left Graz and severed all relations with his family} Prediction: {left Graz and served}
Exact match: max{0, 0, 0} = 0 F1: max{0.67, 0.67, 0.61} = 0.67
17 Neural models for reading comprehension
How can we build a model to solve SQuAD? (We are going to use passage, paragraph and context, as well as question and query interchangeably)
• Problem formulation Input: , , N~100, M ~15 • C = (c1, c2, …, cN) Q = (q1, q2, …, qM) ci, qi ∈ V • Output: 1 ≤ start ≤ end ≤ N answer is a span in the passage
• A family of LSTM-based models with attention (2016-2018) Attentive Reader (Hermann et al., 2015), Stanford Attentive Reader (Chen et al., 2016), Match- LSTM (Wang et al., 2017), BiDFA (Seo et al., 2017), Dynamic coattention network (Xiong et al., 2017), DrQA (Chen et al., 2017), R-Net (Wang et al., 2017), ReasoNet (Shen et al., 2017)..
• Fine-tuning BERT-like models for reading comprehension (2019+)
18 LSTM-based vs BERT models
Image credit: (Seo et al, 2017) Image credit: J & M, edition 3
19 Recap: seq2seq model with attention
• Instead of source and target sentences, we also have two sequences: passage and question (lengths are imbalanced)
• We need to model which words in the passage are most relevant to the question (and which question words) Attention is the key ingredient here, similar to which words in the source sentence are most relevant to the current target word…
• We don’t need an autoregressive decoder to generate the target sentence word-by-word. Instead, we just need to train two classifiers to predict the start and end positions of the answer!
20 BiDAF: the Bidirectional Attention Flow model
21 (Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension BiDAF: Encoding
• Use a concatenation of word embedding (GloVe) and character embedding (CNNs over character embeddings) for each word in context and query.
AAACIXicbVDLSgMxFM3UV62vUZdugqXQbsqMiBZEKIrosoJ9QDuUTHqnDc08SDJCGforbvwVNy4U6U78GTNtBW09EDick8s997gRZ1JZ1qeRWVldW9/Ibua2tnd298z9g4YMY0GhTkMeipZLJHAWQF0xxaEVCSC+y6HpDq9Tv/kIQrIweFCjCByf9APmMUqUlrpmBYq0y0r4EnvFdscnaiD85JaHDRhPjQv8I9IBETe+O5OdUtfMW2VrCrxM7DnJozlqXXPS6YU09iFQlBMp27YVKSchQjHKYZzrxBIiQoekD21NA+KDdJLphWNc0EoPe6HQL1B4qv6eSIgv5Uinw4U0rlz0UvE/rx0rr+IkLIhiBQGdLfJijlWI07pwjwmgio80IVQwnRWnRRCqdKk5XYK9ePIyaZyU7bOydX+ar17N68iiI3SMishG56iK7lAN1RFFT+gFvaF349l4NT6MyexrxpjPHKI/ML6+AR6kokk= AAACIXicbVDLSgMxFM34rPU16tJNsBTaTZkR0YIIRRFdVrAPaIeSSe+0oZmHSUYoQ3/Fjb/ixoUi3Yk/Y6atoK0HAodzcrnnHjfiTCrL+jSWlldW19YzG9nNre2dXXNvvy7DWFCo0ZCHoukSCZwFUFNMcWhGAojvcmi4g6vUbzyCkCwM7tUwAscnvYB5jBKlpY5ZhsJDhxXxBfYKrbZPVF/4yQ0P6zCaGOf4R6R9Iq59dyo7xY6Zs0rWBHiR2DOSQzNUO+a43Q1p7EOgKCdStmwrUk5ChGKUwyjbjiVEhA5ID1qaBsQH6SSTC0c4r5Uu9kKhX6DwRP09kRBfyqFOh/NpXDnvpeJ/XitWXtlJWBDFCgI6XeTFHKsQp3XhLhNAFR9qQqhgOitOiyBU6VKzugR7/uRFUj8u2acl6+4kV7mc1ZFBh+gIFZCNzlAF3aIqqiGKntALekPvxrPxanwY4+nXJWM2c4D+wPj6BmMAonM= e(ci)=f([GloVe(ci); charEmb(ci)]) e(qi)=f([GloVe(qi); charEmb(qi)]) f: high-way networks omitted here • Then, use two bidirectional LSTMs separately to produce contextual embeddings for both context and query.
AAACVHicfVHLahsxFNVMnnWb1EmW3YiYggOtmSmlyaYQmk0WLeTlJOBxB418xxbWSBPpTlMj5iPbRaFf0k0XkR+LPEoPCA7nnIuujrJSCotR9DsIl5ZXVtfWnzWev9jYfNnc2r60ujIculxLba4zZkEKBV0UKOG6NMCKTMJVNj6a+lffwFih1QVOSugXbKhELjhDL6XNcaK9bcRwhMwYfeuSguEoy91NXadO1PQjTRC+oync5/OLL3X7//m3cf2GQvsmFXt7NBGKzu3MndVfj9NmK+pEM9CnJF6QFlngJG3+TAaaVwUo5JJZ24ujEvuOGRRcQt1IKgsl42M2hJ6nihVg+25WSk1fe2VAc238UUhn6v0JxwprJ0Xmk9Md7WNvKv7L61WYH/SdUGWFoPj8orySFDWdNkwHwgBHOfGEcSP8rpSPmGEc/T80fAnx4yc/JZfvOvGHTnT6vnX4aVHHOnlFdkmbxGSfHJJjckK6hJMf5E9AgiD4FfwNl8KVeTQMFjM75AHCzTs+ALVP
AAACVHicfVHPSxwxGM2MWnXb2tUeewldCivYZaaIehGkvXhowbauCjvrkMl+sxs2kwzJN9olzB+pB8G/xEsPzf44WC19EHi89z7y5SUrpbAYRfdBuLS88mJ1bb3x8tXrjTfNza0zqyvDocu11OYiYxakUNBFgRIuSgOsyCScZ+MvU//8CowVWp3ipIR+wYZK5IIz9FLaHCfa20YMR8iM0dcuKRiOstzxuk6dqOkhTRB+oSnc15+n3+r2//Mf43qHQpunYnubJkLRuZ25H/XlcdpsRZ1oBvqcxAvSIgucpM3bZKB5VYBCLpm1vTgqse+YQcEl1I2kslAyPmZD6HmqWAG272al1PSDVwY018YfhXSmPp5wrLB2UmQ+Od3RPvWm4r+8XoX5Qd8JVVYIis8vyitJUdNpw3QgDHCUE08YN8LvSvmIGcbR/0PDlxA/ffJzcvapE+91ou+7raPPizrWyDvynrRJTPbJETkmJ6RLOLkhDwEJguAu+B0uhSvzaBgsZt6SvxBu/AH2y7Ul H H !c i = LSTM( !c i 1,e(ci)) R !q i = LSTM( !q i 1,e(qi)) R AAACUnicfVLPaxNBFJ6kamNMddWjl8EgpLSEXSmtl0LQSw4VojZpIRuX2cnbdOjszDLztjYM+zcKpZf+IV48qJMfB22LHwx8fN97vDffTFpIYTEMb2r1jQcPH202HjeftLaePguevxhZXRoOQ66lNqcpsyCFgiEKlHBaGGB5KuEkPf+w8E8uwFih1THOC5jkbKZEJjhDLyWBiLW3JWTIjNHfXJwzPEszx6sqcaKihzRGuESTu6Mvxx+rzn/Ld6Jql0KHJ2J7m8ZC0ZWdus/V134StMNuuAS9S6I1aZM1BklwFU81L3NQyCWzdhyFBU4cMyi4hKoZlxYKxs/ZDMaeKpaDnbhlJBV945UpzbTxRyFdqn93OJZbO89TX7nY0d72FuJ93rjE7N3ECVWUCIqvBmWlpKjpIl86FQY4yrknjBvhd6X8jBnG0b9C04cQ3b7yXTJ62432u+GnvXbv/TqOBnlFXpMOicgB6ZE+GZAh4eQ7+UF+kd+169rPuv8lq9J6bd3zkvyDeusPQuq2Jw== 2 AAACUnicfVJNbxMxEHUClBD6EeixF4sIKRVVtFsh4IIUwaWHIhVomkjZsPI6s4kVr72xZ4HI2t9YqeqFH8Klh7bOxwFaxJMsPb03oxk/O8mlsBgEvyrVBw8fbTyuPak/3dza3mk8e35mdWE4dLmW2vQTZkEKBV0UKKGfG2BZIqGXTD8u/N53MFZodYrzHIYZGyuRCs7QS3FDRNrbElJkxugfLsoYTpLUzcoydqKk72mE8BNN5o6/nn4qW/8tfxWWBxRas1js79NIKLqyE/el/HYUN5pBO1iC3ifhmjTJGidx4yIaaV5koJBLZu0gDHIcOmZQcAllPSos5IxP2RgGniqWgR26ZSQlfemVEU218UchXap/djiWWTvPEl+52NHe9Rbiv7xBgem7oRMqLxAUXw1KC0lR00W+dCQMcJRzTxg3wu9K+YQZxtG/Qt2HEN698n1ydtgO37SDz6+bnQ/rOGpkj7wgLRKSt6RDjsgJ6RJOzslvck1uKpeVq6r/JavSamXds0v+QnXzFooCtlE= 2 H H c i = LSTM( c i+1,e(ci)) R q i = LSTM( q i+1,e(qi)) R
AAACSXicbVBLSwMxGMy2Pmp9VT16CRbBU9ktooIIRS89VrEP6K41m2bb0GyyJFmlLPv3vHjz5n/w4kERT6YPUNsOBIaZ+fi+jB8xqrRtv1qZ7NLyympuLb++sbm1XdjZbSgRS0zqWDAhWz5ShFFO6ppqRlqRJCj0GWn6g6uR33wgUlHBb/UwIl6IepwGFCNtpE7h3g2R7vtBgtMOhRew7QoTl7TX10hK8Zj8+iZwDsc2I8Ei14Mu5XAi+clNepeUq2mnULRL9hhwnjhTUgRT1DqFF7crcBwSrjFDSrUdO9JegqSmmJE078aKRAgPUI+0DeUoJMpLxk2k8NAoXRgIaR7XcKz+nUhQqNQw9E1ydKaa9UbiIq8d6+DMSyiPYk04niwKYga1gKNaYZdKgjUbGoKwpOZWiPtIIqxN+XlTgjP75XnSKJeck5J9fVysXE7ryIF9cACOgANOQQVUQQ3UAQZP4A18gE/r2Xq3vqzvSTRjTWf2wD9ksj8nx7Vq 2 AAACSXicbVBLSwMxGMy2Pmp9VT16CRbBU9kVUUGEopceq9gHdNeaTbNtaDZZk6xSlv17Xrx58z948aCIJ9MHqG0HAsPMfHxfxo8YVdq2X61MdmFxaTm3kl9dW9/YLGxt15WIJSY1LJiQTR8pwignNU01I81IEhT6jDT8/uXQbzwQqajgN3oQES9EXU4DipE2Urtw54ZI9/wguU/bFJ7DlitMXNJuTyMpxWPy65vAGRzZjATzXA+6lMOx5CfX6W1yWEnbhaJdskeAs8SZkCKYoNouvLgdgeOQcI0ZUqrl2JH2EiQ1xYykeTdWJEK4j7qkZShHIVFeMmoihftG6cBASPO4hiP170SCQqUGoW+SwzPVtDcU53mtWAenXkJ5FGvC8XhREDOoBRzWCjtUEqzZwBCEJTW3QtxDEmFtys+bEpzpL8+S+mHJOS7ZV0fF8sWkjhzYBXvgADjgBJRBBVRBDWDwBN7AB/i0nq1368v6Hkcz1mRmB/xDJvsDcJG1lA== 2 2H 2H ci =[ !c i; c i] R qi =[q i; q i] R 2 ! 2 22 BiDAF: Attention
• Context-to-query attention: For each context word, choose the most relevant words from the query words.
23 (Slides adapted from Minjoon Seo) BiDAF: Attention
• Query-to-context attention: choose the context words that are most relevant to one of query words.
24 (Slides adapted from Minjoon Seo) BiDAF: Attention
The final output is AAACaHicjVFbS8MwFE7rfd7qDRFfgkPwaaRe5oYIoi8+qjgV1jrSLN2CaVKSVBil+B998wf44q8w1Q1UFDwQ+C7nkJMvUcqZNgi9OO7Y+MTk1PRMZXZufmHRW1q+0TJThLaI5FLdRVhTzgRtGWY4vUsVxUnE6W30cFb6t49UaSbFtRmkNExwT7CYEWys1PGeggSbfhTnvaLD4DFsjzix/AiOGP7GSg8GsivN/xuiIoQBE0Me5VfFfd44LzpeFdVQ/aC5hyCqHSD/sNm0AKF6Y28X+haUVQXDuuh4z0FXkiyhwhCOtW77KDVhjpVhhNOiEmSappg84B5tWyhwQnWYfwRVwG2rdGEslT3CwA/160SOE60HSWQ7yzX1T68Uf/PamYkbYc5EmhkqyOdFccahkbBMHXaZosTwgQWYKGZ3haSPFSbG/k3FhjB6Kfwb3OzW/HoNXe5XT06HcUyDTbAFdoAPDsEJOAcXoAUIeHVmnVVnzXlzPXfd3fhsdZ3hzAr4Vu7WO1Nouws= 8H gi =[ci; ai; ci ai; ci b] R 2
AAACBXicbVDLSsNAFL3xWesr6lIXg0WoICURUZdFNy4r2Ae0IUymk3bs5OHMRCghGzf+ihsXirj1H9z5N07aCtp6YODMOfdy7z1ezJlUlvVlzM0vLC4tF1aKq2vrG5vm1nZDRokgtE4iHomWhyXlLKR1xRSnrVhQHHicNr3BZe4376mQLApv1DCmToB7IfMZwUpLrrlX7gRY9T0/JZnLjtDP7y5zbw9ds2RVrBHQLLEnpAQT1Fzzs9ONSBLQUBGOpWzbVqycFAvFCKdZsZNIGmMywD3a1jTEAZVOOroiQwda6SI/EvqFCo3U3x0pDqQcBp6uzJeU014u/ue1E+WfOykL40TRkIwH+QlHKkJ5JKjLBCWKDzXBRDC9KyJ9LDBROriiDsGePnmWNI4r9mnFuj4pVS8mcRRgF/ahDDacQRWuoAZ1IPAAT/ACr8aj8Wy8Ge/j0jlj0rMDf2B8fAMfq5hW • First, compute a similarity score for every pair of ( c i , q j ) :
AAACFHicbVDLSgNBEJz1GeMr6tHLYBAEIeyKRI9BLzlGMQ/IxjA7mTVDZmeXmV41DPsRXvwVLx4U8erBm3/j5HFQY0FDUdVNd1eQCK7Bdb+cufmFxaXl3Ep+dW19Y7Owtd3Qcaooq9NYxKoVEM0El6wOHARrJYqRKBCsGQzOR37zlinNY3kFw4R1InIjecgpASt1C4d+RKAfhOYu6xof2D0YzaMswz6XeOIF5jK7NuVq1i0U3ZI7Bp4l3pQU0RS1buHT78U0jZgEKojWbc9NoGOIAk4Fy/J+qllC6IDcsLalkkRMd8z4qQzvW6WHw1jZkoDH6s8JQyKth1FgO0dn6r/eSPzPa6cQnnYMl0kKTNLJojAVGGI8Sgj3uGIUxNASQhW3t2LaJ4pQsDnmbQje35dnSeOo5JVL7sVxsXI2jSOHdtEeOkAeOkEVVEU1VEcUPaAn9IJenUfn2Xlz3ietc850Zgf9gvPxDXZon68=
AAACYnicbVBNb9QwEHVCS8sC7ZYe6cFihcQBrRKEAAkhVXDhWGi3rbQJkeOdtG4dO7UnwMryn+TGiQs/BGebfjOSpef33ozHr2yksJgkv6P43tLy/ZXVB4OHjx6vrQ83nuxb3RoOE66lNoclsyCFggkKlHDYGGB1KeGgPP3U6QffwVih1R7OG8hrdqREJTjDQBXD+W7hxMsTTz/QrGZ4XFbuh//mMqEQDGfSFy5D+InOitr76YWH+0K8v+w488XJ1a3TaKZnGm8YchqG9kzpvvpiOErGyaLoXZD2YET62imGv7KZ5m0NCrlk1k7TpMHcMYOCS/CDrLXQMH7KjmAaoGI12NwtIvL0eWBmtNImHIV0wV7vcKy2dl6XwdltaG9rHfk/bdpi9S53QjUtguLnD1WtpKhplzedCQMc5TwAxo0Iu1J+zAzjIV87CCGkt798F+y/GqdvxsmX16Ptj30cq+QpeUZekJS8JdvkM9khE8LJn2g5WovWo7/xIN6IN8+tcdT3bJIbFW/9AyA8ubo= 6H | Si,j = w [ci; qj; ci qj] R wsim R sim 2 2 Context-to-query attention (which question words are more relevant to ): • ci AAACcXicbVFdaxQxFM2Mra5rq+vHSylK6FJosSwzS1ERCqW+9KVQbbct7GyHO9lMN22SmSYZcQl59/f55p/wxT9gZmdF+3EhcO45997cnGQlZ9pE0c8gfLCw+PBR63H7ydLy02ed5y9OdFEpQgek4IU6y0BTziQdGGY4PSsVBZFxeppdfar1069UaVbIYzMt6UjAhWQ5I2A8lXa+J8DLCaSWbX10eAcnAsxECauL3Aj45jaOGmkTJ0w2apbZL+78YAu3mzS34FJW9+pKpPZyJ3bn9sDZf5MvHf5beu3SOrs5y/b3XdrpRr1oFvguiOegi+ZxmHZ+JOOCVIJKQzhoPYyj0owsKMMIp66dVJqWQK7ggg49lCCoHtmZYw6ve2aM80L5Iw2esf93WBBaT0XmK+s19W2tJu/ThpXJP4wsk2VlqCTNRXnFsSlwbT8eM0WJ4VMPgCjmd8VkAgqI8Z/U9ibEt598F5z0e/G7XvR5u7u7N7ejhVbRGtpAMXqPdtE+OkQDRNCv4FXwOngT/A5XQhyuNaVhMO95iW5E+PYP38K9Uw== M AAACKnicbVDJSgQxFEy7O26jHr0EB0FBhm4R9SK4XLwIbqPC9Ni8zqRnokm6SdLiEPp7vPgrXjwo4tUPMbMc3AoCRVU98l7FGWfa+P67NzQ8Mjo2PjFZmpqemZ0rzy9c6jRXhNZIylN1HYOmnElaM8xwep0pCiLm9Cq+O+z6V/dUaZbKC9PJaENAS7KEETBOisr7IfCsDZFl67cF3sWhANNWwuo0MQIeisjeFnb1vO+vFThksp+JY3tW3NjjIipX/KrfA/5LggGpoAFOovJL2ExJLqg0hIPW9cDPTMOCMoxwWpTCXNMMyB20aN1RCYLqhu2dWuAVpzRxkir3pME99fuEBaF1R8Qu2d1S//a64n9ePTfJTsMymeWGStL/KMk5Ninu9oabTFFieMcRIIq5XTFpgwJiXLslV0Lw++S/5HKjGmxV/dPNyt7BoI4JtISW0SoK0DbaQ0foBNUQQY/oGb2iN+/Je/HevY9+dMgbzCyiH/A+vwDi36gr M M 2H ↵ = softmax↵i,: = softmax((S ) Si,:) R , ai = ↵i,jqj R i,j j i,j R 2 2 2 j=1 X • Query-to-context attention (which context words are relevant to some question words): AAACLnicbVDLSgMxFM34rPU16tJNsBRclZki6qZQFKErqWIf0KlDJs3U0ExmSDJCCfNFbvwVXQgq4tbPMG1H8HUgcDjnXnLPCRJGpXKcZ2tufmFxabmwUlxdW9/YtLe22zJOBSYtHLNYdAMkCaOctBRVjHQTQVAUMNIJRqcTv3NLhKQxv1LjhPQjNOQ0pBgpI/n2mRchdROEOshgDXoyjXxNa252rc8z7QVEIZ/Crxmc+TSDHuW5EuhLM1htZL5dcirOFPAvcXNSAjmavv3oDWKcRoQrzJCUPddJVF8joShmJCt6qSQJwiM0JD1DOYqI7Otp3AyWjTKAYSzM4wpO1e8bGkVSjiOTqDw5U/72JuJ/Xi9V4XFfU56kinA8+yhMGVQxnHQHB1QQrNjYEIQFNbdCfIMEwso0XDQluL8j/yXtasU9rDgXB6X6SV5HAeyCPbAPXHAE6qABmqAFMLgDD+AFvFr31pP1Zr3PRuesfGcH/ID18Qnh8qmq N AAACOXicbVDLSiQxFE2p42g7ao8u3QQboQVpqkR0NoLoxo3SPlqFrp4ilU610SRVJLcGm1C/5ca/cCe4caGIW3/A9EPwMRcCJ+fcw733xJngBnz/zhsZHfsx/nNisjT1a3pmtvx77sSkuaasQVOR6rOYGCa4Yg3gINhZphmRsWCn8eVOTz/9x7ThqTqGbsZaknQUTzgl4KioXA9jBiTieBOHksC5ltakCUhyVUS8+k71v/ZiMyj+2r2iehRZvnJRLC/jkKuBL47toRP3i6hc8Wt+v/B3EAxBBQ2rHpVvw3ZKc8kUUEGMaQZ+Bi1LNHAqWFEKc8MyQi9JhzUdVEQy07L9ywu85Jg2TlLtngLcZz86LJHGdGXsOntbmq9aj/yf1swh+dOyXGU5MEUHg5JcYEhxL0bc5ppREF0HCNXc7YrpOdGEggu75EIIvp78HZys1oL1mn+wVtnaHsYxgRbQIqqiAG2gLbSL6qiBKLpG9+gRPXk33oP37L0MWke8oWcefSrv9Q3ShK4f M N 2H i = softmaxi(maxj=1(Si,j)) R b = ici R 2 2 i=1 X 25 BiDAF: Modeling and output layers
The final training loss is
AAACMHicdVBdSxtBFJ3Vqmn8ivaxL0ODoIJh1o+YPAhiH9oHHxSMCtkYZic3cXB2dpm5K4Zlf1Jf/Cn60kJL6Wt/hbMawc8DA4dz7uXOOWGipEXGfnlj4x8mJqdKH8vTM7Nz85WFxWMbp0ZAS8QqNqcht6CkhhZKVHCaGOBRqOAkvPha+CeXYKyM9REOE+hEfKBlXwqOTupWvgURx3PBVbaf0x26RgMVD7KkmwUIV5hZ5AbzfNmera7kr1zQPedB4XUrVVZj9a3mBqOstsX87WbTEcbqjY116jtSoEpGOOhWboJeLNIINArFrW37LMFO5s5JoSAvB6mFhIsLPoC2o5pHYDvZfeCcLjmlR/uxcU8jvVefbmQ8snYYhW6yiGdfeoX4ltdOsd/oZFInKYIWD4f6qaIY06I92pMGBKqhI1wY6f5KxTk3XKDruOxKeExK3yfH6zW/XmOHm9XdvVEdJfKZfCHLxCfbZJd8JwekRQT5QW7Jb/LHu/Z+en+9fw+jY95o5xN5Bu//HSE9qks= = log p (s⇤) log p (e⇤) L start end
Modeling layer: pass to another two layers of bi-directional LSTMs. gi • Attention layer is modeling interactions between query and context • Modeling layer is modeling interactions within context words AAACKXicbVDLSgMxFM3UV62vUZdugkWomzJTRN0IpW66UKjah9DWkkkzbWiSGZKMUIb5HTf+ihsFRd36I6YPQVsPBE7OuZd77/FCRpV2nA8rtbC4tLySXs2srW9sbtnbO3UVRBKTGg5YIG89pAijgtQ01YzchpIg7jHS8AbnI79xT6SigajqYUjaHPUE9SlG2kgdu9jiSPc9P+ZJh8IzOP5KHpfoxU31Mvfj9ox7mMAWFZMKz4uvk7u4UE46dtbJO2PAeeJOSRZMUenYL61ugCNOhMYMKdV0nVC3YyQ1xYwkmVakSIjwAPVI01CBOFHteHxpAg+M0oV+IM0TGo7V3x0x4koNuWcqR2uqWW8k/uc1I+2ftmMqwkgTgSeD/IhBHcBRbLBLJcGaDQ1BWFKzK8R9JBHWJtyMCcGdPXme1At59zjvXB1li6VpHGmwB/ZBDrjgBBRBGVRADWDwAJ7AK3izHq1n6936nJSmrGnPLvgD6+sb8ZqnDQ== 2H mi = BiLSTM(gi) R 2 Output layer: two classifiers predicting the start and end positions:
AAACTXicbVDPSxwxGM1sW3+s1W7bYy+hi1Qvy4xIFUpB2kuPCl0VdqZDJvPNGkwyQ/KNuoT8g70Ueut/4cVDi0iz6wpV+yDw8t73yJdXNFJYjONfUefJ02cLi0vL3ZXnq2svei9fHdq6NRyGvJa1OS6YBSk0DFGghOPGAFOFhKPi9PPUPzoDY0Wtv+KkgUyxsRaV4AyDlPfKJncpwgU60KX39CNNFcMTo5ytK1Tswm/MhKJy5/6bS4VGMJxJfy82upsZ+1x8oHc35d/lItukea8fD+IZ6GOSzEmfzLGf936mZc1bBRq5ZNaOkrjBzDGDgkvw3bS10DB+ysYwClQzBTZzszY8XQ9KSavahKORztR/E44payeqCJPTPe1Dbyr+zxu1WO1mTuimRdD89qGqlRRrOq2WlsIARzkJhHEjwq6UnzDDeKjMdkMJycMvPyaHW4Pk/SA+2O7vfZrXsUTekLdkgyRkh+yRL2SfDAkn38kl+U3+RD+iq+g6urkd7UTzzGtyD53Fv5N5t48= AAACX3icbVBNa9wwEJXdtEk3aeI2p5KLyFJIL4tdSloohdBeekwhmwTWrpG1410RSTbSOM0i9Cd7K/TSfxLtZjfk64HgzZt5zOhVrRQW0/RvFD9be/5ifeNlb3Pr1fZO8vrNqW06w2HIG9mY84pZkELDEAVKOG8NMFVJOKsuvs/7Z5dgrGj0Cc5aKBSbaFELzjBIZXLZli5XDKdGOYvMoPf0K71VmhoVu/Iul1DjwUKuavfbBxPCFa4sv1wuNILhTHo6Wo1NfCm+0FWlQlXkRkym+N6XST8dpAvQxyRbkj5Z4rhM/uTjhncKNHLJrB1laYuFC9sFl+B7eWehZfyCTWAUqGYKbOEW+Xj6LihjWjcmPI10od51OKasnakqTM6PtQ97c/Gp3qjD+nPhhG47BM1vFtWdpNjQedh0LAxwlLNAGDci3Er5lBnGQ1a2F0LIHn75MTn9MMgOB+nPj/2jb8s4Nsge2ScHJCOfyBH5QY7JkHDyL4qjzWgr+h+vx9txcjMaR0vPLrmH+O01qGu6TQ== | | pstart = softmax(wstart[gi; mi]) pend = softmax(wend[gi; mi0 ])
AAACMHicbVDBThsxEPUCLTQtJcCxF4uoEgcU7SIEHBEc4EhRE5CyaeR1ZokVr3dlz9JGlj+JC58CF5CoEFe+ok6yByB90khP781oZl5SSGEwDB+CufmFDx8Xlz7VPn9Z/rpSX11rm7zUHFo8l7m+SJgBKRS0UKCEi0IDyxIJ58nwaOyfX4E2Ilc/cVRAN2OXSqSCM/RSr34cZwwHSWp/u56NEf6gNcg0OrdFZy1QfedoLFTlJfbM/bJReOJ69UbYDCegsySqSINUOO3Vb+N+zssMFHLJjOlEYYFd61cLLsHV4tJAwfiQXULHU8UyMF07edjR717p0zTXvhTSifp6wrLMmFGW+M7xnea9Nxb/53VKTPe7VqiiRFB8uigtJcWcjtOjfaGBoxx5wrgW/lbKB0wzjj7jmg8hev/yLGlvN6PdZvhjp3FwWMWxRL6RDbJJIrJHDsgJOSUtwsk1uSOP5G9wE9wHT8HztHUuqGbWyRsEL/8AK+SsBw== AAACKnicbVDLSgMxFM34rPVVdekmWMS6KTNF1I1QdeNCoT6qhU4dMmmmBpPMkGSEMsz3uPFX3HShFLd+iOl0wEc9EDg5517uvcePGFXatofW1PTM7Nx8YaG4uLS8slpaW79VYSwxaeKQhbLlI0UYFaSpqWakFUmCuM/Inf94OvLvnohUNBQ3uh+RDkc9QQOKkTaSVzp2OdIPfpDwdMej8Ahmf8mTE3p+fXORVr59j+5Cl4pxhe8nV+l9UjtLvVLZrtoZ4CRxclIGORpeaeB2QxxzIjRmSKm2Y0e6kyCpKWYkLbqxIhHCj6hH2oYKxInqJNmpKdw2ShcGoTRPaJipPzsSxJXqc99UjtZUf72R+J/XjnVw2EmoiGJNBB4PCmIGdQhHucEulQRr1jcEYUnNrhA/IImwNukWTQjO35MnyW2t6uxX7cu9cv0kj6MANsEWqAAHHIA6OAMN0AQYPINX8AberRdrYA2tj3HplJX3bIBfsD6/AGnrp0Q= 2H 10H m0 = BiLSTM(mi) R wstart, wend R i 2 2 26 BiDAF: Performance on SQuAD
This model achieved 77.3 F1 on SQuAD v1.1. • Without context-to-query attention ⟹ 67.7 F1 • Without query-to-context attention ⟹ 73.7 F1 • Without character embeddings ⟹ 75.4 F1
27 (Seo et al., 2017): Bidirectional Attention Flow for Machine Comprehension Attention visualization
28 BERT for reading comprehension
• BERT is a deep bidirectional Transformer encoder pre-trained on large amounts of text (Wikipedia + BooksCorpus) • BERT is pre-trained on two training objectives: • Masked language model (MLM) • Next sentence prediction (NSP) • BERTbase has 12 layers and 110M parameters, BERTlarge has 24 layers and 330M parameters
29 BERT for reading comprehension
Question = Segment A Passage = Segment B Answer = predicting two endpoints in segment B
AAACMHicdVBdSxtBFJ3Vqmn8ivaxL0ODoIJh1o+YPAhiH9oHHxSMCtkYZic3cXB2dpm5K4Zlf1Jf/Cn60kJL6Wt/hbMawc8DA4dz7uXOOWGipEXGfnlj4x8mJqdKH8vTM7Nz85WFxWMbp0ZAS8QqNqcht6CkhhZKVHCaGOBRqOAkvPha+CeXYKyM9REOE+hEfKBlXwqOTupWvgURx3PBVbaf0x26RgMVD7KkmwUIV5hZ5AbzfNmera7kr1zQPedB4XUrVVZj9a3mBqOstsX87WbTEcbqjY116jtSoEpGOOhWboJeLNIINArFrW37LMFO5s5JoSAvB6mFhIsLPoC2o5pHYDvZfeCcLjmlR/uxcU8jvVefbmQ8snYYhW6yiGdfeoX4ltdOsd/oZFInKYIWD4f6qaIY06I92pMGBKqhI1wY6f5KxTk3XKDruOxKeExK3yfH6zW/XmOHm9XdvVEdJfKZfCHLxCfbZJd8JwekRQT5QW7Jb/LHu/Z+en+9fw+jY95o5xN5Bu//HSE9qks= = log p (s⇤) log p (e⇤) L start end AAACRHicdVBNTxsxFPTSD2j6lZYjF6tRpXCJvFACOVRC9MIRJAJI2XTldd6Che1d2W8pkbU/rhd+ALf+gl56oKp6RXhDkFrUjmRpNDPPfp6sVNIhY9+ihUePnzxdXHrWev7i5avX7TdvD11RWQFDUajCHmfcgZIGhihRwXFpgetMwVF29qnxj87BOlmYA5yWMNb8xMhcCo5BStujMvUJwgV6h9xiXXflKv1IE83x1Grvihw1v6hTSbszLcv9l/qzT6RBsIKr+sE4vU/t1qtpu8N6rL8xWGeU9TZYvDkYBMJYf2t9jcaBNOiQOfbS9lUyKUSlwaBQ3LlRzEoc+3CxFArqVlI5KLk44ycwCtRwDW7sZyXU9H1QJjQvbDgG6Uz9c8Jz7dxUZyHZrOgeeo34L29UYb419tKUFYIRdw/llaJY0KZROpEWBKppIFxYGXal4pRbLkJDrhVKuP8p/T85XOvF/R7b/9DZ3pnXsURWyDvSJTHZJNtkl+yRIRHkK/lOrsnP6DL6Ef2Kft9FF6L5zDL5C9HNLce6tPE= | pstart(i) = softmaxi(wstartH) AAACQHicdVBBTxQxGO2gKK6oixy9NG5MlsumA7KwBxOCF46QuCzJzjrpdL+BhrYzab8RNs38NC/+BG+cuXjQGK+c7CxLIkZf0uTlvffl+/qyUkmHjF1FSw8eLj96vPKk9XT12fMX7bWXx66orIChKFRhTzLuQEkDQ5So4KS0wHWmYJSdv2/80SewThbmA85KmGh+amQuBccgpe1RmfoE4RI9mGldd+UGfUcTzfHMau+KHDW/rFNJu3Mty/1F/dEn0iBYwVV9b5jeZQ7qjbTdYT3W3x5sMcp62yzeGQwCYay/u7VJ40AadMgCh2n7azItRKXBoFDcuXHMSpx4blEKBXUrqRyUXJzzUxgHargGN/HzAmr6JihTmhc2PIN0rv454bl2bqazkGxOdH97jfgvb1xhvjvx0pQVghG3i/JKUSxo0yadSgsC1SwQLqwMt1Jxxi0XoR/XCiXc/ZT+nxxv9uJ+jx297eztL+pYIa/Ia9IlMdkhe+SAHJIhEeQzuSbfyY/oS/Qt+hn9uo0uRYuZdXIP0c1vpjWy2w== | pend(i) = softmaxi(wendH)
AAACKHicdVDLSgMxFM3UV62vqks3wSK4kJJp7WshFt10JRWsLbRDyaSZNjTzIMkIZejnuPFX3Igo0q1fYqYPaEUPBM45915y77EDzqRCaGIk1tY3NreS26md3b39g/Th0aP0Q0Fog/jcFy0bS8qZRxuKKU5bgaDYtTlt2sPbuN58okIy33tQo4BaLu57zGEEK21109cdF6uB7US1MbyC7YUajLvmBVxSOa14z1dyxb2zuukMyqJioZJHEGULyCxVKpogVCznc9DUJEYGzFHvpt87PZ+ELvUU4VjKtokCZUVYKEY4Hac6oaQBJkPcp21NPexSaUXTQ8fwTDs96PhCP0/Bqbs8EWFXypFr6854Sfm7Fpt/1dqhcspWxLwgVNQjs4+ckEPlwzg12GOCEsVHmmAimN4VkgEWmCidbUqHsLgU/k8ec1mzmEX3l5nqzTyOJDgBp+AcmKAEqqAG6qABCHgGr+ADfBovxpvxZUxmrQljPnMMVmB8/wC9VqZw Image credit: https://mccormickml.com/ where H =[h1, h2,...,hN ] are the hidden vectors of the
paragraph, returned by BERT 30 BERT for reading comprehension
AAACMHicdVBdSxtBFJ3Vqmn8ivaxL0ODoIJh1o+YPAhiH9oHHxSMCtkYZic3cXB2dpm5K4Zlf1Jf/Cn60kJL6Wt/hbMawc8DA4dz7uXOOWGipEXGfnlj4x8mJqdKH8vTM7Nz85WFxWMbp0ZAS8QqNqcht6CkhhZKVHCaGOBRqOAkvPha+CeXYKyM9REOE+hEfKBlXwqOTupWvgURx3PBVbaf0x26RgMVD7KkmwUIV5hZ5AbzfNmera7kr1zQPedB4XUrVVZj9a3mBqOstsX87WbTEcbqjY116jtSoEpGOOhWboJeLNIINArFrW37LMFO5s5JoSAvB6mFhIsLPoC2o5pHYDvZfeCcLjmlR/uxcU8jvVefbmQ8snYYhW6yiGdfeoX4ltdOsd/oZFInKYIWD4f6qaIY06I92pMGBKqhI1wY6f5KxTk3XKDruOxKeExK3yfH6zW/XmOHm9XdvVEdJfKZfCHLxCfbZJd8JwekRQT5QW7Jb/LHu/Z+en+9fw+jY95o5xN5Bu//HSE9qks= = log p (s⇤) log p (e⇤) L start end All the BERT parameters (e.g., 110M) as well as the • AAACG3icbVDLSgNBEJz1GeMr6tHLYBA8SNgNoh6DXjxGMA9IQpid9CZDZmeXmV4xLPsfXvwVLx4U8SR48G+cPA6aWNBQVHXT3eXHUhh03W9naXlldW09t5Hf3Nre2S3s7ddNlGgONR7JSDd9ZkAKBTUUKKEZa2ChL6HhD6/HfuMetBGRusNRDJ2Q9ZUIBGdopW6h3A4ZDvwgHWTdtI3wgKlBpjHLTumiBaqXZd1C0S25E9BF4s1IkcxQ7RY+272IJyEo5JIZ0/LcGDup3SK4hCzfTgzEjA9ZH1qWKhaC6aST3zJ6bJUeDSJtSyGdqL8nUhYaMwp92zk+18x7Y/E/r5VgcNlJhYoTBMWni4JEUozoOCjaExo4ypEljGthb6V8wDTjaOPM2xC8+ZcXSb1c8s5L7u1ZsXI1iyNHDskROSEeuSAVckOqpEY4eSTP5JW8OU/Oi/PufExbl5zZzAH5A+frB3zOo5E= newly introduced parameters h start , h end (e.g., 768 x 2 = 1536) are optimized together for AAAB8nicbVBNS8NAFHypX7V+VT16WSyCp5KIqMeiFw8eKlhbSEPZbLft0s0m7L4IJfRnePGgiFd/jTf/jZs2B20dWBhm3mPnTZhIYdB1v53Syura+kZ5s7K1vbO7V90/eDRxqhlvsVjGuhNSw6VQvIUCJe8kmtMolLwdjm9yv/3EtRGxesBJwoOIDpUYCEbRSn43ojhiVGZ301615tbdGcgy8QpSgwLNXvWr249ZGnGFTFJjfM9NMMioRsEkn1a6qeEJZWM65L6likbcBNks8pScWKVPBrG2TyGZqb83MhoZM4lCO5lHNIteLv7n+SkOroJMqCRFrtj8o0EqCcYkv5/0heYM5cQSyrSwWQkbUU0Z2pYqtgRv8eRl8nhW9y7q7v15rXFd1FGGIziGU/DgEhpwC01oAYMYnuEV3hx0Xpx352M+WnKKnUP4A+fzB4KrkWc= . L • It works amazingly well. Stronger pre-trained language models can lead to even better performance and SQuAD becomes a standard dataset for testing pre-trained models. F1 EM Human performance 91.2* 82.3* BiDAF 77.3 67.7 BERT-base 88.5 80.8 BERT-large 90.9 84.1 XLNet 94.5 89.0 RoBERTa 94.6 88.9 ALBERT 94.8 89.3 (dev set, except for human performance) 31 Comparisons between BiDAF and BERT models
• BERT model has many many more parameters (110M or 330M) and BiDAF has ~2.5M parameters. • BiDAF is built on top of several bidirectional LSTMs while BERT is built on top of Transformers (no recurrence architecture and easier to parallelize). • BERT is pre-trained while BiDAF is only built on top of GloVe (and all the remaining parameters need to be learned from the supervision datasets).
Pre-training is clearly a game changer but it is expensive..
32 Comparisons between BiDAF and BERT models
Are they really fundamentally different? Probably not. • BiDAF and other models aim to model the interactions between question and passage. • BERT uses self-attention between the concatenation of question and passage = attention(P, P) + attention(P, Q) + attention(Q, P) + attention(Q, Q) • (Clark and Gardner, 2018) shows that adding a self-attention layer for the passage attention(P, P) to BiDAF also improves performance.
question passage 33 Can we design better pre-training objectives?
The answer is yes!
Two ideas: 1) masking contiguous spans of words instead of 15% random words 2) using the two end points of span to predict all the masked words in between = compressing the information of a span into its two endpoints
34 (Joshi & Chen et al., 2020): SpanBERT: Improving Pre-training by Representing and Predicting Spans SpanBERT performance
Google BERT Our BERT SpanBERT F1 scores
94.6 95 92.6 91.3 88.7 88 85.9 84.8 83.3 83.6 83.0 81.7 81.8 82.8 80.5 80.5 79.0 79.9 80 77.5 78.3 73.6 73 71.0 68.8 65 SQuAD v1.1 SQuAD v2.0 NewsQA TriviaQA SearchQA HotpotQA Natural Questions
35 Is reading comprehension solved?
• We have already surpassed human performance on SQuAD. Does it mean that reading comprehension is already solved? Of course not! • The current systems still perform poorly on adversarial examples or examples from out-of- domain distributions
36 (Jia and Liang, 2017): Adversarial Examples for Evaluating Reading Comprehension Systems Is reading comprehension solved?
Systems trained on one dataset can’t generalize to other datasets:
37 (Sen and Saffari, 2020): What do Models Learn from Question Answering Datasets? Is reading comprehension solved?
BERT-large model trained on SQuAD
38 (Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList Is reading comprehension solved?
BERT-large model trained on SQuAD
39 (Ribeiro et al., 2020): Beyond Accuracy: Behavioral Testing of NLP Models with CheckList 3. Open-domain question answering
Question (Q) Answer (A)
• Different from reading comprehension, we don’t assume a given passage. • Instead, we only have access to a large collection of documents (e.g., Wikipedia). We don’t know where the answer is located, and the goal is to return the answer for any open-domain questions. • Much more challenging but a more practical problem!
In contrast to closed-domain systems that deal with questions under a specific domain (medicine, technical support)..
40 Retriever-reader framework
Document Document Retriever Reader 833,500
https://github.com/facebookresearch/DrQA
Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions 41 Retriever-reader framework
Input: a large collection of documents and Q • � = D1, D2, …, DN • Output: an answer string A
Retriever: • f(�, Q) ⟶ P1, …, PK K is pre-defined (e.g., 100) Reader: A reading comprehension problem! • g(Q, {P1, …, PK}) ⟶ A
In DrQA, • Retriever = A standard TF-IDF information-retrieval sparse model (a fixed module) • Reader = a neural reading comprehension model that we just learned • Trained on SQuAD and other distantly-supervised QA datasets Distantly-supervised examples: (Q, A) ⟶ (P, Q, A)
Chen et al., 2017. Reading Wikipedia to Answer Open-domain Questions 42 We can train the retriever too
• Joint training of retriever and reader
• Each text passage can be encoded as a vector using BERT and the retriever score can be measured as the dot product between the question representation and passage representation. • However, it is not easy to model as there are a huge number of passages (e.g., 21M in English Wikipedia)
Lee et al., 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering 43 We can train the retriever too
• Dense passage retrieval (DPR) - We can also just train the retriever using question-answer pairs!
• Trainable retriever (using BERT) largely outperforms traditional IR retrieval models
Karpukhin et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering 44 We can train the retriever too
http://qa.cs.washington.edu:2020/
Karpukhin et al., 2020. Dense Passage Retrieval for Open-Domain Question Answering 45 Dense retrieval + generative models
Recent work shows that it is beneficial to generate answers instead of to extract answers.
Fusion-in-decoder (FID) = DPR + T5
Izacard and Grave 2020. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering 46 Large language models can do open-domain QA well
… without an explicit retriever stage
Roberts et al., 2020. How Much Knowledge Can You Pack Into the Parameters of a Language Model? 47 Maybe the reader model is not necessary too!
It is possible to encode all the phrases (60 billion phrases in Wikipedia) using dense vectors and only do nearest neighbor search without a BERT model at inference time!
Seo et al., 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index Lee et al., 2020. Learning Dense Representations of Phrases at Scale 48 DensePhrases: Demo
49 Thanks! [email protected]
50