Anton Nikolaev and Nick Glaser
Total Page:16
File Type:pdf, Size:1020Kb
Tensorflow2.0 Question Answering Anton Nikolaev, Nick Glaser ICS 661 - Final Report 1. Introduction applicable, sections from Wikipedia articles Natural language processing (NLP) containing the answer. In contrast to some is one of the domains where the emergence other QA datasets, NQ also provides of deep learning (DL) has had the largest answer candidates for each question as well impact, improving performance across as a context level indicator. The candidates almost the entire spectrum of NLP. contain the indices representing the One of the kinds of problems that are respective start and end tokens for each currently being solved by DL researchers answer. The context indicator is a binary are reading comprehension/ question value that signals whether a given answering (QA) problems. For our project, candidate answer is also contained within we joined a kaggle competition based on a another candidate (nested) or whether it is novel QA dataset provided by Google the only candidate containing the specific Research titled Natural Questions (NQ). Our passage (top-level). This additional goal was to evaluate the performance of information can help improve model some of the current state-of-the-art NLP accuracy after the training stage, but is not architectures on this dataset. traditionally used during training itself. Overall, the entire dataset contains 2. Problem and Dataset about 300,000 training examples as well as The goal of the QA task is just under 8000 test examples that are essentially two-fold: the algorithm is ultimately used to evaluate our model’s provided with a text passage and a performance on kaggle. corresponding question from an arbitrary An example question from the training data domain with the goal of first understanding is given in Appendix A. For more details on the question and subsequently providing the the dataset, visit: appropriate answer (given that one exists) https://github.com/google-resea extracted from the text passage at hand. rch-datasets/natural-questions For the problem-formulation in the kaggle challenge, the answer could potentially 2.1 Metrics come in four formats: as a long answer, For the competition, correct answers usually being about a paragraph in length; must have the correct answer type selected as a short answer, being no longer than a and, where applicable, must also provide sentence and as short as a few words; as a the exactly correct answer span. If the yes/no answer, which is treated as a correct answer type is selected but the sub-type of the short answer; or, lastly, answer span is incorrect, that will still be there could be no answer in the text which considered a misclassification. the algorithm would also have to correctly Overall, the relevant metric is micro-F1 identify. between all answer types. F1 is the The dataset itself consists of google harmonic mean between the precision and queries taken from actual users and, if recall of the model. If the model has multiple answer types, we instead take the micro complicated. These features however, average of the precision and recall metrics comes with the consequence that there is for each answer type. So, if we were to no way to support infinitely long input have two answer types and wanted to get sequences and instead the micro-F1, we would compute the input-sequence-length has to be constant following: and requires padding/ truncating the data. Nevertheless, empirically it is evident that tp1+tp2 this trade-off is overall beneficial to micro average precision = tp +tp +fp +fp 1 2 1 2 performance. tp +tp micro average recall = 1 2 While we plan on trying other architectures tp1+tp2+fn1+fn2 in the future, thus far, our results are limited to BERT [2]. In its essence, BERT, like And, thus, finally, the micro-F1 would just many of the other models, is just a very be: deep network with many subsequent m.a. precision · m.a. recall Transformer layers and many attention micro − F 1 = 2 · m.a. precision + m.a. recall heads per layer. What makes BERT and other large language models so powerful aside from their incredibly large number of trainable parameters (110M for the small 3. Background BERT model), is the very scalable initial The Transformer architecture training procedure. Language models (Figure 1) has two main distinctive features, heavily depend on pretraining on large its bidirectionality and its multi-headed corpora. attention mechanism, and in concept is similar to other encoder-decoder networks [1]. While it is beyond the scope of this report to fully explore the architecture, it is worth highlighting how these features contribute to the impact the architecture has had in the past two years. Both, bidirectionality and multi-head attention, allow the model to establish more complicated dependencies than other commonly used models like RNNs would be able to and are thus better suited to model language data. RNNs, specifically the LSTMs [3] which are the most commonly-used variant of recurrent networks, are strongly limited by the way in which they sequentially process the data which doesn’t always make sense for Figure 1: Transformer Architecture [1] language, where relationships between components are often much more In the case of BERT, this is achieved in two and short answers, we need to have ways, the first of which is reminiscent of multiple outputs for the model. how neural bag-of-word embedding models Overall, we ended up with three outputs. are trained. The input consists of a The first two outputs contain logits paragraph of text with 15% of the words corresponding to each token of the input masked and the model is then required to paragraph. These can be understood as a predict those words from their context. value indicating how likely a given token is The second task is known as to be the start or end token of the correct next-sentence-prediction (NSP). In NSP, the answer to the question. This will be model is given a context paragraph and explained in more detail in the then a potential next sentence. It then has post-processing section. The last output is a to perform binary classification to indicate five-unit vector that indicates what type of whether that sentence truly follows the answer we expect for the input question given context or was simply picked at (short, long, yes, no, no-answer). random from the corpus, with both cases occurring about half of the time. This 4.2 Data Processing training task is perhaps more closely related Preprocessing has to be done to the QA problem that we are interested in. exactly as is detailed in the BERT paper [2] Due to the semi-supervised nature of these since the input format needs to be tasks, it is easy to scale this to very large consistent between the pre-training and training corpora, which are in BERTs case fine-tuning/ prediction phases. The most the Book Corpus (800M words) and all of notable preprocessing component is the English Wikipedia (2,500M words). tokenization of the paragraph, which is done by matching the tokens to a predefined 4. Our Approach vocabulary based on a greedy Currently, most QA datasets have longest-match-first approach. After their performance leaderboards dominated tokenization, the sequence is padded or by Transformer-based models (for a truncated to the length of 512 tokens. collection of datasets and leaderboards During fine-tuning, processing the model visit: outputs is still relatively consistent with what paperswithcode.com/task/questio is described in [2], however, in order to train n-answering). the answer-type output layer an additional We set out to use some of the most loss-term has to be added to account for the prevalent of these architectures and adapt additional outputs. them for the specific formulation of this QA Handling the model outputs requires task. some more work because the competition was set up differently from common QA 4.1 Model tasks as mentioned above. We downloaded one of the publicly First, the logits are analyzed. The model is available pretrained BERT models and then trained to increase the start-end logit scores adapted and fine-tuned it for the QA task. based on the short answer span. So we Because we have to predict both the take the 20 highest scoring start and end answer type and start and indices of long logits respectively, and then create all possible pairs. From the pool of pairs, we ensemble their prediction to hopefully create select the viable candidates, meaning the a more robust overall model. start index is before the end index, and then With that being said, currently our sum their logit scores. The highest overall micro-F1 score is at 0.57, which, at the scoring candidate is then selected as a writing of this report, ranks 36th out of short answer. approximately 850 contestants. We are Next, the long answer is selected out of the confident that the addition of more models; pool of predefined candidates based on two proper answer selection, including proper criteria: it must contain the highest scoring yes/no-answer selection; and ensembling short answer, and it must be a top-level will help further improve that ranking. answer which prevents us from getting multiple matches. 6. Summary Ultimately, which answer type is chosen For sophisticated NLP tasks like depends on both the combined question answering, one of the biggest start-end-logit score as well as the outputs takeaways from this project is the realization in the answer-type vector. The scores are that the pre- and post-processing are some combined and whether they pass an of the largest parts of the puzzle.