DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Automating Question Generation Given the Correct Answer

HAOLIANG CAO

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Authors

Haoliang Cao School of Electrical Engineering and Computer Science KTH Royal Institute of Technology

Place for Project

Stockholm, Sweden KTH Royal Institute of Technology

Examiner

Viggo Kann School of Electrical Engineering and Computer Science KTH Royal Institute of Technology

Supervisor

Johan Boye School of Electrical Engineering and Computer Science KTH Royal Institute of Technology

Swedish Title

Automatisering av frågegenerering givet det rätta svaret iii

Abstract

In this thesis, we propose an end-to-end deep learning model for a question generation task. Given a Wikipedia article written in English and a segment of text appearing in the article, the model can generate a simple question whose answer is the given text segment. The model is based on an encoder-decoder architecture. Our experiments show that a model with a fine-tuned BERT en- coder and a self-attention decoder give the best performance. We also propose an evaluation metric for the question generation task, which evaluates both syntactic correctness and relevance of the generated questions. According to our analysis on sampled data, the new metric is found to give better evaluation compared to other popular metrics for sequence to sequence tasks.

Keywords Natural Language Processing, NLP, Natural Language Generation, NLG, Ques- tion Generation iv

Sammanfattning

I den här avhandlingen presenteras en djup neural nätverksmodell för en fråge- ställningsuppgift. Givet en Wikipediaartikel skriven på engelska och ett textseg- ment i artikeln kan modellen generera en enkel fråga vars svar är det givna textsegmentet. Modellen är baserad på en kodar-avkodararkitektur (encoder- decoder architecture). Våra experiment visar att en modell med en finjusterad BERT-kodare och en självuppmärksamhetsavkodare (self-attention decoder) ger bästa prestanda. Vi föreslår också en utvärderingsmetrik för frågeställ- ningsuppgiften, som utvärderar både syntaktisk korrekthet och relevans för de genererade frågorna. Enligt vår analys av samplade data visar det sig att den nya metriken ger bättre utvärdering jämfört med andra populära metriker för utvärdering.

Nyckelord Naturligtspråkbehandling, Naturligtspråkgenerering, Frågegenerering v

Acknowledgements

I would like to express my sincere thanks and gratitude to my supervisor, Johan Boye who gave me the opportunity to make some contribution for the question generation task. Without his patient and careful guidance, it would be impos- sible for me to complete this thesis. I would also like to thank my examiner, Viggo Kann who gave me valuable suggestions to improve the quality of my final report. Contents

1 Introduction 1 1.1 Objective ...... 2 1.2 Thesis outline ...... 3

2 Background 4 2.1 Introduction of Natural Language Processing ...... 4 2.1.1 Features and representations ...... 4 2.1.2 Models ...... 6 2.2 Natural Language Understanding (NLU) ...... 9 2.2.1 BIDAF model ...... 9 2.2.2 Other QA models ...... 11 2.2.3 Dataset ...... 11 2.3 Natural language generation (NLG) ...... 12 2.4 Transfer Learning and Pre-trained Language Model ...... 13

3 Methods 15 3.1 Baseline Model ...... 15 3.2 ReverseQA Model ...... 20

4 Implementation 22 4.1 Dataset ...... 22 4.2 Training ...... 24 4.2.1 Baseline model ...... 24 4.2.2 ReverseQA model ...... 25 4.3 Testing ...... 25

5 Analysis of the Evaluation Method 26 5.1 Metrics for Sequence to Sequence task ...... 26 5.2 Readability and Relevance (RaR) Metric ...... 27 5.3 Metric Evaluation ...... 30

vi CONTENTS vii

5.3.1 Readability loss ...... 30 5.3.2 Overlapping score ...... 31 5.3.3 Conclusion ...... 32

6 Results 33 6.1 Quantitative Evaluation ...... 33 6.1.1 Baseline model ...... 33 6.1.2 ReverseQA model ...... 33 6.2 Qualitative Evaluation ...... 36

7 Discussion 40 7.1 About the Metric ...... 40 7.2 About the ReverseQA Model ...... 41 7.3 Future Work ...... 41 7.4 Ethical and Sustainability ...... 42

Bibliography 43

A Context for sampled data in Table 5.4 47

B Context for the typical examples 50

Chapter 1

Introduction

Natural Language Processing (NLP) is a sub-field of artificial intelligence. The research in this field involves natural language, which refers to the lan- guage used by humans in their daily life, so it is closely related to the research of linguistics, but there are differences. NLP is not a general study of natu- ral language, but the development of computer systems, especially software systems, which can effectively achieve natural language communication. The realization of natural language communication between human and computer means that the computer cannot only understand the meaning of a natural lan- guage text but also express certain intentions and thought in the form of natural language text. The former is called Natural Language Understanding (NLU), and the latter is called Natural Language Generation (NLG).

Question Answering (QA) task is an essential part of Natural Language Pro- cessing (NLP), or more specifically, Natural Language Understanding (NLU) field. By mimicking the reading comprehension test, we assume a machine has a certain level of understanding if it can answer questions about a specific corpus after "reading" it. The past years has witnessed fast development of models which have excellent performance on some well-known QA datasets, and some even outperform human performance [1, 2]. In this degree project, instead of further developing the model for the QA task, we would like to re- verse the process and generate questions given the answers and related text. Since the design of the QA task aims to test the reading comprehension abil- ity of the machine, the questions designed should also, to some extent, reflect the understanding of the corpus. Except for the NLU part, this project also involves Natural Language Generation (NLG), since the question cannot be simply extracted from the text. The NLG task has always been challenging [3]

1 2 CHAPTER 1. INTRODUCTION

because we need to both maintain the intention and make the sentence syntac- tically correct. The NLG models has been though the rule-based phase [4] and RNN-based phase [5], in which models already given decent performance on some of the NLG tasks, such as the Neural Machine Translation (NMT) [6]. Nowadays, the contextual language models such as BERT[7], GPT2[8], XL- Net[9] have again leveled up their performance on text generation tasks, which provides us with abundant tools to utilize.

If high-quality questions can be successfully generated, its possible applica- tion could be:

• Help to automatically generate simple questions for reading comprehen- sion test.

• Help generate more data for QA datasets.

• Help to train the QA model in a semi-supervised manner.

1.1 Objective

The research question is: if we want to ask a question about a piece of in- formation in a article, and the idea answer to that question is the information itself, how can we automate the question generation? Therefore, we formally define our objective as: Given a corpus and a segment of text appearing in the corpus, the aim is to train the computer to generate meaningful and reason- able questions about that segment of text. At least one of the answers to this question should be the provided answer used as the input before.

Table 1.1 shows one example from the SQuAD dataset, which is one of the well-known datasets for the QA task. In the QA task, people try to implement models that can output the correct answer "zeta function" given the question "What function is related to prime numbers?" and the context in Table 1.1. However, for the question generation task, given the answer "zeta function" in Table 1.1 and its location in the context, the aim is to train the computer to generate a question that has the similar meaning as the original question "What function is related to prime numbers?". Instead of requiring an exact match, the generated question could be "What is the name of the function that closely relates to prime numbers?" or "Name the function related to prime numbers?". As long as the generated question is readable by a human and the answer to CHAPTER 1. INTRODUCTION 3

Table 1.1: One example from the SQuAD dataset[10]

The zeta function is closely related to prime numbers. For example, the aforementioned fact that there are infinitely many primes can also be seen using the zeta function: if there were only finitely many primes then (1) would have a finite value. However, the harmonic context series 1 + 1/2 + 1/3 + 1/4 + ... diverges (i.e., exceeds any given number), so there must be infinitely many primes. Another example of the richness of the zeta function and a glimpse of modern algebraic number theory is the following identity (Basel problem), due to Euler, question What function is related to prime numbers? answer zeta function the generated question given the context is "zeta function" or simply "zeta," it will be considered a successful generation.

1.2 Thesis outline

In chapter 2, the basic concepts of NLP are introduced, including the objec- tive of each sub-field and the available methods. We will present the details of some models whose architectures are also applied in our thesis. In chapter 3, we will first propose two end-to-end deep learning models for the question generation task. Both models will have the standard encoder-decoder structure and have decoders with multi-head attention layers. The model with a similar encoder structure to the QANet[11] will be used as the baseline. In the other model named ReverseQA, the pre-trained Bert encoding will be used as the encoder. All the concepts mentioned above will be explained in chapter 2. In chapter 4, we will present the details of how we process the data and how we implement the models mentioned above. We will then propose a new evalua- tion metric for the question generation task in chapter 5. The performance of the new metric on the question generation task will be compared with other well-known evaluation metrics for sequence-to-sequence tasks. In chapter 6, the performance of the baseline model and ReverseQA model with both fine- tuned and not fine-tuned encoders will be compared. We also perform ex- periments to study the effect of different hyper-parameters on the ReverseQA model. Finally, the generated questions from the best model are presented and discussed. Chapter 2

Background

In this chapter, the background of the question generation task is divided into three sections. The first section includes the Machine Learning paradigms for the Natural Language Processing (NLP) field. The second section consists of the methods applied in the Natural Language Understanding (NLU) field. We will show both history and recent about how the information in the document is extracted and re-weighted concerning queries. In the last section, we will fo- cus on the Natural-language generation (NLG) task and introduce the popular model in this field.

2.1 Introduction of Natural Language Process- ing

Natural Language Processing (NLP) is a branch of artificial intelligence that deals with the digitized information in the form of natural language. The gen- eral objective is that we want the computer to analyze the text information and build a mapping from the raw text to the outcome we want.

2.1.1 Features and representations For a NLP model to perform the mapping from the given text input to the de- sired outcomes, we need to express the meaning of the text into something that can be passed into functions and mathematically processed. The features used in the NLP task are defined as the elementary pieces of evidence that link aspects of the text and target. There are various ways of constructing features. We could use part-of-speech tagging[12] or named-entity recognition[13] to analyze each part of the text and design rules to select the combinations that

4 CHAPTER 2. BACKGROUND 5

most relate to the target. These hand-built features can have excellent perfor- mance for some specific tasks[14]. However, building these features requires human experts and is often time-consuming.

Words as the base unit that expresses the meaning of the language is a perfect candidate for automating the feature construction. However, we need a proper method to convert each word to its numerical representation. A straightfor- ward representation to make the word mathematically processable is to assign each word an integer number. However, since the integer values have a natu- ral ordered relationship between each other, this method will cause bias. To solve this problem, the one-hot encoding method is proposed. The idea be- hind the one-hot encoding is to represent each word using a vector. Only one entry in the vector has value 1 and all the other entries have value 0. This method is proved to have good performance on various tasks. However, since it treats each word as an independent categorical character, the total length of one one-hot vector is equal to the size of our vocabulary library, which could be enormous. Also, there is no way to measure the similarity between each word using the one-hot encoding representation. To solve this problem, the TF-IDF matrix is invented by combining the term frequency[15] and inverse document frequency[16]. The basic idea is that each word is assigned a score according to its frequency in one document and among all the documents. We assume two words have similar meanings if the cosine distance of their TF- IDF vectors is small. By applying the dimensional reduction method such as SVD as in[17], we can have word vectors with much less dimension while still preserving the similarity.

However, if we add or delete a document from the corpus, we need to re- construct the TF-IDF matrix. With the help of deep learning, Mikolov et al. proposed a learning-based method in [18]. The idea is that given a word vec- tor, we use the neural network to predict the words around it in fixed window size. By training on a large amount of corpus, we can get the vector represen- tation of all the words in the vocabulary. The GLOVE word representation[19] combines the idea of the TF-IDF matrix and the prediction-based method. In- stead of using the weighted score of the words, Jeffrey et al. construct the co- occurrence matrix by counting the number of times two words co-occurring in a fixed size window. The matrix is then used as the object for the prediction. The new way of constructing the word vector has dramatically improved the performance of the model in the NLP field. However, since the representation of one word is fixed and often pre-trained on a large corpus, it does not capture 6 CHAPTER 2. BACKGROUND

the contextual information in its current environment. Matthew et al. proposed to add the contextual knowledge to the word vector by using the bi-directional LSTM, a deep learning model[20]. Devlin et al. further developed this idea and built the well-known BERT model[7] by using part of the components in the Transformer model[21] as the contextual embedding. Nowadays, various pre-trained language models such as BERT[7], GPT2[8], XLNet[9] has pro- vided us with reliable word representations tools for all kinds of studies in the NLP field.

2.1.2 Models The models are the mappings from the given text input to the target. Tradi- tional NLP models are often feature-based. The hand-designed features reflect the understanding of the task from the human experts. The model, such as the Maximum Entropy Model[22], inputs feature values and outputs the predic- tion. By using the machine learning algorithm, parameters inside the model can be learned.

With the development of deep learning and neural network, deep learning models have become more and more popular when solving the NLP tasks. Well-known models such as Recurrent Neural Network (RNN)[23], Long short- term memory (LSTM)[24], and gated recurrent unit (GRU)[25] have been dominant in the field for years. Since the input word representations of the sentence are connected by each recurrent unit, the contextual information can pass through the model. At the last unit of the RNN-based model, the input text is encoded and output, as shown in Figure 2.1.

The RNN-based model has made great achievements when dealing with se-

Figure 2.1: RNN structure quential data. However, one drawback of these models is that it can not pre- CHAPTER 2. BACKGROUND 7

serve long-term dependencies. The LSTM and GRU model were invented to deal with this problem but they only mitigate it. A. Vaswani et al. proposed the Transformer model in [21], which is currently thought to be the best solution for this problem. In the Transformer model, the recurrent idea is discarded and the self-attention mechanism is applied at both the encoder and decoder phase. Generally speaking, the attention mechanism is a feature selection method. By assigning weights to all the feature representations, the feature representation with higher assigned weight will have a more significant influence on the fi- nal output. As is shown in Figure 2.2, the bottom rectangle represents the

Figure 2.2: Example for generating one representation for one entry embedding matrix for the input sequence. Each column in the matrix is one vector representation of a token in the input sequence. For each entry of the input (vector in each column of the matrix), it is sent to three feed-forward layers and output three new representations called query, key, and value. This process can be parallelized with matrix manipulation and will output three ma- trices for query (Q), key (K) and value (V). For the self-attention mechanism, take the red vector representation in Figure 2.2 as an example, its new vector representation in the query matrix is marked with orange color. The orange vector is used to perform dot-product with each entry of the key matrix. The result is then sent to the softmax layer and outputs the weight (attention) of the entire sequence for the red vector. Then the weighted sum of the value 8 CHAPTER 2. BACKGROUND

representations is used as a new vector representation of the input red vector. The same process will be applied to all the entries of the input (vector in each column of the bottom embedding matrix). Therefore, the final output of one self-attention process is a matrix whose column dimension is the sequence length of the input and row dimension is the hidden size of the feed-forward layer. A more concise computation flow is shown in Figure 2.3[21]. All com-

Figure 2.3: Scaled Dot-Product Attention (Source: [21]) putations mentioned above can be parallelized by matrix manipulation, which is formulated as, QKT AttentionpQ, K, V q “ softmaxp ? qV, (2.1) dk where Q, K, V represents the query matrix, key matrix, value matrix, respec- tively. The scaling factor dk denotes the dimension of each query, which is used to prevent the resulting attention from having a large magnitude. Af- ter the new representation matrix is generated, the final representation of one self-attention block will be generated after the add and norm layer. The "add" manipulation applies the same idea behinds the well-known residual block in the computer vision field[26]. It adds the input sequence representations with the new representations from the self-attention manipulation to prevent the in- formation loss for a very deep network. The "norm" refers to the layer-norm technique, which normalizes the inputs across the features[27]. For the Trans- former case, it normalizes across tokens of each sequence. To capture mul- tiple aspects of the sequence, the author in [21] also designed the multi-head mechanism for the attention, as shown in Figure 2.4. For each head, there are different sets of (query, key, value). The structure in Figure 2.4 along with the add and norm layer forms the basic block of the transformer. The whole model is built based on the stacking of these blocks. Unlike the RNN using different CHAPTER 2. BACKGROUND 9

Figure 2.4: Multi-Head Attention (Source: [21]) gates to maintain the contextual information, the self-attention mechanism in the Transformer model makes it possible for each word to receive the informa- tion from anywhere of the sequence so that a new representation vector of this word can be created. Also, because it discards the RNN model, the computa- tion can now be parallelized. However, since the matrix multiplication does not consider each input representation’s position, a position encoding should be added before the first multi-head attention block. The authors in [21] used the sine and cosine functions to create wave-like positional encoding. How- ever, in [7], the positional encodings for the BERT model are just learned along with other embeddings.

2.2 Natural Language Understanding (NLU)

The Natural Language Understanding and Natural language generation (chap- ter 2.3) tasks are two sub-tasks in the NLP field. The NLU tries to teach com- puter to understand the information hidden in the given text. One way to test the understanding level is to perform the question-answering task. For the closed-domain, extractive QA problem, there exist many models with great performance.

2.2.1 BIDAF model Minjoon et al.[28] proposed the Bi-Directional Attention Flow (BIDAF) model, as shown in Figure 2.5. In this model, Minjoon et al. integrated both word- level embedding and character level embedding and output their weighted sum 10 CHAPTER 2. BACKGROUND

Figure 2.5: BiDirectional Attention Flow Model (Source: [28]) as the representation. Then the Bi-LSTM network is used to produce the con- textual representation for both the context and the question. The outputted context matrix is represented with C P Rnˆt, where n denotes the hidden size of each representation, and t denotes the context length. Similarly, the outputted query matrix is represented with Q P Rnˆj, where n denotes the hidden size of each representation and j denotes the query length. A similarity matrix S P Rjˆt is calculated by pair-wisely calculating the similarity score using the similarity function as in Eq. 2.2 for every column of the context matrix and the question matrix.

αpa, bq “ wT ˆ ra; b; a d bs, (2.2) where a and b are two input representation vectors from each column of ma- trix C and matrix Q. wT is the trainable parameter vector. d represents the element-wise multiplication and [;] stands for the row-wise vector concate- nation. The similarity matrix S P Rjˆt is then used for creating Context-to- Query Attention and Query-to-Context Attention, as is shown in the right-hand side of the Figure 2.5. When computing the Context-to-Query Attention, the softmax function is applied along the row of matrix S and output the weights for each word representation in the query with respect to different word rep- resentations in the context. For each word representation in the context, the weighted sum of each word representation in the query is calculated and out- put the Context-to-Query Attention matrix C2Q P Rnˆt. When computing the Query-to-Context Attention, the max function is first applied across the row of the similarity matrix. If the resulting number of the max function of a CHAPTER 2. BACKGROUND 11

certain row is large, it means the word represented by this row in the context matrix C highly correlates with at least one word in the query. On the contrary, if the resulting number is small, it means the word represented by this row is not important with respect to the query. Therefore, when softmax function is applied to the resulting number, we could get the weight for each word in the context according to its correlation with the query. Then the weighted sum is calculated for one general representation of the context when considering the query. The resulting vector is duplicated t times, which gives the Query- to-Context Attention matrix Q2C P Rnˆt. The C2Q and Q2C matrix with the same dimension are vertically concatenated and form the final encoding matrix E P R2nˆt. The final layer will eventually output the beginning and the end position of the answer span.

2.2.2 Other QA models The BiDAF++ model is proposed in the same paper as ELMo embedding[20], which is the ELMo-enhanced BiDAF model and cooperates with self-attention. Since the BiDAF model still adopts the RNN-based model for encoding, the computation is slow. Adams et al. solved this problem in their QANet model by replacing the encoding layer to convolutions and self-attentions[11]. Be- sides, instead of using Query-to-Context Attention as in the BiDAF model, they used the coattiontion mechanism in the DCN architecture[29] for the at- tention between document and query.

2.2.3 Dataset In this subsection, we will compare three well-known question-answering dataset which is the SQuAD 2.0[10], QuAC[30], and CoQA[31]. The SQuAD 2.0 dataset is developed based on the SQuAD dataset[32]. The main difference between the 2.0 version and the previous version is that the answer in SQuAD 2.0 contains the unanswerable choice. It does not assume that every answer can be found in the given text to prevent the model from simply selecting a span that most related to the questions rather than performing the reading compre- hension. Unlike SQuAD 2.0, QuAC and CoQA datasets are both designed in the form of conversation. The questions in QuAC and CoQA can be a single interrogative word like “why” or "who" because it is given the multi-turn con- versation environment, while the answers can be free-form and can simply be "yes" and "no." The difference between QuAC and CoQA is that when creating the question-answer pairs the asker in QuAC cannot see the evidence. 12 CHAPTER 2. BACKGROUND

2.3 Natural language generation (NLG)

Unlike the question answering task, the question generation task needs to out- put the sentence by performing language generation. The NLG task has always been hard because we need to both maintain the intention and make the sen- tence syntactically correct. One of the successfully applied tasks is the Neural Machine Translation (NMT). Most models of NMT task involves the encoder-

Figure 2.6: The illustration of sequence to sequence model ("ABC" is the input and "WXYZ" is the output) decoder architecture, which is first proposed by Bob Allen in [33]. One way to think of this architecture is that the decoder generates each word from the language model that is not only conditioned on the previous word but also the encoded information. Figure 2.6 shows the basic sequence to sequence model proposed in [34]. The advantage of this model is that it enables end-to-end learning and smoothing of the text generation. The attention mechanism can also be applied to the seq-to-seq model. As shown in Figure 2.7, instead of only inputting the encoding information to the first recurrent unit, Minh-Thang Luong et al.[35] proposed the model that adds extra information to each of the recurrent units in the decoder.

Unlike in the training phase, where we could apply the teacher-forcing method (which we will describe in chapter 3), the word can only be generated one after another during the inference phase. The greedy search method is the simplest way to perform inference. We would always choose the word with the highest occurrence probability as the next word in the greedy search method. How- ever, since the performance of the model is measured by the quality of the generated sequence, the objective should be choosing the sequence with the highest occurrence probability. Since the complete sequence cannot be seen at the generating process, the word with the highest occurrence probability given the generated words does not always belong to the sequence with the highest probability. Therefore, the beam search method is proposed. In the beam search method, we would keep recording top-k words with k highest oc- CHAPTER 2. BACKGROUND 13

Figure 2.7: Global attentional model (Source: [35]) currence probability. And for all the resulting sequences, we re-rank them with their occurrence probability calculated by multiplying together the individual probabilities of each word in the sequence. The beam search method yields a better result than the greedy search method but is also more computationally demanding.

2.4 Transfer Learning and Pre-trained Lan- guage Model

Recently, the pre-trained language model such as BERT[7], GPT2[8], XL- Net[9] have made great progress on all kinds of NLP tasks. The Bidirectional Encoder Representations from Transformers (BERT) model[7] uses the self- attention blocks introduced in the Transformer model (shown in section 2.1.2) as the basic module. However, the number of self-attention layers is much larger than the Transformer model and so is the size of the hidden unit. The model is trained on an enormous corpus, which is the reason why it can gener- alize so well. The training task of BERT can be divided into two stages. In the first stage, some words in the corpus are randomly masked and the task is to predict those words given the rest of the corpus. In the second stage, the model is trained by performing the Next Sentence Prediction (NSP). Two sentences are inputted to the Bert model with [SEP] between them as the separation to- ken. The task for the model is to identify whether the later sentence is the next sentence in the context of the former sentence. As is described above, the BERT model is trained in a semi-supervised manner. However, by changing 14 CHAPTER 2. BACKGROUND

the final layer of the BERT model, many models dealing with supervised tasks can be fine-tuned on BERT and usually yield better performance. The models like BERT which could transfer the knowledge learned from the pre-training phase to downstream tasks such as classification, machine reading comprehen- sion, named entity recognition are now generally referred to as the pre-trained language model. Unlike the traditional language model with a strict statisti- cal definition, the models like BERT are called pre-trained language model simply because of their inherent ability to transfer knowledge across tasks, so that different models in different NLP tasks can all use BERT as the backbone which makes the process much more convenient.

As is suggest in BERT[7], we could build the input feature of our task by ap- pending [CLS] in front of the answer and [SEP] after the answer but in front of the context. The BERT model will take the feature as the input and output the encoding matrix E P Rnˆt, where n denotes the hidden dimension size of the BERT model and t denotes the sequence length of the context. Chapter 3

Methods

Since we are dealing with the sequence generation task, we apply the classic encoder-decoder structure for the model. In the baseline model, we use the same encoding architecture as in QANET[11] along with the self-attention decoder similar to the decoder structure in Transformer[21]. We build our ReverseQA model by substituting the encoder of the baseline model with the pre-trained BERT encoding.

Note that our task is to generate a question given the answer and the context. Therefore, a critical task for the model is to combine the given information of the answer and the context so that we can get a new context representation, which will be helpful for the model to generate the question. It is precisely what the encoder part of the model should do. After getting the new context representation, the decoder is responsible for generating the question. At each generation phase, one word in the question will be generated. Similar to how human would write a sentence given some information, by using the attention mechanism in the decoder part of the model, the context representation will be re-weighted according to the generated words. Then, the next word will be generated according to the re-weighted context representation.

3.1 Baseline Model

As shown in Figure 3.1, the context, along with the answer, is first encoded by the embedding layer to get the character and word level representation. The encoder block further encodes the representation by considering the contex- tual information. At the CQAttention layer, the encoded answer representation is used as the query to select relevant information from the context represen-

15 16 CHAPTER 3. METHODS

tation. After that, we will have the final representation of the context called the context encodings. By using the teacher-forcing training method, at each generation phase, we assume the previously generated sequence is the same as the reference question. Therefore, the words before the current generating po- sition in the reference question are inputted in the network and first encoded by the embedding layer to get the representation. Then in the decoder block, the contextual information of the generated sequence and the context encodings will be considered. The information is integrated and sent to the final softmax layer and output the probability distribution over the entire vocabulary. The resulting distribution will be compared with the real next word in the refer- ence question, and the loss will be calculated according to 3.1. In detail, the baseline model consists of the following four components:

Figure 3.1: Baseline model architecture

Embedding Layer. The embedding layer we use includes both word and character level embedding. The pre-trained GloVe[19] word vectors with the hidden dimension of 100 is used for word embedding, while the character em- bedding is trained along with the training process of the baseline model. As is shown in Figure 3.2, each character will be first mapped to a vector of length 8. Then, the character embedding vectors in one word are horizontally concate- nated and padded to the max length of 26 to form a matrix. The 1-dimensional convolution kernel with the kernel width of 5 is used to extract features. After the max-pooling layer, only one characteristic feature will be extracted for each convolution kernel. For each word, there are different convolution kernels that form the channel. We build the character level representation for one word by CHAPTER 3. METHODS 17

Figure 3.2: Illustration of building one feature for the character embedding concatenating the results of each channel. The channel size is set to 100 to match the dimension of the word embedding. The final embedding is built by concatenating the word level and character level embedding. Its dimension is Rnˆl, where n denotes the total embedding size, which is 200 in this case, and l denotes the sequence length. The consequence is still one vector representing one word. There is a highway network layer after the embedding. It assigns weight to the final representation telling the network whether the word level or character level embedding is more informative [36].

Encoder blocks. In this block, we perform feature extraction with both con- volution and multi-head self-attention[21]. The architecture of one encoder block is shown in Figure 3.3. The convolution part of the model is composed of multiple one-dimensional convolution layers. In the self-attention stage, as also introduced in 2.1.2, each feature representation will be passed through three feedforward networks. These networks will output the key, value, and query vectors for each representation. We dot-product the keys and queries and apply the softmax function to compute the weights, called the attention, of all entries. The layer-normalized weighted sum of all the values according to the attention will be used as the input of the next encoder block (as described in 2.1.2 and shown in Figure 2.2). Two encoder blocks with similar structures in this part are stacked sequentially to output the final contextual feature em- bedding (Rnˆl).

Attention Layer. Since the context itself contains the answer information, we again apply the self-attention mechanism. Instead of using the entire sequence as the query, we would only use the embedded features included in the answer span. To simplify the process, we will construct the similarity matrix proposed in [37] (as also shown in 2.2.1). We use the result of the similarity matrix 18 CHAPTER 3. METHODS

Figure 3.3: Architecture of encoder block to calculate the encoding E¯ P R2nˆt, where t denotes the context sequence length, with both the Context-to-Query Attention and Query-to-Context At- tention, as is demonstrated in section 2.2.1, which is then concatenated with the previous embedding to form the final encoding representation E P R3nˆt.

Decoder Block. We use the multi-head attention method in the decoder. The architecture is similar to the decoder in [21]. The decoder blocks with one self- attention layer and one multi-head attention layer are sequentially stacked. In each multi-head attention layer, the self-attended embedding vector of the cur- rent token position is used as the query and the context encodings output from the last layer of the encoder are used as key and value to compute the weight of each encoding vector. The reason why we use the self-attended embedding vector of the current token position as the query is because that it can integrate all the information of the generated sequence after the self-attention layer. The computation process is introduced in section 2.1.2. As is shown in Figure 3.4, the weighted sum of the context encodings is used as the input of the next de- coder block. At the final generation phase, the output of the last decoder block will be fed into the feedforward layer and output the probability of choosing the next token for all the tokens in the vocabulary. The generated question is CHAPTER 3. METHODS 19

Figure 3.4: Illustration of single head attention for one generation phase then compared with the original question to produce the loss:

L´1 L “ ´ log ptpTtq, (3.1) t“0 ÿ where L denotes the length of the sequence. By following the teacher-forcing learning method, a mask is needed for each generation phase. The mask can- not only mask out the encoding vectors representing the padding tokens but also ensure that the decoder cannot see the tokens after the current generating position in the self-attention layer. For example, the mask for a target with five tokens is as shown in Figure 3.5. At the first generation phase (first row in Figure 3.5), the token has a mask of [1, 0, 0, 0, 0], which means the only token can be seen now is the token representing the start of a sentence. Similarly, as the second generation phase (second row in Figure 3.5), the token has a mask of [1, 1, 0, 0, 0], which means the tokens can be seen now are the first two tokens in the reference question. In this way, even though we are inputting the reference sequence to the decoder to follow the teacher-forcing method, the decoder cannot cheat by referring to the tokens after the current generating phase. 20 CHAPTER 3. METHODS

Figure 3.5: Example of mask (for a target with 5 tokens)

3.2 ReverseQA Model

As shown in Figure 3.6, the only difference between the ReverseQA model and the baseline model is that we change the encoder to a pre-trained Bert en- coder. The standard method of encoding information for question answering task using Bert model is to take the horizontally concatenated question and context vectors as the input. Similarly, we create the input for the Bert en-

Figure 3.6: ReverseQA model architecture coder by horizontally concatenating the answer and context vector. We notice that if we input the concatenated vector directly to the model, we will lose the positional information of the answer. The position information is important as it tells the model wherein the context the answer is located, which could prevent the model from attending to another place where the same token ap- pears. We deal with this problem by changing the segment embedding in the CHAPTER 3. METHODS 21

Bert input representation, as shown in Figure 3.7. In the segment embeddings of the BERT model, EA and EB are used to distinguish the tokens between sequence A and sequence B. We introduce a third segment encoding EC to indicate the answer location in the context. In other words, the segment em- beddings for the answer in the context are changed from EB to EC . After this manipulation, each token in the answer will have EA as its segment embed- ding and each token in the context that is not part of the answer will have EB as its segment embedding. Each token in the context that is part of the an-

Figure 3.7: BERT input representation (Source: [7]) swer will have EC as its segment embedding. However, in the experiment, we found this trick only works when the Bert encoder part is fine-tuned, which makes sense because the representation EC in segment embeddings is never seen in any other pre-trained encoding. The dimension size of the hidden state for the bert-base-uncased model is 768. Therefore, to conform to the encoded representation from the encoder, we need to adjust the size of the multi-head attention and self-attention layer in the decoder blocks accordingly. Chapter 4

Implementation

In this chapter, we will introduce the detail about how we implement all the experiments. We start introducing the data we used in section 4.1, which is followed by the introduction about the platform we run experiments on and the choice for each hyperparameter. We will also introduce the beam search trick utilized at the testing phase in section 4.3.

4.1 Dataset

We chose to use the public QA dataset SQuAD 2.0[10] for both our training and evaluation. It contains more than 100,000 questions for the top 10000

Table 4.1: Categories for the answers in SQuAD 1.1 dataset (source: [32])

Answer type Percentage Example Date 8.9% 19 October 1512 Other Numeric 10.9% 12 Person 12.9% Thomas Coke Location 4.4% Germany Other Entity 15.3% ABC Sports Common Noun Phrase 31.8% property damage Adjective Phrase 3.9% second-largest Verb Phrase 5.5% returned to Earth Clause 3.7% to avoid trivialization Other 2.7% quietly

English Wikipedia articles assessed by Wikipedia’s internal PageRanks. The

22 CHAPTER 4. IMPLEMENTATION 23

main difference between the 2.0 version and the previous 1.1 version [32] is that the answer in SQuAD 2.0 contains the unanswerable choice. It does not assume that every answer can be found in the given text to prevent the model from simply selecting a span that most related to the questions rather than per- forming the reading comprehension. According to SQuAD 1.1, the answer categories for the answerable questions in SQuAD 2.0 are shown in Table 4.1. Since we are performing the reversed process of reading comprehension, the articles and answers are pair-wisely used as the input of the model, while the original questions are used as the label.

Each paragraph of an article in the SQuAD 2.0 data set is stored in a list. As is shown in Figure 4.1, each element in the list contains the context and the questions. The questions are also stored in a list. Each question will have

Figure 4.1: Demonstration of the SQuAD 2.0 dataset its question id, question content, corresponding answers and a boolean value indicating whether the answer can be answered. For each answer, the starting index of the answer in the context string and the full context of the answer is given.

For the baseline model, the NLTK tokenizer1 is used. As is shown in Fig- ure 4.2, only 1.48% of the context has the tokenized length larger than 300. To avoid the case when we have to add lots of padding tokens for few length context in the batch, we discard any context having the tokenized length larger than 300. Also, only 0.046% of the answer and 0.068% of the question have the tokenized length larger than 30. Therefore, the maximum tokenized an- swer length and question length are all set to 30.

For the ReverseQA model, the WordPiece tokenizer is used to conform to the Bert embedding. It is included in the hugging face2 library. Since the Bert model can only handle the input with the length less or equal to 512 and the input consists of concatenated answer and context, we discarded few data points where the limit is exceeded.

1NLTK http://www.nltk.org/ 2Hugging Face: https://huggingface.co/ 24 CHAPTER 4. IMPLEMENTATION

Figure 4.2: Distribution of the tokenized context length for the training and dev dataset

4.2 Training

We implement our model in Python using Pytorch. Our code can be found in our repository in github3. The implementation of QANet4 is used as the reference when building the encoder part of our baseline model. The pre- trained Bert encoding is provided by Hugging Face library2. The training is carried out in google cloud platform (GCP) with an instance having 4 vCPUs and 15 GB RAM. The model is trained on a NVIDIA Tesla T4 GPU with 16 GB RAM.

4.2.1 Baseline model For total 125,670 training data points, 3719 test data points and 929 dev data points, we set the train batch size to 60 and develop batch size to 100. We use dropout for the regularization with the dropout rate setting to 0.2. It is deployed mostly at the last layer of each block, but also between the layers where the size of the trainable parameters is large. The hidden dimension and the convolution channel number of the model is set to 96. Since 96 is divisible by 3 and 4, we can set the number of the heads in the multi-head attention to 3 or 4. The Adam optimizer with learning rate 2 ˆ 10´4 is good for the model. The hyperparamter in the optimizer is set by convention as in [38] (β1 “ 0.8, β2 “ 0.999).

3ReverseQA: https://github.com/Haoliang-rp/ReverseQA 4QANet-pytorch: https://github.com/setoidz/QANet-pytorch CHAPTER 4. IMPLEMENTATION 25

4.2.2 ReverseQA model Due to the limitation of the GPU RAM, for Bert fine tuning case, the training and develop batch size can only be set to 20. For the model without fine tuning, the same optimizer used in the baseline model is applied. However, for the fine tuning case, we use the the AdamW optimizer [39]. Also, since the parameters in the Bert encoding part require much smaller learning rate, we separate those parameters and set the initial learning rate to 1 ˆ 10´6.

4.3 Testing

In Testing phase, we implemented the beam search method for reference. As is shown in Table 4.2, "what" is considered to have higher probability of oc- currence than "where" when using the beam search method. The resulting questions of the beam search are re-ranked according to the overall proba- bility. After the re-ranking, the question "what is the bronx ’ s borough ’ s name ?" is chosen for the following evaluation. we evaluate the model by

Table 4.2: Generated questions using different inference method

original question What is the Bronx’s borough name? question infered by where is the bronx ’ s borough ? greedy search question infered by what is the bronx ’ s borough ’ s name ? beam search what is the bronx ’ s borough called ? what is the bronx ’ s ? what is the bronx ’ s borough ’ s borough ? what is the bronx ’ s borough ’ s name of ? what is the bronx ’ s borough ’ s name of ? ? what is the bronx ’ s borough ’ s name of ? " the RaR metric introduced in 5.2. The readability loss is based on the prob- ability results calculated by bert-as-language-model5 from xu-song. The QA model used in the overlapping score is again from the Hugging Face library2. The pre-trained model in the library on SQuAD is called "bert-large-uncased- whole-word-masking-finetuned-squad".

5bert-as-language-model https://github.com/xu-song/bert-as-language-model Chapter 5

Analysis of the Evaluation Method

In chapter 5, we focus on the evaluation method for our project. We will first introduce several widely used evaluation metrics and discuss their advantages and disadvantages in section 5.1. Then we will propose a new metric in section 5.2. The proposed metric is evaluated and compared with BLUE score in section 5.3.

5.1 Metrics for Sequence to Sequence task

BLEU. BLEU (Bilingual evaluation understudy)[40] is one of the most widely used metrics for sequence to sequence task. It compares the n-gram text seg- ments between the generated sequences and reference sequences. The pair- wise compared scores are averaged to produce the final score evaluating the overall quality of the text generation. BLEU score is a automatic metric which is quick to compute and easy to implement, therefore, it is widely used in the sequence-to-sequence task, especially in machine translation (MT) task. How- ever, in BLEU score, the syntactic correctness of the sentence is not consid- ered. In addition, BLEU is only good when generated text is short and the sentence structure is similar as the target text.

METEOR. METEOR (Metric for Evaluation of Translation with Explicit OR- dering) [41] tries to address the problem that BLEU score does not consider recall between the translations and the reference texts. Also, to avoid using high order N-grams as an complementary way to increase the score’s ability of measuring grammatical well-formedness, METEOR introduces an explicit grammatically factor.

26 CHAPTER 5. ANALYSIS OF THE EVALUATION METHOD 27

WER. WER (Word Error Rate) is based on the edit distance. In WER, the Levenshtein distance is applied on words between the generated sequence and the reference text, which can be formulated as: S ` D ` I WER “ , (5.1) N where S stands for the word substitution numbers, while D, I stands for the word deletion and word insertion times respectively. N is number of words that appears in the reference sequence. WER score is not used as frequently as the above two metrics. The reason why it is mentioned here is that in similar question generation task [42], WER is used as the evaluation metric. Also, it is proved in [43] that WER correlates with perplexity by a power law.

5.2 Readability and Relevance (RaR) Metric

Similar to the metrics introduced in section 5.1, most of the existing metrics for sequence-to-sequence task evaluates explicit word-to-word matching between the generated sequence and the reference sequence. These metrics work well with the machine translation task since the words expressing the same mean- ing in different languages usually have a mapping relation. However, for the chat-bot task, the evaluation scores of these metrics are found to have a poor correlation between human evaluation scores [44], as is our question genera- tion task.

Table 5.1 shows two sampled generated questions using the model described in 3.2. BLEU-n stands for the BLEU score with n-gram overlap. According to the context and given the answer of sample 1 the generated question "What bridge connects the freeway" is a reasonable question to ask. However, when using the BLEU score based on n-gram precision with n larger than 2, we will get a zero score for the generated question. In sample 2, the generated question is scored 0.34 for BLEU-2. However, the object is wrong, which will lead to a completely different result if we answer the generated question. From the samples, we can see that, unlike the Machine Translation task, the reference question is often not the only option to ask a question. Also, the reference questions have varying sentence structures and linguistic complexity, which makes it difficult for the model to learn the pattern. Even if a question with similar word choice and sentence structure is generated (therefore have high BLEU score), a simple divergence of the object or article will make it com- pletely different in meaning as the reference question. 28 CHAPTER 5. ANALYSIS OF THE EVALUATION METHOD

Here, we propose a metric using the currently existing QA model to directly

Table 5.1: Two Sampled Results

sample 1 2 In what year did the reference Name a bridge in the city? Times Literaray Supplement question begin publishing online? generated What bridge is connected In what year was the question to the freeway? Times published? BLEU-2 0.0 0.50 score capture the relevance of the generated question given the context and answer. We assume a generated question to be relevant with the given context and answer if the answer to the generated question given the context is the input answer. To automate the evaluation, we use the currently existing QA model to construct what we called the overlapping score. Since the performance of top models on the SQuAD dataset Leaderboard has exceeded the human perfor- mance, we assume the answers from these QA models on the SQuAD dataset are reliable. After inputting the generated question and context to the QA model, we will get the index of the start and end position of the answer. To compute the overlapping score, we need first obtain the intersection and union of the index list of the answer span from the QA model and the real answer span. The score is formulated by calculating the Jaccard index, which is, L score “ intersection , (5.2) Lunion where Lintersection denotes the length of the intersection list and Lunion denotes the length of the union list. As shown in Figure 5.1, the index list of the real answer span is [57, 58, 59] and the index list of the answer span output by QA model is [57, 58]. So the overlapping score for this example is 2{3 « 0.67. The overlapping score mentioned above only captures the relevance between the generated question sequence and the original context and answer. However, it cannot measure whether the generated sequence is a readable sentence. If the model only picks up the important information from the context and as- sembles them together, there is a chance that the generated sequence is scored high in the overlapping score. Therefore, we introduce another scoring func- tion based on the language model to evaluate the readability of the generated CHAPTER 5. ANALYSIS OF THE EVALUATION METHOD 29

Figure 5.1: Overlapping score example sequence. A language model is a probability distribution over sequence of words. We could derive the likelihood of a word given the words before in the sequence through an autoregressive language model. In our RaR metric, we use a non-autoregressive Bert language model to score the sequence, so that the tokens appear in the vocabulary of both the language model and our ReverseQA model. For a sequence,

ppsequenceq “ ppw1, w2, ..., wN q, (5.3) where wi represent each word in the sequence. For the autoregressive model, according to the chain rule,

N

ppsequenceq “ ppw1q ppwi|w1, ..., wi´1q, (5.4) i“2 ź However, since Bert is a bidirectional model, we approximate the likelihood of a sequence by,

N

p¯psequenceq « ppwi|w1, ..., wi´1, wi`1, ..., wN q (5.5) i“1 ź We invent the final readability loss of a sequence as,

N log ppw |w , ..., w , w , ..., w q S “ ´ i“1 i 1 i´1 i`1 N . (5.6) N ř As is shown in Eq. 5.6, the lower the score, the more likely the sequence appears. The equation of the readability loss is very similar to how the entropy is calculated as in Eq 5.7. 1 HpLq “ lim ´ log ppw1, w2, .., wnq (5.7) nÑ8 n 30 CHAPTER 5. ANALYSIS OF THE EVALUATION METHOD

However, since the joint probability of the sequence in the readability loss is approximated by Eq 5.5, the calculation results of the entropy and the read- ability loss are not exactly the same but positively correlated.

5.3 Metric Evaluation

We need to first point out that the best way to test the performance of an eval- uation metric is to calculate the correlation between the score given by the metric and the score given by humans. However, due to the resource and time limit of this thesis work, it is too costly to hire people to score the result. Here, we perform some preliminary sanity checks.

5.3.1 Readability loss First, we test the consistency of the readability loss. We assume the score is consistent when using it to evaluate sampled sentences from the same data set, there will not be a high deviation among the scores for all the samples. We sample and score 10,000 questions 12 times from 130,000 original questions in the SQuAD dataset. From Table 5.2, we can calculate the maximum devia- tion of the readability loss of each sample from the average score of 12 samples is 3.96%, which indicates the consistency of the readability loss on the same sequence set. The average readability loss 3.28 will be used as a baseline score for comparison in section 6.1.

Figure 5.2 demonstrates the distribution of the readability loss of one of the

Table 5.2: Readability loss of 10,000 question sequence

sample number 1 2 3 4 5 6 readability loss 3.14 3.31 3.17 3.24 3.29 3.35 sample number 7 8 9 10 11 12 readability loss 3.29 3.35 3.23 3.41 3.24 3.29

12 samples mentioned before. There are only a few extreme values when we sample from a certain sequence set, thus the average value of the readability loss is good enough to represent the general readability of the sequence set. CHAPTER 5. ANALYSIS OF THE EVALUATION METHOD 31

Figure 5.2: distribution of the readability loss of 10,000 question sequence

To test whether the readability loss can distinguish the not sensible sentences, we take three sets of the sampled questions from the previous experiment. For each question in these three sets, we randomly select several pairs of words and swap their position. As is shown in Table 5.3, the more words swapped the higher the readability loss, which shows that the randomness of the sequence can be captured by the readability loss.

Table 5.3: Average readability loss of three sets of 10,000 question sequence with randomly swapped words

swapped word pairs 1 2 3 4 average readability loss 6.19 7.38 9.26 10.46

5.3.2 Overlapping score Table 5.4 shows some examples of the generated questions along with the BLEU-2 score and overlapping score. According to our analysis, the overlap- ping score is better than the BLEU score at evaluating the relevance between the generated sequence and given answer and context. For example, in exam- ple 1 of Table 5.4, both the original question and generated question are ask- ing numbers about the helicopters and they have the same sentence structure. However, these questions have different meanings, since the determinatives in them are different. Also, for example 7 in Table 5.4, "when" in the original question is substituted with "in what year" in the generated question. Similar examples in the experiment are "who", "what" to "what was the name of ..." 32 CHAPTER 5. ANALYSIS OF THE EVALUATION METHOD

Table 5.4: Sampled context with the generated question, BLEU-2 score, and overlapping score (context of the example can be found in Appendix A)

answer / real queation / BLEU-2 overlapping context answer generated score score evaluation by QA generated question (by comparing questions) (by comparing answers) 1 60 How many helicopters came from the PLAAF? 0.33 0.0 wrong 90 how many helicopters were there ? autosomal DNA (atDNA), mitochondrial DNA (mtDNA), 2 What are the three types of origin testing? 0.50 1.0 correct and Y-chromosomal DNA (Y-DNA) autosomal DNA (atDNA), mitochondrial DNA (mtDNA), what are the three dna ? and Y-chromosomal DNA (Y-DNA) 3 565 How many princely states were there in India in 1947? 0.39 1.0 wrong 565 how many british kingdoms were in 1947 ? 4 600 How many coolie labourers were on the island by 1818? 0.30 1.0 correct 600 how many laborers were imported in the island ? Who worked towards obtaining 5 the Yongle Emperor 0.0 1.0 correct a extension of relations with Tibet? who was the first ming ##ification the Yongle Emperor actively seek an extension of relations ? 6 Oklahoma City Thunder Which team became the main tenant of the arena in 2008? 0.19 0.67 wrong Oklahoma City which city was the headquarters of the headquarters ? 7 1998 When was Sky Digital launched? 0.46 1.0 correct 1998 in what year was sky digital satellite launched ? 8 Astra 2A What satellite was used when Sky digital was launched? 0.0 1.0 correct Astra 2A what was the name of the satellite ? How many television and radio channels could 9 hundreds 0.14 1.0 wrong the new digital service carry? hundreds how many tv service did internet have ? 0.0 10 10th What century did the Normans first gain their separate identity? 0.51 wrong (wrong place) 10th and 11th centuries in what century did the norman ##s of the norman ##s ? 11 22 How many times did plague occur in Venice? 0.46 0.0 wrong 0 between how many times did the epidemic occur in 144 ##1 ? and "how many" to "what was the number of ...", etc. These substitutions are all valid but will lower the BLEU score. On the other hand, the overlapping score evaluates the relevance directly through the comparison of the given an- swer and the answer generated by the existing QA model. So the sentence structure is ignored while meaning is captured.

However, the overlapping score has its own limitations. As is shown in ex- ample 9 of Table 5.4, the appearance of keyword "tv" is enough for the QA model to output the original answer, which gives the question 1.0 overlapping score. However, the generated question is asking different stuff compared to the original question.

5.3.3 Conclusion To draw a conclusion, the existing metrics with explicit word-to-word match- ing evaluation is not good in capturing both the readability and relevance of the free form language generation task. The RaR metric we proposed is not per- fect, but it is a better automatic evaluating metric for the question generation task which uses the SQuAD dataset. Chapter 6

Results

In this chapter, we present two aspects of the results obtained from models described in chapter 3. At first, we present the performance of the models by comparing their scores in RaR metric described in section 5.2. In section 6.2, we sample representative context-answer pairs in the dataset along with the original questions and compare them with the questions generated by our model.

6.1 Quantitative Evaluation

6.1.1 Baseline model Unfortunately, the loss of the baseline model does not decrease no matter how we chose the hyperparameters. However, we managed to train the model by copying the parameters of the decoder from the trained ReverseQA model. Figure 6.1 shows the loss record on both training set and dev set. For the base- line model, the measurement interval is set to 300, which means one measure- ment is made once every 300 batches. As we can see from Figure 6.1, the overfitting happens after 50 recording intervals. The readability loss for the baseline model is 4.29 and the overlapping score is 0.08.

6.1.2 ReverseQA model We use the setting of our first successfully trained ReverseQA model as the baseline in this section. In this model, the Bert encoder is not fine-tuned. We set the number of the decoder blocks to 3 and the number of head in multi-head attention to 4. As is demonstrated in Figure 6.2, the overfitting happens after

33 34 CHAPTER 6. RESULTS

(a) loss record on training set for base- (b) loss record on dev loss for baseline line model model

Figure 6.1: The loss record fot the baseline model (measurement interval = 300) about 25 measurement interval with the measurement interval setting to 300. The readability loss for the model is 3.62 and the overlapping score is 0.27.

Number of Decoding blocks. We studied the effect of the different number

(a) loss record on training set for first (b) loss record on dev loss for first Re- ReverseQA model verseQA model

Figure 6.2: The loss record for the first ReverseQA model(measurement inter- val = 300) of decoding blocks on the ReverseQA model. In the experiments, the number of heads in the multi-head attention layer for all models is set to 4. As is shown in Table 6.1, the model with 2 decoding blocks in decoder has the lowest read- ability loss, but also the lowest overlapping score, which means the model has the ability to produce syntactic correct but not relevant sequences. According to Table 6.1, The model with 6 decoding blocks gives the best performance. CHAPTER 6. RESULTS 35

Table 6.1: Models with different number of decoding blocks evaluating with RaR metric (lower readability loss is better)

Number of Decoding Blocks 2 3 4 5 6 original questions readability loss Ó 3.51 3.62 4.03 3.98 3.52 3.28 overlapping score Ò 0.12 0.27 0.25 0.29 0.28 1.0

However, its score is close to the model with 3 decoding blocks. Considering the model complexity, 3 is considered to be the best choice for the number of decoding blocks in decoders.

Number of heads in multi-head attention layer. We also perform experi- ments to study the effect of the different number of heads in the multi-head attention layer. As is shown in previous experiments, we set the number of decoding layers in the decoder to 3. Table 6.2 shows the model with the head

Table 6.2: Models with different number of heads in the multi-head attention layer evaluating with RaR metric (lower readability loss is better)

Number of heads 2 3 4 6 original questions readability loss Ó 4.01 3.54 3.62 5.83 3.28 overlapping score Ò 0.25 0.22 0.27 0.29 1.0 number of 3 has the lowest readability loss (remember that lower readability loss is better) and the model with the head number of 4 has the highest over- lapping score. Taking both readability loss and overlapping score into consid- eration, the model with the head number of 4 gives the best performance.

Fine-tuning. In the last experiment, we study whether fine-tuning the Bert encoder could boost the model performance. For this experiment to work, the learning rate has to be extremely low for the parameters in the encoder. After several trials, 1 ˆ 10´6 is chosen to be the initial learning rate for the Bert parameters. Since the fine-tuning model is slow to train, we did not perform the repetitive experiments about the head and layer numbers on the fine-tuning model. We set the number of heads to 4 and the number of decoding blocks to 3, which happens to be the same as in the first ReverseQA model we studied. Its loss record can be seen in Figure 6.2. As is shown in Figure 6.3, the over- fitting happens after 250 recording interval. Although the recording interval is set to 100 instead of 300 in this experiment, the overfitting happens signif- icantly later compared to the previous experiments. Also, the lowest point of 36 CHAPTER 6. RESULTS

(a) loss record on the training set (b) loss record on the dev set

Figure 6.3: The loss record for the ReverseQA model with Bert fine-tuning (measurement interval = 100) the dev loss curve in Figure 6.3 is lower compared to it is in Figure 6.2. The model scores 0.36 on the overlapping score and 3.58 on the readability loss, which means it has the best performance among all the models studied.

6.2 Qualitative Evaluation

(a) original questions (b) inferred questions

Figure 6.4: Count of the first word in the questions

We count the frequency of the first word in both the generated questions and the original questions, the result is shown in Figure 6.4. The words appear less than 5 times are count in the ’others_’ category. As is demonstrated, the origi- nal questions have more variety of the first word, which means the model tends to generate the questions with simpler sentence structure. Also, the number of CHAPTER 6. RESULTS 37

times the word ’in’ is at the beginning of the question is doubled among the generated questions compared to the original questions, which is because of the phenomenon described in section 5.3.2.

In the following content of this section, we will present some typical results generated from our best model, which shows what kinds of questions the model is capable of generating. Each example consists of the part of the context that is relevant to the answer with the aimed answers shown in bold text, the origi- nal question, the generated question and a short comment. The full-text of the context can be found in Appendix B.

• Example 1

Context: Thomas Piketty in his book Capital in the Twenty-First Cen- tury argues that the fun-damental force for divergence is the usually greater return of capital (r) thaneconomic growth (g), and that larger fortunes generate higher returns [pp. 384Table 12.2, U.S. university en- dowment size vs. real annual rate of return]

original question: Who wrote the book "Capital in the Twenty-First Cen- tury"?

Generated question: who is the author of the book in the 21st century ?

Analysis: In this question, the model failed to identify the text "in the Twenty-First Century" as part of the name of the book. The main rea- son for this is that we lowercase everything in the training data to cut down the vocabulary size. However, having identified the given answer "Thomas Piketty" as a name, the model safely begins the question with "who is". We also notice that the book name information is not included in the generated question. It is very common in the text examples that the names of the book, movie, and location are omitted if no similar kinds of things are mentioned in the context.

• Example 2

Context: Economist Joseph Stiglitz argues that rather than explaining concentrations of wealth and income, ¨ ¨ ¨ 38 CHAPTER 6. RESULTS

original question: What career does Joseph Stiglitz have?

Generated question: what is joseph ’ s job ?

Analysis: This example exhibits the synonym conversion ability of the Bert embedding. Although it is fine to use both "career" and "job", us- ing "job" makes the generated sequence concise, which is safer for the model to perform the following inference.

• Example 3

Context: Effects of inequality researchers have found include higher rates of health and social problems, and lower rates of social goods, a lower level of economic utility in society from resources devoted on high-end consumption, and even a lower level of economic growth when human capital is neglected for high-end consumption.

original question: Higher rates of health and social problems are just two of examples of effects from what?

Generated question: what have the effect of health of health and so- cial rates of health ?

Analysis: The generated question is not making any sense in this ex- ample. However, the overlapping score is 1.0 for this example and the BLUE-2 score for this example also reaches 0.36. It shows us that the important information for generating the question is correctly identified, but it is very hard for the model to assemble them to a reasonable ques- tion since the sentence structure of the original question is relatively complicated.

• Example 4

Context: The (Chinese: ; : Yuán Cháo), officially the Great Yuan (Chinese: ; pinyin: Dà Yuán; Mongolian: Yehe Yuan Ulus[a]), was the empire or ruling dynasty of China established by , leader of the Mongolian clan. CHAPTER 6. RESULTS 39

original question: What is the Chinese name for the Yuan dynasty?

Generated question: what is the name of the yuan dynasty ?

Analysis: This is an example of missing information. Here, "Chinese" is a piece of very important information if we want the answer to the generated question to be exactly "Yuán Cháo". By asking the generated question "what is the name of the yuan dynasty ?", the answer output by QA model is "greate yuan". We could argue that given the original answer "Yuán Cháo" and the context, the generated question is not ex- actly wrong and "Yuán Cháo" is one of the answers if multiple answers are allowed. The overlapping score here is 0.0 meaning the answer is not very specific or the relevance between the generated question and answer is relatively low compared to the original question.

• Example 5

After strengthening his government in northern China, Kublai pursued an expansionist policy in line with the tradition of Mongol and Chinese imperialism. He renewed a massive drive against the to the south.

original question: Where did Kublai build his administration’s strength?

Generated question: in what region did ku ##bla ##i rule rule ?

Analysis: Again, "in what" is used in the generated question instead of "where". The word "rule" is repeated twice in the generated question, which is also a common problem of the current model. These are lots of cases where the generated questions are one or two words away from the completely correct sentences, most of which are because of the repeti- tive words. The repetitive words have no effect on the overlapping score and very little impact on the readability loss, which makes it hard for the RaR metric to reflect the problem. The beam search method could mitigate this problem but cannot completely solve it. Chapter 7

Discussion

In this chapter, we will briefly summarize and discuss our model and the pro- posed metric.

7.1 About the Metric

The motivation of proposing the new metric is because the ReverseQA model could generate questions with different word choice and sentence structure as the reference question while maintaining the same meaning. The RaR met- ric considering both readability and relevance is proved to be a better choice than currently existing metrics like the BLEU score. However, the RaR metric has its own limitations. Firstly, since readability loss and overlapping score are two separate measures, there are cases when the generated question only scores high on one of the scores. So the RaR metric can only be used as the comparison method to show the general performance of the models. For ex- ample, if the overlapping score on the text set is 0.2, it does not imply 20% of the generated questions are good because some of the generated sequences with a high overlapping score may not be syntactic correct. Secondly, since the calculation of both the readability loss and overlapping score requires the inference using the neural network, the computation is slow. On average, it takes half-second to calculate the overlapping score for one data point, which means the calculation of the overlapping score on the entire test set will last a day. Thirdly, there are biases when we analyze the metric. Since we are not comparing the evaluation quality with human scores, the analysis is largely depends on our understanding and interpretation. Also, since the number of examples we can analyze is limited, there could be aspects we missed out.

40 CHAPTER 7. DISCUSSION 41

7.2 About the ReverseQA Model

With the help of the pre-trained Bert encoder, the ReverseQA model can gener- ate high-quality questions with simple sentence structure. In the successfully generated questions, the entity types are identified and simple relations are detected. The model is also learned to use synonyms to simplify the sentence structure of the generated questions.

7.3 Future Work

Although the result is encouraging, the performance of the current best Re- verseQA model is still far from human performance. One possible way to improve the performance is to integrate the information of the entity type for each token in the context and answer using the Named-entity recognition (NER) technique. We think with a proper way to add this information, it could solve the problem for cases when the entity types appeared in the question is misidentified.

Also, in the ReverseQA model, the way loss is calculated diverges from the evaluation metric. One way to diminish this divergence is to use the Gener- ative Adversarial Networks (GAN) architecture[45]. We could use two net- works as the discriminator to capture the readability and relevance. However, what should be noticed is that the gradient update for the GAN model in the language generation task is hard since the outputs (word choices) in language generation task are discrete. Also, the loss cannot be back-propagated with the beam search. One possible solution for this problem is to treat the language generation process as the reinforcement learning problem and produce the re- ward by discriminator via Monte Carlo approach, as what has been done in sequence GAN[46]. This method is in our proposal for this thesis. However, we found it too computational costly according to the running time of the cur- rent model. The reason is that the roll-out policy in the GAN idea requires inference during the training. It means that to construct a loss for one gen- eration step, we need to sample several complete sentences starting from the current word position using the entire model. By referring to the run time of calculating the overlapping score, this method is discarded due to the time and resource limit of the thesis project. However, we still think this idea is valid and interesting. Moreover, The encouraging part of this idea is that it turns the question generation task to a semi-supervised learning problem, as we no 42 CHAPTER 7. DISCUSSION

longer need the true question-answer pairs in our training data. With a proper mechanism to automate the selection of the segments of text as the answer, the model can be trained on any text without human annotations.

7.4 Ethical and Sustainability

With further development of this thesis work, we hope that better models could be created, which could automatically generate meaningful questions given the text and piece of information in the text as the aimed answer. The main con- tribution of our model is that it automates the question generation. Usually, in the scenario when the generation of simple questions is needed, for example, requiring the information from the user, the work could be repetitive. By using our model, instead of carefully designing every question, people can simply create several templates and point out information they want and let the model automatically generate specific questions, which increase the job quality and will full fill Goal 8 among UN sustainability goals. The model can also be used to create questions for the reading comprehension exercises, which will cut down the educational cost and help reduce the inequalities.

Although our work has many positive contributions, there are still some ethical concerns. Firstly, as the cost of the question generation system becomes lower, it is easier for companies to collect user data, which increases the concern of data misuse. Secondly, since the question generation is automated, it could be biased, and there is a potential risk that some of the generated questions can be offensive and discriminative. Bibliography

[1] Zhuosheng Zhang, Junjie Yang, and Hai Zhao. “Retrospective reader for machine reading comprehension”. In: arXiv preprint arXiv:2001.09694 (2020). [2] Zhenzhong Lan et al. “Albert: A lite bert for self-supervised learning of language representations”. In: arXiv preprint arXiv:1909.11942 (2019). [3] Albert Gatt and Emiel Krahmer. “Survey of the state of the art in nat- ural language generation: Core tasks, applications and evaluation”. In: Journal of Artificial Intelligence Research 61 (2018), pp. 65–170. [4] Alice Oh and Alexander Rudnicky. “Stochastic language generation for spoken dialogue systems”. In: ANLP-NAACL 2000 Workshop: Conver- sational Systems. 2000. [5] Kyunghyun Cho et al. “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078 (2014). [6] Jie Zhou et al. “Deep recurrent models with fast-forward connections for neural machine translation”. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 371–383. [7] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transform- ers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018). [8] Alec Radford et al. “Language models are unsupervised multitask learn- ers”. In: OpenAI Blog 1.8 (2019), p. 9. [9] Zhilin Yang et al. “Xlnet: Generalized autoregressive pretraining for language understanding”. In: Advances in neural information process- ing systems. 2019, pp. 5754–5764.

43 44 BIBLIOGRAPHY

[10] Pranav Rajpurkar, Robin Jia, and Percy Liang. “Know what you don’t know: Unanswerable questions for SQuAD”. In: arXiv preprint arXiv:1806.03822 (2018). [11] Adams Wei Yu et al. “Qanet: Combining local convolution with global self-attention for reading comprehension”. In: arXiv preprint arXiv:1804.09541 (2018). [12] Thorsten Brants. “TnT: a statistical part-of-speech tagger”. In: Proceed- ings of the sixth conference on Applied natural language processing. Association for Computational Linguistics. 2000, pp. 224–231. [13] Thierry Poibeau and Leila Kosseim. “Proper name extraction from non- journalistic texts”. In: Computational Linguistics in the Netherlands 2000. Brill Rodopi, 2001, pp. 144–157. [14] Amit Bagga and Breck Baldwin. “Entity-Based Cross-Document Core f erencing Using the Vector Space Model”. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. 1998, pp. 79–85. [15] Karen Sparck Jones. “A statistical interpretation of term specificity and its application in retrieval”. In: Journal of documentation (1972). [16] Stephen Robertson. “Understanding inverse document frequency: on theoretical arguments for IDF”. In: Journal of documentation (2004). [17] Ammar Ismael Kadhim et al. “Improving TF-IDF with singular value decomposition (SVD) for feature extraction on Twitter”. In: Proc. 3rd. International Engineering Conference. on Developments in Civil & Com- puter Engineering Applications. 2017. [18] Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013). [19] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, pp. 1532–1543. [20] Matthew E Peters et al. “Deep contextualized word representations”. In: arXiv preprint arXiv:1802.05365 (2018). [21] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems. 2017, pp. 5998–6008. [22] Christopher Manning. “Maxent models and discriminative estimation”. In: CS 224N lecture notes, Spring (2005). BIBLIOGRAPHY 45

[23] Tomáš Mikolov et al. “Recurrent neural network based language model”. In: Eleventh annual conference of the international speech communica- tion association. 2010. [24] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. “Learning to forget: Continual prediction with LSTM”. In: (1999). [25] Kyunghyun Cho et al. “On the properties of neural machine transla- tion: Encoder-decoder approaches”. In: arXiv preprint arXiv:1409.1259 (2014). [26] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [27] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer nor- malization”. In: arXiv preprint arXiv:1607.06450 (2016). [28] Minjoon Seo et al. “Bidirectional attention flow for machine compre- hension”. In: arXiv preprint arXiv:1611.01603 (2016). [29] Caiming Xiong, Victor Zhong, and Richard Socher. “Dynamic coatten- tion networks for question answering”. In: arXiv preprint arXiv:1611.01604 (2016). [30] Eunsol Choi et al. “Quac: Question answering in context”. In: arXiv preprint arXiv:1808.07036 (2018). [31] Siva Reddy, Danqi Chen, and Christopher D Manning. “Coqa: A con- versational question answering challenge”. In: Transactions of the As- sociation for Computational Linguistics 7 (2019), pp. 249–266. [32] Pranav Rajpurkar et al. “Squad: 100,000+ questions for machine com- prehension of text”. In: arXiv preprint arXiv:1606.05250 (2016). [33] Robert B Allen. “Several studies on natural language and back-propagation”. In: Proceedings of the IEEE First International Conference on Neural Networks. Vol. 2. S 335. IEEE Piscataway, NJ. 1987, p. 341. [34] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: Advances in neural information pro- cessing systems. 2014, pp. 3104–3112. [35] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. “Effec- tive approaches to attention-based neural machine translation”. In: arXiv preprint arXiv:1508.04025 (2015). 46 BIBLIOGRAPHY

[36] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. “High- way networks”. In: arXiv preprint arXiv:1505.00387 (2015). [37] Ramon Tuason, Daniel Grazian, and Genki Kondo. “BiDAF Model for Question Answering”. In: Table III EVALUATION ON MRC MODELS (TEST SET). Search Zhidao All (). [38] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [39] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regulariza- tion”. In: arXiv preprint arXiv:1711.05101 (2017). [40] Kishore Papineni et al. “BLEU: a method for automatic evaluation of machine translation”. In: Proceedings of the 40th annual meeting on as- sociation for computational linguistics. Association for Computational Linguistics. 2002, pp. 311–318. [41] Satanjeev Banerjee and Alon Lavie. “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments”. In: Proceedings of the acl workshop on intrinsic and extrinsic evalu- ation measures for machine translation and/or summarization. 2005, pp. 65–72. [42] Kettip Kriangchaivech and Artit Wangperawong. “Question Generation by Transformers”. In: arXiv preprint arXiv:1909.05017 (2019). [43] Dietrich Klakow and Jochen Peters. “Testing the correlation of word error rate and perplexity”. In: Speech Communication 38.1-2 (2002), pp. 19–28. [44] Ryan Lowe et al. “Towards an automatic turing test: Learning to evalu- ate dialogue responses”. In: arXiv preprint arXiv:1708.07149 (2017). [45] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neural information processing systems. 2014, pp. 2672–2680. [46] Lantao Yu et al. “Seqgan: Sequence generative adversarial nets with policy gradient”. In: Thirty-First AAAI Conference on Artificial Intelli- gence. 2017. Appendix A

Context for sampled data in Ta- ble 5.4

1 By May 15, Premier Wen Jiabao ordered the deployment of an additional 90 helicopters, of which 60 were to be provided by the PLAAF, and 30 were to be provided by the civil aviation industry, bringing the total of number of aircraft deployed in relief operations by the air force, army, and civil aviation to over 150, resulting in the largest non-combat airlifting operation in People’s Liberation Army history.

2 Efforts to identify the origins of Ashkenazi Jews through DNA analysis be- gan in the 1990s. Currently, there are three types of genetic origin testing, au- tosomal DNA (atDNA), mitochondrial DNA (mtDNA), and Y-chromosomal DNA (Y-DNA). Autosomal DNA is a mixture from an individual’s entire ancestry, Y-DNA shows a male’s lineage only along his strict-paternal line, mtDNA shows any person’s lineage only along the strict-maternal line. Genome- wide association studies have also been employed to yield findings relevant to genetic origins.

3 In the aftermath, all power was transferred from the East India Company to the British Crown, which began to administer most of India as a number of provinces. The Crown controlled the Company’s lands directly and had considerable indirect influence over the rest of India, which consisted of the Princely states ruled by local royal families. There were officially 565 princely states in 1947, but only 21 had actual state governments, and only three were large (Mysore, Hyderabad and Kashmir). They were absorbed into the inde- pendent nation in 1947–48.

47 48 APPENDIX A. CONTEXT FOR SAMPLED DATA IN TABLE ??

4 The importation of slaves was made illegal in 1792. Governor Robert Patton (1802–1807) recommended that the company import Chinese labour to sup- plement the rural workforce. The coolie labourers arrived in 1810, and their numbers reached 600 by 1818. Many were allowed to stay, and their descen- dents became integrated into the population. An 1814 census recorded 3,507 people on the island.

5 However, the early Ming government enacted a law, later rescinded, which forbade Han Chinese to learn the tenets of Tibetan Buddhism. There is lit- tle detailed evidence of Chinese—especially lay Chinese—studying Tibetan Buddhism until the Republican era (1912–1949). Despite these missions on behalf of the , Morris Rossabi writes that the Yongle Em- peror (r. 1402–1424) "was the first Ming ruler actively to seek an extension of relations with Tibet."

6 Chesapeake Energy Arena in downtown is the principal multipurpose arena in the city which hosts concerts, NHL exhibition games, and many of the city’s pro sports teams. In 2008, the Oklahoma City Thunder became the major tenant. Located nearby in Bricktown, the Chickasaw Bricktown Ball- park is the home to the city’s baseball team, the Dodgers. "The Brick", as it is locally known, is considered one of the finest minor league parks in the na- tion.[citation needed]

7, 8, 9 When Sky Digital was launched in 1998 the new service used the As- tra 2A satellite which was located at the 28.5E orbital position, unlike the analogue service which was broadcast from 19.2E. This was subsequently fol- lowed by more Astra satellites as well as Eutelsat’s Eurobird 1 (now Eutelsat 33C) at 28.5E), enabled the company to launch a new all-digital service, Sky, with the potential to carry hundreds of television and radio channels. The old position was shared with broadcasters from several European countries, while the new position at 28.5E came to be used almost exclusively for channels that broadcast to the United Kingdom.

10 The Normans (Norman: Nourmands; French: Normands; Latin: Nor- manni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Nor- way who, under their leader Rollo, agreed to swear fealty to King Charles III of APPENDIX A. CONTEXT FOR SAMPLED DATA IN TABLE ?? 49

West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cul- tural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.

11 In 1466, perhaps 40,000 people died of the plague in Paris. During the 16th and 17th centuries, the plague was present in Paris around 30 per cent of the time. The Black Death ravaged Europe for three years before it contin- ued on into Russia, where the disease was present somewhere in the country 25 times between 1350 to 1490. Plague epidemics ravaged London in 1563, 1593, 1603, 1625, 1636, and 1665, reducing its population by 10 to 30% dur- ing those years. Over % of Amsterdam’s population died in 1623–25, and again in 1635–36, 1655, and 1664. Plague occurred in Venice 22 times be- tween 1361 and 1528. The plague of 1576–77 killed 50,000 in Venice, almost a third of the population. Late outbreaks in central Europe included the Ital- ian Plague of 1629–1631, which is associated with troop movements during the Thirty Years’ War, and the Great Plague of Vienna in 1679. Over 60% of Norway’s population died in 1348–50. The last plague outbreak ravaged Oslo in 1654. Appendix B

Context for the typical examples

Example 1. Wealth concentration is a theoretical[according to whom?] pro- cess by which, under certain conditions, newly created wealth concentrates in the possession of already-wealthy individuals or entities. According to this theory, those who already hold wealth have the means to invest in new sources of creating wealth or to otherwise leverage the accumulation of wealth, thus are the beneficiaries of the new wealth. Over time, wealth condensation can sig- nificantly contribute to the persistence of inequality within society. Thomas Piketty in his book Capital in the Twenty-First Century argues that the fun- damental force for divergence is the usually greater return of capital (r) than economic growth (g), and that larger fortunes generate higher returns [pp. 384 Table 12.2, U.S. university endowment size vs. real annual rate of return]

Example 2. Economist Joseph Stiglitz argues that rather than explaining con- centrations of wealth and income, market forces should serve as a brake on such concentration, which may better be explained by the non-market force known as "rent-seeking". While the market will bid up compensation for rare and desired skills to reward wealth creation, greater productivity, etc., it will also prevent successful entrepreneurs from earning excess profits by fostering competition to cut prices, profits and large compensation. A better explainer of growing inequality, according to Stiglitz, is the use of political power gen- erated by wealth by certain groups to shape government policies financially beneficial to them. This process, known to economists as rent-seeking, brings income not from creation of wealth but from "grabbing a larger share of the wealth that would otherwise have been produced without their effort"

Example 3. Effects of inequality researchers have found include higher rates

50 APPENDIX B. CONTEXT FOR THE TYPICAL EXAMPLES 51

of health and social problems, and lower rates of social goods, a lower level of economic utility in society from resources devoted on high-end consumption, and even a lower level of economic growth when human capital is neglected for high-end consumption. For the top 21 industrialised countries, counting each person equally, life expectancy is lower in more unequal countries (r = -.907). A similar relationship exists among US states (r = -.620).

Example 4. The Yuan dynasty (Chinese: ; pinyin: Yuán Cháo), officially the Great Yuan (Chinese: ; pinyin: Dà Yuán; Mongolian: Yehe Yuan Ulus[a]), was the empire or ruling dynasty of China established by Kublai Khan, leader of the Mongolian Borjigin clan. Although the had ruled territories including today’s North China for decades, it was not until 1271 that Kublai Khan officially proclaimed the dynasty in the traditional Chinese style. His realm was, by this point, isolated from the other khanates and controlled most of present-day China and its surrounding areas, including modern Mongolia and Korea. It was the first foreign dynasty to rule all of China and lasted until 1368, after which its Genghisid rulers returned to their Mongolian homeland and continued to rule the Northern Yuan dynasty. Some of the Mongolian Emperors of the Yuan mastered the Chinese language, while others only used their native language (i.e. Mongolian) and the ’Phags-pa script.

Example 5. After strengthening his government in northern China, Kublai pursued an expansionist policy in line with the tradition of Mongol and Chi- nese imperialism. He renewed a massive drive against the Song dynasty to the south. Kublai besieged Xiangyang between 1268 and 1273, the last obsta- cle in his way to capture the rich Yangzi River basin. An unsuccessful naval expedition was undertaken against Japan in 1274. Kublai captured the Song capital of in 1276, the wealthiest city of China. Song loyalists es- caped from the capital and enthroned a young child as Emperor Bing of Song. The Mongols defeated the loyalists at the battle of in 1279. The last Song emperor drowned, bringing an end to the Song dynasty. The conquest of the Song reunited northern and southern China for the first time in three hundred years. TRITA -EECS-EX-2020:813

www.kth.se