FELADATKIÍRÁS

A szemantikai elemzés célja, hogy természetes nyelvi adathoz készíthessünk szeman- tikai reprezentációt, így tudjuk modellezni a szöveg jelentését. Ha a nyelvi jelentést fo- galmak irányított gráfjaival reprezentáljuk, ezeket pedig a mondat szintaktikai szerkezetét reprezentáló fákból kell előállítanunk, akkor a teljes feladat egyetlen komplex gráftranszfor- mációként definiálható. A népszerű szemantikai feladatokra, mint a szemantikai hasonlóság mérése vagy a gépi szövegértés, ritkán használják a természetes nyelv szemantikájának gráfos reprezentációját, főleg state-of-the art rendszerekben. Ezek a rendszerek többnyire szó embeddingeket használnak szavak jelentésének ábrázolására, amik a szavak jelentését legfeljebb nehány száz dimenziós valós vektorként ábrázolják. Egy új kísérleti megközelítés a gráf-transzformációk közvetlen tanulása, amely során gráf-hálózatokat használva ezek a szemantikai reprezentációk gráf formában való transzformálása is lehetséges lenne. Ilyen feladatot támogató keretrendszerek például a Graph Nets Library (ld. GitHub) vagy a Deep Graph Library (ld. GitHub). A hallgató munkájának központi témáját az erre irányuló kísérletek adják. A hallgató feladatai a következőkre terjednek ki:

• Ismerjen meg egy gráf-hálózatok tanulására irányuló keretrendszert.

• Végezzen kísérleteket gráf-transzformációk mély tanulására.

• Ismerjen meg legalább egy szemantikai reprezentációt igénylő feladatot, ahol a mód- szer közvetlen értékelhető lenne, például a Surface Realization (ld. Surface Realization cikk) vagy az Extractive Summarization.

• Végezzen közvetlen kísérleteket a szemantikai reprezentációt igénylő feladaton.

• Mérje fel a rendszer korlátait az eredmények kiértékelésén keresztül.

1 Budapest University of Technology and Economics Faculty of Electrical Engineering and Informatics Department of Automation and Applied Informatics

Deep learning of graph transformations

Masters Thesis

Written By Consultant Gémes Kinga Andrea Kovács Ádám

December 19, 2019 Contents

Kivonat 4

Abstract 5

Introduction6

1 Background7 1.1 Summarization...... 7 1.1.1 Abstractive summarizaton...... 7 1.1.2 Extractive summarization...... 8 1.2 Evaluation...... 8 1.2.1 BLEU score...... 8 1.2.2 ROUGE score...... 8 1.3 Previous work...... 10 1.3.1 TextRank...... 10 1.3.2 Deep Learning models...... 10

2 Data processing 12 2.1 CNN and Daily Mail data...... 12 2.1.1 Greedy algorithm...... 12 2.1.2 Gensim Benchmark...... 13 2.1.3 Example from the dataset...... 13 2.2 Universal Dependencies...... 14 2.3 Graph format...... 16 2.4 Building the graphs...... 18 2.5 Forming the target graphs...... 19

3 Graph neural network 22 3.1 Neural Networks...... 22 3.1.1 Perceptron...... 22 3.1.2 Feed forward neural network...... 23 3.1.3 Embedding layers...... 25 3.1.4 Recurrent Neural Networks...... 26 3.1.5 Attention...... 28

1 3.2 The graph_nets library...... 30 3.2.1 Graph Network block...... 30 3.2.2 Graph Independent block...... 30 3.2.3 Self Attention block...... 31

4 Models 32 4.1 Encode Process Decode model...... 32 4.1.1 Encoder...... 32 4.1.2 Core...... 32 4.1.3 Decoder...... 33 4.2 SimpleGraphAttention model...... 33 4.2.1 Encoder...... 33 4.2.2 Network...... 34 4.2.3 Decoder...... 35 4.3 GraphAttention model...... 35 4.3.1 Network and Graph Attentional Layer...... 36

5 Experiments and results 38 5.1 Evaluation methods...... 38 5.2 Experiments and results...... 39 5.2.1 Encode Process Decode...... 39 5.2.2 SimpleGraphAttention Network...... 40 5.2.3 GraphAttention Network...... 44 5.3 Comparison between models...... 47 5.3.1 Results by summary length...... 48

6 Conclusion and future work 53

Acknowledgement 54

Bibliography 57

Appendices 58 A.1 Modules and packages...... 58 A.1.1 Functions and classes in each relevant module...... 58 A.2 Class diagram...... 60

2 HALLGATÓI NYILATKOZAT

Alulírott Gémes Kinga Andrea, szigorló hallgató kijelentem, hogy ezt a diplomater- vet meg nem engedett segítség nélkül, saját magam készítettem, csak a megadott forrá- sokat (szakirodalom, eszközök stb.) használtam fel. Minden olyan részt, melyet szó szerint, vagy azonos értelemben, de átfogalmazva más forrásból átvettem, egyértelműen, a forrás megadásával megjelöltem. Hozzájárulok, hogy a jelen munkám alapadatait (szerző(k), cím, angol és magyar nyelvű tartalmi kivonat, készítés éve, konzulens(ek) neve) a BME VIK nyilvánosan hozzáférhető elektronikus formában, a munka teljes szövegét pedig az egyetem belső hálózatán keresztül (vagy autentikált felhasználók számára) közzétegye. Kijelentem, hogy a benyújtott munka és annak elektronikus verziója megegyezik. Dékáni engedéllyel titkosított diplomatervek esetén a dolgozat szövege csak 3 év eltelte után válik hozzáférhetővé.

Budapest, December 19, 2019

Gémes Kinga Andrea hallgató Kivonat

Diplomamunkám során annak a lehetőségét vizsgálom, lehetséges-e gráftranszformációként értelmezni a természetes nyelvfeldolgozásban kivonatolással létrehozott összefoglalásként ismert feladatot és gráf neurális hálót építeni ezen feladat megoldására a DeepMind graph_nets könyvtárát felhasználva. Kivonatolással létrehozott összefoglalás (angolul extractive summarization) alatt azt a feladatot értjük, amely során egy szöveghez összefoglalót képzünk a szövegben szereplő modatok felhasználásával. Munkám során megvalósítottam egy leképzést, mely cikkekből megfelelő universal de- pendency (UD) gráfot generál a stanfordnlp könyvtár felhasználásával. Mivel a gráfok tanulásakor feltétel, hogy az élek és csúcsok száma be és kimeneti gráfpáronként egyezőek kell legyenek, így az egyes összefoglalókhoz képzett gráfokat ennek megfelelően alakítottam ki. A graph_nets könyvtár tartalmaz egy Encode-Process-Decode modellt, amelyet kiin- dulási alapnak tudtam felhasználni munkám során. Több módszerrel is megvizsgáltam ezen modell használhatóságát a feladaton, majd az így szerzett tapaszalatokkal építettem két másik gráf neurális hálót, természetes nyelvfeldolgozás feladatok megoldására szabva. A gráf neurális hálók tanítása kihívásokkal teli feladatnak bizonyult, mivel felépítése eltér a megszokott neurális háló struktúráktól. A hozzá kapcsolódó cikk és a demó példák segítették a megértést. Az eredményeimet az adathalmazhoz tartozó szabad szavas összefoglalóhoz mértem, ezzel meghatározva a ROUGE pontját, valamint ezt összevetettem a kivonatolással elérhető maximum ROUGE ponttal és a gensim könyvtár TextRank alapú összefoglalójával.

4 Abstract

In my masters thesis I examined the possibility of using graph transformations for a natural language processing task known as extractive summarization and whether we could build a graph neural network for this purpose using DeepMind’s graph_nets library. Extractive summarization is the task of generating a summary for a text using only the words, expressions and/or sentences from the original text. I’ve used a standard method to transform the articles into their respective universal dependency (UD) graph using the stanfordnlp library. Since the structure of input and output graphs are required to be the same for the graph neural networks to be able to train on them I had to modify the graphs built from the summary accordingly. The graph_nets library already contains an Encode-Process-Decode model that proved to be a great starting point while exploring the task and possibilities. I experimented with the usability of this model on my task and with the observations I gathered I have built another graph neural network specifically for solving natural language processing tasks. The training of these graphs was a challenging task, because their structure vastly differs from regular neural networks. The related article [4] and the demo examples helped me understand it better. I compared the achieved results with the human-written summaries provided in the dataset determining the result’s ROUGE score. This score was compared with the maxi- mum achievable ROUGE score with extractive summarization and also with the summary generated by gensim’s TextRank algorithm.

5 Introduction

Summarization tasks are relevant in the field of natural language processing (NLP). Com- plex end-to-end neural network models like Hierarchical Structured Self-Attentive Model (HSSAS) [1] can construct great summaries, and the state-of-the-art as of writing this thesis is Text Summarization with Pretrained Encoders [14]. I write about the background of this field in chapter1. Graph neural networks on the other hand are not yet widely researched and their usage for NLP purposes has not been explored deeply, at least to my knowledge. Graph Convo- lutional Networks (GCN) have been successfully applied on a variety of tasks recently, but not as widely as other methods. That being said representing syntax with graphs or trees is common practice on the field, so finding a tool that is capable of generating graph representations from texts was not a hard task. I used the stanfordnlp library which uses deep learning to determine part of speech (POS) tags, word lemmas and universal dependency (UD) relations between words. I used these graphs to build one merged graph that can represent a set of sentences, like summaries and articles in a graph format. I had to modify the summary graphs further to have the same structure as the article graph. Chapter2 is about this process and the graph representation used in graph_nets module. In chapter3 I documented some of the most important parts of the graph_nets library as well as relevant deep learning concepts. Chapter4 focuses on the structure of the models used for my experiments during the semester. There are multiple model descriptions in the chapter, all of which I experimented with and I documented their results in chapter5. The conclusion and my plans for future work are described in chapter6. The code for the project is publicly available on GitHub.

6 Chapter 1

Background

1.1 Summarization

Summarization is the task of text shortening with the least amount of information loss. Being able to do this automatically and with nearly human level accuracy is a challenge to this date but its usefulness is undeniable especially in the news industry where there is a constant supply of new articles and writing a summary for each of them would be a waste of resources if there would be another way of summarizing them. We can distinguish between two approaches of this problem, abstractive and extrac- tive summarization, both interesting and challenging. These are described in the following subsections.

1.1.1 Abstractive summarizaton

Summaries written by humans are usually free-form, not a word-by-word or sentence-by- sentence cut outs from the article. The goal of abstractive summarization is to construct the summaries in a similar fashion, being creative with its word and expression choice so the resulting summary is as sound and as readable as any summary written by a person. This poses multiple difficulties because the system not only has to figure out what is important in an article but also has to construct grammatically correct sentences that are at the same time semantically sound and actually a good summary of the article without false information added. This is a hard task compared to the extractive approach but recently there have been some breakthroughs with end-to-end seq2seq neural network models. The first break- through was in 2017 in the article titled Get To The Point: Summarization with Pointer- Generator Networks [26] and had since been developed further in Mixture Content Selection for Diverse Sequence Generation [7] and Unified Language Model Pre-training for Natural Language Understanding and Generation [8]. However, it is important to note that the outputs of these networks are far from human- written level, they can contain factual falsehoods and incorrect grammar so they are not yet applied on real life problems.

7 1.1.2 Extractive summarization

Extractive summarization is an easier approach to the problem of summarization because it just chooses the expressions or sentences to keep from the original text. One could think of this as highlighting the most important information in a text. The mayor advantage of extractive summarization is the fact that there is no need to paraphrase anything from the original text, but the downside is that the maximum achievable correctness is limited by the sentences of the text. I’ve decided to choose this approach for my thesis but it’s important to mention, that on its own the Graph Neural Network model could be used for abstractive summarization as well, although it would be much more complicated because it would involve splitting the end graph into smaller sentence-sized graphs and then construct textual data from these smaller UD graphs. The former would be a challenge on its own and the latter part is still an unsolved problem on the field of NLP1.

1.2 Evaluation

In general we want our generated summary to be as similar to the reference summaries as possible. For this we are looking at the number and length of overlaps between them. The goal is to find a metric that judges the generated extracted summaries similarly to a human would judge the summary. That being said automatic metrics are limited by the reference summaries and will score a sound, good summary lower if they are not matching the reference well.

1.2.1 BLEU score

The BLEU [20] score was first introduced as an automatic evaluation tool for machine translation in 2002 but it has been used for a multitude of problems on the field of Natural Language Processing. BLEU is a precision oriented measure, calculated by the following equation. P P S∈CandidateSummaries n_gram∈S Countclip(n_grams) BLEU = P P 0 S0∈CandidateSummaries n_gram0∈S0 Count(n_grams )

This metric is rarely used for summary evaluation, it is more useful for machine trans- lation problems.

1.2.2 ROUGE score

The ROUGE score metric has been developed in 2004 by Chin-Yew Lin [13]. ROUGE is short for Recall-Oriented Understudy for Gisting Evaluation. The metric is used to determine the quality of a generated summary by comparing it to a set of human written summaries. It yields similar results to human intuition, especially ROUGE-2 and ROUGE- SU4, which correlate the best as it was shown in Re-evaluating Automatic Summarization

1https://www.aclweb.org/portal/content/announcement-surface-realization-shared-task-2019

8 with BLEU and 192 Shades of ROUGE [9]. ROUGE-1 and ROUGE-L are the metrics most used in scientific publications. ROUGE-N is the ROUGE measure over n-grams. An n-gram is an expression from the text with n number of words. In contrast to the BLEU score which is a precision oriented score, the ROUGE-N scores are the recall of the candidate summaries. P P S∈ReferenceSummaries n_gram∈S Countmatch(n_grams) ROUGE − N = P P S∈ReferenceSummaries n_gram∈S Count(n_grams)

Where the Countmatch(n_grams) is the maximum number of co-occurring n-grams in the summaries. If n is 1, the ROUGE score is simply the number of matching words divided by the number of words in the reference sentence. ROUGE-L is an F-measure using the longest common subsequence of the predicted and reference summaries.

LCS(reference summary, candidate summary) Recall = lcs |reference summary|

LCS(reference summary, candidate summary) P recision = lcs |candidate summary|

2 (1 + β )RecalllcsP recisionlcs ROUGE − L = F 1lcs = 2 Recalllcs + β P recisionlcs

Where LCS is a function returning the length of the longest common subsequence and β = P recisionlcs Recalllcs ROUGE-W uses the weighted longest common subsequence method instead of the simple LCS so that it prefers consecutive matches and the weighting function scores them higher. ROUGE-S is a Skip-Bigram Co-Occurrence Statistic. Skip-bigrams are any pair of words in the article, allowing for gaps between them. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams between a candidate summary and a set of reference summaries.

SKIP (X,Y ) Recall = 2 skip_2 combine(m, 2)

SKIP (X,Y ) P recision = 2 skip_2 combine(k, 2)

2 (1 + β )Recallskip_2P recisionskip_2 ROUGE − S = F 1skip_2 = 2 Recallskip_2 + β P recisionskip_2 Although the ROUGE-S allows skips in the matches but it is still sensitive to word

9 order. ROUGE-SU is an expansion over the ROUGE-S metric adding a unigram feature to it to account for single word matches. It is important to note that comparing systems based on their respective ROUGE score is only meaningful if the compared summaries are the same or similar length. Otherwise the longer summary will naturally achieve higher score.

1.3 Previous work

The summarization is a long standing task. There have been multiple attempts at solving this problem.

1.3.1 TextRank

The TextRank algorithm is an unsupervised method for automatic text summarization and it can also be used for keyword extraction. It has been first described in a paper titled TextRank: Bringing Order into Texts [15]. TextRank is a graph based method that (for summarization purposes) builds graph representations from the articles using similarity measures of the sentences in them. Then the algorithm scores each of the vertices with the following formula:

X 1 score(Vi) = (1 − d) + d score(Vj) |OutGoingF rom(Vj)| j∈InComingT o(Vi)

where d is a damping factor between 0 and 1. The article [3] also referenced by the gensim documentation describes multiple alterna- tive methods for TextRank. Gensim uses one of these alternative methods for summary extraction. I will use this as a benchmark for my model evaluation.

1.3.2 Deep Learning models

The highest achieving models on this task are end-to-end deep learning models, mostly ones using some kind of pretrained language models. The best model currently is Text Sum- marization with Pretrained Encoders [14]. This model has both extractive and abstractive versions.

Metric (F1) Score ROUGE-1 43.85 ROUGE-2 20.34 ROUGE-L 39.90

Table 1.1: Their best results on the CNN-Daily Mail dataset according to the paper

The second best based only on the ROUGE-1 score is the paper titled Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [23] which works with the popular transformer approach.

10 Metric (F1) Score ROUGE-1 43.52 ROUGE-2 21.55 ROUGE-L 40.69

Table 1.2: Their best results on the CNN-Daily Mail dataset by their own account

A previously mentioned model, Hierarchical Structured Self-Attentive Model (HSSAS) [1] utilizes word-level and sentence-level attention (see in chapter3) with word encoding and sentence encoding.

Metric (F1) Score ROUGE-1 42.30 ROUGE-2 17.80 ROUGE-L 37.60

Table 1.3: Their results on the CNN-Daily Mail dataset

11 Chapter 2

Data processing

2.1 CNN and Daily Mail data

The most widely used dataset on the field of summarization is the CNN - Daily Mail dataset. This dataset contains over 300,000 articles and their respective summaries from Daily Mail articles collected from June 2010 to April 2015 and CNN news collected from April 2007 until the end of April 2015. [10] The only downside is that the summaries are not strictly extracted from the article hence there are expressions in the summaries that may not appear in the original article. There are multiple preprocessed versions available on this dataset. The dataset I used consists of the articles, their corresponding highlights and also the set of sentences from each article, that together give the highest achievable ROUGE score using the naive greedy method described in the following subsection. I will reference these summaries as the extracted summary of an article.

2.1.1 Greedy algorithm

The (naive) greedy method works with a very basic main principle, always choose the sentence, that maximizes the ROUGE score of the summary. It’s important to note, that it is not a summarization method, it only gives us the highest achievable ROUGE score on a given dataset.

Result: The n best sentence from the text best_sentences = ""; while i < number of required sentences do rouge_scores = map(); foreach sentence in text ∈/ best_sentences do rouge_scores[concat(best_sentences, sentence)] = rouge_score(sentence); end best_sentence = max_key_by_value(rouge_scores); best_sentences = concat(best_sentences, best_sentence); i = i + 1; end Algorithm 1: Greedy summarization algorithm

12 Evaluation method Score ROUGE-1 73.91 ROUGE-2 39.29 ROUGE-L 51.01 ROUGE-SU* 49.35

Table 2.1: ROUGE scores of the top 4 sentences chosen by the naive greedy algorithm over the test set of the CNN-Daily Mail data

As the table 2.1 shows this method works surprisingly well. There have been some advancement of this method regarding its speed [25].

2.1.2 Gensim Benchmark

As I mentioned before, I used the output of gensim’s TextRank based summarization method as my benchmark for the graph neural network model evaluation.

Evaluation method Score ROUGE-1 56.35 ROUGE-2 20.80 ROUGE-L 34.47 ROUGE-SU* 29.27

Table 2.2: ROUGE scores of the top 4 sentences chosen by the gensim library’s TextRank variation based summarization algorithm over the test set of the CNN-Daily Mail data

2.1.3 Example from the dataset

Article rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored to victory in the men’s 4x100m relay. The fastest man in the world charged clear of United States rival as the Jamaican quartet of , Kemar Bailey-Cole, and Bolt won in 37.36 seconds. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. The 26-year-old Bolt has now collected eight gold medals at world championships, equaling the record held by American trio , Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. The relay triumph followed individual successes in the 100 and 200 meters in the Russian capital. "I’m proud of myself and I’ll continue to work to dominate for as long as possible," Bolt said, having previously expressed his intention to carry on until the 2016 Rio Olympics. Victory was never seriously in doubt once he got the baton safely in hand from Ashmeade, while Gatlin and the United States third leg runner Rakieem Salaam had problems. Gatlin strayed out of his lane as he struggled to get full control of their baton and was never able to get on terms with Bolt. Earlier, Jamaica’s women underlined their dominance in the events by winning the 4x100m relay gold, anchored by Shelly-Ann Fraser-Pryce, who like Bolt was completing a triple. Their quartet recorded

13 a championship record of 41.29 seconds, well clear of , who crossed the line in second place in 42.73 seconds. Defending champions, the United States, were initially back in the bronze medal position after losing time on the second handover between Alexandria Anderson and English Gardner, but promoted to silver when France were subsequently disqualified for an illegal handover. The British quartet, who were initially fourth, were promoted to the bronze which eluded their men’s team. Fraser-Pryce, like Bolt aged 26, became the first woman to achieve three golds in the 100-200 and the relay. In other final action on the last day of the championships, France’s Teddy Tamgho became the third man to leap over 18m in the triple jump, exceeding the mark by four centimeters to take gold. Germany’s Christina Obergfoll finally took gold at global level in the women’s javelin after five previous silvers, while Kenya’s Asbel Kiprop easily won a tactical men’s 1500m final. Kiprop’s compatriot Eunice Jepkoech Sum was a surprise winner of the women’s 800m. Bolt’s final dash for golden glory brought the eight-day championship to a rousing finale, but while the hosts topped the medal table from the United States there was criticism of the poor attendances in the Luzhniki Stadium. There was further concern when their pole vault gold medalist Yelena Isinbayeva made controversial remarks in support of Russia’s new laws, which make "the propagandizing of non-traditional sexual relations among minors" a criminal offense. She later attempted to clarify her comments, but there were renewed calls by gay rights groups for a boycott of the 2014 Winter Games in Sochi, the next major sports event in Russia. Summary Usain Bolt wins third gold of world championship Anchors Jamaica to 4x100m relay victory Eighth gold at the championships for Bolt Jamaica double up in women’s 4x100m relay Extracted summary Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men’s 4x100m relay. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. Earlier, Jamaica’s women underlined their dominance in the sprint events by winning the 4x100m relay gold, anchored by Shelly-Ann Fraser-Pryce, who like Bolt was completing a triple. Their quartet recorded a championship record of 41.29 seconds, well clear of France, who crossed the line in second place in 42.73 seconds.

As you can see the extracted summary contains the four best sentences from the article.

2.2 Universal Dependencies

Before we discuss the data processing steps any further I need to clarify what Universal Dependencies are, and why are they useful.

14 Figure 2.1: UD graph of "Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas." Image from the Stan- ford site

Universal dependency trees represent the grammatical relationships between the words of a sentence; it gives us the grammatical dependencies between words. They can be eas- ily interpreted with some knowledge of the grammatical structure of languages and they are widely used because of this. UD graphs are highly standardized and are a language independent way to generate the grammatical structure. One example of a dependency parsed sentence is on Figure 2.11. You can see the connec- tions and their types between words. Some dependency parsers also predict part-of-speech (POS) tags. For this example sentence the POS tags are the following: Bills: noun, on: preposition, ports: noun, and: conjunction, immigration: noun, were: verb, submitted: verb, by: preposition, Senator: noun, Brownback: noun, Republican: noun, of: preposition, Kansas: noun

Dependency trees have been used from the early 20th century based on the work of Lucien Tesnière. In his book titled Éléments de syntaxe structurale (Elements of Structural Syntax)[27] he described the modern grammatical dependency graph structure. We construct these trees by dependency parsing. There is a multitude of ways we can do that, the library I used in my solution (stanfordnlp)2 utilizes the deep learning based approach. Most UD parsing methods use treebanks built from parsed corpora, and are used for annotation. Treebanks have been standardized in 2013 by Ryan McDonald’s team [3]. Treebanks are available in multiple languages on the universal dependency site.3 There

1https://nlp.stanford.edu/software/stanford-dependencies.shtml 2https://github.com/stanfordnlp/stanfordnlp 3https://universaldependencies.org/

15 was a shared task [6] to build a parser that is capable of dependency parsing in multiple languages.

2.3 Graph format

The graph_nets library was developed and published on GitHub4 by DeepMind in 2018. The team detailed their solutions in the publication titled Relational inductive biases, deep learning, and graph networks [4]. The graph network was defined to learn graph transformations between relational data. Such relational data can be data from physical systems, sorting problems, shortest path problems.5 At first it was not clear whether they can be efficiently trained on real-world data,6 but experiments quickly showed, it can be trained with appropriate relational data. I defined the extractive summarization as graph transformation, where the input graph is the merged article graph and the output is the said article graph with the nodes and edges that are in the summary graph marked with one and the ones not in the summary graph marked with zero. I used the graph_nets library to train graph neural networks to learn this graph transformation. The graph_nets library uses a specific format for graphs, a so called GraphTuple. The user can transform multiple graphs with networkx format or dictionary to a GraphTuple object. In my project I used dictionaries because they seemed to be the more straightfor- ward and manageable approach. The graph dictionary contains the following elements:

• nodes: list containing the feature vectors of each node. In my final solution each node’s feature vector is the lemma and the part-of-speech tag of the given word

• edges: list containing the feature vectors of each edge. The final feature vector con- tains the type of each word relation.

• globals: global parameter list. I found no relevant use for this parameter in my project.

• senders: list containing the indices of the sender nodes in the order of the edges.

• receivers: list containing the indices of the receiver nodes in the order of the edges.

On the visual representation Figure 2.2 of this kind of graph transformation I highlighted the nodes and edges and the corresponding feature vectors and indices with the same color for easy traceability.

4https://github.com/deepmind/graph_nets 5https://github.com/deepmind/graph_nets/tree/master/graph_nets/demos 6https://github.com/deepmind/graph_nets/issues/36

16 Figure 2.2: Graph transformation to dictionary format.

Although I mentioned the transformation between one graph dictionary and one Graph- Tuple, but one instance of GraphTuple usually contains a batch of multiple graphs. To be able to handle multiple graphs the GraphTuple object has two more fields calculated during the transformation between the list of graph dictionaries and the GraphTuple instance:

• n_node: the number of nodes in each graph.

• n_edge: the number of edges in each graph.

Figure 2.3: Graph transformation to GraphTuple format. The dictionary list is not visualized.

On the Figure 2.3 depicting this I also applied highlighting on the nodes, edges and graph backgrounds and the corresponding feature vectors and indices with the same color.

17 2.4 Building the graphs

As I mentioned in the Introduction I used the stanfordnlp library to generate the universal dependency parse trees from the texts. Using the lemma and part of speech (POS) tag of the words I constructed the nodes of the graph the following way: if a word’s lemma appears multiple times with the same POS tag in the text I’ve treated them as one node and the dependencies were merged as edges. The feature vector a node is the vector of their lemma and POS tag. The feature vector of the edges is the UD relation between the words.

Figure 2.4: Graph merge.

As you can see at Figure 2.4, the resulting graph is not a correct UD graph because the Word2, POS1 node has multiple incoming edges.

Figure 2.5: Graph merge with multiple matching nodes.

On Figure 2.5 you can see how multiple node matches affect the construction of the graph merging. With this method I was able to represent an article with one large, merged graph, like the one on Figure 2.6. As you can see, it is not easily interpretable for a human at first glance, but you can tell, that the nodes with high connectivity appeared frequently. Most of these are stop-words, which I left in the graphs for the connectivity.

18 Figure 2.6: The graph built from the example article.

2.5 Forming the target graphs

Like I mentioned before, I needed the target graphs to be the same format as the input graph with the summary elements marked. Initially my approach was to try to learn the summaries written by people. For this I needed to generate summary graphs, but their structure was not adequate for training any graph neural network on it. So my solution was the following: I iterated through the edges of the article graph and if the sender node, the receiver node and the edge type matched any edge in the summary I’ve marked them by adding a 1.0 at the end of their feature vector. Otherwise I appended a 0.0 to the vector. Similarly I iterated through the nodes as well and checked whether the same feature vector appears in the summary graph and appended 1.0 in case of successful search, 0.0 otherwise. This way I was able to achieve the same structure, but with marked summary nodes and edges. The problem with this approach was the fact, that some of the expressions used in the

19 Figure 2.7: The colored article graph of the example article. I highlighted the nodes with label 1 with blue. summary did not match up with anything from the original text so the result could not contain the same level of information as the summary graph on it’s own. However, the dataset contained the extracted summaries with the best possible ROUGE score and I decided to use that as the expected output of my neural network. The construc- tion is similar to the previous one, but instead of generating another graph and then trying to find the corresponding parts I only iterated through the UD graphs of the sentences and labeled the nodes and edges in that with the right label. The graphs resulting from this approach contained more information and proved to be better for training. On Figure 2.7 and Figure 2.8 you can see the results of the first article visualized in two different ways.

20 Figure 2.8: The summary graph of the example article. This is the subgraph of the article graph with the node labels of 1.

21 Chapter 3

Graph neural network

3.1 Neural Networks

Artificial neural networks are used for various machine learning and deep learning purposes. In this section I will introduce the basic concepts of artificial neural networks used for supervised learning. Supervised learning is the process of neural network training with a given set of inputs and desired outputs and the goal is to train that system to predict the target outputs with the least error possible. Most of the contents of this section is from the paper written by Ádám Kovács and I for the Scientific Students’ Associations conference in 2018 [12].

3.1.1 Perceptron

The perceptron is the building block of the most basic neural networks. It functions as follows: The x input vector is element-wise multiplied by the w weight vector and is summed with the b bias parameter. n X y = b + xiwi i=1

22 Figure 3.1: A perceptron.

3.1.2 Feed forward neural network

The feed forward or dense neural network is built from perceptrons organized in layers, each layer fully connected to the next layer. The connections each have a weight parameter assigned to them. The nodes are perceptrons with only one slight difference, their output is not necessarily linear. So called activation functions are used after the summation to bring non-linearity to the model. Such functions are for example the sigmoid, hyperbolic tangent, ReLU functions.

1 σ(x) = 1 + e−x

ex − e−x tanh(x) = ex + e−x  x, if x > 0 relu(x) = 0, otherwise

So the result of one neuron in a layer is calculated as follows:

m X yˆ = activationF unction( (xiwi + b)) i=1

The weight and bias parameters are modified during a process called backpropagation which utilizes the chain rule to propagate the gradient of the loss backwards. The weights and the bias are updated with their difference from the gradient multiplied by a learning rate. The loss is the difference function between the calculated value and the expected out- come. We don’t just calculate the numeric difference in this case, but in case of classification we use the cross entropy function.

23 Figure 3.2: A feed forward neural network.

n X L = − yi ln(y ˆi) i=1

The equation above can be used in a multi-class system, where n is the number of classes and yˆ is the calculated output, while y is the expected output. The parameter update happens as follows:

X δL δwk ∆ L = x δwk δx k

w ← w − η∆xL

Where η is the learning rate. The learning rate parameter can also change during the training process. This is called optimization. The main goal of it is to overcome the problems with setting the learning rate (too high can cause us to jump over any good solution, too low results in very slow learning and settling in a local minimum). One such optimization method is called Adam. It is a cross between two other methods: Adaptive learning rate and Momentum. I used it for training the models described in the next chapter. The basic idea is slowly degrading the learning rate, based on the previous gradient. The backpropagation method can result in under-fitting, when the results are not getting close to the expected and over-fitting, when the results on the training set look promising,

24 but on the test set they are not getting better, and start to worsen after a while. To prevent over-fitting we can use early-stopping which will stop the training if the results on the test set are getting worse. I used early-stopping in my solution as well.

3.1.3 Embedding layers

The function of embedding layers is to turn values to fixed length vectors. They are used mostly in natural language processing to work as word2vec translation. Word2vec is the mechanism to translate a word into vector space. [16] Word embeddings are used in basically all state-of-the art systems related to natural language processing applications. Mikolov [17] showed that word embeddings can be applied for vector operations, like addition or subtraction, and these operations often result in meaningful representation. If we have example words "King", "Man", "Woman", then the vector("King") - vector("Man") + vector("Woman") will most likely result in vector that is close to vector("Queen") in the embedding space. See Figure 3.3.

Figure 3.3: An word vector operations.

When using embedding layers we want to find vectors for each word so that it can model the word’s meaning. We achieve this by looking at the context, the word usually appears in. If two words like apple and orange usually appear in the same context, then the vectors assigned to these words should have low cosine distance between them. If you are building an NLP model, the embedding layer should be in the first layer, since its purpose is to make the transition from word to vector, and the word in this case is the input. The input dimension of this layer is the size of the vocabulary and the output dimension is the size of the dense vector. Usually the vocabulary size greatly exceeds the embedding dimension since the output vectors size is fixed and can range from 300 to 1000, and the vocabulary - depending on the dataset - can be way higher than that. See at Figure 3.4. They work mostly like a lookup table that can be trained. Often times we use pretrained models, like the GloVe embeddings [21] that have been trained on enormous datasets. We

25 Figure 3.4: An embedding layer. can also train them on our specific problem, or use the pretrained and our own embeddings simultaneously. There has been a huge step forward on this field with the appearance of pre-trained embeddings built from language models. The first such embedding is ELMo [22] which is a deep-contextualized word representation capable to handle character level features as well as the context of a given word.

3.1.4 Recurrent Neural Networks

RNNs [24] can be used in supervised and also unsupervised learning. They are used when the data is sequential, like text, audio, etc. In a simple feed forward neural network, the information only moves in one direction: from the input layer to the output layer. On the other hand recurrent neural networks take into account their immediate past, the output of the network with the previous timestamp. This internal "memory" like functionality allows the network to remember what it had calculated before. This is illustrated at Figure 3.5. At every timestamp the network gets two sets of inputs: the actual input at the times- tamp and the hidden state of the network for the previous input. In one iteration it calcu- lates its output using the calculated hidden state in the timestamp. It all could be imagined like the same feed forward network being repeated after one other. The hidden state mentioned above is the "memory" of the network that is calculated with the previous hidden state and the input. The backpropagation is also slightly different in this case, it’s called backpropagation through time, you need to "unroll" the network (see at Figure 3.6), and use the backprop-

26 Figure 3.5: A recurrent neural network.

Figure 3.6: Unrolled recurrent neural network. agation starting from the right timestamps. Each timestamp’s backpropagation could be understood as backpropagation on a separate feed forward neural network. The gradient vanishing or explosion can be a problem with this RNN. There is a multitude of solutions for the exploding gradients, one of which is called gradient clipping. This technique is a very simple yet powerful way of dealing with exploding gradients. All it does is that it limits the size of the gradient, if its norm is higher than a set threshold.

3.1.4.1 Long-Short Term Memory

Long-short term memory networks [11] are the extension of the previously discussed recur- rent neural network. The main difference is that it also has an internal long term memory. Usually these type of networks are more reluctant to have the exploding gradient problem. Like the simple RNN, LSTMs also have hidden states, that are calculated slightly dif-

27 ferently.

The LSTMs hidden states are calculated using three gates:

• input gate: determines whether to let new input in

• forget gate: determines whether to forget an input because it’s not relevant anymore

• output gate: determines whether to let the input impact the output with the current timestamp These gates are analog and their values range from 0 to 1 with the sigmoid function. A simplified depiction can be seen at Figure 3.7.

Figure 3.7: A long-short term memory network’s gates.

The sigmoid function allows this structure to be able to learn, meaning that we can use the backpropagation through time method described above. Long-short term memory networks are used in natural language processing, but also in generative networks, like video or image description generation, text generation and so on.

3.1.5 Attention

The Attention mechanism was first described by Dzmitry Bahdanau in 2015 [2] and was used for machine translation. Since then it became a widely used tool in natural language processing. The idea behind this mechanism is that when the neural network predicts the output, it only uses parts of the given input instead of the full input. That is where the most relevant information is concentrated and this mechanism only pays attention to these parts and the network has to learn what to pay attention to. Usually in the "sequence-to-sequence" tasks like machine translation there are two main parts of the model an encoder and a decoder. The encoder and the decoder are usually some type of RNN, mostly LSTM. The encoder is responsible for creating a so called context- vector from the input sequence. This context-vector has a fixed length and it serves as the representation of the sequence inside the model. The decoder then decodes this context- vector to a sequence again, in the case of the machine translation this sequence is in a different language. A depiction can be seen at Figure 3.8. The attention mechanism was used in the decoder part of this model, the encoder func-

28 Figure 3.8: A sequence-to-sequence model with encoder and decoder. tions the same way. The paper explicitly stated that this attention mechanism relieves the encoder from having to encode every sequence to a fixed length context vector. In this case we have a context vector for every word of the expected output. These context-vectors are the weighted sums of the encoder’s states (annotations).

t X ci = αijhj j=1

Where α parameter is calculated like the following:

exp(e ) α j = ij i Pt k=1 eik and eij is its energy

eij = a(si−i, hj)

Figure 3.9: A sequence-to-sequence model with encoder-decoder and attention.

This is an alignment model that scores how well the input around j and the output around i match. This alignment model is a feed forward neural network that is trained simultaneously with the other components of the system. The decoder uses the previous state’s output and its assigned context-vector when calculating its own target.

si = f(si−1, yi−1, ci)

The attention based model is at Figure 3.9.

29 3.2 The graph_nets library

As I mentioned in the previous chapter graph neural networks are used for graph transfor- mation. Recently an article has been published about the types and possible uses of graph neural networks [30]. Graph neural networks are used on the field of physics [5], computer vision [18] and also NLP [19] but they are still not common.1 My main source for understanding the mechanisms of the graph_nets library was the ar- ticle titled Relational inductive biases, deep learning, and graph networks [4] and the demos available on GitHub. The article explains the different approaches of graph transformation learning, most importantly the graph independent model and the graph dependent models. The most notable difference between the two is the fact that while the graph dependent structure uses the data of the nodes and the edges simultaneously, the graph independent model does train the nodes and edges separately from each other. Both models can be used under different circumstances, but they can also be used together in training processes.

3.2.1 Graph Network block

The graph_nets library contains three different block types: EdgeBlock, NodeBlock and GlobalBlock, each responsible for the propagation of the corresponding tensor through the right function. So we can construct different models for edges, nodes and global parameters. Definitions from the code The NodeBlock updates the features of each node in batch of graphs based on the previous node features, the aggregated features of the adjacent edges, and the global features of the corresponding graph. The EdgeBlock updates the features of each edge in a batch of graphs based on the previous edge features, the features of the adjacent nodes, and the global features of the corresponding graph. The GlobalBlock updates the global features of each graph in a batch based on the previous global features, the aggregated features of the edges of the graph, and the aggregated features of the nodes of the graph.

On top of using these graph structure dependent building blocks the GraphNetwork model (or Graph dependent model for the sake of clarity) utilizes the updated edge features to calculate the node features and these results are used to update the global features.

3.2.2 Graph Independent block

As the name suggests, in this block every aspect of the graph is independent from another, the defined update functions only take into account their input tensor which can be the node tensor, the edge tensor or the global tensor.

1https://github.com/thunlp/GNNPapers

30 3.2.3 Self Attention block

The graph_nets library contains a multi-head self attention module which is mostly based on the article titled Attention Is All You Need [28] which introduced Multi-head attentions. This mechanism updates the node features based on the node values, node keys and node queries, also taking into account the structure of the graph.

31 Chapter 4

Models

In this chapter I will introduce the models I experimented with. Encode Process Decode model has been defined in the graph_nets demos. The SimpleGraphAttention and the GraphAttention models are defined by me.

4.1 Encode Process Decode model

This model has been tested and used in the demos provided on GitHub1.

Figure 4.1: Representation of the model.

As it is shown on the Figure 4.1 above the model consist of 3 different parts.

4.1.1 Encoder

The encoder model is a graph independent model that uses multi layer perceptrons and layer normalization on the global attributes, the edge feature vectors and the node feature vectors. There are no shared variables between the multi layer perceptrons.

4.1.2 Core

This is the only graph dependent part of the model. It also uses multi layer perceptrons and layer normalization as the previous one, but the graph dependency enables it to use the

1https://github.com/deepmind/graph_nets/blob/master/graph_nets/demos/models.py

32 predicted edge features to influence the node features and these impact the global feature. This section of the network is repeated resulting in a hidden state that gets feed forward again in the Core section. Each step has its own output with its timestamp.

4.1.3 Decoder

The decoder model has the same structure as the encoder. It does not use any kind of activation on the last layer, which enables it to better generalize, but also making it harder to use on a specific project eq. classifying whether or not the summary graph contains the given node or edge.

4.2 SimpleGraphAttention model

I based the architecture of this model on the Encode Process Decode model. I modified it to be better suited for NLP problems, because it transforms the input words into higher dimension vector space, as you can see on Figure 4.2.

Figure 4.2: Representation of the model.

4.2.1 Encoder

The encoder part of the network contains two Embedding layers; one for the lemmas of the nodes and one for the edges’ type (See Figure 4.3). The second one’s necessity is questionable, since it does not contain as much information and could be also represented with a simple one-hot encoded vector, since it has not as high dimensionality as the nodes. However I decided I leave it in the model and make it a changeable parameter of the network. For the global features I only used a simple linear layer without activation functions, since it does not need to be encoded in any way. All of these blocks work independently from each other, updating their respective blocks, because the parts are defined as the edge, node and global functions for a GraphIndependent block.

33 Figure 4.3: Representation of the encoder part of the model.

4.2.2 Network

The Network, or core part of the model consists of activated linear layers with differing sizes as depicted on Figure 4.4.

Figure 4.4: Representation of the network part of the model.

Since this section of the model is graph dependent, after each feature update, the next block gets the updated feature vectors instead of the original ones.

34 4.2.3 Decoder

The decoder part of the model is also graph independent just like the encoder and it simply serves the function of the softmax output of the whole network. Its structure is presented on Figure 4.5.

Figure 4.5: Representation of the decoder part of the model.

4.3 GraphAttention model

The starting point of this model was the SimpleGraphAttention model which I further developed using recent advancements regarding attention in graphs. There was two versions of this model, one that used LSTM cells in the Network part in the node block, and the other that used only linear layers. Since the main structure is the same I only included the graphical presentation on Figure 4.6 and Figure 4.7 of the later version. This one proved to be more usable due to the large amount of memory usage of the LSTM cells.

Figure 4.6: Representation of the model.

35 The Encoder and Decoder segments of the network are identical to their correspondences in the SimpleGraphAttention model.

4.3.1 Network and Graph Attentional Layer

The network part of the model went under some changes compared to the SimpleGraphAt- tention model’s Network. The most notable change is the new, independent part of the network, called GAT. You can see a simplification of the network on Figure 4.7.

Figure 4.7: Representation of the network part of the model.

The Graph Attentional Layer has been described in a 2018 paper titled Graph Attention Networks [29]. It was designed to work well with Graph Convolutional Network structures, but I was able to implement it for the Graph Neural Network architecture as well. This layer updates node features the following way: Suppose the features of the nodes input features output features are hi, we define a weight matrix W ∈ R × R and a feed forward attention layer. Let the normalized attention coefficients be:

36 exp(LeakyReLU(attention(W h~ ||W h~ ))) a = i j ij P ~ ~ k∈i neighbors exp(LeakyReLU(attention(W hi||W hk))) if i and j are neighbors.

0 X h~i = σ( aijW h~j) j∈i neighbours

Since we are working with multi-head attention, the update equation is modified as follows: K 0 1 X X h~ = σ( ak W kh~ ) i K ij j k=1 j∈i neighbours Where K is the number of heads in the multi-head attention.

37 Chapter 5

Experiments and results

In this chapter I will present my results achieved on the data. The two classes (whether to include a word or edge in the summary) appear disproportionally in the dataset. There is 2.5 times more node that we leave out from the summaries then nodes we take in and 4.8 times more edges left out than contained in the summary.

5.1 Evaluation methods

I used three methods to evaluate the results. The most basic method is the accuracy measure, which calculates the ratio of the correct answers as follows:

correct answers accuracy = all answers

Since we are talking about graphs we need to measure the accuracy on the nodes and (if we are also training on them) the edges.

correct nodes + correct edges accuracy = nodes and edges all nodes + all edges

In our case this metric does not contain real information about the results of the model because of the high disparity between the classes. The other general evaluation metric in case of classification is the F-score. This is better for evaluating the results of a classification because it is not so greatly affected by the distribution of the classes. For this we need to calculate the true positive, true negative, false positive and false negative measures for every class. See on Figure 5.1.

TPnode==1 = number of correctly guessed node == 1

TNnode==1 = number of correctly guessed node 6= 1

FPnode==1 = number of incorrectly guessed node == 1

FNnode==1 = number of incorrectly guessed node 6= 1

With these measures we can calculate the precision and recall. The precision tells us the

38 Figure 5.1: The visual representation of the concept of true positive, true neg- ative, false positive and false negative measures. ratio of correct guesses of the class out of every time the model predicted that class. The recall is the ratio of the correct guesses out of every instance that are actually in the class. The harmonic mean of the precision and recall is the F-score.

TP precision = TP + FN

TP recall = TP + FP

precision ∗ recall F 1 = 2 ∗ precision + recall

I constructed extractive summaries from the resulting graphs and the original articles multiple ways described in section 5.2 and calculated the various ROUGE scores for each method’s output.

5.2 Experiments and results

I have experimented with both the Encode Process Decode and the Graph Attention Network models with varying success.

5.2.1 Encode Process Decode

At first I tried to get the Encode-Process-Decode model to predict whether to put some- thing in the summary graph or not using only the word ids as features, but realized it would probably be more sufficient to also use the POS tags. I also experimented with different loss functions, but for now i decided to use softmax cross entropy with output feature sizes of 2. Training on more, than 7000 articles for 5 epochs (Early stopping stopped it from running any longer) I’ve got the following results.

39 Softmax cross entropy loss on nodes and edges averaged = 0.447

The classification report on the test set. These nodes are the ones that were in the summary:

precision on nodes 62.8 recall on nodes 21.3 F1 on nodes 31.9

Table 5.1: Classification report on the nodes. The edges achieved 0 F1 scores

As you can see from this the network hardly ever classifies a node as "1", meaning that it should be in the summary graph, but when it does it is 62.8% accurate. This might be because the relative frequency of the "1" nodes is lower, than the "0" nodes’ so the network predicts "0" nodes to be more likely. This model has not been further evaluated.

5.2.2 SimpleGraphAttention Network

I trained the network on the whole training set (80% of the whole dataset) for 3 epochs. It stopped after the third epoch because of the early stopping mechanism. I evaluated the trained network on the test set and the results are the following:

Softmax cross entropy loss on nodes = 0.809

The classification report on the test set. These nodes and edges are the ones that were in the summary:

precision on nodes 53.4 recall on nodes 54.2 F1 on nodes 53.8

Table 5.2: Classification report on the nodes.

5.2.2.1 Summary reconstruction and evaluation

I tried three different approaches at the summary extraction, each have been evaluated by the ROUGE scoring mechanism. For this I used the perl based pyrouge package.

5.2.2.1.1 Reconstruction with the node average

This reconstruction method scores each sentence of the original article using the equation below where the scoreword is the predicted probability of the word (node) being in the summary graph.

40 P scoreword score = word∈sentence and word/∈stopwords sentence |word ∈ sentence and word∈ / stopwords|

Based on the calculated score of each sentence we can order them by their relevance and keep the four most relevant sentences from the article as the summary. This method does not utilize the information of the graph structure nor does it prefer longer sentences. As a result the average number of words in a summary is 75.97. For fair comparison I generated shorter TextRank based gensim summaries and cut down the number of sentences in the extracted summaries. I did this by comparing the number of words in each reconstructed summary and chose the number of sentences so the word counts would be as close as possible in each pair.

Evaluation method System Gensim Maximum average word count 76 77 74 ROUGE-1 39.06 41.77 59.11 ROUGE-2 12.56 14.08 29.95 ROUGE-L 24.03 25.99 40.66 ROUGE-SU* 15.60 17.34 33.73

Table 5.3: ROUGE scores on the test set with reconstruction method using only the node scores

5.2.2.1.2 Reconstruction with just the graph structure

Contrary to the previous version this construction method does not use the output of the graph neural network but it is included here as contrast and as a step toward the next summarization method.

X scoresentence = [1 if sender word∈ / stopwords] + edge∈sentence graph [1 if receiver word∈ / stopwords]

This score is basically the degree of each non-stopword node in a sentences summed. The ordering of sentences is based on this score. The summary contains the four best sentences. The average length of a summary is 138.42 word.

Evaluation method System Gensim Maximum average word count 138 136 104 ROUGE-1 54.17 56.35 73.91 ROUGE-2 18.82 20.80 39.29 ROUGE-L 32.68 34.47 51.01 ROUGE-SU* 26.69 29.27 49.15

Table 5.4: ROUGE scores on the test set with reconstruction method using just the graph structure

41 5.2.2.1.3 Reconstruction with the graph structure and the node scores

This method is the blend of the previous two. It utilizes the structure of the graph but also uses the output of the graph neural network.

X scoresentence = [scoresender word if sender word∈ / stopwords] + edge∈sentence graph

[scorereceiver word if receiver word∈ / stopwords]

These summaries also contain the best four sentences. The average number of words in these summaries are 127.11.

Evaluation method System Gensim Maximum average word count 127 136 104 ROUGE-1 56.80 56.35 73.91 ROUGE-2 21.97 20.80 39.29 ROUGE-L 34.95 34.47 51.01 ROUGE-SU* 29.45 29.27 49.15

Table 5.5: ROUGE scores on the test set with reconstruction method using both graph structure and the node scores

5.2.2.2 Example

To better visualize the results I compare the summary graph from Chapter2 (Figure 2.8 and) to the calculated graph (Figure 5.2). The generated summaries are also written below.

42 Figure 5.2: The calculated summary graph of the example article with just the nodes labeled 1.

Reconstruction with the node average Fraser-Pryce, like Bolt aged 26, became the first woman to achieve three golds in the 100- 200 and the relay. Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men’s 4x100m relay. The British quartet, who were initially fourth, were promoted to the bronze which eluded their men’s team. Bolt’s final dash for golden glory brought the eight-day championship to a rousing finale, but while the hosts topped the medal table from the United States there was criticism of the poor attendances in the Luzhniki Stadium. Reconstruction with just the graph structure The 26-year-old Bolt has now collected eight gold medals at world championships, equal- ing the record held by American trio Carl Lewis, Michael Johnson and Allyson Felix, not to mention the small matter of six Olympic titles. Earlier, Jamaica’s women underlined their dominance in the sprint events by winning the 4x100m relay gold, anchored by Shelly-Ann Fraser-Pryce, who like Bolt was completing a triple. The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta

43 Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. Defending champions, the United States, were initially back in the bronze medal position after los- ing time on the second handover between Alexandria Anderson and English Gardner, but promoted to silver when France were subsequently disqualified for an illegal handover. Reconstruction with the graph structure and the node scores The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. Defending champions, the United States, were initially back in the bronze medal position after losing time on the second handover between Alexandria Anderson and English Gardner, but promoted to silver when France were subsequently disqualified for an illegal handover. Germany’s Christina Obergfoll finally took gold at global level in the women’s javelin after five previous silvers, while Kenya’s Asbel Kiprop easily won a tactical men’s 1500m final. Earlier, Jamaica’s women underlined their dominance in the sprint events by winning the 4x100m relay gold, anchored by Shelly-Ann Fraser-Pryce, who like Bolt was completing a triple.

5.2.3 GraphAttention Network

I trained this model on ten-thousand data for 3 epochs. In each epoch the network trained on different data, all in all I used thirty-thousand input and target output. It stopped after the third epoch because of the early stopping mechanism. Due to time constraints I was unable to train the network on the whole training data. I evaluated the trained network and the results are the following:

Softmax cross entropy loss on nodes = 0.7996

The classification report on the test set. These nodes and edges are the ones that were in the summary:

precision on nodes 43.1 recall on nodes 63.8 F1 on nodes 51.4

Table 5.6: Classification report on the nodes.

5.2.3.1 Summary reconstruction and evaluation

I tried the same reconstruction method for the results of the GraphAttention model. Keep in mind that this model has been trained for less time and on less data.

5.2.3.1.1 Reconstruction with the node average

The average number of words in a four sentence long summary is 75.65.

44 Evaluation method System Gensim Maximum average word count 76 77 74 ROUGE-1 36.27 41.77 59.11 ROUGE-2 11.25 14.08 29.95 ROUGE-L 22.55 25.99 40.66 ROUGE-SU* 14.22 17.34 33.73

Table 5.7: ROUGE scores on the test set with reconstruction method using only the node scores

5.2.3.1.2 Reconstruction with just the graph structure

The result is the same as before, since we only consider the graph structures and they haven’t changed.

5.2.3.1.3 Reconstruction with the graph structure and the node scores

The average word count in a four sentence long summary is 130.21.

Evaluation method System Gensim Maximum average word count 130 136 104 ROUGE-1 51.14 56.35 73.91 ROUGE-2 18.02 20.80 39.29 ROUGE-L 31.41 34.47 51.01 ROUGE-SU* 25.19 29.27 49.15

Table 5.8: ROUGE scores on the test set with reconstruction method using both graph structure and the node scores

5.2.3.2 Example

Similarly to the previous case I compared the summary graph from Chapter2 (Figure 2.8 and) to the calculated graph (Figure 5.3). The generated summaries are also written below.

45 Figure 5.3: The calculated summary graph of the example article with just the nodes labeled 1.

Reconstruction with the node average Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men’s 4x100m relay. Fraser-Pryce, like Bolt aged 26, became the first woman to achieve three golds in the 100-200 and the relay. The British quartet, who were initially fourth, were promoted to the bronze which eluded their men’s team. The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover. Reconstruction with the graph structure and the node scores The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds. Defending champions, the United States, were initially back in the bronze medal position after losing time on the second handover between Alexandria Anderson and English Gardner, but promoted to silver when France were subsequently disqualified for an illegal handover. Earlier, Jamaica’s women underlined their dominance in the sprint events by winning the 4x100m relay gold, anchored by Shelly-Ann Fraser-Pryce,

46 who like Bolt was completing a triple. Germany’s Christina Obergfoll finally took gold at global level in the women’s javelin after five previous silvers, while Kenya’s Asbel Kiprop easily won a tactical men’s 1500m final.

5.3 Comparison between models

Table 5.9 contains the comparable ROUGE scores of the SimpleGraphAttention model’s and the GraphAttention model’s summaries reconstructed with the node average method and also the TextRank based benchmark method’s output followed by the maximum achievable ROUGE scores with the adequate summary lengths. These data is directly from table 5.3 and table 5.7.

Metric Simple node avg Attended node avg TextRank Maximum avg words 76 76 77 74 ROUGE-1 39.06 36.27 41.77 59.11 ROUGE-2 12.56 11.25 14.08 29.95 ROUGE-L 24.03 22.55 25.99 40.66 ROUGE-SU* 15.60 14.22 17.34 33.73

Table 5.9: Short summary comparisons. The Simple node avg is the graph generated with the SimpleGraphAttention model and the summary constructed by the node averaging method. The Attended node avg is the graph generated with the GraphAttention model and the summary constructed by the node averaging method. The TextRank is the summary generated with gensim’s TextRank method. The Maximum is the highest achievable ROUGE score with the similar length.

Table 5.10 contains the comparable ROUGE scores of the SimpleGraphAttention model’s and the GraphAttention model’s summaries reconstructed with the node scores and the graph structure and also the TextRank based benchmark method’s output followed by the maximum achievable ROUGE scores all constructed by the top 4 sentences. These data is directly from table 5.4, table 5.5 and table 5.8.

Metric Just structure Simple Attended TextRank Maximum avg words 138 127 130 136 104 ROUGE-1 54.17 56.80 51.14 56.35 73.91 ROUGE-2 18.82 21.97 18.02 20.80 39.29 ROUGE-L 32.68 34.95 31.42 34.47 51.01 ROUGE-SU* 26.69 29.45 25.19 29.27 49.15

Table 5.10: Long summary comparisons The Just structure is the summary constructed by the node’s degree. The Simple is the graph generated with the SimpleGraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The Attended is the graph generated with the GraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The TextRank is the summary generated with gensim’s TextRank method. The Maximum is the highest achievable ROUGE score with the similar length.

47 5.3.1 Results by summary length

Figure 5.4: The average number of words by the number of sentences in the summary.

Model/Length 1 2 3 4 5 6 7 8 9 10 Maximum 26 53 79 104 129 152 174 196 217 238 TextRank 39 73 105 136 165 193 220 246 271 294 G. struct 40 75 108 138 167 195 221 246 269 292 S. graph 36 68 98 127 155 182 207 231 255 277 S. node 17 36 56 76 96 117 137 158 178 199 A. graph 37 70 101 130 158 185 211 235 258 281 A. node 17 36 56 76 96 116 137 157 178 198

Table 5.11: The average word counts by the number of sentences in the summary The Maximum is the summary created by the greedy algorithm. The TextRank is the summary generated with gensim’s TextRank method. The G. struct is the summary constructed by the node’s degree. The S. graph is the graph generated with the SimpleGraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The S. node is the graph generated with the SimpleGraphAttention model and the summary constructed by the node averaging method. The A. graph is the graph generated with the GraphAttention model and the summary con- structed by the method that takes into account the graph structure and the node score as well. The A. node is the graph generated with the GraphAttention model and the summary con- structed by the node averaging method.

48 Figure 5.5: The ROUGE-1 score by the number of sentences in the summary.

M/L 1 2 3 4 5 6 7 8 9 10 Maximum 22.75 41.72 55.34 66.24 70.85 73.42 75.07 76.22 77.03 77.63 Textrank 23.58 36.83 45.4 51.51 56.15 59.82 62.8 65.26 67.37 69.14 G. struct 22.93 37.05 46.9 54.19 59.85 64.31 67.92 70.85 73.3 75.35 S. graph 24.88 39.77 49.74 56.79 62.08 66.18 69.47 72.1 74.29 76.11 S. node 11.95 22.38 31.33 39.06 45.69 51.34 56.1 60.12 63.55 66.53 A. graph 22.28 35.66 44.66 51.14 56.08 59.91 62.99 65.49 67.53 69.27 A. node 11.54 21.34 29.47 36.27 41.93 46.76 50.92 54.47 57.55 60.18

Table 5.12: The ROUGE-1 score by the number of sentences in the summary The Maximum is the summary created by the greedy algorithm. The TextRank is the summary generated with gensim’s TextRank method. The G. struct is the summary constructed by the node’s degree. The S. graph is the graph generated with the SimpleGraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The S. node is the graph generated with the SimpleGraphAttention model and the summary constructed by the node averaging method. The A. graph is the graph generated with the GraphAttention model and the summary con- structed by the method that takes into account the graph structure and the node score as well. The A. node is the graph generated with the GraphAttention model and the summary con- structed by the node averaging method.

49 Figure 5.6: The ROUGE-2 score by the number of sentences in the summary.

M/L 1 2 3 4 5 6 7 8 9 10 Maximum 8.5 18.25 26.33 33.65 35.52 36.51 37.16 37.62 37.97 38.25 Textrank 6.47 11.07 14.75 17.88 20.58 22.94 25.01 26.83 28.46 29.9 G. struct 5.84 10.79 15.1 18.84 22.15 25.04 27.56 29.77 31.72 33.45 S. graph 7.77 13.54 18.18 21.97 25.13 27.8 30.1 32.04 33.73 35.23 S. node 2.95 6.04 9.28 12.56 15.78 18.79 21.54 24.04 26.31 28.38 A. graph 6.17 10.9 14.76 18.02 20.79 23.19 25.25 27.01 28.52 29.85 A. node 2.93 5.8 8.58 11.25 13.77 16.15 18.38 20.41 22.3 23.98

Table 5.13: The ROUGE-2 score by the number of sentences in the summary The Maximum is the summary created by the greedy algorithm. The TextRank is the summary generated with gensim’s TextRank method. The G. struct is the summary constructed by the node’s degree. The S. graph is the graph generated with the SimpleGraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The S. node is the graph generated with the SimpleGraphAttention model and the summary constructed by the node averaging method. The A. graph is the graph generated with the GraphAttention model and the summary con- structed by the method that takes into account the graph structure and the node score as well. The A. node is the graph generated with the GraphAttention model and the summary con- structed by the node averaging method.

50 Figure 5.7: The ROUGE-L score by the number of sentences in the summary.

M/L 1 2 3 4 5 6 7 8 9 10 Maximum 16.26 27.88 37.43 45.58 48.09 49.6 50.7 51.55 52.27 52.87 Textrank 15.65 22.89 27.88 31.66 34.75 37.3 39.44 41.26 42.84 44.21 G. struct 15.42 23.05 28.47 32.69 36.14 38.98 41.39 43.45 45.21 46.73 S. graph 17.09 25.16 30.74 34.95 38.31 41.05 43.36 45.26 46.89 48.29 S. node 8.69 14.73 19.7 24.03 27.87 31.26 34.23 36.83 39.12 41.21 A. graph 15.21 22.48 27.54 31.42 34.5 37.06 39.22 41.02 42.55 43.87 A. node 8.46 14.2 18.72 22.55 25.84 28.74 31.31 33.58 35.6 37.39

Table 5.14: The ROUGE-L score by the number of sentences in the summary The Maximum is the summary created by the greedy algorithm. The TextRank is the summary generated with gensim’s TextRank method. The G. struct is the summary constructed by the node’s degree. The S. graph is the graph generated with the SimpleGraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The S. node is the graph generated with the SimpleGraphAttention model and the summary constructed by the node averaging method. The A. graph is the graph generated with the GraphAttention model and the summary con- structed by the method that takes into account the graph structure and the node score as well. The A. node is the graph generated with the GraphAttention model and the summary con- structed by the node averaging method.

51 Figure 5.8: The ROUGE-SU* score by the number of sentences in the sum- mary.

M/L 1 2 3 4 5 6 7 8 9 10 Maximum 5.71 17.06 29.77 42.81 48.57 52.12 54.58 56.39 57.74 58.79 Textrank 6.14 13.6 20.21 25.84 30.63 34.74 38.28 41.35 44.04 46.37 G. struct 5.75 13.2 20.32 26.71 32.36 37.25 41.51 45.19 48.4 51.2 S. graph 6.68 15.09 22.83 29.45 35.05 39.81 43.88 47.32 50.28 52.85 S. node 2.23 6.01 10.62 15.6 20.63 25.5 30.01 34.13 37.89 41.33 A. graph 5.65 12.79 19.42 25.19 30.16 34.39 38.03 41.14 43.79 46.11 A. node 2.2 5.75 9.89 14.22 18.42 22.45 26.24 29.72 32.93 35.81

Table 5.15: The ROUGE-SU* score by the number of sentences in the summary The Maximum is the summary created by the greedy algorithm. The TextRank is the summary generated with gensim’s TextRank method. The G. struct is the summary constructed by the node’s degree. The S. graph is the graph generated with the SimpleGraphAttention model and the summary constructed by the method that takes into account the graph structure and the node score as well. The S. node is the graph generated with the SimpleGraphAttention model and the summary constructed by the node averaging method. The A. graph is the graph generated with the GraphAttention model and the summary con- structed by the method that takes into account the graph structure and the node score as well. The A. node is the graph generated with the GraphAttention model and the summary con- structed by the node averaging method.

52 Chapter 6

Conclusion and future work

In my opinion we only started to explore the possibilities of this technique in the realms of natural language processing research and there are still a lot of open questions. The results show that the SimpleGraphAttention model trained on the extracted summaries can achieve above the confidence range of the benchmark gensim’s TextRank based algorithm’s ROUGE score. My plans for the future include further experimenting with the optimization, modifying the graph attention, and trying out new summary reconstruction methods and new struc- tures. I would also like to better compare the system to the state-of-the art results which might require the reproduction of those models. Modifications of the graph attention layer may include taking into account the edge types in some way because the current one only accounts for the node connectivity. I might reintroduce a modified edge loss for the training process but experiments are needed to determine whether it would be beneficial for the network, since previous iterations suggested otherwise. The experiments show that the merged dependency graphs can be used for summa- rization on their own and and they will produce similar or even better results than the benchmark. We can achieve even higher ROUGE scores using graph neural networks. In conclusion I think there is room for further development and this research area is highly promising. I would like to delve into graph neural networks even more now that I’ve come to know them better.

53 Acknowledgement

I would like to express my gratitude towards my consultant, Kovács Ádám and Recski Gábor who helped me tremendously through this project. I would also like to thank my boyfriend, Szabó Roland for his love, help and support. Without his kind words and supportive attitude I wouldn’t have been able to achieve this. My family was absolutely supportive and loving during this time. I owe them great gratitude.

54 Bibliography

[1] Kamal Al-Sabahi, Zhang Zuping, and Mohammed Nadher. A hierarchical structured self-attentive model for extractive document summarization. April 2018. https: //arxiv.org/pdf/1805.07799.pdf.

[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. 2015. https://arxiv.org/pdf/1409.0473. pdf.

[3] Federico Barrios, Federico López, Luis Argerich, and Rosita Wachenchauzer. Varia- tions of the similarity function of textrank for automated summarization. February 2016. https://arxiv.org/pdf/1602.03606.pdf.

[4] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vini- cius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases and deep learning and and graph networks. October 2018. https://arxiv.org/pdf/1806.01261.pdf.

[5] Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. December 2016. https://arxiv.org/pdf/1612.00222.pdf.

[6] Sabine Buchholz and Erwin Marsi. Conll-x shared task on multilingual dependency parsing. 2006. https://dl.acm.org/citation.cfm?id=1596305.

[7] Jaemin Cho, Minjoon Seo, and Hannaneh Hajishirzi. Mixture content selection for di- verse sequence generation. September 2019. https://arxiv.org/pdf/1909.01953v1. pdf.

[8] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. October 2019. https://arxiv.org/pdf/ 1905.03197v3.pdf.

[9] Yvette Graham. Re-evaluating automatic summarization with bleu and 192 shades of rouge. 2015. https://www.aclweb.org/anthology/D15-1013.pdf.

55 [10] Karl Moritz Hermann, Tomá˘sKo˘ciský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and compre- hend. November 2015. https://arxiv.org/pdf/1506.03340.pdf.

[11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. 1997. https: //www.bioinf.jku.at/publications/older/2604.pdf.

[12] Ádám Kovács and Kinga Andrea Gémes. Semantic parsing with graph transformations. October 2018. https://tdk.bme.hu/VIK/Intelligens4/ Szemantika-elemzes-graftranszformaciokkal1.

[13] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. July 2004. https://www.aclweb.org/anthology/W04-1013.pdf.

[14] Yang Liu and Mirella Lapata. Text summarization with pretrained encoders. Septem- ber 2019. https://arxiv.org/pdf/1908.08345v2.pdf.

[15] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. July 2004. https://www.aclweb.org/anthology/W04-3252.pdf.

[16] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. September 2013. https://arxiv.org/pdf/ 1301.3781.pdf.

[17] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among lan- guages for machine translation. https://arxiv.org/pdf/1309.4168.pdf, 2013.

[18] Gabriele Monfardini, Vincenzo Di Massa, Franco Scarselli, and Marco Gori. Graph neural networks for object localization. 2006. http://ebooks.iospress.nl/ volumearticle/2775.

[19] Thien Huu Nguyen and Ralph Grishman. Graph convolutional networks with argument-aware pooling for event detection. 2018. http://ix.cs.uoregon.edu/ ~thien/pubs/graphConv.pdf.

[20] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. October 2002. https://www.aclweb. org/anthology/P02-1040.pdf.

[21] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. 2014. https://nlp.stanford.edu/pubs/glove.pdf.

[22] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. March 2018. https://arxiv.org/pdf/1802.05365.pdf.

[23] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. October 2019.

56 [24] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning rep- resentations by back-propagating errors. 1986. https://www.iro.umontreal.ca/ ~vincentp/ift3395/lectures/backprop_old.pdf.

[25] Shinsaku Sakaue, Tsutomu Hirao, Masaaki Nishino, and Masaaki Nagata. Fast greedy compressive summarization with any monotone submodular function. June 2018. https://www.aclweb.org/anthology/N18-1157.pdf.

[26] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summariza- tion with pointer-generator networks. April 2017. https://arxiv.org/pdf/1704. 04368.pdf.

[27] Lucien Tesnière. Elements of Structural Syntax. 1959. https://archive.org/ details/LucienTesniereElementsDeSyntaxeStructurale.

[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. December 2017. https://arxiv.org/pdf/1706.03762.pdf.

[29] Petar Veli˘cković, Guillem Cucurull, Arantxa Casanova, Adriana Romero Pietro Liò, and Yoshua Bengio. Graph attention networks. February 2018. https://arxiv.org/ pdf/1710.10903.pdf.

[30] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. December 2019. https://arxiv.org/pdf/1901.00596.pdf.

57 Appendices

A.1 Modules and packages

Figure A.1: The most important modules and packages in the project

A.1.1 Functions and classes in each relevant module

• main.py

– predict function – preprocess function – test function – train function – visualize function

• preprocessor.py

– dependency_parse function – process_line function – main function

• cnn_parser.py

– highlight_to_sentences function – graph_builder function

58 – main function

• cnn_extractive_parser.py

– feature_appender function – article_graph_builder function – main function

• helper_functions.py

– is_valid_graph function – print_graphs_tuple function – visualize_graph function – visualize_original_graph function – visualize_graph_with_colors function

• network.py

– generate_placeholder function – train_model function – run_session function – train_generator function – predict function – predict_one_graph function – test function

• compute_measures.py

– compute_accuracy function – compute_accuracy_on_nodes function – compute_accuracy_ratio function – compute_one_tp_tn_fp_fn function – compute_tp_tn_fp_fn function – add_tp_tn_fp_fn function – compute_precision_recall_f1 function

• graph_losses.py

– regression_loss function – binary_categorical_loss function – softmax_loss function – softmax_loss_on_nodes function

59 • graph_file_handling.py

– process_line function – load_graphs function – generate_graph function – get_first_batch_graph_dict function – save_predicted_graphs function

• models.model_with_attention.py

– Encoder class – SimpleGraphAttention class – GraphAttention class

• modules.sonnet_nets.py

– ActivatedLSTM class – ActivatedLinear class – NodeEmbedding class – EdgeEmbedding class – SimplifiedSelfAttention class – GraphAttentionLayer class

A.2 Class diagram

Figure A.2: The classes in the project. The image was generated using Pyre- verse

60