Reliability of online news media during periods of stress

Yoram Timmerman

Supervisor: Prof. dr. ir. Antoon Bronselaer Counsellor: Hannah Van den Bossche

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Telecommunications and Information Processing Chair: Prof. dr. ir. Herwig Bruneel Faculty of Engineering and Architecture Academic year 2017-2018

Reliability of online news media during periods of stress

Yoram Timmerman

Supervisor: Prof. dr. ir. Antoon Bronselaer Counsellor: Hannah Van den Bossche

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Telecommunications and Information Processing Chair: Prof. dr. ir. Herwig Bruneel Faculty of Engineering and Architecture Academic year 2017-2018 Preface

Exactly one year ago, when deciding to go for a thesis around the reliability of online news, I knew for sure that I was heading towards an interesting year. The subject combined perfectly my two biggest passions, computer science engineering and news. However, I knew that creating a thesis needed a lot of work. One year of very intensive work later however, I can totally say it was worth it. I am proud of and satisfied with the end result that I can finally present. First, I would like to thank my promotor, prof. dr. ir. Antoon Bronselaer. He was very closely involved in the research process to obtain this final work and was always available to provide the necessary feedback. Without his help, it would have been impossible to formulate this thesis. I would also like to thank hir. Hannah Van den Bossche, who also helped me a lot. She even performed part of the manual error annotations in this thesis, to verify whether her results coincided with mine. Furthermore, I would also like to thank my mother and my boyfriend, who provided me with the optimal circumstances to write this thesis. Without their support, this thesis would not have been the same. Finally, a final word of thanks goes to my cat, who was present during the writing of almost any page that this work contains. Although she probably does not remember anything of what I explained to her in the previous months, she was always present to listen to my ideas.

Yoram Timmerman, Ghent, May 2018 Permission for usage

“The author(s) gives (give) permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.”

Yoram Timmerman, Ghent, May 2018 Reliability of online news media during periods of stress

Yoram Timmerman

Supervisor: Prof. dr. ir. Antoon Bronselaer Counsellor: hir. Hannah Van den Bossche

Master’s dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Telecommunications and Information Processing Chair: Prof. dr. ir. Herwig Bruneel Faculty of Engineering and Architecture Ghent University Academic year 2017-2018

Summary

This Master dissertation aims to perform an extensive study of the online news reliability in Flanders. More specifically, the two largest Flemish online newspapers are investigated. During the analysis, use is made of both manual and automated techniques. Next to a general overview of the reliability, attention is also given to the influence of breaking news events and their accompanying periods of stress on this reliability.

We analyze three different aspects of the reliability of online news: accuracy, consistency and relevance. In a first part, the accuracy of Flemish online newspapers and the influence of breaking news events on this accuracy is investigated. This is approached by manually screening a data set of articles for errors. The results of these manual annotations are ana- lyzed and conclusions regarding the accuracy of online news in Flanders are drawn.

In a second part, an algorithm is developed that searches data sets of articles about the same subject to find inconsistencies that exist between the articles. This algorithm processes structured graph representations of initially unstructured text articles. This tech- nique is then tested on data sets of articles written during specific periods of stress to be able to quantify the problem of consistency of articles during these periods of stress.

Thirdly, a specific aspect of the relevance of articles during periods of stress, namely their freshness, is analyzed. An automatic analysis method based on similarity measures is presented that can be used to this end. By testing this method on two period of stress data sets, the problem of lack of freshness is investigated.

Keywords online news reliability, graph databases, text processing, similarity measures Reliability of online news media during periods of stress Yoram Timmerman Supervisor(s): Antoon Bronselaer, Hannah Van den Bossche

Abstract—Studies investigating the accuracy of printed news media are news in a couple of aspects. Most important to note here is that widespread. However, as far as we know, no studies exist that investigate online news media are part of the 24-hour news cycle (Bucy, the broad concept of reliability of online news media in Flanders and the influence of breaking news events on this. In this paper, the reliability of Gantz, & Wang, 2007). While printed news media have a typ- Flemish online news media is investigated by analyzing their accuracy, con- ical fixed deadline (e.g. the evening before publication), online sistency and relevance. Next to an investigation of the accuracy of Flemish news media publish their articles as fast as possible, 24/7. This online news media under the influence of different breaking news events, possibly creates a very high pressure on the editorial offices of two algorithms are presented. One allows journalists to find numerical in- consistencies within a data set of articles about the same subject. Another such online newspapers. Especially when a breaking news event algorithm can be used to detect how much new information an article con- has happened, it can be assumed that the pressure of publishing tains. all information that comes in as fast as possible becomes very Keywords—online news reliability, period of stress, graph databases, text high. Possibly, this could be reflected in the quality and relia- processing, similarity measures bility of the online news articles that are finally published. As II NTRODUCTION such, a study investigating the reliability of online news media Following what happens around the world by reading news- in Flanders is important. To the best of our knowledge, no stud- papers or watching news shows on television is an important as- ies exist that investigate the quality of reliability of online news pect of many people their daily lives. However, the way people media in Flanders. Moreover, no studies were found that inves- are following the news is changing very rapidly (Picone, 2016). tigate the influence of the presence of breaking news events on Instead of reading printed newspapers, more and more people this reliability. start to use online newspapers as their primary source of infor- mation. The Digital Report of Belgium in 2016 (Picone, 2016) Reliability is a very broad term. It can be summarized as the indicates that around 50% of the people in Flanders still reads a extent to which people reading online news can trust that what printed news article at least once a week. However, for online they read is a truthful, unbiased, correctly represented and cor- news (including social media links to articles), an overwhelm- rectly written article. In the context of this study, three different ing 83% of the interrogated sample indicates to read at least one aspects of the reliability of online news were investigated: ac- such article a week. This percentage is ever increasing. Typical curacy, consistency and relevance. The accuracy of an article is examples of such online newspapers in Flanders include HLN.be related to how many errors are present in an article. The consis- and nieuwsblad.be. It can thus be assumed that, in a world of tency measures whether information present in different articles fast digitalization, this trend will not stop in the next couple of is compatible: if two articles contain information that is contra- years. dictory, the articles are said to be inconsistent. Finally, with the relevance of an article, it is meant how important the informa- As more and more people make use of these online news ser- tion in an article is to the understanding of the subject it handles. vices, it is important that these services are of sufficient quality. Quality of news is something that is difficult to measure, as it is In section II, an investigation of the accuracy of Flemish on- to a large extent a subjective issue: many common errors found line newspapers under the influence of different breaking news in news articles are a possible subject of discussion. However, events is performed. In section III, a structured representation in the past different studies were already conducted to measure of online news articles is presented. Moreover, an algorithm is the quality of printed news articles (e.g. Maier et al. (2002)). illustrated that exploits this structured representation to find nu- These studies, conducted both in the United States and in Eu- merical inconsistencies between articles about the same subject. rope, indicate that the number of errors that can be found in a In section IV, a specific aspect of the relevance of articles about collection of printed news articles is quite high. Maier et al. a breaking news event, i.e. their freshness, is studied. An auto- (2002) concludes that 59% of the investigated local articles that matic analysis method is presented to this end. Finally, section were investigated contains at least one error. Studies performed V summarizes the most important conclusions drawn within the on local news data sets in Italy and Switzerland obtained similar study. results. This illustrates that lack off accuracy in printed news is a border-crossing problem. As such, it is possibly problematic IIS UPERVISED ERROR INVESTIGATION in the Flemish press too. II-A Goal As media consumption is shifting to online alternatives, a A first aspect of the reliability of online news media that is study regarding online news media could be perceived as im- studied is their accuracy. Moreover, the influence of the pres- portant as well. Online news significantly differs from printed ence of a breaking news event on this accuracy is investigated. To this end, the concept of a period of stress is introduced: II-C Results After annotating all articles, several statistical tests were per- formed on the different data sets. The most remarkable result of A period of stress is a period of at least four days in which at these tests is given in the following. least 25% of the online news articles is dedicated to the same breaking news event. One of the statistical tests that were performed investigates A period of stress thus encompasses the four days after a the fraction of articles containing at least one linguistic error. breaking news event happened, in which a large amount of the Errors that are considered to be linguistic are incorrect personal articles are handling this breaking news event. Next to obtain- nouns, spelling mistakes and incorrectly formed sentences and ing a general overview of the accuracy of online news media grammar mistakes. For the eight different data sets under con- in Flanders, an investigation is also performed in which the ac- sideration, the fraction of linguistic error-containing articles is curacy during given periods of stress is compared with the ac- given in Table I. From these numbers, it is immediately clear curacy during periods without the influence of a breaking news that the probability of writing an linguistic error-containing arti- event. cle is each time higher for the data sets of articles written during a period of stress than for the data sets of articles written during II-B Method a non-stress period. II-B.1 Data gathering Two specific breaking news events that induced a period of TABLE I stress were selected: the terrorist attacks in Paris that happened PROPORTION OF ARTICLES CONTAINING AT LEAST ONE LANGUAGE on the 13th of November 2015, and the Germanwings plane MISTAKE FOR EACH DATA SET. crash that happened on the 24th of March 2015. More specif- Het Laatste Nieuws Het Nieuwsblad ically, articles written in the four days after the happening of Stress period Paris attacks P = 0.325 P = 0.358 the two breaking news events where selected from the two most Non-stress period before Paris Attacks P = 0.211 P = 0.150 popular online news brands in Flanders: Het Laatste Nieuws and Stress period Germanwings crash P = 0.310 P = 0.346 Non-stress period before Germanwings crash P = 0.256 P = 0.275 Het Nieuwsblad. Moreover, to be able to compare the accuracy of online news in these periods of stress with the accuracy of online news in non-stress periods, a sample of articles written To verify whether these differences are statistically signifi- in the non-stress periods before the happening of the two break- cant, a chi-square test of homogeneity was performed. This test ing news events was selected. In total, four different data sets verifies whether the differences in the obtained proportions be- were thus collected. These data sets all consist of articles writ- tween periods of stress and non-stress periods are statistically ten about verifiable, factual themes during the specific period. significant or not. Thus, by performing these tests, it is verified With verifiable, factual articles, articles about for example pol- whether the probability of writing a linguistic error-containing itics, economy, terror, . . . are meant. The period of stress data article is higher during the periods of stress after the German- sets thus contain next to articles about the breaking news events wings plane crash and the terrorist attacks in Paris, compared to also articles about other factual, verifiable events. the non-stress periods right before these breaking news events.

The period of stress data sets of articles written by Het Laat- The null hypothesis for the test is that the probabilities of writ- ste Nieuws contained 184 and 332 articles for the Germanwings ing a linguistic error-containing article during a period of stress plane crash and the terrorist attacks in Paris respectively. The and during a non-stress period are equal. All requirements that non-stress period data sets of Het Laatste Nieuws contained 90 are necessary to be able to perform a chi-square test of homo- articles for both the Germanwings plane crash and the terrorist geneity were satisfied by all data sets (Chi-square test of homo- attacks in Paris. The period of stress data sets of articles writ- geneity, 2018). ten by Het Nieuwsblad contained 136 and 310 articles for the Germanwings plane crash and the terrorist attacks in Paris re- First, the test was performed for the data sets of articles writ- spectively. The non-stress period data sets of Het Nieuwsblad ten by Het Laatste Nieuws both before and after the terrorist contained 80 articles for both the Germanwings plane crash and attacks in Paris. During the non-stress period before the Paris the terrorist attacks in Paris. attacks, 21.1% of the articles contains at least one linguistic er- ror. During the stress period introduced by the Paris attacks, the II-B.2 Manual annotation percentage of wrong words increases to 32.5%. The test of two Once all articles were collected, each article was read and proportions used was the chi-square test of homogeneity. The scanned for errors. All errors found were written down. Six difference between the two independent binomial proportions different categories of errors were distinguished, based on em- was statistically significant (p = 0.036 < .05). Therefore, we pirical observations during the annotation and existing literature can reject the null hypothesis and accept the alternative hypoth- (Maier, 2002): overestimation of numbers, other wrong num- esis. bers, factual errors that are not numerical, incorrect personal nouns, spelling mistakes and incorrectly formed sentences and Similarly, the differences in probability were also investigated grammar mistakes. for the other data sets. The resulting p-values of the different statistical tests are given in II. TABLE II III-.2 Graph representation P-VALUES FOR CHI-SQUARETESTOFHOMOGENEITYPERFORMEDON To be able to query text documents in an efficient way, they DIFFERENT DATA SETS, ANNOTATED ONLY WITH LANGUAGE MISTAKES should in general be transformed into a more structured repre- sentation. To this end, the online news articles are transformed Het Laatste Nieuws Het Nieuwsblad Paris attacks p = 0.036 p = 0.000 into graph representations. These graph representations are then Germanwings crash p = 0.354 p = 0.283 stored in Neo4j database (Neo4J, 2018). This structured graph representation is then in a following step used to find numeri- cal inconsistencies within a data set of articles. Each sentence II-C.1 Discussion of an article is represented as a connected graph, where words occurring next to each other in the sentence are connected in the The percentages of linguistic error-containing articles that graph. Different sentences are not connected with each other. were derived from the manual annotations are quite high. They Each vertex in the graph also stores the article to which it be- range from 15.0% to 35.8% depending on the specific data set. longs, the date of the article and its sentence number as a prop- This means that a significant part of the articles written during erty. both periods of stress and non-stress periods is not error-free. Moreover, in these percentages, factual errors are not even con- To transform an online news article into a graph, four different sidered. Although no general conclusions for every article and steps are performed, partially inspired by the process described every online newspaper can be drawn, this indicates that the lack by Bronselaer et al. (2013): of accuracy in online news articles certainly exists in Flanders too. 1. Tokenization. In a first step, the original text document is transformed into a list containing all tokens that are present in Secondly, the probabilities obtained during periods of stress the document. The simplest tokenization criterion is used: split- were compared to the same probabilities during non-stress pe- ting tokens is done when a white space or punctuation mark is riods. The results are statistically significant (i.e. p < 0.05) in encountered. the case of the terrorist attacks in Paris, but not in the case of the Germanwings plane crash. This indicates that in the case 2. Part-of-Speech Tagging. In a second step, all tokens that of the terrorist attacks in Paris, indeed, the fraction of linguistic were identified in step 1 are tagged with their word type within error-containing articles is indeed higher than during the cor- the sentence (e.g. verbs, personal nouns, adjectives, . . . ). To responding non-stress period. In the case of the Germanwings accomplish this task, use is made of TreeTagger, a well-known plane crash, similar conclusions could not be drawn. Part-of-Speech tagger (Schmid, 2013).

A possible explanation for this is that, although the German- 3. Reclassification. Each of the tokens is partitioned into one of wings plane crash induced a period of stress, the number of ar- six different categories: Noun, Number, Adjective, Place, Entity, ticles published during that period of stress is still significantly Edge and Ignore. This classification is based on the word type lower than the number of articles published during for exam- of the tokens and a couple of lists with general information that ple the period of stress after the terrorist attacks in Paris. This are given as input to the algorithm (i.e. lists containing world- could indicate that the pressure was still a lot higher during the wide cities, countries, Dutch numerical expressions, . . . ). The latter period of stress. Possibly, different degrees of periods of Place class includes all cities, countries, and adjectives of these stress thus exist. In this case, the larger and the more important two that are present in the input lists. The Entity class includes the breaking news event is, the more this directly influences the all words starting with a capital letter that do not belong to the accuracy of the news articles published in its period of stress. Place class (e.g. metro lines, concert halls, rivers, . . . ). The pos- However, further research is needed to validate these assump- sibility is however given to the user to input other Places that are tions. not present in the default geographical lists. The user can also indicate that Places are synonyms of each other (e.g. “Brussels IIIA UTOMATIC INCONSISTENCY FINDING Airport” and “Zaventem”) to still improve the obtained results.

III-.1 Goal 4. Graph generation. The classification of the reclassification A second important aspect of the reliability of online news is step now influences the place of each token in the final graph its consistency. In this section, an automated inconsistency find- representation. All words that are in the Ignore category are not ing algorithm is presented. A data set of articles handling the stored in the graph representation. This category includes deter- same subject is provided as input to the algorithm. The goal of miners, adverbs, pronouns and words appearing in a stopword the algorithm is to detect any numerical inconsistencies that ex- list. These are words that are assumed to have no significant ist between numerical information present in any of the articles contribution to the meaning of a sentence. Nouns, Adjectives, in the data set. Moreover, the presented algorithm is tested on Places are all represented as a vertex in the graph. Each vertex four different data sets containing articles written during a pe- then contains a Word property that has the specific token as a riod of stress and handling about the breaking news event that value. The type of the vertex is the class of the token. If mul- triggered the period of stress. tiple subsequent tokens in the original text belong to the same class, they are concatenated. Only one vertex is then created with that class as the vertex type and with the Word property and compared. The similarity of these surrounding words (i.e. equal to the concatenation of tokens. Edge tokens are added as the tokens in the graph database that are in the same connected a property to the edge connecting two vertices, representing the graph) is computed with the help of a variant on the Jaccard sim- tokens occurring in the original text before and after the token ilarity, namely the Weigthed Jaccard similarity (Jaccard, 1901). associated with the edge. Tokens that belong to the Number cat- The formal definition of the Weighted Jaccard similarity is given egory are added as a property to the Noun token that is closest in Equation (1). to that specific number.

Each Entity token is added to an Entity list that is stored as a 1 t s s min(d (t,n),d (t,n)) ∈ 1∩ 2 1 2 property of the numbered Nouns within the sentence. This list sim(s1, s2, n) = 1 (1) Pt s1 s2 min(d(t,n,s ),d(t,n,s )) thus contains all Entities within the list. Furthermore, each num- ∈ ∪ 1 2 bered noun also contains a Place property. The value of this P property is the closest Place token to the numbered noun. If the In the above formula, n denotes the common noun that both closest Place token is more than one sentence away, this prop- sentences s1 and s2 contain. Moreover, the function d is defined erty has a null value. in Equation (2).

An example of a graph representation of a sentence is given in Figure 1. di(t, n) if t si d(t, n, si) = ∈ (2) else (∞ This similarity measure is based on the original Jaccard sim- ilarity (Jaccard, 1901). However, the formula was adapted such that the closer the token is to the common noun, the higher its influence on the similarity between the two sentences.

Within the set of surrounding words of the numbered nouns, the Place tokens are very important, as these offer an indication of the geographical location of the event to which the numbered Fig. 1. Graph representation for sentence “Tijdens zware terroristische aansla- noun refers to. Finally, Entity tokens can also be indicative of gen in Parijs, die uitgevoerd werden door en zijn hand- langers, werden 130 mensen gedood.” the exact event that the numerical information refers to. For ex- ample, the presence of “Le ” in the vicinity of a num- III-.3 Numerical inconsistency finding ber indicates that the number probably has something to do with The structured graph representation that was presented in the what happened at Le Bataclan. previous is now used to find numerical inconsistencies between different articles that handle the same subject. This can be Based on the above observations, a final rule is developed em- achieved by performing a couple of Cypher Query Language pirically that decides whether or not two numbers that are not and the adoption of a couple of decision rules (Intro to Cypher, equal could possibly form an inconsistency. This rule consists 2018). The Neo4j database is queried for a couple of nouns of the following three criteria. These criteria for comparison that have a Number property associated with them. The num- were obtained empirically, by trial and error of intermediate de- bered nouns which are queried depend on the specific subject cision rules on existing data sets: of the data set. Example nouns in the case of terrorist at- tacks include “terroristen”, “doden”, “slachtoffers”, “bommen”, 1. The two nouns of which the numbers are compared are iden- “daders”, . . . . tical or are synonyms.

The nouns with associated numbers that are compared should 2. The nouns have the same associated place, or a similar place. be chosen very carefully. Only numbers that represent the same With a similar place, derived tokens are meant. For example, real-world number should be compared. For example, numbers “Frankrijk” and “Frans” are assumed to refer to the same place. of deaths should not be compared with numbers of wounded people. A first trivial criterion is thus that only numbers asso- 3. Either the Weighted Jaccard Similarity between the two sen- ciated to the same noun or to synonyms of each other will be tences is equal to or larger than 0.2, or the Jaccard Similarity be- compared. tween the two sets of entities belonging to both numbered nouns is equal to or larger than 0.3. However, the context of the noun is also important. For ex- ample, the number of deaths at “Le Bataclan” should not be Once the possible inconsistencies are given back by the algo- compared with the number of deaths at “” when rithm, all occurrences of these inconsistencies within the data searching for inconsistencies in a data set of articles about the set can be retrieved by querying the database for all pairs of the terrorist attacks in Paris. Because of this, the words surround- exact same nouns with the exact same numbers. ing the nouns within the graph database are also investigated III-.4 Results of an article is the freshness of an article. As the definition of The numerical inconsistency finding algorithm is tested on a period of stress indicates that during this period a lot of arti- four different data sets of articles written during two periods of cles are written about the same breaking news event, one could stress by two different newspapers. The first period of stress that ask the question whether all articles that are published contain is investigated is the one initiated by the terrorist attacks in Paris enough new information compared to the information that was on the 13th of November 2015. The other period of stress was already known from previously published articles. The extent to the one induced by the terrorist attacks in Brussels on the 22th which an article contains new information is called the freshness of March 2016. The two online newspapers under investigation of that article in the context of this study. are Het Laatste Nieuws and Het Nieuwsblad. First, an automatic method is presented that measures quanti- The results are expressed in Table III in terms of the preci- tatively the amount of information in an article that was already sion of the algorithm (i.e. number of True Positives divided by published before. Secondly, this method is tested on the four number of Predicted Positives). Each comparison of two num- same data sets as those that were used in the consistency inves- bers that indeed refer to the same real-world number are con- tigation. sidered to be a correct comparison. The numbers that are given IV-B Method include all occurrences of a correct comparison within a data The data sets are investigated by analyzing its constituting ar- set. Moreover, a few synonyms of Places were given as input to ticles in chronological order. The content of each article is com- the algorithm next to the default Place lists. pared to the content of all earlier published articles in the data set. To quantify the amount of “duplicated” information in the TABLE III article, an article is decomposed into its constituting sentences. PRECISIONSOFCOMPARISONSINOUTPUTAUTOMATICINCONSISTENCY Each sentence in the investigated article is then compared with FINDING ALGORITHM all sentences published in earlier articles. The largest similar- ity with any of these earlier published sentences is then stored. Brussels attacks Paris attacks If this maximal similarity is larger than a certain threshold, the 25 23 Het Laatste Nieuws 59 = 0.424 26 = 0.886 current sentence is assumed to contain already known informa- 74 220 Het Nieuwsblad 105 = 0.705 285 = 0.772 tion. Otherwise, the sentence is assumed to contain enough new information. The amount of already published information in III-.5 Discussion an article is then defined as the fraction of sentences that is as- As can be seen, the precisions that are achieved on the data sumed to contain already known information. sets about the terrorist attacks in Paris are very high. Those that are achieved on the data sets about the terrorist attacks in Brus- The similarity between two sentences is computed as follows. sels are lower. The main reason for this is that Place “Brussel” is For each token from the shortest sentence, the most similar to- used to refer to the two different attacks, while one happened in ken in the other sentence is sought. The similarity of the two Zaventem and one in Brussels. Because “Zaventem” and “Brus- sentences is then defined as the average of these maximal simi- sel” are used intermingledly, the “same place criterion” is not as larities. The similarity of two tokens is computed with the help effective as in the case of the terrorist attacks in Paris. However of a variant of the Longest Common Substring similarity (Islam in general, the obtained precisions are fairly good. The globally & Inkpen, 2008). The formula of the token similarity is given in achieved precision (i.e. over all data sets investigated) is 0.72. Equation (3).

This proves the usefulness of the developed algorithm. 1 1 wordSim(w1, w2) = 2 NMCLCS1(w1, w2) + 2 NMCLCSn(w1, w2)(3)

A last observation that can be made is that the numbers of In the above equation, NMCLCS1 and NMCLCSn are inconsistencies found in the data sets of articles written by Het given by Equations (4) and (5) respectively. Nieuwsblad are a lot higher than the number of inconsistencies 2 found in those written by Het Laatste Nieuws. Although not length(LCS1(w1,w2)) NMCLCS1(w1, w2) = ˙ (4) each possible inconsistency that is given back is in fact a real- length(w1)length(w2) 2 world inconsistency (for example because numbers can change length(LCSn(w1,w2)) NMCLCSn(w1, w2) = ˙ (5) over time), this probably indicates that more inconsistencies are length(w1)length(w2) present in the investigated articles written by Het Nieuwsblad Here, LCS1 denotes the Longest Common Substring be- than in those written by Het Laatste Nieuws. tween two tokens, starting from position 1. LCSn denotes the Longest Common Substring between two tokens, starting from IVF RESHNESSOFONLINENEWSARTICLESDURING any position. By using this similarity, tokens that are similar PERIODSOFSTRESS but not identical (such as plurals, verbs versus associated nouns, IV-A Goal . . . ) also contribute to the overall similarity of the sentences, A last aspect of the reliability of online news media that is contrary to other similarity measures such as the Jaccard simi- investigated is its relevance. As the relevance of an article is larity. highly subjective and dependent on the reader, this is an aspect that is very difficult to measure quantitatively. However, an as- Based on empirical observations, two sentences are assumed pect that can be assumed to be somewhat related to the relevance to contain the same information if the LCS-based sentence simi- larity (thus, the average of the maximum LCS-based token simi- case of the terrorist attacks in Paris and Brussels, the amount of larities) is at least 0.6. In this way, not only almost identical sen- articles that is published during these periods of stress are justi- tences, but also sentences that contain to a large extent the same fied, as almost all articles contain enough new information. information are taken into account. Pairs of sentences for which the similarity is larger than the threshold are called linguisti- VC ONCLUSIONS cally similar sentences in the context of this study. Although The reliability of online news media in Flanders was investi- False Positives exist for this method, it is verified empirically gated by focusing on their accuracy, consistency and freshness. that most of the identified sentence pairs indeed contain similar First of all, the accuracy was investigated by a manual annota- information. tion of errors present in online articles. The results indicated that lack of accuracy is a problem in Flanders too. Moreover, IV-C Results comparison of the accuracy of articles during specific periods The same four data sets are investigated as for the consis- of stress and non-stress periods indicates that the presence of tency part: the articles written about the terrorist attacks in Paris the specific breaking news events influences the accuracy of the and those in Brussels, written by Het Laatste Nieuws and Het online news articles. Secondly, the consistency of online news Nieuwsblad. As a result of the analysis performed on the differ- media in Flanders was investigated. To this end, a numerical ent data sets, graphs are obtained that illustrate the evolution of inconsistency finding algorithm has been developed. Testing the average fraction of sentences that contain no new informa- and validation of the algorithm on different data sets indicate tion. This average is each time computed over all articles that fairly high obtained precisions ranging from 45% up to 90% are already analyzed. The graphs for the terrorist attacks in Paris depending on the specific use case. Thirdly, the freshness of on- and those in Brussels are given in Figures 2 and 3 respectively. line news articles published during periods of stress was investi- gated. To this end, an automatic investigation method based on the Longest Common Substring similarity was presented. Test- ing and validation on existing period of stress data sets resulted in the conclusion that the lack of freshness is not a major issue for the investigated periods of stress. REFERENCES Bronselaer, A., & Pasi, G. (2013). An approach to graph-based analysis of textual documents. In 8th European Society for Fuzzy Logic and Technology, Proceedings (pp. 634– 641). Milano: Atlantis Press. Bucy, E. P., Gantz, W., & Wang, Z. (2007). Media Technology Fig. 2. Average percentage of linguistically similar sentences in for articles about the terrorist attacks in Paris. and the 24-Hour News Cycle. In Communication tech- nology and social change: Theory and implications. New Jersey: Lawrence Erlbaum Associates. Chi-square test of homogeneity. (2018). Retrieved from https://statistics.laerd.com/premium/ spss/ttp/test-of-two-proportions-in -spss.php. Intro to Cypher. (2018). Retrieved from https://neo4j .com/developer/cypher-query-language/. Islam, A., & Inkpen, D. (2008). Semantic text similarity us- ing corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data Fig. 3. Average percentage of linguistically similar sentences for articles about (TKDD), 2(2), pp. 10. the terrorist attacks in Brussels and Zaventem. Jaccard, P. (1901). Etude de la distribution florale dans une por- IV-C.1 Discussion tion des Alpes et du Jura. Bulletin de la Societe Vaudoise Although the fraction of linguistically similar sentences in- des Sciences Naturelles, 37(142), pp. 547–579. creases slightly with the number of articles published, it can Maier, S. R. (2002). Getting it right? Not in 59 percent of be concluded that duplication of information during a period of stories. Newspaper Research Journal, 23(1), 10–24. stress is not a major issue. At most, 8% of the articles were lin- Neo4J. (2018). Retrieved from https://neo4j.com/. guistically equal to an already published sentence. This is fairly low, considering that more than 250 articles are analyzed near Picone, I. (2016). Digital news report: Belgium. Re- the end of the investigation. Although the obtained fractions are trieved from http://www.digitalnewsreport not perfect, as some identified pairs of sentences in fact do not .org/survey/2016/belgium-2016. contain the same information while other unidentified pairs do, Schmid, H. (2013). Probabilistic Part-Of-Speech tagging using it can be assumed that the fractions do not deviate too much decision trees. In New Methods in Language Processing from the real numbers. As such, it can be concluded that in the (pp. 154–164). London: Routledge. Contents

1 Introduction 1 1.1 Accuracy ...... 2 1.2 Consistency ...... 2 1.3 Relevance ...... 3

2 News accuracy 4 2.1 News consumption in Flanders ...... 4 2.2 Errors in news articles ...... 5 2.3 Online news and corrections ...... 6

3 Supervised error investigation 8 3.1 Errors during stress periods ...... 8 3.2 Data gathering ...... 9 3.2.1 Article selection ...... 9 3.2.2 Data set properties ...... 11 3.3 Research method ...... 13 3.4 Results ...... 14 3.4.1 Data set motivation ...... 14 3.4.2 Categorization of errors in online news ...... 15 3.4.3 Distribution of number of errors in online news articles ...... 15 3.4.4 Probability of writing an error-containing article ...... 19 3.4.5 Probability of writing an error-containing article over time ...... 24 3.4.6 Probability of writing an error-containing word ...... 26 3.4.7 Analysis of the absolute number of errors in an article ...... 28 3.5 Conclusion ...... 30

4 Automated inconsistency finding in online news articles 32 4.1 Graph databases ...... 33

i 4.1.1 Motivation ...... 33 4.1.2 Graphs ...... 34 4.1.3 Neo4j: data model ...... 36 4.1.4 Cypher Query language ...... 37 4.2 Structured textual data ...... 38 4.2.1 Tokenization ...... 38 4.2.2 Part-Of-Speech tagging ...... 39 4.2.3 Reclassification ...... 40 4.2.4 Graph generation ...... 41 4.3 Automated numerical inconsistency finding ...... 44 4.3.1 Motivation ...... 44 4.3.2 Basic approach ...... 45 4.3.3 Synonyms ...... 47 4.3.4 Similarity measures ...... 47 4.3.5 Places ...... 49 4.3.6 Entities ...... 50 4.3.7 Comparison criterion ...... 51 4.3.8 Evaluation ...... 52 4.3.9 Smallest subset removal ...... 58 4.4 Conclusion ...... 60

5 Automatic freshness estimation of online news articles 62 5.1 Problem statement ...... 62 5.2 General approach ...... 63 5.2.1 Jaccard similarity ...... 63 5.2.2 Longest Common Subsequence similarity ...... 64 5.3 Further work ...... 69 5.4 Conclusion ...... 70

6 Conclusion 72

References 75

A Manual error annotations 79

ii List of Figures

3.1 Histogram of the number of articles in the data sets of Het Laatste Nieuws and het Nieuws- blad...... 12 3.2 Histogram of the number of articles for each number of errors, both for Het Laatste Nieuws and Het Nieuwsblad...... 16 3.3 Comparison of the frequencies for each number of errors in an article with the expected frequencies based on a geometrically distributed random variable with mean 0.512. . . . . 19 3.4 Fraction of error-containing articles in the days in the period of stress, for both Het Laatste Nieuws and Het Nieuwsblad...... 25 3.5 Histogram of the number of articles in the stress period data sets for different length categories...... 26

4.1 Example graph with 6 vertices and 7 edges...... 35 4.2 Possible graph representation for the sentence “Alice, die computerwetenschappen studeert, houdt sinds 2014 van Bob, die 22 jaar oud is.”...... 37 4.3 Graph representation for sentence “Tijdens zware terroristische aanslagen in Parijs, die uitgevoerd werden door Salah Abdeslam en zijn handlangers, werden 130 mensen gedood.” 44 4.4 Adapted graph representation for sentence “Tijdens zware terroristische aanslagen in Parijs, die uitgevoerd werden door Salah Abdeslam en zijn handlangers, werden 130 mensen gedood.” 51 4.5 Fraction of removed inconsistencies in function of the number of articles removed from the file set of the terrorist attacks in Paris...... 60 4.6 Fraction of removed inconsistencies in function of the number of articles removed from the file set of the terrorist attacks in Brussels and Zaventem...... 61

5.1 Average fraction of copied sentences in an article about the terrorist attacks in Paris, where articles are analyzed in chronological order. Similarity based on a Jaccard similarity with threshold 0.8...... 64

iii 5.2 Average fraction of copied sentences in an article about the terrorist attacks in Brussels and Zaventem, where articles are analyzed in chronological order. Similarity based on a Jaccard similarity with threshold 0.8...... 65 5.3 Average fraction of linguistically similar sentences in an article about the terrorist attacks in Paris, where articles are analyzed in chronological order. Similarity based on an LCS- based similarity with threshold 0.6...... 69 5.4 Average fraction of linguistically similar sentences in an article about the terrorist attacks in Brussels and Zaventem, where articles are analyzed in chronological order. Similarity based on an LCS-based similarity with threshold 0.6...... 69

iv List of Tables

3.1 Number of articles in data set for each breaking news event and for each considered news- paper...... 11 3.2 Number of articles in data set right before each breaking news event for each considered newspaper...... 12 3.3 For each data set of articles, the mean value of the geometric distribution with which the frequencies are compared, the chi-square value of the test and the p-value with which the null hypothesis is rejected are given...... 20 3.4 Proportion of articles containing at least one mistake (of any category) for each data set. 20 3.5 P-values for chi-square test of homogeneity performed on the different data sets, annotated with all possible errors...... 22 3.6 Proportion of articles containing at least one language mistake for each data set...... 23 3.7 P-values for chi-square test of homogeneity performed on different data sets, annotated only with language mistakes ...... 24 3.8 Proportion of articles containing at least one factual mistake for each data set...... 24 3.9 Proportion of wrong words for each data set...... 27 3.10 P-values for chi-square test of homogeneity performed on different data sets, annotated only with language mistakes ...... 27 3.11 U-, z- and p-values of the Mann-Whitney U tests performed for the different data sets, all errors included...... 30

4.1 Precision of comparisons in output automatic inconsistency finding algorithm ...... 54 4.2 Precisions of comparisons in output automatic inconsistency finding algorithm ...... 56 4.3 Precisions of comparisons in output automatic inconsistency finding algorithm ...... 57 4.4 Cardinalities of minimal subsets and relative fractions for each of the different data sets. . 60

v List of Abbreviations

CQL Cypher Query Language

LCS Longest Common Substring

NLP Natural Language Processing

PoS Part-of-Speech

SQL Structured Query Language

vi Chapter 1

Introduction

In this Master dissertation, the reliability of online news media is investigated during periods right after a breaking news event has happened. Reliability is in this work considered to be a very broad concept, including the accuracy of news, the consistency of news and the relevance of news.

This Master dissertation specifically focuses on online news media. The reason for this is that online news brands in Flanders gain importance, but they differ significantly on a couple of aspects from the traditional printed newspapers (Picone, 2016). Most important, online news is part of a 24-hour news cycle (Bucy, Gantz, & Wang, 2007). Journalists that work for a printed newspaper have a fixed deadline to finish their story, typically the evening before the newspaper is published. In general, the journalist can thus plan his or her work, to be able to finish his or her work before the deadline. Once the article is published, the article is finished and it cannot be changed anymore. The introduction of online news however significantly changes the life of the journalist. Online newspapers typically publish news articles every hour of the day, every day of the week. Moreover, even if an article is published on the Web, it can still be altered as new information comes in.

This 24-hour news cycle is a concept that already exists longer in the case of TV (Bucy et al., 2007). Research however indicates that this 24-hour news cycle has a negative influence on the quality of the actual news (Lewis & Cushion, 2009). Although in this thesis the quality of online, written news is investigated (and thus not television news), possibly this 24-hour news cycle also influences the quality of online news. Moreover, earlier research indicated that possibly this 24-hour news cycle even speeds up under the influence of breaking news events (Lewis & Cushion, 2009). These are large events that happen all around the world, lead to a very broad news coverage and reach a very broad public. Thus, it can be assumed that during the days following a breaking news event, the influence of the 24-hour news cycle on the quality is even higher than during normal periods. Such a period right after a breaking news event

1 happens, will be defined as a period of stress in this Master dissertation. As such, the reliability of online news media will be specifically investigated during such periods of stress. Sometimes, the obtained results will be compared with the reliability during “normal”, non-stress periods in which there is no influence of any breaking news event.

1.1 Accuracy

A first aspect of reliability of online news that is investigated in this work is the accuracy of online news. In the context of data quality, accuracy refers to the closeness between a data object modelled in a database and the real-world object it aims to represent (Batini & Scannapieco, 2006). This concept can be redefined in the context of the accuracy of news: an accurate news article is one that does not contain mistakes. Stated differently, an accurate news article is thus one for which all information that it contains represents the real-world facts perfectly. Research has been done about accuracy of news articles, including (Porlezza, Maier, & Russ-Mohl, 2012; Maier, 2005, 2002). Research about the rela- tionship between the accuracy of news and the credibility of the associated news media also exists. This research indicates that accuracy and the absence of errors directly influences the credibility of (online) news media gaziano1986measuring. However, to the best of our knowledge, little to no research exists that investigates the accuracy of online news in Flanders. Moreover, no research was found that specifi- cally investigates the accuracy of online news articles during periods of stress.

Based on the above observations, the research part about the accuracy of online news media during periods of stress is structured around the following research questions:

Q1: “What is the extent of the problem of accuracy of news articles in Flemish online news media?” Q2: “How does the presence of a breaking news event influence the accuracy of Flemish online news articles?”

This research is performed by annotating four large sets of articles written by two online newspapers during two different periods of stress. The results of these annotations are then compared with the results of the annotations of four large sets of articles written right before the accompanying periods of stress.

1.2 Consistency

A second aspect of reliability of online news that is distinguished in this Master dissertation is its con- sistency. Typically, several articles are written about the same subject over a certain time period. In the context of data quality, consistency refers to the fact that data that is altered in a database should always be valid according to a couple of predefined rules (Batini & Scannapieco, 2006). This concept can

2 again be redefined to have a useful definition for the use case of online news: two articles are said to be consistent if all information that is present in both articles is consistent, i.e. both articles do not contain any factual contradictions.

In the context of this Master dissertation, specific focus goes to the numerical inconsistencies. The reason for this is that numerical information is the most deterministic information typically present in a news article and thus is the kind of information for which it is probably most achievable to classify as either inconsistent or consistent. To this end, the goal of this research part is to develop an algorithm that automatically finds numerical inconsistencies between articles handling the same subject. As such, this part is structured around the following research questions:

Q3: “How can online news articles be structured to be processed automatically for finding numerical inconsistencies between them?” Q4: “How much possible inconsistencies can be detected between Flemish online news articles with the help of an automatic detection algorithm?”

1.3 Relevance

A third aspect that is considered to be part of the reliability of online news is relevance (Metzger, Flana- gin, Eyal, Lemus, & McCann, 2003). People want the articles they read to be relevant, in the sense that the articles contain new, interesting information that is really important to the understanding of the subject. As such, the article should contain enough information that is new to as much readers as possible and that is considered interesting by as much people as possible.

Determining what is relevant to publish as a newspaper or not is very difficult, if not impossible as a researcher. Each newspaper has its own mission and has freedom of speech. Moreover, it is subjective: one person could find a certain article very interesting, while another one thinks it is not relevant at all. Because of this, the focus in this Master dissertation goes to a very specific aspect of relevance. No matter what information is present in the article, the information should be new. Stated differently, if an article does not contain any new information compared to earlier published articles, it can be considered to be not relevant. This aspect of relevance is called freshness in the context of this thesis. Freshness of Flemish online news articles is thus further studied, based on the following research questions:

Q5: “How can the freshness of an online news article be determined automatically?” Q6: “How fresh are Flemish online news articles published during a period of stress?”

3 Chapter 2

News accuracy

In this chapter, a broad, general overview of earlier research on the accuracy of news articles is given. First, the current state of news consumption in Flanders is sketched. Secondly, the existing research regarding the prevalence of errors and the types of errors in news articles is summarized. Finally, these types of errors are considered in the context of online news articles, which differ from printed press in a number of ways.

2.1 News consumption in Flanders

News plays a crucial role in modern society. Knowing what is happening around the world is important to many people for several reasons. It could affect personal lives of people, for example when a family member dies in a natural disaster. Moreover, news allows people to form their own opinion about events and allow them to discuss these events with colleagues, friends, . . . .

While news consumption is ever increasing, the way people consume this news is drastically changing. This is also the case in Belgium and in Flanders (Picone, 2016). In 2016, all printed newspapers sold fewer copies than the year before. Even Het Laatste Nieuws (Het Laatste Nieuws, 2018), the newspaper with the largest number of copies in Flanders, saw its number of sold copies declining with 1.9% in 2016. Across the whole Flemish population, they still reach 28% of the citizens on a weekly basis. Only 11% of the people in Flanders declares that Het Laatste Nieuws is still their primary source of information, while it is still the largest Flemish newspaper. For other Flemish newspapers, these percentages are even much lower.

Furthermore, it is illustrated that in total, the printed media (including magazines) still reaches 50% of the population on a weekly basis (Picone, 2016). Furthermore, the study also gives insight in the

4 popularity of traditional media, such as television, radio or newspapers. This popularity is expressed in terms of the percentage of the population that the brand reaches weekly, and the percentage for which the brand is the main source of information. The most popular classic media brands in Flanders are VRT Nieuws, VTM Nieuws and Het Laatste Nieuws. The latter is the most popular printed newspaper in Flanders. This decrease is accompanied with a significant increase in online news consumption. HLN.be, which is the online news version of Het Laatste Nieuws, reaches 60% of the population on a weekly basis. This is significantly more than the printed alternative. In general, online news combined with news posted and shared on social media networks such as Facebook and Twitter, reaches 83% of the population in Flanders on a weekly basis. The most popular digital online news brands in Flanders are Het Laatste Nieuws online, Het Nieuwsblad online and VTM Nieuws online.

2.2 Errors in news articles

In an ideal world, news articles would not contain any mistakes. In the real world however, several studies have already indicated that published news articles very frequently contain errors, both linguistic errors and factual errors (Porlezza et al., 2012; Maier, 2005). These errors directly influence the confidence people have in news media (Metzger et al., 2003). In 2015, 22% of the Flemish population had little to very little confidence in the Flemish media (Bral, 2016). Although this percentage decreased the last couples of years, still 1 person out of 5 does not trust the Flemish media. It is clear that the number of errors that is present in the published articles is crucial to this end (Karlsson, Clerwall, & Nord, 2017).

The accuracy of news consists of many different aspects. Most of these are not measurable in a consis- tent, objective way. To clarify this statement, consider the case of a popular computer science magazine that publishes an article about the basic concepts of machine learning. Most people will find this article very valuable, as they do not know that much about machine learning. However, a professor in computer science at Ghent University could think the article is worthless, as many oversimplifications are present in the article. The first group of readers will probably say the article is of relatively high “quality”, while the professor may think the article is of very low quality. This illustrates the fact that news quality, and even news correctness, is not easily measured, and is to a large extent dependent on the interpretation of the people that read the news.

However, in the past, efforts were made by researchers to estimate the extent of the news accuracy problem. The errors that occur in a typical news article are very diverse. Errors can be both objective and subjective, can be erroneous regarding the content of the article or regarding the language that is used, . . . . Next to the extent of the general problem, it is thus also important to know which errors occur

5 commonly in an article. Finally, an important aspect to take into account is the opinion of the readers themselves, by looking at which errors they consider to be more severe when compared to others.

The first studies that were conducted stem from the era that online news did not yet exist. However, the results can still be considered useful. An interesting study was conducted by Maier (2005). In this study, local news articles in fourteen different newspapers in the United States of America were investi- gated. Maier sent a survey to the primary sources of a collection of local news articles. After collecting the responses, they found out that 61% of the articles contained at least one error, with a mean number of 1.36 errors per article.

Moreover, Maier found out that a large part of these errors are in fact objective errors. The three most encountered objective errors according to the survey were misquotation, inaccurate headlines and wrong numbers and misspelling. Subjective errors were also found often. Essential information missing, quotes distorted, and story sensationalized were the subjective errors that were mostly encountered. Fur- thermore, the questioned news sources were also asked to indicate the severity of each error based on a Likert-like scale. It was concluded that subjective errors were considered to be the most severe ones. However, the categories misquotation, inaccurate headlines and misspelling of name or address were also considered more severe than other errors.

The study cited above only investigated local newspapers in the USA. However, a similar study was performed in Italy and Switzerland (Porlezza et al., 2012). The results of the study were surprisingly similar to the study performed in the USA: the categories of errors that were noted the most were exactly the same. Moreover, also the type of errors that were considered to be the most severe were also quite similar. While no similar study investigating Flemish newspapers is available, the earlier studies abroad show that the lack of news accuracy is a problem that can be assumed to be present in Flanders too.

An extra prove for this can be found by looking at the trust Flemish citizens have in the media. 59% of the people trust the news they read, see or hear (Picone, 2016). While this is fairly high in comparison with other countries and regions (such as Wallonia), there is still a large part of the population having less confidence in what the media is telling them.

2.3 Online news and corrections

As was already mentioned in the introduction, online news editorial offices do not have fixed deadlines, which leads to a 24-hour newscycle. Because of this, it takes quite a while until a published news story

6 is finalized, certainly when the article reports breaking news: the first version of the article that is published is typically changed several times, to add content or to change its content. From the mo- ment the journalist gathers information, the information is added to the article. Regarding this fluidity, Saltzis investigated online news articles written by the six most important news websites in the United Kingdom (Saltzis, 2012). 44 breaking news events were selected and followed throughout a period of time.

Saltzis concluded that this updating of news lasted for a couple of hours on average. Over 60% of the updates was performed within the first two hours after publication. On average, a breaking news article was updated 5.7 times. Moreover, most updates were conducted to add new available information. However, 26% of the updates were needed to correct errors and up to 17% of the updates were needed to remove information. This illustrates that errors are published in breaking news articles, but also that journalists try to correct mistakes in the hours following the initial publication.

In this regard, Karlsson et al. (2017) offer very useful insights. The study, conducted in Sweden, elaborates on the influence of errors on the trust people have in the media. Karlsson et al. stated some propositions about confidence in the media and errors in the media, and the respondents needed to answer to which degree they agreed with the proposition. A first conclusion that was drawn is that people do not accept errors in articles, even if these are caused by the willingness of a journalist to publish news very rapidly. However, this issue is even more problematic in the context of online news because of the accelerated news cycle online (Saltzis, 2012).

Furthermore, Karlsson et al. state that online news also offers possibilities regarding news accuracy. Because of their fluidity, online news articles can easily be corrected. Both when large and small errors are corrected, the reader appreciates the correction, in the sense that more people tolerate the mistake afterwards. However, the reader would only accept such a correction if they are informed about it. 63% of the respondents does not find it tolerable to remove erroneous content without informing the reader.

7 Chapter 3

Supervised error investigation

In this chapter, a first important aspect of the reliability of Flemish online news during periods of stress is studied: its accuracy. Stated differently, in this chapter the prevalence of errors in online news articles is studied. While subsequent chapters will focus on automated techniques to estimate other aspects of online news reliability, the research in this chapter is performed manually. Thus, the articles under in- vestigation are all read and annotated with the errors that they contain.

The chapter starts with the definition of a period of stress, which will be an important concept throughout the chapter. In a following section, the data gathering process is explained and some insights in the obtained articles are given. Furthermore, an overview of the research method that is used is given and the exact meaning of an error is explained. Finally, the results of several statistical tests that give insight in the relationship between certain periods of stress and the prevalence of errors in online news articles is given, and the consequential conclusions are stated.

3.1 Errors during stress periods

As already explained in the previous chapter, online news publication significantly differs from the older printed news publication. In fact, online news publication can be compared to the 24 hours a day news shows on television in the United States of America. News is being reported from the moment the first information comes in, and new information is added to the news article (or to the news broadcast) di- rectly. As already mentioned in Chapter 2, this is called the fluidity of online news media.

As a consequence, it is suspected that online news articles concerning breaking news events contain mistakes very frequently. In this and the following sections, the goal is to check whether this assumption is correct. To this end, a definition of a period of stress must first be given:

8 Definition 3.1.1 A period of stress is a period of at least four days in which at least 25% of the online news articles is dedicated to the same breaking news event.

The definition that is given above thus targets periods of four days right after some event happened that was extensively discussed in the media. During this period of stress, the most popular news brands will dedicate a significantly large proportion of their work (at least 25% each day) to that specific event.

First of all, it was chosen to only look at the two most popular online news brands in Flanders, i.e. Het Laatste Nieuws and Het Nieuwsblad. The reason for this is twofold. First of all, as was already mentioned in the introduction, a large part of the Flemish population makes use of these online news sites, be it on a daily or a weekly basis. As such, the main news source for a significant part of the Flemish population is analyzed if these two online newspapers are investigated. Secondly, both Het Laatste Nieuws and Het Nieuwsblad are similar news sites. They aim at delivering news to a broad public. As such, it is possible to define a common threshold of articles dedicated to the breaking news event to categorize a period as a period of stress (i.e. 25%). Different kinds of newspapers, such as financial newspapers, aim at a different public and handle different core themes. Because of this, it can be expected that the proportions of articles dedicated to a specific breaking news event for these news sites will deviate significantly from those for the popular news sites. Because of this, only representative online newspapers for the latter category were taken into account.

3.2 Data gathering

3.2.1 Article selection

Now that the concept of a period of stress is clear, the main research idea of this chapter can be explained. The main goal is to check whether the presence of a breaking news event during a certain period influences the number of errors that are found in online news articles during that period. As already mentioned above, the two online newspapers that are considered are Het Laatste Nieuws and Het Nieuwsblad. Ar- ticles that are gathered are split up in two large groups: articles that were published during a period of stress and articles that were published during a non-stress period. The latter category thus contains articles that were not written during a period of stress. Two breaking news events that happened in the last couple of years, more specifically in 2015, were considered:

The Germanwings plane crash, caused by a pilot committing suicide on the 24th of March 2015. • The terrorist attacks in Paris, France on the 13th of November 2015. •

9 Both events share the same property, namely that they caused a period of stress for the online news media, in the days that followed the different events. For each of these events, at least 25% of the articles that were published by Het Laatste Nieuws on their website handle about the specific event. For the terrorist attacks in Paris, this percentage is even much higher, namely around 80%. Although no specific numbers can be given for Het Nieuwsblad (as no complete archive of articles is available), the number of articles that is retrieved from the website of this newspaper indicates that these events were an important topic for this newspaper also. For the purpose of this Master dissertation, a large sample of articles that were written during the period of stress was selected. These articles were scraped from the two most popular online news papers and stored in an XML file for later analysis.

Only articles that are newsworthy and verifiable to a certain extent are selected, although the news- worthiness of an article is subjective and difficult to measure. With a newsworthy and verifiable article, an article handling topics like terrorist attacks, economics, political news, robberies, deaths, . . . is meant. The reason for this is that it should be possible to check most of the facts that are stated in the article. Contrary, articles handling rumours, opinions or articles about popular culture are much more difficult to verify factually, and are sometimes not factual at all. For this reason, these kinds of articles are not included in the sample that is investigated. A second criterion that is included in the selection procedure of the articles is their length. Only articles that contain at least 5 sentences are considered. Otherwise, a lot of articles which only consist of a video accompanied with a couple of sentences would also have to be taken into account, while the focus in this Master dissertation lies on the analysis of textual information.

The data sets of selected articles that were written in the period of stress after the crash of the Ger- manwings plane consist of 184 and 136 verifiable articles from Het Laatste Nieuws and Het Nieuwsblad respectively. From the period of stress that was induced by the terrorist attacks in Paris, 332 articles and 310 articles were retrieved from Het Laatste Nieuws and Het Nieuwsblad respectively.

A large part of articles in the data sets handles the specific subject that induced the period of stress that the article was written in. Thus, the absolute numbers of articles concerning these events are rather high. However, also other factual, verifiable articles that were written during these periods are present in the different data sets. The articles present in the sample data sets were chosen randomly, only taking into account the aforementioned two criteria. The number of articles in the sample data sets that specifically handle the breaking news events are chosen such that their ratio to the total number of articles in the sample data set is more or less equal to the global ratio of articles dedicated to the breaking news events during the periods of stress (i.e. the ratio for all published articles). An overview of the number of usable articles for each of the selected events can be found in Table 3.1.

10 Table 3.1: Number of articles in data set for each breaking news event and for each considered newspaper.

Germanwings crash Paris attacks Het Laatste Nieuws 184 332 Het Nieuwsblad 136 310

Another aspect that this table illustrates is that the number of articles written during the period of stress after the terrorist attacks in Paris is enormous. This indicates that although both the Germanwings plane crash and the terrorist attacks in Paris are categorized as inducing a period of stress, it could be that the terrorist attacks still belong to this category with a higher membership degree. It will thus be useful to see whether this difference in the number of published articles is also reflected in the results of the performed analyses.

To check the assumption that stress periods are indeed more prone to faults being introduced in online news articles, a data set to compare the stress period data set with should be available. To this end, extra sets of online news articles are scraped from both Het Laatste Nieuws and Het Nieuwsblad, this time consisting of articles that were published in the days and weeks before the respective periods of stress started. Thus, for each newspaper and period of stress mentioned above, a non-stress period data set was gathered consisting of articles written before the respective breaking news events happened. Most articles were written right before the breaking news event (i.e. in the two weeks before the breaking news event). Some articles were written a bit earlier (e.g. one or two months before the breaking news event). The reason that some of the articles that were written earlier were selected, is that they handle the same subject as an article that was written right before the breaking news event and that is thus also included in the non-stress period data set. In this way, it is easier to find inconsistencies and thus to find errors in the non-stress period data set. However, it is made sure that none of the articles that is selected was under the influence of any kind of breaking news event. Finally, it should also be noted that the arti- cles that are selected for the non-stress periods are also verifiable, to be able to fully analyze their contents.

The cardinality of each of the final non-stress period data sets that were gathered is given in Table 3.2.

3.2.2 Data set properties

In this subsection, a few aspects of the data sets are visualized. In this way, the specific properties of each data set are understood better.

11 Table 3.2: Number of articles in data set right before each breaking news event for each considered newspaper.

Before Germanwings crash Before Paris attacks Het Laatste Nieuws 90 90 Het Nieuwsblad 80 80

First of all, a histogram is plotted with the number of articles written each day in the period of stress by both Het Laatste Nieuws and Het Nieuwsblad. These histograms are plotted for the terrorist attacks in Paris and for the Germanwings plane crash in Figures 3.1a and 3.1b respectively.

(a) Terrorist attacks in Paris

(b) Germanwings plane crash

Figure 3.1: Histogram of the number of articles in the data sets of Het Laatste Nieuws and het Nieuwsblad.

These figures show that the number of articles in the data set is more or less equally spread over the

12 four days that constitute the respective periods of stress. Thus, the sample sets of articles selected in this Master dissertation are indeed representative for the whole set of articles published in the periods of stress. Secondly, it is also noted that the data sets of articles written by Het Laatste Nieuws are typically larger than the data sets of Het Nieuwsblad, as this is also the case for the complete set of published articles.

3.3 Research method

Once the stress and non-stress period data sets for both the Paris attacks and the Germanwings crash are gathered, all articles are read. Each article is screened manually to find mistakes. As already explained in the previous chapter, subjective errors also exist. However, these are not taken into account in this Master dissertation. The goal is to find all errors in the data set about which no discussion can be started. To this end, a few rules are used while reading the articles:

All articles are written in Dutch. The spelling of words is checked with the help of the online • dictionary of Taalunie (Woordenlijst, 2015). Taalunie is a policy organization regarding the use of Dutch in both Belgium and The Netherlands, and can thus be assumed to be reliable.

The grammar and construction of sentences is only categorized as being erroneous if this is absolutely • clear from the context itself. An example could be a sentence that contains no verb (e.g. “Het mooi weer vandaag.” instead of “Het is mooi weer vandaag.”) or a sentence that contains the same word multiple times (e.g. “Het is is mooi weer vandaag.”). The same is true for the placing of punctuation marks. If there is any doubt about the correctness of a certain sentence, the sentence is assumed to be correct.

The spelling of all proper nouns is checked. Multiple sources, such as Wikipedia, personal accounts, • previous news articles, . . . are checked to this end. Only when all sources that are checked write the proper noun in a different way than the article does, the spelling is assumed to be incorrect. If different types of spelling are used by different sources, the spelling in the news article is assumed to be correct, as the correct spelling cannot be deduced. Even if only one online source writes the proper noun in a different way than all other sources do, the correct spelling of the proper noun in the article is not doubted.

All facts that are stated and for which no direct news source is available in the article (i.e. not • part of a quotation), are checked. When not enough information can be found or no trustworthy sources such as Wikipedia are available, the information in the article is assumed to be correct. Only when all sources found indicate that the information in the article is not true, the information in the article is considered to be erroneous.

13 Any information within a quote or within an opinion is considered as being correct. Thus, a fact • can only be false if there is no direct source in the article or if it is not clear where the information was retrieved from.

3.4 Results

3.4.1 Data set motivation

In this section, different statistical tests are performed to investigate the accuracy of online news articles during both periods of stress and non-stress periods. These tests are performed on the different data sets of factual, newsworthy articles that were gathered previously. These data sets were gathered around two different events, i.e. the terrorist attacks in Paris and the Germanwings plane crash. As only two dedicated periods of stress (and accompanying non-stress periods of stress) were selected, in general no conclusions about a randomly chosen period of stress and its corresponding non-stress period can be drawn.

The main reason why only two events were selected is that periods of stress (i.e. 25% of articles handle about the same subject for at least four days) are quite seldom (i.e. maximally only a few per year). This means that between two subsequent periods of stress, typically months (possibly more than a year) have passed. It can be assumed that in between, major changes have been made in the way online news articles are produced (e.g. different journalists, better quality control, focus on different types of information, . . . ). As such, aggregating online news articles, that were written in very distinct time periods, in the same data set could possibly influence the results of the different statistical tests.

Moreover, even for two breaking news events that induce a period of stress, the fraction of articles published about these events during such a period can be quite different. For example, the fraction of articles written about the terrorist attacks in Paris during the corresponding period of stress is much higher than the same fraction for the Germanwings plane crash. As already mentioned before, possibly a period of stress is not a fixed concept, but a period can be a period of stress to a certain degree. More evidence for this approach will also be given further in this chapter. As such, it is argued that it is not a good idea to create data sets with a mix of articles from different periods of stress (or non-stress periods).

Contrary, in this section, the two breaking news events that were gathered (i.e. the terrorist attacks in Paris and the plane crash of Germanwings) are investigated thoroughly. The data sets consist of factual, verifiable online news articles written during the period of interest. These data sets form a valid sample of the typical set of articles that is produced during a certain period. For example, with the help of

14 statistical tests that are performed on the data sets with articles before and after the terrorist attacks in Paris, specific conclusions can be drawn about this specific period of stress and non-stress period. As these data sets do not form a representative sample for the general case of a random period of stress and its accompanying non-stress period, no general conclusions can be drawn. However, it is believed that the study of both the plane crash and the terrorist attacks can be indicative for the general case.

3.4.2 Categorization of errors in online news

Based on the annotations that were performed on the eight different data sets, a distinction of 6 error categories is made. These categories cover all errors that were found.

1. Overestimation of numbers (number of deaths, amount of money, . . . )

2. Incorrect personal nouns (persons, cities, countries, albums, . . . )

3. Spelling mistakes according to Dutch spelling rules

4. Incorrectly formed sentences, grammar errors, . . .

5. Wrong numbers that do not belong to category 1 (e.g. wrong time or date)

6. Factual errors that are not numbers (e.g. “Tim Cook is the CEO of Microsoft”)

Categories 1, 5 and 6 contain factual errors. While categories 1 and 5 contain numerical errors, category 6 contains all factual errors that are not numerical. Categories 2, 3 and 4 handle linguistic mistakes, consisting of both spelling mistakes and grammar mistakes. A detailed overview of all errors that were found is given in Appendix A.

3.4.3 Distribution of number of errors in online news articles

In the previous sections, it was explained how each of the articles in the different data sets was annotated manually. Because of this, it is known how many errors each of the articles contains. These numbers can be used to plot histograms for each of the eight data sets. These histograms show the number of articles having a certain amount of errors. Figures 3.2a and 3.2b illustrate this for the sets of articles written by Het Laatste Nieuws and Het Nieuwsblad during the stress period after the terrorist attacks in Paris and the Germanwings plane crash respectively. From these plots, it is immediately clear that the number of errors per article in the data set follows a kind of “discrete exponential distribution”. The distribution that typically models this kind of behaviour is called the geometric distribution, which is given by Equation (3.1). The distribution is characterized by the parameter q, which indicates the probability that an article in the data set contains no errors.

15 (a) After terrorist attacks in Paris.

(b) After the Germanwings plane crash.

Figure 3.2: Histogram of the number of articles for each number of errors, both for Het Laatste Nieuws and Het Nieuwsblad.

P (X = k) = (1 q)k q, k = 0, 1, ..., (3.1) − ∗ ∞

The formula for the expected value of a geometrically distributed variable with parameter q is given in Equation (3.2).

1 q E[X] = − (3.2) q

To verify whether the number of errors per article in a specific data set is geometrically distributed with a certain parameter q, a “chi-square goodness-of-fit test” is conducted (Chi-square goodness-of-fit test, 2018). For example, a first chi-square goodness-of-fit test can be used to test if the number of errors per article published in the period of stress that was initiated by the terrorist attacks in Paris, is

16 geometrically distributed. This test could then be performed based on the sample data sets of the peri- ods of stress after the terrorist attacks in Paris, once for Het Laatste Nieuws and once for Het Nieuwsblad.

The chi-square goodness-of-fit test is a statistical test that compares the observed frequencies in the specific data set with the frequencies that would be expected if the sample data followed a geometric distribution with given parameter q. The null hypothesis of this statistical test is that the number of errors in an article are drawn from a population with a geometric distribution. The performance of the test results in a p-value. If this p is lower than 0.05, the null hypothesis can be rejected and thus the number of errors in an article cannot be assumed to be geometrically distributed. If the p-value is equal to or larger than 0.05 however, the statistical test fails to reject the null hypothesis and thus the number of errors in an article can be assumed to be geometrically distributed with a certain parameter q.

The exact value of the parameter q of the geometric distribution with which the frequencies of the different data sets are compared is computed by the average number of errors per article for each of the different test data sets. By using this average value in combination with Equation (3.2), the plausible value of q can be determined. To be able to perform such a test on a specific data set, this data set should meet three basic requirements (Chi-square goodness-of-fit test, 2018):

1. The data set should consist of one categorical variable, i.e. a variable that can take on one out of a limited amount of possible values, assigning each sample to a specific category/value.

Each of the data sets clearly consists of one categorical variable: each article (sample) is assigned to a specific number, i.e. the number of errors that it contains. It is the assumption about the distribution of this variable that is tested statistically.

2. Independence of observations should hold, which means that no relationship exists between the different categories of the variable.

This is clearly also the case in this specific context, as each article can only contain a single amount of errors (e.g. it is not possible that an article contains both one error and two errors).

3. There must be an expected frequency of at least 5 in each group of the categorical value.

In the context of a geometrically distributed variable however, the expected frequencies decrease very quickly. Thus, the amount of articles that is expected to contain a large amount of errors is low. Because of this, for each chi-square goodness-of-fit test that will be performed, the last category of the variable will consist of all articles with a number of errors higher than a given number (and thus not only equal to that number). More formally, if the expected frequencies of the categories “X = k”, “X = k+1” and “X = k+2” are all lower than 5 but the sum of their expected frequencies

17 is larger than 5, then one single category “X >= k” is created instead. As such, it can be assured that each category has an expected frequency of at least 5.

As all three requirements are met for all gathered data sets, the different chi-square goodness-of-fit tests can be performed. Firstly, the test is performed on the two data sets of verifiable articles written in the period of stress after the terrorist attacks in Paris. In this way, the goal is to prove statistically that the number of errors in an article written in this specific period of stress is geometrically distributed, both for Het Laatste Nieuws and Het Nieuwsblad.

Similar tests are then performed for the data sets consisting of verifiable articles written during the period of stress after the Germanwings plane crash. These tests then verify whether the number of errors in an article written during this specific period of stress is geometrically distributed, both for Het Laatste Nieuws and Het Nieuwsblad.

The same tests can than also be performed on the data sets of articles written during the non-stress periods, to verify whether the number of errors in an article written during these periods is geometrically distributed, both for Het Laatste Nieuws and Het Nieuwsblad. In total, eight chi-square goodness-of-fit tests are thus performed.

Next, the results of the chi-square goodness-of-fit test performed on the stress-period data set of articles written by Het Laatste Nieuws after the terrorist attacks in Paris are discussed briefly. After computing the average number of errors per article in this data set, the value for the parameter q of the geometric distribution was computed to be 0.661. The minimum expected frequency was exactly 5. This expected frequency was reached by merging all categories for which the number of errors in an article is equal to or higher than 3.

The chi-square goodness-of-fit test indicated that the distribution of the number of faults in articles in the data set was not statistically significantly different from the proportions found by a geometric distribu- tion with the same mean (χ2(2) = 1.551, p = 0.671). We thus fail to reject the null hypothesis and cannot accept the alternative hypothesis, being the number of faults in an article is not geometrically distributed.

Thus, it can be concluded that the distribution of the number of faults in an article written by Het Laatste Nieuws during the period of stress after the terrorist attacks in Paris cannot be distinguished from a geometric distribution with parameter q = 0.661. As an illustration, a comparative histogram of both the frequencies and the expected frequencies of each category is given in Figure 3.3.

18 Similar plots can be generated for each of the other different tested data sets. The results for each of the performed tests are summarized in Table 3.3. All of these tests require the minimum expected frequency of each evaluated category to be at least 5. To this end, each time the last categories were brought together into one category of the type X >= k. For each data set, the mean of the geometric distribution with which the data set was compared is given. Furthermore, the chi-square value and the p-value of the chi-square goodness-of-fit test are given. As can be seen, for all statistical tests that were performed, p > 0.05. Thus, for each data set no statistically significant difference exists between a geometrical distribution with given parameter q and the distribution of the number of errors per article for that data set.

Figure 3.3: Comparison of the frequencies for each number of errors in an article with the expected frequencies based on a geometrically distributed random variable with mean 0.512.

3.4.4 Probability of writing an error-containing article

As can be seen from Table 3.3, the expected value of the number of errors in an article is clearly higher during the periods of stress than during the non-stress periods. Table 3.4 gives an overview of the probability of writing an article with at least one error in it. In these numbers, all categories of errors that were identified previously are included. First of all, it is clear that these probabilities are fairly high. For example, in the data sets related to the period of stress after the Paris attacks, 34.6% of the articles written by Het Laatste Nieuws and even 40.0% of the articles written by Het Nieuwsblad contain at least one error. Moreover, it can be easily seen that for all data sets, the articles written during a stress period are more probable to contain at least one mistake (of any category) than the articles written during the corresponding non-stress period. The goal is now to verify whether the differences between these data sets of articles can be extrapo- lated to the whole set of articles written during these specific stress periods and non-stress periods. To this end, a test of two proportions, more specifically a chi-square test for homogeneity, was used (Chi-

19 Table 3.3: For each data set of articles, the mean value of the geometric distribution with which the frequencies are compared, the chi-square value of the test and the p-value with which the null hypothesis is rejected are given.

Het Laatste Nieuws Het Nieuwsblad E = 0.512 E = 0.635 Stress period after terrorist attacks in Paris χ2(2) = 1.551 χ2(2) = 3.450 p = 0.671 p = 0.327 E = 0.333 E = 0.250 Non-stress period before terrorist attacks in Paris χ2(2) = 0.359 χ2(2) = 0.078 p = 0.836 p = 0.780 E = 0.533 E = 0.618 Stress period after Germanwings plane crash χ2(2) = 1.541 χ2(2) = 0.857 p = 0.673 p = 0.836 E = 0.478 E = 0.413 Non-stress period before Germanwings plane crash χ2(2) = 0.891 χ2(2) = 0.023 p = 0.640 p = 0.989

Table 3.4: Proportion of articles containing at least one mistake (of any category) for each data set.

Het Laatste Nieuws Het Nieuwsblad Stress period Paris attacks P = 0.346 P = 0.400 Non-stress period before Paris attacks P = 0.267 P = 0.188 Stress period Germanwings crash P = 0.359 P = 0.390 Non-stress period before Germanwings crash P = 0.289 P = 0.300 square test of homogeneity, 2018). With this test, it is possible to check whether the differences between the proportions of error-containing articles in the stress period data sets and the non-stress period data sets are statistically significant. In this way, it can be examined if for example the probability of writing an error-containing article in the period of stress that is a consequence of the terrorist attacks in Paris is statistically significantly higher than the corresponding probability during the non-stress period right before the terrorist attacks took place.

The null hypothesis for the chi-square test of homogeneity is that the probabilities are equal for both data sets. Formally, this leads to the hypothesis in Equation (3.3).

20 H0 : pnon stress = pstress (3.3) −

The data sets that are used to perform the statistical tests are the eight different data sets that were gathered in the beginning of this chapter. They each consist of a sample of verifiable, newsworthy articles that were written during either a period of stress after the terrorist attacks in Paris, a period of stress after the Germanwings plane crash or during a non-stress period.

The performance of the chi-square test of homogeneity results in a p-value. If this p-value is equal to or larger than 0.05, we fail to reject the null hypothesis. This means that the probability of writing an error-containing article in a specific period of stress is not statistically significantly different from the same probability in its corresponding non-stress period. If the p-value is lower than 0.05 however, the null hypothesis is rejected. This means that there is a statistically significant difference between the two probabilities in the two specific periods.

To be able to perform this test however, a couple of basic requirements should be met (Chi-square test of homogeneity, 2018):

1. The data set should consist of one independent variable and one dependent variable that are both measured at the dichotomous level. The independent variable is the variable indicating whether an article belongs to a period of stress or not. The dependent variable is then the variable indicating whether the article contains at least one error or not. The dichotomous variables in this context are both boolean, i.e. the variables can only take on the values 0 or 1.

2. Independence of observations should hold, which means that no relationship exists between the different categories of the variable.

This is clearly also the case in this specific context, as the only pairs of data sets that are compared are pairs that consist of one period of stress data set and the accompanying data set containing articles written in the non-stress period before that period of stress. As such, the intersection of the data sets that are compared is always empty.

3. The chi-square test of homogeneity requires the two data sets to be sampled following one out of two different requirements. One of the requirements is that each article should have a common characteristic with all other articles in that data set.

This requirement is also met in this context. The common characteristic for each data set is the period in which the articles it contains are written: either during a period of stress or during a non-stress period.

21 Table 3.5: P-values for chi-square test of homogeneity performed on the different data sets, annotated with all possible errors.

Het Laatste Nieuws Het Nieuwsblad Paris attacks p = 0.154 p = 0.000 Germanwings crash p = 0.250 p = 0.184

4. Each data set should at least contain 5 observations. As hundreds of articles were obtained per data set, this is not a problem at all.

As such, all basic requirements needed to perform the tests are met. In the following, each time the probability of writing an error-containing article in a data set of articles written during a period of stress is compared against the same probability during the non-stress period right before the breaking news event happened. In this way, conclusions about these specific periods are drawn.

As an example, the (stress period and non-stress period) data sets of the Paris attacks published by Het Laatste Nieuws are taken. During the non-stress period before the Paris attacks, 26.7% of the articles contained at least one mistake. During the stress period introduced by the Paris attacks, the percentage of error-containing articles increases to 34.6%. The test of two proportions used was the chi-square test of homogeneity. The difference between the two independent binomial proportions was not statistically significant (p = 0.154 > .05). Therefore, we fail to reject the null hypothesis and cannot accept the alternative hypothesis. As such, it can be concluded that for Het Laatste Nieuws there is no statistically significant difference between the probability of writing an error-containing article during the period of stress induced by the terrorist attacks in Paris and the same probability during the non-stress period right before the terrorist attacks.

The same test of two proportions is used to compare the probabilities for the other non-stress/stress data set pairs. The results are summarized in Table 3.5. From this table, it can be seen that in the case of Het Nieuwsblad reporting before and after the terrorist attacks in Paris, the null hypothesis is rejected. This means that in this case there is indeed a statistically significant difference between the probability of writing an error-containing article during the period of stress after the terrorist attacks in Paris and the same probability during the prior non-stress period. The other three tests fail to reject the null hypothesis however, and for these periods it cannot be concluded that there is a difference between these probabilities.

22 Probability of writing a linguistic error-containing article

In the tests that were performed previously, all types of errors were included in the probabilities. However, the analysis could also be distinguished based on the type of error. One possibility is to look separately to spelling and language mistakes (mistakes belonging to categories 2, 3 and 4 as defined above) and factual errors (belonging to categories 1, 5 or 6). The probabilities for language mistakes are given in Table 3.6.

This table already clearly illustrates that it is far more common to find a language mistake in an article published during a period of stress than in an article published during a non-stress period. To find out if these differences in proportions are statistically significant, again a chi-square test of homogeneity is performed. However, this time only the linguistic errors are taken into account. All requirements needed to perform the statistical test are still met. The null hypothesis for each of the tests is that the proportions of the number of articles with at least one linguistic error in it during the non-stress period and the stress period, are equal. The resulting p-values are mentioned in Table 3.7.

It is observed that for the data sets covering the non-stress period and stress period of the Paris attacks, the test is statistically significant (p < 0.05). This means that in the case of the terrorist attacks in Paris, it is proven statistically that the probability of writing a linguistic error-containing article is higher during the period of stress afterwards than during the prior non-stress period, both for Het Laatste Nieuws and Het Nieuwsblad. The differences are however not statistically significant for the data sets related to the Germanwings plane crash, although there is a clear difference between the observed proportions for these data sets too. A possible explanation for this is that the length of the articles was not taken into account: each article is classified as being error-containing or not, independent of its length.

Table 3.6: Proportion of articles containing at least one language mistake for each data set.

Het Laatste Nieuws Het Nieuwsblad Stress period Paris attacks P = 0.325 P = 0.358 Non-stress period before Paris Attacks P = 0.211 P = 0.150 Stress period Germanwings crash P = 0.310 P = 0.346 Non-stress period before Germanwings crash P = 0.256 P = 0.275

Probability of writing a factual error-containing article

In Table 3.8, the proportions are also given for the factual errors. Thus, in this table, for each data set the given number indicates the probability of an article to contain at least one factual error. As can be

23 Table 3.7: P-values for chi-square test of homogeneity performed on different data sets, annotated only with language mistakes

Het Laatste Nieuws Het Nieuwsblad Paris attacks p = 0.036 p = 0.000 Germanwings crash p = 0.354 p = 0.283 deduced from the table, for all data sets except the ones related to the terrorist attacks in Paris written by Het Laatste Nieuws, there is a small increase in probability during the periods of stress. However, it was observed during the annotation of the articles that verifying the correctness of article content is a very difficult task to perform. In Section 3.3, the different rules that were taken into account were stated. These clearly indicate that only after various considerations, a fact stated in an article can be considered wrong. This is contrary to the language mistakes probabilities in Table 3.6, as the spelling and grammar of a language follow clear and well-defined rules in most cases. These are thus far more easy to retrieve.

Moreover, the fact that it is this difficult to verify the correctness of article content also indicates that probably the probabilities of writing a factual error-containing article, given in Table 3.8, are probably an underestimation of the real probabilities. Because of this, it was chosen to not perform a chi-square test of homogeneity for this use case.

Table 3.8: Proportion of articles containing at least one factual mistake for each data set.

Het Laatste Nieuws Het Nieuwsblad Stress period Paris attacks P = 0.042 P = 0.081 Non-stress period before Paris Attacks P = 0.056 P = 0.038 Stress period Germanwings crash P = 0.071 P = 0.051 Non-stress period before Germanwings crash P = 0.033 P = 0.038

3.4.5 Probability of writing an error-containing article over time

Figures 3.4a and 3.4b plot the proportions of error-containing articles in time for the Paris attacks and Germanwings crash period of stress data sets respectively. As can be seen from these figures, no clear decreasing trend in probability is visible. A possible reason for this is that the stream of information about the breaking news event typically does not stop the day after the breaking news event happens. As lots of new information comes in, spread in time over the period of stress, these information streams also influence the number of articles that are published and thus the pressure on the editorial office of an

24 online newspaper.

(a) After the terrorist attacks in Paris.

(b) After the Germanwings plane crash.

Figure 3.4: Fraction of error-containing articles in the days in the period of stress, for both Het Laatste Nieuws and Het Nieuwsblad.

A possible example of this can be seen in Figure 3.4b. When the news of the crash was just coming in, the proportion of articles with at least one mistake is very high (24th of March 2015). However, in the following 24 to 48 hours, almost no information about the crash was released. Because of this, the stress level decreased and the proportions on the 25th of March and part of the 26th of March are lower. It is only during the 26th of March that information regarding the copilot (who let the plane crash on purpose) was made available. The moment this information became available, the stress level increased again, which again is reflected in the proportions of error-containing articles that rise again: these proportions stagnate or even increase again a little bit because of the newly created pressure on the editorial offices. This behaviour again illustrates the influence of a typical period of stress and its characteristics on the accuracy of online news.

25 3.4.6 Probability of writing an error-containing word

A major limitation of the tests performed above is that the length of the articles is not taken into ac- count. Because of this, an article containing 100 words and one error is penalized equally bad as an article containing 500 words and two errors.

It could thus be useful to have a look at the length distributions of the articles in the different data sets. These distributions are plotted in Figures 3.5a and 3.5b for the data sets related to the periods of stress after the terrorist attacks in Paris and the Germanwings plane crash, both for Het Laatste Nieuws and Het Nieuwsblad.

(a) Terrorist attacks in Paris

(b) Germanwings plane crash.

Figure 3.5: Histogram of the number of articles in the stress period data sets for different length categories.

One thing that can be observed from these different figures is that the length of an online news article varies a lot. Although no figures are given for the non-stress periods, the same conclusions can be drawn

26 regarding these data sets. This indicates that the approach of the previous sections, namely considering an article as the atomic unit of news, is probably too simple for a thorough analysis. A new approach should be tried where the length of the article is taken into account.

A possible solution that is carried out in this subsection is looking at the proportions of error- containing words instead of the proportions of error-containing articles. More specifically, the number of words in each of the different data sets is counted. Next to this, it is assumed that each error is related to one specific word. This is a reasonable assumption: missing words, wrong numbers, spelling mistakes, . . . all lead to one specific word being wrong. Dividing the number of wrong words by the total number of words in the data set then gives the proportion of wrong words in each of the data sets. These proportions are given in Table 3.9.

Table 3.9: Proportion of wrong words for each data set.

Het Laatste Nieuws Het Nieuwsblad Stress period Paris attacks P = 0.00242 P = 0.00279 Non-stress period before Paris Attacks P = 0.00164 P = 0.00127 Stress period Germanwings crash P = 0.00259 P = 0.00238 Non-stress period before Germanwings crash P = 0.00199 P = 0.00207

This table illustrates the fact that the proportions of wrong words are clearly higher in the period of stress data sets than in the non-stress period data sets. To verify if these differences in proportion are statistically significant, again four chi-square tests of homogeneity are performed. All requirements that were listed in the previous section in the context of the analysis of the probability of writing an error- containing article are still valid in the case of the analysis of the probabilities of writing an error-containing word. As such, the chi-square test of homogeneity can be performed without any limitations.

Table 3.10: P-values for chi-square test of homogeneity performed on different data sets, annotated only with language mistakes

Het Laatste Nieuws Het Nieuwsblad Paris attacks p = 0.049 p = 0.001 Germanwings crash p = 0.142 p = 0.496

As an example, the (stress period and non-stress period) data set related to the Paris attacks pub- lished by Het Laatste Nieuws is taken. During the non-stress period before the Paris attacks, 0.164% of the words were wrong. During the stress period introduced by the Paris attacks, the percentage of wrong words increases to 0.242%. The test of two proportions used was the chi-square test of homo-

27 geneity. The difference between the two independent binomial proportions was statistically significant (p = 0.049 < .05). Therefore, we can reject the null hypothesis and accept the alternative hypothesis.

This type of test was performed for all different data sets. The resulting p-values are summarized in Table 3.10. From this table, it can be seen that in the case of the Paris attacks, the differences are statistically significant. The probability of writing a wrong word is thus higher during this stress period than during the accompanying non-stress period, both for Het Laatste Nieuws and Het Nieuwsblad. In the case of the Germanwings crash, the results are not statistically significant.

The same conclusion can be drawn as for the investigation of the probability of writing a linguistic error-containing article: the differences between period of stress and non-stress period probabilities are statistically significant in the case of the terrorist attacks in Paris, but not in the case of the Germanwings plane crash. A possible explanation for this is that although both the Paris attacks and the Germanwings crash induced a period of stress, there is still a major difference between the amount of articles published in the associated periods of stress. As such, it could be that the definition of a period of stress is not precise enough. In this Master dissertation, a time span is categorized as being a period of stress or not. However, based on the results of this subsection, it can be assumed that the concept of being a period of stress or not could be improved by replacing it with a degree to which a certain period is a stress period. In this way, a clear distinction could be made between the Paris attacks and the Germanwings plane crash, which could on its turn possibly fully explain the obtained results. However, this study was beyond the scope of this Master dissertation.

3.4.7 Analysis of the absolute number of errors in an article

Next to looking at the fraction of articles that contains at least one error, it is also important to look at the effective amount of errors that is present in the different data sets. A typical way in statistics to compare two groups is to use the Mann-Whitney U test (Mann-Whitney U test, 2018). This is a test that compares the numbers in two different data sets by comparing their medians. In this context, the median of the number of errors in an article in a period of stress and its corresponding non-stress period will thus be compared. However, before applying the test, a couple of basic requirements should be met (Mann-Whitney U test, 2018):

1. The data set that is given to the statistical test should contain a dependent variable that is measured at the continuous or ordinal level. In this context, this ordinal variable is the number of errors in each of the articles.

2. The data set should also contain an independent variable that can only take on two categorical

28 values (i.e. a dichotomous variable). In this context, this variable is the boolean variable that indicates whether the article belongs to the period of stress (i.e. value 1) or to the non-stress period (i.e. value 0).

3. Independence of observations should hold, just as in the previous statistical tests. As all articles belong to either a period of stress or a non-stress period, no dependencies between the two categories are possible.

4. The way the Mann-Whitney U test should be used in the context of this Master dissertation is to compare the medians of two different groups. To be able to interpret the results this way, an extra requirement should hold: the distribution of the scores for both groups of the independent variable should have the same shape. Specifically for this use case, the distribution of the number of errors in an article in a period of stress should have a similar shape as the distribution of the number of errors in an article in a non-stress period. As was already statistically verified, the number of errors in each data set follows a geometrical distribution. This means that indeed, visually it can be verified that each of the period of stress/non-stress period data set pairs satisfy this requirement.

The null hypothesis of the Mann-Whitney U test is the following: the distribution of the two groups are equal. As a result, the statistical test outputs a p-value. If this p-value is lower than 0.05, the null hypothesis is rejected. This thus means that there is a statistically significant difference between the similarly shaped distributions of both data sets. If the p-value is equal to or higher than 0.05 however, we fail to reject the null hypothesis. This means that there is no statistically significant difference between the distributions of both data sets. For example, a Mann-Whitney U test was run to determine if there was a statistically significant dif- ference in the median number of faults per article written by Het Laatste Nieuws between the non-stress period and corresponding stress periods of the Paris attacks. Distributions of the number of faults were similar for all groups, as assessed by visual inspection. Median engagement score for period of stress (0.00) and non-stress period (1.00) was not statistically significantly different, U = 13625, z = -1.549, p = 0.121.

This kind of test was repeated for all available data sets. The corresponding U, z and p-values for the different tests are given in Table 3.11. The outcome is similar as when investigating the difference in proportions of the number of error-containing articles (summarized in Table 3.5). Only for the articles written by Het Nieuwsblad before and after the Paris attacks, the difference between the medians is statistically significant. Only for this case it can thus be concluded that the median of the number of errors in an article is higher during the period of stress than during the corresponding non-stress period.

29 Table 3.11: U-, z- and p-values of the Mann-Whitney U tests performed for the different data sets, all errors included.

Het Laatste Nieuws Het Nieuwsblad U = 13625 U = 9631 Paris attacks z = 1.549 z = 3.632 − − p = 0.121 p = 0.000 U = 7767 U = 4891 Germanwings crash z = 0.998 z = 1.552 − − p = 0.318 p = 0.121

3.5 Conclusion

In this chapter, a large number of articles was manually annotated with six different categories of errors. The articles were selected in four specific periods: just before and just after the terrorist attacks in Paris and just before and just after the Germanwings plane crash. The two events have in common that they introduced a period of stress, meaning that the four days after the events, a large number of articles was dedicated to these events. The days before the media events, there was no period of stress as no major event happened.

First of all, the error annotations made clear that the lack of accuracy certainly is a problem in Flemish online news articles too. Depending on the investigated data set, the percentage of error-containing articles varied between 18.8% and 40.0%. Based on the obtained annotations, several statistical tests were performed. These verified the distributions of the number of errors in an online news article for each data set, compared the probabilities of writing an error-containing article and word during periods of stress and during non-stress periods, and compared the medians of the numbers of errors per article of both kinds of periods. Generally spoken, it was noted that the results that came out of the statistical tests in case of the terrorist attacks in Paris were clearer. The probability of writing a linguistic error-containing article was proven to be higher during the period of stress induced by the terrorist attacks in Paris compared to the non-stress period right before it. The same conclusion was drawn regarding the probability of writing an error-containing word during these periods. However, this could not be concluded in the case of the Germanwings plane crash. A possible explanation for this observation was given in the form of a degree of being a period of stress.

30 An important general remark to the results stated above is that they are only valid for the specific events that were thoroughly investigated. As the sample data sets are not representative for a random period of stress and its accompanying non-stress period, no generalization to the global case is possible. The reasons why this approach was used were given in the beginning of this chapter. However, it can be safely assumed that the results that were presented in this chapter are indicative for the general period of stress/non-stress period pair.

Finally, it should be kept in mind that the results presented were obtained by analyzing the articles as they were found during the fall of 2017, almost three years after their publication. As already explained in the previous chapter, online news articles are frequently updated in the hours following their original publication. It can thus be assumed that not all errors that were once present in the investigated articles, were taken into consideration in this Master dissertation. In fact, the percentages could thus still be higher than assumed in this work.

31 Chapter 4

Automated inconsistency finding in online news articles

In this chapter a second aspect of online news articles is investigated, namely their consistency. Opposing Chapter 3, in this chapter the research focuses on automated ways of verifying the consistency of online news. More specifically, an algorithm is developed to find numerical inconsistencies in articles about the same subject.

First, an overview will be given of a few basic concepts regarding graphs and graph databases. These concepts will then be used in a second section to develop a structured representation of online news ar- ticles in a graph database. This structured representation is then used in the third section to develop an automatic inconsistency finding detection algorithm. Finally, after evaluating this algorithm on a couple of data sets containing articles handling the terrorist attacks in Paris and those in Brussels, a minimal subset of articles from the respective data sets is identified. This minimal subset has the property that removing its articles from the global set of articles removes all inconsistencies that the algorithm finds within the data sets.

It should be noted that all example sentences that will be used throughout the chapter are written in Dutch. The reason for this is of course that the articles under consideration are written in Dutch too.

32 4.1 Graph databases

4.1.1 Motivation

For many decades, relational database management systems have been the primary database management systems being used (Codd, 1970). Relational databases have a few very clear advantages which make them very attractive to a wide variety of applications. The data model consists of tables, consisting of different records. These tables represent the mathematical notion of a relation. Moreover, the data model consists of relationships between these tables and records by the definition of primary and foreign keys. Next to the widely applicable data model, the standardized data definition and manipulation language SQL strengthened the popularity of relational database systems (Date, 1997). These two factors led to a few very successful relational database systems, such as MySQL (MySQL, 2018) and PostgreSQL (PostgreSQL, 2018), that are still in use and still represent a large part of the databases installed globally.

As described above, relational databases store data in a very structured way. Each record belongs to a predefined table. Moreover, this table consists of a couple of predefined attributes (columns) that should be present for each record residing in that table. However, in the modern times we live in today, data is not always that structured. Sensors produce irregular and non-structured time series, the amount of videos and images uploaded on the internet is ever increasing, social media such as Twitter and Facebook produce all kinds of unstructured information, . . . .

One specific example of data that exhibits almost no structure is textual data. Textual documents arise from different sources, such as digital books, online news sites, blogs, . . . . Take the example of online news articles. It is very difficult or even impossible to put forward a relational data model that can represent each of the different news articles. It is practically impossible to represent the content of a news article with the help of a couple of columns in a relational database. Representing and stor- ing this unstructured data is thus a real challenge. This challenge is often generalized as the Variety dimension of the Big Data era. This dimension was originally mentioned in IBM’s 4 V’s of Big Data: Volume, Variety, Velocity and Veracity (Infographic: The Four V’s of Big Data, 2018). However, it would be very useful to be able to represent textual data in a more structured way. In the very specific context of this Master dissertation, this could allow users to query news articles in a more structured way.

The need for new, innovative database models thus arose due to the explosion of the amount of unstructured, incomplete data that is created at very high speed all over the world and for which the typical relational database model is not capable of modelling it. Thus, a different data model that offers more flexibility is needed to store (for example) this textual information in an adequate way. Many

33 alternative database models were therefore developed. One of them is the graph data model, which relies on the mathematical foundations of graph theory, of which the basics needed for the rest of this work are given in the next subsection. According to the DB-engines ranking, the commercial implementations of graph databases are gaining importance each year (DB-Engines Ranking of Graph DBMS, 2018). Furthermore, this ranking also indicates that of all available commercial graph databases, Neo4j is the most important and well known solution for the moment (Neo4J , 2018). This is also the commercial database system that will be considered further on in this thesis.

4.1.2 Graphs

A graph is a mathematical model that is used to model entities and relationships between these different entities. A graph G = (V,E) consists of both a vertex set V and an edge set E. Each edge e E is defined ∈ by a couple of vertices. The edge e can then be represented as e = (u, v), where u V and v V . The ∈ ∈ vertices u and v are then said to be neighbours of each other. In the context of this thesis, the graphs are assumed to be directed graphs. This means that each edge has a starting vertex and an ending vertex. For example, for the edge (u, v) the vertex u is the vertex where the edge (or arc) starts and vertex v is the vertex where the edge (or arc) ends. This implies that the edges (u, v) and (v, u) are not the same edge.

An example graph is shown in Figure 4.1. The graph is defined by the following vertex set V and edge set E.

V = A, B, C, D, E, F { } E = (A, B), (B,C), (A, D), (C,E), (D,C), (F,B), (F,E) { }

An alternating sequence of vertices and edges, v1, e1, v2, . . . , vn 1, en 1, vn, starting and ending at a − − vertex and of which each edge should be adjacent to its two endpoints, is called a walk. Moreover, each edge in the sequence should start from the vertex right before that edge in the sequence, and end in the vertex right after that edge in the sequence.

A second concept that is defined is a trail. A trail is defined as a walk in which no repeated edges occur. Based on these two definitions, the important concept of a path can be introduced. A path is a trail in which all vertices are also distinct. An example path that starts at vertex A and ends at vertex E in the graph given in Figure 4.1 is A, (A, D),D, (D,C),C, (C,E),E. Contrary, the sequence A, (A, B),B, (F,B),F, (F,E),E is not a path. The reason for this is that the edge (F,B) is traversed in the wrong direction.

If for each couple of vertices in the graph there exists a path between these two vertices, the graph is

34 Figure 4.1: Example graph with 6 vertices and 7 edges. said to be a connected graph. In the case that for at least two vertices there exists no path between these two vertices, the graph is said to be a disconnected graph. The graph in Figure 4.1 is clearly a connected graph.

A graph can be extended with a weighting scheme. This scheme assigns a real number, i.e. a weight, to each edge in the graph. This weight typically represents a cost. For example, these weights could indicate the geographical distance between two locations. In the example graph of Figure 4.1, no weights are indicated however. In such a case, it is assumed that all edges have a uniform weight, for example a weight of 1.

Based on the graph definition and the weights of the different edges, a shortest path between two vertices can be defined. A shortest path between two vertices u and v is the path between u and v for which the sum of the weights of the edges belonging to the path is the lowest possible sum needed to connect u and v. Many algorithms exist in literature to compute the shortest path between specific vertices or between all couples of vertices in the graph. Well-known algorithms are Dijkstra’s algorithm and the A* algorithm. Later in this chapter, the concept of a shortest path will come back when describing the algorithm for automatic inconsistency detection. For a more elaborate overview of existing shortest path algorithms, the reader is referred to (Delling, Sanders, Schultes, & Wagner, 2009). As an example, in Figure 4.1 two paths are present between node A and node E:

35 Path 1: A, (A, D),D, (D,C),C, (C,E),E. Path 2: A, (A, B),B, (B,C),C, (C,E),E.

Both paths have a total cost of 3, because they both consist of three edges of weight 1. As no other paths exist from vertex A to vertex E, the two paths above are both shortest paths between the two nodes.

4.1.3 Neo4j: data model

Neo4j is at the moment the most often used and best known graph database management system that exists. In the following subsection, the general data model of a graph database together with the specifics of the Neo4j implementation are given.

The graph data model is based on the mathematical concept of a graph, as explained in the previous section. Thus, the graph data model consists of vertices and edges. The vertices represent entities of the real world. The edges represent relationships between these different entities. These edges form the main difference between the graph data model and data models of other typical database systems. They make graph databases especially suited to model any kind of relation between different entities. Finally, vertices can be further described by adding properties. These are key-value fields that give additional information about the real-world entity.

A specifically important aspect in the context of this thesis is that graph databases are especially well-suited to model textual data. This is due to the highly connected nature of text. A word in a sentence is typically in a relation with other words occurring close to that word in the sentence. As an example, the following sentence is given:

Sentence 1: “Alice houdt van Bob.”

In this sentence, there is a clear relationship between Alice and Bob. Namely, Alice loves (“houdt van”) Bob. Alice and Bob are the entities in this sentence. They could thus be represented as a vertex in the database. As they are both persons, they could be stored as entities with label “Persoon”. The label of the edge that connects both entities is then “houdt van”. Moreover, if more information would be present about Alice or Bob, this information could also be stored in the database in a structured way. Consider the following sentence:

Sentence 2: “Alice, die computerwetenschappen studeert, houdt sinds 2014 van Bob, die 22 jaar oud is.”

If this sentence would be stored in a graph database, a property studies could be added to the ver- tex representing Alice, with value "computerwetenschappen". Identically, a property “leeftijd” could be

36 added to the vertex representing Bob, with the value “22”. Moreover, properties can also be added to the edges. These properties then indicate specific information about the relationship that the edge models. For example, a property “sinds” could be added to the relationship “houdt van”, of which the value would be “2014”.

Within a Neo4j database, different entity types and different relationship types can be stored. These are, as already mentioned, indicated with the help of a label. This label can for example be useful to limit the search range of the entities that are queried. For example, consider the following sentence:

Sentence 3: “Alice eet spinazie.”

Spinach is also an entity and will thus also be stored in the database as a vertex. However, if a user now wants to query the database to find the people that study computer science, it is not useful to also search for “the studies of spinach”. By adding the label “Voedsel” to the vertex representing spinach, it is clear that this vertex should not be considered by the query resolution algorithm. Thus, labels can significantly narrow the entities that are searched. Moreover, each edge obligatory has exactly one label.

As an example, a graph representation of sentence 2 as it could be stored in a structured way in a graph database is given in Figure 4.2.

Figure 4.2: Possible graph representation for the sentence “Alice, die computerwetenschappen studeert, houdt sinds 2014 van Bob, die 22 jaar oud is.”.

4.1.4 Cypher Query language

The query language that is used to query a Neo4j database is the Cypher Query Language (Intro to Cypher, 2018). The Cypher Query Language is a relatively simple language with which very difficult queries can be performed. Without going into too much detail, it consists of a couple of clauses, similar to those present in the traditional SQL language that is used to query relational databases. The most common ones are MAT CH and WHERE. MAT CH can be used to match specific vertices and/or edges, based on their labels and properties. The WHERE clause can then be used to specify additional constraints. Finally, the RETURN clause states what should be returned eventually by the query to the user. Other, more detailed clauses exist as well and are frequently used.

37 As an example query, suppose that a user wants to retrieve the names of all persons in a database that have studied computer science. A possible CQL query for this is:

MATCH (n:Person {studies: "computer science"}) RETURN n.name as name

Another example query that uses relationships is one that retrieves the names of all persons that love someone, and that are older than 30. This can be stated in CQL as:

MATCH (n:Person)-[:Loves]->(m:Person) WHERE n.age >= 30 RETURN n.name as name

4.2 Structured textual data

In the previous section, the need for and the concept of graph database management systems was ex- plained. From the motivation for such a graph database management system, the need for a structured representation of textual data such as online news articles arose. In the following paragraphs, it is ex- plained how such a graph database can be used to represent text in a more structured way, such that querying and automated analysis is possible. More specifically, it will be illustrated how such a structured representation in a graph database can be obtained. This whole process is split up in four different steps, partially based on the study performed by Bronselaer and Pasi (2013).

4.2.1 Tokenization

As a first step, the original text document should be split up in tokens. Stated differently, the text document should be transformed into a list consisting of all the words occurring in the text. The simplest criterion to distinguish different tokens in a text is splitting on a white space or on a punctuation mark. This is also the criterion that will be used throughout this thesis. This criterion is not perfect: for exam- ple, semantic words that span multiple tokens are split up (e.g. “San Francisco”). To avoid unnecessary complexity however, this good, basic tokenization criterion is used.

The only exception that is made to this very simple tokenization criterion is when input lists are used. In the following sections, it will be explained that for example lists of cities, countries and numerical expressions in Dutch are used. Of course, these lists contain a few semantic concepts that span multiple words (e.g. “San Francisco”, “ten minste”, . . . ). For the words present in these lists an exception to the simple tokenization criterion is made, as these words will be interpreted as being one token.

38 4.2.2 Part-Of-Speech tagging

Representing a document in a structured way should be done without loosing too much information present in the document. Otherwise, the structured representation would not be a good representation of the original document anymore. However, it should also be clear that typically not all words present in the text will be stored in the structured representation, as the goal of the structuring process should also be to aggregate the information as much as possible.

For that reason, it should be clear which words in a sentence are important to represent the original information contained within the sentence, and which words are not. An important indicator for this is the specific word type of that word. Thus, it should be known for each word in the text what its word type is. This is exactly what a Part-Of-Speech tagging algorithm tries to solve.

Part-Of-Speech tagging is the process in which for each word in a text, the word type of the word is revealed, based on both its definition and the context. Possible word types are nouns, verbs, adverbs, . . . . This problem cannot be solved simply by providing for each language a large list of each word together with its word type. Most often, the word type of a specific token in a text also depends on the context, i.e. its function in that sentence and the words surrounding it. As an example the following sentences are given.

Sentence 1: “Ik koop een boek.”

Sentence 2: “Ik boek een reis.”

Each of these two sentences contains the word “boek”. However, in the first sentence, it is used as a noun with the meaning of “a collection of papers to read”. In the second sentence, it is used as a verb, and the meaning is “arranging something”. Thus, simply stating what the word type of “boek” is in Dutch, is not possible. This immediately illustrates the non-triviality of PoS tagging.

Two major solution categories for the PoS tagging problem exist: rule based methods and probabilistic methods. An example of the first category is given in the study by Leech et al. (1983). However in this thesis, use will be made of a well-known probabilistic method, called TreeTagger (Schmid, 2013). As the name already suggests, this is a probabilistic technique that is based on binary decision trees. For each token occurring in the text, it outputs the word type of its specific occurrence together with the stem of the token.

39 4.2.3 Reclassification

Once the PoS tagging is done, the tokens are further classified based on their word types. By partition- ing all word types in a few classes, it will be much easier to finally generate the graph in the following step. In this thesis, five different categories are defined: Noun, Number, Adjective, Place, Entity, Edge and Ignore. These categories are based on intuition of what word types are relevant to understand the semantics of a specific sentence. Next to this, the different categories are chosen to be a good fit for the query applications that will make use of the database further on.

A more detailed explanation of each of the categories is given now:

1. Noun: As the word already suggests, this category covers all tokens that were tagged by the TreeTagger algorithm as being a noun.

2. Number: This category again speaks for itself. Each token in the tagged document that is consid- ered to be a number will be categorized in this class. Moreover, TreeTagger does not know certain Dutch numerical expressions, such as “ten minste” or “honderdtal”. To be sure that these are also categorized as a number, a couple of if/else rules are added to the database construction algorithm. Next to these rules, a file with some common numerical expressions in Dutch is used. If a token occurs in this file, it is also considered to be a number.

3. Adjective: Again, the name of the category is intuitive. This category will contain all words of which TreeTagger indicates them to be an adjective.

4. Place: This category contains all possible geographical locations. This categorization is based on a couple of files containing all countries in the world and the largest cities all over the world. Specif- ically for the countries in Western Europe, smaller cities are also added to the list. This is because the articles that are automatically analyzed in this thesis are written by Flemish newspapers. Thus, a large amount of the published articles handles events in Western Europe.

5. Entity: The Entity category contains all specific names that do not point to a geographical lo- cation. Examples are personal names, names of concert halls or bands, television stations, . . . . It is impossible to enumerate all of the entities in the world in a file. For that reason, a simple classification rule was developed. Every token that contains at least one capital letter, that is not the first token of a sentence and that does not point to a geographical location, is considered to be an entity. While this rule is far from perfect, it is assumed to be a significantly good performing simplification.

6. Edge: Based on the assumption made by Bronselaer et al. for English, three word types are considered to express a relationship as well in Dutch: the prepositions, the conjunctions and the

40 verbs (Bronselaer & Pasi, 2013). Examples of a verb relationship were already given when explaining the graph data model. But conjunctions (e.g. en - “Tom en Jan”) and prepositions (e.g. in - “Ik ben in Gent.”) express a relationship between different words.

7. Ignore: All tokens that do not belong to any of the above mentioned classes are assumed to be not important to represent the textual data in a structured way. Word types that are not considered further on include determiners, pronouns, adverbs, . . . . Tokens that are classified in this category are simply not stored in the eventual graph database.

Moreover, the graph generation algorithm also uses a Dutch stopword list. A stopword is a (typically small) word that occurs that often in Dutch texts, that the presence of such a word tells almost nothing about the semantics of the sentence in which it occurs. As such, it is not useful to include such stopwords in the structured representation of a sentence. Because of this, these tokens are also added to the Ignore class.

As a final remark, it should be noted that the PoS tagging by the TreeTagger algorithm is not perfect. For that reason, it is possible that tokens are classified in the wrong category. Moreover, a couple of simplifying assumptions are made throughout the algorithm. However, based on the results of the automated analysis that will be given further on in this work, it can be assumed that these assumptions are justified.

4.2.4 Graph generation

Once all tokens in the text have obtained their correct classification, the graph representation can ef- fectively be constructed. As already explained in Section 4.1.3, a graph database in Neo4j consists of primarily three major components: vertices, edges and properties. How these are created based on the classification of the tokens is explained in the following.

Vertices

The first type of components of the graph data model that will be discussed is the vertex. The cat- egories of tokens that are stored as a vertex in the graph database are the following: Nouns, Adjectives, Places and Entities. Each of the created vertices is also given a label that indicates the specific category to which the associated token(s) belong(s). In this way, the database can be efficiently searched and queried, for example to narrow a search to Nouns only.

Each of the vertices has an associated property word, for which the value is a string of subsequent concatenated tokens that were all classified into the same category. Thus, if for example multiple Nouns

41 occur after each other in a filtered sentence, without any token of another category in between, these are concatenated (with white spaces) as being one large token belonging to the Noun vertex. As another example, consider the following sentence:

“Een van de terroristen die zelfmoord heeft gepleegd in Parijs was Brahim Abdeslam.”

In this sentence, both the strings “Brahim” and “Abdeslam” will be classified in the Entity category. As such, instead of two vertices, only one vertex will be created with the associated property word, for which the value is then “Brahim Abdeslam”. In general, this concatenation rule leads to a cleaner database. A possible problem is that two tokens for which the tokens in between belonged to the Ignore category and were thus filtered out, are merged together into one vertex, although they are not related semantically. However, in general the advantage of a more structured database outweighs this disadvantage.

Edges

In a graph database, the edges represent relationships between the vertices that they connect. In the context of this Master dissertation, the word types that express a relationship between other words are considered to be either verbs, conjunctions or prepositions. These are thus exactly the word types that are classified in the Edge category. Thus, for each of the tokens belonging to the Edge category, an edge is created with a property "word". The value of this property will then be a string containing all words of these word types that are in between the words of the vertices that the edge connects. As an example, the following sentence is given:

“John verblijft tijdens de zomer in Italië.”

If TreeTagger tagged the sentence correctly, both “verblijft”, “tijdens” and “in” will eventually be classified in the Edge class. As no vertices occur between these “verblijft” and “tijdens”, one edge will be created of which the word property will be “verblijft tijdens”. This edge will connect the two vertices “John” and “zomer”, where “John” is a vertex with the label “Entity” and “zomer” is a vertex with the label “Noun”. Finally, if subsequent tokens should be transformed into a vertex but they should get a different label (e.g. one is an adjective and the other is a noun), then there is no Edge token in between to connect the two vertices. In this case, an artificial edge with label “EDGE” is added in the database to connect both nodes. In this way, the sentence is still represented as one connected graph.

42 Properties

A last category of tokens that is not yet stored in the database is the Number category. The num- bers are not stored in the graph database as a vertex nor as an edge. Instead, each number value is added as a Number property to the closest Noun in the sentence. The reason for this is that almost all numbers that occur in a sentence are indeed linked to an accompanying noun. As will be explained in the following section, this "number" property turns out to be very useful in the case of automated numerical inconsistency finding.

Next to the number property, it should also be noted that some general properties are also kept with each token that is stored in the graph database: article, date, sentence. These respectively represent the article in which the token occurs, on which date this article was published and finally in which sentence the token occurs. In this way, it is clear in which article and where exactly the token is written in case multiple articles are stored in the database (which is probably most common).

Example

To make the whole process that was described above still a bit clearer an extra example is given, where the following sentence is transformed step by step to the structured graph representation given in Figure 4.3.

“Tijdens zware terroristische aanslagen in Parijs, die uitgevoerd werden door Salah Abdeslam en zijn handlangers, werden 130 mensen gedood.”

After tokenization, PoS tagging and reclassification, the sentence will be transformed into the list of tokens with accompanying categories given below in Listing 4.1. As can be seen, the token “zijn” is filtered out, as this is a possessive pronoun.

Listing 4.1: PoS tagging of the sentence “Tijdens zware terroristische aanslagen in Parijs, die uitgevoerd werden door Salah Abdeslam en zijn handlangers, werden 130 mensen gedood.”

classDict = {"Tijdens": "Edge", "zware␣terroristische": "Adjective", "aanslagen": "Noun", "in": "Edge", "Parijs": "Place", "uitgevoerd werden␣door": "Edge", "Salah␣Abdeslam": "Entity", "en": "Edge", "handlangers":"Noun", "werden": "Edge", "130": "Number", "mensen": "Noun", "gedood": "Edge"}

43 Figure 4.3: Graph representation for sentence “Tijdens zware terroristische aanslagen in Parijs, die uitgevoerd werden door Salah Abdeslam en zijn handlangers, werden 130 mensen gedood.”

4.3 Automated numerical inconsistency finding

4.3.1 Motivation

In the previous section, a general procedure was given to create a more structured graph database repre- sentation out of an unstructured text. This graph representation could be useful to many text analysis applications. In the context of this thesis, it is useful to look into possible applications for the automated analysis of online news articles. In Chapter 3, it was already shown that still a significant proportion of online news articles are published while they contain at least one error. Another important conclusion that was drawn in Chapter 3 is that it is very difficult to find factual errors in an online news article. The reason for this is that it is often not clear at all what is really “true”. Even manually, it is very often not easy to find correct, reliable information about a certain event. As it was already very difficult to decide manually if a certain proposed fact was either true or false, it can be expected that it would be even harder to extract mistakes or errors automatically from an online news article.

Automatic error extraction from an online news article can thus be assumed to be impossible. How- ever, as the introduction of this chapter already indicated, the main goal of this chapter is to investigate the consistency aspect of the reliability of online news. As such, instead of analyzing articles one by one, one could compare different articles to look for factual, numerical differences that are present between them. Examples in the context of Chapter 3 are frequent. The following two sentences are taken from two different articles written about the Paris attacks by Het Laatste Nieuws:

44 1. Cazeneuve drukte ook nog zijn dankbaarheid uit tegenover de Belgische overheid voor haar steun- betuigingen na de aanslagen, waarbij volgens een laatste balans 132 doden vielen. (“Europa moet sneller en efficiënter optreden tegen het terrorisme”, 2015)

2. Mensen brachten ook maandag bloemen en kaarsen mee om de 129 slachtoffers en 352 gewonden te herdenken. (“Europa staat stil bij aanslagen Parijs”, 2015)

The first sentence was written in an article published on the 15th of November 2015 and states that there were 132 deaths after the terrorist attacks in Paris. The second sentence however was part of an article published one day later, and says that there are only 129 deaths. This is a clear contradiction, as the number of deaths cannot decrease over time. Automatically finding number pairs that could possibly constitute an inconsistency is the goal of this chapter. In the previous sections, a (more) structured representation of originally unstructured text was presented. In the following, it is illustrated how this structure can turn out useful for the automatic detection of numerical inconsistencies.

4.3.2 Basic approach

In Chapter 3, a data gathering process was conducted of which the result is a set of hundreds of arti- cles about two different events that induced a period of stress: the terrorist attacks in Paris and the Germanwings plane crash. However, the statistical tests that were performed indicated that only in the case of the terrorist attacks in Paris, a statistically significant relation was found between the presence of the breaking news event and the decrease of the accuracy in the period of stress afterwards. Because of this, it is chosen in this section to investigate the data sets of articles handling the terrorist attacks in Paris that were written during the period of stress after these attacks. Moreover, another event was selected about which an enormous amount of articles was published in the following period of stress: the terrorist attacks in Brussels and Zaventem on the 22th of March, 2016. As the fractions of articles that were dedicated to both breaking news events during their respective periods of stress are comparable, it is assumed that this breaking news event will also provide us with an interesting data set to query. To be able to perform the analysis for the case of the terrorist attacks in Brussels and Zaventem, two additional data sets of articles written during the period of stress are gathered from the online websites of Het Laatste Nieuws and Het Nieuwsblad.

These articles can now be used as input files for the algorithm detailed out below. More specifically, the goal is to automatically compare articles from within a file cluster (thus, two articles with same subject and with same newspaper) and find numerical inconsistencies between these different articles. It is important to note that contrary to Chapter 3, the data sets analyzed in this chapter consist of articles handling only about the respective breaking news event. The articles written during the periods

45 of stress that handle about different subjects are thus not part of the investigated data sets in this chapter.

To find inconsistencies within the same cluster, in a first step all articles are processed and trans- formed following the procedure that was pointed out in Section 4.2. Once this is done, a large graph database containing all structured sentences and articles from one cluster is obtained. This database can subsequently be queried to find inconsistencies. This is where the Number property that is associated with some noun vertices comes in handy. More specifically, the graph database is queried for specific nouns that have an associated Number property. The nouns that are queried should be part of a list that is specific to the subject of that cluster. In that way, numerical information is found that is relevant for the subject of interest. For example, the list given in Listing 4.2 illustrates the nouns that are queried in case of a terrorist attack, such as the ones in Paris in 2015:

Listing 4.2: Typical nouns possibly associated with a number in articles about terrorist attacks. attackWords= ["terrorist", "terroristen", "slachtoffer", "slachtoffers", "doden", "gewonden", "belg", "belgen", "nationaliteiten", "aanslag", "aanslagen", "bom", "bommen", "dader", "daders", "kind", "kinderen", "zelfmoordaanslag", "zelfmoordaanslagen", "zwaargewonde", "zwaargewonden", "burger", "burgers", "agent", "agenten"]

The query that is then performed to retrieve all nouns that have a word property that is in this list is the following:

MATCH(n:Noun) WHERE EXISTS(n.number) AND n.word IN {attackWords} RETURN DISTINCT(n.word) as word ORDER BY word

Based on the result of the query, a list consisting of lists can be made. Each list within that list then contains all occurrences of one specific noun from the attackW ords list, together with the number that is associated with that occurrence. For example, the first list contains all vertices from the graph database that have a number and have word property terrorist, the second list contains all vertices with a number and with word property terroristen,....

In a following step, all numbers associated to a specific noun can be compared to find inconsistencies. However, the words that are queried occur a lot in the different articles. Taking a Cartesian product and comparing every numbered pair is thus not a good idea. This would lead to an enormous amount of numbers to be compared. Because of this, a first comparison criterion should be that only numbers

46 belonging to the same noun should be compared. However, this should not be the only comparison criterion to be used. First of all, a lot of combinations of numbered nouns can still be made, which would still lead to a high complexity of the algorithm. Secondly, even within articles that handle about the same subject, the identical nouns could refer to different events. For example, in the case of the terrorist attacks in Paris, attacks were performed on different places in Paris. Comparing the number of people that died in Le Bataclan to the number of people that died at Stade de France is thus not correct. However, if the only comparison criterion would be the equality of the associated nouns, these numbers would be compared as they belong to the same noun “doden”.

4.3.3 Synonyms

Next to the words in the attackW ords list above, several synonyms or semantically similar sequences can be present in the text. For example, “politieagenten” is a synonym of “agenten”. Numbers associated with the noun “politieagenten” should thus be considered to be compared with “agenten”. To this end, in the database, all occurrences of “politieagenten” are stored as “agenten”. The same is true for all other words in the attackW ords list. For example, each occurrence of “mensen zijn gestorven” will be stored as “doden” in the database. This synonym strategy highers the number of justified comparisons, and thus the recall in general.

4.3.4 Similarity measures

As explained above, only pairs of the same noun or of synonyms that are reasonable to compare should be effectively compared. To measure this quantitatively, similarity measures are introduced. A similarity function is a real-valued function that quantifies the similarity between two different objects. In the con- text of inconsistency finding, more specifically, the similarity between two filtered sentences is measured. In this way, only pairs of nouns that occur in sufficiently similar sentences are compared.

At the base of the similarity measure that is used to compare filtered sentences lies the Jaccard similarity (Jaccard, 1901). The Jaccard similarity of two sets is given by Equation (4.1).

s s sim(s , s ) = | 1 ∩ 2| (4.1) 1 2 s s | 1 ∪ 2| Here, s1 is a set of all unique tokens that occur in the first sentence to be compared. Similarly, s2 is a set of all unique tokens that occur in the second sentence to be compared. The Jaccard similarity is thus given by the number of tokens that are common between both sentences, divided by the number of unique tokens that occur in either of the two sentences. As such, it is immediately clear that 0 sim(s , s ) 1. ≤ 1 2 ≤

47 When calculating the Jaccard similarity, the weight of all tokens that are present in the sentences is equal. However, it should be kept in mind that the final goal of the comparison is to know if two sentences are similar enough, such that it is useful that the numbers that are associated with a noun that occurs in both sentences are compared. To that end, it could intuitively be argued that tokens that reside very close to the noun in a sentence have more influence on the decision that should be made than tokens that occur far away from the noun in the sentence. Stated differently, the smaller the distance from a token in a sentence to the numbered noun, the higher its impact on the final decision (compare or not) should be. This is especially the case for longer sentences.

Based on the Jaccard similarity, a new similarity measure should thus be developed such that tokens that occur closer to the common noun have a larger weight in the calculation than tokens that are at larger distance. Thus, in the context of this thesis a weighted Jaccard similarity was developed, of which the exact formula is given in Equation (4.2).

1 t s s min(d (t,n),d (t,n)) ∈ 1∩ 2 1 2 sim(s1, s2, n) = 1 (4.2) Pt s1 s2 min(d(t,n,s ),d(t,n,s )) ∈ ∪ 1 2

In the above formula, n denotes the commonP noun that both sentences s1 and s2 contain. d1(t, n) denotes the shortest path distance between t and n in the graph representing sentence 1. d2(t, n) denotes the shortest path distance between t and n in the graph representing sentence 2. Moreover, the function d is defined in Equation (4.3).

d (t, n) if t s i ∈ i d(t, n, si) =  (4.3)  else ∞ To calculate the shortest path in a graph representation of a sentence, each edge in the graph that should be traversed to reach t from n counts for a weight equal to 1. This shortest path distance can be computed very easily, as the Cypher Query Language offers a shortestP ath function in its API.

In summary, instead of using a uniform sum by simply computing the cardinality of the intersection and the union, a weighted sum is used in both numerator and denominator. In this way, the weighted Jaccard similarity makes sure that tokens that are close to the common noun get a larger influence on the similarity computation. If a token t occurs multiple times, possibly in both sentences, only the closest of these different occurrences will influence the computation of the similarity. This is because of the min function that occurs in the formula.

48 The proposed similarity measure is useful because based on the graph data model, its computation is very easy. As it is known in which article and in which sentence in that article the token occurs, and what its associated number is, the following query can be executed:

MATCH (n{word: {commonNoun}, article: {article}, sentence: {sentence}, number: {number}}), p= shortestPath((n)-[*]-(m))

WHERE n <> m RETURN m.word as word, labels(m) as label, rels(p) as edges ORDER BY length(p)

This query gives for each token from that sentence the exact word, the type of word and the edges that connect the noun with the token. Based on this, all tokens that are present in the filtered sentence are obtained, together with their distances. As such, the weighted Jaccard similarity can be computed. It should be noted however that the EDGE words that are associated with artificial edges should be filtered out before calculating the similarity of the two sentences.

4.3.5 Places

Another aspect that should be taken into consideration is the fact that numbers are almost always asso- ciated with a certain place. Examples are very common: “130 deaths in Paris”, “1 million people flee from Syria”, . . . . Thus, taking into account which place is associated with a given noun and a given number could be useful when deciding whether two numbers should be compared.

This aspect was taken into account as follows. An extra property, Place, is stored with each noun that has a Number property. The value of this property is then the place that is closest in the sentence to that specific noun. Later on, when it is discussed how it is finally decided which numbers are compared, it will become clear what the exact role of this place property is.

The attentive reader will notice that sensu stricto the storage of a place property with the nouns is not needed. By simply using the query given in Section 4.3.4, all places that are within the sentence together with their shortest path distance could be easily recovered.

However, this would mean that only places that are within the sentence of the noun are relevant. If no place vertex would be present in the sentence, no place property would be stored with the noun, as subsequent sentences are not connected in the graph database. This is of course not the wanted

49 behaviour, as it is possible that a number refers to a place that is present in one of the surrounding sentences. Because of this, as long as the shortest place is in the previous, the current or the next sentence, the place is considered to influence on the number. To obtain this behaviour, the only solution is to add an extra place property to the numbered nouns.

4.3.6 Entities

A final aspect that came up during the development of the algorithm is that Entities are very important words to consider when deciding whether two numbers should be compared. To make this statement more intuitive, two example sentences are given below:

Sentence 1: “Tijdens de terroristische aanslagen in Parijs vielen 89 doden in Le Bataclan.”

Sentence 2: “Tijdens de terroristische aanslagen in Parijs vielen 5 doden in de Rue de la Fontaine.”

Using the criteria that were given above, the two numbers would be probably compared: the noun (“doden”) that is associated with the numbers is exactly the same, the sentences have a lot of similar words, especially close to the common noun, and finally the closest place is in both sentences “Parijs”. However, it is immediately clear that the two numbers should not be compared, as they refer to the number of deaths at a different terrorist attack. This small example clearly illustrates that not only the place is important, but also entities such as names, streets, concert halls, . . . that are not a geographical place are important to consider.

To this end, the graph generation procedure is changed slightly. The entities are not stored as vertices anymore. Instead, all entities that are present in a sentence are put into a list that forms an "Entities" property of the nouns that contain a number. The two entity lists of the nouns under consideration can then be compared with a similarity measure themselves. In the context of this thesis, the simple Jaccard similarity is used, of which the formula was given in Section 4.3.3.

As such, two similarity computations are now performed. The first one measures the similarity be- tween the two entity lists of the nouns that should (not) be compared. The second one is the one that was introduced in Section 4.3.3 and measures the similarity between the filtered sentences. However, the concept of the latter comparison changes a little, as entities are now not included anymore in these filtered sentences. Otherwise, they would influence both similarities.

To illustrate the changes in the graph generation process, the sentence for which the graph representa- tion was generated in Figure 4.3 is again structured, but now by following the slightly changed procedure. The resulting graph representation is shown in Figure 4.4.

50 Figure 4.4: Adapted graph representation for sentence “Tijdens zware terroristische aanslagen in Parijs, die uitgevoerd werden door Salah Abdeslam en zijn handlangers, werden 130 mensen gedood.”

4.3.7 Comparison criterion

In the previous sections, the different aspects that could possibly have an impact on which pairs of num- bers should be compared were explained. Based on these aspects, a decision rule should now be created that classifies the pairs as correctly as possible in either the comparable class and the non-comparable class. Different decision rules were considered in the context of this Master dissertation. The quality of the final results (i.e., the possible inconsistencies that were found) is then evaluated by calculating the precision and recall. The results are extensively discussed in the next subsection. Finally, the decision rule indicated in Listing 4.3 was proven to be the best rule:

Listing 4.3: Decision rule for comparing numerical information

i f sameNoun and ((entitySim >= 0.3) or (sentenceSim >= 0.2)) and samePlace : comparable = True else : comparable = False

Here, sameNoun is a boolean indicating whether the two nouns that are compared are exactly the same. Remember that synonyms of nouns that are queried are also mapped on the same noun. For these synonyms, the sameNoun variable will thus also evaluate to True. sentenceSim denotes the weighted Jaccard similarity of the two sentences, as already explained in Section 4.3.3. Similarly, the entitySim variable denotes the Jaccard similarity of the two entity lists that are associated with the respective nouns. Finally, sameP lace is a boolean variable that indicates whether the two numbers are connected to the same place or not.

51 Of course, two places are considered to be the same if they consist of the exact same string. For example, if the places of the first noun and the second noun are both “Parijs”, the numbers are considered to refer to the same place. However, the concept of a same place is broader than only the same string. The algorithm also uses an input text file that contains each country together with its capital. Based on this input file, strings like country + ”hoofdstad” are considered equal to the capital itself. For example, "Franse hoofdstad" is considered the same place as "Parijs". Furthermore, an input file consisting of geographical adjectives is also present together with the country or city to which they are related. For example, “Frans” and “Frankrijk” are considered the same location. However, “Frans” and “Parijs” or “Frankrijk” and “Parijs” are not considered to be the same place.

Two numbers are thus compared if they belong to the exact same noun and if they are related to the same place (based on the rules explained above). Moreover, one of the following two conditions should be met:

The Jaccard Similarity of the two entity lists is at least 0.3. • The weighted Jaccard Similarity of the two sentences is at least 0.2. •

These thresholds were chosen based on the empirical results that were obtained while testing on the file clusters present.

4.3.8 Evaluation

Precision and recall

In the following section, the results that are obtained by the algorithm are evaluated. First, it is impor- tance to consider what should be evaluated exactly however.

A possible evaluation criterion could be the following. Any of the numbers associated to the same noun in a comparison is mapped onto an interval. For example, “20” is mapped onto the interval [20, 20]. Similarly, the string “minstens 20” is mapped onto the interval [20, ]. Inconsistency detection could ∞ then simply consist of comparing the two intervals associated with the same noun. If there is overlap between the two intervals, there is no inconsistency. If there is no overlap, the pair of numbers is assumed to be inconsistent with each other. For example, the strings “minstens 20” and “20” are not inconsistent, while the strings “15” and “minstens 20” are inconsistent. However, two problems arise when classifying such intervals as being inconsistent:

52 Overlap between two intervals does not mean that the numbers associated with the nouns are • close enough (or not inconsistent). Consider the two strings “minstens 100 doden” and “minstens 200 doden”. The respective intervals [100, ] and [200, ] overlap. However, there is still a huge ∞ ∞ semantic gap between the two strings associated with “doden”. An idea that could come up to mind is to define a threshold: if the difference between the two start points is larger than that threshold, the numbers are considered to be inconsistent. However, this is not possible: a threshold should in this case be defined for each different noun, which is practically not feasible.

Whether two numbers are really inconsistent is often also dependent on the date of publication of • the articles under consideration. For example, assume that an article, published on the 14th of November 2015 at 10:00 says that 100 people were killed. A second article, which was published on the 14th of November 2015 at 20:00, says that 110 people were killed. The two intervals [100, 100] and [110, 110] clearly do not overlap and thus would be assumed to be inconsistent. However, it could be possible that during the day, another 10 people died. As there is no way to detect and verify this number of additional deaths automatically, it is thus arguable if the intervals are really inconsistent.

The two problems explained above show that it is possibly controversial to evaluate the algorithm with the help of the interval approach. Because of this, the most correct way to evaluate the inconsistency finding algorithm is based on the number of correct comparisons of number pairs. A pair of numbers is compared correctly if the two numbers that are compared are really referring to the same real-world number, for example the number of deaths in Le Bataclan. Thus, a correct comparison does not imply directly that the compared numbers are indeed a real-world inconsistency.

The evaluation of the algorithm should be done based on two measures, namely precision and recall. Both metrics take on values between 0 and 1. An algorithm for which both precision and recall are equal to 1 is the perfect algorithm. The respective formulas are given in Equations (4.4) and (4.5).

T rueP ositives precision = (4.4) T rueP ositives + F alseP ositives

T rueP ositives recall = (4.5) T rueP ositives + F alseNegatives

In these formulas, the True Positives are those pairs of which the numbers are correctly compared. The False Positives are those that are compared but that should not be compared, because they refer to different real-world numbers. And finally, the False Negatives are the pairs of numbers that are not

53 compared, but that should be compared.

It is thus clear that the precision can be easily calculated based on the output of the algorithm, by dividing the number of correctly compared number pairs by the number of all number pairs that are given back by the algorithm.

However, the recall cannot be calculated. The reason for this is that the number of False Negatives should be known. In fact, this comes down to knowing all possible number pairs referring to a same real-world number, present in a file cluster. As this is not feasible, the exact recall cannot be calculated. However, the number of True Positives is indicative for the recall: as the number of True Positives increases, it is certain that the recall also increases.

Initial results

The algorithm that was described above was tested on four different data sets, two consisting of articles written by Het Laatste Nieuws and two consisting of articles written by Het Nieuwsblad. It is chosen to analyze two different subjects, one that was already analyzed manually and one that has never been analyzed before. As such, the first subject that is chosen is the terrorist attacks in Paris, on the 13th of November, 2015. The second subject is then the terrorist attacks in Brussels and Zaventem on the 22nd of March, 2016. These subjects led to a large amount of articles written by both online newspapers, and thus it can be assumed that enough numerical information will be present to find possible inconsistencies.

The precision that is obtained for each of the four different file clusters after running the algorithm is given in Table 4.1.

Table 4.1: Precision of comparisons in output automatic inconsistency finding algorithm

Brussels attacks Paris attacks

12 14 Het Laatste Nieuws 22 = 0.545 17 = 0.824 17 61 Het Nieuwsblad 29 = 0.586 75 = 0.813

In total, 104 correct comparisons are found (thus, 104 true positives). Computing the total precision, 104 by division of the sum of all True Positives by the sum of all Positives gives a total precision of 143 = 0.727.

Improvements

As detailed out when explaining the inconsistency finding algorithm, pairs of numbers are only compared if they are linked to the same place (which is a country, capital or geographically related adjective). These

54 geographical locations are based on a large text file that is given as input to the algorithm.

However, the data sets that were analyzed above immediately illustrate that this large text file is sometimes not sufficient. For example, the tokens “Maalbeek” and “Le Bataclan” could also be inter- preted as a place, instead of an entity. This could possibly be beneficial, as the algorithm would be even better capable of distinguishing numbers belonging to different places.

It is thus made possible for the user to indicate some synonyms of places in the beginning. For example, “metro”, “Maalbeek” and “Brussel” could be considered place synonyms. They are thus mapped on the same Place term, e.g. “Maalbeek”. By making it possible for the user to give a number of place synonyms in the beginning, the algorithm will now be allowed to compare numbers that are present in very similar sentences but that were linked to a different place previously. For example, consider the following two sentences:

Sentence 1: “Er vielen 20 doden in Maalbeek en 14 doden in Zaventem.”

Sentence 2: “Er vielen 20 doden in de metro.”

As Maalbeek is a subway station, it is not included in the large text file that the algorithm gets as an input. Thus, the 20 deaths in the first sentence will be associated to Zaventem, as this is the closest place to the number that is present in the input file. However, this is not correct: only the 14 deaths are associated to Zaventem.

By making it possible for the user to indicate before the start of the algorithm that “Maalbeek” and “metro” are in fact the same concept for this use case, the twenty deaths in the first sentence will probably be compared with the twenty deaths in the second sentence, which is a correct comparison. It can thus be expected intuitively that this approach will generally lead to a larger number of True Positives and thus will further increase the recall. The influence on the precision of the algorithm is more difficult to estimate, as this depends on the strength of the Place synonyms that are given to the algorithm.

As input place synonyms, the dictionaries given in Listing 4.4 are feeded to the algorithm for the terrorist attacks in Paris and Brussels respectively.

55 Listing 4.4: Input dictionaries for the algorithm with place synonyms for the terrorist attacks in Paris and those in Brussels. synonymsParis = {"stadion": "Stade␣de␣France", "Bataclan": "concertzaal"} synonymsBrussels = {"Zaventem": "Brussels␣Airport", "Zaventem": "luchthaven", "Maalbeek": "metro": , "Maalbeek": "metrostation"}

Running the algorithm with the added Place terms leads to the precisions given in Table 4.2:

Table 4.2: Precisions of comparisons in output automatic inconsistency finding algorithm

Brussels attacks Paris attacks

8 16 Het Laatste Nieuws 18 = 0.444 18 = 0.889 34 62 Het Nieuwsblad 54 = 0.630 77 = 0.805

It is clear that this leads to slightly different results for three of the four file data sets, and for a very different result for the data set about the terrorist attacks in Brussels written by Het Nieuwsblad.

The weighted average precision of the algorithm decreases very slightly, from 0.727 to 0.719: for some data sets the precision increases, for other data sets it decreases. However, the number of True Positives increases from 104 to 120 and for each data set separately, except for the articles written by Het Laatste Nieuws about the attacks in Brussels. This clearly indicates that the global recall of the algorithm is higher, although an exact computation is not possible. The expected effects of the changed algorithm are thus obtained.

Final extension to algorithm

In a final step, a possible application for a user (e.g. a journalist) could be to find as much occurrences as possible of the above found possible inconsistencies. For example, suppose the algorithm outputs the following comparison:

56 Listing 4.5: Possible output of inconsistency finding algorithm

{"word": "doden", "number": "zeker␣dertig", "article": "article1", "place" "zaventem", "sentence": 3, "date": "22/03/16", "entities": None}

{"word": "doden", "number": "minstens␣34", "article": "article2", "place": "zaventem", "sentence": 1, "date": "23/03/16", "entities": None}

Then, the database can be queried for all possible inconsistencies where one article associates the number “zeker dertig” with the noun “doden” and the other article associates the number “minstens 34” with the same noun. As such, this simple extension can be useful for finding all occurrences of the found possible inconsistencies.

Of course, it can be assumed that the number of proposed comparisons of the algorithm will be higher than without this extension: for each possible inconsistency found in the previous step, all occurrences in all articles are given now. Just as for the previous modification of the algorithm, the influence of the modification on the precision is more difficult to estimate and is dependent on the specific data set.

The results of the algorithm with the proposed extension are summarized in Table 4.3.

Table 4.3: Precisions of comparisons in output automatic inconsistency finding algorithm

Brussels attacks Paris attacks

25 23 Het Laatste Nieuws 59 = 0.424 26 = 0.886 74 220 Het Nieuwsblad 105 = 0.705 285 = 0.772

As can be seen from this table, the influence of the extension on the precision depends on the specific data set. Globally, the precision remains more or less the same: 0.72. On the other end, the number of True Positives explodes from 120 to 342. This is an increase with almost a factor three. This means that the recall of the algorithm significantly improves. It is thus clear that a lot of possible inconsistencies occur different times in the different files. Moreover, it is clear that the general performance of the algo- rithm, in terms of its harmonic mean value, is much better.

To conclude, the precisions for the data sets about the terrorist attacks in Paris are clearly very high. This means that the largest part of the results that the algorithm outputs are correct. For the data sets about the terrorist attacks in Brussels, the precisions are lower. A possible explanation for this is that places are used intermingled in the online news articles when reporting about the attacks. For example, when an article writes about “the attacks in Brussels”, sometimes this refers to all attacks that took

57 place that day (both in Zaventem and Maalbeek), while sometimes the journalist only means the attacks that took place in Maalbeek (as Zaventem is not part of the city of Brussels anymore). Because of this confusion, both numbers will often be mapped onto the same Place (i.e. “Brussel”), which leads to a lower precision. However, the numbers are still fairly high for automated Natural Language Processing (NLP).

Another conclusion that can be drawn is that the number of inconsistencies that is found is typically much higher in data sets of articles of Het Nieuwsblad than in data sets of articles of Het Laatste Nieuws. This is true for both the articles about the terrorist attacks in Paris and those in Brussels. A possible explanation for this, in the case of the attacks in Brussels, is that in total, the data set of Het Laatste Nieuws contains 1830 tokens tagged as Number, while the data set of Het Nieuwsblad contains 3262 tokens tagged as Number. However, this difference is not present anymore in the case of the terrorist attacks in Paris (1274 tokens and 1292 tokens respectively). Without generalizing, it can thus be concluded that the investigated data sets of Het Nieuwsblad probably contain more numerical inconsistencies than the ones of Het Laatste Nieuws.

4.3.9 Smallest subset removal

As a final consideration, the prevalence of the possible inconsistencies over the different articles is stud- ied. A possible approach that is used in the context of this Master dissertation is based on the minimum change principle or the Fellegi-Holt paradigm (Holt, 1976). This paradigm is typically used for error localization in data and automatic editing of data. It looks for the smallest subset of variables of a record that should be changed in order to get a consistent record.

In the context of automatic numerical inconsistency finding, the paradigm can be used as follows. The goal in this subsection is to find the smallest subset of articles that should be removed, such that the algorithm presented above does not find any possible inconsistency anymore. Although this does not assure that all inconsistencies are removed (as the recall is not equal to 1.0), the number of articles that are removed can be assumed to be indicative for the extent of the problem of numerical inconsistencies within a data set.

To find this minimal subset of articles, the algorithm given in Listing 4.6 is used.

58 Listing 4.6: Algorithm for finding the minimal subset of articles that contains all found inconsistencies. while len (foundInconsistencies) > 0: articleOccs = {}

for i n c in inconsistencies:

articleOccs[result["article1"]] = articleOccs[result["article1"]] + 1

articleOccs[result["article2"]] = articleOccs[result["article2"]] + 1

most_frequent_article = article_with_max_value({( article , nrOccurrences) |( article ,nrOccurrences) in articleOccs})

minimal_subset .append(article_with_max_value)

for i n c in inconsistencies: i f (result["article1"] != most_frequent_article) and (result["article2"] != most_frequent_article): remainingInconsistencies .append(result)

inconsistencies = remainingInconsistencies

The algorithm starts with an array containing all possible inconsistencies that were found by the algorithm described in the previous subsections. It counts for each article in how many possible incon- sistencies it is involved. Finally, the article with the most inconsistencies is added to the subset and all inconsistencies that are related to that article are removed from the inconsistencies array. Then, the procedure repeats itself: again, the inconsistencies per article are counted and the article with the most occurrences is removed. This procedure is repeated until no inconsistencies are present anymore.

The number of articles that had to be removed for each data set to become an inconsistency-free data set is given in Table 4.4. Moreover, the table illustrates which fraction of the total number of articles in each data set this is. These numbers and percentages illustrate that the problem of inconsistencies in a data set is limited

59 Table 4.4: Cardinalities of minimal subsets and relative fractions for each of the different data sets.

Brussels attacks Paris attacks Het Laatste Nieuws 11 (1.9%) 9 (2.9%) Het Nieuwsblad 12 (2.1%) 20 (7.7%) to a fairly small subset of articles. Although probably a few inconsistencies will still be present in the data set, it can be assumed that a significant amount of inconsistencies is removed by removing this minimal subset.

Moreover, within the identified minimal subsets, the most frequent articles contribute significantly to the total number of possible inconsistencies that is found in the respective data sets. This is illustrated in Figures 4.5 and 4.6 for the terrorist attacks in Paris and the terrorist attacks in Brussels and Zaventem respectively.

(a) Het Laatste Nieuws (b) Het Nieuwsblad

Figure 4.5: Fraction of removed inconsistencies in function of the number of articles removed from the file set of the terrorist attacks in Paris.

4.4 Conclusion

In this chapter, first of all an introduction to graph databases and the specific Neo4J database was given. This knowledge was then used to create a structured representation of the unstructured textual data that online news articles are. This structured representation is created by performing four steps: tokenization, Part-of-Speech tagging, reclassification and finally the graph generation. In this way, textual data is represented as a graph consisting of vertices and edges which resemble words of sentences.

60 (a) Het Laatste Nieuws (b) Het Nieuwsblad

Figure 4.6: Fraction of removed inconsistencies in function of the number of articles removed from the file set of the terrorist attacks in Brussels and Zaventem.

This structured representation was then modified a little bit for the specific use case of automated numerical inconsistency finding. Here, use was made of a Jaccard similarity and a weighted Jaccard similarity. These were used to measure the similarities of different sets that were gathered from the original sentences that were compared. Finally, the numbers that could be possibly inconsistent were selected based on an empirically created selection criterion. The results of the algorithm and its extensions indicate that the precision that is achieved is quite high for the different data sets. Moreover, although the exact recall cannot be computed, a few hundreds of True Positives are returned by the algorithm. This is a fairly high amount.

61 Chapter 5

Automatic freshness estimation of online news articles

In this chapter a third aspect of the reliability of online news, namely its relevance to the reader is investigated. Only one specific aspect is studied: the freshness of online news articles during a period of stress. The chapter starts with giving the problem statement, after which the general approach in this chapter is explained. This approach is performed two times, once by using the Jaccard similarity and once using a variant of the Longest Common Substring similarity. Finally, a conclusion and suggestions for further research are given.

5.1 Problem statement

In Chapter 3, an online news article gathering process was conducted to obtain large datasets of articles, published both during periods of stress and during non-stress periods. A period was classified as a period of stress if, for at least four days, at least 25% of the online news articles of the two largest online news papers were dedicated to the same news event. Two different news events were originally considered: the terrorist attacks in Paris and the Germanwings plane crash. Later on, in Chapter 4, an additional data set of articles handling the terrorist attacks in Brussels and Zaventem was also gathered.

However, Chapter 3 already illustrated that the number of articles that handle about the specific breaking news event during its period of stress can still drastically differ depending on the breaking news event. More specifically for the periods of stress studied in this work, the terrorist attacks in Paris and Brussels generated approximately five times more news articles than the Germanwings plane crash. The amount of articles published about these terrorist attacks are enormous. In just four days after both

62 news events happened, Het Laatste Nieuws and Het Nieuwsblad published in total 1179 articles about the terrorist attacks in Zaventem and 570 about those in Paris. Over 80% of the articles written during these days within the periods of stress handled the specific stress event.

The large numbers mentioned above raise the question whether this enormous amount of articles written about these events is justified. More specifically, one could ask himself to which extent each article that is published during such a period of stress is completely fresh. Stated differently, how much information that is present in a new article was already written during one of the previously published articles? This is the central question to the research conducted in this chapter. As the terrorist attacks in Paris and those in Brussels and Zaventem clearly lead to an enormous amount of published articles, these breaking news events were chosen as the ones to investigate.

5.2 General approach

To be able to quantify the amount of information that was already present in a previously published article about the subject, the online news articles are processed in chronological order. Each online news article is compared with all online news articles that were already published before about the subject. Thus, the first article ever published about the breaking news event is considered completely new, the second article is only compared with the first article, the third article with the first two, . . . . Based on these comparisons, the fraction of information that is "duplicated" in an article can be computed.

To compare two articles, their respective sentences are compared. Thus, each sentence of the newly published article is compared with all sentences that were already published before. This comparison is based on a specific similarity measure. Different types of similarity measures will be investigated in the following. In this way, for each sentence in the new article, the most similar sentence that was published before can be determined. This most similar sentence has an associated similarity value. If this maximum similarity is larger than a given threshold, the newly published sentence is assumed to be duplicated. The specific meaning of this duplication is dependent on the combination of the similarity measure that is used together with the chosen threshold.

5.2.1 Jaccard similarity

A first, easy, similarity measure that could be used is the Jaccard similarity (Jaccard, 1901). This sim- ilarity measure was already used in Chapter 4 to compare word sequences stored in a graph database. The exact formula of the Jaccard similarity is given in Equation (4.1). A sentence is considered to be duplicated if there exists a sentence in the earlier published articles for which the similarity is equal to

63 or higher than 0.8.

This threshold is empirically determined: a Jaccard similarity of at least 0.8 indicates that two sen- tences are nearly clones of each other. The reason for this is that a slight difference between two sentences immediately leads to a significant decrease of the similarity. As an illustration, two sentences of each 10 words where each sentence contains exactly one word the other sentence does not contain, have a Jaccard 9 similarity of 11 = 0.82. This similarity is thus already almost below the chosen threshold of 0.80. The considered similarity measure together with the chosen threshold can thus be used to find almost exact copies of earlier published sentences.

The obtained results for the terrorist attacks in Paris and Brussels are depicted in Figures 5.1 and 5.2 respectively. These figures show the average percentage of sentences in an article that are (almost) copies of an earlier published sentence, based on the similarity measure/threshold pair mentioned above. From these graphs, it can be concluded that the number of copied sentences is relatively low: the percentages range from 0% to 3.5% at most. Moreover, it can be seen that as more articles are published, the percentages do not increase significantly. This indicates that even when already hundreds of articles are published, new articles still contain a lot of new information: almost all sentences are not similar enough (according to the Jaccard similarity) to an earlier published sentence in another article.

Figure 5.1: Average fraction of copied sentences in an article about the terrorist attacks in Paris, where articles are analyzed in chronological order. Similarity based on a Jaccard similarity with threshold 0.8.

5.2.2 Longest Common Subsequence similarity

The approach mentioned in the previous subsection, namely using the Jaccard similarity together with a high threshold, is good to find almost identical sentences. However, the Jaccard similarity compares sentences only very superficially. Only if two strings are exactly the same are present in both sentences,

64 Figure 5.2: Average fraction of copied sentences in an article about the terrorist attacks in Brussels and Zaventem, where articles are analyzed in chronological order. Similarity based on a Jaccard similarity with threshold 0.8. they contribute to an increase of the Jaccard similarity. Strings that are almost but not completely identical, are considered to be completely different strings by the Jaccard similarity. As an example, consider two linguistically similar words, such as “politieagent” and “agent” or “speler” and “spelen”. As the strings are not completely identical, they are considered to be different words. They are thus both part of the union of the two sentence sets, but not of the intersection of the sets. They thus lower the Jaccard similarity, although it is intuitive that they are both syntactically and semantically similar.

A possible solution could be to use a stemming algorithm for Dutch, for example the Porter stemmer (Kraaij & Pohlmann, 1994). However, although this stemming algorithm performs quite well considering its complexity, it can be expected that lots of linguistically similar words will still not be considered as being equal after stemming. It is thus intuitive to look for another similarity measure that can better handle the partial similarity between words within two sentences.

To this end, an alternative approach is considered in this subsection. Instead of interpreting sentences as sets of words, each word in the newly published sentence is compared separately with each word in the earlier published sentence. Thus, instead of computing the similarity of two sentences directly, this similarity is now calculated based on the similarity between the words by which they are constituted. More specifically, for each word in the newly published sentence, the most similar word in the other sentence is searched. Finally, the similarity between the sentences is then computed as the average of these maximum words similarities. This is formalized in Equation (5.1).

1 sim(s1, s2) = max( wordSim(w1, w2) w2 s2) ) (5.1) w s s1 1∈ 1 { | ∈ } | | X 65 The last choice that should be made is which similarity measure is used as wordSim operator in the above formula. This similarity measure should be capable of handling partial linguistic similarity between different words, as explained above. For this reason, a string similarity measure based on the Longest Common Substring Problem is used (Gusfield, 1997). The LCS of two strings is defined as the longest string that is a substring of both strings. As an example, the LCS of the strings “ABCD” and “ABD” is “AD”. The Longest Common Substring Problem may not be confused with the Longest Common Subsequence Problem (Bergroth, Hakonen, & Raita, 2000). For this well-known computing problem, the resulting sequence does not need to consist of characters that are subsequent in the original sequence. For example, the Longest Common Subsequence of the strings “ABCD” and “ABD” is “ABD” instead of “AD”.

While a Longest Common Subsequence similarity could probably also be a good similarity measure to compare two strings, the complexity of determining the Longest Common Subsequence of a string of length n and one of length m is θ(n + m) using generalized suffix trees, and θ(nm) using dynamic programming (Ullman, Aho, & Hirschberg, 1976). As plenty of words will be compared, a choice for a more computing-friendly word similarity measure was thus made in the form of the Longest Common Substring. Moreover, it can be expected that for matching similar words, it is important that substrings are more important than subsequences, as typically linguistically similar words are derived from the same stem or derived from each other (and thus contain a large substring).

Based on the LCS concept, two normalized word similarities are used that were originally defined by Islam et al. (2008). The first one is the Normalized Maximal Consecutive Longest Common Subsequence

(NMCLCS1) starting from the first character. This metric is defined by dividing the second power of the length of the largest common substring, starting from the first position, by the product of the length of both strings. The formalized version of this definition is given in Equation (5.2).

2 length(LCS1(w1, w2)) NMCLCS1(w1, w2) = (5.2) length(w1)length˙ (w2)

1 As an example, the NMCLCS1 of “ABD” and “ACBD” is 12 , as LCS1 = ”A”. This measure is thus based on the length longest common prefix of the two strings. The second string similarity that is used is similar, but now looks for the LCS starting from any position n in any of the two strings. The formula is given in Equation (5.3).

2 length(LCSn(w1, w2)) NMCLCSn(w1, w2) = (5.3) length(w1)length˙ (w2)

66 For the same two strings mentioned in the example above (“ABD” and “ACBD”) the NMCLCSn 1 equals 6 , as LCSn = “BD”. The final string similarity is then given as the average of the two string similarities, as given in Equation (5.4):

1 1 wordSim(w , w ) = NMCLCS (w , w ) + NMCLCS (w , w ) (5.4) 1 2 2 1 1 2 2 n 1 2

The computation of these two measures is computationally much more friendly than the computation of the Longest Common Subsequence. Islam et al. propose to compute LCS1 by removing the final character from the shortest word, until it completely fits in the longer word. Thus, the time complexity of computing this similarity measure is θ(n), with n the length of the shortest string. Pseudocode for this algorithm is given in Listing 5.1.

Listing 5.1: Algorithm in pseudocode for computing LCS1 while len (shortestWord) > 0:

i f (shortestWord in longestWord) and (longestWord. startswith(shortestWord)): return shortestWord

else : shortestWord = shortestWord[: 1] −

return ""

To compute LCSn, Islam et al. propose to compute all n-grams present in the shortest string. An n-gram in a string is a substring of n subsequent characters in that string. Computing all n-grams in the shortest string thus means computing all 1-grams, all 2-grams, . . . up until the length of this shortest string. After all these n-grams are found, each time the largest n-gram is taken. Then it is verified whether this n-gram is a substring of the longest string or not. If this is the case, this n-gram is the LCS of the two strings. Otherwise, the n-gram is deleted from the set of n-grams, and the process repeats itself. This goes on until the set of n-grams is empty or the LCS is found. Pseudocode for this algorithm is given in Listing 5.2.

67 Listing 5.2: Algorithm in pseudocode for computing LCSn n_grams = {}

for n in range (1 , len (shortestWord)+1): n_grams.add(ngrams(shortestWord , n))

while len (n_grams) > 0: max_n_gram = argmax ({ len (n_gram)|n_gram in n_grams })

i f max_n_gram in longestWord : return max_n_gram

else : n_grams . remove (max_n_gram)

return ""

In the context of this Master dissertation, two sentences are considered to be duplicated if the LCS- based similarity is at least 0.6. This threshold was chosen empirically: most of the sentence pairs that have a similarity higher than this threshold indeed exhibit the same content. Of course, some sentence pairs with an LCS-based similarity of at least 0.6 are in fact not semantically similar. The percentages in the graph thus form an upper bound for the real numbers. However, increasing the threshold is not an option, as a lot of sentences that are indeed semantically similar have an LCS-based similarity that would be lower than this threshold. A threshold of 0.6 was thus chosen as a compromise.

The graphs that were obtained for the combination of an LCS-based similarity together with a thresh- old of 0.6 are given in Figures 5.3 and 5.4 for the terrorist attacks in Paris and those in Brussels respec- tively. These graphs illustrate that the percentage of linguistically similar sentences in a newly published ar- ticle compared to earlier published articles, is higher than the percentage of copied sentences determined with the Jaccard similarity. This is an intuitive result. However, the percentages are again quite low. At the end of the period of stress, an average of 5% to 8% is obtained depending on the file set.

Contrary to the analysis with the Jaccard similarity, in general there is a small increase in the percent- age as more articles are analyzed. However generally speaking, the numbers do not increase significantly near the end of the file set. It can thus be concluded that even for hundreds of articles, most of the

68 Figure 5.3: Average fraction of linguistically similar sentences in an article about the terrorist attacks in Paris, where articles are analyzed in chronological order. Similarity based on an LCS-based similarity with threshold 0.6.

Figure 5.4: Average fraction of linguistically similar sentences in an article about the terrorist attacks in Brussels and Zaventem, where articles are analyzed in chronological order. Similarity based on an LCS-based similarity with threshold 0.6. information present is linguistically significantly different from earlier published information. From this perspective, the number of articles written is thus again justified.

5.3 Further work

Two different analyses were performed so far. First, sentences were compared based only on the exact words that they contain by using the Jaccard similarity. To solve the problem that in this approach a word is completely equal or completely different, but that there is no intermediate degree of string similarity, a second approach based on the Longest Common Substring was proposed. In this way, linguistically similar sentences could be discovered.

69 However, the initial goal of this chapter was to detect how much information in a newly written arti- cle was already present in the earlier published articles about the subject. As explained in the previous subsection, most of the linguistically similar sentences identified indeed handle the exact same content. However, sentences do not need to be linguistically similar to contain the same content.

As an example, “agenten” en “agent” are two words that are linguistically similar to a large degree and will thus contribute to the increase of the similarity of the sentences in which they appear. Contrary however, “agent” and “politie” are linguistically completely different: the LCS is simply the empty string. Intuitively however, it is obvious that “agent” and “politie” are semantically related to a very high degree. To be able to find all semantically similar sentences in an article, a change from a linguistic similarity measure to a combination of a linguistic similarity measure and a semantic similarity measure is needed. A semantic similarity measure is a similarity measure for which the similarity between words is based on their meaning, instead of only their physical representation (i.e. the string). A possible approach to compute such semantic similarities is to use large corpora of documents that are statistically analyzed to obtain a similarity value for all word pairs present in the large collections of documents, for example by using a vector space model. Examples in literature are widespread (Landauer & Dumais, 1997) (Kolb, 2009).

Specifically for Dutch, such semantic similarity measures already exist. An example is described by Mandera et al. (2017). This corpus-based semantic distance is based on the assumption that words that are closely related typically co-occur very often in texts (Harris, 1954). However, the existing semantic similarities that were developed for Dutch are corpus-based and thus are computationally expensive. In combination with the large amount of words that are needed to compare in the context of this chapter, such an analysis is not feasible. Further research could thus include the development of a computationally more friendly approach to assess the freshness of articles, taking into account the semantic similarities of words as well.

5.4 Conclusion

In this chapter, it was investigated how the freshness of an article can be determined automatically. The way this was performed is by comparing new articles with all articles already published before. A pair of articles was compared by computing the similarities between the sentences of which the articles are constituted. To this end, two different similarity measures were used. First of all, the Jaccard similarity was used. Together with a threshold of 0.8, this metric was used to find almost exact copies of published

70 sentences. Secondly, an alternative sentence similarity, based on the Longest Common Substring of the constituting words of the sentence, was proposed. This measure was appropriate to find sentences that are linguistically similar next to almost exact copies of sentences.

For both research scenarios, it can be concluded that based on the amount of new information, the publication of hundreds of articles for the terrorist attacks in Paris and in Brussels can be justified. While the percentage of copied sentences stagnates as more articles are published, a small increase is noticed for the LCS analysis near the end of the period of stress. However, at most 8% of the published sentences can be considered to be syntactical duplicates. This is a fairly low fraction. Further research could focus on the development of a new technique to include semantic similarities in the analysis of the freshness of articles.

71 Chapter 6

Conclusion

The goal of this Master dissertation was to quantitatively assess the reliability of online news in Flanders during periods of stress by using both manual and automatic techniques. To this end, first a period of stress was defined as a period of four days after the occurrence of a breaking news event, in which 25% of the articles published by popular online newspapers is dedicated to that specific event. The investi- gation of the reliability of online news media was separated in three different parts. More specifically, an investigation of the accuracy, the consistency and the relevance and more specifically the freshness of online news articles was performed.

First of all, a data gathering process was set up. Two specific periods of stress were selected to be analyzed: the period of stress started by the terrorist attacks that took place in Paris on the 13th of November, 2015 and the period of stress started by the plane crash of a Germanwings plane on the 24th of March, 2015, which was done on purpose by the copilot of the plane. Articles written in the four days after the breaking news events and articles written before the periods of stress were selected such that for each event a data set of period of stress articles and a data set of non-stress period articles was available. The analysis was performed for both Het Laatste Nieuws and Het Nieuwsblad, the two most popular online news brands in Flanders.

In a first stage, the accuracy of online news media during the two periods of stress was extensively investigated. This part of the study was performed manually, by annotating each article present in one of the eight different data sets with all errors that it contains. This analysis illustrated that a fairly high amount of articles contains at least one error, both during non-stress periods and periods of stress. 1 2 Depending on the data set, 5 up to 5 articles contains at least one mistake. Next to this, for each period of stress data set, this number was higher than the number for the preceding non-stress period data set. These numbers thus clearly illustrate the extent of the accuracy problem in online news: they show that a

72 significant amount of articles that are published contains errors, both during periods of stress and during non-stress periods.

Once the error annotation and initial interpretation of the results was done, several statistical tests were performed. These statistical tests mainly investigated the differences that exist between the number of errors in period of stress data sets and the same number in the corresponding non-stress periods. The specific numbers that were investigated were the probability of writing an error-containing article, the probability of writing an error-containing word and the mean number of errors in an article. The most remarkable results that came out of these statistical tests were that the probability of writing an error- containing word is lower during the non-stress period before the terrorist attacks in Paris than the same probability during the following period of stress. The same was proven to be true for the probability of writing a linguistic error-containing article.

Not all statistical tests that were performed indicated a statistically significant result. However, it can be noted directly from the error annotations of the different data sets that in general the percentages of error-containing articles and error-containing words and the mean number of errors per article are higher during periods of stress than during the corresponding non-stress periods. Although no general conclusions were drawn because only two specific breaking news events were investigated, this clearly indicates that there is at least some relationship between the presence of a breaking news event and the presence of errors in online news articles published in the days after the occurrence of the breaking news event.

In a second stage, the consistency of online news media in periods of stress was investigated. To this end, a structured representation of an online news article was developed in the form of a graph data model. These structured articles were then stored in a Neo4j database. Each article was assumed to be composed of individual sentences. These sentences were represented by nodes and vertices that represent specific words. The words were associated to a specific vertex type or relationship type depending on their word type. Moreover, numbers that were related to a noun were added as a property to that specific noun. In this way, the structured representation of the articles could be used to find numerical inconsis- tencies between different articles about the same subject. This was done by developing a few rules that finally decided whether two numbers should be compared or not. In the context of this Master disser- tation, the developed automatic technique was tested for two breaking news events, namely the terrorist attacks in Paris and the terrorist attacks in Brussels and Zaventem. The latter attacks happened on the 22th of March, 2016. In the case of the terrorist attacks in Paris, precision of returned inconsistencies ranged from 80% to 90%. In the case of the terrorist attacks in Brussels, this precision ranged from 44%

73 to 63%. Although the numbers are clearly lower for the latter case, the general precisions of the algo- rithm are fairly high. In absolute numbers, in the case of Het Laatste Nieuws 58 possible inconsistencies were found spread over the two events. In the case of Het Nieuwsblad, 294 possible inconsistencies were found. Although this does not mean that each of the returned inconsistencies is really erroneous, this indicates that a lot of different numbers about a real-word entity or event circulate in articles written by the same newspaper, especially in the case of Het Nieuwsblad. Finally, it was also illustrated that typ- ically, these possible inconsistencies can be brought down to a very small subset of articles in the data set.

In a final stage, the relevance of online news articles written during a period of stress was also inves- tigated. More specifically, the freshness of these articles was investigated. This was done by comparing each article published during a period of stress with each of the previously published articles about that specific breaking news event. Two comparison techniques were used to this end: comparing the sentences with a simple Jaccard similarity and comparing the sentences by computing a variation on the Longest Common Subsequence similarity of the words they contain. In both cases, the results indicated that the amount of new information is large enough for newly published articles: typically, no more than 10% of the information present in the articles can be assumed to be duplicated from earlier articles. The freshness of articles during periods of stress thus seems to be not a big problem. However, semantic information and similarity was not taken into account in this analysis because of the large computing times. Further research could thus include this aspect too.

74 References

Batini, C., & Scannapieco, M. (2006). Data Quality: Concepts, Methodologies and Techniques. Berlin Heidelberg: Springer-Verlag.

Bergroth, L., Hakonen, H., & Raita, T. (2000). A survey of longest common subsequence algorithms. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000 (pp. 39–48). La Coruna: IEEE.

Bral, L. (2016). VRIND 2016. Retrieved from http://ebl.vlaanderen.be/publications/documents/ 87486.

Bronselaer, A., & Pasi, G. (2013). An approach to graph-based analysis of textual documents. In 8th European Society for Fuzzy Logic and Technology, Proceedings (pp. 634–641). Milano: Atlantis Press.

Bucy, E. P., Gantz, W., & Wang, Z. (2007). Media Technology and the 24-Hour News Cycle. In Communication technology and social change: Theory and implications. New Jersey: Lawrence Erlbaum Associates.

Chi-square goodness-of-fit test. (2018). Retrieved from https://statistics.laerd.com/premium/spss/ gof/goodness-of-fit-in-spss.php.

Chi-square test of homogeneity. (2018). Retrieved from https://statistics.laerd.com/premium/ spss/ttp/test-of-two-proportions-in-spss.php.

Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM , 13 (6), pp. 377–387.

Date, C. J. (1997). A guide to the SQL standard: a user’s guide to the standard database language SQL. Boston: Addison-Wesley.

DB-Engines Ranking of Graph DBMS. (2018). Retrieved from https://db-engines.com/en/ranking/

75 graph+dbms.

Delling, D., Sanders, P., Schultes, D., & Wagner, D. (2009). Engineering route planning algorithms. In Algorithmics of large and complex networks (pp. 117–139). Berlin Heidelberg: Springer.

“Europa moet sneller en efficiënter optreden tegen het terrorisme”. (2015). Retrieved from https://www.hln.be/nieuws/-europa-moet-sneller-en-efficienter-optreden-tegen-het -terrorisme~ad08e767/.

“Europa staat stil bij aanslagen Parijs”. (2015). Retrieved from https://www.nieuwsblad.be/cnt/ dmf20151114_01970582.

Gusfield, D. (1997). Algorithms on strings, trees and sequences: computer science and computational biology. New York: Cambridge University Press.

Harris, Z. S. (1954). Distributional Structure. WORD, 10 (2-3), pp. 146-162.

Het Laatste Nieuws. (2018). Retrieved from https://www.hln.be.

Holt, I. F. D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical association, 71 (353), pp. 17–35.

Infographic: The Four V’s of Big Data. (2018). Retrieved from http://www.ibmbigdatahub.com/ infographic/four-vs-big-data.

Intro to Cypher. (2018). Retrieved from https://neo4j.com/developer/cypher-query-language/.

Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2 (2), pp. 1–25.

Jaccard, P. (1901). Etude de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37 (142), pp. 547–579.

Karlsson, M., Clerwall, C., & Nord, L. (2017). Do not stand corrected: Transparency and users’ atti- tudes to inaccurate news and corrections in online journalism. Journalism & Mass Communication Quarterly, 94 (1), pp. 148–167.

Kolb, P. (2009). Experiments on the difference between semantic similarity and relatedness. In Proceed- ings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009) (pp. 81–88). Potsdam: Northern European Association for Language Technology (NEALT).

76 Kraaij, W., & Pohlmann, R. (1994). Porter’s stemming algorithm for Dutch. In Informatiewetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie (pp. 167–180). Leiden: Stichting Informatiewetenschap Nederland.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104 (2), pp. 211-240.

Leech, G., Garside, R., & Atwell, E. S. (1983). The automatic grammatical tagging of the LOB corpus. ICAME Journal: International Computer Archive of Modern and Medieval English Journal, 7 , pp. 13–33.

Lewis, J., & Cushion, S. (2009). The thirst to be first: An analysis of breaking news stories and their impact on the quality of 24-hour news coverage in the UK. Journalism Practice, 3 (3), pp. 304–318.

Maier, S. R. (2002). Getting it right? Not in 59 percent of stories. Newspaper Research Journal, 23 (1), 10–24.

Maier, S. R. (2005). Accuracy matters: A cross-market assessment of newspaper error and credibility. Journalism & Mass Communication Quarterly, 82 (3), pp. 533–551.

Mandera, P., Keuleers, E., & Brysbaert, M. (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language, 92 , pp. 57–78.

Mann-Whitney U test. (2018). Retrieved from https://statistics.laerd.com/premium/spss/mwut/ mann-whitney-test-in-spss.php.

Metzger, M. J., Flanagin, A. J., Eyal, K., Lemus, D. R., & McCann, R. M. (2003). Credibility for the 21st century: Integrating perspectives on source, message, and media credibility in the contemporary media environment. Annals of the International Communication Association, 27 (1), pp. 293–335.

MySQL. (2018). Retrieved from https://www.mysql.com/.

Neo4J. (2018). Retrieved from https://neo4j.com/.

Picone, I. (2016). Digital news report: Belgium. Retrieved from http://www.digitalnewsreport.org/ survey/2016/belgium-2016.

Porlezza, C., Maier, S. R., & Russ-Mohl, S. (2012). News accuracy in Switzerland and Italy: a transat- lantic comparison with the US press. Journalism Practice, 6 (4), pp. 530–546.

77 PostgreSQL. (2018). https://www.postgresql.org/.

Saltzis, K. (2012). Breaking news online: How news stories are updated and maintained around-the-clock. Journalism Practice, 6 (5-6), pp. 702–710.

Schmid, H. (2013). Probabilistic Part-Of-Speech tagging using decision trees. In New Methods in Language Processing (pp. 154–164). London: Routledge.

Ullman, J. D., Aho, A. V., & Hirschberg, D. S. (1976). Bounds on the complexity of the longest common subsequence problem. Journal of the ACM (JACM), 23 (1), pp. 1–12.

Woordenlijst. (2015). Retrieved from http://www.woordenlijst.org.

78 Appendix A

Manual error annotations

In this appendix, all manual error annotations from the supervised error investigation are listed. As explained in Chapter 2, analyses were done on articles written during the periods of stress induced by the terrorist attacks in Paris, which happened on the 13th of November 2015, and the Germanwings plane crash, which happened on the 24th of March 2015. Next to the periods of stress, also a data set of articles published in the non-stress period right before the stress period was investigated for both events. The online newspapers that were investigated are Het Laatste Nieuws and Het Nieuwsblad. Each data set contains a sample of verifiable, factual articles written during the specific period and by the specific online newspaper. For each investigated data set, the following information is given:

1. The number of words in the data set.

2. The total number of errors that were found.

3. The fraction of articles that contain an error, the fraction of articles that contain a linguistic error and the fraction of articles that contain a factual error.

4. The number of errors for each different category.

5. The distribution of the articles over the number of errors they contain.

6. A description of each error that was found during the annotation.

As the articles were all written in Dutch, all attached documents to this appendix containing the annotations are all written in Dutch as well.

79 Fouten terroristische aanslagen in Parijs – Stress periode Artikels geschreven door Het Laatste Nieuws

Algemeen overzicht Aantal woorden in data set: 70197 Totaal aantal fouten gevonden: 170

Fractie artikels dat minstens 1 fout bevat: 115/332 of 34.6% Fractie artikels dat minstens 1 taalfout bevat: 108/332 of 32.5% Fractie artikels dat minstens 1 feitelijke fout bevat: 14/332 of 4.2%

Voorkomende fouten: • Overschatting van cijfers (1) komt 13 keer voor of 7.6% van alle gevonden fouten • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 48 keer voor of 28.2% van alle gevonden fouten • Spellingsfout, fouten met leestekens, ... (3) komt 56 keer voor of 32.9% van alle gevonden fouten • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, ... (4) komt 45 keer voor of 26.5% van alle gevonden fouten • Foutieve cijfers die niet onder categorie (1) vallen (5) komt 4 keer voor of 2.4% van alle gevonden fouten • Feitelijke, maar geen cijferlijke fout (6) komt 5 keer voor of 2.9% van alle gevonden fouten

Aantal artikels per aantal fouten: 217 artikels bevatten 0 fouten, 82 artikels bevatten 1 fout, 21 artikels bevatten 2 fouten, 8 artikels bevatten 3 fouten, 3 artikels bevatten 4 fouten en 1 artikel bevat 10 fouten.

Overzicht gevonden fouten

'Staatsvijand nummer 1' Salah Abdeslam werkte voor MIVB (16/11) ● "Daarnaast is ook bekend dat Salah, net als zijn broer Ibrahim, in Syrië is heeft verbleven." (4)

23 arrestaties, 168 nachtelijke invallen in Frankrijk vannacht (16/11) ● " vergeten (3)

Aangepaste albumcover gaat viraal (14/11) ● "Peace Love Death Metal" waarbij komma's vergeten werden. (2) ● spatie te veel bij begin zin (4)

Aangrijpend weerzien met vader en echtgenoot die in Parijs vlak bij aanslag was (17/11) ● "Cosa Nostra" moet "Casa Nostra" zijn (2) ● "AirBnB" moet "Airbnb" zijn (2) + (2) ● "Zijn oudste kind, de 16-jarige Rosey, belden hem ook op" (4)

Aangrijpende foto toont Bataclan vlak voor de hel er losbrak (16/11) 80 ● "Een enthousiast publiek geniet van een concert van Eagles of Death Metal, luttele momenten voor gewapende terroristen de zaal binnenvielen en het vuur openden." (4) ● "Het publiek dacht de knallen bij het concert hoorden, maar al snel brak paniek los." (4)

"Aanslagen waren aanval op Europese waarden: ieder van ons had slachtoffer kunnen zijn" (17/11) ● "Harlem Desir" moet "Harlem Désir" zijn (2)

American Airlines schort vluchten op naar Parijs (14/11) ● "na de terroristische aanslagen waarbij in Franse hoofdstad zeker 120 doden vielen" (4) ● "Charles-de-Gaulle" moet "Charles de Gaulle" zijn (2)

Assad: "Frans beleid droeg bij tot uitbreiding terrorisme" (14/11) ● Bij aanslagen in Beiroet vielen 43 doden, geen 44. (5)

Belg Bilal Hadfi blies zichzelf op aan Stade de France (16/11) ● "Abdelsam" moet "Abdeslam" zijn (2)

Belgische jihadisten kondigden al in februari aanslagen in Frankrijk aan (14/11) ● "Lofti Aoumeur" moet "Lotfi Aoumeur" zijn (2)

Belgische link naar aanslagen in Parijs (14/11) ● "Intussen is niet alleen het Franse gerecht, maar ook de Belgische inlichtingendiensten volop bezig met een onderzoek naar de daders van de aanslagen." (4) ● "lande" moet "land" zijn (3) ● komma te veel (3)

Britse komiek laat zich even helemaal gaan over terroristen (16/11) ● "seireuze" moet "serieuze" zijn (3)

Broer van Salah Abdeslam: "Geef je aan" (17/11) ● "Mohamed Abdeslam, broer van de gezochte terroristten Salah en Brahim - die afgelopen vrijdag bij een zelfmoordaanslag in Parijs omkwam". Hoe kan iemand gezocht worden als hij dood is? (4) + (3)

Café van de broers Abdeslam begin deze maand gesloten voor criminele feiten (16/11) ● Café was slechts van 1 van de 3 broers (6) ● "Op 4 september werd hij bij de politie uitgenodigd om zich te verantwoorden bij." (4) ● "processen verbaal" moet "processen-verbaal" zijn (3)

Charles Michel: "Daders liegen als ze zeggen dat ze doden in naam van Allah" (15/11) ● "Saoedie-Arabië" (2)

Comité I gaat Belgische inlichtingendiensten onderzoeken (16/11) ● "Het Vast Comité van Toezicht op de inlichtingen- en veiligheidsdiensten (Comité I) opent op eigen initiatief een onderzoek naar de werking van onze inlichtingendiensten in aanloop naar de terreuraanslagen van Parijs." (4)

81 Deze mensen lieten het leven bij de aanslagen in Parijs (15/11) ● Patricia San Martin was 61, niet 55. (5) ● "Elsa DelPlace" moet "Elsa Delplace" zijn (2) ● Thomas Ayad was 32, niet 34 (5) ● "Fabrice DuBois" moet "Fabrice Dubois" zijn (2)

Disneyland Parijs blijft vandaag gesloten (14/11) ● "Onze gedachten en gebeden gaan uit naar iedereen die door deze afschuwelijke gebeurtenissen zijn getroffen." (4)

Dreigingsniveau 3: wat betekent dat? (17/11) ● Punt vergeten achter zin (3)

Drie Brusselse broers betrokken bij aanslagen: 1 dood, 1 opgepakt, 1 gevlucht (15/11) ● "Police Nationale" moet "Police nationale" zijn (3)

Eagles of Death Metal in veiligheid, maar erg aangedaan na drama in Bataclan (14/11) ● "Amerikaanse garagerockband band Eagles of Death Metal" (4) ● "We zijn nog steeds bezig om de veiligheid en de verblijfplaats van al onze bandleden en de crew." (4) ● "de eerste reactie op officiële Facebookpagina" moet zijn "de eerste reactie op DE officiële Facebookpagina" (4) ● "het is nog maar de vraag of de bandleden twee dagen na de gruwelijke ervaringen in Parijs opnieuw het podium op wil." moet zijn "het is nog maar de vraag of de bandleden twee dagen na de gruwelijke ervaringen in Parijs opnieuw het podium op WILLEN." (4)

Europa moet sneller en efficiënter optreden tegen het terrorisme (15/11) ● Sprake van 132 doden, terwijl er maar 130 in totaal zijn gevallen. (1)

Eurostar, Thomas Cook, JetAir: tickets omboeken of annuleren mogelijk (14/11) ● Jetair, niet JetAir (2) + (2)

Extra veiligheidsmaatregelen voor België-Spanje (14/11) ● "zworden" moet "worden" zijn (3)

Falen van België is veiligheidsrisico voor heel Europa (17/11) ● Sprake van 132 doden, terwijl er maar 130 in totaal zijn gevallen. (1) ● "beweert geen enkele Europese inlichtingendienst noch de CIA op voorhand iets hadden opgevangen" (4) ● Laatste alinea zou tussen aanhalingstekens moeten staan, want zijn subjectieve stellingen (6)

Familie van terreur-broers Abdeslam: "We zijn verbaasd" (16/11) ● Artikel zegt dat 3 broers rol in aanslagen hebben gespeeld, terwijl Mohamed niet in verband kon worden gebracht. (6)

Fan keek zo uit naar optreden Eagles of Death Metal (14/11) ● Volgens artikel meer dan 90 doden in Bataclan, het waren er slechts 89. (1) 82 Fransen delen massaal dit aangrijpende fragment uit film 'Casablanca' (16/11) ● 'Viva la France' moet 'Vive la France' zijn (3)

Grootste terroristische aanslag in Frankrijk in jaren: zeker 127 doden, bloedbad in concertzaal (14/11) ● Er waren slechts 3 terroristen in de Bataclan, geen 4. (1) ● Er stierven geen 8 terroristen, maar slechts 7. (1) ● Daarvan 6 met een bommengordel, geen 7. (1) ● Er was helemaal geen schietpartij aan het Louvre, zoals het artikel beweert. (6) ● Er is sprake van 100 doden in artikel in de Bataclan, maar het waren er maar 89. (1) ● Twee ontploffingen gebeurden vlak aan stadion, andere aan een McDonald's in de buurt. Artikel spreekt echter van 2 hamburgerkramen en een café. (6) ● Meerdere plaatsen waar spatie na punt of achter komma is vergeten (3) + (3) + (3) ● "De toeschouwers zijn na het laatste fluitsignaal het veld opgelopen: zij mochten het stadion niet verlaten tot de schietpartijen voorbij waren: de poorten gingen dicht." (4)

Had Frankrijk terreur in Parijs kunnen voorkomen? (14/11) ● Bij aanslag in Beiroet vielen 43 doden, geen 44. (5) ● lidwoord "het" vergeten (4)

Honderden mensen wonen stille wake bij aan Gentse stadhuis (15/11) ● "Sharm-al-sheik" moet "Sharm-el-sheikh" zijn (2) ● "bisschip" moet "bisschop" zijn (3)

Huiveringwekkende getuigenissen uit concertzaal: "Ik vrees dat ze ons 1 voor 1 willen doden (14/11) ● "Getuige Marc Coupris vertelt aan een journalist van de Britse krant The Guardian hoe hij bevrijd uit de handen van de gijzelaars bevrijd werd: "Het was een bloedbad." (4) ● "toestande" ipv "toestand" (3) ● lidwoord "een" voor scène vergeten (4)

IJzingwekkende getuigenis van Sébastien, redder van de zwangere vrouw in Bataclan (17/11) ● "De zwangerschap nog maar tien weken ver en de vrouw wilde wachten tot drie maanden om het goede nieuws te brengen." (4)

Jambon: "Niveau 3 blijft tot Salah Abdeslam is gevat" (17/11) ● Staatsveiligheid moet met hoofdletter (3) ● "Abdelslam moet "Abdeslam" zijn (2)

Journalist doet live verslag en hoort plots schoten in Bataclan (15/11) ● "Jean-Français Bélanger" moet "Jean-François Bélanger" zijn (2)

Journalist ontsnapt op nippertje aan horror in concertzaal (14/11) ● "Als je een strijd voert tegen het westen dan kan ik begrijpen dat je, je pijlen richt op een voetbalwedstrijd" (3) ● "freelancejournalist" moet "freelance journalist" zijn (3)

Klimaattop Parijs gaat door met nog meer verscherpte maatregelen (14/11) ● "COP-21" moet "COP21" zijn (2) 83

Klopjacht op de man voor wie interland werd afgelast: "We vrezen dat hij strijdend ten onder wil gaan" (17/11) ● "aanwijzigingen" ipv "aanwijzingen" (3)

Koningsdag in teken van Frankrijk(15/11) ● Koningsdag moet met hoofdletter. (3) ● “in teken van” moet “in het teken van” zijn (4)

Leden Eagles of Death Metal veilig (14/11) ● Eagles of Death Metal (EOMD) (2) + (2) + (2)

Leerkrachte getuigt: "Te voorzichtig met Bilal geweest" (17/11) ● "Sara Staccino" moet "Sara Stacino" zijn (2)

Mannen die Salah Abdeslam hielpen: "We haalden gewoon een vriend op" (17/11) ● "Mohammed Amri" (2) ● "ontkennen bij monde van hun advocaten dat ze elke betrokkenheid bij de aanslagen." (4)

"Meer middelen voor inlichtingendiensten" (15/11) ● "Zo kwamen er grenscontroleS" (3)

Model poseert uitdagend als terrorist tijdens slachting in Parijs (15/11) ● Hoofdletter aan begin van zin vergeten. (3) ● quotes vergeten (3) ● "om daarna nog eentje beeldje te posten" (4)

Mogelijk meesterbrein van de aanslagen dreigde in februari al met terreur (16/11) ● "bedacht" moet "verdacht" zijn (3)

"Niemand van ons wil die smeerlappen helpen: trap niet in de val van het zwakke IS" (16/11) ● "TWitter" moet "Twitter" zijn (3) ● " te veel (3)

Niet mijn islam: Wie een onschuldig mens doodt, doodt de hele mensheid" (16/11) ● "Naima Ajouaou" moet "Naima Ajouaau" zijn (2)

Nu al vijf arrestaties in Aken - Salah Abdeslam niet bij gearresteerden (17/11) ● "Werner Schneider", niet "Scheider" (2)

Obama: "Dit is een aanval op de mensheid" (14/11) ● "We zullen alles doen wat nodig is om samen te werken met het Fransen" (4)

Oorlogsfotograaf Teun Voeten: "Het is een heel bekrompen gemeenschap in Molenbeek" (17/11) ● "De Afspraak" moet met hoofdletter. (2)

Opnieuw huiszoekingen in Molenbeek (15/11) 84 ● Er waren slechts 3 terroristen in de Bataclan, geen 4. (1)

Pianist speelt 'Imagine' aan Bataclan-theater (14/11) ● Er waren slechts 3 terroristen in de Bataclan, geen 4. (1)

Premier Cameron wil akkoord parlement om IS te bombarderen in Syrië (17/11) ● Moet "werken" zijn ipv "merken" (3)

Premier Michel: "We moeten haatpredikers stoppen en problematische moskeeën sluiten" (16/11) ● "We moeten tegen haatpredikers bestrijden" (4) + (4)

"Radicalisering in Molenbeek werd afgedaan als kritiek op multiculturaliteit" (15/11) ● "ghetto" moet "getto" zijn (3) ● Het is "Tarik Ibn Ali", niet "Tarik Inb Ali" (2)

Ronselaar Parijse jihadist is Belgisch-Marokkaanse haatprediker Tarik Ibn Ali (17/11) ● Het is "Tarik Ibn Ali", niet "Tarik Inb Ali" (2) + (2) + (2)

Samenscholingsverbod houdt Parijs niet tegen om samen te rouwen (15/11) ● "We willen duidelijk maken dat we deze kaarsen niet onszelf hebben aangestoken" (4)

Schots-Franse versie 'Imagine' met gitaar gaat viraal (16/11) ● "spreciaal" moet "speciaal" zijn (3)

Slovaakse premier wil "alle moslims surveilleren" (16/11) ● "Het land land heeft in verhouding met het aantal inwoners minder vluchtelingen" (4)

Sylvestre stond te bellen en ontsnapte zo aan de dood (14/11) ● Artikel spreekt van minstens 6 doden aan Stade de France, terwijl het er maar 4 in totaal waren. (1)

Syrische rebellengroepen veroordelen aanslagen (15/11) ● "Rakka" moet "Raqqa" zijn (2)

Trump: "Aanslagen Parijs waren anders verlopen met bewapende burgers" (15/11) ● "Gerard Araud" moet "Gérard Araud" zijn (2)

"Twee daders en voortvluchtige waren bekend bij antiterreurdienst OCAD" (16/11) ● Het is "Salah Abdeslam", niet "Saleh" (2)

Tweede auto terroristen gevonden: één terreurcel op de vlucht (15/11) ● "" moet met hoofdletter. (2) ● "De speurders troffen in het voertuig verschillende kalasjnikov-aanvalsgeweren" (4)

"Twintig militairen houden oogje in het zeil tijdens België-Spanje" (16/11) ● "jaar" moet "naar" zijn (3) 85

U2 legt bloemen aan concertzaal Bataclan (15/11) ● Geen 4 terroristen, maar slechts 3 in de Bataclan. (1) ● "De slachtoffers waren een muziekfans." (4)

Uithaal naar Jambon: "Plots was er geen geld meer eerstelijnszorg in Molenbeek" (16/11) ● "Dat antwoordt de voorzitter van de vzw en gewezen directeur van het Centrum voor Gelijke Kansen en Racismebestrijding Johan Leman op zijn Facebook-pagina minister van Binnenlandse Zaken Jan Jambon die zich afvroeg" (4)

Verdacht pakket bij IKEA in Anderlecht blijkt ongevaarlijk (16/11) ● "IKAE" moet "IKEA" zijn (2)

Vlaamse steden herdenken slachtoffers met minuut stilte (16/11) ● "Hendrik Conscienseplein" moet "Hendrik Conscienceplein" zijn (2)

Voorlopig geen extra bewaking op Brussels Airport (14/11)

● "rezigers" moet "reizigers" zijn (3)

Waarom reisde Salah Abdeslam in september via Duitsland naar Oostenrijk? (17/11) ● "De drie zeiden dat ze een "een week vakantie in Oostenrijk"" (4)

Wereldpers noemt België een "terreurnest": bloedbad voor Parijs, blamage voor België (15/11)

● "ghettovorming" moet "gettovorming" zijn (3)

Wie zit er achter de aanslagen in Parijs? (14/11)

● "binnekomen" moet "binnenkomen" zijn (3)

WTC krijgt Franse driekleur (14/11)

● "corovado" moet "corcovado" zijn (2)

Zo verliep de zwaarste terroristische aanslag ooit in Frankrijk (14/11) ● Slechts 7 terroristen stierven, geen 8. (1) ● "Place de la Republique" moet "Place de la République" zijn (2) ● Er komen slechts 3 terroristen om in de Bataclan (1)

Bang zijn, is - evolutionair gezien - nuttig (16/11) ● "kunnen er inhakken" moet "kunnen er op inhakken" zijn (4)

Buitengewone EU-top voor Europees antwoord op terreur: "Frankrijk verwacht daden" (15/11) ● "niveau's" moet "niveaus" zijn (3) + (3)

Charles Michel: "Daders liegen als ze zeggen dat ze doden in naam van Allah" (15/11) 86 ● "de toegang tot België te verwijderen." (4)

Complottheorieën doen de ronde nadat tweet aanslagen op Parijs voorspelde (17/11) ● "tweede tweede" moet "tweede tweet" zijn (4)

Concerten in Brusselse zalen gaan gewoon door (17/11) ● "Ocad" moet "OCAD" zijn (2)

De Kesel: "Niet toegeven aan de angst" (15/11) ● " op einde quote vergeten (3)

Disneyland nog tot dinsdag gesloten, Parijse scholen starten morgen wel weer (15/11) ● "Ministère de l'éducation National" moet "Ministère de l'Education nationale" zijn (2)

Dreigingsniveau in België wordt opgetrokken van 2 naar 3 voor grote evenementen (15/11) ● komma vergeten in eerste zin (3) ● " vergeten op einde laatste zin (3)

Drie Belgen bij dodelijke slachtoffers (14/11) ● "10 u" moet "10u" zijn (3)

Eagles of Death Metal nu echt in Britse hitlijst (16/11) ● "Uit de midweek Official Chart Update Top 100 is de band te vinden op de 96e plek." (4)

Eiffeltoren kleurt drie dagen blauw-wit-rood (16/11) ● "De Eiffeltoren was afgelopen dagen niet toegankelijk" (4)

Federale politie verspreidt nieuw opsporingsbericht voor Salah Abdeslam (17/11) ● "tereuraanslagen" moet "terreuraanslagen" zijn (3)

Frankrijk verscherpt zijn grenscontroles: wat houdt dat in? (14/11) ● "sinterklaas" moet "Sinterklaas" zijn (2)

Frankrijk zou beter Molenbeek bombarderen (17/11) ● "Eric Zemour" moet "Eric Zemmour" zijn (2)

G20-landen willen samenwerking tussen inlichtingendiensten verder uitbouwen (16/11) ● "De top van de G20-landen vindT nog tot maandag plaats in de Turkse badplaats Antalya." (3)

Geens: "Ik smeek al een jaar om meer samenwerking tussen de Europese veiligheidsdiensten" (16/11) ● "'draaischijf van internationale terrorisme", waarbij laatste ' vergeten is (4) + (3)

Gitzwarte voorpagina's Franse kranten: "Deze keer is het oorlog" (14/11) ● "terrroristen" moet "terroristen" zijn (3)

Groen vraagt onderzoek Comité I en P na aanslagen (16/11) ● "inlichingen" moet "inlichtingen" zijn (3) 87

IS eist aanslag op en doet oproep tot meer aanslagen in Frankrijk (14/11) ● "Islamatische Staat" moet "Islamitische Staat" zijn (2)

Man zaait paniek met alarmpistool: "Ik ga Arabieren doden in Molenbeek" (16/11) ● "woordvoerdster" moet "woordvoerster" zijn (3)

"Namen verdachten deel aan aanslagen?" (15/11) ● "Dat heeft Françoise Schepmans, burgemeester van Molenbeek GEZEGD." (4)

Nederlander overleeft bloedbad: "Ben tussen doden naar nooduitgang gekropen" (14/11) ● "Door hadden" moet "doorhadden" zijn (3)

Paniek bij herdenking op plein in Parijs (16/11) ● "Uiteindelijk bleken ontploffende voetzoekers de paniek te veroorzaakt te hebben." (4)

Pegida Vlaanderen: "Sluit de grenzen en stop terrorisme" (16/11) ● Het is "Islamitische Staat", niet "Islamistische Staat" (2)

Politie maakt jacht op daders (14/11) ● "inlichtingendienstenzijn" moet "inlichtingendiensten zijn" zijn (3)

Stephen Colbert haalt uit naar IS: "Het is gewoon een bende pussies" (17/11) ● Spatie te veel tussen 2 woorden (3)

Terreurjaar in Frankrijk (14/11) ● "Afwachtig" moet "Afwachting" zijn (3)

Toerismebureau: "Blijf niet weg uit Parijs" (17/11) ● "Maar de reden dat Parijs de nummer 1 vakantiebestemming van de wereld is, komt omdat mensen het leven in Parijs liefhebben" (4)

Tranen van agent zeggen meer dan duizend woorden (16/11) ● "Hardverscheurdende" moet "hardverscheurende" zijn (3)

Vlaams Parlement sluit hoofdingang tijdelijk af na aanslagen (17/11) ● "De hoofdingang van HET Vlaams Parlement wordt tijdelijk gesloten." (4)

"We moeten ook hier op onze hoede blijven voor slapende cellen met ex-Syriëstrijders" (14/11) ● ' op einde van zin, waar die niet hoort (3)

"We waren in Parijs met 7. Wat als we morgen met 70 zijn?" (17/11) ● "imans" moet "imams" zijn (3)

Werden de aanslagen via Twitter gecoördineerd? (15/11) ● Twitteraccount Op_is90 wordt twee keer anders geschreven (2)

88 "Zij hebben geweren, wij hebben kaarsjes en bloemen", zegt jongetje voor tv-camera in Parijs (17/11) ● "bekending" moet "bedenking" zijn (3)

G20 keurt plan tegen fiscale ontwijking door multinationals definitief goed (16/11)

• “de communiqué” moet “het communiqué” zijn (4)

Inwoners Raqqa mogen huis niet uit na zware aanvallen Franse luchtmacht (16/11) • “Syrisch observatorium voor de mensenrechten” moet met hoofdletter (2) + (2)

CDenV wil levensduurverlenging Doel 1 en 2 herbekijken (17/11) • “De Levensverlenging De wet op…” (3) • “Problemen De…” (3)

Nog eens 159 ontslagen bij Philips in Turnhout (17/11) • “ en . omgewisseld in laatste zin. (3)

Nog vijf mensen vermist bij ontspoorde TGV (15/11) • “Eckwerscheim” moet “Eckwersheim” zijn (2)

Hoog bezoek voor Miss België-finalistes: “Laat Egypte niet links liggen” (17/11) • Woordje “in” vergeten (4) • “Sharm El Sheik” moet “Sharm-El-Sheikh” zijn (2)

89 Fouten terroristische aanslagen in Parijs – Niet-stress periode Artikels geschreven door Het Laatste Nieuws

Algemeen overzicht Aantal woorden in data set: 18253 Totaal aantal fouten gevonden: 30

Fractie artikels dat minstens 1 fout bevat: 24/90 of 26.7% Fractie artikels dat minstens 1 taalfout bevat: 19/90 of 21.1% Fractie artikels dat minstens 1 feitelijke fout bevat: 5/90 of 5.6%

Voorkomende fouten: • Overschatting van cijfers (1) komt 1 keer voor of 3.3% • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 8 keer voor of 26.7% • Spellingsfout, fouten met leestekens, … (3) komt 8 keer voor of 26.7% • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, … (4) komt 9 keer voor of 30.0% • Foutieve cijfers die niet onder categorie (1) vallen (5) komt 1 keer voor of 3.3% • Feitelijke, maar geen cijferlijke fout (6) komt 3 keer voor of 10.0%

Aantal artikels per aantal fouten: 66 artikels bevatten 0 fouten, 19 artikels bevatten 1 fout, 4 artikels bevatten 2 fouten en 1 artikel bevat 3 fouten.

Overzicht gevonden fouten Boko Haram doodt twintig mensen bij nieuwe aanslag (22/10) ● "de jihadistisch terreurbeweging Boko Haram" moet "de jihadistische terreurbeweging Boko Haram" zijn (3) ● Achter allerlaatste zin is punt vergeten (4)

“Aanslag Ankara werk van Islamitische Staat” (11/10) ● "wandaag" moet "wandaad" zijn (3)

Turkije tast in duister over aanslagen: 122 doden volgens pro-Koerdische partij (11/10) ● Originele dodentol stond op 95, niet op 97 (1)

Vier agenten komen om bij aanslag in Turkije (03/09) ● "Het conflict die" moet "Het conflict dat" zijn (4)

Turkse troepen trekken Irak binnen na aanval PKK (08/09) ● "bonnen" moet "bronnen" zijn (3)

Vredesmars wordt bloedbad: twee zelfmoordterroristen doden burgers in Ankara (10/10) ● "De Turkse premier acht twee zelfmoordenaars verantwoordelijk zijn voor de aanslagen" (4)

90 ● "Er was eerst ook sprake van circa 200 gewonden, van 28 op de afdeling intensieve zorgen van een ziekenhuis lagen" (4) ● Dat de aanslag uitgerekend een betoging voor vrede viseert, toont volgens het Witte Huis de “verdorvenheid” van de daders aan. De Duitse kanselier Angela Merkel spreekt van “bijzonder laffe daden”, die “gericht zijn op burgerrechten, vrede en democratie”. => wordt 2 maal herhaald (4)

Tientallen doden door aanslagen in Irak (06/10) ● "het Khalis" moet "Khalis" zijn (4)

24 doden bij driedubbele zelfmoordaanslag in Bagdad (03/10) ● "Driedubbel" moet "drievoudig" zijn (6)

Doden bij schietpartij voor hoofdkantoor politie Sydney (02/10) ● Als het in Sydney 16u30 is, is het bij ons 7u30, geen 8u30. (5)

Zeker 11 doden bij zelfmoordaanslag tegen presidentieel paleis Somalië (03/09) ● Het is "Nicholas Kay", niet "Nikolas Kay" (2) ● "Boko Haram pleegt niet aanslagen in de regio, maar voert ook regelmatig terreurcampagnes jegens de Kameroense ordediensten." (4)

Zelfmoordaanslag op moskee in Nigeria: “Er zijn geen overlevenden” (16/10) ● maar een lokale militie die jihadistisch organisaties bestrijdt (3)

Gemolesteerde personeelschef Air France krijgt promotie (14/10)

• “Ségolène Royal” moet “Segolène Royal” zijn (2)

Protest in Mexico een jaar na ontvoering van 43 studenten (26/09)

• “vele” moet “velen” zijn (3)

Meer gepensioneerden, minder nieuwe uitkeringen (12/11)

• “administratrice-generaal” moet “administrateur-generaal” zijn (3)

Italiaan Grandi wordt nieuwe Hoge Commissaris voor de Vluchtelingen (12/11)

• “tweede wereldoorlog” moet “Tweede Wereldoorlog” zijn (3)

Woedende menigte bestormt Afghaans presidentieel paleis (11/11)

• Spatie vergeten aan begin zin (3)

Toerisme in verval na crash: Egypte verliest maandelijks 281 miljoen dollar (11/11)

• “Sissi” moet “Sisi” zijn (2) • “Sharm-El-Sheik” moet “Sharm-El-Sheikh” zijn (2)

Lufthansa schrapt 930 vluchten, staking mag doorgaan (11/11) 91 • Meerdere zinnen staan meerdere keren in artikel (4) • “deze nacht vannacht” (4)

Valse agente (24) troggelt oma van vermist kind geld af (10/11)

• Het gaat niet om de oma, maar om de moeder van het kind (6)

Voormalig bondskanselier Helmut Schmidt overleden (10/11)

• “Valery” moet “Valéry” zijn (2)

Van hemel naar hel: Moeder bevalt en paar uur later wordt oudste zoon doodgestoken op straat (10/11)

• “Evenig Standard” moet “Evening Standard” zijn (2)

Nederlander krijgt 20 jaar cel in Thailand (10/11) • “John” moet “Johan” zijn (2)

Voor het eerst Iraanse vrouw aan hoofd van ambassade (09/11)

• “Mahmoed” moet “Mahmoud” zijn (2)

Franse bus met 39 kinderen aan boord volledig uitgebrand (09/11)

• Kinderen zaten helemaal niet meer aan boord op moment van brand. (6)

92 Fouten terroristische aanslagen in Parijs – Stress periode Artikels geschreven door Het Nieuwsblad

Algemeen overzicht Aantal woorden in data set: 70534 Totaal aantal fouten gevonden: 197 Fractie artikels dat minstens 1 fout bevat: 124/310 of 40.0% Fractie artikels dat minstens 1 taalfout bevat: 111/310 of 35.8% Fractie artikels dat minstens 1 feitelijke fout bevat: 25/310 of 8.1%

Voorkomende fouten:

• Overschatting van cijfers (1) komt 14 keer voor of 7.1% • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 36 keer voor of 18.3% • Spellingsfout, fouten met leestekens, ... (3) komt 104 keer voor of 52.8% • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, ... (4) komt 25 keer voor of 12.7% • Foutieve cijfers die niet onder categorie (1) (5) komt 13 keer voor of 6.6% • Feitelijke, maar geen cijferlijke fout (6) komt 5 keer voor of 2.5%

Aantal artikels per aantal fouten:

186 artikels bevatten 0 fouten, 82 artikels bevatten 1 fout, 21 artikels bevatten 2 fouten, 15 artikels bevatten 3 fouten, 3 artikels bevatten 4 fouten , 2 artikels bevatten 5 fouten en 1 artikel bevat 6 fouten.

Overzicht gevonden fouten

Groepsleden Eagles of Death Metal ongedeerd, “crewlid gedood” (14/11)

● “… ipv “…” (3) ● Sprake van honderdtal doden, zonder bronvermelding (1) ● "Julio Doro" moet "Julian Doro" zijn (2) + (2)

Foo Fighters annuleert tournee na gruwel in Parijs (14/11)

● Sprake van “meer dan 150 man die werden neergeschoten” (1) ● "De ergste aanslag van vrijdagavond plaats tijdens het optreden van de rockband Eagles of Death Metal in concertzaal Le Bataclan." (4)

Dit zijn de gevolgen voor België na de aanslagen in Parijs (14/11)

● spatie vergeten bij begin nieuwe zin (3) ● “Eagles of Black Metal” ipv “Eagles of Death Metal” (2) + (2) ● “Security” ipv “security” (3)

“Terroristen in Parijs hadden mogelijk logistieke steun vanuit België” (14/11)

● “De Belgische autoriteiten kenden de huurder zijn via zijn broer” (4)

93 U2 herdenkt slachtoffers van aanslagen in Parijs (14/11)

● “het” ipv “met” (3)

Deze drie vragen moet Franse politie snel beantwoorden (14/11)

● “80 mensen neergeschoten in Bataclan” ipv 89. (5)

OVERZICHT. Dit weten we al van de horroravond in Parijs (14/11)

● “8 terroristen gedood” ipv 7. (1) ● “3 mensen overleden aan Stade de France” ipv slechts 1 dode. (1) ● “Schietpartij aan Le Catillon was rond 22 uur”, terwijl deze rond 21u25 plaatsvond. (5) ● “Meer dan 100 doden in Bataclan”, terwijl het er maar 89 waren. (1) ● “4 terroristen in de Bataclan”, terwijl er maar 3 waren. (1)

Ooggetuigen: Het enige wat ons beschermde, was een raam (14/11)

● “getuigenaangeslagen” ipv “getuigen aangeslagen” (3) ● " vergeten op einde quote (3) ● Het is "Petit Cambodge", niet "Petit Cambodje" (2)

REACTIES. Obama: "Dit is een aanslag tegen de mensheid" (14/11)

● “oor” ipv “voor” (3) ● "De Amerikaanse president ook een boodschap klaar voor de daders." (4) ● “een n een” ipv “een” (4)

“Stilaan komen de Parijzenaars weer op straat” (14/11)

● “grootst aantal mensen moest vluchten voor zijn leven” ipv “grootst aantal mensen moest vluchten voor hun leven” (4)

Verscherpte controles op E17 ter hoogte van grensovergang te Rekkem (14/11)

● Komma vergeten in zin. (3) ● “Belgie” ipv “België” (2) ● spatie te veel aan begin van de zin (4x)(3) + (3) + (3) + (3)

Bommenmaker van ISIS was in ons land (14/11)

● “waaruit onder moet blijken” ipv “waaruit onder meer moet blijken” (4)

Nederlandse cabaretier doet opvallende oproep (14/11)

● “Zijn oproep werden honderdduizenden keren gedeeld.” Ipv “Zijn oproep werd honderdduizenden keren gedeeld.” (4)

GETUIGENISSEN. “Tussen de doden naar de uitgang gekropen” (14/11)

● “The Eagles of Death Metal” ipv “Eagles of Death Metal” (2) ● “de politiefilm die aan het kijken was op tv” ipv “de politiefilm die ik aan het kijken was op tv” (4) ● “Nu gaat het nog met me"” ipv “Nu gaat het94 nog met me" (3) ● “Zo’n 100 slachtoffers” terwijl er maar 89 waren (1)

“Belgen achter aankondiging aanslagen in Frankrijk in februari” (14/11)

● “Lofti Aoumeur” ipv “Lotfi Aoumeur” (2)

“Ik wil geen teruggekeerde Syriëstrijders meer begeleiden” (14/11)

● “op” ipv “of” (2x) (3) + (3) ● Spatie te veel voor komma. (3)

Gent opent maandag rouwregister voor aanslagen Parijs (14/11)

● Punt achter zin vergeten. (3) ● "dij" moet "die" zijn (3)

Extra veiligheidsmaatregelen voor EK-barrageduel Zweden-Denemarken (14/11)

● “Anders Sigurdson” ipv “Anders Sigurdsson” (2)

Rouwbanden en minuut stilte bij voetbalinterlands (14/11)

● “2 explosies”, terwijl het er 3 waren. (5)

“Het zou me allerminst verbazen als er ook Belgische daders zijn” (14/11)

● “Lofti Aoumeur” ipv “Lotfi Aoumeur” (2) ● " vergeten op einde uitdrukking (3) ● spatie te veel na einde zin (3)

UEFA is “diep gechoqueerd en bedroefd” (14/11)

● “2 explosies”, terwijl het er 3 waren. (5)

Zondag stille wake aan Stadhuis (14/11)

● “Sharm-al-sheik” ipv “Sharm-el-Sheikh” (2) Stille wake voor slachtoffers aanslagen Parijs aan stadhuis (14/11)

● “slachtofffers” ipv “slachtoffers” (3) ● "overtstijgen" moet "overstijgen" zijn (3)

Koning Filip en koningin Mathilde diep onder de indruk door aanslagen in Parijs (14/11)

● “presidentFrançois” ipv “president François” (3)

Broer en vader van Bataclan-terrorist opgepakt (15/11)

● Punt vergeten tussen 2 zinnen. (3)

Zeker twee dode terroristen kwamen uit Brussel (15/11)

● Spatie na einde zin vergeten. (2x) (3) + (3) ● In artikel staat dat 1 uit Brussel kwam en 1 uit Sint-Jans-Molenbeek (2 uit Brussel?) (6)

95 Parket bevestigt: Zeven arrestaties in België na aanslagen Parijs (15/11)

● komma vergeten voor "noch" (3) ● In Bataclan waren slechts 3 terroristen, geen 4 (1) ● Consequent spreken over “Fransman”, terwijl nationaliteit nog door niemand kon bevestigd worden. (6)

Niet alles is de schuld van Molenbeek (15/11)

● Spatie voor de komma die er niet moet staan (3) ● “Ghetto” ipv “Getto” (3)

OVERZICHT. Dit gebeurde er de afgelopen twee dagen (15/11)

● “132 slachtoffers”, terwijl er maar 130 gevallen zijn. (1) ● Explosies aan Stade de France waren om 21.16, 21.19 en 21.53, niet 21.20, 21.30 en 21.53 (5) ● "Bij de aanslagen in Parijs zijn 129 doden en 352 gewonden" (4)

Van minuut tot minuut: reconstructie van een waanzinnige avond in Parijs (15/11)

● Uren explosies Stade de France kloppen weer niet (5) ● Sprake van 19 doden bij rue Alibert, terwijl het er maar 15 waren. Verderop zeggen ze 13. (1)

Stille wake tegen terreur (15/11)

● “Bregen” in plaats van “brengen” (3) ● "vluchtelinge" in plaats van "vluchtelingen" (3)

Fransen rouwen op Place de la Republique (15/11)

● “République” ipv “republique” (3x) (3) + (3) + (3)

En toen brak de hel los: dit is het moment waarop terroristen concertzaal binnenvielen (15/11)

● Patrick Zachmann is geen fotograaf van Paris Match, maar verkocht zijn filmpjes eraan. (6)

Overlever Bataclan beschrijft in beklijvende Facebookpost hoe ze zich kon redden (15/11)

● “troosten” ipv “troostten” (3)

Vredeswake aan kiosk maandagavond (15/11)

● “maandag avond” ipv “maandagavond” (3)

“EK in Frankrijk annuleren is geen optie” (15/11)

● “9 gaststeden” ipv “10 gaststeden” (5) ● "da" moet "dat" zijn (3)

Honderden mensen knuffelen elkaar tijdens stille wake aan Gents stadhuis (15/11)

● “Sharm-al-sheik” ipv “Sharm-el-Sheikh” (2)

Lijn-bussen stoppen maandag één minuut, ook scholen stil (15/11)

96 ● “alle medewerkers van de vervoermaatschappij maandag om 12 uur een minuut het werk stil” (4) ● "Charles Michel heeft heeft..." (4)

Vlaggen halfstok aan Tiense gebouwen (16/11)

● “Oproe” ipv oproep (3) ● “12uur” ipv 12 uur (3) ● “CD&BV” ipv CD&V (2)

Waarom terroristen Playstation 4 als communicatiemiddel gebruiken (16/11)

● “Dat heeft Minister van Veiligheid en Binnenlandse Zaken Jan Jambon (NV-A)” (2) Ook Anonymous verklaart oorlog aan ISIS (16/11)

● “In de video is opnieuw een man te zien met typische Anonymous-masker.” (3)

Twee Belgische terroristen bliezen enkel zichzelf op (16/11)

(27) => hij was 28 (5) ● Bilal Hafdi (20) => Hadfi (2x) (2) + (2)

Ook Belgische eersteklassers houden minuut stilte (16/11)

● “de slachtoffers vand e terreurnacht” ipv “de slachtoffers van de terreurnacht” (3) Een minuut stilte voor Parijs (16/11)

● “In Frankrijk zijn om 12 uur een minuut in stilte gehouden” (4)

Broer van twee vermoedelijke daders aanslagen vrijgelaten (16/11)

● Punt vergeten achter zin. (3)

“Er zijn te weinig mensen om de 100 teruggekeerde Syriëstrijders goed te volgen” (16/11)

● komma te veel achter woord (3) ● Punt vergeten achter zin. (3) ● “pro-actief” moet proactief zijn. (3)

Amerikaanse staten willen geen Syrische vluchtelingen meer opvangen (16/11)

● “opnenmen” ipv “opnemen” (3)

Vlaams Belang eist ontslag Jambon en Geens (16/11)

● Punt achter zin vergeten. (3)

Scholen in Bornem houden indrukwekkende minuut stilte (16/11)

● “Ook in de verschillende scholen van de Scholengroep Rivierenland leven mee” (4)

Broer van zelfmoordterrorist spreekt pers toe (16/11)

● “zefl” ipv “zelf” (3) 97

Ook minuut stilte in Halse scholen (16/11)

● 2 punten op einde van zin ipv 1. (3) ● "Ook in een aantal Halse scholen werd een op het middaguur een minuut stilte gehouden." (4)

“Belgische veilgheidsdiensten werken ondermaats” (16/11)

● 4 punten op einde zin ipv 3. (3) ● "veilgheidsdiensten" moet "veiligheidsdiensten" zijn (3)

“Zullen twintig militairen inzetten voor België-Spanje” (16/11)

● “jaar” ipv “naar”. (3)

UEFA resoluut: “Euro 2016 blijft in Frankrijk” (16/11)

● 3 zelfmoordterroristen aan stadion ipv 2. (5) ● “Francois Hollande” ipv “François Hollande” (3) ● “success” ipv “succes” (3) ● “steward” ipv “stewards” (3) ● spatie vergeten bij start nieuwe zin (3)

Ook Aalst toont solidariteit (16/11)

● “blauw wit en rood” ipv “blauw, wit en rood” (3)

Vredeswake op kiosk bij “Hoegaarden schenkt Warmte” (16/11)

● “Hoegaarden schenkt Warme” ipv “Hoegaarden schenkt Warmte” (3) ● “DaarVoor” ipv “Daarvoor”(3) ● “Vandaag” ipv “Vandaar” (3)

Inwoners Molenbeek houden woensdag solidariteitsactie (16/11)

● “..’ ipv “…” (3)

Welke middelen worden ingezet om vermeende terrorist te klissen? (17/11)

● “een speurders” (3)

Federale politie verspreidt nieuw opsporingsbericht voor Salah Abdeslam (17/11)

● “tereuraanslagen” (3)

Verhoogde waakzaamheid bij Belgische kerncentrales (17/11)

● woordvoerster Els De Clerck van de centrale in Doel (2)

Frankrijk mobiliseert 115.000 agenten en soldaten (17/11)

● Artikel spreekt van 170 huiszoekingen, terwijl er maar 168 waren. (1)

Gijzelaar: “Terroristen vertelden me waarom ze aanslag pleegden” (17/11)

98 ● Inval gebeurde om 21u40, niet 21u35. (5)

VIDEO. Zo viel de politie de Bataclan binnen (17/11)

● Inval was om 21u40, niet om 21u30 (5) ● “Eagles of Death Meatal” (2) ● “te voorschijn” moet "tevoorschijn" zijn (3)

Verdachte Franse wagen in stadscentrum (17/11)

● “rezigers” ipv reizigers (3) ● spatie aan begin van zin vergeten (3)

Zeven gearresteerden in Aken niet gelinkt aan terreur Parijs (17/11)

● “onmiddelijk” ipv “onmiddellijk” (3) ● “De politie pakte dinsdagnamiddag nog vier bijkomende mensen opgepakt” (4) ● "dinsdagzeven" moet "dinsdag zeven" zijn (3)

“Schoten tijdens anti-terrorismeactie in Frankrijk, station Lyon geëvacueerd” (17/11)

● “Salah Abdaslam” (2) ● Artikel spreekt van 170 huiszoekingen, terwijl er maar 168 waren. (1)

“Syrisch paspoort van vluchteling is mogelijk manoeuvre van ISIS” (17/11)

● “of het gaat om een vluchteling gestuurd door ISIS, of het een manoeuvre is van IS IS” (2) ● spatie teveel op einde zin (3)

Teken van hoop en eendracht: Franse en Engelse fans zingen samen 'La Marseillaise' (17/11)

● “Roy Hodgons” ipv Hodgon (2)

Vriend: "Ik heb Abdeslam Salah afgezet aan Koning Boudewijnstadion" (17/11)

● “Comptoire Voltaire” ipv “Comptoir Voltaire” (2) ● " vergeten op einde quote (3)

Hele wereld wijst naar Molenbeek (17/11)

● “Politicoging” ipv “Politico ging” (3)

Jambon: "Niveau 3 blijft tot Salah Abdeslam is gevat" (17/11)

● "Abdelslam" moet "Abdeslam" zijn (2) ● “Samu Social” ipv “Samusocial” (2) ● "de Abdeslam" moet gewoon "Abdeslam" zijn (4)

"ISIS plant dodelijke cyberaanvallen" (17/11) ● spatie te veel op einde zin (3)

"Mensen die nu nog vertrekken, wil je nooit nog terugzien" (17/11) ● " vergeten op einde van zin (3)

99 “Veiligheidsdiensten eerst in alle rust laten werken, dan pas evaluatie” (16/11) ● spatie te veel voor komma (3)

Assad: “Frans beleid droeg bij tot uitbreiding terrorisme” (14/11) ● In Beiroet vielen 43 doden, geen 44 (5)

België-Spanje in allerijl afgelast door terreurdreiging (16/11) ● "Abdelsam" moet "Abdeslam" zijn (2) ● "De bond liet in een persbericht weten "heel erg te betreuren dat een dergelijke vriendschappelijke match tussen twee gemotiveerde ploegen zo laat geannuleerd moet worden en begrijpen dat heel wat supporters ontgoocheld zullen zijn."" (4)

Buitengewone EU-raad Binnenlandse Zaken op vrijdag (15/11) ● "niveau's" moet "niveaus" zijn (3)

Crewlid Nick (36) stierf in de armen van zijn ex-vriendin (15/11) ● Artikel zegt dat band net aan concert was begonnen, maar ze waren al minstens een uur bezig (6)

Crisiscentrum: "Doe gewoon wat je anders doet" (17/11) ● "directe" moet "direct" zijn (3) ● "de factuur sturen voor de ontmijningsdiensten als die ter plaatse is moeten komen." (4)

Discotheek schrapt 'Flight to Paris' na aanslagen in Parijs (16/11) ● "gewijd aan een bekende stad of land." (4)

Drie verdachten stonden op lijst Belgisch antiterreurorgaan (16/11) ● het is "Salah", niet "Saleh" (2)

Duitse voetbalploeg beleefde “horrornacht” van Parijs in kleedkamer (16/11) ● "ondermeer" moet "onder meer" zijn (3)

Eerst van dichtbij geconfronteerd met ramp Germanwings, nu betrokken bij aanslagen Parijs (14/11) ● "alpen" moet met een hoofdletter (3)

Extra veiligheidsmaatregelen tijdens Zesdaagse van Gent (16/11) ● "binnenlandse zaken" moet tweemaal met hoofdletter (3)

Fragment uit Casablanca plots razend populair na aanslagen (16/11) ● "Twee Wereldoorlog" moet "Tweede Wereldoorlog" zijn (3)

Gevluchte terrorist niet gevonden in Molenbeek (16/11) ● "Dovo" moet "DOVO" zijn (2)

Hasselts gemeenteraadslid ontsnapt aan de dood in Parijs (15/11) ● "Hasselts gemeenteraad" moet "Hasseltse gemeenteraad" zijn (3)

Hollande: ‘Aanslagen waren oorlogsdaad van ISIS’ (14/11) ● "maet" moet "met" zijn (3)

Internationaal gezochte Abdeslam Salah werkte ooit bij100 MIVB (16/11) ● Het is "Brahim Abdeslam", niet "Ibrahim" (2)

Jambon: "Ik ga Molenbeek opkuisen" (14/11) ● "besture" moet "besturen" zijn (3)

LIVE. Drie broers uit Molenbeek: 1 omgekomen, 1 vrijgelaten na verhoor, 1 internationaal gezocht (15/11) ● "og" moet "nog" zijn (3)

Luckas Vander Taelen over Molenbeek: "Dit zat er al jaren aan te komen" (17/11) ● "Het is fout te denken dat het maar om een kleine minderheid." (4)

Marc Penxten wil rouwregister na drama Parijs (16/11) ● "gemeente" moet "gemeenten" zijn (3)

Meer dan 5 miljoen internetgebruikers stelden hun familie en vrienden gerust via Facebook (15/11) ● "socialenetwerksite" moet "sociale netwerksite" zijn (3)

Model veroorzaakt opschudding tijdens aanslagen in Parijs (15/11) ● punt achter zin vergeten (3)

Moslims veroordelen aanslagen in Parijs met #NotInMyName (16/11) ● "net" moet "niet" zijn (3)

Na Luiks koppel, derde Belgische slachtoffer van terreur in Parijs (15/11) ● spatie voor begin nieuwe zin vergeten (3)

Naalden gevonden in hotelkamers van terroristen (17/11) ● spatie tussen woord en komma moet weg (3)

Prince zet Europese tournee stop, Belgisch concert geannuleerd (16/11) ● "Dat was het geval van" moet "Dat was het geval bij" zijn (4)

Raketlanceerder aangetroffen tijdens massale huiszoekingen (16/11) ● Spatie te veel op einde zin (3)

Recyclagekunstenaar uit verdriet met symbolisch kunstwerk (17/11) ● "Thuisbijven" moet "thuisblijven" zijn (3) ● spatie teveel tussen 2 woorden (3)

Rouwregister voor terreurslachtoffers Parijs in Sociaal Huis (17/11) ● Spatie tussen woord en komma (3)

Syrische rebellengroepen veroordelen aanslagen (15/11) ● "Rakka" moet "Raqqa" zijn (2)

Terreurdreigingsniveau voor grote evenementen in ons land verhoogd naar niveau drie (15/11) ● komma vergeten in eerste zin (3)

Twee grootschalige politiecontroles aan de gang in Nord-Pas-de-Calais (17/11) ● spatie vergeten bij start nieuwe zin (3) 101

Vlag half stok (16/11) ● "half stok" moet "halfstok" zijn (2x) (3) + (3)

Vrije Lagere School gedenkt slachtoffers aanslag Parijs (16/11) ● "“Mooi! Daar krijg je rillingen van,” en “bedankt aan alle juffen en meesters die vandaag en de komende dagen extra veel vragen krijgen van onze kinderen. Eerlijk antwoorden op deze vragen is geen gemakkelijke taak,” waren enkele van de talrijke reacties" (3)

Waarnemend FIFA-voorzitter Hayatou steekt Franse voetbalwereld hart onder de riem (14/11) ● "vandaaruit" moet "van daaruit" (3)

41-jarige gewapende Fransman opgepakt na incident in luchthaven Gatwick (14/11) • “Het zou gaan om een handgranaat”, terwijl politie niks meer zegt dan een “vuurwapen” (6) Moskee in Spanje in brand gestoken (14/11) • Stadje heeft geen 40000, maar slechts 36000 inwoners. (1)

Nog een drama in Frankrijk: tien doden en 49 gewonden bij mislukte testrit TGV (14/11) • Artikel spreekt van 49 doden, maar er waren slechts 53 passagiers op de trein waarvan 10 stierven. (1) • “en gedeeltelijk in een kanaal” is geen volwaardige zin. (4)

“Turkse soldaten schieten vier ISIS-militanten dood” (15/11) • In Ankara vielen 109 doden, geen 100. (5)

Negen arrestaties na aanslag in Beiroet (15/11) • “Mashnuk” moet “Machnouk” zijn (2)

70 bedrijfsleiders naar Chili en Brazilië (16/11) • “Vallparisio” moet “Valparisio” zijn (2)

Amerikaanse gevechtsvliegtuigen vernietigen 116 tankwagens van ISIS (16/11) • “Deir Ezzor” moet “Deir Ez-zor” zijn (2)

G20 keurt actieplan tegen belastingontwijking multinationals goed (16/11) • “ans” moet “and” zijn (3)

Macedonië gestart met bouwen hek aan grens met Griekenland (16/11) • “Gorgi” moet “Gjorge” zijn (2) • “Europees” moet “Europees land” zijn (4)

Marriot en Sheraton komen in zelfde handen (16/11) • “Marriot” moet “Marriott” zijn (2) • “Le Meridien” moet “Le Méridien” zijn (2)

150 wapens in beslag genomen bij 13-jarige na dood van vriendje (17/11)

• “wapen” moet “wapens” zijn (4)

Heropstart Doel 3 en Tihange 2 ten vroegste op 15 december102 (17/11) • “besuit” moet “besluit” zijn (3)

Minstens negen migranten voor Kos verdronken (17/11)

• “Piräeus” moet “Piraeus” zijn (2)

Palestijn die op Israëlische soldaten schiet zelf doodgeschoten (17/11)

• “voegde toe” moet “voegden toe” zijn (4)

Tsjechische president neemt deel aan anti-islambetoging (17/11)

• “islamofoeben” moet “islamofoben” zijn (3)

Vader doodt zijn zoon met een mes in Membach (17/11)

• “konings” moet met hoofdletter (3)

103 Fouten terroristische aanslagen in Parijs – Niet-stress periode Artikels geschreven door Het Nieuwsblad

Algemeen overzicht

Aantal woorden in data set: 15725 Totaal aantal fouten gevonden: 20

Fractie artikels dat minstens 1 fout bevat: 15/80 of 18.8% Fractie artikels dat minstens 1 taalfout bevat: 12/80 of 15.0% Fractie artikels dat minstens 1 feitelijke fout bevat: 3/80 of 3.8%

Voorkomende fouten:

• Overschatting van cijfers (1) komt 0 keer voor of 0.0% • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 10 keer voor of 50.0% • Spellingsfout, fouten met leestekens, ... (3) komt 5 keer voor of 25% • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, ... (4) komt 2 keer voor of 10.0% • Foutieve cijfers die niet onder categorie (1) vallen (5) komt 2 keer voor of 10.0% • Feitelijke, maar geen cijferlijke fout (6) komt 1 keer voor of 5.0%

Aantal artikels per aantal fouten:

65 artikels bevatten 0 fouten, 12 artikels bevatten 1 fout, 2 artikels bevatten 2 fouten en 1 artikel bevat 4 fouten.

Overzicht gevonden fouten

Minstens zes doden na zelfmoordaanslag Nigeria (23/10)

• Titel is niet aangepast naar nieuwe balans van minstens 28 doden. (5)

Drië Israëli’s en twee Palestijnen omgekomen bij nieuwe aanvallen in Israël (13/10)

• "drië" moet "drie" zijn (3) • Het is "Sami Abu Zuhri", niet "Sami Abu Suhri" (2) • Het is "Raänana", niet "Raanana" (2) • Het is "Kirjat Ata", niet "Kiriat Ata" (2)

Palestijnen vallen buspassagiers aan in Jeruzalem (13/10) • Het is "Raänana", niet "Ranaana" (2) + (2)

Politieman en dader doodgeschoten voor politiekantoor in Sydney (02/10)

• Lokale Belgische tijd van aanslag was 7u30, niet 8u30 (6)

Aanslagen in Bagdad eisen meer dan 20 doden (17/09)

104 • In titel staat "minstens 20", in intro "minstens 23", en vervolgens als je alle slachtoffers optelt kom je op "minstens 26". (5)

Twee medewerkers internationale Rode Kruis gedood in Jemen (02/09)

• "internationale Rode Kruis" moet "Internationale Rode Kruis" zijn (2) • "Sadaa" moet "Saada" zijn (2)

70 miljoen telefoongesprekken van gevangenen gehackt (13/11) • Komma vergeten (3)

Jihadi John "zo goed als zeker" gedood bij luchtaanval (13/11) • “Hij verscheen ‘Jihadi John’ nog in een video van ISIS met de Japanse gijzelaars Haruna Yukawa en Kenji Goto, kort voor ze werden gedood.” (4)

Kinepolis schrapt vrijdagvertoning van ‘Black’, ook weekendprogramma aangepast (13/11)

• “Het weekendprogrammatie” moet “de weekendprogrammatie” zijn (4)

Actie tegen ‘jihadistennetwerk’ in Italië en Noorwegen (12/11) • “Fraj” moet “Faraj” zijn (2)

Achtjarige slaat baby dood omdat hij gehuil niet kan verdragen (12/11) • Spatie te veel achter naam (3)

Beelden opgedoken van relletjes bij vertoning van 'Black' in Kinepolis (12/11) • “Troos” moet “Van Troos” zijn (2)

Bijna 1 op de 10 Vlaamse ambtenaren in 2014 ‘onbeschikbaar’(12/11)

• “privé-sector” moet “privésector” zijn (3)

Gevel ingestort in centrum Leuven (12/11)

• “vermoedelijke” moet “vermoedelijk” zijn (3)

Kapitein krijgt levenslang voor ramp met Zuid-Koreaanse veerboot (12/11)

• “Kwangju” moet “Gwangju” zijn (2)

105 Fouten vliegtuigcrash Germanwings – Stress periode Artikels geschreven door Het Laatste Nieuws Algemeen overzicht Aantal woorden in data set: 37772 Totaal aantal fouten gevonden: 98 Fractie artikels dat minstens 1 fout bevat: 66/184 of 35.9% Fractie artikels dat minstens 1 taalfout bevat: 57/184 of 31.0% Fractie artikels dat minstens 1 feitelijke fout bevat: 13/184 of 7.1%

Voorkomende fouten:

• Overschatting van cijfers (1) komt 0 keer voor of 0.0% • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 36 keer voor of 36.7% • Spellingsfout, fouten met leestekens, ... (3) komt 31 keer voor of 31.6% • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, ... (4) komt 16 keer voor of 20% • Foutieve cijfers die niet onder categorie (1) vallen (5) komt 10 keer voor of 16.3% • Feitelijke, maar geen cijferlijke fout (6) komt 5 keer voor of 5.1%

Aantal artikels per aantal fouten:

118 artikels bevatten 0 fouten, 44 artikels bevatten 1 fout, 17 artikels bevatten 2 fouten, 3 artikels bevatten 3 fouten, 4 artikels bevatten 4 fouten en 1 artikel bevat 7 fouten.

Overzicht gevonden fouten

“45 Spanjaarden aan boord” (24/03) • “Pierre-Martin Charpenel” moet “Pierre Martin-Charpenele” zijn (2)

“Dit scenario moet hij al honderden keren in zijn hoofd hebben afgespeeld” (26/03) • Andreas Lubitz was 27, geen 28 (5)

“Er is niks meer te zien dan puin en lichamen. Alles is verpulverd” (24/03) • “hij is de plaats overvlogen” moet “hij heeft de plaats overvlogen” zijn (4)

“Gekend probleem bij Airbus-toestellen” (24/03) • “GermanWings” moet “Germanwings” zijn (2x) (2) + (2)

“Gisteren waren we met veel, nu alleen” (25/03) • Spatie en “ omgewisseld (3) • Spatie te veel voor “Gymnasium” (3)

“Identifictie start pas vandaag” (27/03) • “Identifictie” moet “identificatie” zijn (3)

“Lubitz kende de plaats van de crash” (27/03) • “Montabour” moet “Montabaur” zijn (2)

“Passagiers wisten pas op het laatste moment wat er gebeurde” (27/03) • “de slachtoffer” moet “de slachtoffers” zijn (3)

106 “Raam van cockpit brak, waardoor piloten bewustzijn verloren” (25/03) • Bron van artikel is forum van piloten die niks met onderzoek te maken hebben. Achteraf blijkt ook dat wat in artikel staat helemaal niet klopt. (6)

16 scholieren en twee baby’s aan boord ramptoestel Germanwings (24/03) • Komma te veel in zin (3) • “Sixtuskirsche” moet “Sixtus Kirche” zijn (2) • “Llinars des Vallès“ moet “Llinars del Vallès“ zijn (2)

Aanwijzing gevonden in flat van copiloot (27/03) • Lubitz was 27, geen 28 (5)

Andreas Lubitz, de copiloot die 149 mensen dood injoeg (26/03) • “Daarna volgen een opleiding in de vliegsimulator in Bremen en wordt er geoefend met een Cessna Citation.” (4)

Beelden tonen versplintering Airbus (27/03) • Spatie en punt omgewisseld (3)

Bemanning zond geen noodsignaal uit (24/03) • “Daarna verdween het contact met het toestel verbroken” (4)

Copiloot handelde bewust (26/03) • Punt vergeten op einde van zin (3)

Copiloot kreeg recent nog erkenning van luchtvaartautoriteiten (26/03) • Lubitz was 27 jaar, geen 28(5)

De Airbus A320, het werkpaard van de luchtvaart (24/03) • “luchtvaarspecialist” moet “luchtvaartspecialist” zijn (3)

Duitse Airbus A320 crasht in Franse Alpen (24/03) • “GermanWings” moet “Germanwings” zijn (2) + (2)

Duitse scholieren waren uitgeloot voor de reis (25/03) • punt vergeten op einde van zin (3)

Eén piloot buitengesloten uit cockpit: “Je hoorde hem op de deur inbeuken” (26/03) • “er staat vast” moet “het staat vast” zijn (4) • “informatie van Le Monde” zegt dat het copiloot was die cockpit verliet, en niet andersom (zoals het achteraf bleek te zijn) (6)

Expert: “Wat er gebeurd is in cockpit is zeer interessant” (25/03) • “moest” moet “mocht” zijn (4)

FBI biedt hulp aan bij onderzoek naar de crash (26/03) • “Volgens woordvoerder Josh Earnest van het Witte Huis blijven de VS ervan uitgaan dat er bij de vliegtuigcrash geen terrorisme vandoen is.” (4)

Franse piloten stappen naar rechter uit onvrede met lekken (27/03) • “ze noemen het getuigen van” moet “ze vinden het getuigen van” (4)

Hier is vlucht 4U9525 neergestort (24/03) 107 • “GermanWings” moet “Germanwings” zijn (2x) (2) + (2) • “Dignes-Les-Bains” moet “Digne-Les-Bains” zijn (2)

Hoe komt het dat Lubitz toch vloog? (27/03) • Punt vergeten op einde zin (3)

Hoe kon het foutlopen met vlucht 4U9525? (24/03) • “Alpes-de-Hoate-Provence” moet “Alpes-de-Haute-Provence” zijn (2)

Hoe werkt zo’n cockpitdeur? (26/03) • Punt vergeten op einde van zin. (3)

Kippenvel: piloot Germanwings spreekt passagiers toe (27/03) • “fantasisch” moet “fantastisch” zijn (3)

Lubitz stond onder toezicht voor psychologische problemen (27/03) • “voot” moet “voor” zijn (3)

Lufthansa-CEO: “Ik begrijp niet hoe dit kon gebeuren” (25/03) • “luchtvaartmaatcshappij” moet “luchtvaartmaatschappij” zijn (3) • “telkens” moet “enkele” zijn (4)

Merkel: “Crash grondig onderzoeken” (24/03) • “GermanWings” moet “Germanwings” zijn (2) • “Dignes-Les-Bains” moet “Digne-Les-Bains” zijn (2)

Minstens 1 Belg bij slachtoffers 4U9525 (24/03) • “Dignes-Les-Bains” moet “Digne-Les-Bains” zijn (2) • ” vergeten op einde van quote (3)

Ongelukkige Germanwings-reclame uit Londense metro verwijderd (27/03) • “ vergeten aan begin quote (3)

Onbekende uit huis Lubitz gehaald (27/03) • “wordt” moet “werd” zijn (4) • “Montabau” moet “Montabaur” zijn (2)

Ook Nederlandse vrouw aan boord van rampvlucht (24/03) • “ te veel gebruikt in quote (3)

Operazangers Oleg Bryjak en Maria Radner bij slachtoffers (25/03) • “bassbariton” moet “basbariton” zijn (3) • Radner was 33, geen 34 (5)

Procureur: “Copiloot deed vliegtuig doelbewust crashen” (26/03) • “communiceerde de piloten” moet “communiceerden de piloten” (3)

Ramptoestel had gisteren technische panne, personeel Lufthansa weigert te vliegen (24/03) • “veiliheid” moet “veiligheid” zijn (3) • Frans gevechtsvliegtuig heeft passagiersvliegtuig wel degelijk teruggevonden, in tegenstelling tot wat artikel beweert. (6)

Technologie om deze drama’s te vermijden bestaat. Waarom wordt ze niet gebruikt? (27/03) 108 • “vakbanden” moet “vakbonden” zijn (3)

Vanaf nu verplicht met twee in cockpit (27/03) • Lidwoord vergeten (4)

Verscheurd ziektebriefje in woning copiloot (27/03) • “verscheurd ziektebriefje zou gevonden” moet “verscheurd ziektebriefje zou gevonden zijn” zijn (4)

Vorig jaar stortte Airbus van Lufthansa bijna neer na steile duikvlucht (24/03) • “Lufthanse” moet “Lufthansa” zijn (2) • “passagier” moet “passagiers” zijn (3) • “à rato van” moet “a rato van” zijn (3)

Wat is Germanwings, de “toekomst van Lufthansa”? (24/03) • “luchtvaarspecialist” moet “luchtvaartspecialist” zijn (3)

Toestand Vlaamse autosnelwegen verbetert voor zevende jaar op rij (26/03) • “verkeren” moet “verkeert” zijn (3)

De Lijn verliest steeds meer reizigers (25/03) • Er staat geen min voor de reizigersaantallen, maar wel voor de verschillen in reizigersaantallen (6)

Toeristen maakten huiveringwekkende beelden van aanslag museum in Tunis(25/03) • Artikel komt na opsomming aan 23 doden, terwijl er in werkelijkheid 24 zijn gevallen. (5)

Heropening Bardomuseum in Tunis uitgesteld om veiligheidsredenen (25/03)

• Volgens artikel (week na aanslag) 21 doden, terwijl er 24 doden in totaal vielen. (5)

Vrouw probeert zichzelf in koffer Turkije binnen te smokkelen (24/03)

• “eerder” moet “eerdere” zijn (3)

Twee Syriëstrijders vrij door fouten gevangenispersoneel (24/03) • “Aurelie-Anne” moet “Aurélie- Anne” zijn (2)

Politie onder vuur genomen op Brusselse Ring (24/03) • “verbrandde” moet “verbrande” zijn (3)

Amanda Knox definitief vrijgesproken voor moord op Britse studente (27/03) • “Girgha” moet “Ghirga” zijn (2) • “Raffalele” moet “Raffaele” zijn (2)

Wat wordt het verdict voor Amanda Knox? (27/03)

• “Rafaelle” moet “Rafaele” zijn (2) • “Het Proces Na het onderzoek” (3) • 16 jaar moet 26 jaar zijn (5) 109 • 40 messteken moeten er 47 zijn (5)

Uitspraak in zaak Amanda Knox uitgesteld tot vrijdag (25/03)

• “dat” moet “dan” zijn (4)

Opnieuw bombardementen in Jemen (27/03)

• “Hoethi” moet “Houthi” zijn (2) + (2) + (2) + (2) • “Mohammed” moet “Mohammad” zijn (2) + (2) • “Al-Arabija” moet “Al-Arabiya” zijn (2)

Jemenitische president “onder Saoedische bescherming” naar top in Egypte (26/03)

• “Al-Arabija” moet “Al-Arabiya” zijn (2) • “verlaten” moet “had verlaten” zijn (4)

Saoedi-Arabië valt houthi-rebellen in Jemen aan (26/03)

• “houthi” moet “Houthi” zijn (2)

Marokko rolt lokale terreurcel van IS op (24/03)

• “ vergeten (3)

Athene wil 1,2 miljard euro terug van Europees noodfonds (24/03)

• “Euroepse” moet “Europese” zijn (3) • “eurogroep” is met hoofdletter (2)

Pensioenhervorming goedgekeurd, bespreking indexsprong nu van start (25/03)

• Het is de oppositie die de wet laakt, niet de meerderheid (6)

Tunesië blijft ook in paasvakantie populaire vakantiebestemming (25/03)

• “Bruyere” moet “Bruyère” zijn (2) • “Van den Bosch” moet “van den Bosch” zijn (2)

Vakbonden: “Vlaamse regering mag factuur niet opnieuw doorschuiven” (26/03)

• “moet” moet “moeten” zijn (4)

Nederlandse economie meer gegroeid dan gedacht (26/03)

• “Van Mullingen” moet “van Mullingen” zijn (2)

Vlaams Belang in Damascus om Assad-regime te steunen (24/03)

• “Van Dermeersch” moet “Van dermeersch” zijn (2) • Lidwoord “de” vergeten (4)

Oud-parlementslid Joris Van Hauthem (Vlaams Belang) overleden (25/03)

• Hij was fractievoorzitter in de senaat tot 2010, niet tot 2009. (5)

110 VS houden 9.800 soldaten tot eind dit jaar in Afghanistan (24/03)

• “Afghanistan” moet “Afghanisten” zijn. (2) • “gebaseerd” is letterlijk vertaald uit het Engels. (4)

111 Fouten vliegtuigcrash Germanwings – Niet-stress periode Artikels geschreven door Het Laatste Nieuws

Algemeen overzicht Aantal woorden in data set: 21662 Totaal aantal fouten gevonden: 43

Fractie artikels dat minstens 1 fout bevat: 26/90 of 28.9% Fractie artikels dat minstens 1 taalfout bevat: 23/90 of 25.6% Fractie artikels dat minstens 1 feitelijke fout bevat: 3/90 of 3.3%

Voorkomende fouten: • Overschatting van cijfers (1) komt 0 keer voor of 0.0% • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 20 keer voor of 46.5% • Spellingsfout, fouten met leestekens, … (3) komt 15 keer voor of 34.9% • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, … (4) komt 5 keer voor of 11.6% • Foutieve cijfers die niet onder categorie (1) vallen (5) komt 1 keer voor of 2.3% • Feitelijke, maar geen cijferlijke fout (6) komt 2 keer voor of 4.7%

Aantal artikels per aantal fouten: 64 artikels bevatten 0 fouten, 16 artikels bevatten 1 fout, 6 artikels bevatten 2 fouten, 1 artikel bevat 3 fouten en 3 artikels bevatten 4 fouten.

Overzicht gevonden fouten

Zeker dertig lichamen toestel AirAsia geborgen (02/01)

• “Air Asia” moet “AirAsia” zijn (2) • Exact dezelfde zin komt 2 keer voor in artikel. (4)

Porosjenko: “Nemtsov gedood om onthullingen over Oekraïne” (28/02)

• “Gary Kasparov” moet “Garry Kasparov” zijn (2) • “auto”s” moet “auto’s” zijn (3)

Nieuwe informatie op video over moord Boris Nemtsov (27/03)

• “ vergeten op einde van quote (3)

Video opgedoken van botsing helikopters waarbij Franse sportkampioenen omkwamen (10/03)

• Alexis Vastine was 28, geen 29. (5)

Langste vakantie van mijn leven” werd nieuwe ‘Laure Manaudou’ fataal (10/03)

• “Micheline Ostermeuer” moet “Micheline Ostermeyer” zijn (2) • “olimpiade” moet “olympiade” zijn (3) 112 • “Reactie op Muffats” moet “Reactie op Muffats dood” zijn (4) • “Ushuaïa” moet “Ushuaia” zijn (2)

Islamitische Staat eist aanslagen Jemen op, nu al zeker 140 doden (20/03)

• “Dat heeft een Jemenitische ministerie bekend gemaakt” (4)

Jonge Britse jihadgangers terug in Engeland (16/03)

• Er staat eerst dat ze terug zijn in Londen, maar vervolgens wordt gezegd dat hun uitzetting naar Engeland nog moet onderhandeld worden. (6)

Ghelamco bouwt ook nieuw nationaal voetbalstadion (19/03) • “heizelplateau” moet “Heizelplateau” zijn (2)

Je zal er maar naast zitten: vrouw steekt sigaret op en ‘terroriseert’ heel vliegtuig (17/03) • “Obama doodt mensen dinsdags.” Wat doet die dinsdags daar? (4) • “Exxon Mobil” moet “ExxonMobil” zijn (2)

Aanvullend pensioen wordt gekortwiekt (17/03) • Misleidende titel. Niet zeker welke maatregel minister gaat nemen, maar in titel staat al dat pensioen zeker wordt verminderd. (6)

Laagste uitkeringen en pensioenen stijgen met 2 procent (13/03) • “mimimum” moet “minimum” zijn (3)

Anonymous publiceert 9.200 Twitter-accounts van IS (16/03)

• “oonder” moet “onder” zijn (3)

“Genkse Syriëstrijder (20) gesneuveld in Irak” (16/03)

• “Younnes” moet “Younes” zijn (3x) (2) + (2) + (2) • “ teveel (3)

“Jihadi John wilde me de keel oversnijden maar ik overleefde zijn dreigement” (15/03)

• “Espinoza” moet “Espinosa” zijn (2) • “Gorbanov” moet “Gorbunov” zijn (2)

25 jaar later weer vrij: de tienerlover van ‘To Die For’ lerares die haar man doodde (15/03)

• “Terwijl Randall hield hem onder bedwang hield met een mes vuurde Flynn een kogel in zijn hoofd.” (4) • ‘ vergeten voor quote (3) • “Gregory” moet “Greggory” zijn (2)

Tropisch eiland Vanuatu zwaar getroffen door cycloon (13/03) 113 • “Vanuata” moet “Vanuatu” zijn (2x) (2) + (2)

Check hier hoe zuinig de woningen in uw gemeente zijn (12/03)

• Komma vergeten (3)

‘Facebook voor jihadisten’ dag na lancering weer offline (12/03)

• “jhadisten” moet “jihadisten” zijn (3)

Overloper onthult waarom slachtoffers IS zo kalm blijven bij hun executie (11/03)

• “Hauruna” moet “Haruna” zijn (2)

IS-video toont executie van ‘Israëlische spion’ door kind (10/03)

• “Mussallam” moet “Musallam” zijn (2)

Deze foto van Obama gaat de geschiedenisboeken in (08/03)

• “historisch” moet “historische” zijn (3)

Belgische familie gevlucht uit Syrië (06/03)

• “Vandelvede” moet “Van De Velde” zijn (3x) (2) + (2) + (2) • “autoreiten” moet “autoriteiten” zijn (3)

Bankkaarten verraden verdachte zuuraanval (07/03)

• “Daarhield” moet “Daar hield” zijn (3)

Federale kern zit samen over brugpensioen, Peeters wil vandaag landen (06/03)

• “berreiken” moet “bereiken” zijn (3)

Moeder uit Molenbeek ontvoert kleuters naar Syrië (06/03)

• 2x spatie te veel in zin (3) + (3)

1000 kilometer gereden om uit het leven te stappen (05/03)

• “Goefferdinge” moet “Goeferdinge” zijn (2)

114 Fouten vliegtuigcrash Germanwings – Stress periode Artikels geschreven door Nieuwsblad Algemeen overzicht Aantal woorden in data set: 35329 Totaal aantal fouten gevonden: 84 Fractie artikels dat minstens 1 fout bevat: 53/136 of 39.0% Fractie artikels dat minstens 1 taalfout bevat: 47/136 of 34.6% Fractie artikels dat minstens 1 feitelijke fout bevat: 7/136 of 5.1%

Voorkomende fouten: Overschatting van cijfers (1) komt 1 keer voor of 2.0%

Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 36 keer voor of 36.7% Spellingsfout, fouten met leestekens, ... (3) komt 27 keer voor of 46.9% Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, ... (4) komt 14 keer voor of 12.2% Foutieve cijfers die niet onder categorie (1) vallen (5) komt 3 keer voor of 0% Feitelijke, maar geen cijferlijke fout (6) komt 3 keer voor of 0%

Aantal artikels per aantal fouten:

83 artikels bevatten 0 fouten, 34 artikels bevatten 1 fout, 10 artikels bevatten 2 fouten, 6 artikels bevatten 3 fouten en 3 artikels bevatten 4 fouten. Overzicht gevonden fouten

'Copiloot was tot de dag voor de crash in psychiatrische behandeling' (27/03)

• “Carsten Sphor” moet “Carsten Spohr” zijn. (2)

‘Christian nam het vliegtuig zoals anderen de metro’ (26/03)

• ‘ te weinig om zin af te sluiten (2x) (3) + (3) • Punt vergeten op einde van zin (3)

‘Dat het vliegtuig 25 jaar oud is, speelt geen rol’ (24/03)

• Filip Van Rossum moet Filip Van Rossem zijn (2) + (2) + (2) + (2)

‘In Airbus A-320 controleert computer elke stuurbeweging’ (24/03)

• “GermanWings” moet “Germanwings” zijn (2) + (2) • “RyanAir” moet “Ryanair” zijn (2) + (2)

‘Mensen met depressie zijn niet gevaarlijk, tenzij af en toe voor zichzelf’ (27/03)

• Hoofdletter aan begin van zin vergeten (3) • “Dusseldorf” moet “Düsseldorf” zijn (2) • “Bij een klassieke depressie kan het best zijn dat men op een bepaald moment dat zijn/haar leven niet meer de moeite waard is.” (4) • “medisch screening” moet “medische screening” zijn (3)

‘Piloten en passagiers waren waarschijnlijk in slaap gevallen’ (25/03) 115 • Er waren in Athene slechts 121 doden, geen 212. (1)

‘Vliegangst? Niet nodig, vliegtuig blijft enorm veilig’ (25/03) • “Universit Gent” moet “Universiteit Gent” zijn (2) • “helpen die cursussen hen de angst af” moet “helpen die cursussen hen van de angst af” (4)

Al vaker problemen met vliegtuigen van Lufthansa (24/03) • “GermanWings” moet “Germanwings” zijn (2x) (2) + (2)

Bergingsactie gaat verder (27/03) • Punt vergeten op einde van zin (3)

Broer van Belgisch slachtoffer was piloot, maar verbrak alle contact (27/03) • Vlucht was 4U9525, niet A320 (2) • Het is Christian, niet Christal (2)

Bruikbare opnames van stemmen en geluiden gehaald uit zwarte doos (25/03) • “Routineboodshappen” moet “Routineboodschappen” zijn (3)

De slachtoffers van de rampvlucht van Germanwings (25/03) • “vliegtui” moet “vliegtuig” zijn (3) • 2x spatie te veel tussen woorden (3) + (3)

Duitse scholieren waren uitgeloot voor de reis (25/03) • “Fanse” moet “Franse” zijn (3) • Spatie te veel op einde van zin (3)

Duitse school rouwt om 18 slachtoffers vliegtuigcrash (24/03) • “Sylvia Loerhman” moet “Sylvia Löhrmann” zijn (2)

Franse president: ‘Geen overlevenden’ (24/03) • “Er zouden voornamelijk Duitse slachtoffers zijn” ipv “Er zouden voornamelijk voor Duitse slachtoffers zijn” (4)

Inwoners regio bieden massaal slaapplaats aan (25/03) • ‘ vergeten aan begin van quote (3)

Lufthansa: ‘Vliegtuig heeft 8 minuten lang duikvlucht gemaakt’ (24/03) • “Winkelmanna” moet “Winkelmann” zijn (2)

Lufthansa gaat uit van ongeval, ‘al de rest is ‘speculatie’ (24/03) • ‘ te weinig op einde van quote (3)

Niet te vatten (27/03) • Zin uit intro wordt zomaar halverwege gestopt (3)

Nog geen duidelijkheid over Belgen aan boord (24/03) • “GermanWings” moet “Germanwings” zijn (3x) (2) + (2) + (2)

Pakkende minuut stilte voor Duitsland-Australië (26/03) • Spatie te veel na openen haakjes (3)

Politie onderzoekt videobeelden vliegtuigcrash (24/03) 116 • “De Duitse autoriteiten zijn” moet “De Duitse autoriteiten zien” zijn (3)

Politie voor ouderlijke woning van copiloot (26/03) • “De politie heeft zich in de straten die naar het ouderlijk huis van de copiloot opgesteld” (4) • “Natuurlijk zijn wij geschokt en geraakt nu blijkt dat er ook in onze gemeente één van de slachtoffers uit onze gemeente afkomstig is.” (4)

PORTRET. Andreas Lubitz: 'Een perfect normale jongeman’ (26/03) • “een gesprekken” moet “een gesprek” zijn (3) • Komma en ‘ zijn omgewisseld (3) • “Carsten Sphor” moet “Carsten Spohr” zijn (2)

PROFIEL. Germanwings: nog onbekend maar vol ambitie (24/03) • “Het in Frankrijk neergestorte toestel Germanwings” moet “Het in Frankrijk neergestorte toestel van Germanwings” zijn (4)

REACTIES. ‘België staat klaar om alle hulp te verlenen’ (24/03) • “onderzoeksamen” moet “onderzoek samen” zijn (3)

RECONSTRUCTIE. De laatste momenten van vlucht 4U9525 (26/03) • “nforceren” moet “forceren” zijn (3) • “de Andreas Lubitz” moet “Andreas Lubitz” zijn (2)

Valls: ‘Geef ons alle informatie die jullie over Lubitz hebben’ (27/03) • “de vierde dag op vrij” moet “de vierde dag op rij” zijn (3) • “de” moet “die” zijn (3)

Zij hadden geluk en stapten niet op noodlottige vlucht (25/03) • “German Wings” moet “Germanwings” zijn (2)

Houthi-rebellen rukken verder op in Zuid-Jemen (24/03) • Saleh is geen president meer, maar wordt wel zo genoemd (6)

MR aanvaardt verklaringen van Bart De Wever niet (24/03) • Punt vergeten op einde van zin (3)

Ook Canada bereid om IS aan te vallen in Syrië (24/03)

• “De extremisten leunen ideologisch dichtbij de IS aan, zijn hen echter vijandig gezind.” (4)

Vlucht van Germanwings naar Barcenlona afgeschaft (24/03)

• “Barcenlona” moet “Barcelona” zijn (2)

Poolse priester 7 jaar naar cel voor pedofilie (25/03) • Hij werd aangeklaagd voor in totaal 10 gevallen van pedofilie, geen 8. (5)

Meer vrouwen en allochtonen aan de slag bij Vlaamse overheid (25/03)

• “zet” moet “zegt” zijn (3)

Francken hoopt op akkoord met Marokko (25/03) 117 • “Franken” moet “Francken” zijn (2) • Lidwoord “het” vergeten voor “overgrote deel” (4)

Filip Dewinter: 'Samenwerken met Syrisch regime, of men dat nu wil of niet' (25/03)

• “Van Dermeersch” moet “Van dermeersch” zijn (2)

Boko Haram ontvoert opnieuw honderden vrouwen en kinderen (25/03)

• “Tjadische” moet “Tsjadische” zijn (2)

4 VRAGEN. D-Day voor Amanda Knox uitgesteld (25/03)

• “Know” moet “Knox” zijn (2) • “zouden” moet “zou” zijn (4)

‘Weinig reisannulaties na aanslag in Tunesië’ (25/03)

• “Van den Bosch” moet “van den Bosch” zijn (2) • “Bruyere” moet “Bruyère” zijn (2)

Voortaan sneller inchecken op Brussels Airport (26/03)

• Artikel zegt dat Brussels Airport een luchtvaartmaatschappij is, wat niet het geval is. (6)

Toestand Vlaamse autosnelwegen verbetert voor zevende jaar op rij (26/03) • “verkeren” moet “verkeert” zijn (4)

Regering houdt vast aan taxiplan minister Smet (26/03)

• “Diliès” moet “Dilliès” zijn (2)

Ministers trekken 440.000 euro uit voor daklozenhulp (26/03)

• “Samu social” moet “Samusocial” zijn (2)

Meer jobs, maar werkloosheid daalt niet (26/03)

• “2010” moet “2020” zijn (5)

Juf De Blokkendoos volledig vrijuit voor vermeend kindermisbruik (26/03)

• “De Bokkendoos” moet “De Blokkendoos” zijn (2)

Iran eist stopzetting Saudische aanval op Jemen (26/03) • Lidwoord “het” vergeten (4)

Geweld stuwt asielaanvragen in rijke landen met 45 procent hoger (26/03)

• “aantal mensen dat asiel vorig jaar asiel heeft aangevraagd” (4)

118 Dit oude gevaarlijke chemische wapen wil ISIS gebruiken in Europa (26/03) • Er wordt continu “Britse experts” gezegd, maar nergens wordt effectief vermeld om wie het gaat. (6)

Dertien burgers gedood bij Saoudische aanvallen op Jemen (26/03)

• “Saoudische” moet “Saoedische” zijn (2) • “houthi” is met hoofdletter (2) • “Al Arabija” moet “Al Arabiya” zijn (2)

Amerikaanse militair verdacht van terroristische activiteiten (26/03)

• Spatie vergeten aan begin zin (3) • Er staat dat de maximumstraf 15 jaar is, maar ze kregen later respectievelijk 21 en 30 jaar (5) • “Maar wat ze niet wisten, was dat ze een van hun kompanen een undercoveragent van de FBI was.” (4)

Pensioenen zelfstandigen stijgen met 2 procent (27/03)

• - vergeten na “moederschaps” (3)

Turkse man 13 jaar de cel in wegens beledigen van de vlag (27/03)

• Plotse start van nieuwe alinea, midden in zin (4)

119 Fouten vliegtuigcrash Germanwings – Niet-stress periode Artikels geschreven door Het Nieuwsblad

Algemeen overzicht Aantal woorden in data set: 15956 Totaal aantal fouten gevonden: 33

Fractie artikels dat minstens 1 fout bevat: 24/80 of 30.0% Fractie artikels dat minstens 1 taalfout bevat: 22/80 of 27.5% Fractie artikels dat minstens 1 feitelijke fout bevat: 3/80 of 3.8%

Voorkomende fouten: • Overschatting van cijfers (1) komt 0 keer voor of 0.0% • Foutieve naamgeving (personen, steden, albums, groeperingen, ...) (2) komt 13 keer voor of 39.4% • Spellingsfout, fouten met leestekens, … (3) komt 8 keer voor of 24.2% • Foutieve zinsconstructie, ontbrekende woorden, woorden te veel, … (4) komt 9 keer voor of 27.3% • Foutieve cijfers die niet onder categorie (1) vallen (5) komt 2 keer voor of 6.1% • Feitelijke, maar geen cijferlijke fout (6) komt 1 keer voor of 3.0%

Aantal artikels per aantal fouten: 56 artikels bevatten 0 fouten, 17 artikels bevatten 1 fout, 6 artikels bevatten 2 fouten en 1 artikel bevat 4 fouten.

Overzicht gevonden fouten

Bemanning Syrische legerhelikopter gevangen genomen, één dode (23/03)

• “Idleb” moet “Idlib” zijn (2) • “al-nosra” moet “al-Nusra” zijn (2)

Dode bij schietpartij op parking van supermarkt (23/03)

• In intro is sprake van 7 gewonden, terwijl het er even verder 8 zijn (5)

Haai doodt Duitse toerist in Egypte (23/03)

• “Sharm El Sheik” moet “Sharm-El-Sheikh” zijn (2)

Investeringen in technologische industrie blijven onder gemiddelde (23/03)

• “Daarmee blijven de investeringen blijven net als vorig jaar onder het gemiddelde niveau van de voorbije 20 jaar” (4)

Roularta ziet tij keren (23/03)

• “Patrick Draghi” moet “Patrick Drahi” zijn (2)

120 Zoon afgezette Oekraïense president verdronken in Bajkalmeer (23/03)

• “Bajkalmeer” moet “Baikalmeer” zijn (2x) (2) + (2)

Zuid- en Noord-Koreaanse bestrijden samen brand in grensgebied (23/03)

• “Zuid- en Noord-Koreaanse bestrijden samen brand in grensgebied” (4)

‘Psychisch zieke’ man die agenten aanviel in luchthaven New Orleans overleden (22/03)

• “sherif” moet “sheriff” zijn (3)

Britse studenten helpen ISIS in ziekenhuizen (22/03)

• “afgestuurd” moet “afgestudeerd” zijn (3)

‘“Noiraud” Reynders moet opstappen’ (21/03)

• Spatie te veel op einde zin (3)

Brand in tankstation langs Nederlandse snelweg A1 (21/03)

• “Appeldooorn” moet “Appeldoorn” zijn (2) • “aan tal” moet “aantal” zijn (3)

Drie dagen nationale quarantaine in Sierra Leone (21/03)

• “zo lang” moet “zolang” zijn (3)

Dubbel zoveel onderzoeksdossiers naar sektes (21/03) • “telt momenteel 2.300 dossiers die geopend werden na een vraag over een filosofische of religieuze groeperingen” (4) • “SudPresse” moet “sudpresse” zijn (2)

Jonge overvallers geklist na wilde achtervolging (21/03) • “In het dorpje Sint-pleegden ze een carjacking.” (4)

Australische oud-premier overleden op 84-jarige leeftijd (20/03)

• “George Whitlam” moet “Gough Whitlam” zijn (2)

Bedreiger De Wever riskeert internering (20/03)

• Punt vergeten op einde van zin (3) • “bedreigingen eind 2013 tegen N-VA’ers burgemeester Bart De Wever en schepen Ludo Van Campenhout” (4)

Elke dag worden minstens 90 fietsen gestolen (20/03)

• Komma vergeten in zin (3) • “ebay” moet “eBay” zijn (2) 121 • Het gaat om gemiddeld 90 fietsen per dag. Dit betekent niet dat iedere dag minstens 90 fietsen worden gestolen. (5) • “Naar schatting maar één diefstal op drie wordt aangegeven bij de politie.” (4)

Griekse regering presenteert komende dagen lijst met hervormingen (20/03)

• “Francois Hollande” moet “François Hollande” zijn (2)

Kapitaal van Charlie Hebdo verdeelt redactie (20/03)

• “Riis” moet “Riss” zijn (2)

Miljoen dollar schadevergoeding na 40 jaar onschuldig in de cel (20/03)

• “Jackson werd samen met Kwame Ajamu (die vroeger door het leven ging als Ronnie Bridgeman) en diens broer Wiley en werden in 1975 veroordeeld” (4) • “De politie vertelde Vernon onder meer hoe de mannen een bijtend goedje Franks’ gezicht hadden gegooid” (4)

Minder dan kwart procent rente op spaarboekje (20/03)

• “)” te veel in zin (3)

Minstens 55 doden bij drie aanvallen tegen moskeeën in Jemen (20/03)

• Niet de Hawti milities, maar wel de Houthi milities waren baas in die regio. (6)

Twee kinderen en hun mama dood teruggevonden in huis (20/03)

• “De vrouw zich zich van het leven hebben beroofd.” (4)

Vier kandidaat-voorzitters Ecolo willen geen fusie met Groen (20/03)

• “functionneren” moet “functioneren” zijn (2)

122