Historical Language Records Reveal a Surge of Cognitive Distortions In

Downloaded by guest on September 27, 2021 ontv distortions cognitive internalizing toward and in shift distortions disorders. societal changes cognitive recent with by a associated Google suggest driven language the results be or Our standards, to sample. writing Books seem and publishing not meaning, does word World pattern both and Depression This Great 1980s, Wars. the the of since the prevalence those for their exceeding of books levels surge to million a observe 14 textual and over y of 125 in prevalence past distortions the cognitive investigate of in we records markers Here, changes historical use. in similar reflected language are undergo of that can psychology collective societies nega- their that with language. overly and hypothesize associated behavior, in mood, We are individual’s future distortions an in These the changes marked ways. and inaccurate world, and think tive they the whereby patterns themselves, distortions, maladaptive cognitive about 2021) to as 1, February known prone review thinking, for are (received of 2021 depression 15, June with approved and Individuals NJ, Princeton, University, Princeton Fiske, T. Susan by Edited and 47405; Netherlands IN The Bloomington, University, Netherlands; Indiana The Sciences, Delft, Brain CD 2628 Technology, of University a Lorenzo-Luaces Lorenzo ihti yai.I at eetrsac hw htindividu- that shows research intertwined recent of closely fact, context In is dynamic. the Language this 15). negative in with (14, patterns reflect stress behavioral they environmental avoidant other disorders; and and are internalizing affectivity depression distortions of with cognitive that treatment associated holds the (13), disorders for internalizing standard gold the and “personalizing,” statements.” “should “overgeneralizing,” filter- “mindreading,” “mental ing,” minimization,” and as “magnification fail- such and mislabeling,” “labeling types, a telling,” “fortune overlapping the reasoning,” am “emotional “disqualifying positive,” partially I reasoning,” of differenti- “dichotomous that number generally “catastrophizing,” think a distortions between will cognitive complete ate “Everybody of a (e.g., Typologies be ure”). mind will of dichoto- else’s meeting someone in state about events “My assumptions unfounded future make (e.g., about or disaster”) terms talk may (e.g., extreme They terms mous, loser”). absolutist a For when negative, am (9–12). occurs in “I depression ways themselves in and negative label seen future, individuals overly distortion cognitive and the a inaccurate themselves, example, in about world distortions, think the cognitive as individuals to wherein referred patterns, and thinking ago adaptive y (8). 40 time over only changes because undergone introduced scales have time criteria were these long criteria for answer diagnostic to formal difficult is pop- 7)? question (6, upheaval, their disease This political and as war, inequality, insecurity, as time, food such collapse, over economic stressors past depressed to collec- exposed less the are societies or ulations over Can more rising (3–5). become youth been tively among has particularly depression decades, to attributed ability D Bollen Johan decades cognitive recent of in surge distortions a reveal records language Historical PNAS ud colo nomtc,Cmuig n niern,IdaaUiest,Bomntn N47408; IN Bloomington, University, Indiana Engineering, and Computing, Informatics, of School Luddy h hoyudryn ontv-eairlteay(CBT), therapy cognitive-behavioral underlying theory The mal- recognizable and distinct with associated is Depression peso salaigcnrbtrt h udno dis- dis- of that burden indicates the evidence Some to 2). contributor (1, leading worldwide ability a is epression 01Vl 1 o 0e2102061118 30 No. 118 Vol. 2021 a,1 | aintnThij ten Marijn , nenlzn disorders internalizing d n atnScheffer Marten and , a,b | itrcllnug analysis language historical rt Breithaupt Fritz , c eateto emncSuis nin nvriy loigo,I 47405; IN Bloomington, University, Indiana Studies, Germanic of Department e e eateto niomna cecs aeignUiest,60 BWageningen, PB 6708 University, Wageningen Sciences, Environmental of Department c lxne .J Barron J. T. Alexander , -rm eei ale eerhsont esgicnl more significantly be to shown labeling research precise earlier same the in These were or etc.). n-grams “loser,” context mislabel- person,” and “honorable its (“lady,” labeling of involved a regardless captures a” distortion, highly am ing using “I and type, 3-gram 1 the distortion example, (Fig. cognitive particular terms a frequent types of expressed 12 that as core statements designed of the were stand-alone expression n-grams and the CDS unambiguous, capture The short, (9). to distortions experts, cognitive by CBT of validated of externally and panel lin- speakers a computational native experts, bilingual CBT of and team guists, (CDS), a by schemata designed five distortion were to that cognitive one labeled of English, more sequences (n-grams), longi- short in the of words of examining hundreds collection published of cog- are prevalence a Books) we of tudinal Specifically, in (Google markers German. y books and of 125 Spanish, million set past large 14 the a than over of distortions. distortions prevalence cognitive nitive col- with the their associated analyze in are We changes to that similar undergo language whole, can lective a depression, as with societies individuals whether investigate to language point for the vulnerability of to index an 17) 19). as (18, (16, used depression be language may prevalence their their in that distortions levels higher cognitive significantly of express disorders internalizing with als doi:10.1073/pnas.2102061118/-/DCSupplemental at online information supporting contains article This 1 ulse uy2,2021. 23, July Published BY-NC-ND) (CC 4.0 NoDerivatives License distributed under is article access and open This collection Submission.y Direct data PNAS design, a study is manuscript.y article in the This of role preparation no or had Hap- publish, to from Happify decision consulting 2020. analysis, for September honorarium in an Inc. received pify, L.L-L. paper. data; y statement: the analyzed wrote interest M.tT. M.S. Competing and and J.B., L.L-L., J.B. L.A.R., research; research; A.T.J.B., F.B., designed performed M.tT., J.B., M.S. L.L-L. and and and L.L-L., L.A.R., A.T.J.B., L.A.R., F.B., A.T.J.B., M.tT., F.B., J.B., contributions: Author owo orsodnemyb drse.Eal [email protected] Email: addressed. be may correspondence whom To g,adsca ei r soitdwt ug fcognitive of surge a with distortions. technol- associated are the new media to social changes, sta- point and ogy, socioeconomic results or Our recent declining century. after that 20th II, the possibility of and levels, most I for historical War bilizing World above of well those surged analogs including textual distortions the decades cognitive two of past in “hockey the centuries pronounced Over a two pattern: find stick” last We the German. of and of Spanish, millions course English, in the anxiety, over and published internal- depression books with as such associated disorders strongly izing are distortions, cognitive that of patterns traces historical thinking the time? for over look depressed we less Here, or more become societies entire Can Significance ee elvrg h oncinbtendpeso and depression between connection the leverage we Here, a b ef nttt fApidMteais Delft Mathematics, Applied of Institute Delft arnA Rutter A. Lauren , https://doi.org/10.1073/pnas.2102061118 . y .For S1–S3). Tables Appendix, SI raieCmosAttribution-NonCommercial- Commons Creative . y https://www.pnas.org/lookup/suppl/ d , d scooia and Psychological | f7 of 1

COMPUTER SCIENCES PSYCHOLOGICAL AND COGNITIVE SCIENCES Fig. 1. Examples of CDS n-grams shown inside gray boxes, surrounded by plausible context words that may vary without affecting whether the n-gram marks the expression of a cognitive distortion of the given type (e.g., mindreading, emotional reasoning, or labeling and mislabeling). CDS were designed by a team of CBT experts, linguists, and native language speakers to capture the expression of a particular cognitive distortion type, regardless of its speciﬁc lexical context. For English (US), Spanish, and German the team of experts deﬁned respectively 241, 435, and 296 n-grams to mark 12 commonly distinguished types of cognitive distortions. Note that our prevalence measurements count only the CDS n-gram occurrence regardless of context (“everyone thinks,” “still feels,” and “I am a”). A complete list of all CDS n-grams by distortion type is provided in SI Appendix, Tables S1–S3.

prevalent in the language of individuals with depression vs. a English cognitive distortion schemata (n = 241) in English (US) random sample (17). books (N = 9,018,119, United States only), from 1855 to 2019 To account for changes in publication volume, for each CDS (Fig. 2A). Since these data pertain only to books published in the n-gram we define its prevalence in a given year as the number United States, we mark notable events in US history or notable of times it occurred that year in the Google Books data divided changes in the time series: the end of the century in 1899; the by the total volume published (estimated from end-of-sentence start of World War I; the financial collapse of 1929; the start of punctuation numbers). All resulting time series are converted to World War II; a peak of CDS prevalence in 1968; and distinct z scores, to provide the same scale of comparison between dif- trend changes in 1978, 1999, and 2007. ferent CDS n-grams, and compared to a null model of randomly The overall trend of CDS prevalence for most of the 20th cen- chosen n-grams for the same years and set of books (Materials tury pointed distinctly downward toward a historic minimum in and Methods). 1978, with only a few noticeable peaks, one surrounding the turn of the century in 1899 (possibly related to the Spanish–American Results war), a slight peak from 1940 to 1945 (around the time of World We perform this analysis for three unique geographic and lin- War II), and a sharp peak in 1968 (possibly related to social and guistic spheres: 1) the United States of America (US English), political unrest). From 1978, we observe an accelerating increase 2) the German-speaking countries (German), and 3) all Spanish- in CDS prevalence. This acceleration seems to be separated into speaking countries (Spanish). English (US), Spanish, and Ger- three periods: an accelerating increase from 1978 to 1999 (where man were chosen as the focal points of our analysis because CDS prevalence first exceeds levels observed in the 1910s), an they share a common alphabet, a common history and vary in even more rapid increase after 1999 to roughly 2007, followed by terms of whether they are spoken as a first language either in an acceleration after 2007, and a possible stabilization in 2010. a particular geographic region (US books only and German- The so-called “bursting of the dot.com bubble” seems to coin- speaking countries) or across several continents (Spanish) as a cide with an acceleration of the increase of CDS prevalence after control. We limit our analysis to the range 1855 to 2019 since 1999 whereas the acceleration since 2007 seems to coincide with it provides 125 y of persistently high publishing volume for all the widespread uptake of social media and the start of the Great three languages and few grammatical, orthographic, or spelling Recession. Present CDS prevalence levels exceed those observed changes that would affect our analysis. Although books pub- since the 1900s by almost two standard deviations (excepting the lished in a specific language in a particular geographic area are 1899 peak). not necessarily a representative reflection of society as a whole, We include Spanish in our analysis as a control vs. English persistent language trends over decades and centuries, observed (US) and German since it is not confined to a particular geo- from tens of millions of books, have in previous research graphic region; i.e., these data encompass all books published been shown to signal cultural, linguistic, and psychological in Spanish, which includes Spain (Europe) and most of Latin changes (18, 20–28). America (N = 1,658,438 books). The prevalence of Spanish CDS markers (N = 435 n-grams) remains quite stable throughout the Trends for English (US), Spanish, and German. We first examine the 20th century, with a moderate increase around the start of World history of the median prevalence (z scores) of the entire set of War 1, a very short moderate spike in 1929, and an upward

ABC

Fig. 2. (A–C) Median z scores of time series of CDS n-gram prevalence from 1855 to 2020 (125 y) in US English (A), Spanish (B), and German (C) with year markers added for major historical events. All time series reveal stable or declining levels for most of the 20th century followed by a sharp surge of cognitive distortions in the past three decades. US English shows declining levels from 1899 to 1978, with minor peaks around 1914 and 1940 (World War I and World War II) and notably 1968. This decline is followed by a surge of CDS prevalence starting in 1978 that continues to 2019. For Spanish we ﬁnd stable levels from 1895 to the early 1980s at which point a trend occurs toward higher CDS prevalence levels above any of those previously observed. German shows stable CDS prevalence levels, with the exception of strong peaks around and after World War I and World War II, until 2007 at which point a sudden surge occurs.

2 of 7 | PNAS Bollen et al. https://doi.org/10.1073/pnas.2102061118 Historical language records reveal a surge of cognitive distortions in recent decades Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 ntegahta aeafce h he ouain uha the as such populations three the affected have that graph the in (Materials time over volume Methods, published publication the and recently increased as toward 5-grams to bias to due same 1- books the of and number set same CDS the respective have they that cor- in n-gram such 3). Google n-grams pus German all (Fig. and from Spanish, n-grams sampled English, chosen respective were the n-grams randomly random 241 of of sets These sets 10,000 of model , Methods and confidence 95% (Materials ping the prevalence of CDS terms in of languages intervals three all for change of N-Grams. Chosen Randomly of Model Null above mean. deviations standard historical increase two the nearly immediate to nearly levels prevalence a the CDS of see of start observe we the Recession we at 2007, Great as in worldwide 1980s Notably, Spanish. or and 1970s (US) English accelerating the for of during indications no levels and 2007, prevalence 1962 to CDS in 1950s peak the minor a throughout only plateau after- with stable sharply a after reaches declines year and prevalence the rapidly ward 1946, CDS increase in defeated. the peak levels was their prevalence the Germany reaching of II, CDS and War end II. regime, World War Nazi during the World the of of struggles, includes emergence start economic the period Republic, This upheaval, Weimar sharply. social increase major levels distortion nitive levels decreasing observe markers. we CDS Republic, of the Weimar I Throughout the War 1923. of in World recession existence of devastating aftermath peak a the a and with reaching Germany in coinciding I, 1923, War and World 1920 of but in start 1900s, the the since around increase low sharply Spanish relatively and start English levels to prevalence CDS, Contrary (predom- Austria). countries and German-speaking Germany geopolitical to inantly and specific historical are major that match events they since population, a itrcllnug eod eelasreo ontv itrin nrcn decades recent in distortions cognitive of surge (Materials a set reveal CDS records language (US) 95% Historical English indicates the band as al. et distribution Gray Bollen length series. same time the CDS with individual n-grams of chosen randomly set 241 the of of sets bootstrap 10,000 10,000-fold of Methods, with model and null estimated a values of z-score interval yearly confidence of intervals confidence 95% 3. Fig. mark- CDS our of ability (N the ers for validity face provides trend 3,843,962) This deviation. standard 2008. one in accelerate than to histor- 1984 more seems the in by exceed starting that baseline levels trend ical prevalence upward CDS English an present for to of saw leading acceleration we however, sharp as 1984, a pattern in hockey-stick (US): Starting same 2B). the (Fig. observe 1984 levels we until which off after level 1953 in to trend seem downward 30-y a from departure epoieanttosta aksgicn itrclevents historical significant mark that annotations provide We cog- point which at 1932 in interrupted is trend this However, ok (N books 2C) (Fig. German for prevalence of pattern The o nls U) pns,adGra gis null a against German and Spanish, (US), English for ) = D rvlnefrEgih pns,adGra ueipsdwt ulmdletmt frno -rmpeaec.Clrdbnsindicate bands Colored prevalence. n-gram random of estimate null-model a with superimposed German and Spanish, English, for prevalence CDS 9 -rm)t atr infiatmmnso tesin stress of moments significant capture to n-grams) 296 ulModel Null ulModel). Null ). ecmaetepattern the compare We Bootstrap- = n samre facgiiedsoto ftedichotomizing the of distortion cognitive a mean- of its lose marker our hence a and bias time as over potentially ing usage acquire or may could meaning “ever” n-grams different 1-gram the a example, CDS rhetorical a the As results. of shift semantic alternative of effects. test number to a Language patterns. conducted analyses observed the we sensitivity for biases, explanations and and controls effects of mitigation number German a by on Wars Limitations. and World Robustness two the indicating of possibly use. language effects statements, labeling widespread should telling, overgeneraliz- the and mindreading, World fortune personalizing, German, filtering, and reasoning, ing, For mental I dichotomous II. mislabeling, War for War World and surrounding World II for peaks in War peaks see involvement slight we the US find through- however, at the also observed mindreading of We and prevalence time 2A). reasoning, (Fig. CDS emotional century of catastrophizing, 20th decline respon- the likely the out are types for overgeneralizing, distortion sible these telling, rea- that since dichotomous suggesting fortune for soning, prevalence pronounced most for and of perhaps labeling and the case and mislabeling surge mindreading, over the minimization, rapid and declined is magnification a n-grams This by CDS 1978. followed of century types 20th certain where tern cognitive trans- specific to the difficult with his- associated be the which, uniquely may distortion. above statements, n-grams structure, should well to grammatical late is levels their levels exception to to historical one 2010 due The above to CDS mean. surge char- 1980 declining torical period a the or the by see stable during followed we languages of levels types three signature prevalence distortion all hockey-stick cognitive For English 4). acteristic most for com- (Fig. across 12 (14) German and by types and separated Spanish, distortion prevalence (US), cognitive CDS recognized mean monly yearly Types. of Distortion Cognitive series Distinct for the Trends from S6). model of Fig. null Appendix, case the (SI of 1990s the that the those Note in to below II. 1920s but War fall World decades, levels and (US) recent I English War in World during model exceed lev- also null significantly Germany prevalence German this CDS and of 2007. Spanish, those (US), Wars, in World English starting two for recession the els great crash”), St. the (“Wall and 1929 of crisis financial o nls efeunl e tle okysik pat- stick” hockey “tilted a see frequently we English For ecuinta hne nmaigor meaning in changes that caution We sorosrain ol ecaused be could observations our As https://doi.org/10.1073/pnas.2102061118 epo h time the plot We PNAS | f7 of 3

COMPUTER SCIENCES PSYCHOLOGICAL AND COGNITIVE SCIENCES ABCDEF

GHI JKL

Fig. 4. (A–L) CDS n-gram prevalence from 1855 to 2019 (median z score smoothed by 10-y rolling mean), for English, Spanish, and German, grouped by cognitive distortion type, namely (A) catastrophizing, (B) dichotomous reasoning, (C) disqualifying the positive, (D) emotional reasoning, (E) fortune telling, (F) labeling and mislabeling, (G) magnification and minimization, (H) mental filtering, (I) mindreading, (J) overgeneralizing, (K) personalizing, and (L) should statements. Nearly all time series reveal a universal hockey-stick pattern of recently surging CDS n-gram prevalence levels across cognitive distortion types. The value C indicates the log (base 10) of the total frequency of CDS n-grams in the specific cognitive distortion category as an indication of the order of magnitude of its contribution to our observations.

type. We perform several controls to account for changes in the observed surge in CDS prevalence in recent decades relative language over time. First, the CDS n-grams consist predom- to this null model. inantly of words that have been among the most frequent Finally, all n-grams in the English (US), Spanish, and Ger- since 1895 [mean word percentile among all 1-grams M (Pr ) = man CDS sets occurred in every year from 1895 to 2019, 99.885, SD = 0.346; SI Appendix, Fig. S1 ]. The CDS n-grams indicating they were in continuous use throughout this period. have equally been among the most frequent since 1895 [mean They were highly frequent from 1895 to 2019, in fact on aver- CDS n-gram percentile among all 2- to 5-grams M (Pr ) = age more frequent than 94.6% (SD = 0.0103) of all n-grams 0.946, SD = 0.010; SI Appendix, Fig. S2 ]. Hamilton, Leskovec, in the Google Books data (SI Appendix, Figs. S1 and S2). and Jurafsky (25) quantify semantic shifts over historical time We furthermore bootstrap our prevalence estimates to gauge using word embeddings, showing that frequent words experience the sensitivity of our findings to random changes in the set the lowest rate of change, scaling with an inverse power law of of CDS n-grams over time (Materials and Methods, Bootstrap- word frequency. Hence the rate of semantic shift of the words ping). The narrow 95% CI bands (Fig. 3) throughout the period in our CDS n-grams and the CDS n-grams themselves could under consideration indicate the stability of our observations be among the lowest as well. Second, a trend toward shorter over time. sentences (29) may provide alternative explanations of our obser- CDS limitations. We caution that although the Google Books vations, but although sentence length did decrease from 1890 data have been widely used to assess cultural and linguistic to the 1920s in English, it has remained stable since (30). Fur- shifts, and they are one of the largest records of historical litera- thermore, our analysis accounts for changes in sentence length ture, it remains uncertain whether CDS prevalence truly reflects by normalizing n-gram prevalence by the frequency of end- changes in societal language and societal wellbeing. Many books of-sentence punctuation for that year (Materials and Methods, included in the Google Books sample were published at times or Time Series: Prevalence and Normalization). Finally, we previ- locations marked by reduced freedom of expression, widespread ously showed that the prevalence of the CDS n-grams in the propaganda, social stigma, and cultural as well as socioeconomic language of individuals with depression is not affected by the inequities that may reduce access to the literature, potentially emotional valence of the n-grams or the presence of personal reducing its ability to reflect societal changes. Although CDS pronouns (17); hence, a language trend toward more emotional n-gram prevalence was shown to be higher in individuals with language or use of personal pronouns is not likely to affect our depression (17) and our composition of CDS n-grams closely fol- results. lows the framework of cognitive distortions established by Beck Sampling effects. There are several issues that could arise from (9), they do not constitute an individual diagnostic criterion with our Google Books sample. First, Pechenick et al. (31) show respect to authors, readers, and the general public. It is also indications of a possible increase in technical writing and non- not clear whether the mental health status of authors provides a fiction in the Google Books sample over the past decades. true reflection of societal changes nor whether cultural changes Since our CDS n-grams contain personal pronouns, common may have taken place that altered the association between verbs, and adjectives that may refer to personal matters, if the mental health, cognitive distortions, and their expression amount of technical writing and nonfiction increased, one could in language. hypothesize this could explain a decrease in CDS prevalence. However, we observe the opposite, a significant increase of CDS Discussion prevalence. While the differences between the languages are interesting, Second, the choice of CDS n-grams could lead to a “recency perhaps the most important point is that the expression of cog- bias” in our results, explaining their rise in prevalence in recent nitive distortions increases for all three languages in the recent decades. We control for this effect with a null model that sam- three decades, leading to a distinct hockey-stick pattern indicat- ples random n-grams more frequently from recent books, due to ing a surge of the CDS prevalence levels, which serve as lexical rapidly increasing publication volume since 1895, thereby induc- markers of cognitive distortions. ing a bias toward more recent language. We observe increases of We can only speculate on the possible underlying causes of CDS n-gram prevalence well above levels predicted by this null the observed surge of CDS prevalence for these three languages, model (Fig. 3). Hence, a recency bias alone may not likely explain since our results do not establish any causal mechanisms. The

4 of 7 | PNAS Bollen et al. https://doi.org/10.1073/pnas.2102061118 Historical language records reveal a surge of cognitive distortions in recent decades Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 itrcllnug eod eelasreo ontv itrin nrcn decades recent in distortions cognitive of surge a reveal records language Historical al. et Bollen igitcdnmc “utrmc” 2) hl acknowledging while and (21), cultural (“culturomics”) important of opportu- dynamics investigation unique linguistic quantitative a the provide for our may nity centuries with back align going languages to seems decades recent depression in of observations. (3–5) prevalence rising anxiety The large eco- and changes. cultural, lan- that social pervasive and possibility three by nomic, the stressed in increasingly to large are points distortions populations a and this cognitive distortions of disorders, cognitive of between internalizing expression association markers the the Given lexical guages. of of results levels our set high to drivers, inspire economic respect historically may or with indicate social, hope speculation comments cultural, underlying of we above the Regardless that the research. and speculations distor- follow-up cognitive disorders, constitute markers, internalizing therefore lexical and between tions, relationship the to and reasoning, reasoning, emotional dichotomous catastrophizing. overgeneralizing, mislabeling), us-vs.-them cor- (47), particular and may mindreading in polarization (labeling have a (46), such at distortions may thinking of cognitive 44) language (39–42) to (43, The widespread media respond polarization (45). political social The level and and global 1970s. societal internet, Web, greater late the Wide driven as the decades- World such this in technologies the of communication started effects of recession that the adoption great compounded trend The have (38). long might labor skilled 2007 highly of for automation demand of for growth and seen rapid observed the not with been Latin contemporaneous levels and has America, Spain, recent Germany, phenomenon including economies, This to developed associ- (37). most inequality was 1930s income the trend in stopped since This rises wages with productivity. when ated 1970s work late increasing the tracking with coincides prevalence prior English and pro- Spanish 1999) in 2007. in to recorded currency trends Euro to the resilience European of vided the introduction increased in the possi- the (and countries is and Union It German-speaking 1990 2007. the in in of Germany level of integration higher reunification a the to that jump no ble sharp is a there only German a in rise, see whereas such we 1980, English around and starting Spanish trend point in rising also instance, might For analyzed drivers. we relevant corpora to language three the in tions 2007. after increase rapid stability the of to period up a 1962 show CDS from data of psychologi- German fluctuations on of The register prevalence. amounts not n-grams higher does of West also in fall and result and the (36) not distress Interestingly, East cal did 1961. 1989 of in in built division Wall was the the Wall cemented Berlin the Germany when in 1959 1962 of period to tumultuous The mislabeling, telling. and fortune CDS emphasizes and labeling of mindreading, that markers reasoning, National several dichotomous identity to of e.g., relates of n-grams, discourse which language (35), the divide a us-them particular, by an shaped In is 34). polit- Socialism (33, the normalized agenda speech, thus of and ical use, registers dis- language many the everyday invaded (32), including Socialism Socialism National National of of language course the there separate of While a combination propaganda. not detrimental Socialist was a National of and product experience the war World be after right may and II during prevalence War sample. CDS Books of Google surge the the and fact, recency n-grams In a CDS run by of and caused choice are our turmoil in results bias of our that times hypothesis in CDS the our to dynamics of counter World societal ability the during signal to respect to German with n-grams validating in are II prevalence and I CDS Wars of increases strong h viaiiyo ag-cl itrclrcrso published of records historical large-scale of availability The respect with claims causal no make we that caution We CDS in surge US the of timing the that suggestive is It distor- cognitive of dynamics the between differences Other .Ec -rmi h D e fec language coverage each continuous of indicating set 2019, n-grams. CDS to CDS the individual 1895 all in from for year n-gram every Each prevalence in S4). CDS occurred Fig. our stable Appendix, of relatively variance (SI and volume, low data century, publishing relatively 20th high and the of of standards, period most orthographic a century, decades, 19th to two the 1895 past of period the end the the to spelling capture limited in to was variances analysis 2019 and our cen- volumes mentioned, 15th publication As the grammar. low in and by published the marked were that period dataset Note a year. Books n- tury, year particular Google the a the latest times in in the of data books Books number earliest to Google the earliest the to in the correspond occurred values grams from series RE time matching The provided. and n-gram subsequent specific in the weight relative that highest such CDS the specific have the forms analysis. (e.g., for frequent includ- series incorrect time most etc., were grammatically compound n-gram the eines”, single individual even each bin a for or into “ich matches summed rare such eine,” All are bin capitalized). letters that “Ich other ein,” variations . . bin ein(e,es,em,. some bin “ich ing “Ich of For expression any expressions. capital- regular match multiple on German match depending the could CDS, example, variations, given grammatical One and capitalizations. across ization variation all lexical possible including widest the time, capture to data never Books (I Google the nie” hatte in was). “Ich ends never have), (I and never nie” (I) (I war “Ich” nie” “Ich with habe and starts had), “Ich (.+) that as “Ich 5-gram such expression exam- and (never) regular For 4-, “nie” the sentence. 3-, to the any translated of matching was end nie,” never” the the “I German at n-gram of frequently the case is ple, the term in or that verb Ger- Note operative errors). in or variations matches “ich misspellings grammatical and which (including possible Spanish, un(a),” man in all article soy match definite “Yo to the ein(e,em,er,en,es)” REs of bin gender the female to and translated male the the was example, both For a” German. am and Spanish “I in English CDS gram- same and the lexical of possible all variations succinctly matical translated capture were to (RE) n-grams expressions Some regular gender. to con- and as inflections, such and variations cases grammatical capture jugations, to need former the since English in n-grams complete CDS the German provide and We S1–S3. una.” Spanish, soy English, “yo “I all and e.g., of un” translations, lists soy literal “Yo nearly to to have translates speakers many a” one, am words, frequent language five of native or expressions four, and short three, are experts two, n-grams CBT CDS All of Since consensus. language. team ensure our target validated of and the members translated, in back by n-gram compared, the collaboratively an retaining were of on translations focused expression mainly distortion translation This cognitive German. and 3-gram. Spanish a” into am expression “I an the by constitute captured a both mislabeling am [loser]” and “I labeling a a example, am of For “I context. cog- and of specific gentlemen]” expressions their [defeated stand-alone of be a,” regardless to am distortions designed “I nitive etc., e.g., be,” expressions, “will nontechnical designed thinks,” and were “he frequent, n-grams simple, CDS of These consist type. to particular of a (sequences of n-grams distortion English cognitive 241 of set a n compile to effort design orative N-Grams. Distortion grew Cognitive volume publication year as the years to later makes century of team rapidly. 16th coverage (https://storage.googleapis.com/books/ngrams/ increasing Books the with from Google 2019, span online the data The which books/datasetsv3.html). available data, n-gram freely Books Google Data. the N-Gram Books Google Methods and Materials socioe- and (25). meaning cultural of word indicators of in in quantitative shifts variety manifested from semantic be example a are for can to changes challenges, response societies conomic these in of how language psychology and their of time collective understanding over the better observed in a these to changes from contribute how observations may work any the Future testing underlie data. and that hypotheses mechanisms verifying causal to respect with limitations ,2 ,4 n od)ta eedee oidct h xrsino a of expression the indicate to deemed were that words) 5 and 4, 3, 2, 1, = ahmthrtivdtecmlt iesre fnga rqece for frequencies n-gram of series time complete the retrieved match Each against manner case-insensitive a in matched were REs and n-grams All in than German and Spanish in higher is n-grams CDS of number The English from translated subsequently was n-grams English 241 of set The eue h hr eso 09rlaeof release 2019 version third the used We ae fCTeprseggdi collab- a in engaged experts CBT of panel A https://doi.org/10.1073/pnas.2102061118 Tables Appendix, SI PNAS ”could )” | f7 of 5

COMPUTER SCIENCES PSYCHOLOGICAL AND COGNITIVE SCIENCES Time Series: Prevalence and Normalization. The volume of books published similar patterns of change over time. This is revealed when we plot them at has increased significantly over the past two centuries, punctuated by the same z-score scale (SI Appendix, Fig. S5). declines at times of economic collapse and war (SI Appendix, Fig. S3). The frequency of occurrence of any specific n-gram will therefore fluctuate Null Model. We define a null model to compare the observed yearly CDS accordingly, since it is recorded from a changing sample of books pub- n-gram prevalence fluctuations against. This null model consists of 10,000 lished. We therefore determine yearly n-gram prevalence by normalizing sets of k randomly selected n-grams sampled uniformly across the set of n- the observed yearly frequency of each n-gram by the total yearly volume grams in the Google Books data. To match the continuity of coverage in published. We estimate the latter by summing the yearly frequency of the CDS n-grams since 1895, each random n-gram was required to occur in periods, exclamation points, and question marks (“.,” “!,” and “?”), three at least 100 of 125 y in our analysis period. Each of the resulting 10,000 ˆ ˆ ˆ ˆ ˆ punctuation symbols that are used in English, Spanish, and German to random sets of n-grams, denoted Ci ∈ C = {C1, C2, ... , C10,000}, was chosen mark the end of a sentence. Their frequency indicates publication volume to replicate the number of n-grams of a given length n as in C (e.g., there as the number of sentences published (SI Appendix, Fig. S4), accounting are 86 3-grams in C of k = 241 total in English [SI Appendix, Table S1 ], so ˆ for possible changes in writing style toward shorter or longer sentences. every Ci would also have 86 randomly selected 3-grams of 241 total). We Although periods, exclamation points, and question marks may express dif- retrieve the Google Books time series for each individual random set of CDS ˆ ferent meanings over time, here we record their frequency only to mark Ci and normalize them to prevalence values as described above. This results the end of sentences. The ratio of periods vs. all end-of-sentence punctu- in 10,000 time series that yield a yearly distribution of z scores. We use the ation remained stable from 1895 to 2019 (M = 0.9633, SD = 0.0103), with 2.5th and 97.5th percentiles of this yearly distribution as a 95% confidence periods approximately 26 times more frequent than exclamation points and interval showing the diachronic fluctuations of random n-grams in our data, question marks. to which we can compare the empirical fluctuations of our selected set of More formally we determine our prevalence time series as follows. We CDS n-grams (Fig. 3 and SI Appendix, Fig. S6). define a set C of k n-grams ci ∈ C = {c1, c2, ... , ck}, where the number of This null model controls for inherent methodological biases in the nor- words in each n-gram is its length n ∈ {1, 2, 3, 4, 5}. We denote the yearly malization of our time series and the Google Books data sample, as well time series of the prevalence of any n-gram ci as the set Xj(ci), where j refers as a possible recency bias in the CDS n-grams since the null model’s n- + to a year in the ordered set (1855, 1856, ... , 2020). Each value xj ∈ N of grams are randomly chosen from the Google Books data, which have grown Xj(ci) represents the n-gram’s prevalence, i.e., the total number of occur- rapidly in volume since 1895 (SI Appendix, Fig. S3). Therefore, any recent rences of n-gram ci in the books published in year j, which we denote increase in CDS prevalence (e.g., the observed hockey-stick pattern) is com- xj(ci). pared against a null model that draws n-grams preferentially from recently Since the raw prevalence xj(ci) will fluctuate with the total volume of published books. In addition, the requirement for each null-model n-gram text published in a given year j, denoted Vj, we normalize the preva- to have at least 100 y of coverage since 1895 also favors more recent n-grams lence xj(ci) with an estimate of Vj. We use the total number of yearly because more data will be available in recent years. occurrences of end-of-sentence punctuation to estimate the volume of text published, Nj = Xj(.) + Xj(!) + Xj(?); hence, the volume-normalized time Bootstrapping. Each CDS n-gram corresponds to a yearly z-score time series ¯ series of n-gram ci is given by Xj(ci) = Xj(ci)/Nj for every year j. Note from 1855 to 2019, yielding a distribution of z scores for each year for the that Nj roughly corresponds to the total number of sentences pub- set of CDS n-grams (SI Appendix, Fig. S7). To determine the robustness of our lished, because nearly all sentences are terminated by end-of-sentence results under random variations of our set of CDS n-grams (e.g., by making punctuation. different CDS n-gram choices) we calculate the 95% confidence intervals However, the magnitude of yearly normalized prevalence values will for this distribution by a 10,000-fold random resampling with replacement differ significantly between n-grams of different lengths; e.g., “never” is of the respective set of CDS n-grams (i.e., US English, Spanish, or German). likely much more prevalent than “they will not believe” for the same vol- Each such random resample results in a new mean time series for the given ¯ ume published, since the former is a 1-gram and the latter a more specific random set Cr of n-grams, i.e., Z1885,2020(Cr ). The resulting yearly distribution 4-gram. Nevertheless, both may follow a similarly shaped pattern of histor- of z scores indicates how much our yearly results can vary as a result of ical change. Furthermore, the volume of books published increases rapidly random changes to our CDS n-gram set and thereby tests the robustness over time; hence the variance of the time series of n-gram occurrence may of our time series results under 10,000 random variations of the set of CDS also change over time. To allow comparisons of the patterns of chang- n-grams. All time series are normalized and converted to z scores before ing prevalence over time between time series of different magnitudes and resampling; hence we obtain a distribution of yearly z scores from which variance, we subtract the 1895 to 2019 mean from all prevalence time the relevant percentiles, including the median, are determined. series and divide them by the observed standard deviation over the same period, thus converting all prevalence time series to z scores with respect Data Availability. Previously published data were used for this work ¯ ¯ to their 1895 to 2019 mean µ(Xj) and standard deviation σ(Xj) as follows: (https://storage.googleapis.com/books/ngrams/books/datasetsv3.html). ¯ ¯ ¯ Zj(ci) = (Xj(ci) − µ(Xj(ci)))/σ(Xj(ci)). These time series express the fluctuations of the prevalence of all n-grams ACKNOWLEDGMENTS. J.B. is grateful for support from the Urban Mental Health Institute of the University of Amsterdam, Wageningen University on a common scale, namely standard deviations from their historical mean, and Research, the NSF (NSF Social, Behavioral and Economic Sciences [SBE] without altering the pattern of their decline or increase of time. Normalized 1636636), and the support of the Indiana University Vice Provost for COVID- as such we can compare the pattern of changing prevalence for any CDS 19 Research. We thank Jonathan Haidt of New York University who provided n-gram time series on a common scale. For example, the individual time the inspiration for this investigation by speculating on a recent societal series of the “I am a” n-gram that marks a labeling and mislabeling cognitive shift toward a style of thinking that may be associated with internalizing distortion can have very different magnitudes in Spanish and English, but disorders.

1. P. E. Greenberg, A. A. Fournier, T. Sisitsky, C. T. Pike, R. C. Kessler, The economic bur- 8. American Psychological Association, Diagnostic and Statistical Manual of Mental den of adults with major depressive disorder in the United States (2005 and 2010). J. Disorders (Am. Psychiatric Assoc., ed. 5, 2013). Clin. Psychiatr. 76, 155–162 (2015). 9. A. T. Beck, Thinking and depression: I. Idiosyncratic content and cognitive distortions. 2. World Health Organization, Depression and Other Common Mental Disorders: Global Arch. Gen. Psychiatr. 9, 324–333 (1963). Health Estimates (WHO, Geneva, Switzerland, 2017). 10. A. T. Beck, Thinking and depression: II. Theory and therapy. Arch. Gen. Psychiatr. 10, 3. A. Case, A. Deaton, Rising morbidity and mortality in midlife among white non- 561–571 (1964). Hispanic Americans in the 21st century. Proc. Natl. Acad. Sci. U.S.A. 112, 15078–15083 11. D. Burns, The Feeling Good Handbook (Harper-Collins Publishers, 1989). (2015). 12. J. S. Beck, A. T. Beck, Cognitive Therapy: Basics and Beyond (Guilford Press, New York, 4. R. Mojtabai, M. Olfson, B. Han, National trends in the prevalence and treat- NY, 1995). ment of depression in adolescents and young adults. Pediatrics 138, e20161878 13. L. Lorenzo-Luaces, The evidence for cognitive behavioral therapy. Jama 319, 831–832 (2016). (2018). 5. K. M. Keyes, D. Gary, P. M. O. Malley, A. Hamilton, J. Schulenberg, Recent increases in 14. A. T. Beck, E. A. Haigh, Advances in cognitive theory and therapy: The generic depressive symptoms among US adolescents: Trends from 1991 to 2018. Soc. Psychiatr. cognitive model. Annu. Rev. Clin. Psychol. 10, 1–24 (2014). Psychiatr. Epidemiol. 54, 987–996 (2019). 15. D. A. Clark, A. T. Beck, Cognitive theory and therapy of anxiety and depres- 6. J. A. Goldstone et al., A global model for forecasting political instability. Am. J. Polit. sion: Convergence with neurobiological ﬁndings. Trends Cognit. Sci. 14, 418–424 Sci. 54, 190–208 (2010). (2010). 7. J. DeVylder, L. Fedina, B. Link, Impact of police violence on mental health: A 16. W. Bucci, N. Freedman, The language of depression. Bull. Menninger Clin. 45, 334 theoretical framework. Am. J. Publ. Health 110, 1704–1710 (2020). (1981).

6 of 7 | PNAS Bollen et al. https://doi.org/10.1073/pnas.2102061118 Historical language records reveal a surge of cognitive distortions in recent decades Downloaded by guest on September 27, 2021 Downloaded by guest on September 27, 2021 0 .Rdik,“aito fsnec eghars ieadgne in genre” and time across length texts sentence of historical “Variation in Rudnicka, K. positivity Linguistic 30. Axelrod, R. Dehghani, subjec- M. national Hoover, of J. analysis Iliev, Historical R. Seresinhe, 29. I. C. Sgroi, D. Proto, E. Hills, T. T. H 28. P. Mønsted, M. B. Lorenz-Spreen, P. 27. 6 .Aao .Lcs,A D A. Lacasa, L. Amato, R. 26. statis- reveal embeddings word “Diachronic Jurafsky, D. Leskovec, J. Hamilton, L. W. 25. Twit- in signals on health research mental “Quantifying of Harman, range C. wide Dredze, a M. 2000. Coppersmith, for G. through useful n-grams 24. 1800 Books from Google culture Making Davies, of M. psychology 23. changing The Greenfield, M. P. 22. 1 .B Michel B. J. 21. Miller, A. G. 20. Eichstaedt words C. J. absolutist of 19. use Elevated state: absolute an In individ- Johnstone, T. Depressed Al-Mosaiwi, Bollen, M. J. Rutter, 18. A. L. Lorenzo-Luaces, L. Thij, M. Bathina, C. K. 17. itrcllnug eod eelasreo ontv itrin nrcn decades recent in distortions cognitive of surge a reveal records language Historical al. et Bollen 2–4 (2018). 220–240 Linguistics pus (2016). E7871–E7879 113, factors. psychological and environmental dynamic reflects books. digitized of (2019). millions using wellbeing tive attention. collective language. of evolution cultural the .Ek .A mt,Es AscainfrCmuainlLnusis eln Germany, Berlin, Linguistics, Computational for 1489–1501. (Association pp. Eds. 2016), Smith, A. N. Papers) Erk, Long K. 1: (Volume Linguistics in Computational change” for Association semantic of laws tical PA, Stroudsburg, Resnik, [ACL], R. Linguistics Resnik, 51–60. P. Computational pp. for CLPsych., 2014), (Association Reality, Eds. Clinical Mitchell, to M. Signal Linguistic From Psychology: in ter” change. language Sci. Psychol. Science o,1996). Co., & U.S.A. Sci. Acad. Natl. ideation. suicidal and depression, (2018). anxiety, 529–542 to specific marker a is February (7 arXiv:2002.02800 media. social on thinking 2020). distorted more express uals 7–8 (2011). 176–182 331, rceig fteWrso nCmuainlLnusisadClinical and Linguistics Computational on Workshop the of Proceedings uniaieaayi fclueuigmlin fdgtzdbooks. digitized of millions using culture of analysis Quantitative al., et 7213 (2013). 1722–1731 24, h cec fWords of Science The aeoklnug rdcsdpeso nmdclrecords. medical in depression predicts language Facebook al., et R .Wit d Jh ejmn ulsigCmay,vl 5 pp. 85, vol. Company), Publishing Benjamins (John Ed. Whitt, J. ,R. n.J opsLinguist. Corpus J. Int. a.Commun. Nat. 10–10 (2018). 11203–11208 115, a-ulr,A aocel,Tednmc fnr hnein change norm of dynamics The Baronchelli, A. ´ ıaz-Guilera, SinicAeia irr eis .H Freeman H. W. Series, Library American (Scientific rc al cd c.U.S.A. Sci. Acad. Natl. Proc. rceig fte5t nulMeigo the of Meeting Annual 54th the of Proceedings 79(2019). 1759 10, vl .Lhan ceeaigdnmc of dynamics Accelerating Lehmann, S. ovel, ¨ 0–1 (2014). 401–416 19, a.HmnBehav. Human Nat. rc al cd c.U.S.A. Sci. Acad. Natl. Proc. 2086 (2018). 8260–8265 115, ln sco.Sci. Psychol. Clin. .vndnBosch, den van A. , tde nCor- in Studies 1271–1275 3, Proc. 6, 2 .M wne .Hit .E onr .K apel neetmtn iia media media digital Underestimating social Campbell, Limiting K. W. fomo: Joiner, more E. T. No Haidt, Young, J. J. Twenge, M. Lipson, J. C. 42. Marx, R. Hunt, media G. social M. of influence 41. The review: systematic A Grealish, A. McCrae, N. Keles, B. 40. 7 .Brs,T i,I vneii,Iko h o oe o rm:(Over)inferring Trump: for voted you why know I Evangelidis, I. Kim, T. Barasz, K. Haidt, 47. J. under Lukianoff, G. evolution 46. belief of Donohue, dynamics O. A. Collective Carothers, T. Ahn, Y. 45. Y. Bollen, J. Rodriguez, N. 44. Bail A. C. 43. men- adolescent and use media Social Sacker, A. Booker, C. Zilanawala, A. Kelly, Y. 39. Affairs, Social and Economic of Department UN 38. Institute, Policy Economic 37. Polenz, Von P. corpus: 32. Books Google the Characterizing Dodds, S. P. Danforth, M. C. Pechenick, A. E. 31. 5 .Hrn rzgsc i eeSrcedsDitnRihsubrweein wie ueber Reiches disorders psychiatric Dritten and distress Psychological des Benkert, O. Sprache Linden, M. Achberger, neue M. die 36. sich zog Er Horan, G. 35. Klemperer, V. 34. Maas, U. 33. harm. depression. and loneliness decreases (2020). adolescents. 79–93 in 25, distress psychological and anxiety depression, on (2018). oie ae nchoice. on based motives Polarization conformity. social and coherence cognitive polarization. study. cohort millennium UK the from Findings health: tal World Changing Rapidly a in (2012). evolution. linguistic and (2015). socio-cultural of inferences to limits Strong npiayhat aeptet nEs n etGray1ya fe h alo the of fall the after year 1 Germany West and East Wall. in Berlin patients care health primary in socialist national in performativity and practice discourse. of Communities Kleidungsstueck: Notebook Philologist’s Argumentationsanalyse Historischen 2013). einer Versuch alsozialismus. 3. vol. 1999), Gruyter, de (Walter a.HmnBehav. Human Nat. AsdrGitdrGmishf ieSrcefn” pah mNation- im Sprache fand”: Sprache eine Gemeinschaft der Geist der “Als xouet poigveso oilmdacnices political increase can media social on views opposing to Exposure al., et igit Online Linguist. o.Pyhar scit.Epidemiol. Psychiatr. Psychiatr. Soc. BoknsIsiuinPes 2019). Press, Institution (Brookings rc al cd c.U.S.A. Sci. Acad. Natl. Proc. etceSrcgshct o Sp vom Sprachgeschichte Deutsche h agaeo h hr ec:Li-Lnu etiIpri A Imperii: Tertii Lingua - Lti Reich: Third the of Language The h odigo h American the of Coddling The Bombr cdmc 2013). Academic, (Bloomsbury Cognition 4–4 (2020). 346–348 4, 78 (2007). 57–80 30, h tt fWrigAmerica Working of State The Uie ain,2020). Nations, (United eorce iie:TeGoa hleg fPolitical of Challenge Global The Divided: Democracies 59 (2019). 85–97 188, .Sc ln Psychol. Clin. Soc. J. 2692 (2018). 9216–9221 115, https://doi.org/10.1073/pnas.2102061118 lSOne PloS 9–0 (1999). 195–201 34, ol oilRpr 00 Inequality 2020: Report Social World titlle i u Gegenwart zur bis atmittelalter ¨ –5(2016). 1–15 11, PnunBos 2015). Books, (Penguin 5–6 (2018). 751–768 37, CrelUiest Press, University (Cornell EClinicalScience n.J dls.Youth Adolesc. J. Int. lSOne PloS PNAS (Springer-Verlag, | 59–68 6, 1–24 10, f7 of 7

COMPUTER SCIENCES PSYCHOLOGICAL AND COGNITIVE SCIENCES