This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. remain few: Lee, reviewing existing research into online language diversity and and diversity language online into research existing reviewing Lee, few: remain documented is choice language online which in Studies patterns. observed of interpretations t pertaining issues methodological population than rather individual on focused have diversity linguistic or bilingualism of studies many 2014), a et Mocanu 2013; al. et Leetaru (e.g., Twitter as most the language populations, in choice years since, changed withseveral inthe 11 has transparent thesituation 2007). (Paolillo While language online actual whose companies marketing of products the often are studies cited frequently documenting work aggregate empirical of and absence the in that 2007 2012),in noted was It languages. particular Commission of use of frequency t relative of European overview an only offered (e.g. typically has choice language usage CMC into research empirical documented not data, survey linguistic overall been Europe increasedwithin and by migration toEurope regions. from other and people, young among particularly Europeans, many norm for the become has however, English, with Bilingualism displacement. immediate of risk languages Europe, wi In domains. global communicative online in evident with particularly are that trends sites multilingual bi become increasing have Twitter an movements 2013 as al. computer et such (Leetaru of representation platforms prominence relative and the media, in increase an communicatio seen have years Recent 1. and thesignal ofthat language onTwitter. European count most for attested be can use English of rates high that show results The language. likelihood the on based created was network language a for Language measure. entropy an profiles using by linguistic quantified and filtering, created were and groups language detection and countries/territories European language After APIs. Twitter the accessing in relation thelanguage ofEuropean to situations countries. social media ischaracterizedageographicalconsidered platformfrom Twitter and perspective i documented been yet not has ecology) language online the (i.e. online used are languages, English and local languages, and immigrant official which to extent the Europe, In limited. remains populations of increasingly are platforms media Social

th official statuses official th ries and that a positive relationship exists between the size of a linguistic community community linguistic a of size the between exists relationship positive a that and ries Despite this, large this, Despite mess Twitter of corpus large A Introduction andBackground - use profiles at country level having been produced for social media platforms such platforms media social for produced been having level country at profiles use -

(M) oaiis uh s etn, ntn msaig o psig n social on posting or messaging, instant texting, as such modalities (CMC) n cags n dcto ad ei cnupin ae otiue twrs an towards contributed have consumption media and education in changes d Online Language Language Online n mliigaiain f oa environm local of multilingualization and - ee ue o hv nt oue seiial o Erp. n oe cases, some In Europe. on specifically focused not have or use, level

continue to receive reinforcement in education and media and are not in not are and media and education in reinforcement receive to continue n depth. In this study, the online language ecology of Europe on the on Europe of ecology language online the study, this In depth. n - scale studies of online language in Europe have often been based on on based been often have Europe in language online of studies scale - use distributions were compared to results from survey data, and data, survey from results to compared were distributions use o data collection and filtering have weakened the proposed proposed the weakened have filtering and collection data o ; oau t l 21) A te ae ie population time, same the At 2013). al. et Mocanu Ecology: Twitter in Europe Twitter Ecology: used gs ih egahcl eaaa a cetd by created was metadata geographical with ages

, but knowledge of the online linguistic behavior linguistic online the of knowledge but , 46

l. 2013; Magdy et al. 2014; Graham et al. al. et Graham 2014; al. et Magdy 2013; l.

of bi of ents, particularly with English English with particularly ents,

-

or m or University Oulu, Finland of ultilingual users sharing a a sharing users ultilingual

methods are not are methods Steven COATS

diversity has has diversity - mediated mediated — he

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. book) this corpus create to order in Facebookusers from collecteddata (2015) al. etFrey Forexample, al. et (Nguyen inve to used be can data Twitter that and 2016), al. et (Moise mobility geo that shown been has It 2014). al. et Magdy etMocanu al.2013; al.Leetaru 2013; et (e.g. haveproduced been data countries Twitter using individual for or languages individual for profiles use language and Maps Twitter. on groups e Hong platforms. CMC features inAmerican (Bamman English et al.2014) (Coa English in variation grammatical African 2014), al. et of functions discourse the including Twitter on (Wik hashtags English of use the of aspects investigated has research 2006, in platform the of introduction the Since Europe. in English on necessary functional certain in languages local replacing Linndomains (Görlachfor in a seecontributions discussion 2002; the 2016). be may English societies, European some Increasing English: to paid Swaa (De Europe of ecology been cement has has attention language particular the franca, of knowledge lingua a European as status the its broader by to the conducted In surveys use. Eurobarometer their the Commission towards example, speakers for of context, attitudes European the or contexts, communicative of knowledge investigated has research Extensive 2.1. 2. future and issues methodological some notes conclusion prospects languagework ononline for diversity. The data. survey a and of research interprete and responses discussed are results the the 6, Section In quantified. to bi compared of tendency the which in is network language a of data Twitter The 2012).Section 5presentsEurobarometersurvey language (Europeanthecreation Commission languages. and entities country European of perspective the from diversity linguistic mean of terms to used measure entropy the quantify and described, are Twitter language of tweet localization of geographical identification and filtering, users, data APIs, Twitter’s from data of collection language online and language Twitter and CMC on work related 2, Section In 337). 1972: (Haugen world” the of languages otherwith comparison goingin are they where and languagesstand where about something us in platform, Twitter the on users European of linguistic diversity” islacking 129). (2016: th notes multilingualism,

s 06, 06) o te neato bten eorpi prmtr ad language and parameters demographic between interaction the or 2016b), 2016a, ts In light of these factors, this study provides a characterization of the linguistic diversity linguistic the of characterization a provides study this factors, these of light In eea rsac poet hv ivsiae lnug coc ad multilin and choice language investigated have projects research Several Language a 650,000 a Multilingualism, CMC Twitter and Related Work linguistic diversity is introduced. Section 4 presents the results of the analysis in in analysis the of results the presents 4 Section introduced. is diversity linguistic , and other projects have investigat have projects other and , (2006; 2011 ström 2014; Squires 2015), lexical innovation in American English (Eisenstein English American in innovation lexical 2015), Squires 2014; ström 2015). Studies with a specific geographical focus hav focus geographical specific a with Studies 2015). - related CMC resea CMC related - token multilingual corpus from South Tyrol South from corpus multilingual token - American Vernacular English on Twitter (Jørgensen et al. 2015), 2015), al. et (Jørgensen Twitter on English Vernacular American at in many contexts, a “reliable, quantitative measurement of online of measurement quantitative “reliable, a contexts, many in at al. t ; 2012)provide dat

n 2001; Soler 2001; n (2010) found different posting behavior by different language language different by behavior posting different found (2010)

diversity is reviewed. In Section 3, the methods used for the for used methods the 3, Section In reviewed. is diversity - language Twitter from Finland and the Nordic countries countries Nordic the and Finland from Twitter language rch has also been heavily focused on English, although not although English, on focused heavily been also has rch d t “hypercentral” its ed -

Carbonell 2016), and it has been suggested that in that suggested been has it and 2016), Carbonell a onself ed linguistic diversity on social media at city at media social on diversity linguistic ed 47 the hope that a macro a that hope the -

located tweets can shed light on international international on light shed can tweets located

languages, their status in various media or media various in status their languages,

, among other topics -

- and multilinguals to share a language is language a share to multilinguals and reported and language attitudes. use Due

(reported upon in Section 1 of of 1 Section in upon (reported oiin ihn h language the within position d in light of previous CMC previous of light in d - level approach can “tell “tell can approach level e also been undertaken: undertaken: been also e stigate code stigate .

-

or territory or uls on gualism - switching switching the DiDi DiDi the

recent recent - level level -

or This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. 20 with these sources Phone Windows apps: to filtered was data the reason, this For automatically broadcast the to likely in metadata tweet’s a as in such apps indicated is tweet a compose “ to used app or platform The (e.g. (e.g. visits user a locations the Foursquare about tweet automatically can Apps bots 2015). or al. apps et by automatically(Haustein sent and created are platform Twitter the on messages Many 3.1.1. 2.9 and toidentify a to local particular users country European or territory. approximately representing ag were territories, users “seed” These users. or unique million countries European to country corresponding contained million codes 153.2 these, Of script. collection the by excluded were Retweets the using 2017 June until 2016 November from with tweets 627,553,806 3.1. 3. attention. in d the language, mediating central the become has English that and languages switch to likely more are languages represented less of users that found been has it th of language predominant the reflects their Twitter on choice that language showed and users Twitter multilingual 92 of tweets the examined (2014) Golbeck and Eleta force. bridging larger a represent collectively languages other language, mediating language are networks interaction most while that found and mentions, user and retweets of associations network the including Twitter, on i an playsEnglish quantified (2014) that found al. and users, Twitter et and editors, Wikipedia Ronen translators, book among 2014). multilingualism al. et Kim 2017, al. et Arnaboldi (e.g. level province source

h cutis n triois osdrd n h aayi were analysis the in considered territories and countries The large social media data set data media social large

Sandaysoft Cumulus Sandaysoft Netherlands, Norway, Poland, Portugal, Romania, Serbia, Russian Federation, Sweden, Slovenia, Slovenia, Sweden, Federation, Russian Serbia, Romania, Montenegro Lithuania, Kosovo. Portugal, and City, Ukraine,Vatican Turkey, San Marino, Slovakia, of), Liechtenstein, Poland, (Republic Italy, Norway, Iceland, Moldova Netherlands, Man, Monaco, of Isle Latvia, Ireland, Luxembourg, Hungary, Croatia, Greece, Gibraltar, Germany Czechia, Cyprus, Switzerland, Belarus, Bulgaria, Belgium, Herzegovina, and Bosnia Azerbaijan, Islands, Twitter WebClient Twitter

Data Collection and Processing Data and Methods

field. A manual analysis of a selection of twee of selection a of analysis manual A field. ), the music a user downloads or listens to (e.g. to listens or downloads user a music the ), Filtering for Non , , Estonia, Spain, Finland, Faroe Islands, France, United Kingdom, Georgia, Guernsey, Guernsey, Georgia, Kingdom, United France, Islands, Faroe Finland, Spain, Estonia, Denmark, , wte fr iPhone for Twitter , Twitter for Instagram for Twitter mportant central role. Hale (2014) investigated global multilingual networks globalmultilingual investigated (2014) Halecentralrole. mportant collectively ofall comprised 90% thoseby over European users.

, ), or the parameters of a user’s exercise activity (e.g. activity exercise user’s a of parameters the or ), place Twitter for iOS for Twitter

s

-

from Europ from human Sources

- eeae tx cmae t sm infrequently some to compared text generated c web Twitter the or attributes were collected from the Twitter Streaming API API Streaming Twitter the from collected were attributes ,

Tweetbot for iOS for Tweetbot , retain only those tweets broadcast by the following the by broadcast tweets those only retain Twitter for iPhone for Twitter e have not yet been subject to subject been yet not have e 48 -

based and English is the most important s important most the is English and based

Tweepy gressively filtered to remove bots and apps and bots remove to filtered gressively

: Andorra, Albania, Armenia, Austria, Åland Åland Austria, Armenia, Albania, Andorra, : lient (i.e. www.twitter.com) are less less are www.twitter.com) (i.e. lient

ts showed that widely that showed ts library , and , Spotify ,

Twitter for Android for Twitter Tweetbot for iPhone for Tweetbot in Python (Roesslein 2015). (Roesslein Python in ynamics of multilingualism of ynamics ), loca ), , Macedonia (FYRO), Malta, Malta, (FYRO), Macedonia , eir social networks. While networks. social eir 20

l weather conditions weather l sustained - Endomondo used Twitter Twitter used , - used apps. apps. used Twitter for Twitter

. Tweets . research research ingle ). This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. 22 21 measureentropy of(Rényi 1973). 1961,Hill general economics. more a to and related be ecology to Rényi population by shown later as was 1) such (Equation entropy Shannon fields in used widely been has that sequences S million 10 which in territory diversity aseach speakers. value language a territoryinwhich 5million has A sizes: population for an account such spe to of fails disadvantage it The that diversity. linguistic is quantify approach to used the be then of can logarithm regression the regressing langua by of component number geographical a incorporates that measure (1999 Nettle propos been have measures Different 3.3. with ahigh accuracyto assign probabilistic waslanguages totweets. value used Nynorsk, Faroese, Albanian,or Azeri,Twitteras algorithm, such with agreed algorithm or countries European the using in important are that this For languages territories. some identify not does algorithm detection language own Twitter’s 2018, early of As languages. natural in sequences character the in not URL points code in present sequences character Unicode can as algorithm, detection, language inaccurate to lead can hashtags and usernames, addresses, detection the short on to depending due and difficulties addition, pose can tweets individual non length, message accurately. of characterized language be the must detecting tweets individual However, in used language the group, language le the diversityat aggregatelinguistic describe accurately to In order 3.2. then were users of set resulting APIdownloaded REST froma using thePython Twitter script. the of tweets) 3,250 to (up timelines The out. filtered were c countrya single oftweets with 90% and tweetslessthan 100 than users with fewer short from signal the of strength the mitigate To 3.1.2.

ann n Wae (99 i odr o uniy h ifrain otn i character in content information the quantify to order in (1949) Weaver and hannon In population ecology species species ecology population In For this data, correlation between the center of the place bounding place the of center the between correlation data, this For akers share one language, and 10 speakers share a second language would have the same same the have would language second a share speakers 10 and language, one share akers

1) is therefore distinguished from species species from distinguished is therefore 1) ofthe value the this For reason, 0.98). high (> quite be to wasfound werepresent, if both object, thecoordinates by introduced measure a entropy, using calculated is diversity linguistic study, this In

Quantification Linguisticof Diversity Language Detection compact language detector 2 detector language compact place User location ), in the context of an anthropological analysis of linguistic diversity, suggested a a suggested diversity, linguistic of analysis anthropological an of context the in ),

field was considered an accurate indication of true user location when posting a tweet. tweet. whena posting location trueuser indicationof an accurate wasfield considered ges within a territory with its area. The standardized residuals of the the of residuals standardized The area. its with territory a within ges

reason, tweets were subject to an additional language identification step step identification language additional an to subject were tweets reason, -

standard orthography, and langu and orthography, standard cld2 cld2 model richness

were retained in the data. For some languages not detected by detected not languages some For data. the in retained were

aa (s data

(number of distinct types in a population, corresponds to to corresponds population, a in types distinct of (number uch as emoji characters), as they rarely correspond to to correspond rarely they as characters), emoji as uch d o uniy igitc iest wti populations within diversity linguistic quantify to ed

( diversity cld2

, Sites 2013). Tweets for which Twitter’s native native Twitter’s which for Tweets 2013). Sites , 49

.

- em rvlr o ohr non other or travelers term age mixture age

box and the precise GPS coordinates from from coordinates GPS precise the and box 21

cld2 (Lui and Baldwin2014) and (Lui vel of country/territory or country/territoryof or vel

identification combined 22

- locals,

R in Equation Equation in ode value ode value “seed” ized ized . In .

the the .

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. value. Appendix most three o number mean the tweets, of number total the territory, or country that from users of number the summarizes Table 4.1. 4. country country obtain To languages). to value used) The language set. data the in languages each in tweets their all of proportions the to according calculated were dataset the in user Twitter individual each for values entropy the data, this In types. unique of number total

ÅlandIslands Country Cyprus Croatia Bulgaria andHerzegovina Bosnia Belgium Belarus Azerbaijan Austria Armenia Andorra Albania

In Equation (1), Equation In

1 -

Language Results or territory was , Table A hw te 5 uoen onre ad ertre cniee i te aa and data the in considered territories and countries European 55 the shows

- used languages in the Twitter data for that territory (with the ISO 639 ISO the (with territory that for data Twitter the in languages used

f tweets per user, the standard deviation of the number of tweets per user, the user, per tweets of number the of deviation standard the user, per tweets f

) and ) log

D Number of Users 2 p ( iversity Europeanof R 1,708 1,581 the type with population a of proportion the represents calculated ) 204 172 166 111 189 430

55 24 26 fr sekr ih n qa nme o tet i al f the of all in tweets of number equal an with speaker a (for 3

proportion of tweets in thatlanguage, proportion oftweets in

Number of Tweets - Table

and territory and 2,271,138 2,255,095 Η . 264,009 229,222 222,557 107,866 284,697 693,112 Equation Equation

1 65,531 33,582 28,146 = 3,829 : Data Summary Statistics : Data Summary

− Η'

∑ 푖 Tweets/User

= 1 푅 a rne rm 0 from range can - : Entropy Shannon 1 level values, the mean of all users located located users all of mean the values, level Mean Mean ( 1,294.16 1,332.69 1,340.70 1,329.71 1,426.37 1,506.33 1,611.89 1,191.47 1,399.25 1,082.54 1,276.33 50

푝 971.77 푖

∙ C log ountries andountries

2 Tweets/User 푝 푖 St. dev. )

757.44 632.53 736.40 498.43 732.11 659.88 810.91 782.96 929.37 606.94 844.72 603.03

(homogenous data; only one one only data; (homogenous en 0.37 en 0.58 hr 0.44 bg 0.67 hr 0.42 nl 0.92 ru az0.49 0.50 de 0.53 en es 0.48 0.55 en sv0.57 as well as as well T erritories Top 3 languages

tr 0.35 tr 0.29 en 0.31 en 0.17 en 0.27 en 0.03 en 0.30 tr 0.29 en hy0.22 ca0.27 0.21 sq 0.27 en

the i, el0.16 es 0.02 0.1 ru sl 0.02 fr0.19 0.01 uk 0.08 en 0.04 tr 0.13 ru 0.13 pt ar0.07 fi0.11

where

mean entropymean -

1 code; see see code; 1

(1) Entropy of of Mean Mean R

in 1.03 1.54 1.35 1.69 1.34 0.71 1.62 1.03 1.20 1.48 1.29 1.26 is the is the

that that

R R

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context.

Liechtenstein Latvia Kosovo Italy Isleof Man Ireland Iceland Hungary Guernsey Greece Gibraltar Germany Georgia France Finland Islands Faroe Estonia Denmark Czechia Romania Portugal Poland Norway Netherlands Montenegro Monaco Moldova Malta (FYR) Macedonia Luxembourg Lithuania

17,570 5,150 1,939 4,073 8,200 3,280 4,189 179 576 102 250 149 211 927 827 438 375 70 40 58 43 42 25 25 50 98 10 1 5 2 3

10,015,203 20,471,336 5,671,408 1,290,447 6,781,950 1,471,721 5,437,972 3,242,639 6,820,822 204,210 928,238 130,058 376,543 292,090 354,905 711,288 573,645 99,014 39,570 82,583 49,939 65,536 39,207 35,087 46,765 79,992 13,422 4,830 2,255 5,582 353

1,729.09 1,960.34 1,682.01 1,392.07 1,403.48 1,618.99 1,165.13 1,779.59 1,860.67 1,414.49 1,623.95 1,529.72 1,140.84 1,055.92 1,672.33 1,611.52 1,674.64 1,342.20 1,423.84 1,275.08 1,161.37 1,560.38 1,506.17 1,221.37 1,568.28 51 1127.5 935.30 816.24 989.25 353.00 966.00

691.85 668.24 868.86 739.30 756.81 790.30 627.70 726.79 784.39 743.53 752.14 825.04 840.90 538.69 807.63 692.80 680.80 307.02 777.50 717.74 740.86 617.04 737.15 780.55 757.69 541.81 667.79 653.18 908.9 3.54 0

en 0.94 en is 0.69 0.50 hu 0.97 en el0.68 0.87 en 0.60 de 0.53 en fr0.83 fi0.67 0.98 en 0.32 ru 0.46 da cs 0.64 en 0.53 en 0.78 pt 0.80 pl 0.39 no 0.70 nl 0.66 hr fr0.73 0.64 ru 0.68 en mk0.38 0.40 en 0.44 en 0.67 tr lv0.54 0.40 en it 0.91 en

0.77

pt 0.01 pt 0.24 en 0.34 en fr0.01 0.24 en es 0.10 0.24 en 0.19 ru 0.10 en 0.22 en fo<0.01 0.29 en 0.41 en 0.2 en ro 0.24 ro 0.11 en 0.12 en 0.34 en 0.21 en 0.11 en 0.14 nl 0.25 en ar0.07 sr 0.19 fr0.30 0.24 lt 0.26 de 0.29 ru 0.16 sq 0.14 en ja 0.06

ar0.01 es 0.01 es et<0.01 0.01 tr <0.01 tl 0.03 tr es 0.06 es 0.01 et0.02 es<0.01 et0.22 0.02 no 0.05 ru es 0.07 es 0.04 0.02 ru 0.07 nn af0.01 0.03 pl 0.06 en 0.05 ro fr0.04 0.14 en es 0.11 0.19 ru 0.03 en 0.11 en 0.11 hr es 0.03 0.01 tl 0.04

0.55 0.95 1.07 0.43 0.85 0.83 1.02 1.20 1.04 1.02 0.46 1.05 1.05 1.06 1.25 1.34 0.90 1.47 1.10 2.09 1.00 1.03 0.91 2.19 1.17 1.06 1.47 0.89 2.26 1.14 0.69

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. terr country per values bi 0.83 users, of of proportion value mean The Spanish. metho sampling random the where Vatican, the for is 0.27, sample, the in value entropy lowest The language. one than more use Swedish,while 50% oftweets from fewer comparatively territories and Countries 82%. Spain and 83%, France 90%, example principal For the language: in national 90%. tweets of approximately proportion higher have a have both to tend Gibraltar populations large and with Countries Man of Isle the while respectively, English, The IrelandGuernsey,countriesUK, hav and and territories: the exceptions. some Anglophone the are are languages national the there in tweets of althoughproportion highest the with English, countries as well as languages national principal withastandardis 1,441.29, deviation of734.25. territory, tweets of population. Thenumber mean sample, despite large its or country other any than sample English in primarily write who States United a as which platform, reflec Twitter also sizes sample Andorra, The sample. or the Slovakia in users as fewer such have populations, small with countries smaller more while are France, represented, or Turkey, Russia, as such populations, larger with Countries 2017). June tweets users Twitter of number active the to corresponds unit Totals City Vatican Kingdom United Ukraine Turkey Switzerland Sweden Spain Slovenia Slovakia Serbia Marino San Federation Russian country each from sample the in users of number The

from that from Mean entropy values reflect the tendency of users in a particular country or territory to territory or country particular a in users of tendency the reflect values entropy Mean most The

country/territory - ersne lnugs e cutytrioy agl corr largely country/territory per languages represented wes n h ntoa language national the in tweets 173,507 itory as a violin plot in which the range of values (violin length), the the length), (violin values of range the which in plot violin a as itory - 14,650 56,422 20,650 22,501

2,267 1,930 r utlnuls i te om Fgr 1 hw te en entropy mean the shows 1 Figure norm. the is multilingualism or 591 143 711 63 1 2

, 250,074,330

– 18,957,971 96,006,991 28,200,872 27,349,264 Turkey has Turkey

2,953,766 3,186,738 during the collection of the “seed” data data “seed” the of collection the during d has captured tweets from a user who writes primarily in in primarily writes who user a from tweets captured has d ec te ntd igo hvn fr oe sr i the in users more far having Kingdom United the hence 827,728 200,946 507,180 Hungary are in Hungarian, example.Hungary are in for 81,192 2,215 2,073 o te nie sample entire the for

- based service may be of greater interest to users to interest greater of be may service based

1,441.29 2,215.00 1,701.59 1,302.94 1,365.66 1,400.55 1,651.16 1,215.47 1,405.22 1,288.76 1,294.06 52 92% of tweets in the national language, Russia Russia language, national the in tweets of 92% and Germany having only 4,189 users in the the in users 4,189 only having Germany and 713.33 1036.5

6% f wes rm wdn r in are Sweden from tweets of 68% : with at least 100 tweets and ≥ 90% of of 90% ≥ and tweets 100 least at with 734.25 699.48 735.09 775.93 763.55 700.87 639.49 630.41 772.41 449.14 818.12 728.86 e 97%, 96%, and 94% oftweetsand in 96%,e 97%, -

suggests that for a significant significant a for that suggests or territory or t the relative popularity of the the of popularity relative the t 0 with smaller populations smaller with

en0.43 es 0.98 0.96 en 0.70 ru 0.92 tr 0.33 en sv0.68 es 0.82 sl 0.61 en 0.55 hr 0.67 it 0.90 ru per user user per 0.61

es 0.11 la 0.01 ar0.01 0.18 uk 0.04 en 0.23 de 0.22 en 0.07 en 0.22 en sk0.16 0.18 en 0.31 en 0.06 en - for the entire the sample for level administrative administrative level (November 2016 (November

tr0.10 gl0.01 es 0.08 en 0.01 id fr0.20 0.01 nn ca0.04 es 0.03 es 0.06 sr 0.07 es 0.01 0.01 uk sod o the to espond <0.01

0.27 0.53 0.96 0.75 1.16 1.14 1.04 1.59 1.09 1.89 0.49 0.78 0.83 have have

– This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. of number significant a have typically users Macedonian The Slovenian. or Serbian as such addition in languages Slavic South other in tweets authored have users automatically many that the of examination closer a value, entropy App 201 Commission (European Russian in ability conversational report Estonians of 23% such as Twitter messages. languages related the in employed procedure sampling random study the of nature the reflect also can they places, behavio language the about something us tell values diversity language national values: Suchisthe principal with single Countries a others. and for among populations case, Italy, larger the and is Switzerland, This Luxembourg, Belgium, values: in entropy in example, mean higher and somewhat show communities common is language education autochthonous multiple with territories and Countries diversity. linguistic of levels high exhibit also Yugoslavia, of entity Croat and as Slovenia, well Herzegovina, as languages, Slavic Serbia, south Montenegro, of countries neighboring other The Twitter. on used English are Albanian, and Serbian Macedonian, to addition in are bar) (red error standard the provided.and Countries are fromentropy sorted highest mean tolowest. dot) (red value mean the width), (violin density kernel endix , Thus linguistic mean the and territory or country per representation language the While where Macedonia, highestin entropyis Shannon usingquantified diversity Linguisticas

s el s h ihrn dfiut o atmtcly itnusig ewe closely between distinguishing automatically of difficulty inherent the as well as , Table C Table , ,

in the small Estonian sample Estonian small the in

(see Frey, Stemle, and and Stemle, Frey, (see case,for example,for Germany, Poland ). For users from Montenegro, the country with the second the with country the Montenegro, from users For ).

Figure 1 a lkws fre constituents former likewise ia, : Mean Entropy by Entropy : Mean Country/Territory ,

Do most users author tweets in Russian, although only only although Russian, in tweets author users most ğruöz 53

,

this volume), this , Russia, , Russia, - identified tweet languages shows shows languages tweet identified r on Twitter of users from those from users of Twitter on r ed o xii lwr entropy lower exhibit to tend hc bi which

particularly for short texts texts short for particularly and

f h mliiga state multilingual the of

Turkey.

-

r multilingual or - highest mean mean highest

to Croatian, to Bosnia and and Bosnia 2 ; see ; -

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. lowest the with languages The Urdu. national or their in write primarily users whose Russian, and Indonesian, Turkish, Greek, English, are values Tagalog, as such Europe, in in groups used Languages population. higher exhibit sample the Balkans, the in of languages many as such multilingualism, of dynamic histories extensive with societies underlying same the represents language withatleast inthe data unique users. 10 o use language a of users all which to extent the quantify to used be also can entropy Shannon 4.2. are content ontheplatformnational intheir often to communicate Twitter first language. or widely other and English better other and English to switch to likely on than more language of first whose use People greater language. to corresponding entropy, mean higher exhibit communities Less Europeans: of ability language the to applies a languagetoAzeri. closely related Azerbaijan from tweets the of Many differences. Danish standard situation: to similar close a reflects Norway orthographic for of value terms high in The Macedonian to norms. close quite is that language a Bulgarian, in tweets ther languages. Figure 2 shows the mean Shannon entropy values for all users of each each of users all for values entropy Shannon mean the shows 2 Figure languages. ther

Despite The trend for languages co languages for trend The Language diversity by Principal Language used it as wisdom conventional reflects data Twitter the in trend general the this, nrp vle, s o agae ue piaiy y ioiy r imm or minority by primarily used languages do as values, entropy Figure 2 : Mean Entropy by Principal Language (minimum 10users)byLanguage Entropy : Mean Principal(minimum - Indeed, . spoken languages (e.g. Russian, Turkish, or Fr or Turkish, Russian, (e.g. languages spoken rresponds to the pattern pattern the to rresponds

for oe hr msae tee il e o orthographical no be will there messages short some os o hv a ag nme o speakers of number large a have not does - represented languages online, whereas speakers of speakers whereas online, languages represented 54

- localized users are reported to be in Turkish, Turkish, in be to reported are users localized

- populous countries with smaller language smaller with countries populous seen with seen

Norwegian Bokmål is very very is Bokmål Norwegian

countries or territories, and territories, or countries ench, among others), others), among ench,

are more more are i grant

e

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. 23 Turkish, Estonian, Russian, English, are they data Twitter the in while German, and Finnish, and Russian. Italian, and Russian, French, English, Greek, most five the while are sample Eurobarometer the in languages Turkish the (primarily from from are also data are Eurobarometer the in Respondents Greek (primarily considerations: sample of result a may light shed onthedifferences. sources ofthe detected Portugal (at significantly differ the of that to Denmark Germany 0.05. fou was order rank the value signed compar was proficiency of terms in languages of the to compared were conversation?”) a meeting users of proportion have to able be to order in enough well speak you parlez a, en ? yconversation une à participers’il pour bien suffisamment langues, vous autres quelles (“Et 48b and tongue?”) mother your is language parlez, vous langu que votre langues Special est aux quelle pensant (“En languages: 48a questions to responses their the set, data and Twitter “Europeans utilizing Commission by gauged 386 was Eurobarometer data survey from obtained use language reported to comparable is Twitter on use language which to extent The 4.3. languages apply. clo between distinguishing of difficulty the and sizes sample small to pertaining short relatively la Bulgaria Belgium Count Estonia Czechia Austria Cyprus Table

Spain nguage, as well as Hebrew and Japanese, which in this case may represent tweets from from tweets represent may case this in which Japanese, and Hebrew as well as nguage, The data are provided in areprovided data The

o Etna te Eurobaromete the Estonia, For T are summarizedinTable 2 ry -

ranks test for paired samples. The results, with country, test statistic (W value) and p and value) (W statistic test country, with results, The samples. paired for test ranks The detected difference for Cyprus for difference detected The he rank he 2

. Comparing the five languages with the most representation in these two data sets data two these in representation most the with languages five the Comparing .

: Wilcoxon signed : Wilcoxon Comparison with Survey Data 606.5

788 892 682 705 837 737 803 738

W 2012). For each of the 27 countries in both the Eurobarometer survey and the and survey Eurobarometer the both in countries 27 the of each For 2012). empirical data in the Twitter the in data empirical - -

order of languages from the Eurobarometer survey results results survey Eurobarometer the from languages of order

term visitors to a European country. Here again, however, considerations considerations however, again, Here country. European a to visitors term in that language, 386 Eurobarometer and Twitterdata sample -

speaking) Republic of Cyprus, whereas the Cyprus users in the Twitter data Twitter the in users Cyprus the whereas Cyprus, of Republic speaking) - used languages in the Twitter sample are English, Turkish, Greek,Arabic, Turkish, are English, sample Twitter the languagesin used ”

p a , 0.261 0.014 0.982 0.827 0.326 0.527 0.186 0.559 0.057 - mtrel ? maternelle e Tables C and D of the Appendix. ofthe D Cand Tables value nd between the Eurobarometer responses and the Twitter data at data Twitter the and responses Eurobarometer the between nd = 0.05 = p E n -

rank languages sample byproportionproficient testoforder of

a

U

usage threshold of 100 tweets in that language. that in tweets 100 of threshold usage Luxembourg Lithuania - Hungary wide . Country Ireland ae Estonia, are ) Finland

France Malta For the countries in bold typeface, inbold theaFor countries significant difference in U.K. Italy - paig Nrhr Cy Northern speaking)

/ tp ie agae ae soin Rsin English, Russian, Estonian, are languages five top r

uvy f agae s ad attitudes and use language of survey

Thinking about the languages that you speak, which which speak, you that languages the about Thinking

, which is very close to signif to close very is which , sample

795.5 55 1181 726 813 901 989 752 709 699 W the

. Countries for which the paired mean ranks ranks mean paired the which for Countries .

ed in the two data sets using a Wilcoxon Wilcoxon a using sets data two the in ed ntd Kingdom United <0.001 p 0.614 0.123 0.227 0.438 0.792 0.873 0.007 0.001 - value

rs Hne te ie most five the Hence, prus.

/

And Netherlands Portugal Romania Slovakia Slovenia Country Sweden Poland

which other languages do do languages other which , Ireland, Lithuania, and and Lithuania, Ireland, ,

icance at icance

largely corresponds corresponds largely

23

776.5

839 786 812 911 780 757 The rank order order rankThe W

p = 0.057 = p

sely related related sely

(European

p 0.056 0.246 0.266 0.126 0.281 0.424 0.010 - value - used used

p = p ,

is

- -

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. 24 language. each they of as users of languages number given the account two into taking the when in sample, the bilingual in be languages to likely as are users from value in ranges and variables, binary two contingencynumber ofbilinguals table ofthe were 39,446 data, languages between strength the in users 174,170 connection initial The (23%). criterion this to according the multilingual) (or bilingual be to determined Of languages. both in tweets 100 least languages for bilingual be to determined was data the in user A 5.1. a to led based on have data Paoli CMC in of provided availability the is overview (an applied and been have methods analysis network which in studies communicationof proliferation online of growth The 5. 2014 Arabic of migration increased to due be may This survey. authors Portuguese,French, Spanish, English, Italian.and Estonian, and Spanish, French, English, Portuguese, are most the Portugal, forFinally, Polish, English, and German, French, Irish, areArabic, French best and Irish, English, that shows Spanish data Eurobarometer Ireland, In and French. Portuguese, Spanish, Arabic, English, are they data, Twitter the In Polish. and Spanish, Kingd United the For French. and

19 Portugal 19

as theSyrian a consequencecivil of war.

The phi coefficient is equivalent to Pearson's product Pearson's to equivalent coefficient The is phi

In general, In

than it is reported as a conversational language according to the Eurobarometer Eurobarometer the to according language conversational a as reported is it than Quantification Bilingualismof Strength Multilingual Network r te most the are users -

based respondent based The small sample size for Estonia included mainly users who write in Russian. in write who users mainly included Estonia for size sample small The

who share twolanguwho share

and llo

it can be noted that Arabic is more widely used by Europe by used widely more is Arabic that noted be can it Table ,

om, the top five Eurobarometer responses are English, French, German, French, English, are responses Eurobarometer five top the om, - 05. hs eto dsrbs h ceto o a multi a of creation the describes section This 2015). ~ language emn ess nls, ihain Rsin Trih ad F and Turkish, Russian, Lithuanian, English, versus German known languages, whereas in the Twitter data English, Portuguese, Portuguese, English, data Twitter the in whereas languages, known language 3 s to the Eurobarometer survey reported Estonian as their first language. first Estonian as their reported survey the Eurobarometer sto : Contingency Table forBilinguals: Contingency of Table Number

i,j

-

reported conversational languages in the Eurobarometer survey Eurobarometersurvey the languagesin reported conversational was quanti was j

j -

휑 represented. For Lithuania, the order is Lithuanian,Lithuania, For is Russian, theorder represented. 푖 a

Equation , 푗 ges. language = = C O O

(

21 11 푂 fied using the phi ( phi the using fied 1

11

√ 푂 2 i 56 (Table 3 (Table

푅 22

: Phi Value : Phi - 1 1 to 1. to 1

푅 −

2 ~ language 퐶 푂 1 12 퐶 = C ) according (2) : toequation O O 2 푂 A value of zero indicates that language that indicates zero of value A 22 12 21

24 2

)

- -

while in the Twitter data they are they data Twitter the in while moment correlation coefficient for coefficientmoment for correlation speaking i 휑

) coefficient, calculated from a from calculated coefficient, ) i,j

= R = R if he or she had authored at at authored had she or he if = N

persons 2 1

are in any other other any in are

li to Europe since Europe to ngual network network ngual - based Twitter Twitter based

rench. (2)

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. 25 bilingual mainly (users Ukrainian as such with Russian). used, widely been not has reasons historical for that language a or Spanish), Castilian (with Galician or Catalan, Basque, as such country, the t use also speakers bilingual whose languages minority some Exceptions include 0.50. over well typically data, the in high are English to values exposure franca, bilingual persons of with Englishtweets alsoGe author in 7% over Just 0.073: is hand, other the on German, to exposure English English. in write them of 96% over bilingual, Eng least) (at also to are who German use German who people of value exposure The data. entire the in language each of users of number the on based expected be would than likely more slightly is languages two the for bilingualism that indicating 0.06, is example, for English, and use also language of users of proportion the or value, exposure the indicates width Edge value. languages. between N 1: Figure 2016). Thieurmel and Almende 2006; Nepusz and (Csardi packages R the using graph interactive an as realized was networkThe 5.2. significantat = to according languages between correlation the of significance the test expected. be would than connection weaker a indicate values langua the of prevalence the on based expected Positive

The interactive network The 푚푎푥

ik bten agae wt a lat 0 iiga ues ht ee statistically were that users bilingual 50 least at with languages between Links

( In general it can be remarked that due to the status of English as the principal lingua principal the as English of status the to due that remarked be can it general In 푅

Bilingualism Network 1 be would than connected strongly more are pairs language the indicate values , 퐶 odes represent the number of bilingual speakers of a langu a of speakers bilingual of number the represent odes 1 p )

.

< 0.05were retainedinthenetwork.

j . While . Green edges correspond to a positive positive a to correspond edges Green

can be accessed at accessed canbe 휑

bi - Equation ietoa, exposure directional,

rman. http://cc.oulu.fi/~scoats/EurNetwork.html 푡 푖 , 푗 =

3 : T 휑 √ 57 푖 , 1 푗 - ges in the multili the in ges √ test Statistic

− 퐷 휑 −

푖 2 , 푗 2

ausaent The not. are values 휑 25 lish is 0.9637, indicating that for for that indicating 0.9637, is lish

value, red edges to a negative a to edges red value, A static A A t A

ngual dataset, while negative while dataset, ngual - statistic was calculated to calculated was statistic he majority language of of language majority he age screenshot is shown in shown is screenshot igraph . Equation 3, Equation

and and 휑

and edges the links links the edges value between value visNetwork where i who who (3) D 휑

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. (Table 4). Turkish, Dutch, Russian, Italian, French, Spanish, English, impo of the automatic of closely identification artifact an be also may extent certain a to but use, bilingual to actual reflect may patterns as These other well as to Galician, and connected Catalan as is Danish,Basque.Bokmål, and areNorwegian connected,Finnish Nynorsk and as Es are such Spanish Peninsula, Iberian Bulgarian. the a of such and languages languagesRomance Macedonian, Slavic Serbian, other to Czech, connections Ukrainian, shows Russian example, For importance. Macedonian related(connected and toSpanish). to Slavic languages) Basque (connected English. in data the of 42% approximately with language, prevalent most the is English representation, language i role: important languages) rtant most the centrality, eigenvector with node language a of importance the Quantifying language 41 of network a whole, a as Europe For oe yooial ad itrcly eae lnugs osiue oe o secondary of nodes constitute languages related historically and typologically Some

bilingualism describes the statistically significant bilingual links. English clearly plays the most most the plays clearlyEnglish links. bilingual statisticallysignificant the describes t is connected to almost all of the other languages. In terms of overall overall of terms In languages. other the of all almost to connected is t Only two languages two Only Figure oe i te uoen wte lnug ntok r fud o be to found are network language Twitter European the in nodes 3 : Static imageinteractiveStatic of n bilingualism

- in the network the in related languages. 58

are

nodes Portuguese, German, and Swedish and German, Portuguese,

not

and directly connected to English to connected directly 142 etwork

edges

(links between (links tonian.

s : This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. data. the in users the thesesomeFuture parameters.of about information containmight metadatafields tweet of Some language L1 or status, immigration gender, level, educational age, paramet identity of terms in groups speaker between yetThese not usershave may good acommand European a of language. c Syrian the of result a as 2015 since Europeto moved have who immigrants recent of comprised be may The hand, language. other the first on their group, latter as well as English and/or language European a in communicate to relatively are who progeny) their (and Europe to immigrants or Arabic of users while represent maygroup former languages.The other in messages author to likely less are Persian diversity, linguistic of rates average than higher exhibit example, Manda or Urdu, Tagalog, Indonesian, in messages Twitter of Authors values. diversity may used widely communicativetheir messages. range of not is language primary wid another whose or English to switch to choose users platforms, media global social inherently of the considering nature and Secondly, societies. monolingual more from users to likely to more be thus and context, school a in language exposure daily have may societies second multilingual traditionally from users First, divers reasons. linguistic higher have to likely more are populations smaller relatively oflinguistic lowaverage diversity. levels large with countries dominant, is language from one which in populations Users France. or territories associated and U.K. the as historical such over time, policies educational in manifest been has homogenization linguistic towards countries) Nordic some or Switzerland, socie from users than diversity linguistic of rates average higher show Belgium, Balkans, the of territories and countries specific within diversity linguistic groups languageat userscountry/territory of level possible orlanguage becomes of extent relative the of comparison a entropy, Shannon within diversity linguistic significant that shows rec is Europe Twitter on use language of analysis The 6. Table D).Twitter in the data (Appendix,

hs eut agl cicds ih h ma vle fr cie s o a agae on language a of use active for values mean the with coincides largely result This It should be noted, however, that the nature of the Twitter data makes differentiation differentiation makes data Twitter the of nature the that however, noted, be should It with countries from those as well as societies multilingual traditionally from Users the as (such multilingual been traditionallyhave that societies from speakers general, In sr o non of Users -

Discussion and r hr lnugs n non in languages third or apitulated on the social media platform. By quantifying linguistic diversity using linguistic quantifying By platform. media social the on apitulated - uoen agae i te aa hw rne f vrg linguistic average of range a show data the in languages European Table 4: EigenvectorCentrality for 10 Conclusion Swedish German Portuguese Turkish Dutch Russian Italian French Spanish - CMC contexts, in addition to experience learning the the learning experience to addition in contexts, CMC

ely

ivil war and continuing violence in Afghanistan. in violence continuing and war ivil 59 - used language in order to increase the potential potential the increase to order in language used 0.0258 0.0711 0.0760 0.0823 0.0957 0.0995 0.1021 0.1973 0.2059 1.0000 Centrality Eigenvecto

such as Russia, Turkey, or Spain, also exhibit exhibit also Spain, or Turkey, Russia, as such

to know additional languages compared languages additional know to r ifcl. itei konaot the about known is Little difficult. ers

r

Languages well - established and hence able able hence and established ity values for two main two for values ity ties in which a tendency tendency a which in ties

.

rin, for for rin, This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. 386. Eurobarometer Special languages: their and Europeans (2012), COMMISSION EUROPEAN 313. Eurobarometer Flash online: preference language User (2011), COMMISSION EUROPEAN 243. Spécial Eurobaromètre langues: leurs et européens Les (2006), COMMISSION EUROPEAN Twitte of use Multilingual (2014), J. GOLBECK, & I. ELETA, in changelexical ofDiffusion (2014), E. XING,& A. N. SMITH, O’CONNOR,B., EISENSTEIN, J., (2001), A. SWAAN, DE research, network complex for package software igraph The (2006), T. NEPUSZ, & G. CSARDI, (2016b),GrammaticalS. COATS, Nordic frequenciesgenderand TwitterEngl in featureGrammatical (2016a), S. COATS, social in variation lexical and identity Gender (2014), T. SCHNOEBELEN, & J. EISENSTEIN, D., BAMMAN, S. VANTINI, & P., CIUCCARELLI, B., CASSOTTANA, M., BRAMBILLA, M., ARNABOLDI, ’vis.js’ using Visualization Network visNetwork: (2016), B. THIEURMEL, & B.V. ALMENDE, BIBLIOGRAPHIC REFERENCES personal undoubtedlywill identitya thisresearch. focus be of speakers. few with languages minority infe of for methods automated future in Advances and status online the as well as contexts, online in refuted) (or confirmed be can English towards shift language which to extent the be res survey as well as data empirical on based use language of patterns of knowledge our refine to continue to aim should diversity linguistic online on work Future use. language active of terms in 2012) Commission (European survey Eurobarometer learning. language in trends and migration, of patterns factors, historical reflect may languages third and languages moveme While Europe by postings Twitter languages central a plays English parameters ofa inthecontext that influence the study factors of online language choice. filteri keyword use could work http://ec.europa.eu/public_o http://ec.europa.eu/public_opinion/flash/fl_313_en.pdf http frontier, Social Media, PLoS ONE 9(1). Blackwell. InterJournal Complex Systems, 1695. humanities. the for URL: corpora media social and CMC on conference 4th the De Gruyter, 179 Englishcomputer (Ed.), in 135 18(2), ofsociolinguistics Journal media, 774 in mixlanguage observe to lens A Urbanscope: (2017), Library. R package version The multilingualism network multilingualism The – ://ec.europa.eu/public_opinion/archives/ebs/ebs_243_fr.pdf 793. http://nl.ijs.si/janes/wp

Computers humanin b o Erpa ues varies users European for

nt towards the global lingua franca of English is undeniable, the use of local of use the undeniable, is English of franca lingua global the towards nt – 210.

Nonetheless, the data from Twitter largely confirm the res the confirm largely Twitter from data the Nonetheless,

role in in role od o te world the of Words -

mediated communication:mediated representation, Variation, Berlin,change.and - 1.0.2. URL: pinion/archives/ebs/ebs_386_ sum_en.pdf based users suggests a complex underlying language situation: situation: language underlying complex a suggests users based - content/

ehavior 41, 424 g n machine and ng h network the

may reflect some cultural and demographic facts. While While facts. demographic and cultural some reflect may

uploads/2016/09/CMC . URL: frequencies of English on Twitter in Finland, in Squires, L. Squires, in Finland, in Twitter on English of frequencies The patterns of language choice variation evident in in evident variation choice language of patterns The https://CRAN.R – 160. 160. rring demographic parameters of users pertaining to to pertaining users of parameters demographic rring

http://igraph.org

60 –

, the strength of of strength the , Te lbl agae system language global The : 432.

- learning techniques to infer some of these these of some infer to techniques learning

cities, American behavioral scientist 61(7),scientist Americanbehavioral cities, - project.org/package=visNetwork -

conference

: oil ewrs t h language the at networks Social r:

ponses. Of special interest will will interest special Of ponses. the bilingual bilingual the

- proceedings jbjn, p 12 pp. Ljubljana, ishes, Proceedingsishes, of . abig, UK, Cambridge, links between between links - 2016.pdf ults of the of ults

16. This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. language Tracking (2016), E. POURNARAS, & S. KOCH, R., MERZ, E., GAERE, I., MOISE, MOCANU,BARONCHELLI,D., PERRA,ZHANG,GONÇALVES,A.,VESPIGNANI,N.,B.,& Q. NETTLE, D. (1999), Linguistic diversity. Oxford, Oxford University Press. MOKBEL,Exploiting&(2014),geo MUSLEH,M. M. MAGDY, GHANEM,T., A., messages. Twitter of identificationlanguage Accurate (2014), T. BALDWIN,& M. LUI, an Contexts Europe: in English Investigating (2016), (Ed.) A., LINN, the Mapping (2013), E. SHOOK, & A. PADMANABHAN, G., CAO, S., WANG, H., K. LEETARU, Georgakopoulou, in communication, digital in practices and resources Multilingual (2016), C. LEE, multilingual in Twitter of analysis Sociolinguistic (2014), A. OH, & L. WEI, I., WEBER, S., KIM, processing and studying of Challenges (2015), A. SØGAARD, & D. HOVY, K., A. JØRGENSEN, Twit in matters Language (2010), H. E. CHI, & G. CONVERTINO, L., HONG, consequences, its and notation unifying A evenness: and Diversity (1973), M.O. HILL, LARIVIÈRE, & C. SUGIMOTO, A., TSOU, K., HOLMBERG, D., T. BOWMAN, S., HAUSTEIN, Essays language: of ecology The E., Haugen, in language, of ecology The (1972). E. HAUGEN, (2014 S. HALE, GÖRLACH, Still M.(2002), more Englishes, Amsterdam, Benjamins. John andGeolocation you? are world the in Where (2014), D. GAFFNEY,& A. S. HALE, GRAHAM,M., CMCTyrolean South of corpus DiDi The (2015), W. E. STEMLE, & A. GLAZNIEKS,C., J. FREY, V. V. Einar Haugen, Stanford, CA, StanfordUniversity Press, 325 SIGCHI Conference on Human Computing Factors in Systems, New York, 833 ACM, language identification in Twitter, The professional geographer 66(4), 568 communication/social media, 1 computer for processing language natural the of workshop 2nd the of Proceedings data. mobility in the Twitter landscape, Data Mining Workshops (ICDMW), 2016 IEEE 16th 16th IEEE 2016 (ICDMW), Workshops Mining Data landscape, Twitter International Conference, Piscataway, NJ, IEEE, 663 the in mobility ONE 8(4). A. (2013), The Twitter babel: of Mapping world languages through microblogging platforms, PLo mining and managing on workshop of Proceedings enriched geo diversity, language localized understand PA, Stroudsburg, EACL, (LASM) media social for analysis Association Computatfor language on workshop 5th the of Gruyter Mouton. global Twitter heartbeat: The geography Twitter, of First Monday 18(5/6). New York, Routledge, 118 et Londrescommunication, digital and languageof handbook Routledge The (Eds.), T. Spiloti, A., York, ACM, 243 on conference ACM 25th the of Proceedings societies. 2015 Stroudsburg, PA, Association for computational ACL linguistics, 9 the of Proceedings media, social in dialects 521. 518 AAAI,CA, Park, Menlo media. social and weblogson conference AAAIInternational study, 427 on Twitter, Jo – (2015), Tweets as impact indicators: Examining the implications of automated ‘bot’ accounts ‘bot’ automated of implications the Examining indicators: impact as Tweets (2015),

473.

- spatial data,spatial GeoRich’14 v. 2,New York, A ), Global connectivity and multilinguals in the Twitter network, Proceedings of the of Proceedings network, Twitter the in multilinguals and connectivity Global ), urnal of the association for information science and technology 232 67(1),

– 248.

ional Linguistics, 17 – 132. – 6

.

– 61 25.

- 670. 670. Hypertext and social media (HT '14), New '14), (HT media social and Hypertext CM, 1

okhp n os user noisy on workshop

– 339. – 18. – 6.

d agendas. d

578. Berlin et Boston, De Boston, et Berlin

e: lre scale large A ter: - tagged tweets totweets tagged - eeae text, generated – Ecology Proceedings 842. – 238. - mediated

54, by S –

This is the postprint (the non-typeset version reflecting changes made during the peer review process) of a contribution to: Wigham, C. R., & Stemle, E. W. (eds.) (2019). Building Computer-Mediated Communication Corpora for sociolinguistic Analysis. Presses Universitaires Blaise Pascal. The content must not be used in a commercial context. WIKSTRÖM, P.(2014). #srynotfunny: Communicative functions hashtags of SKY onTwitter, journal SQUIRES, Design, Twitter: L.(2015). discourse, and implications of public text, Georgakopoulou, A., SOLER SITES, D. (2013), Compact language detector 2. Urbana communication, of theory mathematical The (1949), W. WEAVER, & C.E. SHANNON, A., VESPIGNANI, K., HU, B., GONÇALVES, S., RONEN, ROESSLEIN, (2015), J. Tweepy (module informatique Python). Berkeley fourth the of Proceedings entropy, and information of measures On (1961), A. RÉNYI, Ne (2015), J. PAOLILLO, B., Danet, internet, the on diversity Language multilingualism? much How (2007), J. PAOLILLO, D., NGUYEN, Sym handbook of language and digital communication, Londres New et York, Routledge, 36 online, communication and culture, Language, OxfordOxford, University Press, 408 internet: multilingual The (Eds.), C. S. Herring, media, Menlo Park, CA, Association for Advancementthe of Artificial Intelligence, 666 social and web on conference AAAI international ninth the of Proceedings Twitter, on languages linguisticsof 27, 127 communica digital and New York, Routledge, 239 language of handbook Routledge The (Eds.), T. Spiloti, 58. Englis Investigating Champaign, University Il of PNAS fame, global with association its and network language 111(52), E5616 global The speak: that Links

posium onMathematics, Statistics andProbability 1960,547 - ABNL, . (06, nls i te agae clg o Erp, in A (Ed.), A. Linn, Europe, of ecology language the in English (2016), J., CARBONELL, RECNG, . CRIS L (05, uine n te s o minority of use the and Audience (2015), L. CORNIPS, & D. TRIESCHNIGG, – E5622. h in Europe: Contexts and agendas, Berlin et Boston: De Gruyter Mouton, 53 Mouton, Gruyter De Boston: et Berlin agendas, and Contexts Europe: in h – 152.

twork analysis, in Georgakopoulou, A., Spiloti, T. (Eds.), The Routledge The (Eds.), T. Spiloti, A., Georgakopoulou, in analysis, twork

– linois Press. 256.

– 430.

https://github.com/CLD2Owners/cld2 62

PINKER, S. & HIDALGO, C. (2014), (2014), C. HIDALGO, & S. PINKER, https://github.com/tweepy/tweepy – 561.

in Lnrs et Londres tion,

– – 54. 669

.

– - Appendix

Table A : Language Codes (ISO 639-1) and Names af Afrikaans ca Catalan eo Esperanto gl Galician is Icelandic ku Kurdish am Amharic ckb Central Kurdish es Spanish gu Gujarati it Italian ky Kyrgyz ar Arabic cs Czech et Estonian he Hebrew (modern) ja Japanese la Latin az Azerbaijani cy Welsh eu Basque hi Hindi ka Georgian lb Luxembourgish be Belarusian da Danish fa Persian (Farsi) hr Croatian kk Kazakh lo Lao bg Bulgarian de German fi Finnish ht Haitian Creole kl Kalaallisut lt Lithuanian bn Bengali dv Divehi fo Faroese hu Hungarian km Khmer lv Latvian bo Tibetan Standard el Greek (modern) fr French hy Armenian kn Kannada mg Malagasy br Breton en English ga Irish id Indonesian ko Korean mk Macedonian

ml Malayalam or Oriya si Sinhalese tr Turkish mn Mongolian pa Eastern Punjabi sk Slovak ug Uyghur mr Marathi (Marāṭhī) pl Polish sl Slovene uk Ukrainian mt Maltese ps Pashto sq Albanian ur Urdu my Burmese pt Portuguese sr Serbian vi Vietnamese ne Nepali qu Quechua sv Swedish vo Volapük nl Dutch ro Romanian ta Tamil xh Xhosa nn Norwegian Nynorsk ru Russian te Telugu zh Chinese

no Norwegian rw Kinyarwanda th Thai oc Occitan sd Sindhi tl Tagalog

Table B : Country Codes(ISO 3166-1 alpha-2) and Names AT Austria FI Finland MT Malta

BE Belgium FR France NL Netherlands

BG Bulgaria GB United Kingdom PL Poland

CY Republic of Cyprus HU Hungary PT Portugal

CZ Czechia IE Republic of Ireland RO Romania

DE Germany IT Italy SE Sweden

DK Denmark LT Lithuania SI Slovenia

EE Estonia LU Luxembourg SK Slovakia

ES Spain LV Latvia

133

Table C : Proportion of Respondents Reporting Active Language Proficiency by Country and Language (Eurobarometer 386 Question 43) AT BE BG CY CZ DE DK EE ES FI FR GB HU IE IT LT LU LV MT NL PL PT RO SE SI SK

ar 0.002 0.026 0 0.029 0 0.002 0.003 0.001 0.005 0.003 0.017 0.007 0.002 0.005 0.006 0 0.002 0 0.003 0.002 0.001 0.001 0 0.007 0 0.001 bg 0.002 0.002 0.822 0.017 0 0 0.001 0.001 0.002 0 0 0.001 0.001 0 0.001 0 0 0.002 0.003 0.001 0 0 0 0 0 0.001 ca 0.003 0 0 0 0 0 0 0 0.094 0.002 0 0.001 0 0.001 0.001 0 0.001 0 0 0.001 0.003 0.001 0 0 0 0 cs 0.002 0.001 0 0.005 0.815 0.002 0 0.001 0 0 0 0.003 0 0 0 0.001 0.001 0.001 0 0.001 0.005 0.015 0 0.001 0 0.092 cy 0 0 0 0 0 0 0 0.001 0 0 0 0.007 0 0.002 0 0.001 0 0 0 0 0 0 0 0 0 0 da 0.001 0.003 0 0 0 0.003 0.686 0.002 0 0 0 0.002 0.001 0 0 0 0.002 0 0 0 0 0.005 0 0.013 0 0 de 0.738 0.042 0.016 0.02 0.037 0.717 0.089 0.028 0.005 0.032 0.023 0.019 0.043 0.016 0.011 0.025 0.117 0.025 0.005 0.129 0.043 0.004 0.012 0.053 0.211 0.044 en 0.143 0.098 0.056 0.654 0.061 0.13 0.157 0.093 0.049 0.13 0.093 0.804 0.045 0.872 0.082 0.068 0.098 0.082 0.18 0.156 0.083 0.068 0.068 0.17 0.291 0.05 es 0.01 0.012 0.004 0.013 0.002 0.009 0.009 0.001 0.72 0.006 0.035 0.016 0.002 0.011 0.025 0.002 0.013 0.002 0.002 0.011 0.002 0.023 0.01 0.016 0.016 0.001 et 0 0 0 0 0 0 0 0.582 0 0.007 0 0.002 0 0 0 0.002 0 0.001 0 0 0 0.016 0 0 0 0 eu 0 0 0 0 0 0 0 0 0.009 0 0.001 0 0.003 0 0.009 0 0 0 0 0 0.003 0.001 0 0 0 0.001 fi 0 0 0 0 0 0 0.001 0.041 0 0.689 0 0.002 0 0 0 0.001 0.002 0.001 0 0 0 0 0 0.01 0 0 fr 0.023 0.341 0.005 0.064 0.004 0.031 0.017 0.002 0.031 0.009 0.786 0.051 0.006 0.04 0.036 0.006 0.223 0.002 0.02 0.052 0.007 0.038 0.035 0.017 0.016 0.003 gl 0.001 0 0 0 0 0 0 0 0.045 0 0 0.001 0 0 0 0 0 0 0 0 0 0.001 0 0.001 0 0 hi 0.002 0 0 0 0 0 0 0 0 0 0 0.007 0 0.001 0 0 0 0 0 0.001 0 0.003 0 0 0 0 hr 0.014 0.002 0.001 0 0 0.001 0.001 0.002 0 0 0.001 0.002 0.002 0.001 0.001 0 0.001 0 0 0.001 0 0.004 0 0.001 0.377 0.001 hu 0.006 0.002 0 0.003 0 0.001 0 0 0.001 0 0 0.001 0.874 0 0.001 0 0 0 0 0.001 0 0.001 0.071 0.004 0.002 0.081 it 0.019 0.019 0.002 0.029 0.001 0.009 0.002 0 0.006 0.001 0.014 0.007 0.003 0.003 0.817 0.001 0.041 0.002 0.103 0.002 0.004 0.004 0.016 0.003 0.064 0.002 ja 0 0 0 0.002 0 0 0 0 0 0.001 0 0.003 0.001 0 0 0 0 0 0.001 0 0 0 0 0 0 0

134 ko 0 0 0 0 0 0 0 0 0 0 0 0.002 0 0 0 0 0 0.001 0 0 0.001 0 0 0 0 0 lb 0 0 0 0 0 0 0 0 0 0 0 0.001 0 0 0 0 0.348 0 0.001 0 0 0 0 0 0 0 lt 0 0 0 0 0 0 0 0.001 0 0 0 0.004 0 0.004 0 0.668 0 0.007 0 0 0.001 0 0 0 0 0 lv 0 0 0 0 0 0 0 0.003 0 0 0 0.002 0 0.004 0 0.004 0 0.55 0 0 0.002 0 0 0.001 0 0 mt 0 0 0 0 0 0 0 0.001 0 0 0 0 0 0 0 0 0 0 0.678 0 0 0 0.002 0 0 0 nl 0.003 0.422 0 0.002 0 0.004 0.002 0 0 0 0.001 0.004 0.001 0.003 0.001 0 0.016 0.001 0 0.632 0 0.003 0 0.003 0 0 pl 0.003 0.002 0.001 0.003 0.009 0.018 0.002 0 0 0.002 0.005 0.015 0 0.024 0 0.042 0.002 0.008 0 0.001 0.79 0 0.001 0.003 0 0.009 pt 0.001 0.005 0 0 0 0.001 0.001 0 0.006 0 0.014 0.008 0 0.001 0.001 0 0.122 0 0 0.003 0.001 0.805 0.001 0 0 0 ro 0.001 0.002 0.003 0.071 0 0.004 0 0.002 0.023 0 0.003 0.003 0.005 0.002 0.003 0 0.001 0 0 0 0 0.004 0.775 0 0 0 ru 0.007 0.004 0.052 0.059 0.028 0.045 0 0.233 0.001 0.009 0.001 0.006 0.009 0.003 0.001 0.178 0.003 0.313 0.001 0 0.053 0.001 0.006 0 0.017 0.032 sk 0.004 0.001 0 0.002 0.044 0 0.001 0 0.001 0 0 0 0.002 0.003 0 0 0 0 0.001 0 0.001 0.001 0.001 0 0.004 0.68 sv 0.001 0 0 0.003 0 0.001 0.025 0.002 0.001 0.105 0 0.001 0 0 0 0.001 0.005 0.001 0 0.001 0 0 0 0.695 0 0 tr 0.013 0.015 0.036 0.025 0 0.019 0.002 0.001 0 0.001 0.003 0.001 0 0 0 0 0 0 0 0.004 0 0 0.001 0 0 0 ur 0 0 0 0 0 0 0.002 0 0 0 0 0.013 0 0 0 0 0 0 0 0 0 0 0 0 0 0 zh 0.001 0.002 0 0 0 0.002 0 0.001 0.001 0 0.001 0.003 0 0.001 0.001 0 0.002 0.001 0.001 0.001 0 0.002 0 0 0 0

135

Table D : Active Language Use by Country and Language – Proportion of Users with at least 100 Tweets in the given Language AT BE BG CY CZ DE DK EE ES FI FR GB HU IE IT LT LU LV MT NL PL PT RO SE SI SK

ar 0.03 0.005 0.012 0.024 0.008 0.014 0.016 0 0.001 0.001 0.003 0.008 0.014 0.011 0.003 0 0.023 0 0.088 0.005 0.006 0 0.016 0.008 0.007 0 bg 0 0 0.485 0 0 0.001 0 0 0 0.002 0 0 0 0 0 0 0 0 0 0 0.001 0 0 0.001 0 0 ca 0 0.001 0 0 0 0.001 0 0 0.079 0.001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cs 0 0 0.006 0 0.714 0.002 0.002 0 0 0.001 0 0 0.009 0 0.001 0 0 0 0 0 0 0 0 0 0.062 0.047 cy 0 0 0 0.005 0.003 0.001 0.002 0 0 0 0 0.003 0 0.001 0 0 0 0 0 0 0 0 0 0.001 0 0 da 0.002 0.001 0 0 0 0.001 0.582 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.007 0 0 de 0.6 0.005 0 0 0 0.725 0.007 0.014 0.001 0.004 0.001 0.001 0.005 0.002 0.003 0 0.093 0.004 0 0.008 0.002 0 0.005 0.003 0 0.016 en 0.639 0.707 0.509 0.534 0.475 0.621 0.738 0.557 0.222 0.547 0.32 0.984 0.654 0.976 0.414 0.762 0.651 0.319 0.825 0.665 0.371 0.387 0.651 0.593 0.6 0.672 es 0.032 0.021 0.047 0.005 0.021 0.027 0.016 0.014 0.942 0.005 0.01 0.004 0.042 0.006 0.05 0 0.116 0.004 0 0.011 0.018 0.123 0.075 0.008 0.041 0.125 et 0.002 0.001 0.006 0 0 0.002 0 0.314 0 0.091 0 0 0 0 0.001 0 0 0.004 0 0 0 0 0 0.001 0 0 eu 0 0 0 0 0.003 0 0 0 0.005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi 0 0 0 0.005 0 0.001 0 0.014 0.001 0.831 0 0 0.005 0 0 0 0 0 0.018 0 0 0 0 0.002 0 0 fr 0.009 0.247 0.006 0 0.008 0.013 0.007 0.029 0.005 0.001 0.949 0.003 0.009 0.009 0.008 0.024 0.419 0 0.035 0.005 0.005 0.006 0.022 0.005 0.007 0 gl 0 0.001 0 0 0 0 0 0 0.018 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hi 0.002 0 0.006 0 0 0.003 0.005 0 0 0.001 0 0.001 0.005 0 0 0 0 0 0 0 0 0 0 0.002 0 0 hr 0.007 0.001 0 0 0 0.002 0.002 0 0 0 0 0 0 0.002 0 0 0 0 0 0.001 0 0 0 0.001 0.028 0 hu 0.005 0.001 0 0 0 0 0 0 0 0 0 0 0.598 0 0 0 0 0 0 0 0.001 0 0.005 0 0 0 it 0.002 0.005 0.006 0 0.003 0.005 0.005 0.014 0.003 0 0.003 0.001 0.005 0.001 0.896 0 0.023 0 0.07 0.002 0.002 0.002 0.038 0.001 0.007 0.031 ja 0.009 0 0 0 0.005 0.009 0 0 0.001 0.008 0.002 0.001 0.005 0.002 0.002 0 0.023 0 0.035 0.002 0.001 0.001 0 0.003 0 0

136 ko 0.002 0 0 0 0 0.002 0.002 0 0 0 0 0 0 0 0 0 0 0.008 0 0 0.002 0 0 0 0 0 lb 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.07 0 0 0 0 0 0 0 0 0 lt 0.002 0 0 0 0 0 0.005 0 0 0 0 0 0 0 0 0.31 0 0 0 0 0 0 0 0.001 0 0 lv 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.024 0 0.546 0 0 0 0 0 0.001 0 0 mt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.035 0 0 0 0 0 0 0 nl 0.005 0.593 0.006 0 0.008 0.004 0.002 0 0 0.001 0.001 0.001 0 0 0 0 0.023 0 0 0.874 0.002 0 0 0.001 0 0 pl 0.014 0.002 0 0 0.008 0.006 0.002 0 0 0.001 0.001 0.001 0 0.003 0.001 0.024 0 0 0 0.002 0.88 0 0 0.004 0.021 0 pt 0.007 0.005 0.012 0.01 0.005 0.011 0.009 0 0.014 0.005 0.005 0.003 0.009 0.012 0.009 0 0.023 0 0.018 0.005 0.003 0.958 0.011 0.005 0.007 0.016 ro 0.002 0.002 0 0 0.003 0 0 0 0 0 0 0 0.005 0 0.001 0 0 0.008 0 0 0 0 0.301 0.001 0 0 ru 0.016 0.002 0.432 0.015 0.045 0.01 0.002 0.371 0.001 0.016 0.001 0 0.014 0.001 0.002 0.238 0.023 0.367 0.018 0.003 0.025 0.001 0.011 0.002 0 0.047 sk 0.002 0 0 0 0.056 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.001 0 0 0 0 0.25 sv 0.002 0.001 0 0 0 0 0.005 0 0 0.022 0 0 0 0 0 0 0 0.004 0.018 0 0.001 0 0 0.816 0 0 tr 0.058 0.031 0.041 0.393 0 0.047 0.02 0.029 0 0.002 0.004 0.001 0.023 0.001 0.002 0.048 0 0.004 0.018 0.014 0.004 0 0.022 0.01 0 0.016 ur 0.002 0 0 0 0 0.002 0.002 0 0 0.001 0 0.001 0 0 0 0 0 0 0 0 0 0 0 0.002 0 0 zh 0 0 0 0 0 0.001 0 0 0 0.001 0 0 0 0.001 0 0 0 0 0 0 0.001 0 0 0.001 0 0

137