<<

Pub Crawling at Scale: Tapping Untappd to Explore Social Drinking

Martin J. Chorley∗ Luca Rossi Gareth Tyson Matthew J. Williams Cardiff University Aston University Queen Mary University of London University of Birmingham [email protected] [email protected] [email protected] [email protected]

Abstract breweries opening each year (Naylor 2014). Thus, study- ing such a dynamic ecosystem could prove extremely fruit- There has been a recent surge of research looking at the re- ful, offering potential insight into the behaviour of people porting of food consumption on social media. The topic of alcohol consumption, however, remains poorly investigated. from around the world.This is because it is well known that Social media has the potential to shed light on a topic that, tra- many countries form social communities around the drink- ditionally, is difficult to collect fine-grained information on. ing of alcohol, which are quite distinct from those formed One social app stands out in this regard: Untappd is an app around eating (Clarke et al. 2000). There are also clear health that allows users to ‘check-in’ their consumption of beers. It considerations that such data contribute to. Excessive alco- operates in a similar fashion to other location-based appli- hol consumption is one of the most significant preventable cations, but is specifically tailored to the collection of infor- causes of death today. It caused 79,000 deaths and 2.3 mil- mation on beer consumption. In this paper, we explore beer lion years of potential life lost each year in 2001–2005 in consumption through the lens of social media. We crawled the United States alone (Kanny et al. 2011). Typically, such Untappd in real time over a period of 112 days, across 40 things can only be studied using labour-intensive (e.g., in- cities in the United States and Europe. Using this data, we shed light on the drinking habits of over 369k users. We fo- terviews and surveys) or coarse-grained (e.g., alcohol sales cus on per-user and per-city characterisation, highlighting key statistics) collection methods. Gathering reliable, up-to-date behavioural trends. and fine-grained data on such matters could be extremely helpful for exploring the role and impact of alcohol in soci- ety. 1 Introduction With these considerations in mind, one recent social app We are witnessing a convergence in our online and offline stands out: Untappd centres on the consumption of drinks personas. Activities that were once considered solely the (primarily beer), allowing users to ‘check-in’ beers that they domain of the real-world, have begun to encroach onto on- are drinking in a given location. These checkins are shared line territory. An example of this is the consumption of food with friends, allowing users to interact. Although the princi- and drink. It is now common for users to share details of ples underpinning the app may sound unusual, it has proven their meals online, as part of the so-called #foodporn revo- extremely successful. In 2014 its userbase surpassed 1 mil- lution (Mejova et al. 2015). As such, there has been a flurry lion, with 60 million beers being checked-in over just 3 three of recent research looking at consumption habits through the years1. This popularity shows little sign of abating. Hence, lens of social media (Abbar, Mejova, and Weber 2014). Untappd has the potential to shed extremely fine-grained in- While there has been a bulk of food-related research us- sight into the social drinking habits of a huge population ing social media data, a topic that has received less attention of users. Furthermore, unlike work modelling consumption is that of drinking alcohol. This is partly due to the percep- habits using free-text (e.g., via ), there is no need tion that alcoholic drinks have far less diversity than food. to perform interpretation or translation of data. Instead, all Consequently, past papers have looked at the topic from check-ins are represented using a formal schema. a rather narrow perspective, e.g., identifying alcohol abuse In this paper, we offer the first characterisation of Un- by identifying tweets containing the word ‘hangover’ (Cu- tappd, exploring user drinking habits around the world. To lotta 2013). We argue that this, however, misses a signif- this end, we have crawled their publicly available data over icant opportunity. For example, the craft beer community a 112 day period, to gather all checkins for users in 40 cities. has grown dramatically in recent years. In 2013, craft beer We begin by characterising the dataset to reveal the app’s sales saw a year-on-year increase of 79%, with 74 million scale and popularity (§3). Following this, we separate our pints being sold in the UK alone (CGAStrategy 2013). There analysis into three broad themes. First, we shed light on in- are significant supply-side expansions too, with 200 new ∗Authors listed alphabetically 1Untappd Official Blog, ‘Untappd is 1,000,000 strong Copyright c 2016, Association for the Advancement of Artificial and growing’, 2014. http://blog.untappd.com/post/ Intelligence (www.aaai.org). All rights reserved. 73638076039 dividual user behaviour, highlighting drinking habits, as well 2015). With respect to these works, Untappd provides far as how users interact with Untappd (§4). Second, we explore more fine-grained information, allowing individual locations which beers are most popular, and how these vary across and drinks to be captured, as well as other metadata, e.g., in- cultural boundaries (§5). Then, third, we inspect the social tervals between drinks and ABV. side of Untappd, exploring Untappd’s social graph and ho- A key contribution of our work is the temporal and spa- mophily (§6). Our key findings can be summarised as fol- tial analysis of user drinking habits. Silva et al. (Silva et al. lows: 2014) explored similar spatial and temporal food/drink pref- erences via Foursquare. Interestingly, they found that cul- • Untappd is a remarkably large-scale service. Over a 4 tural factors were often stronger than spatial (e.g., French month period, we witness in excess of 5 million check- food has more in common with Brazil than British food). ins across 40 monitored cities. A common issue with studies such as these is that they use • Most users are responsible in their drinking (75% checkin venue choices as a proxy for specific food and drink pref- a maximum of 4 drinks per day), yet we observe a core erences. This contrasts with Untappd, which provides both group of very heavy drinkers. We find that 7% consume venue choices and individual drinks explicitly. We also note in excess of 10 drinks at least once, with 7.5% of users that work in this area has typically been limited to food, checking in excess of 50 drinks during the period. rather than drink, with particular focus on health implica- • We identify distinct diurnal drinking patterns, revealing tions. We expand this work by studying alcohol consump- differing trends across weekdays, weekends and typical tion, and go beyond health-based implications to explore the working hours. Cultural trends can also be extracted with social nature of the app. Another important difference be- different temporal patterns in cities (e.g., drinking peaks tween food and drink is the frequency and rate of usage, in Paris around 9PM compared to 7PM in London). with often many drinks being consumed per night. There have also been various studies of more general- • We observe clear preferences for particular types of beers purpose location-based social networks (LBSNs). Most no- across different cities. Through this, it becomes possible table is Foursquare, which also happens to provide the venue to cluster users into geographic regions based on their database for Untappd. Cramer et al. explored the reasons checkins. why people used LBSNs via in-depth interviews (Cramer, • A nascent Untappd social network exists, with a median Rost, and Holmquist 2011). They found many motivations friendship group size of 9. Despite its sparsity, we find for using LBSNs. Obvious examples include socially driven distinct homophily in the social network, with a tendency desires (e.g., knowing what friends are doing), although for friends to consume similar types of drinks. more unexpected reasons were discovered as well, such as the endorsement of venues and coordinating meet-ups. It We conclude the paper by highlighting key implications is logical to think that Untappd might hold many synergies from our work (§7 and §8). These include a range a possible here; however, we conjecture that the specific nature of the apps that could be underpinned by our data and findings, application likely leads to rather novel motivations as well particularly in relation to health monitoring (e.g., enabling (e.g., tracking popular beers). interventions through the social network). In this study, we build on past work to offer a window into the world of online ‘social’ drinking. On one hand, we pro- 2 Related Work vide the first empirical study of a new social app, which, as There has been a flurry of recent work into the relation- of yet, lacks even a rudimentary analysis. On the other hand, ship between food consumption and social media. Abbar et we extract and analyse this data to shed light on a number of al. (Abbar, Mejova, and Weber 2014) investigated the food social patterns, which appear unique to drinking. The rest of that people report eating on Twitter. They monitored 210k the paper explores this topic in depth. users to discover what food they reported consuming. They found a strong correlation between the food being consumed 3 Dataset and Characterisation and the health of the host country, as well as the attributes of We have crawled the publicly accessible Untappd API in real the user (e.g., education). In relation to Untappd, they found time to capture data relating to beer consumption in 40 loca- that alcohol tended to be mentioned in urban areas, as well as tions around the USA and Europe between 14th August and having a weak correlation with obesity. Mejova et al. built on 4th December 2015 (112 days). The Untappd API endpoint these past studies to investigate food consumption patterns /thepub/local allows retrieval of all public checkins in in the United States via and Foursquare (Mejova a given locale. Polling this endpoint at regular intervals al- et al. 2015). They made numerous observations, including lows continuous monitoring of all Untappd checkins within that those attending local venues were less likely to be obese. a given geographic area. Note that we only collect data for Other work has used Twitter to estimate alcohol sales us- users who have allowed their checkins to be public. Each ing keyword matching (e.g., ‘drunk’ or ‘hangover’) (Culotta checkin contains: the username, the location, the beer, the 2013), and text-analysis to extract health information about beer rating given by the user, as well as the number of users (Paul and Dredze 2011). Tamersoy et al., on the other ‘toasts’ the checkin received from friends (‘toasts’ are simi- hand, looked at linguistic features on Reddit to characterise lar to ‘likes’). It should be noted that only check- long-term abstinence of users in the StopSmoking and Stop- ins that are associated with a venue are included within this Drinking communities (Tamersoy, De Choudhury, and Chau data. 40 cities were selected for monitoring, 34 within the Cicerone_Syllabus_v2.pdf smaller a a have making Boston users and of DC number the Washington large can to exceptions instance, disproportionately notable For proportional However, city. seen. is a be in checkins users of of number number the speaking, Ameri- populations. to smaller compared much with (10th) cities lowly can period; ranks however, measurement still, the London, this far, over by checkins is, The 199,781 userbase (3.8m). collecting greatest Angeles the Los with as city such 5th, European cities ranked larger is of 663k ahead of well population 3 a are with under Denver, examples to too; extreme More visible compared 1st). (ranked residents Chicago million in (ranked million 20 York to almost New adhere has example, particularly 2nd) For How- not assumption. size. does city’s intuitive Untappd the this that is this note drive we Naturally, might ever, populations. that user find property We of key range paper. a wide the a throughout have abbre- use cities plus that will names, city we the which contain viations, graphs The cities. 40 all miss- and rare. impressive very is are Untappd beers by ing covered beers of range Guide recognised Beer internationally Cicerone The categories. Advocate back styles Beer Untappd to from mapping cu- taxonomy, manually own our therefore rated We styles. additional Untappd introduced hierarchy, Advocate has hierarchy Beer styles the Advocate from Beer diverging the Since which of Untappd, extension by an produced is taxonomy itself a type on The ABV. based the is and beer This beer of of crawl. type the the brewery, during the encountered includes beer every for metadata city radius. the miles of 25 a point popula- within central of check-ins the all basis selected collected the we and on case, each (chosen In Europe tion). within 6 and USA

iue2as hw h ubro sr e iy Broadly city. per users of number the shows also 2 Figure across collected checkins of number the presents 1 Figure full the collected also we data, checkin our augment To Number of Checkins 3 2 400000 200000 300000 500000 100000 https://cicerone.org/files/Certified_ http://www.beeradvocate.com/beer/style/ iue1 ubro nqecekn e city. per checkins unique of Number 1: Figure 0 chicago - Ch newyork - NY philadelphia - Ph washingtondc - Wa denver - De portland - Po boston - Bo sandiego - SD seattle - Se

3 london - Lo

a sdt eov miute.The ambiguities. resolve to used was sanfrancisco - SF losangeles - LA dallas - Da amsterdam - Am detroit - Dt baltimore - Bl charlotte - Ca phoenix - Pe

City columbus - Cl austin - Au houston - Ho milwaukee - Mi indianapolis - In lasvegas - LV sanjose - SJ nashville - Na sacramento - Sa copenhagen - Co louisville - Lu sanantonio - SA jacksonville - Ja barcelona - Ba oklahoma - Ok dublin - Du tucson - Tu albuquerque - Al paris - Pa memphis - Me fresno - Fr

2 elpaso - El . 04 odtrietebs twt epc ofu candidate four to respect with We fit best user. the per determine to checkins 2014) of number the the use of distribution users. cumulative (CCDF) Untappd complementary function by the usage shows of 3 frequency Figure the inspect we First, Frequency Usage 4.1 beers of range and frequency the application, consumed. to the relation with in interact we particularly users Here, individual Untappd. how of on usage focus the characterising by begin We the than rather users, Untappd large. of at exploration population mobile a scope adopt as to to study careful individuals are eager our we of Consequently, also Untappd. are population as such and a apps smartphone on a relies we own type, sample who this our of studies that all note with as Finally, consumed. cohol include not does that only also time data the ( The exactly quantity fluid either. at them beers consume their be they cannot checkin we users course, that Of radius). checkins. sure measurement some miss our may (within we Thus, location their associate a users when with data collect checkins only can is- we Another that is consumption). sue excessive users after beers that in guarantee checking no is ways there notably, in Most greater dataset. far is large popularity a Europe. its in across than although American inroads cities, significant of clear made number is has It This beers. Untappd unique Europe. that 139,759 in and 470,732 users with 369,905 checkins: covered America, 5,305,543 in witnessed we were num- total, 4,834,811 average In above checkins. an of performing user- ber user smaller each a with has but London base, contrast, In checkins. of number

eoecniun,i swrhntn h iiain fthe of limitations the noting worth is it continuing, Before Number of Users 40000 45000 20000 25000 30000 35000 10000 15000 5000 esr rn ons ahrta xc mut fal- of amounts exact than rather counts, drink measure hci h er htte rn ( drink they that beers the checkin 0 powerlaw hrceiigUtpdUsers Untappd Characterising 4 chicago - Ch newyork - NY philadelphia - Ph

iue2 ubro sr e city. per users of Number 2: Figure washingtondc - Wa denver - De e.g., portland - Po boston - Bo sandiego - SD seattle - Se itv.50m) osqety ecan we consequently, ml); 500 vs. pint 1 london - Lo akg Asot uloe n Plenz and Bullmore, (Alstott, package sanfrancisco - SF losangeles - LA dallas - Da amsterdam - Am detroit - Dt baltimore - Bl charlotte - Ca phoenix - Pe

City columbus - Cl austin - Au houston - Ho milwaukee - Mi indianapolis - In lasvegas - LV

e.g., sanjose - SJ nashville - Na sacramento - Sa copenhagen - Co

srmystop may user a louisville - Lu sanantonio - SA jacksonville - Ja barcelona - Ba oklahoma - Ok dublin - Du tucson - Tu albuquerque - Al paris - Pa memphis - Me

al- fresno - Fr elpaso - El 0.35 100

0.30 10-1

0.25 10-2 0.20 10-3

CCDF 0.15 10-4 0.10 10-5 0.05

-6 Checkins per User Day 10 0 1 2 3 10 10 10 10 0.00 Number of Checkins paris - Pa fresno - Fr elpaso - El dallas - Da austin - Au tucson - Tu dublin - Du detroit - Dt london - Lo seattle - Se sanjose - SJ boston - Bo denver - De chicago - Ch phoenix - Pe portland - Po houston - Ho louisville - Lu Figure 3: Users checkins distribution. The dashed red line lasvegas - LV newyork - NY baltimore - Bl nashville - Na charlotte - Ca columbus - Cl sandiego - SD memphis - Me barcelona - Ba oklahoma - Ok milwaukee - Mi losangeles - LA −α −λx jacksonville - Ja indianapolis - In sanantonio - SA sacramento - Sa albuquerque - Al philadelphia - Ph amsterdam - Am copenhagen - Co shows the truncated power-law fit f(x) = x e , with sanfrancisco - SF washingtondc - Wa α ≈ 1.29 and λ ≈ 0.01. City

Figure 4: Average number of unique checkins per user per distributions, i.e., exponential, power law, truncated power city on a daily basis (ordered by number of checkins across law, and lognormal. In particular, we compare the truncated whole measurement period). power law with alternative hypotheses via a likelihood ratio test, finding that in all cases the best fitting distribution is a 100 truncated power law (p < 0.01). Note that a checkin indi- -1 cates a drink has been consumed. A large number of users 10 are quite occasional, with 65.2% having under 10 checkins. 10-2

However, there are a small subset of extremely dedicated -3 10 users. For example, we see that 2.44% of users have over -4 100 checkins in a 112 day period. Most users spread their CCDF 10 drinks across a long period, however, we also observe many 10-5 examples of intensive drinking. To check this, we compute -6 the maximum number of drinks each user consumes on a 10 -7 daily basis. Most users drink quite responsibly, with 75.44% 10 0 1 2 3 checking in a maximum of 4 drinks per day. However, we 10 10 10 10 Number of Unique Beers also find a notable subset of heavy users; 6.68% of users consume in excess of 10 drinks a day at least once during the measurement period. Figure 5: Unique beers distribution. The dashed red line We can also see how this usage frequency varies across shows the truncated power-law fit, with α ≈ 1.29 and cities. Figure 4 presents a box plot for the number of check- λ ≈ 0.01. ins per user on a daily basis. The mean number of daily per- user checkins is typically between 0.05 and 0.15. There is no clear rank that might relate to the numbers of users in a and New York. On a per-user basis, their ranks drop to 14th city. There are, however, some notable outliers worth men- and 19th, respectively. Instead, small cities such as Den- tioning. Interestingly, these tend to be either smaller Ameri- ver, Portland and San Diego report far higher rates. In Port- can cities or European cities. The city with the highest per- land, for example, 0.78% of users checkin a beer once ev- user checkin rate is Portland (mean of 0.13 checkins per ery two days on average. This is in contrast to Chicago, for user per day), although London, Amsterdam, Copenhagen instance, where the equivalent percentage is 0.39%. Again, and Barcelona all report similar per-user per-day checkin this is likely because users in small cities have had to be rates to the larger US cities (between 0.08 and 0.1 check- more proactive in discovering Untappd, whilst those in large ins per user per day). This shows that, although America cities have probably been exposed to it (e.g., via friends, in has a large userbase, Europeans are more active. This may bars) and installed it without necessarily being particularly be explained by the higher alcohol consumption reported in interested in its long-term usage. Europe (Peter Anderson and Galea 2012). Another possibil- ity is that users in Europe have had to be more proactive 4.2 User Drinking Preferences to discover Untappd and, therefore, have a propensity to be Next, we analyse the range of beers that individuals checkin. more-committed to its usage. Figure 5 shows the CCDF of the number of unique beers per That said, users in smaller American cities are also dis- user. We find that it follows a truncated power-law distri- proportionately active in using Untappd. The two highest bution: A small fraction of users checkin a high number of ranked cities by absolute number of checkins are Chicago unique beers, with the remainder far less active. A user, on 100 12000

10-1 10000

10-2 8000

-3 6000 10 CCDF 4000 10-4 Number of Users 2000 10-5 0 -6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 0 1 2 10 10 10 Beer Diversity Index Number of Styles Figure 7: Shannon Entropy of the per user checkins fre- Figure 6: Beer styles distribution. The dashed red line shows quency for all users with at least 10 checkins. the truncated power-law fit, with α ≈ 1.23 and λ ≈ 0.03.

9000 Monday 8000 Tuesday average, drinks 14.34 unique beers, with 54% of the users 7000 Wednesday Thursday 6000 trying at most 5 different beers; 6.5% trying more than 50; Friday and 1.9% trying more than 100 unique beers. 5000 Saturday In order to get a better insight into the users drinking pref- 4000 Sunday erences, we characterise the affinity that individuals have 3000 to certain types of beer, e.g., lagers, British ales, American 2000 etc. ales, Belgian beers Figure 6 presents a CCDF of the 1000 number of types of beers that users checkin. This is based Average Number of Checkins on the beer taxonomy presented in §3. It can be seen that 0 1am 1pm 7am 5am 3am 2am 6am 8am 9am 7pm 3pm 5pm 2pm 4am 6pm 8pm 9pm 4pm 10am 11am 12am the bulk of users only drink a small range of different beer 12pm 10pm 11pm types. On the one hand, this is because many users simply Time do not checkin many times. However, on the other hand, we find that some users are extremely exploratory. The aver- Figure 8: Average number of checkins per hour. age number of beer types a users consumes is just 8.9, yet we find that some users (12.06%) consume over 20 different styles in our measurement period, and 6.1% consume more 4.3 Temporal Drinking Patterns than 30 different styles. For reference, the total number of A particularly novel aspect of Untappd is the ability to styles in the taxonomy is 222. If we restrict this analysis to checkin individual drinks ‘live’. This gives us insight into those users who have made at least 25 checkins during the the drinking patterns of different cities in terms of the monitoring period (i.e., those users above a ‘casual’ level time that people drink. Figure 8 presents the average num- of usage) we find the average beer types consumed rises to ber of checkins per hour across our entire dataset (nor- 30.13. Of these more active users, 98% have drunk at least malised by timezone). It can be seen that clear patterns are 10 different styles of beer, 71% have drunk at least 20 dif- followed. Most checkins occur between 4PM and 10PM, ferent styles, and 37% have drunk at least 30 different styles peaking around 7–8PM. Checkins during the working day of beer during the monitoring period. This illustrates the di- are surprisingly frequent, starting most noticeably at 12PM versity of styles sampled by the active userbase of Untappd. (18.28% occur between 12–6PM). Finally, to address the varying number of checkins per It can also be seen that the rate at which people drink in- user, we compute the Shannon Entropy of the per user creases during the week: People checkin noticeably more on checkin frequency. Higher entropy means a more uniform Thursday than on Monday (average of 12.7% checkins vs. distribution across beer types, i.e., a user who explores many 8.34%). Of course, Friday and Saturday evening witness the different types. A low score indicates a user who limits highest rates. It is particularly interesting to see the differ- themselves to a small range of types. Figure 7 presents the ence between these two key days. Whereas Saturday has far distribution of scores across all users that have at least 10 more checkins than Friday, these are spread across a much checkins. It can be seen that a normal distribution is fol- longer period of time (with a large number of users drink- lowed. Whereas a few users are highly experimental, the ing between 12PM and 10PM on Saturday). Users also start majority show a strong propensity towards 4 – 8 preferred to drink around this time on Sunday, although to a lesser types. In summary, it is clear that most users do have clear extent. In contrast, on Friday a large number of drinks are affinities to beer types, with only a small minority of explor- consumed, but these are consumed in a far shorter time pe- ers willing to experiment widely. riod. In fact, there is a higher peak on Friday than on Satur- sr hcigi h aebes o xml,i Chicago in example, For two beers. of same chance the greater in a as checking intuitive, indicate users is would This population city. the larger unique in a population of user different, fraction the very the and previ- are between beers trends the relationship the inverse in that the an seen as by with be location ordered can It per is section. checkins plot had ous of that The of beers there. number for of in absolute were value checked checkins a been of example, already 90% For that by there. indicates normalise checkins 0.1 we city, of each number For unique city. the of each number in range the observed presents greatest beers 10 the Figure have available. cities beers which of inspecting by heaven? beer start We is Where how 5.1 on focus regions. now geographic across we varies section, Un- popularity beer previous within important the are Unlike beers tappd. which investigate we Next, days. all on late for New at 9PM in in around drinking than peaking their that evening, less centre they seen far Instead, of is be London. drinking or both York also daytime to can Their tendencies cities. It different these very 3PM. exhibits heavily at users drink peaking Parisian London day, in the users during particularly where is Saturday, propensity a on just This evident York. to compared New be in can 2,668 increase This to 206% increase. compared 267% a checkins Lon- Friday, against 996 a Monday, of on compared a average when on an collects example, time don For of activities. period weekday more longer their consume London a in over drinkers Satur- cities, and and other Friday to binge-oriented comparing night more when day Paris; is or seen London York be New in can than culture it drinking First, Paris. the and that London York, in shown New comparison, 9: for Figure cities three select We extracted. however, can, be differences the cultural across Subtle weekend. trends and similar week display cities all speaking, Broadly 6PM. On before occur drinking. checkins daytime all of of tenth rate a Saturday greatest Sunday drinks. the work’ have ‘after Sunday for peo- and early that work suggesting finish weekdays, often drink other ple to on than start drinks Friday also on People more week. earlier 55% the during checkin than people hour weekend per the Overall, checkins peak). all of the 19.23% at vs. 20.57% of (average day

et eisethwteepten ayars cities. across vary patterns these how inspect we Next, Average Number of Checkins 400 600 200 300 500 700 100 0 5am 6am idn h er htMatter that Beers the Finding 5 7am Sunday Saturday Friday Thursday Wednesday Tuesday Monday 8am 9am 10am 11am 12pm iue9 vrg ubro hcisprhu n( in hour per checkins of number Average 9: Figure 1pm 2pm 3pm Time 4pm 5pm 6pm 7pm 8pm 9pm 10pm 11pm 12am 1am 2am 3am 4am Average Number of Checkins 200 250 300 350 100 150 50 0 5am 6am 7am Sunday Saturday Friday Thursday Wednesday Tuesday Monday 8am 9am 10am 11am 12pm 1pm 2pm 3pm Time 4pm 5pm uoeas rn mrcnbes(7) hyso greater show they in (27%), beers users American many drink Although also have ales. Europe users and American lagers cities. that tendency American European the towards by and led largely American is between This to- divergence 11(a) Figure the in preferences seen, noticeable is Most be strong beers. of can exhibiting types clusters certain wards locales Clear re- individual cities. the American with presents 11(b) just presents Figure for 11(a) whilst sults cities, Figure all 2000). for Cox results the and Multidimen- (Cox using Scaling space sional two-dimensional a their onto project and locations 1991) We (Lin city. us- divergence signature the Jensen-Shannon city’s ing in each between consumed distances being prob- the style compute sig- the then beer contains A given element them. a each of of which one ability in cities, vector, each a respective for is signature their nature beer into a checkins compute all and separate We area. of number the of trend the there. following users unique closely of city number terms, new the absolute per a with In beers case, for Paso. the is are El opposite for checkins the however, 20% of over 4% to compared around beer, only York New and of number by (ordered city period). measurement by the whole normalised in across city checkins checkins per of beers number unique the of Number 10: Figure 6pm i ecnas oka h tlso er vial neach in available beers of styles the at look also can We Normalised Number of Unique Beers 7pm e ok ( York, New ) 0.00 0.10 0.20 0.05 0.15 0.25 8pm 9pm chicago - Ch 10pm newyork - NY 11pm philadelphia - Ph 12am washingtondc - Wa 1am denver - De portland - Po 2am boston - Bo 3am

sandiego - SD ii 4am

seattle - Se ( and London ) Average Number of Checkins 20 30 25 35 10 london - Lo 15 0 5 sanfrancisco - SF 5am losangeles - LA dallas - Da 6am amsterdam - Am 7am Sunday Saturday Friday Thursday Wednesday Tuesday Monday detroit - Dt 8am baltimore - Bl 9am charlotte - Ca phoenix - Pe 10am

City columbus - Cl 11am austin - Au 12pm houston - Ho iii 1pm

milwaukee - Mi Paris. ) 2pm indianapolis - In 3pm lasvegas - LV Time sanjose - SJ 4pm nashville - Na 5pm sacramento - Sa 6pm copenhagen - Co louisville - Lu 7pm sanantonio - SA 8pm jacksonville - Ja 9pm barcelona - Ba 10pm oklahoma - Ok 11pm dublin - Du tucson - Tu 12am albuquerque - Al 1am paris - Pa 2am memphis - Me 3am fresno - Fr elpaso - El 4am 72 0.10 Am Fr 0 0.04 Tu Au 78 El 15 Ho SF Ok Ok De SD 84 0.05 Me 0.02 Da Sa DaMiSA Pa 30 Pe LASJ Dt Ho Lu LV SA Na Bl Po 90 ClJa PhWaChAu 0.00 InCa NYBo Ba 45 0.00 Ja NY Ch Tu Pe DeLA El WaPh Se Al SJ 96 Fr SeSDSaSF Co LV Bl Ca Po 60 Mi Cl 0.05 0.02 In Bo 102 Longitude 75 Lu Dt Longitude 108 0.10 90 0.04 Al Na 114 2nd Leading Dimension 105 2nd Leading Dimension 0.06 0.15 Lo Du Me 120 120 0.10 0.05 0.00 0.05 0.10 0.15 0.06 0.04 0.02 0.00 0.02 0.04 1st Leading Dimension 1st Leading Dimension (a) All cities (b) USA only

Figure 11: Similarity of beer signatures across cities. The signatures are computed using the probability that each beer style is consumed in a city.

72 availability, but also differing costs to buy local beer versus 0.04 Po SF SD 78 imported beer. However, it is evident form this analysis that Se Fr 0.02 SJ regional factors (beyond availability) do heavily influence Ch LA 84 Bo Ca Au the beers consumed. 0.00 Ph NY 90 Wa Ho Cl Pe 96 Bl Tu 0.02 Na SA 5.2 Ale or lager? Dt 102

Mi Longitude 0.04 LV 108 We can inspect the different preferences across cities. Fig- ure 13 presents a heat map showing the location quotient (Is- 114 2nd Leading Dimension 0.06 Me serman 1977) (LQ) of checkins in each city for various style 120 categories. The location quotient for a particular city and cat- 0.04 0.02 0.00 0.02 0.04 1st Leading Dimension egory indicates the number of checkins to that category in the city, relative to the global proportion for that beer style. Red indicates a strong preference (more than the global av- Figure 12: Similarity of beer signatures across American erage) for a particular genre of beer; blue indicates the op- cities. To mitigate the impact of beer availability, the data is posite. Various cultural preferences can be observed. For in- subsetted to only leave checkins for beers that are available stance, unsurprisingly, it can be seen that London exhibits in all cities. a strong propensity towards ‘English Ales’ (probability of an ‘English Ale’ is 4.0 times the global likelihood), whilst these remain broadly unpopular across other cities. This, of diversity, consuming various English, Belgian and German course, is largely a product of availability, which will be less beers more regularly. Thus, it is interesting to see that Un- in non-English areas. However, it is worth noting that even tappd effectively captures these cultural boundaries. some American cities (e.g., Phoenix and Dallas) show a lik- We can also observe divergence within America itself, ing for English ale. Other cities show less obvious trends, shown in Figure 11(b). This divergence emerges between spreading their consumption across various types of beers. East and West coast cities. A large part of this is, clearly, the American ales do not stand out as particularly popular in the availability of beers in the area. To explore this, we compute American cities. This is because all American cities tend the set of unique beers consumed in each American city and towards American ales, leading none to an above average then calculate their intersection. We find that 64% of beer dominance. The European beer with the greatest export pop- styles are available in all American cities. This suggests that ularity is Germany, outperforming both English and Belgian availability alone is not the only factor in the differing pref- ales in America. erences seen. To understand the role of availability, we sub- set the American cities to leave those with in excess of 50k 5.3 Beer ratings checkins. We then subset the checkins to leave only those A particularly helpful feature of Untappd is the ability to rate beers that are available in all the cities. Through this, we beer from 0 — 5. This can be used to inform your friends of mitigate the impact that availability has on our earlier anal- good/bad beers, but also to keep a personal diary of beers ysis. Figure 12 shows the results. It can be seen that the that have been enjoyed. Here, we briefly explore the dis- clusters still remain, suggesting that different cities do have tribution of these ratings across the entire dataset, shown inherent cultural preferences towards particular beer types. in Figure 15. On the whole, users seem relatively positive The reasons for this could be diverse, and includes not only about the beers they drink, with a mode average rating of Figure 13: Beer styles location quotient. The location quotient for a city and category indicates the proportion of checkins to that category in that city with respect to the global proportion for that beer style. Red indicates a preference towards a style, whereas blue indicates a preference against. Styles with fewer 60,000 checkins and cities with fewer than 26,000 checkins have been aggregated into ‘other’ categories.

4. Very few beers score below 2 with the overwhelming including those emerging from human societies, display a majority 64% falling between 3 and 4. A particularly cu- power-law degree distribution with exponent α between 2 rious attribute of our findings is the difference between Eu- and 3 (Ahn et al. 2007; Kwak et al. 2010). Indeed, we find ropean and American ratings. Whereas the average rating in that the degree distribution of the Untappd social network American cities is ≈ 3.75, this is instead ≈ 3.50 in Europe. follows a power-law with α ≈ 1.85. This reveals a sparse Although this may not seem significant, the consistency of social graph, lacking the large social groups seen in other these scores are remarkable with American users nearly al- social networks. 52.74% of users have under 10 friends. The ways marking one ordinal point above Europeans. The exact median friendship size is just 9 (mean 21.64), indicating that reason for this is unclear, however, we posit that differences social interactions on Untappd are localised to a relatively in culture lead to American users being more generous. small group of people. Further, we find a network density of Finally, we find that there is no correlation between the just 0.0000069693, further highlighting the sparsity of the average score and the number of checkins of a beer. We Untappd social graph. measure Kendall’s tau coefficient and we find that it is ap- Another common observation of the structure of social proximately equal to 0.0192, with p-value < 10−8. This is networks is homophily — the principle that individuals tend surprising, as one would expect popular beers to also receive to be similar to their friends (McPherson, Smith-Lovin, and high ratings. Instead, it seems that users tend to experiment Cook 2001). Next, we briefly study homophily in the Un- with new beers often rather than repeatedly consuming the tappd network, asking the simple question: to what extent same highly ranked beers. This gives powerful insight into are individuals’ beer drinking preferences similar to their the usage patterns of Untappd. friends? To explore homophily, we use the approach of Aiello et al. (Aiello et al. 2012), whereby we measure the 6 Untapping the Social Network average similarity between friends and compare this em- pirical measurement to appropriate random null models. A Untappd is equipped with a social network, allowing users to null model preserves the same social structure (ensuring the ‘follow’ each others’ activities. Untappd’s friend mechanism same degree distribution and community structure), but ran- is symmetric; i.e., both users involved in the friendship must domises other features in the network. accept it. A logical question is to what extent do users within We first construct and compare beer style feature vectors a social network influence each other. To reconstruct the Un- of two users using cosine similarity (Salton 1989). A feature tappd social graph, we generated a list of all users we ob- vector represents the number of checkins by a user to each served with at least one check in before 30 September 2015. beer style in the taxonomy described in §3. User similarity is We then retrieved each user’s publicly accessible friends list therefore based on the propensity to drink the same types of from the API. This crawl constitutes 254,868 users across beers. Before analysis, we filter users to remove those that all monitored cities. We then combined these 254,868 friend have fewer than six checkins or have visited fewer than six lists (i.e., ego-centric networks) to build a single undirected different venues (leaving 103,403 users and 456,426 friend- network. ship links). We first inspect the degree distribution of the social graph, Figure 14 depicts the distribution of friend similarities. shown in Figure 16. It is has been shown that real networks, The intra-city null model shuffles users within the same city, 7 0.9 Empirical Data 6 0.8 Europe Co-visit Null Model 0.7 USA 5 Intra-city Null Model 0.6 4 0.5

3 0.4 Frequency 0.3

Frequency [%] 2 0.2 1 0.1 0.0 0 0 1 2 3 4 5 0.5 0.6 0.7 0.8 0.9 1.0 Beer Rating Similarity Figure 15: Beer ratings distribution in European and Amer- Figure 14: Homophily of Untappd users. Distribution of co- ican cities. sine similarities between friends according to the empirical data and two null models. 100

-1 while the co-visit null model randomly re-assigns a profile 10 with a user who has visited the same venue (to avoid issues 10-2 with beer availability). In both null models, the structure of the friendship network is not changed. The null models only 10-3 CCDF act to randomly re-assign user profiles. It can be seen that the -4 empirical similarities tend to be higher than the null models, 10 confirming the presence of homophily in friends’ drinking 10-5 preferences. Histograms of null models are approximated -6 by numerical experiments. Applying Kolmogorov-Smirnov 10 0 1 2 3 4 5 tests, we find that both null models statistically differ from 10 10 10 10 10 10 Degree the empirical data (p < 0.01). The average similarity for the empirical data, co-visit null Figure 16: Degree distribution, social network. model, and intra-city null model are 0.853, 0.829, and 0.782, respectively. Unsurprisingly, the co-visit null model tends to result in higher similarities than the intra-city null model, more traditional methodologies (e.g., surveys) is a promis- indicating that commonality in the venues visited by friends ing line of work here. Second, a number of further apps constrains the range of beers they are exposed to, pushing could be built atop of Untappd. These could have a vari- their profiles to become more similar. However, the statis- ety of foci, however, intuitive possibilities include recom- tical difference between the empirical data and the co-visit mendation apps, targeted advertisement, social connectivity model indicates that, even after accounting for these corre- services (e.g., facilitating meet-ups) and heath-related ser- lations, homophily is present. vices. For instance, health features could be integrated into the app itself, with warnings provided to users when exceed- 7 Implications ing certain amounts. This might be particularly helpful for users who consume drinks containing a wide range of alco- There are a number of implications from our work. We be- hol strengths. lieve our analysis could offer a powerful insight for sociol- ogists, dieticians, and psychologists, specialising in alcohol consumption. Here, we briefly discuss two key applications 8 Conclusion and Future Work that could be built. First, our work could form the under- In this paper, we have explored the consumption of beer pinning of future (automated) documentation of drinking ac- through the lens of social media. Using nearly 4 months of tivities. Various organisations frequently perform large-scale data, collected from the Untappd social app, we have charac- studies into drinking habits (e.g., the World Health Organi- terised the activities of users from both Europe and America. sation (Peter Anderson and Galea 2012)). These are usually We have found clear trends on both a per-user and per-region aimed at assisting governmental policy construction. How- basis. In both cases, we find a strong affinity for certain types ever, they are slow to be performed and often coarse-grained of beers. This is particularly prominent on a per-region ba- in terms of the data collected. Untappd offers the potential sis, with different cities having very evident beer signatures. to near-automate this process, with huge bodies of data read- The clarity of these signatures was surprising, with the abil- ily available across the globe. Of course, a key challenge is ity to automatically cluster users into geographic areas. Cul- normalising and interpreting such data so that it can be gen- ture and availability are clearly paramount here, although eralised across an entire population. Combing Untappd with it is difficult to ascertain exactly which plays the greater role. Similar findings apply to temporal drinking patterns as CGAStrategy. 2013. Craft beer sales jump 79%. well. These factors confirm that beer consumption can offer http://www.peach-report.com/Trends/ a fine-grained insight into the cultural properties of a region; 2039066/craft_beer_sales_jump_79.html. whereas past work has also identified these ingrained tradi- Clarke, I.; Kell, I.; Schmidt, R.; and Vignali, C. 2000. Think- tions, we are the first to do this on a large-scale. These tradi- ing the thoughts they do: Symbolism and meaning in the tions relate closely to Untappd’s growing social network. It consumer experience of the british pub. British Food Jour- was particularly interesting to see the presence of homophily nal 102(9):692–710. in regards to friends’ preferred beer styles. Again, this raises Cox, T., and Cox, M. 2000. Multidimensional scaling. CRC interesting questions regarding causality. We posit that Un- Press. tappd friends may often share experiences (e.g., drinking the same drinks together, or recommending drinks to each Cramer, H.; Rost, M.; and Holmquist, L. E. 2011. Perform- other). However, within our current dataset this is impossible ing a check-in: emerging practices, norms and’conflicts’ in to determine. With this in mind, we are keen to emphasise location-sharing using foursquare. In Proceedings of the that our insights do not necessarily generalise to whole pop- 13th International Conference on Human Computer Inter- ulations but, instead, shed light on the activities of Untappd action with Mobile Devices and Services, 57–66. ACM. users. Despite this, we find that the scale and real-time na- Culotta, A. 2013. Lightweight methods to estimate influenza ture of Untappd data is unrivalled in this particular research rates and alcohol sales volume from twitter messages. Lan- domain. guage resources and evaluation 47(1):217–238. There are a number of interesting areas of future work Isserman, A. M. 1977. The location quotient approach to we plan to focus on. Most notably, we wish to expand our estimating regional economic impacts. Journal of the Amer- work on the Untappd social network. For example, it would ican Institute of Planners 43(1):33–41. be interesting to see how often Untappd friends drink to- Kanny, D.; Liu, Y.; Brewer, R. D.; for Disease Control, C.; gether and how they influence each other. We posit that (CDC), P.; et al. 2011. Binge drinkingunited states, 2009. drinks might ‘spread’ through the social network via both MMWR Surveill Summ 60(Suppl):101–4. online and offline recommendations (i.e., two people drink- ing together in a pub). Quantifying this could be extremely Kwak, H.; Lee, C.; Park, H.; and Moon, S. 2010. What is powerful for recommendation engines and targeted adver- twitter, a social network or a news media? In Proceedings of tisement. Another key line of future work will be to further the 19th international conference on World wide web, 591– investigate how cultural aspects impact drinking. So far, we 600. ACM. have explored primarily American users, however, we in- Lin, J. 1991. Divergence measures based on the shan- tend to increase this to include users on all continents. We non entropy. Information Theory, IEEE Transactions on also plan to develop some of the apps discussed in §7. Fi- 37(1):145–151. nally, we conclude by saying that inspecting alcohol con- McPherson, M.; Smith-Lovin, L.; and Cook, J. M. 2001. sumption through social media is a topic that has yet to re- Birds of a Feather: Homophily in Social Networks. Annual ceive sufficient attention. We therefore hope that Untappd Review of Sociology 27:415–444. could provide a catalyst for researchers to explore this topic Mejova, Y.; Haddadi, H.; Noulas, A.; and Weber, I. 2015. more deeply. # foodporn: Obesity patterns in culinary interactions. In Proceedings of the 5th International Conference on Digital Acknowledgements Health 2015, 51–58. ACM. We thank Lorrie Bonser and Thomas Galloway for valuable Naylor, T. 2014. The craft beer revolution: how hops got insights into the beer industry. We also acknowledge Dr. Ja- hip. The Guardian. son Khan for his fruitful beer-driven discussions. Paul, M. J., and Dredze, M. 2011. You are what you tweet: References Analyzing twitter for public health. In ICWSM, 265–272. Peter Anderson, L. M., and Galea, G. 2012. Alcohol in the Abbar, S.; Mejova, Y.; and Weber, I. 2014. You tweet what european union. consumption, harm and policy approaches. you eat: Studying food consumption through twitter. arXiv World Health Organisation. preprint arXiv:1412.4361. Salton, G. 1989. Automatic Text Processing: The Transfor- Ahn, Y.-Y.; Han, S.; Kwak, H.; Moon, S.; and Jeong, H. mation, Analysis, and Retrieval of Information by Computer. 2007. Analysis of topological characteristics of huge online Boston, MA, USA: Addison-Wesley Longman Publishing social networking services. In Proceedings of the 16th inter- Co., Inc. national conference on World Wide Web, 835–844. ACM. Silva, T. H.; de Melo, P. O.; Almeida, J.; Musolesi, M.; and Aiello, L. M.; Barrat, A.; Schifanella, R.; Cattuto, C.; Loureiro, A. 2014. You are what you eat (and drink): Iden- Markines, B.; and Menczer, F. 2012. Friendship Prediction tifying cultural boundaries by analyzing food & drink habits and Homophily in Social Media. ACM Trans. Web 6(2):9:1– in foursquare. arXiv preprint arXiv:1404.1009. 9:33. Tamersoy, A.; De Choudhury, M.; and Chau, D. H. 2015. Alstott, J.; Bullmore, E.; and Plenz, D. 2014. powerlaw: Characterizing smoking and drinking abstinence from so- a python package for analysis of heavy-tailed distributions. cial media. In Proceedings of the 26th ACM Conference on PloS one 9(1):e85777. Hypertext & Social Media, 139–148. ACM.