Longitudinal Measurements of Link Usage on Twitter Longitudinella Mätningar Av Länkanvändning På Twitter
Total Page:16
File Type:pdf, Size:1020Kb
Linköping University | Department of Computer and Information Science Bachelor’s thesis, 16 ECTS | Link Usage 2019 | LIU-IDA/LITH-EX-G--19/036--SE Longitudinal measurements of link usage on Twitter Longitudinella mätningar av länkanvändning på Twitter Oscar Järpehult and Martin Lindblom Supervisor : Niklas Carlsson Examiner : Markus Bendtsen Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer- ingsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko- pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis- ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker- heten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman- nens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to down- load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. © Oscar Järpehult and Martin Lindblom Students in the 5 year Information Technology program complete a semester-long software devel- opment project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, cur- rently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culminates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students create small groups and specialise in one topic, resulting in a bachelor thesis. The current report represents the results obtained during this specialisation work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis. Abstract As Twitter launched with their unique way of limiting posts to only 140 characters the usage of link shorteners was brought forth. This was the only way to fit long URLs in tweets until Twitter solved this by providing their own integrated link shortener. This study investigates how links are used on Twitter. The study include both care fulldata collection including multiple APIs and analysis of the collected data providing new insight into this topic. It was found that a small set of internet domains account for a large part of the links found in posted tweets. This set of top occurring domains did not necessarily reflect the top domains typically on common internet top lists. When looking at link shorteners in posted tweets we found that “bit.ly” was the most common one. Due to our method of collecting data we had the possibility of looking up the amount of clicks “bit.ly” links had received. We compared the click data to the amount of retweets the tweets containing these links had received and this led to some interesting discoveries regarding the ratio between these two. Acknowledgments We would like to give a special thanks to our supervisor Niklas Carlsson that has been very helpful answering questions and helping us come up with solutions to problems we have encountered as well as putting us on the right track as to what research questions that would be beneficial for us to study. v Contents Abstract iv Acknowledgments v Contents vi List of Figures viii List of Tables ix 1 Introduction 1 1.1 Motivation . 1 1.2 Aim............................................ 1 1.3 Research Questions . 1 1.4 Contributions . 2 1.5 Delimitations . 2 2 Background 3 2.1 Twitter . 3 2.2 URL Shortening . 4 2.3 Domain Top Lists . 4 2.4 Related Work . 5 3 Method 7 3.1 Data Collection . 7 3.2 Data Analysis . 12 3.3 Limitations . 12 4 Results 14 4.1 Domain Statistics . 15 4.2 User Statistics . 19 4.3 Bitly Link Interaction . 23 4.4 Verified vs Non-verified Users . 25 4.5 Miscellaneous Statistics . 27 5 Discussion 29 5.1 Results . 29 5.2 Method . 30 5.3 The Work in a Wider Context . 31 6 Conclusion 32 6.1 Future Work . 32 Bibliography 34 vi A Appendix 37 vii List of Figures 3.1 Overview that shows the data flow and process communication of the tweet col- lection. 8 3.2 Overview that that shows the data flow and process communication of the addi- tional data collector. 9 3.3 Overview of how the timing for the different phases looked. 10 4.1 Distribution of domains for different classes in top 1M lists. 17 4.2 Distribution of domain rank. 17 4.3 5x5 scatter plot that shows the frequencies and ranks of domains in the top 25 of the categories All links, Shortened links, Bitly links, Alexa ranking and Majestic ranking. 18 4.4 Distribution of the age for users account at the time of posting their tweet. 19 4.5 Distribution of the number of tweets favourited by users at the time of posting their tweet. 20 4.6 Distribution of the number of tweets posted by users at the time of posting their tweet. 20 4.7 Ratio between tweets favourited and tweeted by users at the time of posting their tweet. 21 4.8 Distribution of the number of followers for users at the time of posting their tweet. 21 4.9 Distribution of the number of friends for users at the time of posting their tweet. 22 4.10 Followers-to-friends ratio for users at the time of posting their tweet. 22 4.11 Distribution of the delays for Bitly data to be retrieved after retweet data for tweets. 24 4.12 Two scatter plots of Bitly clicks-to-retweets-ratio. 24 4.13 Clicks-to-followers ratio for Bitly links. 25 4.14 Heat-map of retweets vs followers tweeted. 26 4.15 Heat-map of retweets vs number of tweets tweeted. 26 4.16 Heat-map of followers vs number of tweets tweeted. 27 A.1 Followers-to-friends ratio for users at the time of posting their tweet, side-by-side view for all categories. 40 viii List of Tables 3.1 Clicks from different time spans and different sources for a Bitly link. The symbol in each cell indicates which information that can be extracted from the same API call. 10 4.1 Amount of tweets that are included in different categories. 14 4.2 Top 20 most frequent domains for two different categories. 15 4.3 Top 20 most frequent domains for three different categories. 16 4.4 Amount of unique users for each category and how many of those users that are verified. 23 4.5 Breakdown of the number of geo-coded tweets for each category and the percent- age those tweets makes up of total number of tweets for that category. 27 4.6 Breakdown of top 5 languages for tweets of the four different categories. 27 4.7 Breakdown of top 5 languages for users that posted tweets of the four different categories. 28 A.1 All collected shorteners sorted on domains and how many we were able to get the full domain from. 39 ix 1 Introduction As the world wide web expands more and more people can keep in touch and share their lives online via social media sites. With only one click away one can reach millions of users worldwide and share everything from just plain text to media like pictures or videos. One of the largest social media sites is Twitter with 326 million monthly active users [12]. With this large user-base Twitter becomes a good source for analyzing patterns among users and get an understanding of what type of content is being shared online. 1.1 Motivation The basis for doing this project stems from the importance of knowing how users behave on social medias. This data surely can serve as informational to the public, but even more than so, it can serve as a source of knowledge for future research in the academic world for others to build upon. For the wider audience, insight into how links are used can help educate on users habits and how these might differ depending on culture, location or age to name a few.