University of Calgary PRISM: University of Calgary's Digital Repository

Graduate Studies The Vault: Electronic Theses and Dissertations

2015-12-15 Traffic Analysis of Two Scientific Web Sites

Liu, Yang

Liu, Y. (2015). Traffic Analysis of Two Scientific Web Sites (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/28501 http://hdl.handle.net/11023/2678 master thesis

University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY

Traffic Analysis of Two Scientific Web Sites

by

Yang Liu

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN COMPUTER SCIENCE

CALGARY, ALBERTA

DECEMBER, 2015

c Yang Liu 2015 Abstract

This thesis presents a workload characterization study of two scientific Web sites at the

University of Calgary based on a four-month period of observation (from January 1, 2015 to April 30, 2015). The Aurora site is a scientific site for auroral researchers, providing auroral images collected from remote cameras deployed in northern Canada. The ISM site is a scientific site providing lecture materials to about 400 undergraduate students in the

ASTR 209 course.

Three main observations emerge from our workload characterization study. First, sci- entific Web sites can generate extremely large volumes of Internet traffic, even when the user community is seemingly small. Second, robot traffic and real-world events can have surprisingly large impacts on network traffic. Third, a large fraction of the observed network traffic is highly redundant, and can be reduced significantly with more efficient networking solutions.

ii Acknowledgements

I would like to express my sincere appreciation and gratitude to my supervisor, Dr. Carey

Williamson, for his invaluable support and insightful suggestions to my graduate research.

His enthusiasm motivated my passion for accomplishing this study. His patience and metic- ulousness helped me overcome my writing weaknesses and finally finish this thesis.

I would like to acknowledge Michel Laterman, Martin Arlitt, and U of C IT staff for setting up the logging system and capturing the data. Michel provided a lot of instructions to me for big log data processing issues. A very special thanks goes out to Martin and Michel for their constructive suggestions and support in helping me polish this thesis.

I would like to thank my external committee member, Dr. Eric Donovan, for being on my committee and his valuable advice from the perspective of Physics and Astronomy. I would like to thank Emma Spanswick and Darren Chaddock for their technical expertise about the setup and operation of the Aurora site.

I am also indebted to my fellow lab-mates in the Networks Group, Yao, Ruiting, Brad,

Mohsen, Keynan, Vineet, Mahshid, Haoming, Linquan, Xunrui, Sijia, Shunyi, Yuhui, and

Wei, for the fun we have had and the help you offered during my graduate study. I wish you all the best in your studies and careers.

I would like to thank all my friends for supporting me spiritually, and the U of C for offering me this great opportunity to learn and to explore.

Finally, I would like to thank my family for their support through my entire life, especially my parents Baocheng and Ling for respecting my decisions and encouraging me with their best wishes.

iii Table of Contents

Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vii List of Figures ...... viii List of Symbols ...... x 1 INTRODUCTION ...... 1 1.1 Open Scientific Web Sites ...... 1 1.2 Background Context ...... 2 1.3 Motivation ...... 4 1.4 Objectives ...... 6 1.5 Contributions ...... 6 1.6 Thesis Overview ...... 7 2 BACKGROUND and RELATED WORK ...... 8 2.1 TCP/IP Model ...... 8 2.1.1 Physical Layer ...... 9 2.1.2 Link Layer ...... 10 2.1.3 Network Layer ...... 10 2.1.4 Transport Layer ...... 10 2.1.5 Application Layer ...... 11 2.2 HTTP and the Web ...... 11 2.2.1 Persistent Connections ...... 13 2.2.2 HTTP Messages ...... 14 2.2.3 HTTP Secure ...... 19 2.3 Network Traffic Measurement ...... 19 2.4 Web Robots ...... 22 2.5 Video Streaming ...... 23 2.6 Scientific Web Sites ...... 25 2.7 Summary ...... 26 3 METHODOLOGY ...... 27 3.1 Endace DAG Card Deployment ...... 27 3.2 Bro Logging System ...... 28 3.3 Data Pretreatment ...... 31 3.4 Summary ...... 32 4 AURORA SITE ANALYSIS ...... 33 4.1 HTTP Analysis ...... 33 4.1.1 HTTP Requests ...... 33 4.1.2 Data Volume ...... 34 4.1.3 IP Analysis ...... 35 4.1.4 HTTP Methods ...... 38 4.1.5 HTTP Referer ...... 38 4.1.6 URL Analysis ...... 39

iv 4.1.7 File Type ...... 41 4.1.8 HTTP Response Size Distribution ...... 42 4.2 Robot Traffic ...... 48 4.2.1 Prominent Machine-Generated Traffic ...... 48 4.2.2 AuroraMAX ...... 54 4.3 Geomagnetic Storm ...... 57 4.4 Summary ...... 57 5 ISM SITE ANALYSIS ...... 59 5.1 HTTP Analysis ...... 59 5.1.1 HTTP Requests ...... 59 5.1.2 Data Volume ...... 62 5.1.3 IP Analysis ...... 63 5.1.4 URL Analysis ...... 66 5.1.5 HTTP Methods ...... 68 5.1.6 HTTP Status Codes ...... 70 5.1.7 HTTP Response Size Distribution ...... 71 5.1.8 User Agents ...... 75 5.2 Video Viewing Pattern and Traffic ...... 81 5.2.1 Video Requests Traffic ...... 81 5.2.2 Browser Behaviors for Video Playing ...... 85 5.3 Course-Related Events ...... 88 5.4 Summary ...... 91 6 DISCUSSION ...... 92 6.1 Comparative Analysis of Two Scientific Web Sites ...... 92 6.2 Workload Characteristics Revisited ...... 95 6.2.1 Success Rate ...... 95 6.2.2 File Types ...... 95 6.2.3 Mean Transfer Size ...... 97 6.2.4 Distinct Requests ...... 97 6.2.5 One Time Referencing ...... 97 6.2.6 Size Distribution ...... 98 6.2.7 Concentration of References ...... 98 6.2.8 Wide Area Usage ...... 98 6.2.9 Inter-Reference Times ...... 98 6.3 Network Efficiency Analysis ...... 99 6.3.1 File Transfer Methods ...... 99 6.3.2 JavaScript Cache-Busting Solution ...... 103 6.4 Summary ...... 105 7 CONCLUSIONS ...... 106 7.1 Thesis Summary ...... 106 7.2 Scientific Web Site Characterization ...... 107 7.2.1 The Aurora Site ...... 107 7.2.2 The ISM Site ...... 108 7.3 Conclusions ...... 109 7.4 Future Work ...... 110

v References ...... 112

vi List of Tables

3.1 A Sample of a Subset of the Bro HTTP Log ...... 29

4.1 Statistical Characteristics of the Aurora Site (Jan 1/15 to Apr 29/15) . . . . 33 4.2 Top 10 Most Frequently Observed IP Addresses for Aurora Site ...... 36 4.3 Top 10 Most Frequently Requested URLs for Aurora Site ...... 41 4.4 Top 10 Most Frequently Requested File Types for Aurora Site ...... 43 4.5 Prominent UCB and Alaska IPs in Aurora Web Site Traffic ...... 49

5.1 Statistical Characteristics of the ISM Site (Jan 1/15 to Apr 29/15) . . . . . 60 5.2 Top 10 Most Frequently Observed IP Addresses for ISM Site ...... 64 5.3 Top 10 Most Frequently Requested URLs for ISM Site ...... 67 5.4 HTTP Method Summary for ISM Site ...... 68 5.5 HTTP Status Code Summary for ISM Site ...... 70 5.6 Top 10 Most Popular User Agents for the ISM Site ...... 77 5.7 Top 5 OS Versions ...... 80 5.8 Top 5 Browser Versions ...... 80 5.9 Browser Support for the Four Video Playing Implementations ...... 87

6.1 Statistical Characteristics of Two Scientific Web Sites (Jan 1/15 to Apr 29/15) 92 6.2 HTTP Method and HTTP Status Code Percentage ...... 93 6.3 Comparison of Workload Characteristics ...... 96 6.4 Experimental Results for File Transfer/Synch Methods ...... 102

vii List of Figures and Illustrations

2.1 The Five-Layer Protocol Stack and the Seven-Layer OSI Model ...... 9 2.2 Illustration of HTTP Requests and Responses ...... 12

3.1 Campus Network Structure with Traffic Monitor System ...... 27

4.1 HTTP Request Count Per Day for Aurora Site ...... 34 4.2 HTTP Requests Per Hour (Jan 1-3 and Jan 5, 2015) ...... 35 4.3 Data Volume (GB) Per Day for Aurora Site ...... 36 4.4 Number of Unique IP Addresses Daily from 2015-01-01 to 2015-04-29, Aurora Site ...... 37 4.5 IP Geolocation Distribution, Top 10 Countries Sorted by Unique IPs . . . . 38 4.6 IP Geolocation Distribution, Top 10 Countries Sorted by Request Numbers . 39 4.7 Frequency-Rank Profile for IP Addresses, Aurora Site ...... 40 4.8 HTTP Methods in Aurora Traffic ...... 40 4.9 Frequency-Rank Profile for URLs, Aurora Site ...... 42 4.10 AuroraMAX Images from Yellowknife, 2015/03/10 ...... 43 4.11 HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, from 2015-03-09 to 2015-03-15 ...... 44 4.12 HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, on 2015-03-12 ...... 44 4.13 HTTP Response Size Distribution Histogram for “/summary plots/slr-rt/yknf/- recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis log-scale) ...... 45 4.14 HTTP Response Size Distribution Cumulative Histogram for “/summary plots/slr- rt/yknf/recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis proportion) . . . . 45 4.15 HTTP Response Size Distribution Histogram for “/summary plots/rainbow- rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis log-scale) ...... 46 4.16 HTTP Response Size Distribution Cumulative Histogram for “/summary plots/- rainbow-rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis proportion) . 46 4.17 HTTP Requests and Data Volume Per Day for UCB, UA IPs ...... 50 4.18 “robots.txt” Request Count Per Day for UCB1 ...... 52 4.19 “robots.txt” Request Count Per Hour on Four Selected Days ...... 52 4.20 HTTP Request Count Per Day for AuroraMAX ...... 55 4.21 Data Volume (GB) Per Day for AuroraMAX ...... 56 4.22 IP Addresses Frequency-Rank Profile for AuroraMAX ...... 56

5.1 HTTP Request Count Per Day for ISM Site ...... 60 5.2 HTTP Request Per Hour (Feb 23, Feb 24, Mar 23, Mar 24, Apr 20, and Apr 21, 2015) ...... 61 5.3 Data Volume (GB) Per Day for ISM Site ...... 62 5.4 IP Geolocation Distribution for Countries ...... 64 5.5 IP Geolocation Distribution for Canada ...... 65 5.6 IP Geolocation Distribution for USA ...... 65 5.7 IP Geolocation Distribution for Alberta ...... 66

viii 5.8 Number of Daily Unique IP Addresses Visiting ISM Site, from Canada and Calgary (2015-01-01 to 2015-04-30) ...... 67 5.9 Number of Daily Unique IP Addresses Visiting ISM Site, from USA and Cal- ifornia (2015-01-01 to 2015-04-30) ...... 68 5.10 Frequency-Rank Profile for IP Addresses, ISM Site ...... 69 5.11 Frequency-Rank Profile for URLs, ISM Site ...... 69 5.12 HTTP Methods in ISM Traffic ...... 70 5.13 HTTP Status Code in ISM Traffic ...... 71 5.14 HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, from 2015-02- 18 to 2015-02-24 ...... 72 5.15 HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, on 2015-02-24 72 5.16 HTTP Response Size Distribution Histogram for “Lec8 - Feb 5, 2015.mov” (x-axis 0-5 GB, 50 bins, y-axis log-scale) ...... 73 5.17 HTTP Response Size Distribution Histogram for “Lec3 - Jan 20, 2015.mov” (x-axis 0-10 GB, 50 bins, y-axis log-scale) ...... 73 5.18 HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec8 - Feb 5, 2015.mov” ...... 74 5.19 HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec3 - Jan 20, 2015.mov” ...... 74 5.20 Histograms of Response Size Values (smaller than 1 MB) for “Lec8 - Feb 5, 2015.mov” and “Lec3 - Jan 20, 2015.mov” Files ...... 76 5.21 User Agent Names Distribution in the ISM Site ...... 78 5.22 User Agent Browsers Distribution in the ISM Site ...... 78 5.23 Operating System Distribution in the ISM Site ...... 79 5.24 HTTP Requests Count Per Day for Video (requests) ...... 81 5.25 Data Volume (GB) Per Day for Video (requests) ...... 82 5.26 HTTP Transaction Durations Distribution Histogram (x-axis 0-60K s, 50 bins, y-axis log-scale) ...... 83 5.27 HTTP Response Size Distribution Histogram (x-axis 0-12 GB, 50 bins, y-axis log-scale) ...... 83 5.28 HTTP Transaction Duration (≤ 10s) CDF for Video Requests ...... 84 5.29 HTTP Response Size (≤ 5MB) CDF for Video Requests ...... 84 5.30 HTTP Requests Count Per Day for ASTR209, ASPH213, and ASPH503 . . 89 5.31 Data Volume (GB) Per Day for ASTR209, ASPH213, and ASPH503 . . . . . 89 5.32 HTTP Requests and Data Volume Per Day for the Six Categories ...... 90

6.1 HTTP Traffic Overview for the Aurora and ISM Sites ...... 93 6.2 Frequency-Rank Profiles for the Aurora and ISM Sites ...... 94 6.3 Illustration of File Transfer Methods Experiment ...... 99 6.4 HTTP Response Size Results for the Two Methods ...... 104

ix List of Acronyms

AJAX Asynchronous JavaScript and XML

ARPANET Advanced Research Projects Agency Network

CORS Cross-Origin Resource Sharing

CS Computer Science

CSA Canadian Space Agency

DASH Dynamic Adaptive Streaming over HTTP

DHCP Dynamic Host Configuration Protocol

DNS Domain Name System

DOM Document Object Model

FTP File Transfer Protocol GIF Graphics Interchange Format

GUI Graphical User Interface

HTML HyperText Markup Language

HTTP HyperText Transfer Protocol

HTTPS HyperText Transfer Protocol Secure

IP Internet Protocol

ISM Inter-Stellar Medium JPEG Joint Photographic Experts Group

LAN Local Area Network MIME Multipurpose Internet Mail Extensions

MIT Massachusetts Institute of Technology

MPEG Moving Picture Experts Group

OCW OpenCourseWare

OS Operating System

x OSI Open Systems Interconnection

PDF Portable Document Format

PPP Point-to-Point Protocol

P2P Peer-to-Peer QoS Quality of Service

RIP Routing Information Protocol

RSH Remote Shell RSS Rich Site Summary

RTEMP Real-Time Environmental Monitoring Platform

RTMP Real Time Messaging Protocol

RTSP Real Time Streaming Protocol

SNAP Stanford Network Analysis Project

SSH Secure Shell SSL Secure Sockets Layer

TCP Transmission Control Protocol THEMIS Time History of Events and Macroscopic Interactions

during Substorms

TLS Transport Layer Security

UA University of Alaska

UCB University of California at Berkeley

UDP User Datagram Protocol

U of C University of Calgary

URI Uniform Resource Identifier

URL Uniform Resource Locator

WWW World Wide Web

xi Chapter 1

INTRODUCTION

The Internet has a growing influence on many aspects of our daily lives. With the upgrading of network speed, and the emergence of new technologies, the way people use the Internet gradually changes. For example, researchers and educators often share their results and teaching materials via the Internet. This action in return facilitates external interactions with scientific Web sites.

This thesis presents measurements of the network traffic of two scientific Web sites at the University of Calgary. In particular, we study the usage patterns, identify inefficiencies in current information exchanging methods, and suggest potential improvements.

1.1 Open Scientific Web Sites

With the rapid development of network technology and high performance personal comput- ers, scientific research and education organizations often share resources over the Internet, typically via the World Wide Web [27]. The scientific materials provide people around the world with opportunities to obtain scientific knowledge conveniently, efficiently, and impar- tially. In some cases, however, the sharing of large and popular materials can generate a significant volume of network traffic.

Furthermore, an emerging trend with research funding agencies and public-funded univer- sities is toward open access publishing and open data repositories. These publicly-accessible data repositories enable not only the sharing of scientific data among researchers world- wide, but also enable a wide variety of “citizen science” projects and outreach activities.

In addition, many universities currently offer a variety of on-line educational resources to the public, including video recorded lectures. For example, Stanford University provides

1 large network dataset collections to the public via the Stanford Network Analysis Project

(SNAP) [57], giving computer scientists, sociologists, and psychologists opportunities to test their methodologies as well as their conjectures. As another example, the worldwide Open-

CourseWare (OCW) [20] site offers a series of free on-line courses recorded by the most celebrated universities, including the Massachusetts Institute of Technology (MIT) and Yale

University.

The open scientific Web sites provide a rich set of resources for “citizen science”. There is a list of papers posted each year using the datasets in SNAP1. The open datasets in re3data2 also enable research in a variety of fields3. For OpenCourseWare, a report from

MIT4 shows that MIT OCW was visited 2,385,654 times by 1,367,228 unique visitors in

April 2015. However, these open scientific Web sites also have an effect on network traffic.

The University of Calgary also provides open scientific resources. For instance, it hosts multiple Web sites that share scientific measurement data from remote sensors for atmo- spheric and environmental monitoring, and free on-line courses offered by university edu- cators. These scientific Web sites have generated voluminous network traffic, which is the basis for our study.

1.2 Background Context

Given the pervasive applications of the World Wide Web (WWW), network resource usage is always a relevant problem. The Internet evolved from the ARPANET (Advanced Research

Projects Agency Network) project funded by the US government in the 1970’s, and has become globally popular and powerful today. From modest beginnings in local area networks with a few workstations, the Internet has grown into a world-wide network system, with over

1https://snap.stanford.edu/papers.html 2Registry of Research Data Repositories, http://www.re3data.org/ 3http://www.re3data.org/about/ 4http://ocw.mit.edu/about/site-statistics/monthly-reports/MITOCW_DB_2015_04.pdf

2 1 billion hosts5 and 3 billion users6.

On the modern Internet, the advances in network speed and the high performance servers provide users with a high-quality Internet surfing experience. However, the tension between network traffic consumption and the QoS (Quality of Service) still remains an important issue. To economize on network bandwidth, numerous methods were proposed, such as the formulation of new protocols and caching architectures. Before design changes are made, however, network traffic measurement is one of the useful methods to provide a clear under- standing of network bottlenecks, as a prerequisite for network optimization.

Network traffic measurement is an effective way to understand network activities. By analyzing how Web resources are retrieved, it provides an understanding of the data trans- fer traffic. This information is useful for identifying issues in network resource allocation, distribution, and bandwidth configuration. To optimize the network resource allocation, numerous mathematical models, experimental methods, and auto-adjustment systems are proposed and tested in academic and industrial fields. Web site workload analysis is a well-known network traffic measurement technique to summarize the characteristics of Web sites.

The workload pattern is usually determined by many factors, such as the demographics of the users, the type of resources on the site, and the services provided by the site. For example, sites like the Washington Post provide news to people around the world, particularly for the United States7. The Asahi Shimbun Web site is a Japanese press provider who also serves news to the public. However, the workload pattern of the Washington Post is quite different from Asahi8, based on the statistics from Alexa9. Therefore, sufficient background information and general analysis of a site’s workload are very important.

The network traffic workload at the University of Calgary (U of C) is mostly contributed

5Global Internet usage, https://en.wikipedia.org/wiki/Global_Internet_usage#Internet_hosts 6World Internet Users and 2015 Population Stats, http://www.internetworldstats.com/stats.htm 7How popular is washingtonpost.com?, http://www.alexa.com/siteinfo/washingtonpost.com 8How popular is asahi.com?, http://www.alexa.com/siteinfo/asahi.com 9http://www.alexa.com/comparison/washingtonpost.com#?sites=asahi.com

3 by the university students, faculty, and staff. It is unsurprising that most inbound traffic involves popular sites like Google, Facebook, and YouTube. However, the summary results show that some scientific Web sites hosted internally by the university are extremely popular externally, and generated huge data traffic volume during the period under observation.

As stated earlier, sharing research and educational materials via the Internet is an effec- tive way utilized by many research funding agencies and public-funded universities. Research on the workload analysis of scientific Web sites has not received as much attention as the most popular sites. Researchers might assume that most scientific sites consume little net- work bandwidth, since they have a small influence on specific groups of users. Nevertheless, our analysis shows that some scientific sites in the university generated surprisingly large volumes of network traffic.

1.3 Motivation

The University of Calgary hosts many research Web sites and integrated education sites.

After assessing all the ingoing and outgoing network traffic, we found that two scientific

Web sites hosted by U of C generate a lot of traffic. Both of them rank among the top data volume generators during our four-month observation from January 1, 2015 to April 30, 2015.

The bandwidth they consumed is even in the same scale as the most popular sites like Google and Facebook, though well behind streaming video sites YouTube and NetFlix. One of the sites is the Auroral Imaging Group (Aurora) site10, and the other one is the Star Formation and Molecular Astrophysics (ISM) site11. Both sites are hosted by the Department of Physics and Astronomy at the University of Calgary.

The Aurora site studies the Aurora Borealis (Northern Lights), which is a natural phe- nomenon caused by cosmic rays, solar wind, and magnetospheric plasma interacting with the upper atmosphere [13]. These auroral phenomena are primarily seen in the high latitude

10Auroral Imaging Group, http://aurora.phys.ucalgary.ca/ 11Star Formation & Molecular Astrophysics at the U of C, http://ism.ucalgary.ca/

4 regions like northern Canada and the Arctic (and Antarctic) regions. Since the aurora are mainly observed during nights in remote areas, researchers have deployed digital cameras across northern Canada as a ground-based observatory to automatically record auroral phe- nomena, with the data transferred to U of C servers via network connections. The Aurora site is a scientific Web site providing aurora data collected from specifically-designed cam- eras. We find that the traffic generated by the Aurora site is surprisingly large. Everyday there are 1.5 million HTTP requests sent to the Aurora server, to retrieve 90 GB of data volume. This unusual discovery motivates us to analyze the workload characteristic of the

Aurora site.

The ISM site is another interesting scientific site, which studies the Inter-Stellar Medium

(i.e., the gas and dust in between the stars) in astrophysics. The site is created and main- tained by a U of C professor. Apart from a brief introduction about the Inter-Stellar Medium and some corresponding research, the ISM site mainly provides study materials for three courses taught by the professor, including one Astronomy course (ASTR 209) and two As- trophysics courses (ASPH 213, ASPH 503). Among the courses, ASTR 209 contains a series of recorded course videos. Similar to the Aurora site, the ISM site also generated voluminous traffic during our four-month observation, given the relatively small user community (400

U of C students registered in the course in winter 2015). There are around 70 GB of data volume retrieved from the ISM server per day. By analyzing the workload characteristic of the ISM site, we intend to understand the bandwidth usage, and how the Web resources are being used.

The purpose of conducting the network traffic measurements is to improve the network usage. The Aurora site and the ISM site are both constructed and maintained by technical staff with minimal computer science (CS) or networking background. As such, they may not deploy the site effectively from the CS perspective, when sharing information over the

Internet. Considering the traffic volume engendered by these scientific Web sites, we are

5 motivated to measure their network traffic, identify performance issues (if any), and propose potential remedies for the problems.

A second motivation for this thesis is a better understanding of modern scientific Web site traffic, compared to previously known workload patterns. As indicated earlier, the network technologies have improved dramatically over time. Therefore, the workload characteristics may also have changed, along with user behaviors.

1.4 Objectives

The objectives of this thesis are as follows:

1) Measure network traffic at the University of Calgary to determine the characteristics of modern scientific Web sites.

2) Compare workloads of modern scientific Web sites with those of previously studied

Web sites to identify similarities and differences.

3) Identify inefficiencies (if any) based on the traffic measurement results, and suggest improvements.

1.5 Contributions

This thesis has four primary contributions, listed as follows:

1) We collect and measure the network traffic of two distinct scientific Web sites at the

University of Calgary, namely the Aurora site and the ISM site.

2) We identify the dominance of automated robot traffic in the Aurora site measurement, and compare its characteristics to the human-generated traffic.

3) We discover several inefficiencies in the data transfer methods of the Aurora site. We suggest potential improvements as well as experimental results to evaluate their effectiveness.

4) We compare the Web usage characteristics of modern scientific Web sites with those from the prior literature.

6 Although the scope of this thesis focuses on network traffic measurement of two campus scientific Web sites, we expect this analysis and our potential suggestions for improvement raise awareness to Web robot traffic and network inefficiency issues. Also, we believe our results will provide a foundation for future explorations on scientific data sharing systems.

1.6 Thesis Overview

This thesis is organized as follows:

1) Chapter 2 presents basic network knowledge including TCP/IP and HTTP protocols, and introduces related work on network traffic measurement and workload characterization.

It also discusses prior studies on Web crawling, video streaming, and scientific Web sites.

2) Chapter 3 introduces the data collection methodology and the Bro logging system.

3) Chapter 4 analyzes the network traffic of the Aurora site, discusses its workload char- acteristics, and identifies the robot traffic and inefficiency issues.

4) Chapter 5 analyzes the network traffic of the ISM site, discusses its workload charac- teristics, and studies the video streaming traffic and user pattern analysis.

5) Chapter 6 compares the workload characteristics of scientific Web sites and discusses potential solutions for the inefficiency issues.

6) Chapter 7 summarizes the results, presents conclusions, and suggests future work.

7 Chapter 2

BACKGROUND and RELATED WORK

In this chapter, we introduce fundamental background knowledge regarding the technologies underlying our research. An overview of this chapter is as follows:

1) Section 2.1 and Section 2.2 provide background information on computer networks, including the classical five-layer network architecture, TCP/IP protocols, and HTTP in the application layer.

2) Section 2.3 reviews the literature on network traffic measurement research.

3) Section 2.4 briefly introduces Web robots.

4) Section 2.5 discusses current video streaming techniques on the Internet.

5) Section 2.6 presents literature about scientific Web sites, focusing on network traffic analysis and workload characterization.

2.1 TCP/IP Model

The modern Internet had its early origins in regional academic networks. It evolved from the ARPANET project and has greatly changed over the decades. However, the fact that the Internet consists of infrastructure and protocols remains the same.

Within the overall Internet system, network hardware and software implementing the protocols are organized in layers, called a protocol stack. Figure 2.1 shows the five-layer protocol stack and the seven-layer OSI (Open Systems Interconnection) model [55]. Except for the Presentation and Session layers that are specific to the OSI model, both stacks have

Application, Transport, Network, Link, and Physical layers. Differences exist among these two protocol stack models in the services and protocols. Since most concepts are similar, we choose to introduce the five-layer protocol stack in this chapter; more information about the

8 Application Presentation Application Session Transport Transport Network Network Link Link Physical Physical (a) Five-Layer Pro- (b) Seven-Layer tocol Stack OSI Model

Figure 2.1: The Five-Layer Protocol Stack and the Seven-Layer OSI Model

OSI model is available elsewhere [17,55].

The five-layer protocol stack is also referred to as the TCP/IP protocol suite, for two of its celebrated protocols: Transmission Control Protocol (TCP) and Internet Protocol (IP).

We use the common term “TCP/IP model” to refer to the five-layer protocol stack in this thesis.

Each protocol only belongs to one layer. For example, TCP belongs to the Transport layer, IP is in the Network layer, and HyperText Transfer Protocol (HTTP) is in the Appli- cation layer. In each layer, actions are performed to provide services to that layer or to the adjacent layer above, by utilizing the services within that layer or from the adjacent layer below. Protocol layers are implemented in software, or in hardware, or in a combination.

We take a bottom up approach to introduce the layers.

2.1.1 Physical Layer

The physical layer is at the bottom of the TCP/IP model. It provides the means of trans- mitting raw bits to the connected destination node, and determines the parameters of the communication channel [55]. It also provides the interface to protocols used by hardware transmission media. When a network connection is established, the way bits are moved is determined by its actual transmission medium and the corresponding protocols.

9 2.1.2 Link Layer

The Link layer is designed to move link-layer frames between two different nodes in the route [55]. It provides this service to the Network layer for routing a datagram via a series of routers, by invoking the bit-moving service provided by the Physical layer. The Link layer services highly depend on the link-layer protocols available on the link. Protocols such as Ethernet, WiFi, and Point-to-Point Protocol (PPP) belong to the Link layer. When a datagram from the Network layer traverses the links, it is passed down to the Link layer, and then transferred to the destination. During the process, it may use different link-layer protocols at different links. Finally, the datagram is passed up to the Network layer in the destination node.

2.1.3 Network Layer

The Network layer moves data packets (datagrams) between different hosts [55]. It provides services to the Transport layer, and uses the services provided by the Link layer. Whenever a Transport layer segment and a destination address are passed to the Network layer, the

Internet Protocol (IP) is invoked to send the segment to the specified destination.

IP protocols are the primary protocols in the Network layer. Internet Protocol Version

4 (IPv4) is the dominant protocol [16]. The main function of IPv4 is to route datagrams from a source host to a destination host, based on a 32-bit address (IP address). IPv4 only provides best-effort delivery without guarantees that the datagrams are delivered. Each host in the network layer has an unique IP address. We did a detailed analysis for the network traffic in this thesis based on the IP information extracted from this layer.

2.1.4 Transport Layer

The Transport layer is responsible for moving transport-layer packets (segments) from one end host to another. It achieves the data transfer by establishing a logical data channel between the two end hosts.

10 The two primary protocols in the Transport layer are the Transmission Control Pro- tocol (TCP) and User Datagram Protocol (UDP) [55]. TCP provides connection-oriented service with reliability. For example, TCP guarantees reliable, ordered, and error-checked delivery when transferring application-layer messages [17]. It also applies a sliding window

flow control protocol to match speeds between sender and receiver, and congestion control mechanisms to avoid congestive collapse (i.e., extremely poor network performance). UDP provides connectionless service with no reliability, except for (weak) checksums for data in- tegrity. TCP is widely used by many popular Internet applications, such as the HTTP, the

File Transfer Protocol (FTP), and Secure Shell (SSH). UDP is utilized by Internet applica- tions caring more about responsiveness than reliability, such as the Domain Name System

(DNS), the Routing Information Protocol (RIP), and some audio or video streaming appli- cations.

Our work studies HTTP traffic, which commonly utilizes TCP since it presumes a reliable transport-layer protocol [8] (note that unreliable UDP also can be used by HTTP). Therefore, the traffic we study in this thesis is almost always generated by TCP connections.

2.1.5 Application Layer

The Application layer is the topmost layer in the TCP/IP model. It contains many important protocols, such as HTTP and FTP [55]. The protocols are used to exchange application- layer messages between hosts. The Application layer protocols utilize the logical data transfer channels established by underlying transport-layer protocols to deliver messages. Our traffic analysis focuses on logs of HTTP activities in the Application layer.

2.2 HTTP and the Web

As introduced above, the TCP/IP model provides the underlying mechanisms to support

Internet applications. Upon this network foundation, the World Wide Web (WWW) [33]

11 HTTP request HTTP request

HTTP response HTTP response

Web client Proxy server Web server

Figure 2.2: Illustration of HTTP Requests and Responses

emerged as a means to exchange data content easily on the Internet. The Web has had a

transformative role in enriching the interactions on the Internet. Its popularity has helped

foster the merging of separate data networks, leading to the formation of the global data

network that we know today.

The HyperText Transfer Protocol (HTTP) in the Application layer is the foundation of

the WWW. It has two primary versions: HTTP/1.0 [6] and HTTP/1.1 [8]. HTTP/1.1 is a

revision of HTTP/1.0, with persistent connections (and other features) added in HTTP/1.1.

Currently, there is design and implementation work being done on a new version, HTTP/2.

There are two programs implemented for HTTP: a client side program like a Web browser,

and a server side program like a Web server. HTTP defines how messages are exchanged

between the client and server. The client retrieves the objects (e.g., HTML files, image files,

video files, and JavaScript files) in a Web page from the server side, through a recognizable

Uniform Resource Locator (URL). For example, http://www.abc.com/def/ghi.pdf is a valid URL for fetching the “ghi.pdf” PDF file from the “/def/” directory in the Web server host “www.abc.com”, using HTTP.

HTTP invokes TCP to establish connections within the Transport layer. The client and server exchange data by accessing the socket interface of the TCP connection. The reliability of the TCP connection guarantees that messages are exchanged successfully between the client and the server.

In some deployments, an HTTP proxy server is used as an intermediary between the

12 client and the original Web server. A proxy server can reduce the response time for client requests, as well as save bandwidth between the client and server, by temporarily storing and serving recently requested objects in a Web cache. In this role, a proxy acts as the Web server for the client when it has a copy of the requested object. Conversely, it acts as a client to request an object from the origin Web server when the object is not stored locally.

Figure 2.2 shows a generic illustration of HTTP requests and responses involving a proxy server.

2.2.1 Persistent Connections

Persistent HTTP connections and non-persistent HTTP connections are two different ways for clients to interact with Web servers. The main difference is whether the HTTP connection reuses the existing TCP connection [55]. The non-persistent HTTP approach establishes a new TCP connection for each request-response transaction. However, persistent HTTP connections allow multiple messages to be exchanged via the same TCP channel, in series.

The primary advantage of persistent connections is the reduction of the request latency.

Since the total time of an HTTP request-response transaction consists of the TCP con- nection initialization time, request delivering time, and response delivering time, persistent connections save the time used to re-establish the TCP connections (three-way handshake).

HTTP/1.1 makes persistent connections the default behavior for all HTTP connections, while HTTP/1.0 uses non-persistent connections. Technically, HTTP/1.0 can support per- sistent connections by adding “Connection: Keep-Alive” in the message header, and it is compatible with HTTP/1.1 servers [7]. However, there are many restrictions when imple- menting this. For example, clients cannot establish Keep-Alive connections with HTTP/1.0 proxy servers. HTTP/1.1 clients and servers can be configured to use non-persistent connec- tions for each request-response transaction, if resource usage is a concern. There are many other configuration options, such as adjusting the maximum session time and the maximum number of concurrent persistent connections.

13 HTTP/1.1 also supports the HTTP pipelining technique, which allows multiple HTTP re- quests to be sent over a single TCP connection before receiving the corresponding responses.

HTTP/1.0 doesn’t support this feature.

2.2.2 HTTP Messages

HTTP messages are the data sent by Web clients and servers over the HTTP connections. It consists of a request message and a response message, both of which have a defined format.

We use an example in Listing 2.1 to introduce the HTTP messages.

Listing 2.1: HTTP Request and Response Message Example Request message: GET / HTTP/ 1 . 1 User −Agent: curl/7.37.1 Host: www.ucalgary.ca Accept : ∗/∗

Response message: HTTP/ 1 . 1 200 OK Date: Thu, 23 Jul 2015 22:27:04 GMT Server: Apache/2.2.15 (Red Hat) Last −Modified: Thu, 23 Jul 2015 19:06:16 GMT ETag: ”496582− a7c7 −51b8f94f5dd9a” Accept −Ranges: bytes Content −Length: 42951 Connection: close Content −Type: text/html; charset=UTF−8

data is attached here

HTTP Request Message

The first line in the HTTP request message is called the request line. It contains the HTTP method field, the URL field, and the HTTP version field. In this example, the client would like to use GET method to fetch the root directory via HTTP/1.1 (the server can choose to abide by these or not, e.g., server could respond with HTTP/1.0). The vast majority of

HTTP requests use the GET method.

14 The following lines are request header lines. They inform the server of basic background information about the client as well as some parameters of the request. In the example, the client tells the server that the HTTP message is generated with user agent “curl/7.37.1” [15], which is a command line tool used for transferring files over the Internet. Usually, the user agent field consists of the name and the version information of the browsers and operating systems that the client is using.

The Host field indicates where the requested object is located. Although the TCP con- nection between the specific server and client is already established before transferring HTTP messages, it is necessary to keep the host field since: 1. there may exist a proxy server as an intermediary agent talking to the client; and 2. the Web server may support numerous different Web sites.

In the request message, the client can specify detailed parameters when requesting the ob- ject. For example, “Accept-Language” field indicates the acceptable language of the client’s preference, “Accept-Encoding” field indicates the acceptable encodings, “Connection” field indicates the client’s intention to keep this connection alive or not, and “If-Modified-Since”

field means the client only wants a copy if the file has been changed after a certain time point

(the server could send the response if it wants). There is no need for the request message to include all the header line fields; a subset of the fields is acceptable.

HTTP Response Message

Similar to the request message, an HTTP response message starts with a status line, includes header lines in the middle, and may end with an entity body. The status line contains the

HTTP version the server is using for this transaction, the HTTP response status code, and its corresponding status message. In the example, the server tells the client that it is using

HTTP/1.1 for this transaction and the requested object is successfully returned.

In the response header lines, the “Date” field indicates when the response is generated and sent, the “Server” field indicates the server name and version, the “Last-Modified” and

15 “ETag” fields are auxiliary information that the client (or proxy) can use in the future to

check whether the requested object has been modified (if not, the client has no need to

download the object again), “Accept-Ranges” indicates whether the server accepts range

(partial transfer) requests for an object, “Content-Length” and “Content-Type” indicate

the length and MIME (Multipurpose Internet Mail Extensions) type of a requested object,

and “Connection” informs the client whether the server will keep this TCP connection active

or not. Like the request message, usually the response message only contains a subset of the

fields. A list of the available HTTP request and response headers is shown in [19].

The entity body is the content of the requested object returned by the server. If the

client is a Web browser and the requested object is a part of the Web page (e.g., HTML,

image files, CSS files, etc.), then the browser can directly render the object onto the Web

page via its layout engine and present it to the user. Since the way of rendering the page

varies for different browsers, the actual Web page may not look the same when viewed on

different browsers.

HTTP Request Methods

The method in the HTTP request defines the action to be performed on the object identified

by the request URL [11]. In HTTP/1.0, there are only three methods: GET, POST, and

HEAD. Another five new methods were added to HTTP/1.1, including OPTIONS, PUT,

DELETE, TRACE, and CONNECT. The client can use any methods to interact with the

server. The server can be configured to support any methods or not.

GET method is the most prevalent in the Internet. It is used to retrieve the content of

the object with some specific request URL. The conditional GET is a GET request with conditional statements in the request message header lines. The conditional statements in- clude If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, and If-Range header

fields [11]. Once the conditional GET request reaches the server, the server can choose to return the updated object or not depending on the conditions. The advantage of using con-

16 ditional GET is to reduce the network usage of the superfluous data transfer. The partial

GET is a GET request with a Range header field in the request message. It only requests

part of the entire object, which may reduce network usage by avoiding re-transferring data

that is already at the client. The partial GET is very useful when dealing with large size

objects, such as videos.

HEAD method is identical to GET method except that the response does not contain

the entity body. It is often used for retrieving meta-information about the requested object,

such as how large it is, or how old. Based on the meta-information, the client may choose to

use GET to retrieve the object or not. Similar to the conditional GET, appropriately using

the HEAD method may significantly reduce the network usage.

The rest of the HTTP methods are rarely seen in our network traffic measurements.

Therefore we briefly cover only selected methods here.

POST method is designed for sending the enclosed entity to a specific resource at the server, and requiring the resource to handle the request. The Web server determines the actions to be performed on the transferred entity.

PUT method allows the client to enclose an additional entity in the request, which can update the resource in the server if the resource exists, or create it if needed.

DELETE method requests that the specified object in the server be deleted (this will only be done if the user has the appropriate permission).

OPTIONS method requests the server to return available HTTP methods for an object.

HTTP Response Status Codes

The HTTP response status code is a 3-digit integer that summarizes the server’s action based on the request. There is a short textual description following the status code in the response status line.

We introduce several selected status codes as follows:

200 OK is the standard response status code for successful HTTP requests. It is the

17 most common status code for most Web sites.

206 Partial Content indicates that the server successfully returned the partial content

for a request with a Range header field.

304 Not Modified indicates that the object in the server has not been modified, based

on the conditional GET request sent from the client. Therefore, responses with 304 don’t

contain entity bodies.

403 Forbidden response informs the client that the request is understood by the server,

however, the server refuses to respond to it.

404 Not Found indicates that the requested object is not found in the server.

503 Service Unavailable indicates that the server is currently unavailable (caused by

temporary overload or maintenance).

Referer

The HTTP referer1 is a request header field. It informs the server where the request

originated. For example, consider a client viewing a Web page on server A that has a hyper-

link to an image on server B. When the client sends a request to server B for retrieving the

image object, the referer field in the request tells server B that this request was referred

by server A.

The referer field provides servers with information about where the visitors come from, which is often useful in network traffic analysis. However, for security and privacy concerns, sometimes clients choose to obfuscate the referer field in their HTTP requests. Further-

more, if the user types in a URL or visits sites from bookmarks in browsers, the referer

field will be blank. 1A misspelling of “referrer” originally, https://en.wikipedia.org/wiki/HTTP_referer

18 2.2.3 HTTP Secure

HTTP Secure (HTTPS) is a protocol designed for encrypted HTTP data transfers over the Internet. In HTTP, clients and servers communicate “in the clear” directly over the

TCP connections. In HTTPS, connections are built on top of cryptographic protocols like

Transport Layer Security (TLS) and Secure Sockets Layer (SSL), which usually work at the session layer and presentation layer in seven-layer OSI model, or at the application layer in TCP/IP model [25]. The cryptographic protocols encrypt messages before sending, and decrypt messages after receiving, thus improving the security and privacy of transferred messages.

The sites selected in this thesis are all HTTP sites, and we only measure their HTTP traffic.

2.3 Network Traffic Measurement

Network traffic measurement provides a way to understand and manage the usage of the

Internet today. There is a wealth of literature on network traffic measurement and Web workload characterization dating from the early-1990’s to the present.

In 1992, Vern Paxson used measurements to evaluate analytic models of TCP connec- tions [64]. It was one of the earliest network traffic characterization works during the growth of the Internet. Based on traces collected from 7 different sites, Paxson analyzed the network characteristics of TELNET, NNTP, SMTP, and FTP connections, and compared several an- alytic models as well as empirical models. He found that analytic models are as good as the empirical models in general, and the connection characteristics are different for different sites or different periods of the same site.

In 1997, Thompson et al. [73] observed the traffic volume, flow volume, flow duration, and packet sizes in a wide area network. They found interesting results, for example, the measured traffic has diurnal trends, it decreases on weekends, and TCP dominates IP traffic.

19 In 1995, Crovella et al. [40] discovered self-similarity in World Wide Web traffic. They also found that the Web transmission times and the silent times (inactive times of a client) follow heavy-tailed distributions. They attribute these to the heavy-tailed distribution of

Web file sizes, as well as the influence from network users.

Sedayao et al. [70] analyzed WWW traffic patterns. Their work covers the WWW traffic characteristics from a fundamental perspective, with concerns about inefficiency issues, and proposed solutions. In the paper, they mentioned that the most popular file type in WWW traffic at the time was the Graphics Interchange Format (GIF), followed by Moving Picture

Experts Group (MPEG) file (in terms of references of bytes).

Cunha et al. [41] characterized Web client activity. They collected data by deploying modified Web browsers in terminal rooms on the Boston University campus. During their four-month observation, they found power-law distributions in document size distribution and document popularity distribution. They showed that this information is useful for designing caching strategies.

In 1996, Arlitt and Williamson [30, 31] analyzed the Web traffic of six different servers.

They identified ten common characteristics in the Web server workloads. For example, they found that successful requests are the most common, 90% of the transferred documents are

HTML and image files, and file sizes and the transfer sizes both follow heavy-tailed distribu- tions. Based on these results, they also proposed effective suggestions for improving caching systems. We will revisit their work in Section 6 by comparing the workload characteristics of modern scientific Web sites to their results.

In 2000, Mahanti and Williamson [61] analyzed the workload of three different Web proxy servers. They found similar results as [31] (e.g., HTML and image files account for 95% of all the requests), and distinct results such that the Web document popularity doesn’t match

Zipf’s Law strictly (it matches in [31]). This work confirmed Arlitt’s work to some extent, and indicates the peculiarities of workload characteristics for different Web sites in different

20 periods.

In 1999, Breslau et al. [35] were among the first to focus on the Zipf-like distribution for file popularities. They studied the Zipf-like distribution in Web page requests, and the reference probability of the documents. They also presented a simple model for understanding cache performance.

Since lots of new trends have emerged on the Internet, more recent network traffic studies have provided insights in general network measurements as well as individual sites.

In 2010, Callahan et al. studied the Web workload characteristics from a longitudinal view [37]. Their work includes HTTP transaction characterization, user behavior, and server distributions. They found that most of the HTTP transactions are GET requests, Zipf-like distributions are present in requests per object statistics, and identified effects from browser caches and content distribution networks.

In 2011, Ihm et al. conducted measurements on modern Web traffic over five-year obser- vations of a content distribution network vendor [53]. Their work covers high-level character- istics such as the overall connection speed and maximum concurrent connections, as well as page-level characteristics analysis with their new page detection algorithm. They found the increasing use of Flash video, AJAX (Asynchronous JavaScript and XML), and client-side interactions after initially loading the pages.

There are numerous papers about the traffic of Web 2.0 sites, in which users can interact and collaborate with each other instead of merely viewing the content provided by the Web site. Butkiewicz et al. [36] studied the complexity of today’s Web pages. Schneider et al. [68] presented a study of AJAX traffic by analyzing popular Web 2.0 sites, such as Google Maps, and social network Web sites. Lin et al. [59] also studied the on-line map application traffic on Web 2.0 sites.

Web 2.0 has evolved to encompass a large group of sites, including video Web sites like

YouTube, NetFlix, and Vimeo, and on-line social networks like Facebook, Flickr, and Twit-

21 ter. Cha et al. [38, 39] studied the traffic of several user-generated content video Web sites.

Gill et al. [48], Zink et al. [77], and Ameigeiras et al. [29] all studied YouTube. Several papers [32, 50, 51, 62, 69] studied on-line social networks from many perspectives, includ- ing network usage, user behaviors, user content generation patterns, and user relationship connections.

Other research has involved network traffic measurement of e-commerce Web sites [66,75],

Web robot activities [45,76], Peer-to-Peer (P2P), and mobile networks.

2.4 Web Robots

A Web robot is a software program that automatically launches a series of HTTP transac- tions [49]. The main application of the Web robot is to crawl and extract useful information from the Web sites, by moving from site to site and analyzing the browsed data. Therefore, this kind of Web robot is also called a “”. For example, Googlebot2 is a Web crawling robot operated by the Google search engine. The main function of is discovering new and updated Web pages in the Google index.

Due to the similarities of Web page structures, Web robots can quickly fetch informa- tion from Internet by performing repetitive and redundant tasks. Usually, users can utilize software like Wget3 to retrieve filtered content from the Web, or create more flexible scripts with programming libraries like the urllib4 module in Python.

Technically, all Web robots have the same core approach. There is a list of URLs known as root pages for robots to start with initially. Robots (as clients) can generate HTTP requests to get the content of those pages from Web servers. Then robots have the ability to extract the links and useful information from the page content, and identify potential URLs to crawl in the next step. Robots can repeat the previous procedures unless the termination

2Googlebot, https://support.google.com/webmasters/answer/182072?hl=en 3GNU , https://www.gnu.org/software/wget/ 4urllib, Python module, https://docs.python.org/2/library/urllib.html

22 condition is satisfied. For example, a robot may terminate when all the links in a specific page are visited, or it reaches the maximum link depth from the root page.

Web robots may not be welcomed by some Web servers. There is a Robots Exclusion

Standard [22] widely used by Web servers to communicate with robots. A Web server can provide a file named “robots.txt” in the root directory, indicating which parts of the site are not allowed to be crawled. If a Web robot follows the standard, it first generates a GET request to fetch the “robots.txt” file, then modifies its further operations according to the rules. Most robot software provides configuration options to users to determine whether to follow “robots.txt” or not. For example, Wget can disable courteous operations by adding

“-e robots=off” in the command. Furthermore, some robots can mislead servers by faking the HTTP request header content, especially the user agent field.

There are numerous papers about the network traffic of Web robots. Before undertaking a full analysis of the Web traffic, a preliminary analysis for detecting and identifying robot traffic is recommended. Several papers [34, 44, 71] studied Web robot detection techniques, including statistics analysis as well as data mining approaches. Another two papers [54, 72] analyzed thousands or millions of different Web sites, and provided surveys about the usage of robots exclusion standard, namely “robots.txt”. They found that 46.02% of newspaper

Web sites and 45.93% of the USA university Web sites adopted the robots exclusion standard.

Two studies [42,43] discussed Web robot behaviors based on the analysis of Web server access logs.

2.5 Video Streaming

As mentioned in Chapter 1, one of the sites we study provides video lectures. We present a brief introduction about Internet video streaming techniques in this section.

Video streaming has changed a lot over the years, with the upgrading of network speeds and the growing demands from network users. Today, YouTube, the largest video sharing

23 Web site, allows users to view user-generated videos smoothly with qualities ranging from

240p (426 × 240, progressive scan) to 1080p (1,920 × 1,080, progressive scan). NetFlix and

Hulu provide hundreds of thousands of movies or TV shows to users with high quality video streams. Many sports or e-sports sites can even provide live broadcasting video streams to users over the Internet. This technology enriches our daily lives.

Typically, there are three different ways to watch a video on-line. Progressive download is a way to transfer the video files from a server to a client via HTTP connections [3]. When a user plays the video embedded in a Web page, the browser starts to download a copy of the video file from the Web server and the user can only have access to the video content that is already downloaded. There is usually a progress bar indicating how much video content has been loaded or played. Users can playback or fast forward any loaded video parts, however, they cannot view the part that is not yet downloaded by the browser. Although the progressive download can be easily implemented, the drawbacks of this technique are obvious:

1) Users have to wait until the video is loaded (from the beginning to where users want to watch) even if they are interested in only a small part of the whole video.

2) The technique wastes bandwidth if a user downloads a large part of the video and then exits without watching the whole video.

3) If the video quality is high bandwidth, and the capacity between the server and the client is lower than the bit rate of the video, then the user occasionally has to stop and wait for the buffer to be loaded. The user has no means to change the quality of the video.

Traditional streaming [4] is another method for video streaming, which utilizes special streaming servers to deliver videos. Special protocols like Real Time Streaming Protocol

(RTSP) and Real Time Messaging Protocol (RTMP) are adopted to divide the original video

file into small-size chunks, and then transfer those chunks via UDP or TCP connections. In this approach, the video quality is still immutable, and this streaming service cannot be

24 implemented in normal Web servers.

Adaptive streaming [5] is designed for solving the video quality issues above. By de- ploying several video files encoded in different qualities for the same video content, adaptive streaming can provide video streaming service in different qualities to users according to their network conditions. The video files are also chunked into small fragments for ease of delivery. There are four primary implementations of the adaptive streaming, including

Dynamic Adaptive Streaming over HTTP (MPEG-DASH), Adobe Dynamic Streaming for

Flash, Apple HTTP Adaptive Streaming, and Microsoft Smooth Streaming [12]. MPEG-

DASH is the only international standard widely supported by most HTTP servers. The advantage of adaptive streaming is obvious, which enables users to watch videos smoothly, instead of waiting for buffering high-quality videos or tolerating low-quality videos. The disadvantage of adaptive streaming is the computation and storage resources used for com- pressing the videos in various bit-rates and resolutions beforehand. Since user experience is more important than the additional cost, adaptive streaming is prevalently adopted in most of today’s video sites, like YouTube, NetFlix, and Vimeo.

2.6 Scientific Web Sites

There is not much literature about network traffic measurement and workload characteriza- tion of scientific Web sites.

Eldin et al. [46] studied the top 500 popular pages in Wikimedia, which is a Web site promoting free educational contents. They analyzed time-series request counts and discov- ered the collateral load phenomena, in which the links embedded in a popular page (e.g.,

Michael Jackson’s page when he died) generate more traffic than the page itself. They also suggested that simple prediction algorithms are able to predict workload in Wikimedia.

Urdaneta et al. [74] studied a sample of the network traffic for Wikipedia. Their re- search includes analysis of user requests, read and save operations, flash crowds, and non-

25 existent page requests. They also suggested a decentralized and collaborative setting to host

Wikipedia for improving the network performance.

Morais et al. [63] studied the user behavior in a citizen science project, though their focus was on the user interaction aspects, rather than Internet traffic workloads. Li et al. [58] studied the workload characterization of CiteSeer, which is a digital library for computer science literature.

The closest example in the prior literature is Faber et al. [47]. They studied the Web traffic of four different data sets. They compared the workload characteristics found in [30] and in their data sets. However, their analyses are somewhat limited due to missing HTTP header

fields in their logs, and a relatively short observation period of 1-2 months. Furthermore, their data was collected over a decade ago, and scientific Web workloads may have changed since then.

2.7 Summary

This chapter introduced basic background knowledge on computer networks. The application protocol HTTP and its underlying TCP/IP protocols were discussed. Then, we presented a series of related studies involving network traffic measurement, Web robots, video streaming techniques, and scientific Web site analysis.

This thesis studies the workload characteristics of two scientific Web sites, by analyzing the HTTP transaction logs, especially the HTTP header fields mentioned in this chapter.

The study focuses on the user behaviors and network usage of the two scientific sites.

We present our methodology in the next chapter.

26 Chapter 3

METHODOLOGY

In this chapter, we explain how the network traffic in our study was monitored. To be specific, we introduce the overall logging system, including the deployment of hardware infrastructure, and the log generation framework. For analysis purposes, we then describe the pre-processing methods applied to the logs. The majority of the infrastructure deployment and data collection work is done by Michel Laterman and Martin Arlitt, with support from

University of Calgary Information Technologies (UCIT) staff.

3.1 Endace DAG Card Deployment

The incoming and outgoing network traffic passes through the edge routers of the University of Calgary network. By mirroring these traffic flows, it is feasible to observe all of the packet-level traffic between the campus and the Internet from a monitor server.

Campus Network Internal External Network Network Campus Backbone Internet Switch Router Edge Router

monitor logs … … Router Desktop Laptop Web Server Monitor Server Storage Server

Figure 3.1: Campus Network Structure with Traffic Monitor System

Figure 3.1 shows the structure of the campus network with our traffic monitor included.

All incoming and outgoing traffic is mirrored to the monitor, and then summaries of the actual traffic are transferred to a storage server every night.

27 The monitor is a Dell server equipped with two Intel Xeon E5-2690 CPUs (32 logical cores @ 2.9GHz), 64 GB RAM, and 5.5 TB hard disk storage, running the CentOS 6.6 x64 operating system. Since the hard disk is not large enough to store summary logs of the network traffic for a long period of time (around 50 GB of compressed log files are generated everyday), the summary logs are transferred to a storage server early every morning (during the off-peak times).

The monitor uses Endace DAG 8.1SX card for traffic capture and filtering. The Endace

DAG card is designed for 10 Gbps Ethernet, and uses a series of programmable hardware- based functions to improve the packet processing performance. A full list of Endace DAG

8.1SX’s specifications is available elsewhere [2]. Typical overall daily usage during the col- lection period of the U of C network was 2 Gbps of inbound TCP/IP traffic, and 1 Gbps outbound traffic.

The primary function of the Endace DAG data capture card is to split the incoming stream for the Bro logging system. The stream from the edge router is split into two streams, providing 24 sub-streams to the Bro system.

3.2 Bro Logging System

The Bro network security monitor [65] is an open-source network analysis framework. It pro- vides a generalized platform for network performance measurements and security monitoring.

The Bro logging system is able to monitor all network activities from a high-level viewpoint and provides detailed transaction information. Specifically, Bro produces logs including all transport-layer connections appearing on the network backbone, and many application-layer transcripts, such as HTTP transaction headers, DNS requests and replies, SSL certificates, etc.

Bro is configured to process the traffic streams generated by Endace DAG card in the monitor. With the incoming stream split by the Endace DAG card, Bro’s event engine

28 Table 3.1: A Sample of a Subset of the Bro HTTP Log

Fields ts id.orig h id.resp h method host uri referer user agent Types time IP addr IP addr string string string string string 1 ts1 a.b.c.d e.f.g.h GET uc.ca /1.jpg abc.ca Mozilla/5.0 2 ts2 i.j.k.l m.n.o.p GET uc.ca /2.png def.com Chrome/35.0

first transforms the sub-streams into higher-level events, which describe network activities

in objective terms. For example, the traffic stream captured by the DAG card is determined

to be an HTTP request by Bro and converted into a Bro event containing the request

information, such as HTTP version and IP addresses. Then Bro uses its script interpreter

to convert the event into logs, and notify the Bro user of abnormal activities (e.g., malicious

attacks) if corresponding policies are in place. This study focuses on Web traffic analysis,

and not about detecting and preventing intrusions from the external network.

Once the logging system is activated, Bro collects and generates logs hourly. The two

scientific sites studied in our work are both HTTP servers. Therefore, we concentrate on

the HTTP traffic measurements. The HTTP transaction logs contain detailed information

about the requests and responses, including request start and end times, response start and

end times, host name, request method, referer, user agent, response status code, etc.

Table 3.1 shows a sample (only includes a subset of all the fields) of the HTTP log generated by Bro. The “types” are specific data formats defined by the Bro system, which also can be used in Bro scripts. Note that there are 37 fields in our original HTTP logs, and we present selected fields with fabricated data for simplification only. Our analyses primarily rely on the following fields:

• ts (time) is the request start time-stamp, in Linux epoch time format.

• id.orig h (addr) is the request IP address, in 32-bit (four-byte) format.

• id.resp h (addr) is the response IP address.

• method (string) is the HTTP method in the request.

29 • host (string) is the name in the Host request header.

• uri (string) is the requested resource name in that specific host.

• referer (string) is the value in the referer request header.

• user agent (string) indicates the user agent used by client.

• request body len (count) is the size of request.

• response body len (count) is the size of response.

• status code (count) is the response status code.

• status msg (string) is the response status message.

• resp mime types (vector[string]) indicates the MIME type of the response.

• req start, req end, res start, and res end (time) are request/response start/end

time-stamps.

As introduced above, the monitor periodically transfers the log files to the storage server.

We study the Bro logs collected from January 1, 2015 to April 30, 2015 in this thesis.

However, there were several disruptions in the logs during our observation period, primarily due to events such as power failure, network disconnection, and Bro system crashes. The outage periods were:

• January 30, 2015: 11:00 - 12:00, 1 hr

• February 15, 2015: 18:00 - 19:00, 1 hr

• April 10, 2015: 10:00 - 24:00, 14 + 23 hr

• April 11, 2015: 0:00 - 23:00

• April 30, 2015: all day, 24 hr

30 For the analyses in the following chapters, there are some graphs on which these five outages are visible. Nevertheless, these outages were relatively rare during the observation period. With about four months of data collected by the logging system, we have a good representation of the campus network usage, and are able to make informed observations about the traffic characteristics.

3.3 Data Pretreatment

The Bro system generates about 50 GB of compressed log files every day, including HTTP transaction logs, FTP transaction logs, TCP/UDP connection logs, DNS logs, etc. HTTP logs are separated into files hourly, with about 1 GB for each file. It is slow to analyze the daily HTTP transaction logs, with around 20 GB in total.

After several trials, we refined our analysis approach to use a mix of awk1 and Python2 scripts. The awk is a free Linux software used to extract particular columns of information from a file. Since it has powerful functions and good efficiency, we used it to output selected records from the HTTP logs. Python is a well-known computer programming language. We chose it because it has many free libraries, such as the graph plotting module “Matplotlib”.

Furthermore, Python is convenient for handling string variables when working with awk scripts.

While using awk to extract records is reasonably efficient, it is still quite slow to analyze the data of one or several months. Therefore, we extract and store the HTTP records for the analyzed sites in temporary files, to speed up the data processing. With this pretreatment step in place, we can normally obtain the subsequent analysis results in a matter of hours.

We analyzed data with the servers (4 Intel Xeon X5450 3.00GHz CPUs, 32 GB RAM) in the Department of Computer Science.

1The GNU Awk User’s Guide, http://www.gnu.org/software/gawk/manual/gawk.html 2Python (programming language), https://www.python.org/

31 3.4 Summary

This chapter introduced the methodology of this thesis, including the hardware deployment, the Bro logging framework, and the pretreatments applied to the data.

In the following two chapters, we analyze the HTTP traffic of two scientific Web sites, namely the Aurora site and the ISM site. The traffic of both sites was monitored and collected by the Bro logging system during our four-month observation from January 1, 2015 to April

30, 2015.

32 Chapter 4

AURORA SITE ANALYSIS

In this chapter, we analyze the Aurora site workload. To begin, we analyze the HTTP characteristics including number of requests, data volume, HTTP method, HTTP referer,

IP activity, IP geolocation information, and URL popularity. In the process, we identify the existence of automatic crawling scripts (robots) responsible for a large part of the traffic.

Next, we provide more detailed analysis for some individual IP and referer sites based on popularity information. Finally, we identify the file transfer inefficiencies for the traffic generated by the robots. Additional active measurement experiments and results about file transfer inefficiency are available in Chapter 6.

4.1 HTTP Analysis

4.1.1 HTTP Requests

Figure 4.1 shows the daily count of HTTP requests over the four-month period under study.

There are approximately 1.5 million requests per day, and 182 million in total (see Table 4.1).

The Aurora site had fairly steady request traffic throughout the observation period (except for brief monitor outages on April 11 and April 30), but with a noticeable surge reaching

6 million requests per day in mid-March 2015, due to geo-magnetic storm activity affecting the aurora (see Section 4.3).

Figure 4.2 shows the hourly counts of HTTP requests on four selected days of our trace

(i.e., January 1-3 and January 5). We choose these four days since they are in the first week

Table 4.1: Statistical Characteristics of the Aurora Site (Jan 1/15 to Apr 29/15)

Site Total Reqs Avg Reqs/day Total GB Avg GB/day Uniq URLs Uniq IPs Aurora 182,068,131 1,529,984 10,354 87.01 2,894,294 240,236

33 6.0M

5.0M

4.0M

3.0M

2.0M

Number of Requests of Number 1.0M

Jan Feb Mar Apr

Figure 4.1: HTTP Request Count Per Day for Aurora Site during our trace, and they have clear hourly workload patterns. The consistent structure of the traffic, with over 40 thousand requests per hour, suggests that automated robots are generating most of the traffic. This is particularly likely given that January 1 is a statutory holiday (New Year’s Day). Further analysis in Section 4.2 shows that about 50% of the request traffic is attributable to University of California at Berkeley robots crawling the site.

In fact, Figure 4.2 indicates that the robot is crawling the site multiple times per day, with a four-hour period in early-January. The pattern started to change slightly on Monday,

January 5, 2015.

4.1.2 Data Volume

Figure 4.3 shows that the typical daily data volume for the Aurora site is about 90 GB/day, except for mid-March, when the traffic quadrupled. The Aurora site server provides a variety of data files to the public, including videos, images, and zip files. Since the sizes of video and image files are relatively larger than the rest, the large jump in data volume is attributable to a surge in popularity for these large image or video files.

34 80K 80K 70K 70K 60K 60K 50K 50K 40K 40K 30K 30K 20K 20K

Number of Requests of Number 10K Requests of Number 10K 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (a) HTTP Reqs/hr, Jan 1 (b) HTTP Reqs/hr, Jan 2 80K 80K 70K 70K 60K 60K 50K 50K 40K 40K 30K 30K 20K 20K

Number of Requests of Number 10K Requests of Number 10K 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (c) HTTP Reqs/hr, Jan 3 (d) HTTP Reqs/hr, Jan 5

Figure 4.2: HTTP Requests Per Hour (Jan 1-3 and Jan 5, 2015)

4.1.3 IP Analysis

There were 240,236 distinct IP addresses that visited the Aurora Web site during our trace.

Figure 4.4 shows the number of distinct IP addresses viewing the Aurora site per day. The daily count for unique IPs is about 4,000, except for mid-March, when the IP count grows eightfold. It is interesting that the surges in HTTP requests and unique IPs are different, with the former one only quadrupling from its usual level.

We performed IP geolocation for all the IP addresses, using the IP location services from

IPAddressLabs1 and MaxMind2. IP addresses come from 192 distinct countries in total.

Figure 4.5 shows the IP geolocation distribution of the top 10 countries based on number of IP addresses. Most of the IPs (39.50%) are from Canada, with the United States second at 15.67%. Figure 4.6 shows the IP geolocation distribution of the top 10 countries sorted by request count. Most of the requests (73.22%) come from the United States, with Canada

1IP-GeoLoc IP Address Geolocation Online Service, http://www.ipaddresslabs.com/ 2GeoIP, MaxMind, https://www.maxmind.com/

35 500

400

300

200 Volume (GB) Volume 100

Jan Feb Mar Apr

Figure 4.3: Data Volume (GB) Per Day for Aurora Site

Table 4.2: Top 10 Most Frequently Observed IP Addresses for Aurora Site

IP Reqs Pct Organization Location 128.32.18.45 89,977,861 49.19% University of California Berkeley, USA 137.229.18.201 22,951,449 12.55% University of Alaska Fairbanks, USA 137.229.18.252 3,403,550 1.86% University of Alaska Fairbanks, USA 50.65.108.252 2,394,630 1.31% Shaw Communications Edmonton, Canada 128.32.18.192 1,919,161 1.05% University of California Berkeley, USA 162.157.255.241 1,027,080 0.56% TELUS Communications Calgary, Canada 162.157.31.100 817,197 0.45% TELUS Communications Edmonton, Canada 211.133.151.210 795,318 0.43% JIN Office Service Japan 110.92.52.141 670,887 0.37% Good Communications Kagoshima, Japan 99.66.177.107 564,430 0.31% AT&T U-verse Dallas, USA second at 17.47%. Furthermore, for IP-city distribution, Berkeley California accounted for

50.28% of the requests generated by 43 IPs, while Fairbanks Alaska was second at 14.78% of the requests with 223 IPs. Since the THEMIS project (a larger collaborative project including the Aurora group) [10] is based in North America, these results are not surprising.

There are, however, many other countries accessing the images (e.g., Japan 2.09%, UK

6.30%).

Table 4.2 shows the geolocation information for the top 10 most frequently observed

IP addresses, ranked by number of HTTP requests. Three observations are evident from

36 36K

30K

24K

18K

IP Count IP 12K

6K

Jan Feb Mar Apr

Figure 4.4: Number of Unique IP Addresses Daily from 2015-01-01 to 2015-04-29, Aurora Site these results. First, most of the Top 10 are members of the THEMIS project, as expected

(e.g., University of California at Berkeley, University of Alaska). Second, some of these organizations have multiple IPs in the Top 10, indicating either multiple auroral researchers, or the use of DHCP (Dynamic Host Configuration Protocol), or the use of automated robots.

Third, the topmost IP address, which is from UCB, generates about half of the requests. Its total request count actually exceeds the sum of all the other IP addresses, both on a daily basis and overall.

Zipf’s law [28] is observed in many types of data. It is widely used in Internet traffic analysis such as [35,67]. By sorting the IPs according to the number of requests by each IP, we get the rank and frequency (number of requests) for each IP. Then we plot the (rank, frequency) pairs on a 2-dimensional coordinate system with log scale on both axes. Data manifesting Zipf’s law should result in a straight line in a log-log plot. Figure 4.7 shows the frequency-rank profile for the IP addresses observed at the Aurora site. There is visual evidence of power-law structure.

37 Canada Canada - 39.50% - 94942 United States - 15.67% - 37669 United Kingdom - 6.30% - 15139 Germany - 4.57% - 10985 39.5% France - 2.38% - 5730 Finland - 2.35% - 5647 Australia - 2.27% - 5463 Japan - 2.09% - 5014 United States Sweden - 1.85% - 4442 15.7% Russian Federation - 1.85% - 4438 others - 21.17% - 50892

6.3%

United Kingdom 4.6%

2.4% 21.2% 2.3% Germany 2.3% 2.1% others 1.8% France 1.8%

Figure 4.5: IP Geolocation Distribution, Top 10 Countries Sorted by Unique IPs

4.1.4 HTTP Methods

Figure 4.8 shows the HTTP methods seen over the trace duration. For the Aurora site,

88.4% of the HTTP requests use the GET method, while 11.6% are HEAD requests. Among

the HEAD requests, over 99.7% are generated by Wget. Other HTTP methods are negligible with fewer than 100 requests over the four-month period.

The number of HEAD requests is fairly consistent over time, suggesting that they are generated by robots. Comparatively, the number of GET requests consists of two parts, namely human activities and robot activities. The surge of GET requests in mid-March suggests human activities that outpace the robot traffic at that point.

4.1.5 HTTP Referer

The HTTP referer field (when present) is another source of useful information about the traffic. This field in the HTTP request header indicates the Web page from which the Aurora site was visited.

We analyze the top 100 referers in terms of requests and data volume. The top referer

38 United States - 73.22% - 133941542 United States Canada - 17.47% - 31952429 Japan - 2.51% - 4587971 Germany - 1.30% - 2378523 United Kingdom - 0.96% - 1755945 73.2% France - 0.74% - 1354597 Russian Federation - 0.42% - 759914 Australia - 0.36% - 662787 Estonia - 0.24% - 441392 Hong Kong - 0.23% - 413734 others - 2.56% - 4678838

2.6% others

0.7%0.4%0.4%0.2%0.2% 1.3% 1.0% 2.5% United Kingdom Germany 17.5% Japan

Canada

Figure 4.6: IP Geolocation Distribution, Top 10 Countries Sorted by Request Numbers for both is the Canadian Space Agency (CSA) AuroraMAX portal3, which appeared in

45,763,205 (25%) requests, and triggered a data transfer volume of 4,423 GB (43%) in total.

Most of the referrals come from pages showcasing images or videos from the Aurora Web site.

For example, 25 of the top 100 referrers come from the CSA site, and 9 from virmalised.ee4, which is an Estonian Web site broadcasting live auroral imagery from cameras around the world. These live feed pages generate large volumes of network traffic. Interestingly, many of the referring Web pages use JavaScript to automatically refresh the images shown on the page every few seconds, which contributes to the machine-generated5 traffic.

4.1.6 URL Analysis

Table 4.3 shows the Top 10 most frequently requested URLs for the Aurora site. Most of these URLs are images or videos labeled with recent or latest in the “/summary plots/” directory. These images are updated automatically by the ground-based cameras every few seconds during the night, while the videos are generated and posted on Real-Time

3Canadian Space Agency, AuroraMAX, http://www.asc-csa.gc.ca/eng/astronomy/auroramax/ 4http://virmalised.ee/en/ 5Note that the browser will refresh the image automatically, whether there is a human viewing the images or not.

39 108 107 106 105 104 3

Frequency 10 102 101 100 100 101 102 103 104 105 106 Rank

Figure 4.7: Frequency-Rank Profile for IP Addresses, Aurora Site

6.0M GET 5.0M HEAD

4.0M

3.0M

2.0M

Number of Requests of Number 1.0M

Jan Feb Mar Apr

Figure 4.8: HTTP Methods in Aurora Traffic

40 Table 4.3: Top 10 Most Frequently Requested URLs for Aurora Site

URL Reqs Pct GB Pct /summary plots/slr-rt/yknf/recent 480p.jpg 32,360,809 17.8% 4,116 39.8% /summary plots/rainbow-rt/yknf/latest.jpg 25,269,475 13.9% 1,349 13.0% /summary plots/slr-rt/yknf/recent 1080p.jpg 3,344,105 1.8% 970 9.4% /summary plots/slr-rt/yknf/recent SD.jpg 3,147,139 1.7% 170 1.6% /summary plots/slr-rt/yknf/recent 720p.jpg 2,781,414 1.5% 678 6.5% /summary plots/rainbow-rt/sask/latest.jpg 2,177,948 1.2% 26 0.3% /summary plots/rainbow-rt/fsmi/latest.jpg 2,067,294 1.1% 19 0.2% /summary plots/rainbow-rt/rabb/latest.jpg 2,060,695 1.1% 14 0.1% /summary plots/rainbow-rt/gill/latest.jpg 1,958,148 1.1% 17 0.2% /summary plots/rainbow-rt/fsim/latest.jpg 1,796,832 1.0% 22 0.2%

Environmental Monitoring Platform (RTEMP)6 the next day.

From Table 4.3, we see that the topmost URL “480p” accounts for 18% of the requests and

40% of the data volume. There are a few static HTML files in the Top 100, which contribute very little data volume. Note that the number of unique URLs is actually much larger than 2 million (see Table 4.1), reaching 75,847,177. When some Web sites (e.g., CSA AuroraMAX) fetch data (mostly images and videos) from the Aurora site for live broadcasting, they append a timestamp to the URL as a query string to obtain fresh content (since the URL is used as the key to cache files). For example, the request URL “/abc/latest.jpg” is modified to

“/abc/latest.jpg?1426417182905” by the JavaScript code. This “cache busting” technique causes the excessive number of unique URLs.

Figure 4.9 shows a frequency-rank analysis applied to the URLs requested on the Aurora site. It has several distinct plateaus in the frequency-rank profile. We attribute this to machine-generated request traffic, which we explore in more detail in Section 4.2.

4.1.7 File Type

There are 39 different file types in our trace for the Aurora site. Table 4.4 shows the top

10 file types ranked by HTTP request count. JPEG (Joint Photographic Experts Group)

6Real-Time Environment Monitoring Platform, http://rtemp.ca/

41 108 107 106 105 104 3

Frequency 10 102 101 100 100 101 102 103 104 105 106 107 Rank

Figure 4.9: Frequency-Rank Profile for URLs, Aurora Site images account for most requests and data volume. It is unsurprising that the dominant traffic contributors are videos and images. Static HTML files and JavaScript files are popular in terms of requests, but have minimal contribution to data volume.

4.1.8 HTTP Response Size Distribution

From the URL and file type analysis, we know that image files in the Aurora server are extremely popular. Therefore, we select the top two popular URLs to analyze the HTTP response size distribution. The size values we obtain are mainly affected by two factors:

1) Since those images are updated by the ground-based cameras, the size of the images changes along with the variation of the content. Note that the images posted to the Web pages are compressed to JPG format. The size of the original pictures would be larger.

Figure 4.10 shows a series of images taken by the Yellowknife, NWT camera on March 10.

Although the changes are not noticeable over small time spans, the image changes regularly and thus the size of the image changes. Even when the camera is not working in the day time, there is a countdown message updated in the images.

2) Instead of directly extracting file size values from the Aurora server, we trace the

42 Table 4.4: Top 10 Most Frequently Requested File Types for Aurora Site

File Type Reqs Pct Rank Volume (GB) Pct Rank Image/JPEG 80,158,252 52.23% 1 6,570 72.50% 1 Text/HTML 56,475,028 36.80% 2 122 1.35% 5 Application/X-Gzip 5,765,942 3.76% 3 1,312 14.48% 2 Text/Plain 2,686,677 1.75% 4 5 0.06% 14 Image/PNG 975,366 0.64% 5 68 0.75% 6 Application/JavaScript 671,627 0.44% 6 1 0.01% 15 Video/MPEG 173,094 0.11% 7 163 1.81% 4 Video/MP4 151,291 0.10% 8 707 7.81% 3 Image/GIF 18,509 0.01% 9 13 0.15% 9 Image/X-Portable- 10,501 0.01% 10 35 0.39% 7 Anymap

(a) AuroraMAX, Yellowknife, 2015/03/10, 01:04 (b) AuroraMAX, Yellowknife, 2015/03/10, 01:23 am am

(c) AuroraMAX, Yellowknife, 2015/03/10, 03:03 (d) AuroraMAX, Yellowknife, 2015/03/10, 06:27 am am

Figure 4.10: AuroraMAX Images from Yellowknife, 2015/03/10

43 Figure 4.11: HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, from 2015-03-09 to 2015-03-15

Figure 4.12: HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, on 2015-03-12

44 108

107

106

105 Frequency 104

103 0.00 0.05 0.10 0.15 0.20 Response Size (MB)

Figure 4.13: HTTP Response Size Distribution Histogram for “/summary plots/slr-rt/yknf/- recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis log-scale)

1.0

0.8

0.6

0.4

0.2

0.0 0.00 0.05 0.10 0.15 0.20 Response Size (MB)

Figure 4.14: HTTP Response Size Distribution Cumulative Histogram for “/sum- mary plots/slr-rt/yknf/recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis proportion)

45 108

107

106

105

104 Frequency

103

102 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Response Size (MB)

Figure 4.15: HTTP Response Size Distribution Histogram for “/summary plots/rainbow- rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis log-scale)

1.0

0.8

0.6

0.4

0.2

0.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Response Size (MB)

Figure 4.16: HTTP Response Size Distribution Cumulative Histogram for “/sum- mary plots/rainbow-rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis proportion)

46 HTTP responses. The size of the HTTP responses depends on the HTTP requests (e.g., conditional GET, HEAD, partial GET), and many other factors (e.g., network interruption or not). Consequently, even requesting the same file in the Aurora server may lead to different response size values.

The purpose of analyzing the response size distributions is to understand the actual band- width costs when those popular files are requested. Figure 4.11 shows a time-series of the

HTTP response size for all the HTTP requests for “/summary plots/slr-rt/yknf/recent 480p.jpg” from March 9 to March 15. Figure 4.12 shows the HTTP response sizes for all the HTTP requests for “/summary plots/slr-rt/yknf/recent 480p.jpg” on March 12. The flat shapes in both figures are the responses generated during the camera’s idle hours, when the size of the image rarely changes.

The overall response size distributions of the top two popular URLs (see Table 4.3) are presented in Figure 4.13 and Figure 4.15. Note that the y-axis is drawn in log-scale, and there are 50 bins in total.

The histograms of the top two URLs are visually similar. We attribute this to two reasons:

1) Although these two images are taken by two different lenses, the two lenses are deployed in the same location (Yellowknife) recording the same auroral phenomena. Since the content of the two images is highly correlated, so is the file size.

2) The live feed images are usually displayed together within a single Web page. Therefore whenever the Aurora server responds to the clients, the two images will be viewed together.

In addition, the images are updated synchronously on the Web page, which leads to the similar response size distribution (from previous reason we know that the two images have strong correlations). Furthermore, the user groups who view the two images are almost the same. Thus the requests sent from their Web browsers for the two images behave similarly.

In addition, there are are many small size responses (smaller than the actual size of the

47 images) in both histograms. We attribute this to two reasons:

1) For “recent 480p.jpg” over the four-month period of observation, there are 4,271 “206

Partial Content”, 365,283 “304 Not Modified”, and 971 “404 Not Found” responses. For

“latest.jpg”, there are 3,372 “206 Partial Content” and 466,460 “304 Not Modified” responses

in four months. These response size values should be smaller than the actual size of the

images.

2) Since the JavaScript “cache busting” implementation forces the browser to re-fetch

images every few seconds, the browser may discard a pending request and the incomplete

response data if a new request is generated. This phenomenon is frequently observed when

the network connection speed is slow.

The CDFs for the response size distributions are shown in Figure 4.14 and Figure 4.16.

4.2 Robot Traffic

In this section, we study the workloads of the top IPs from University of California at

Berkeley (UCB) and University of Alaska (UA). In addition, we analyze the traffic introduced

by AuroraMAX, the top referrer site mentioned in Section 4.1.5.

4.2.1 Prominent Machine-Generated Traffic

Since we don’t have a priori knowledge about which IP addresses are robots, we rely on two

heuristics to identify them:

1) We classify a certain IP address as a robot if it requests the file robots.txt. There were 613 such IP addresses in our dataset.

2) We classify an IP address as a robot if it generates many HTTP requests in a relatively short time, or has a deterministic structure in its request patterns.

The top few IPs from UCB and UA in Table 4.2 are definitely robots, based on this loose definition. UCB and UA are two leading participants in the THEMIS project mentioned

48 Table 4.5: Prominent UCB and Alaska IPs in Aurora Web Site Traffic

Name IP Total Reqs Reqs/day Total GB Avg GB/day UCB1 128.32.18.45 89,977,861 756,116 211 1.78 UCB2 128.32.18.192 1,919,161 16,127 789 6.64 UA1 137.229.18.201 22,951,449 192,869 1,680 14.12 UA2 137.229.18.252 3,403,550 28,601 573 4.82 earlier. We use the terms “UCB1” and “UCB2” to refer to the two most prominent IPs from

UCB. Similarly, we use “UA1” and “UA2” to refer to the two most prominent IPs from UA.

Furthermore, we identify the traffic from the referrer site AuroraMAX from Canadian

Space Agency as robot traffic, since:

1) The AuroraMAX page makes the viewer’s browser re-fetch images and videos from the Aurora site repeatedly.

2) It generates a huge volume of traffic for the Aurora site (see Section 4.1.5).

HTTP Requests and Data Volume

Figure 4.17 shows the HTTP requests and data volume information for each of the four UCB and UA IP addresses in Table 4.5.

1) UCB1 generates 89,977,861 requests in total, and 756,116 requests per day on average

(see Table 4.5). This is about half of the total Aurora request traffic. However, the daily data volume that UCB1 generated is comparatively small. Upon further analysis, we found that all the requests generated by UCB1 have the user agent Wget/1.11.4 Red Hat modified, which is free software for retrieving Web site content using HTTP, HTTPS, and FTP proto- cols. With the Wget scripts, UCB1 generates many HTTP requests without generating much data volume, since it only uses GET method for fetching the HTML pages and updated data

files, and it checks the time-stamp of data files with HEAD method.

2) UCB2 was active primarily from mid-January to early-February. It only generated

1,919,161 requests in total. Nevertheless, it contributed to a large proportion of the data volume in late-January (see the surge in Figure 4.17(d)). Different from UCB1, it uses

49 6.0M 500 UCB1 UCB1 5.0M Other 400 Other 4.0M 300 3.0M 200

2.0M (GB) Volume

Number of Requests of Number 100 1.0M

Jan Feb Mar Apr Jan Feb Mar Apr (a) HTTP Requests Per Day for UCB1 (b) Data Volume Per Day for UCB1 6.0M 500 UCB2 UCB2 5.0M Other 400 Other 4.0M 300 3.0M 200

2.0M (GB) Volume

Number of Requests of Number 100 1.0M

Jan Feb Mar Apr Jan Feb Mar Apr (c) HTTP Requests Per Day for UCB2 (d) Data Volume Per Day for UCB2 6.0M 500 UA1 UA1 5.0M Other 400 Other 4.0M 300 3.0M 200

2.0M (GB) Volume

Number of Requests of Number 100 1.0M

Jan Feb Mar Apr Jan Feb Mar Apr (e) HTTP Requests Per Day for UA1 (f) Data Volume Per Day for UA1 6.0M 500 UA2 UA2 5.0M Other 400 Other 4.0M 300 3.0M 200

2.0M (GB) Volume

Number of Requests of Number 100 1.0M

Jan Feb Mar Apr Jan Feb Mar Apr (g) HTTP Requests Per Day for UA2 (h) Data Volume Per Day for UA2

Figure 4.17: HTTP Requests and Data Volume Per Day for UCB, UA IPs

50 another version of Wget, namely Wget/1.12 (linux-gnu), as the user agent.

3) UA1 generated approximately 0.2 million requests and 14 GB of data volume per day.

It has more influence on the data volume compared to the other three IPs.

4) UA2 was active in March, when the geo-magnetic storm happened (see Section 4.3).

Workload Pattern

For these four IP addresses, it is interesting to study how the automatic scripts work by analyzing the pattern of the URLs requested.

With further URL analysis, we found that the UCB1 IP uses Wget to recursively down-

load all the data files in four specific directories: fluxgate/stream0, imager/stream1, imager/stream2, and imager/stream3. Furthermore, UCB1 checks all the data files within both the previous month and the current month in those four directories everyday. Note that the data in those directories are organized into folders by month and date. Basically, the Aurora server merely stores each day’s new data into the corresponding directory.

The file robots.txt is configured by the site administrator. Located in the Web site root directory, it contains instructions in a specific format, indicating what robots are not permitted to access. By default, Wget follows the proper robot etiquette [22]. It requests robots.txt before downloading files, hence, providing us a way to figure out the workload pattern.

Figure 4.18 shows the daily count for UCB1 requesting the robots.txt file. There are around 30 robots.txt requests per day from UCB1. In other words, the Wget script was running 30 times each day. Figure 4.19 displays 4 selected hourly robots.txt request counts from UCB1. The periodic pattern is visually apparent. The cyclic period on January 16 is

8 hours, and it changes to 4 hours on other days.

Further analysis of the URL requests shows the following:

1) There are two independent robots running different Wget scripts with the same UCB1

IP. The Wget script uses recursive download mode to save all files in the given directory to

51 45 40 35 30 25 20 15 10

Robot.txt Requests Robot.txt 5

Jan Feb Mar Apr

Figure 4.18: “robots.txt” Request Count Per Day for UCB1

3 3

2 2

1 1 Robot.txt Requests Robot.txt Requests Robot.txt

0 0 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 (a) “robots.txt” Reqs/hr on 2015-01-16 (b) “robots.txt” Reqs/hr on 2015-02-01 3 3

2 2

1 1 Robot.txt Requests Robot.txt Requests Robot.txt

0 0 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 (c) “robots.txt” Reqs/hr on 2015-02-05 (d) “robots.txt” Reqs/hr on 2015-04-05

Figure 4.19: “robots.txt” Request Count Per Hour on Four Selected Days

52 the local hard disk (it downloads HTML files but deletes them after extracting any embedded

URLs).

2) The “imager” robot updates local data from imager/stream1, imager/stream2, and imager/stream3 directories. It usually takes 2-3 hours to complete the scan of a month’s data, and 4-6 hours in total to complete both last month and the current month. The

“fluxgate” robot updates local data from the fluxgate/stream0 directory. It usually takes

10 minutes to finish 2 months of data.

3) Both robots take a short break after one complete scan over 2 months of data. The length of the break between scans varies from around 1 hour to 4 hours.

4) The file robots.txt is requested whenever a stream directory scan launches.

5) The robots use time-stamping mode, which makes Wget send HEAD requests to check

the time-stamp of files on the server-side and only generate GET requests to fetch a file if it

has a newer time-stamp.

Since Wget applies breadth-first search to recursively retrieve the directory structure of the site, it needs to download static HTML pages and extract URLs repeatedly7. Therefore,

it is not surprising that UCB1 generates so many requests with very limited data volume.

What is surprising is that UCB1 runs the Wget scripts to download files from these directories

several times each day, even though the content rarely changes.

Furthermore, it does so using many non-persistent connections, and a lot of HEAD

requests, rather than Conditional GETs. This approach is not very Internet-efficient, because

of the excessive number of TCP connections used and network round-trip times incurred.

We revisit this issue later in Chapter 6.

The UCB2 IP applies Wget/1.12 (linux-gnu) to retrieve files in older directories. Dif-

ferent from UCB1, the UCB2 script generates the URLs itself and invokes Wget to fetch the

files. Several aspects of this script are different from UCB1:

7For more information, refer to the “recursive download mode” in the Wget manual, http://www.gnu. org/software/wget/manual/html_node/Recursive-Download.html

53 1) The referer information is missing, which indicates that UCB2 directly visits each

URL.

2) Some URL requests are spelled incorrectly (e.g., “//data/themis” should be “/data/themis”,

but the server can redirect the former URL to the latter one with 200 OK), which shouldn’t

happen if Wget extracts the links automatically.

3) Some requested resources receive the “404 Not Found” error response from the Aurora

server.

There is no periodic structure for UCB2. In addition, it downloads all data files in the

given directory rather than updating local files like UCB1. Consequently, the UCB2 robot

generates significant data volume in its short active period.

The UA robots are browsers repeatedly viewing the RTEMP live feed pages. Similar to

the CSA AuroraMAX page, RTEMP provides Aurora live feeds by re-fetching images and

videos from the Aurora site server. The process is completed every three seconds by the

client’s browser, implemented with JavaScript Document Object Model (DOM) operations.

The RTEMP live feed pages force clients to continuously send GET requests to the Aurora

server as long as they are open in the browser (even when there is no human viewing the

images, thus we classify it as robot traffic). The two UA robots did the same task. They re-

fetched the live feed images for months, except that the user agent is Mozilla/5.0 (Windows

NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0 for UA1, and Mozilla/5.0 (Windows

NT 5.1; rv:36.0) Gecko/20100101 Firefox/36.0 for UA2.

There are actually two live feed pages opened by UA1, with different aurora pictures on each page. It produces the step-like structure in Figure 4.17(e). Since the content consists of images and videos, the data volume is much larger than the UCB1 robot.

4.2.2 AuroraMAX

The AuroraMAX page on the Canadian Space Agency Web site provides aurora live feeds that include image hyper-links to the Aurora site (note that visitor’s browser fetches images

54 6.0M AuroraMAX 5.0M Other

4.0M

3.0M

2.0M

Number of Requests of Number 1.0M

Jan Feb Mar Apr

Figure 4.20: HTTP Request Count Per Day for AuroraMAX from the Aurora server instead of the CSA server). About 25% of the HTTP requests

(45,763,205) and 43% of the data volume (4,423 GB) are generated by this top referrer site.

Figure 4.20 and Figure 4.21 show that the daily requests and data volume are steady over the observation period, except for the surge in mid-March.

There were 104,529 unique IP addresses visiting the Aurora site via the AuroraMAX por- tal over the 4 month period. Further analysis shows that the requests from the AuroraMAX portal are not attributable to a small set of highly active IP addresses. The topmost IP only accounts for less than 1.5% of the HTTP requests from the AuroraMAX portal (compared to UCB1 accounting for half of the traffic for the whole Aurora site). Figure 4.22 shows the frequency-rank profile for IP addresses.

Considering AuroraMAX’s popularity as a referrer site, the naive way it fetches the images from the Aurora site is Internet-inefficient. We propose a solution for this inefficiency issue in Chapter 6.

In summary, the robot traffic (UCB, UA, and AuroraMAX) accounts for 90.1% of the total requests and 74.1% of the total data volume. Therefore, solving these inefficiency issues may significantly reduce the load of the Aurora site.

55 500 AuroraMAX 400 Other

300

200 Volume (GB) Volume 100

Jan Feb Mar Apr

Figure 4.21: Data Volume (GB) Per Day for AuroraMAX

106

105

104

103

102 Frequency

101

100 100 101 102 103 104 105 106 Rank

Figure 4.22: IP Addresses Frequency-Rank Profile for AuroraMAX

56 4.3 Geomagnetic Storm

An interesting discovery in our dataset was the non-stationary traffic observed for the Aurora site in mid-March 2015. The HTTP request traffic and the data volume both quadrupled from their normal levels for the March 17-20 period (see Figure 4.1 and Figure 4.3).

The root cause for this traffic surge was solar flare activity that triggered one of the largest geomagnetic storms in over a decade [56]. Auroral researchers knew about this im- mediately, and eagerly downloaded many of the new images. The ensuing media coverage of the geomagnetic storm triggered many other site visits, either directly or via the Aurora-

MAX portal. Figure 4.20 shows that a lot of the traffic surges arrived via the AuroraMAX referrer site.

Further analysis indicates that the increased traffic is primarily human-initiated, since:

1) The number of distinct IPs visiting the site surged (eightfold) during the geomagnetic storm period (see Figure 4.4).

2) The number of GET requests quadrupled in the surge in Figure 4.8, with no change for HEAD requests. Therefore, the contrast indicates the surge is not contributed by Wget robots.

3) There was in fact a ten-fold increase in the AuroraMAX portal traffic (requests and data volume) during this period.

It is interesting to witness how real-world events affect the traffic of scientific Web sites.

The traffic information shows that flash crowds are not limited to “popular Web sites”. Such surges are important to consider when provisioning server-side capacity configurations.

4.4 Summary

This chapter provided a detailed analysis of the network traffic for the Aurora site. First, our analysis covered fundamental HTTP characteristics. Specifically, we analyzed the daily and hourly values for HTTP requests and data volume. We extracted the top 100 popular

57 IP addresses, and discovered the existence of machine-generated traffic.

Based on the previous discoveries, we analyzed the robot traffic. We primarily studied the traffic of four distinct IP addresses from University of California at Berkeley, and University of Alaska at Fairbanks. The results showed that the way they perform data transfers is very inefficient. In addition, we analyzed the top referrer site AuroraMAX and discovered its inefficient way of fetching live images from the Aurora site. Further discussions of the data transfer inefficiency problem is presented in Chapter 6.

Finally, we showed how real-world events affect the traffic of the Aurora site, by illus- trating the changes during the geomagnetic storm.

We analyze the traffic of the ISM site in the next chapter.

58 Chapter 5

ISM SITE ANALYSIS

The ISM (Inter-Stellar Medium) Web site at the U of C provides astrophysics teaching materials. We present our analysis of the ISM site in this chapter. First, we show the

HTTP characteristics for the ISM site. Considering that the network traffic was primarily human-generated, we focus on IP geolocation distribution, user agent classification, and URL popularity analysis. Since large-volume video files are a unique feature of the ISM site, we study how network traffic relates to user behavior patterns when viewing course videos from the ISM site. Finally, we analyze how course schedules affect the network traffic of the ISM site.

5.1 HTTP Analysis

We analyze the traffic logs for a four-month period from January 1, 2015 to April 29, 2015, covering the whole Winter 2015 semester at U of C. In this semester, lectures began on

January 12, and ended on April 15, with final exams running from April 18 to 29. There was a reading week with no lectures from February 15 to 22.

Due to the limitation of our tracing framework, we can only observe the ISM Web traffic generated when users are off-campus. The on-campus traffic doesn’t pass through the campus edge routers and therefore is not seen by our monitor. We may analyze the server-side logs of the ISM site in our future work.

5.1.1 HTTP Requests

A summary of the ISM site traffic is shown in Table 5.1. There are around 1.5 million requests in total, and 13,000 requests per day for the ISM site. While robots and referrer

59 Table 5.1: Statistical Characteristics of the ISM Site (Jan 1/15 to Apr 29/15)

Site Total Reqs Avg Reqs/day Total GB Avg GB/day Uniq URLs Uniq IPs ISM 1,583,339 13,305 8,483 71.29 10,563 9,720

150K

125K

100K

75K

50K

Number of Requests of Number 25K

Jan Feb Mar Apr

Figure 5.1: HTTP Request Count Per Day for ISM Site sites contribute most of the traffic to the Aurora site, this is not true for the ISM site.

Consequently, it is not surprising that the average request traffic for ISM is about two orders of magnitude lower than that for the Aurora site in terms of requests (the data volume in both sites are similar).

The daily ISM site traffic is illustrated in Figure 5.1. Note there are 3 obvious surges in the request traffic over the four months. The surge in late-February aligns with the first midterm in the course (February 24), while the subsequent surges align with the second midterm (March 24) and the final exam (April 21). These surges are “expected” compared to the “unexpected” surges in the Aurora site.

We select six surge days and display their hourly HTTP request traffic in Figure 5.2. The requests usually decreased between midnight and dawn, conformant to human schedules.

However, February 24 is a counterexample, for which the requests were influenced by the course midterm exam.

60 20K 20K

15K 15K

10K 10K

5K 5K Number of Requests of Number Requests of Number 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (a) HTTP Reqs/hr, Feb 23 (b) HTTP Reqs/hr, Feb 24 20K 20K

15K 15K

10K 10K

5K 5K Number of Requests of Number Requests of Number 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (c) HTTP Reqs/hr, Mar 23 (d) HTTP Reqs/hr, Mar 24 20K 20K

15K 15K

10K 10K

5K 5K Number of Requests of Number Requests of Number 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (e) HTTP Reqs/hr, Apr 20 (f) HTTP Reqs/hr, Apr 21

Figure 5.2: HTTP Request Per Hour (Feb 23, Feb 24, Mar 23, Mar 24, Apr 20, and Apr 21, 2015)

61 5.1.2 Data Volume 500

400

300

200 Volume (GB) Volume 100

Jan Feb Mar Apr

Figure 5.3: Data Volume (GB) Per Day for ISM Site

Table 5.1 shows that the average daily data volume of the ISM site (71 GB per day) is comparable to the Aurora site (87 GB per day), even though the number of requests for the two sites are very different. We attribute this to two reasons:

1) The ISM site contains more course-related materials rather than research resources.

The professor provides large objects (e.g., course videos, PDFs) to his students. Those files are much larger than the JPEG and HTML files provided by the Aurora site.

2) Although the number of requests in the ISM site is low compared to the Aurora site, most of the requests target large data volume files instead of small-size HTML or JavaScript

files.

Figure 5.3 shows the daily data volume information for the ISM site over the four-month period. It is interesting to observe the similar “sawtooth” structures in each month. To be specific, in Figure 5.3, the data volume increased on February 19, 21, 23, and decreased on February 20, 22, which makes the broken lines form a “sawtooth” structure. The same

“sawtooth” structure appeared in March and April as well. Since these surges align with the exams, we attribute the “sawtooth” structures to the students’ studying pattern.

62 Another interesting discovery is the “out of sync” phenomenon for the dates when the maximum surge was generated in requests versus data volume in late-February. Specifically, the maximum surge of requests (in Figure 5.1) was on February 24, while the biggest surge in data volume (in Figure 5.3) was on February 23 (this issue only occurred in February’s surge; surges in March and April align). By comparing the URLs requested on February 23 and February 24, we find that although the amount of video requests on February 24 is larger than February 23, the average data volume per video request on February 24 is smaller than

February 23. This may indicate that most video viewers (students) tended to skip frames when watching the course videos on February 24, while they prefer to watch video clips with longer average duration on February 23. This midterm reviewing pattern makes the number of requests peak on February 24, and the data volume peak on February 23.

5.1.3 IP Analysis

During the 4 months of observation, 9,720 unique IP addresses visited the ISM site. Every- day, around 300 IPs requested files from the ISM server over the four-month period. Since the ISM site is mainly designed for students in the University of Calgary, certainly the mag- nitude of daily users is much smaller than the Aurora site, reaching about one-tenth of the

Aurora one specifically. The amplitude of the surges in the second half of each month is comparatively smaller than the request traffic in Figure 5.1 and data volume in Figure 5.3, because the primary users are students, who are frequent repeat visitors for the ISM site.

The geolocation analysis for all the IPs visiting the ISM site shows that visitors were from

101 different countries, though about half of those countries (55) generated fewer than 100 requests in four months. Figure 5.4 shows a pie graph of the top 5 countries. Again, it is not surprising that most of the traffic is generated by Canadian (88.24%) and American (7.91%) users. For all the requests from Canada, Alberta surpasses all other provinces with 1.2 million requests (97.64%) in Figure 5.5, while the US distribution is more dispersed in Figure 5.6.

Actually, many of the USA requests are generated by Internet companies, like Google and

63 Table 5.2: Top 10 Most Frequently Observed IP Addresses for ISM Site

IP Reqs Pct. Organization Location 209.89.92.190 125,296 7.91% TELUS Communications Inc. Calgary, Canada 70.72.185.197 86,648 5.47% Shaw Communications Inc. Calgary, Canada 96.51.68.175 64,912 4.10% Shaw Communications Inc. Calgary, Canada 198.166.61.187 64,581 4.08% TELUS Communications Inc. Calgary, Canada 209.89.235.216 61,135 3.86% TELUS Communications Inc. Calgary, Canada 68.146.124.225 43,501 2.75% Shaw Communications Inc. Calgary, Canada 206.75.57.71 40,749 2.57% TELUS Communications Inc. Calgary, Canada 162.157.164.121 39,405 2.49% TELUS Communications Inc. Calgary, Canada 68.146.221.78 26,053 1.65% Shaw Communications Inc. Calgary, Canada 68.110.70.13 20,802 1.31% Cox Communications Inc. Scottsdale, USA

Canada - 88.24% - 1397096 United States - 7.91% - 125269 United Kingdom - 0.75% - 11847 France - 0.48% - 7612 China - 0.38% - 6092 others - 2.24% - 35423 Canada

88.2%

2.2% others 0.7%0.5%0.4% United Kingdom 7.9%

United States

Figure 5.4: IP Geolocation Distribution for Countries

64 Alberta - 97.64% - 1221943 British Columbia - 1.25% - 15639 Ontario - 0.61% - 7660 Quebec - 0.31% - 3838 Saskatchewan - 0.16% - 2032 others - 0.03% - 361

Alberta 97.6% 0.2%0.0% others 0.6%0.3% 1.2% Ontario British Columbia

Figure 5.5: IP Geolocation Distribution for Canada

California California - 32.85% - 39627 Arizona - 18.53% - 22356 Washington - 12.10% - 14595 New Jersey - 10.94% - 13191 Massachusetts - 4.63% - 5588 Arizona 32.9% others - 20.94% - 25263

18.5%

12.1% 20.9%

Washington 10.9% 4.6% others

New Jersey Massachusetts

Figure 5.6: IP Geolocation Distribution for USA

65 Calgary - 93.07% - 1129175 Red Deer - 1.64% - 19844 Medicine Hat - 1.21% - 14622 Cochrane - 0.92% - 11188 Edmonton - 0.79% - 9628 others - 2.38% - 28821

Calgary 93.1%

2.4% others 0.8% 1.2% 0.9% 1.6% Edmonton Cochrane Medicine Hat Red Deer

Figure 5.7: IP Geolocation Distribution for Alberta

Apple, for indexing Web content. Figure 5.7 shows that 1.1 million requests come from

Calgary, dominating all other cities in Alberta. Furthermore, for all the requests generated in Canada, about half (704,074 requests, or 50.4%) use the Internet service provided by

“Shaw Communications Inc.”, and 44.8% belong to “TELUS Communications Inc.”.

Figure 5.8 shows how many unique IPs visited the ISM site from Canada and Calgary per day. Since Canada is the primary contributor for the ISM site, and Calgary is the primary contributor for Canada, the structure of the red area aligns with the blue area and green area. In addition, we analyzed the daily unique IPs from the USA and its top state California in Figure 5.9. Nearly half of the IPs from USA are in California. Furthermore, the surges for USA align with the surges for California.

The IP frequency-rank profile of the ISM site is shown in Figure 5.10. Visual evidence of a power-law structure is apparent.

5.1.4 URL Analysis

There are 10,563 different URLs on the ISM site requested in the four-month period. Ta- ble 5.3 shows the top 10 most popular URLs for the ISM site. Note that we only show the

66 600 Outside Canada 500 Canada (excl. Calgary) Calgary 400

300

IP Count IP 200

100

Jan Feb Mar Apr

Figure 5.8: Number of Daily Unique IP Addresses Visiting ISM Site, from Canada and Calgary (2015-01-01 to 2015-04-30)

file names and parts of the directory names for convenience in Table 5.3. Since the course instructor had changed the format of all the course videos in the middle of that semester, we have both “.mov” and “.mp4” extensions in the top 10 URLs.

It is unsurprising that course materials are popular among the URLs. Furthermore, the large-size videos and PDF files generate tremendous data volume with limited requests.

Figure 5.11 shows the URL frequency-rank profile of the ISM site. Unlike the step shape for Aurora in Figure 4.9, the URL frequency-rank profile for the ISM site shows visual

Table 5.3: Top 10 Most Frequently Requested URLs for ISM Site

URL Total Reqs Total GB ASTR209 - Lec8 - Feb 5, 2015.mov 153,410 267.04 ASTR209 - Lec3 - Jan 20, 2015.mov 87,051 787.02 ASTR209 - Intro. & Lecture#1 - Jan 13,2015.mov 75,380 735.64 ASTR209 - Lec4 - Jan 22, 2015.mov 68,609 584.47 AST209 Podcast/rss.xml 56,293 0.71 2015/1/28 Course Notes files/Part2 e&m.pdf 55,952 58.07 ASTR209 - Lec2 - Jan 15, 2015.mov 39,687 998.60 ASTR209 - Lec10, Feb 12, 2015.mov 31,308 310.65 2015/3/11 Course Notes files/Part2 e&m.pdf 30,068 23.54 ASTR209 - Lec15 - Mar 12, 2015.mp4 28,690 284.02

67 600 Outside US 500 US (excl. California) California 400

300

IP Count IP 200

100

Jan Feb Mar Apr

Figure 5.9: Number of Daily Unique IP Addresses Visiting ISM Site, from USA and Cali- fornia (2015-01-01 to 2015-04-30)

Table 5.4: HTTP Method Summary for ISM Site

HTTP Method Rank Reqs Avg Reqs/day Pct. GET 1 1,575,574 13,130 99.51% HEAD 2 7,749 65 0.49% OPTIONS 3 11 0.09 0.00% POST 4 5 0.04 0.00% evidence of the power-law structure, typical of human-generated requests.

5.1.5 HTTP Methods

Table 5.4 shows a summary of the HTTP method information for the ISM site. As expected, the number of GET requests dominates other HTTP methods. Since there is no Wget robot for the ISM site, HEAD requests only account for a fairly small part of the total traffic.

Furthermore, 7,285 HEAD requests (94.01%) were generated by Apple’s iTunes application to check the existence of some resources or whether the ISM site RSS (Rich Site Summary) [23]

file was updated.

Figure 5.12 displays the daily number of GET and HEAD requests. The HEAD requests are rarely seen.

68 106

105

104

103

Frequency 102

101

100 100 101 102 103 104 Rank

Figure 5.10: Frequency-Rank Profile for IP Addresses, ISM Site

106

105

104

103

Frequency 102

101

100 100 101 102 103 104 105 Rank

Figure 5.11: Frequency-Rank Profile for URLs, ISM Site

69 150K GET 125K HEAD

100K

75K

50K

Number of Requests of Number 25K

Jan Feb Mar Apr

Figure 5.12: HTTP Methods in ISM Traffic

Table 5.5: HTTP Status Code Summary for ISM Site

Status Code Type Rank Reqs Avg Reqs/day Pct. 206 Partial Content 1 927,733 7,731 58.59% 200 OK 2 507,358 4,227 32.04% 304 Not Modified 3 79,064 658 4.99% 404 Not Found 4 47,372 394 2.99% 301 Moved Permanently 5 52 0.43 0.00% Requested Range 416 6 33 0.28 0.00% Not Satisfiable 400 Bad Request 7 1 0 0.00%

5.1.6 HTTP Status Codes

HTTP status code is a part of the HTTP response header, indicating how the server responds to a HTTP request. For example, the server responds with a “200 OK” HTTP status code when it successfully fetches the resource in response to a client’s GET request.

Table 5.5 summarizes the HTTP status codes for the ISM site. Status code “206” (Partial

Content) is the topmost one accounting for around 60% of the requests, while “200” is second at 32%. This result is quite different from the workload characterization of most general Web sites, where “200 OK” responses dominate. This situation is primarily caused by students frequently requesting pieces of large-size files (e.g., videos and PDFs), and by Internet user

70 106 206 304 200 404 105

104

103 Count 102

101

100 Jan Feb Mar Apr

Figure 5.13: HTTP Status Code in ISM Traffic agent behaviors.

We draw the daily information for the top 4 popular status codes in Figure 5.13. It is interesting to observe the interleaving patterns of status codes “200” (red dashed line) and

“206” (black solid line), whenever an exam was imminent. We attribute this to students’ reviewing strategies, and further discuss it in Section 5.3.

5.1.7 HTTP Response Size Distribution

Figure 5.14 shows the HTTP response sizes for all the HTTP requests for “Lec8 - Feb 5,

2015.mov” from February 18 to February 24. The x-axis represents the time series, covering about 130,000 requests generated in a week (the midterm surge is included).

Figure 5.15 shows the HTTP response sizes for all the HTTP requests for “Lec8 - Feb 5,

2015.mov” on February 24 (the first midterm date). The video is retrieved frequently from early morning to noon right before the midterm.

We did a series of HTTP response size distribution analyses for the ISM site. The response size analyses include all responses (e.g., “206 Partial Content” and “200 OK”). The response size values are the transferred data volume instead of “Content-Length” parameter.

71 Figure 5.14: HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, from 2015-02-18 to 2015-02-24

Figure 5.15: HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, on 2015-02-24

72 106

105

104

103

102 Frequency

101

100 0 1 2 3 4 5 Response Size (GB)

Figure 5.16: HTTP Response Size Distribution Histogram for “Lec8 - Feb 5, 2015.mov” (x-axis 0-5 GB, 50 bins, y-axis log-scale)

105

104

103

102 Frequency 101

100 0 2 4 6 8 10 Response Size (GB)

Figure 5.17: HTTP Response Size Distribution Histogram for “Lec3 - Jan 20, 2015.mov” (x-axis 0-10 GB, 50 bins, y-axis log-scale)

73 131072

131072 - 51.57% - 79106 262144 - 21.64% - 33205 65536 - 12.21% - 18727 51.6% others - 11.72% - 17977 0 - 1.93% - 2963 327680 - 0.93% - 1432

0.9% 327680 1.9% 0

21.6% 11.7%

others 262144 12.2%

65536

Figure 5.18: HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec8 - Feb 5, 2015.mov”

65536

65536 - 47.54% - 41385 others - 46.96% - 40875 1245184 - 1.58% - 1379 1179648 - 1.51% - 1311 47.5% 0 - 1.36% - 1181 1310720 - 1.06% - 920

1.1% 1310720 1.4% 1.5% 0 1.6% 1179648 1245184

47.0%

others

Figure 5.19: HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec3 - Jan 20, 2015.mov”

74 Figure 5.16 and Figure 5.17 shows the response size histograms for the top 2 popular URLs,

“Lec8 - Feb 5, 2015.mov” and “Lec3 - Jan 20, 2015.mov”. Note that the y-axis is drawn in log-scale, and there are 50 bins in total. Although the actual file size of “Lec8 - Feb 5,

2015.mov” is 1.5 GB, and “Lec3 - Jan 20, 2015.mov” is 2.1 GB, most of the response size values in both figures are observed in the small-value bins.

Due to the limitation of information provided by histograms, we draw pie graphs with the topmost 5 size values (Byte) for the 2 URLs in Figure 5.18 and Figure 5.19. Clearly, small-size responses with data volume smaller than 1 MB are predominant from the ISM server, even though the requested videos have around 2 GB. Furthermore, the majority of these small-size values concentrates on some specific values, such as 131,072 Bytes (51.57% with 79,106 requests) and 65,536 Bytes (47.54% with 41,385 requests). These phenomena are caused by Internet user agent behaviors when fetching large files under the condition that the server supports partial GET requests.

Figure 5.20 shows zoomed-in histograms and cumulative histograms of the 2 URLs for the response data volumes smaller than 1 MB. The peaks correspond to the popular response size values from the pie graphs (e.g., 131,072). Note that the y-axis is drawn in log-scale, and there are 50 bins in total. The step-like shapes of the cumulative histograms also indicate many responses sharing the same size values.

The ISM server supports “Accept-Ranges: bytes” function, which allows clients to request any byte ranges of a file stored on server. Therefore, clients can request partial content from the ISM server. User agents may have diverse behaviors to fetch large files in the ISM server, even though the videos are served as static files (namely, adaptive streaming techniques are not applied). We revisit this issue in Section 5.2.

5.1.8 User Agents

Unlike the traffic of the Aurora site, the vast majority of users viewing the ISM site are humans. Therefore, it is meaningful to analyze the user agent information. In this section,

75 1.0 105 0.8 104 0.6 103 0.4

Frequency 2 10 0.2

101 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Response Size (MB) Response Size (MB) (a) Response Size Histogram for “Lec8 - Feb 5, (b) Response Size Cumulative Histogram for 2015.mov” (x-axis 0-1 MB, 50 bins, y-axis log- “Lec8 - Feb 5, 2015.mov” (x-axis 0-1 MB, 50 scale) bins) 1.0 105 0.8 104 0.6 103 0.4

Frequency 2 10 0.2

101 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Response Size (MB) Response Size (MB) (c) Response Size Histogram for “Lec3 - Jan 20, (d) Response Size Cumulative Histogram for 2015.mov” (x-axis 0-1 MB, 50 bins, y-axis log- “Lec3 - Jan 20, 2015.mov” (x-axis 0-1 MB, 50 scale) bins)

Figure 5.20: Histograms of Response Size Values (smaller than 1 MB) for “Lec8 - Feb 5, 2015.mov” and “Lec3 - Jan 20, 2015.mov” Files

76 Table 5.6: Top 10 Most Popular User Agents for the ISM Site

User Agent Name Reqs AppleCoreMedia/1.0.0.11D201 (iPhone; U; CPU OS 7 1 1 like Mac OS X; en us) 142,232 AppleCoreMedia/1.0.0.12B466 (iPad; U; CPU OS 8 1 3 like Mac OS X; en us) 124,788 AppleCoreMedia/1.0.0.12B435 (iPhone; U; CPU OS 8 1 1 like Mac OS X; en gb) 72,731 AppleCoreMedia/1.0.0.11A501 (iPad; U; CPU OS 7 0 2 like Mac OS X; en us) 64,608 AppleCoreMedia/1.0.0.12A405 (iPad; U; CPU OS 8 0 2 like Mac OS X; en us) 60,223 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0 50,255 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 41,548 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0 41,536 AppleCoreMedia/1.0.0.10K549 (Macintosh; U; Intel Mac OS X 10 8; en us) 38,045 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 28,225 we use the on-line user agent database provided by “User Agent String.Com”1 to identify viewers’ operating system and user agent information.

The top 10 most popular user agents for the ISM site are shown in Table 5.6. Six user agents in the table are from Apple’s products, while the others are using Windows operating system with Firefox browser.

Among all captured user agents, about half of them (49.15%) belong to the “browser” category, 44.31% of them are identified as “AppleCoreMedia”, and 2.01% are identified as

“crawler”. Specifically, Figure 5.21 shows a distribution of the top 8 user agent names. All other identified user agents are classified into the “others” label.

Since most requests were generated by Internet browsers, we did an analysis of the user agents in the browser category in Figure 5.22. Firefox, Chrome, and Safari are the top 3 most popular browsers. Internet Explorer is in 4th position, with only 6.67% share.

For all the user agents labeled with “crawler”, we found that “Googlebot” from Google accounts for about half of the traffic (15,734 requests, 49.46%), and “Bingbot” from Microsoft ranks second with 8,031 (25.25%) HTTP requests. There are crawlers executed by some

Chinese search engines, like Baidu2 and Sogou3.

1User Agent String.Com, http://www.useragentstring.com/ 2Baidu, http://www.baidu.com/ 3Sogou, http://www.sogou.com/

77 AppleCoreMedia

AppleCoreMedia - 44.31% - 701507 Firefox - 18.63% - 295001 Chrome - 14.78% - 234035 Safari - 10.86% - 171994 44.3% Internet Explorer - 3.28% - 51897 unknown - 3.03% - 48006 Android Webkit Browser - 1.48% - 23404 iTunes - 1.38% - 21929 others - 2.25% - 35563

2.2% others 18.6% 1.4% Firefox 1.5% iTunes 3.0% Android Webkit Browser 3.3% unknown

14.8% 10.9% Internet Explorer

Safari Chrome

Figure 5.21: User Agent Names Distribution in the ISM Site

Firefox Firefox - 37.91% - 295001 Chrome - 30.07% - 234035 Safari - 22.10% - 171994 Internet Explorer - 6.67% - 51897 37.9% Android Webkit Browser - 3.01% - 23404 others - 0.24% - 1856

0.2% others 3.0% 30.1% Android Webkit Browser Chrome 6.7%

Internet Explorer

22.1%

Safari

Figure 5.22: User Agent Browsers Distribution in the ISM Site

78 Macintosh

Macintosh - 59.01% - 934367 Windows - 32.96% - 521945 unknown - 4.41% - 69809 59.0% Android - 1.88% - 29763 Darwin - 0.96% - 15223 Linux - 0.68% - 10805 others - 0.09% - 1424

0.1% others 0.7% 1.0% 1.9% Android 4.4%

unknown

33.0%

Windows

Figure 5.23: Operating System Distribution in the ISM Site

Figure 5.23 summarizes the distribution of operating systems for all the user agents.

Macintosh (the user agent database classifies all Apple products into this category) is the most popular operating system. It contains the iPhone OS (iOS) [18] running on iPhone, iPad, iPod touch, and the OS X [21] running on Apple computers. The second most prevalent is Microsoft’s Windows, generating 32.96% of the requests. Interestingly, Android is not as popular among students. We further analyze each operating system category:

1) In Apple, iPhone OS accounts for 70.88% of the requests with 662,286 in total, and

OS X represents 29.12% with 272,081 requests.

2) For Windows users, 44.9% (234,572) of the requests were generated by Windows

7, 40.87% (213,300) by Windows NT, 9.24% (48,238) by Windows 8, 2.82% (14,700) by

Windows Vista, and 1.29% (6,743) by Windows XP. These results are similar to the Windows

Web browsing shares in [26].

We list the top 5 popular versions for some selected operating systems in Table 5.7. Note that Apple users tend to upgrade their OS to the newer versions more frequently, compared to Windows users.

79 Table 5.7: Top 5 OS Versions

(a) Android (1.88% of total Reqs) (b) iPhone OS (41.83% of total Reqs) OS Version Reqs Pct. OS Version Reqs Pct. 4.4.2 14,781 49.66% 8.1.3 170,011 25.67% 4.4.4 5,667 19.04% 7.1.1 143,252 21.63% 5.0.1 2,299 7.72% 8.1.1 89,253 13.48% 5.0.2 935 3.14% 8.0.2 67,816 10.24% 4.2.1 790 2.65% 7.0.2 64,786 9.78%

(c) OS X (17.18% of total Reqs) (d) Windows (32.96% of total Reqs) OS Version Reqs Pct. OS Version Reqs Pct. 10.6.8 47,491 17.45% Win 7 234,572 44.94% 10.10.2 47,090 17.31% Win NT 213,300 40.87% 10.9.5 31,012 11.40% Win 8 48,238 9.24% 10.10.1 30,689 11.28% Win Vista 14,700 2.82% 10.8.3 19,978 7.34% Win XP 6,743 1.29%

Table 5.8: Top 5 Browser Versions

(a) Firefox (18.63% of total Reqs) (b) Chrome (14.78% of total Reqs) Browser Version Reqs Pct. Browser Version Reqs Pct. 35.0 118,495 40.17% 40.0.2214.115 37,381 15.97% 36.0 86,130 29.20% 40.0.2214.111 28,115 12.01% 37.0 54,030 18.32% 42.0.2311.90 22,460 9.60% 34.0 13,391 4.54% 41.0.2272.118 21,752 9.29% 33.0 6,259 2.12% 41.0.2272.101 19,674 8.41%

(c) Safari (10.86% of total Reqs) (d) Internet Explorer (3.28% of total Reqs) Browser Version Reqs Pct. Browser Version Reqs Pct. 8.0 49,467 28.76% 11.0 32,116 61.88% 8.0.3 20,880 12.14% 10.0 9,824 18.93% 7.0 15,995 9.30% 7.0 4,785 9.22% 8.0.2 15,944 9.27% 8.0 2,601 5.01% 8.0.4 11,225 6.53% 9.0 1,403 2.70%

80 150K Video 125K Other

100K

75K

50K

Number of Requests of Number 25K

Jan Feb Mar Apr

Figure 5.24: HTTP Requests Count Per Day for Video (requests)

In addition, we list the top 5 versions for selected browsers in Table 5.8. It is clear that clients using Internet Explorer and Safari are more inclined to update the browser to the latest version.

5.2 Video Viewing Pattern and Traffic

From the previous analysis, we have a general understanding of the ISM site traffic. The large-size course videos make its traffic pattern different from other sites. Therefore, we study the video viewing pattern and the corresponding traffic in this section.

5.2.1 Video Requests Traffic

Figure 5.24 shows the daily video requests. Clearly, most requests are video requests. How- ever, the number of video requests is at a comparatively low level during the first half of each month. Furthermore, non-video requests even exceed the video requests in the last two surges, which is caused by students’ exam reviewing strategies. Before the first midterm, students relied more on the lecture videos for studying. However, they chose to use the other materials (e.g., lecture notes) when studying for the second midterm and the final exam.

81 500 Video 400 Other

300

200 Volume (GB) Volume 100

Jan Feb Mar Apr

Figure 5.25: Data Volume (GB) Per Day for Video (requests)

Figure 5.25 compares the video-related data volume to the non-video traffic. The result shows that most of the data volume is contributed by the video requests during the four- month observation, aligning with the analysis in the previous sections. Even when the amount of requests retrieving other resources exceed video requests in Figure 5.24, the data volume of video requests is still dominating.

We analyze the HTTP transaction duration (Figure 5.26) and response size values (Fig- ure 5.27) of all the video requests during the four months of observation. The duration value is calculated as the time between sending a request and receiving a response. For all video requests, we find that 98.1% of HTTP transaction duration values are shorter than 10 sec- onds, and 94.6% of response size values are smaller than 5 MB. In other words, short HTTP transaction durations and small response sizes dominate the video HTTP transactions, from the prevalence of HTTP partial content request-responses. Furthermore, for the long-term

HTTP transactions, some last for several hours. It is surprising since one lecture video usu- ally only lasts an hour. These rare long transactions may be caused by slow network speed or connection failure issues or pausing video players. The same situation also happens in response size distribution.

82 106

105

104

103

102 Frequency

101

100 0K 10K 20K 30K 40K 50K 60K HTTP Transaction Duration (s)

Figure 5.26: HTTP Transaction Durations Distribution Histogram (x-axis 0-60K s, 50 bins, y-axis log-scale)

106

105

104

103

102 Frequency

101

100 0 2 4 6 8 10 12 Response Size (GB)

Figure 5.27: HTTP Response Size Distribution Histogram (x-axis 0-12 GB, 50 bins, y-axis log-scale)

83 1.0

0.8

0.6

0.4

0.2 Cumulative Frequency Cumulative

0 2 4 6 8 10 HTTP Transaction Duration (s)

Figure 5.28: HTTP Transaction Duration (≤ 10s) CDF for Video Requests

1.0

0.8

0.6

0.4

0.2 Cumulative Frequency Cumulative

0 1 2 3 4 5 Response Size (MB)

Figure 5.29: HTTP Response Size (≤ 5MB) CDF for Video Requests

84 Figure 5.28 and Figure 5.29 show the CDF of the HTTP transaction durations (≤ 10s) and the HTTP response sizes (≤ 5MB) for video requests. The x-axes represent duration values (in seconds) and response size values (in MB), respectively. The duration curve is comparatively smoother than the response size curve, with a few vertical jumps in the latter. Further analysis indicates that the dominant response sizes are 65,536 Bytes (35.4% of requests), 131,072 Bytes (12.1%), and 262,144 Bytes (5.2%). This phenomenon is caused by user agents when fetching large (video) files from a server that supports partial GET requests.

The ISM server supports the “Accept-Ranges: bytes” function, which allows clients to request any byte ranges of a file stored on server. Therefore, clients can request partial content from the ISM server. User agents may have diverse behaviors to fetch large video

files in the ISM server. Among all video requests, user agent “AppleCoreMedia” is respon- sible for 701,499 of the video requests (97.9%), dominating other popular Internet browsers.

AppleCoreMedia is a framework in Apple’s products used to help process on-line videos, cooperating with other applications such as Safari and iTunes. From the access logs, we

find that the user agent field in HTTP request sometimes changes to “AppleCoreMedia” even when using Safari. Furthermore, we discover that the partial requests generated by

Safari or “AppleCoreMedia” are quite unpredictable. For example, the range values are not always monotonic or contiguous; they occasionally skip or overlap. Yao et al. [60] also found this behavior happening in other iOS devices and analyzed its inefficiency. They concluded that about 10%-70% traffic is redundant when accessing Internet streaming services on iOS devices.

5.2.2 Browser Behaviors for Video Playing

As introduced in Section 2, there are three different techniques for streaming video over the Internet. We analyze the HTML source code of the ISM site and find that it uses the progressive download technique. This implementation of video streaming in the ISM site is

85 not only inconvenient for users, but also inefficient for network usage. We further explore

this by performing a comparison experiment.

We deploy an Apache HTTP server on a PC, with “Accept-Ranges: bytes” enabled by

default. The configuration of our server is essentially the same as the ISM server. We

limit client bandwidth4 to 10.24 Mbit/s for simulating the QoS environment, with Apache

“mod ratelimit” module5 configurations. The Web server and clients all run in the same PC

(localhost), thus network issues are eliminated. One lecture video (“ASTR209 - Lec4 - Jan

22, 2015.mp4”) is downloaded from the ISM site as a sample to be deployed in our server.

We experiment with three ways of server-side video playing implementations, tested by the

latest version of Firefox, Chrome, Safari, and Internet Explorer (see Figure 5.22):

Case 1) The video file is served as a static file in the server. This is the simplest way for

delivering video files.

Case 2) The video file is embedded as an HTML “” element6, with its attribute

“type” set to Video/QuickTime. This is implemented exactly the same way as in the ISM

site.

Case 3) The video is displayed by the HTML5 “

standard way to embed a video in a Web page, but was not feasible before HTML5.

Case 4) The video is displayed by MPEG-DASH implementation with Dash.js support.

This approach needs to process the video and generate the Media Presentation Description

file beforehand. Dash.js requires Media Source Extensions support in the browsers.

The results are shown in Table 5.9. The browser names and versions are listed in the

leftmost column. We use “Static File”, “Object Element”, “HTML5 Video”, and “MPEG-

DASH” to represent the four implementations. The column “Play” shows whether the video

is able to be played in that condition, and “Forward” shows whether the video can be

4List of countries by Internet connection speeds, https://en.wikipedia.org/wiki/List_of_ countries_by_Internet_connection_speeds 5Apache Module mod ratelimit, http://httpd.apache.org/docs/2.4/mod/mod_ratelimit.html 6HTML Tag, http://www.w3schools.com/tags/tag_object.asp 7HTML