University of Calgary PRISM: University of Calgary's Digital Repository
Graduate Studies The Vault: Electronic Theses and Dissertations
2015-12-15 Traffic Analysis of Two Scientific Web Sites
Liu, Yang
Liu, Y. (2015). Traffic Analysis of Two Scientific Web Sites (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/28501 http://hdl.handle.net/11023/2678 master thesis
University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY
Traffic Analysis of Two Scientific Web Sites
by
Yang Liu
A THESIS
SUBMITTED TO THE FACULTY OF GRADUATE STUDIES
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF SCIENCE
GRADUATE PROGRAM IN COMPUTER SCIENCE
CALGARY, ALBERTA
DECEMBER, 2015
c Yang Liu 2015 Abstract
This thesis presents a workload characterization study of two scientific Web sites at the
University of Calgary based on a four-month period of observation (from January 1, 2015 to April 30, 2015). The Aurora site is a scientific site for auroral researchers, providing auroral images collected from remote cameras deployed in northern Canada. The ISM site is a scientific site providing lecture materials to about 400 undergraduate students in the
ASTR 209 course.
Three main observations emerge from our workload characterization study. First, sci- entific Web sites can generate extremely large volumes of Internet traffic, even when the user community is seemingly small. Second, robot traffic and real-world events can have surprisingly large impacts on network traffic. Third, a large fraction of the observed network traffic is highly redundant, and can be reduced significantly with more efficient networking solutions.
ii Acknowledgements
I would like to express my sincere appreciation and gratitude to my supervisor, Dr. Carey
Williamson, for his invaluable support and insightful suggestions to my graduate research.
His enthusiasm motivated my passion for accomplishing this study. His patience and metic- ulousness helped me overcome my writing weaknesses and finally finish this thesis.
I would like to acknowledge Michel Laterman, Martin Arlitt, and U of C IT staff for setting up the logging system and capturing the data. Michel provided a lot of instructions to me for big log data processing issues. A very special thanks goes out to Martin and Michel for their constructive suggestions and support in helping me polish this thesis.
I would like to thank my external committee member, Dr. Eric Donovan, for being on my committee and his valuable advice from the perspective of Physics and Astronomy. I would like to thank Emma Spanswick and Darren Chaddock for their technical expertise about the setup and operation of the Aurora site.
I am also indebted to my fellow lab-mates in the Networks Group, Yao, Ruiting, Brad,
Mohsen, Keynan, Vineet, Mahshid, Haoming, Linquan, Xunrui, Sijia, Shunyi, Yuhui, and
Wei, for the fun we have had and the help you offered during my graduate study. I wish you all the best in your studies and careers.
I would like to thank all my friends for supporting me spiritually, and the U of C for offering me this great opportunity to learn and to explore.
Finally, I would like to thank my family for their support through my entire life, especially my parents Baocheng and Ling for respecting my decisions and encouraging me with their best wishes.
iii Table of Contents
Abstract ...... ii Acknowledgements ...... iii Table of Contents ...... iv List of Tables ...... vii List of Figures ...... viii List of Symbols ...... x 1 INTRODUCTION ...... 1 1.1 Open Scientific Web Sites ...... 1 1.2 Background Context ...... 2 1.3 Motivation ...... 4 1.4 Objectives ...... 6 1.5 Contributions ...... 6 1.6 Thesis Overview ...... 7 2 BACKGROUND and RELATED WORK ...... 8 2.1 TCP/IP Model ...... 8 2.1.1 Physical Layer ...... 9 2.1.2 Link Layer ...... 10 2.1.3 Network Layer ...... 10 2.1.4 Transport Layer ...... 10 2.1.5 Application Layer ...... 11 2.2 HTTP and the Web ...... 11 2.2.1 Persistent Connections ...... 13 2.2.2 HTTP Messages ...... 14 2.2.3 HTTP Secure ...... 19 2.3 Network Traffic Measurement ...... 19 2.4 Web Robots ...... 22 2.5 Video Streaming ...... 23 2.6 Scientific Web Sites ...... 25 2.7 Summary ...... 26 3 METHODOLOGY ...... 27 3.1 Endace DAG Card Deployment ...... 27 3.2 Bro Logging System ...... 28 3.3 Data Pretreatment ...... 31 3.4 Summary ...... 32 4 AURORA SITE ANALYSIS ...... 33 4.1 HTTP Analysis ...... 33 4.1.1 HTTP Requests ...... 33 4.1.2 Data Volume ...... 34 4.1.3 IP Analysis ...... 35 4.1.4 HTTP Methods ...... 38 4.1.5 HTTP Referer ...... 38 4.1.6 URL Analysis ...... 39
iv 4.1.7 File Type ...... 41 4.1.8 HTTP Response Size Distribution ...... 42 4.2 Robot Traffic ...... 48 4.2.1 Prominent Machine-Generated Traffic ...... 48 4.2.2 AuroraMAX ...... 54 4.3 Geomagnetic Storm ...... 57 4.4 Summary ...... 57 5 ISM SITE ANALYSIS ...... 59 5.1 HTTP Analysis ...... 59 5.1.1 HTTP Requests ...... 59 5.1.2 Data Volume ...... 62 5.1.3 IP Analysis ...... 63 5.1.4 URL Analysis ...... 66 5.1.5 HTTP Methods ...... 68 5.1.6 HTTP Status Codes ...... 70 5.1.7 HTTP Response Size Distribution ...... 71 5.1.8 User Agents ...... 75 5.2 Video Viewing Pattern and Traffic ...... 81 5.2.1 Video Requests Traffic ...... 81 5.2.2 Browser Behaviors for Video Playing ...... 85 5.3 Course-Related Events ...... 88 5.4 Summary ...... 91 6 DISCUSSION ...... 92 6.1 Comparative Analysis of Two Scientific Web Sites ...... 92 6.2 Workload Characteristics Revisited ...... 95 6.2.1 Success Rate ...... 95 6.2.2 File Types ...... 95 6.2.3 Mean Transfer Size ...... 97 6.2.4 Distinct Requests ...... 97 6.2.5 One Time Referencing ...... 97 6.2.6 Size Distribution ...... 98 6.2.7 Concentration of References ...... 98 6.2.8 Wide Area Usage ...... 98 6.2.9 Inter-Reference Times ...... 98 6.3 Network Efficiency Analysis ...... 99 6.3.1 File Transfer Methods ...... 99 6.3.2 JavaScript Cache-Busting Solution ...... 103 6.4 Summary ...... 105 7 CONCLUSIONS ...... 106 7.1 Thesis Summary ...... 106 7.2 Scientific Web Site Characterization ...... 107 7.2.1 The Aurora Site ...... 107 7.2.2 The ISM Site ...... 108 7.3 Conclusions ...... 109 7.4 Future Work ...... 110
v References ...... 112
vi List of Tables
3.1 A Sample of a Subset of the Bro HTTP Log ...... 29
4.1 Statistical Characteristics of the Aurora Site (Jan 1/15 to Apr 29/15) . . . . 33 4.2 Top 10 Most Frequently Observed IP Addresses for Aurora Site ...... 36 4.3 Top 10 Most Frequently Requested URLs for Aurora Site ...... 41 4.4 Top 10 Most Frequently Requested File Types for Aurora Site ...... 43 4.5 Prominent UCB and Alaska IPs in Aurora Web Site Traffic ...... 49
5.1 Statistical Characteristics of the ISM Site (Jan 1/15 to Apr 29/15) . . . . . 60 5.2 Top 10 Most Frequently Observed IP Addresses for ISM Site ...... 64 5.3 Top 10 Most Frequently Requested URLs for ISM Site ...... 67 5.4 HTTP Method Summary for ISM Site ...... 68 5.5 HTTP Status Code Summary for ISM Site ...... 70 5.6 Top 10 Most Popular User Agents for the ISM Site ...... 77 5.7 Top 5 OS Versions ...... 80 5.8 Top 5 Browser Versions ...... 80 5.9 Browser Support for the Four Video Playing Implementations ...... 87
6.1 Statistical Characteristics of Two Scientific Web Sites (Jan 1/15 to Apr 29/15) 92 6.2 HTTP Method and HTTP Status Code Percentage ...... 93 6.3 Comparison of Workload Characteristics ...... 96 6.4 Experimental Results for File Transfer/Synch Methods ...... 102
vii List of Figures and Illustrations
2.1 The Five-Layer Protocol Stack and the Seven-Layer OSI Model ...... 9 2.2 Illustration of HTTP Requests and Responses ...... 12
3.1 Campus Network Structure with Traffic Monitor System ...... 27
4.1 HTTP Request Count Per Day for Aurora Site ...... 34 4.2 HTTP Requests Per Hour (Jan 1-3 and Jan 5, 2015) ...... 35 4.3 Data Volume (GB) Per Day for Aurora Site ...... 36 4.4 Number of Unique IP Addresses Daily from 2015-01-01 to 2015-04-29, Aurora Site ...... 37 4.5 IP Geolocation Distribution, Top 10 Countries Sorted by Unique IPs . . . . 38 4.6 IP Geolocation Distribution, Top 10 Countries Sorted by Request Numbers . 39 4.7 Frequency-Rank Profile for IP Addresses, Aurora Site ...... 40 4.8 HTTP Methods in Aurora Traffic ...... 40 4.9 Frequency-Rank Profile for URLs, Aurora Site ...... 42 4.10 AuroraMAX Images from Yellowknife, 2015/03/10 ...... 43 4.11 HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, from 2015-03-09 to 2015-03-15 ...... 44 4.12 HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, on 2015-03-12 ...... 44 4.13 HTTP Response Size Distribution Histogram for “/summary plots/slr-rt/yknf/- recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis log-scale) ...... 45 4.14 HTTP Response Size Distribution Cumulative Histogram for “/summary plots/slr- rt/yknf/recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis proportion) . . . . 45 4.15 HTTP Response Size Distribution Histogram for “/summary plots/rainbow- rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis log-scale) ...... 46 4.16 HTTP Response Size Distribution Cumulative Histogram for “/summary plots/- rainbow-rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis proportion) . 46 4.17 HTTP Requests and Data Volume Per Day for UCB, UA IPs ...... 50 4.18 “robots.txt” Request Count Per Day for UCB1 ...... 52 4.19 “robots.txt” Request Count Per Hour on Four Selected Days ...... 52 4.20 HTTP Request Count Per Day for AuroraMAX ...... 55 4.21 Data Volume (GB) Per Day for AuroraMAX ...... 56 4.22 IP Addresses Frequency-Rank Profile for AuroraMAX ...... 56
5.1 HTTP Request Count Per Day for ISM Site ...... 60 5.2 HTTP Request Per Hour (Feb 23, Feb 24, Mar 23, Mar 24, Apr 20, and Apr 21, 2015) ...... 61 5.3 Data Volume (GB) Per Day for ISM Site ...... 62 5.4 IP Geolocation Distribution for Countries ...... 64 5.5 IP Geolocation Distribution for Canada ...... 65 5.6 IP Geolocation Distribution for USA ...... 65 5.7 IP Geolocation Distribution for Alberta ...... 66
viii 5.8 Number of Daily Unique IP Addresses Visiting ISM Site, from Canada and Calgary (2015-01-01 to 2015-04-30) ...... 67 5.9 Number of Daily Unique IP Addresses Visiting ISM Site, from USA and Cal- ifornia (2015-01-01 to 2015-04-30) ...... 68 5.10 Frequency-Rank Profile for IP Addresses, ISM Site ...... 69 5.11 Frequency-Rank Profile for URLs, ISM Site ...... 69 5.12 HTTP Methods in ISM Traffic ...... 70 5.13 HTTP Status Code in ISM Traffic ...... 71 5.14 HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, from 2015-02- 18 to 2015-02-24 ...... 72 5.15 HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, on 2015-02-24 72 5.16 HTTP Response Size Distribution Histogram for “Lec8 - Feb 5, 2015.mov” (x-axis 0-5 GB, 50 bins, y-axis log-scale) ...... 73 5.17 HTTP Response Size Distribution Histogram for “Lec3 - Jan 20, 2015.mov” (x-axis 0-10 GB, 50 bins, y-axis log-scale) ...... 73 5.18 HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec8 - Feb 5, 2015.mov” ...... 74 5.19 HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec3 - Jan 20, 2015.mov” ...... 74 5.20 Histograms of Response Size Values (smaller than 1 MB) for “Lec8 - Feb 5, 2015.mov” and “Lec3 - Jan 20, 2015.mov” Files ...... 76 5.21 User Agent Names Distribution in the ISM Site ...... 78 5.22 User Agent Browsers Distribution in the ISM Site ...... 78 5.23 Operating System Distribution in the ISM Site ...... 79 5.24 HTTP Requests Count Per Day for Video (requests) ...... 81 5.25 Data Volume (GB) Per Day for Video (requests) ...... 82 5.26 HTTP Transaction Durations Distribution Histogram (x-axis 0-60K s, 50 bins, y-axis log-scale) ...... 83 5.27 HTTP Response Size Distribution Histogram (x-axis 0-12 GB, 50 bins, y-axis log-scale) ...... 83 5.28 HTTP Transaction Duration (≤ 10s) CDF for Video Requests ...... 84 5.29 HTTP Response Size (≤ 5MB) CDF for Video Requests ...... 84 5.30 HTTP Requests Count Per Day for ASTR209, ASPH213, and ASPH503 . . 89 5.31 Data Volume (GB) Per Day for ASTR209, ASPH213, and ASPH503 . . . . . 89 5.32 HTTP Requests and Data Volume Per Day for the Six Categories ...... 90
6.1 HTTP Traffic Overview for the Aurora and ISM Sites ...... 93 6.2 Frequency-Rank Profiles for the Aurora and ISM Sites ...... 94 6.3 Illustration of File Transfer Methods Experiment ...... 99 6.4 HTTP Response Size Results for the Two Methods ...... 104
ix List of Acronyms
AJAX Asynchronous JavaScript and XML
ARPANET Advanced Research Projects Agency Network
CORS Cross-Origin Resource Sharing
CS Computer Science
CSA Canadian Space Agency
DASH Dynamic Adaptive Streaming over HTTP
DHCP Dynamic Host Configuration Protocol
DNS Domain Name System
DOM Document Object Model
FTP File Transfer Protocol GIF Graphics Interchange Format
GUI Graphical User Interface
HTML HyperText Markup Language
HTTP HyperText Transfer Protocol
HTTPS HyperText Transfer Protocol Secure
IP Internet Protocol
ISM Inter-Stellar Medium JPEG Joint Photographic Experts Group
LAN Local Area Network MIME Multipurpose Internet Mail Extensions
MIT Massachusetts Institute of Technology
MPEG Moving Picture Experts Group
OCW OpenCourseWare
OS Operating System
x OSI Open Systems Interconnection
PDF Portable Document Format
PPP Point-to-Point Protocol
P2P Peer-to-Peer QoS Quality of Service
RIP Routing Information Protocol
RSH Remote Shell RSS Rich Site Summary
RTEMP Real-Time Environmental Monitoring Platform
RTMP Real Time Messaging Protocol
RTSP Real Time Streaming Protocol
SNAP Stanford Network Analysis Project
SSH Secure Shell SSL Secure Sockets Layer
TCP Transmission Control Protocol THEMIS Time History of Events and Macroscopic Interactions
during Substorms
TLS Transport Layer Security
UA University of Alaska
UCB University of California at Berkeley
UDP User Datagram Protocol
U of C University of Calgary
URI Uniform Resource Identifier
URL Uniform Resource Locator
WWW World Wide Web
xi Chapter 1
INTRODUCTION
The Internet has a growing influence on many aspects of our daily lives. With the upgrading of network speed, and the emergence of new technologies, the way people use the Internet gradually changes. For example, researchers and educators often share their results and teaching materials via the Internet. This action in return facilitates external interactions with scientific Web sites.
This thesis presents measurements of the network traffic of two scientific Web sites at the University of Calgary. In particular, we study the usage patterns, identify inefficiencies in current information exchanging methods, and suggest potential improvements.
1.1 Open Scientific Web Sites
With the rapid development of network technology and high performance personal comput- ers, scientific research and education organizations often share resources over the Internet, typically via the World Wide Web [27]. The scientific materials provide people around the world with opportunities to obtain scientific knowledge conveniently, efficiently, and impar- tially. In some cases, however, the sharing of large and popular materials can generate a significant volume of network traffic.
Furthermore, an emerging trend with research funding agencies and public-funded univer- sities is toward open access publishing and open data repositories. These publicly-accessible data repositories enable not only the sharing of scientific data among researchers world- wide, but also enable a wide variety of “citizen science” projects and outreach activities.
In addition, many universities currently offer a variety of on-line educational resources to the public, including video recorded lectures. For example, Stanford University provides
1 large network dataset collections to the public via the Stanford Network Analysis Project
(SNAP) [57], giving computer scientists, sociologists, and psychologists opportunities to test their methodologies as well as their conjectures. As another example, the worldwide Open-
CourseWare (OCW) [20] site offers a series of free on-line courses recorded by the most celebrated universities, including the Massachusetts Institute of Technology (MIT) and Yale
University.
The open scientific Web sites provide a rich set of resources for “citizen science”. There is a list of papers posted each year using the datasets in SNAP1. The open datasets in re3data2 also enable research in a variety of fields3. For OpenCourseWare, a report from
MIT4 shows that MIT OCW was visited 2,385,654 times by 1,367,228 unique visitors in
April 2015. However, these open scientific Web sites also have an effect on network traffic.
The University of Calgary also provides open scientific resources. For instance, it hosts multiple Web sites that share scientific measurement data from remote sensors for atmo- spheric and environmental monitoring, and free on-line courses offered by university edu- cators. These scientific Web sites have generated voluminous network traffic, which is the basis for our study.
1.2 Background Context
Given the pervasive applications of the World Wide Web (WWW), network resource usage is always a relevant problem. The Internet evolved from the ARPANET (Advanced Research
Projects Agency Network) project funded by the US government in the 1970’s, and has become globally popular and powerful today. From modest beginnings in local area networks with a few workstations, the Internet has grown into a world-wide network system, with over
1https://snap.stanford.edu/papers.html 2Registry of Research Data Repositories, http://www.re3data.org/ 3http://www.re3data.org/about/ 4http://ocw.mit.edu/about/site-statistics/monthly-reports/MITOCW_DB_2015_04.pdf
2 1 billion hosts5 and 3 billion users6.
On the modern Internet, the advances in network speed and the high performance servers provide users with a high-quality Internet surfing experience. However, the tension between network traffic consumption and the QoS (Quality of Service) still remains an important issue. To economize on network bandwidth, numerous methods were proposed, such as the formulation of new protocols and caching architectures. Before design changes are made, however, network traffic measurement is one of the useful methods to provide a clear under- standing of network bottlenecks, as a prerequisite for network optimization.
Network traffic measurement is an effective way to understand network activities. By analyzing how Web resources are retrieved, it provides an understanding of the data trans- fer traffic. This information is useful for identifying issues in network resource allocation, distribution, and bandwidth configuration. To optimize the network resource allocation, numerous mathematical models, experimental methods, and auto-adjustment systems are proposed and tested in academic and industrial fields. Web site workload analysis is a well-known network traffic measurement technique to summarize the characteristics of Web sites.
The workload pattern is usually determined by many factors, such as the demographics of the users, the type of resources on the site, and the services provided by the site. For example, sites like the Washington Post provide news to people around the world, particularly for the United States7. The Asahi Shimbun Web site is a Japanese press provider who also serves news to the public. However, the workload pattern of the Washington Post is quite different from Asahi8, based on the statistics from Alexa9. Therefore, sufficient background information and general analysis of a site’s workload are very important.
The network traffic workload at the University of Calgary (U of C) is mostly contributed
5Global Internet usage, https://en.wikipedia.org/wiki/Global_Internet_usage#Internet_hosts 6World Internet Users and 2015 Population Stats, http://www.internetworldstats.com/stats.htm 7How popular is washingtonpost.com?, http://www.alexa.com/siteinfo/washingtonpost.com 8How popular is asahi.com?, http://www.alexa.com/siteinfo/asahi.com 9http://www.alexa.com/comparison/washingtonpost.com#?sites=asahi.com
3 by the university students, faculty, and staff. It is unsurprising that most inbound traffic involves popular sites like Google, Facebook, and YouTube. However, the summary results show that some scientific Web sites hosted internally by the university are extremely popular externally, and generated huge data traffic volume during the period under observation.
As stated earlier, sharing research and educational materials via the Internet is an effec- tive way utilized by many research funding agencies and public-funded universities. Research on the workload analysis of scientific Web sites has not received as much attention as the most popular sites. Researchers might assume that most scientific sites consume little net- work bandwidth, since they have a small influence on specific groups of users. Nevertheless, our analysis shows that some scientific sites in the university generated surprisingly large volumes of network traffic.
1.3 Motivation
The University of Calgary hosts many research Web sites and integrated education sites.
After assessing all the ingoing and outgoing network traffic, we found that two scientific
Web sites hosted by U of C generate a lot of traffic. Both of them rank among the top data volume generators during our four-month observation from January 1, 2015 to April 30, 2015.
The bandwidth they consumed is even in the same scale as the most popular sites like Google and Facebook, though well behind streaming video sites YouTube and NetFlix. One of the sites is the Auroral Imaging Group (Aurora) site10, and the other one is the Star Formation and Molecular Astrophysics (ISM) site11. Both sites are hosted by the Department of Physics and Astronomy at the University of Calgary.
The Aurora site studies the Aurora Borealis (Northern Lights), which is a natural phe- nomenon caused by cosmic rays, solar wind, and magnetospheric plasma interacting with the upper atmosphere [13]. These auroral phenomena are primarily seen in the high latitude
10Auroral Imaging Group, http://aurora.phys.ucalgary.ca/ 11Star Formation & Molecular Astrophysics at the U of C, http://ism.ucalgary.ca/
4 regions like northern Canada and the Arctic (and Antarctic) regions. Since the aurora are mainly observed during nights in remote areas, researchers have deployed digital cameras across northern Canada as a ground-based observatory to automatically record auroral phe- nomena, with the data transferred to U of C servers via network connections. The Aurora site is a scientific Web site providing aurora data collected from specifically-designed cam- eras. We find that the traffic generated by the Aurora site is surprisingly large. Everyday there are 1.5 million HTTP requests sent to the Aurora server, to retrieve 90 GB of data volume. This unusual discovery motivates us to analyze the workload characteristic of the
Aurora site.
The ISM site is another interesting scientific site, which studies the Inter-Stellar Medium
(i.e., the gas and dust in between the stars) in astrophysics. The site is created and main- tained by a U of C professor. Apart from a brief introduction about the Inter-Stellar Medium and some corresponding research, the ISM site mainly provides study materials for three courses taught by the professor, including one Astronomy course (ASTR 209) and two As- trophysics courses (ASPH 213, ASPH 503). Among the courses, ASTR 209 contains a series of recorded course videos. Similar to the Aurora site, the ISM site also generated voluminous traffic during our four-month observation, given the relatively small user community (400
U of C students registered in the course in winter 2015). There are around 70 GB of data volume retrieved from the ISM server per day. By analyzing the workload characteristic of the ISM site, we intend to understand the bandwidth usage, and how the Web resources are being used.
The purpose of conducting the network traffic measurements is to improve the network usage. The Aurora site and the ISM site are both constructed and maintained by technical staff with minimal computer science (CS) or networking background. As such, they may not deploy the site effectively from the CS perspective, when sharing information over the
Internet. Considering the traffic volume engendered by these scientific Web sites, we are
5 motivated to measure their network traffic, identify performance issues (if any), and propose potential remedies for the problems.
A second motivation for this thesis is a better understanding of modern scientific Web site traffic, compared to previously known workload patterns. As indicated earlier, the network technologies have improved dramatically over time. Therefore, the workload characteristics may also have changed, along with user behaviors.
1.4 Objectives
The objectives of this thesis are as follows:
1) Measure network traffic at the University of Calgary to determine the characteristics of modern scientific Web sites.
2) Compare workloads of modern scientific Web sites with those of previously studied
Web sites to identify similarities and differences.
3) Identify inefficiencies (if any) based on the traffic measurement results, and suggest improvements.
1.5 Contributions
This thesis has four primary contributions, listed as follows:
1) We collect and measure the network traffic of two distinct scientific Web sites at the
University of Calgary, namely the Aurora site and the ISM site.
2) We identify the dominance of automated robot traffic in the Aurora site measurement, and compare its characteristics to the human-generated traffic.
3) We discover several inefficiencies in the data transfer methods of the Aurora site. We suggest potential improvements as well as experimental results to evaluate their effectiveness.
4) We compare the Web usage characteristics of modern scientific Web sites with those from the prior literature.
6 Although the scope of this thesis focuses on network traffic measurement of two campus scientific Web sites, we expect this analysis and our potential suggestions for improvement raise awareness to Web robot traffic and network inefficiency issues. Also, we believe our results will provide a foundation for future explorations on scientific data sharing systems.
1.6 Thesis Overview
This thesis is organized as follows:
1) Chapter 2 presents basic network knowledge including TCP/IP and HTTP protocols, and introduces related work on network traffic measurement and workload characterization.
It also discusses prior studies on Web crawling, video streaming, and scientific Web sites.
2) Chapter 3 introduces the data collection methodology and the Bro logging system.
3) Chapter 4 analyzes the network traffic of the Aurora site, discusses its workload char- acteristics, and identifies the robot traffic and inefficiency issues.
4) Chapter 5 analyzes the network traffic of the ISM site, discusses its workload charac- teristics, and studies the video streaming traffic and user pattern analysis.
5) Chapter 6 compares the workload characteristics of scientific Web sites and discusses potential solutions for the inefficiency issues.
6) Chapter 7 summarizes the results, presents conclusions, and suggests future work.
7 Chapter 2
BACKGROUND and RELATED WORK
In this chapter, we introduce fundamental background knowledge regarding the technologies underlying our research. An overview of this chapter is as follows:
1) Section 2.1 and Section 2.2 provide background information on computer networks, including the classical five-layer network architecture, TCP/IP protocols, and HTTP in the application layer.
2) Section 2.3 reviews the literature on network traffic measurement research.
3) Section 2.4 briefly introduces Web robots.
4) Section 2.5 discusses current video streaming techniques on the Internet.
5) Section 2.6 presents literature about scientific Web sites, focusing on network traffic analysis and workload characterization.
2.1 TCP/IP Model
The modern Internet had its early origins in regional academic networks. It evolved from the ARPANET project and has greatly changed over the decades. However, the fact that the Internet consists of infrastructure and protocols remains the same.
Within the overall Internet system, network hardware and software implementing the protocols are organized in layers, called a protocol stack. Figure 2.1 shows the five-layer protocol stack and the seven-layer OSI (Open Systems Interconnection) model [55]. Except for the Presentation and Session layers that are specific to the OSI model, both stacks have
Application, Transport, Network, Link, and Physical layers. Differences exist among these two protocol stack models in the services and protocols. Since most concepts are similar, we choose to introduce the five-layer protocol stack in this chapter; more information about the
8 Application Presentation Application Session Transport Transport Network Network Link Link Physical Physical (a) Five-Layer Pro- (b) Seven-Layer tocol Stack OSI Model
Figure 2.1: The Five-Layer Protocol Stack and the Seven-Layer OSI Model
OSI model is available elsewhere [17,55].
The five-layer protocol stack is also referred to as the TCP/IP protocol suite, for two of its celebrated protocols: Transmission Control Protocol (TCP) and Internet Protocol (IP).
We use the common term “TCP/IP model” to refer to the five-layer protocol stack in this thesis.
Each protocol only belongs to one layer. For example, TCP belongs to the Transport layer, IP is in the Network layer, and HyperText Transfer Protocol (HTTP) is in the Appli- cation layer. In each layer, actions are performed to provide services to that layer or to the adjacent layer above, by utilizing the services within that layer or from the adjacent layer below. Protocol layers are implemented in software, or in hardware, or in a combination.
We take a bottom up approach to introduce the layers.
2.1.1 Physical Layer
The physical layer is at the bottom of the TCP/IP model. It provides the means of trans- mitting raw bits to the connected destination node, and determines the parameters of the communication channel [55]. It also provides the interface to protocols used by hardware transmission media. When a network connection is established, the way bits are moved is determined by its actual transmission medium and the corresponding protocols.
9 2.1.2 Link Layer
The Link layer is designed to move link-layer frames between two different nodes in the route [55]. It provides this service to the Network layer for routing a datagram via a series of routers, by invoking the bit-moving service provided by the Physical layer. The Link layer services highly depend on the link-layer protocols available on the link. Protocols such as Ethernet, WiFi, and Point-to-Point Protocol (PPP) belong to the Link layer. When a datagram from the Network layer traverses the links, it is passed down to the Link layer, and then transferred to the destination. During the process, it may use different link-layer protocols at different links. Finally, the datagram is passed up to the Network layer in the destination node.
2.1.3 Network Layer
The Network layer moves data packets (datagrams) between different hosts [55]. It provides services to the Transport layer, and uses the services provided by the Link layer. Whenever a Transport layer segment and a destination address are passed to the Network layer, the
Internet Protocol (IP) is invoked to send the segment to the specified destination.
IP protocols are the primary protocols in the Network layer. Internet Protocol Version
4 (IPv4) is the dominant protocol [16]. The main function of IPv4 is to route datagrams from a source host to a destination host, based on a 32-bit address (IP address). IPv4 only provides best-effort delivery without guarantees that the datagrams are delivered. Each host in the network layer has an unique IP address. We did a detailed analysis for the network traffic in this thesis based on the IP information extracted from this layer.
2.1.4 Transport Layer
The Transport layer is responsible for moving transport-layer packets (segments) from one end host to another. It achieves the data transfer by establishing a logical data channel between the two end hosts.
10 The two primary protocols in the Transport layer are the Transmission Control Pro- tocol (TCP) and User Datagram Protocol (UDP) [55]. TCP provides connection-oriented service with reliability. For example, TCP guarantees reliable, ordered, and error-checked delivery when transferring application-layer messages [17]. It also applies a sliding window
flow control protocol to match speeds between sender and receiver, and congestion control mechanisms to avoid congestive collapse (i.e., extremely poor network performance). UDP provides connectionless service with no reliability, except for (weak) checksums for data in- tegrity. TCP is widely used by many popular Internet applications, such as the HTTP, the
File Transfer Protocol (FTP), and Secure Shell (SSH). UDP is utilized by Internet applica- tions caring more about responsiveness than reliability, such as the Domain Name System
(DNS), the Routing Information Protocol (RIP), and some audio or video streaming appli- cations.
Our work studies HTTP traffic, which commonly utilizes TCP since it presumes a reliable transport-layer protocol [8] (note that unreliable UDP also can be used by HTTP). Therefore, the traffic we study in this thesis is almost always generated by TCP connections.
2.1.5 Application Layer
The Application layer is the topmost layer in the TCP/IP model. It contains many important protocols, such as HTTP and FTP [55]. The protocols are used to exchange application- layer messages between hosts. The Application layer protocols utilize the logical data transfer channels established by underlying transport-layer protocols to deliver messages. Our traffic analysis focuses on logs of HTTP activities in the Application layer.
2.2 HTTP and the Web
As introduced above, the TCP/IP model provides the underlying mechanisms to support
Internet applications. Upon this network foundation, the World Wide Web (WWW) [33]
11 HTTP request HTTP request
HTTP response HTTP response
Web client Proxy server Web server
Figure 2.2: Illustration of HTTP Requests and Responses
emerged as a means to exchange data content easily on the Internet. The Web has had a
transformative role in enriching the interactions on the Internet. Its popularity has helped
foster the merging of separate data networks, leading to the formation of the global data
network that we know today.
The HyperText Transfer Protocol (HTTP) in the Application layer is the foundation of
the WWW. It has two primary versions: HTTP/1.0 [6] and HTTP/1.1 [8]. HTTP/1.1 is a
revision of HTTP/1.0, with persistent connections (and other features) added in HTTP/1.1.
Currently, there is design and implementation work being done on a new version, HTTP/2.
There are two programs implemented for HTTP: a client side program like a Web browser,
and a server side program like a Web server. HTTP defines how messages are exchanged
between the client and server. The client retrieves the objects (e.g., HTML files, image files,
video files, and JavaScript files) in a Web page from the server side, through a recognizable
Uniform Resource Locator (URL). For example, http://www.abc.com/def/ghi.pdf is a valid URL for fetching the “ghi.pdf” PDF file from the “/def/” directory in the Web server host “www.abc.com”, using HTTP.
HTTP invokes TCP to establish connections within the Transport layer. The client and server exchange data by accessing the socket interface of the TCP connection. The reliability of the TCP connection guarantees that messages are exchanged successfully between the client and the server.
In some deployments, an HTTP proxy server is used as an intermediary between the
12 client and the original Web server. A proxy server can reduce the response time for client requests, as well as save bandwidth between the client and server, by temporarily storing and serving recently requested objects in a Web cache. In this role, a proxy acts as the Web server for the client when it has a copy of the requested object. Conversely, it acts as a client to request an object from the origin Web server when the object is not stored locally.
Figure 2.2 shows a generic illustration of HTTP requests and responses involving a proxy server.
2.2.1 Persistent Connections
Persistent HTTP connections and non-persistent HTTP connections are two different ways for clients to interact with Web servers. The main difference is whether the HTTP connection reuses the existing TCP connection [55]. The non-persistent HTTP approach establishes a new TCP connection for each request-response transaction. However, persistent HTTP connections allow multiple messages to be exchanged via the same TCP channel, in series.
The primary advantage of persistent connections is the reduction of the request latency.
Since the total time of an HTTP request-response transaction consists of the TCP con- nection initialization time, request delivering time, and response delivering time, persistent connections save the time used to re-establish the TCP connections (three-way handshake).
HTTP/1.1 makes persistent connections the default behavior for all HTTP connections, while HTTP/1.0 uses non-persistent connections. Technically, HTTP/1.0 can support per- sistent connections by adding “Connection: Keep-Alive” in the message header, and it is compatible with HTTP/1.1 servers [7]. However, there are many restrictions when imple- menting this. For example, clients cannot establish Keep-Alive connections with HTTP/1.0 proxy servers. HTTP/1.1 clients and servers can be configured to use non-persistent connec- tions for each request-response transaction, if resource usage is a concern. There are many other configuration options, such as adjusting the maximum session time and the maximum number of concurrent persistent connections.
13 HTTP/1.1 also supports the HTTP pipelining technique, which allows multiple HTTP re- quests to be sent over a single TCP connection before receiving the corresponding responses.
HTTP/1.0 doesn’t support this feature.
2.2.2 HTTP Messages
HTTP messages are the data sent by Web clients and servers over the HTTP connections. It consists of a request message and a response message, both of which have a defined format.
We use an example in Listing 2.1 to introduce the HTTP messages.
Listing 2.1: HTTP Request and Response Message Example Request message: GET / HTTP/ 1 . 1 User −Agent: curl/7.37.1 Host: www.ucalgary.ca Accept : ∗/∗
Response message: HTTP/ 1 . 1 200 OK Date: Thu, 23 Jul 2015 22:27:04 GMT Server: Apache/2.2.15 (Red Hat) Last −Modified: Thu, 23 Jul 2015 19:06:16 GMT ETag: ”496582− a7c7 −51b8f94f5dd9a” Accept −Ranges: bytes Content −Length: 42951 Connection: close Content −Type: text/html; charset=UTF−8
data is attached here
HTTP Request Message
The first line in the HTTP request message is called the request line. It contains the HTTP method field, the URL field, and the HTTP version field. In this example, the client would like to use GET method to fetch the root directory via HTTP/1.1 (the server can choose to abide by these or not, e.g., server could respond with HTTP/1.0). The vast majority of
HTTP requests use the GET method.
14 The following lines are request header lines. They inform the server of basic background information about the client as well as some parameters of the request. In the example, the client tells the server that the HTTP message is generated with user agent “curl/7.37.1” [15], which is a command line tool used for transferring files over the Internet. Usually, the user agent field consists of the name and the version information of the browsers and operating systems that the client is using.
The Host field indicates where the requested object is located. Although the TCP con- nection between the specific server and client is already established before transferring HTTP messages, it is necessary to keep the host field since: 1. there may exist a proxy server as an intermediary agent talking to the client; and 2. the Web server may support numerous different Web sites.
In the request message, the client can specify detailed parameters when requesting the ob- ject. For example, “Accept-Language” field indicates the acceptable language of the client’s preference, “Accept-Encoding” field indicates the acceptable encodings, “Connection” field indicates the client’s intention to keep this connection alive or not, and “If-Modified-Since”
field means the client only wants a copy if the file has been changed after a certain time point
(the server could send the response if it wants). There is no need for the request message to include all the header line fields; a subset of the fields is acceptable.
HTTP Response Message
Similar to the request message, an HTTP response message starts with a status line, includes header lines in the middle, and may end with an entity body. The status line contains the
HTTP version the server is using for this transaction, the HTTP response status code, and its corresponding status message. In the example, the server tells the client that it is using
HTTP/1.1 for this transaction and the requested object is successfully returned.
In the response header lines, the “Date” field indicates when the response is generated and sent, the “Server” field indicates the server name and version, the “Last-Modified” and
15 “ETag” fields are auxiliary information that the client (or proxy) can use in the future to
check whether the requested object has been modified (if not, the client has no need to
download the object again), “Accept-Ranges” indicates whether the server accepts range
(partial transfer) requests for an object, “Content-Length” and “Content-Type” indicate
the length and MIME (Multipurpose Internet Mail Extensions) type of a requested object,
and “Connection” informs the client whether the server will keep this TCP connection active
or not. Like the request message, usually the response message only contains a subset of the
fields. A list of the available HTTP request and response headers is shown in [19].
The entity body is the content of the requested object returned by the server. If the
client is a Web browser and the requested object is a part of the Web page (e.g., HTML,
image files, CSS files, etc.), then the browser can directly render the object onto the Web
page via its layout engine and present it to the user. Since the way of rendering the page
varies for different browsers, the actual Web page may not look the same when viewed on
different browsers.
HTTP Request Methods
The method in the HTTP request defines the action to be performed on the object identified
by the request URL [11]. In HTTP/1.0, there are only three methods: GET, POST, and
HEAD. Another five new methods were added to HTTP/1.1, including OPTIONS, PUT,
DELETE, TRACE, and CONNECT. The client can use any methods to interact with the
server. The server can be configured to support any methods or not.
GET method is the most prevalent in the Internet. It is used to retrieve the content of
the object with some specific request URL. The conditional GET is a GET request with conditional statements in the request message header lines. The conditional statements in- clude If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, and If-Range header
fields [11]. Once the conditional GET request reaches the server, the server can choose to return the updated object or not depending on the conditions. The advantage of using con-
16 ditional GET is to reduce the network usage of the superfluous data transfer. The partial
GET is a GET request with a Range header field in the request message. It only requests
part of the entire object, which may reduce network usage by avoiding re-transferring data
that is already at the client. The partial GET is very useful when dealing with large size
objects, such as videos.
HEAD method is identical to GET method except that the response does not contain
the entity body. It is often used for retrieving meta-information about the requested object,
such as how large it is, or how old. Based on the meta-information, the client may choose to
use GET to retrieve the object or not. Similar to the conditional GET, appropriately using
the HEAD method may significantly reduce the network usage.
The rest of the HTTP methods are rarely seen in our network traffic measurements.
Therefore we briefly cover only selected methods here.
POST method is designed for sending the enclosed entity to a specific resource at the server, and requiring the resource to handle the request. The Web server determines the actions to be performed on the transferred entity.
PUT method allows the client to enclose an additional entity in the request, which can update the resource in the server if the resource exists, or create it if needed.
DELETE method requests that the specified object in the server be deleted (this will only be done if the user has the appropriate permission).
OPTIONS method requests the server to return available HTTP methods for an object.
HTTP Response Status Codes
The HTTP response status code is a 3-digit integer that summarizes the server’s action based on the request. There is a short textual description following the status code in the response status line.
We introduce several selected status codes as follows:
200 OK is the standard response status code for successful HTTP requests. It is the
17 most common status code for most Web sites.
206 Partial Content indicates that the server successfully returned the partial content
for a request with a Range header field.
304 Not Modified indicates that the object in the server has not been modified, based
on the conditional GET request sent from the client. Therefore, responses with 304 don’t
contain entity bodies.
403 Forbidden response informs the client that the request is understood by the server,
however, the server refuses to respond to it.
404 Not Found indicates that the requested object is not found in the server.
503 Service Unavailable indicates that the server is currently unavailable (caused by
temporary overload or maintenance).
Referer
The HTTP referer1 is a request header field. It informs the server where the request
originated. For example, consider a client viewing a Web page on server A that has a hyper-
link to an image on server B. When the client sends a request to server B for retrieving the
image object, the referer field in the request tells server B that this request was referred
by server A.
The referer field provides servers with information about where the visitors come from, which is often useful in network traffic analysis. However, for security and privacy concerns, sometimes clients choose to obfuscate the referer field in their HTTP requests. Further-
more, if the user types in a URL or visits sites from bookmarks in browsers, the referer
field will be blank. 1A misspelling of “referrer” originally, https://en.wikipedia.org/wiki/HTTP_referer
18 2.2.3 HTTP Secure
HTTP Secure (HTTPS) is a protocol designed for encrypted HTTP data transfers over the Internet. In HTTP, clients and servers communicate “in the clear” directly over the
TCP connections. In HTTPS, connections are built on top of cryptographic protocols like
Transport Layer Security (TLS) and Secure Sockets Layer (SSL), which usually work at the session layer and presentation layer in seven-layer OSI model, or at the application layer in TCP/IP model [25]. The cryptographic protocols encrypt messages before sending, and decrypt messages after receiving, thus improving the security and privacy of transferred messages.
The sites selected in this thesis are all HTTP sites, and we only measure their HTTP traffic.
2.3 Network Traffic Measurement
Network traffic measurement provides a way to understand and manage the usage of the
Internet today. There is a wealth of literature on network traffic measurement and Web workload characterization dating from the early-1990’s to the present.
In 1992, Vern Paxson used measurements to evaluate analytic models of TCP connec- tions [64]. It was one of the earliest network traffic characterization works during the growth of the Internet. Based on traces collected from 7 different sites, Paxson analyzed the network characteristics of TELNET, NNTP, SMTP, and FTP connections, and compared several an- alytic models as well as empirical models. He found that analytic models are as good as the empirical models in general, and the connection characteristics are different for different sites or different periods of the same site.
In 1997, Thompson et al. [73] observed the traffic volume, flow volume, flow duration, and packet sizes in a wide area network. They found interesting results, for example, the measured traffic has diurnal trends, it decreases on weekends, and TCP dominates IP traffic.
19 In 1995, Crovella et al. [40] discovered self-similarity in World Wide Web traffic. They also found that the Web transmission times and the silent times (inactive times of a client) follow heavy-tailed distributions. They attribute these to the heavy-tailed distribution of
Web file sizes, as well as the influence from network users.
Sedayao et al. [70] analyzed WWW traffic patterns. Their work covers the WWW traffic characteristics from a fundamental perspective, with concerns about inefficiency issues, and proposed solutions. In the paper, they mentioned that the most popular file type in WWW traffic at the time was the Graphics Interchange Format (GIF), followed by Moving Picture
Experts Group (MPEG) file (in terms of references of bytes).
Cunha et al. [41] characterized Web client activity. They collected data by deploying modified Web browsers in terminal rooms on the Boston University campus. During their four-month observation, they found power-law distributions in document size distribution and document popularity distribution. They showed that this information is useful for designing caching strategies.
In 1996, Arlitt and Williamson [30, 31] analyzed the Web traffic of six different servers.
They identified ten common characteristics in the Web server workloads. For example, they found that successful requests are the most common, 90% of the transferred documents are
HTML and image files, and file sizes and the transfer sizes both follow heavy-tailed distribu- tions. Based on these results, they also proposed effective suggestions for improving caching systems. We will revisit their work in Section 6 by comparing the workload characteristics of modern scientific Web sites to their results.
In 2000, Mahanti and Williamson [61] analyzed the workload of three different Web proxy servers. They found similar results as [31] (e.g., HTML and image files account for 95% of all the requests), and distinct results such that the Web document popularity doesn’t match
Zipf’s Law strictly (it matches in [31]). This work confirmed Arlitt’s work to some extent, and indicates the peculiarities of workload characteristics for different Web sites in different
20 periods.
In 1999, Breslau et al. [35] were among the first to focus on the Zipf-like distribution for file popularities. They studied the Zipf-like distribution in Web page requests, and the reference probability of the documents. They also presented a simple model for understanding cache performance.
Since lots of new trends have emerged on the Internet, more recent network traffic studies have provided insights in general network measurements as well as individual sites.
In 2010, Callahan et al. studied the Web workload characteristics from a longitudinal view [37]. Their work includes HTTP transaction characterization, user behavior, and server distributions. They found that most of the HTTP transactions are GET requests, Zipf-like distributions are present in requests per object statistics, and identified effects from browser caches and content distribution networks.
In 2011, Ihm et al. conducted measurements on modern Web traffic over five-year obser- vations of a content distribution network vendor [53]. Their work covers high-level character- istics such as the overall connection speed and maximum concurrent connections, as well as page-level characteristics analysis with their new page detection algorithm. They found the increasing use of Flash video, AJAX (Asynchronous JavaScript and XML), and client-side interactions after initially loading the pages.
There are numerous papers about the traffic of Web 2.0 sites, in which users can interact and collaborate with each other instead of merely viewing the content provided by the Web site. Butkiewicz et al. [36] studied the complexity of today’s Web pages. Schneider et al. [68] presented a study of AJAX traffic by analyzing popular Web 2.0 sites, such as Google Maps, and social network Web sites. Lin et al. [59] also studied the on-line map application traffic on Web 2.0 sites.
Web 2.0 has evolved to encompass a large group of sites, including video Web sites like
YouTube, NetFlix, and Vimeo, and on-line social networks like Facebook, Flickr, and Twit-
21 ter. Cha et al. [38, 39] studied the traffic of several user-generated content video Web sites.
Gill et al. [48], Zink et al. [77], and Ameigeiras et al. [29] all studied YouTube. Several papers [32, 50, 51, 62, 69] studied on-line social networks from many perspectives, includ- ing network usage, user behaviors, user content generation patterns, and user relationship connections.
Other research has involved network traffic measurement of e-commerce Web sites [66,75],
Web robot activities [45,76], Peer-to-Peer (P2P), and mobile networks.
2.4 Web Robots
A Web robot is a software program that automatically launches a series of HTTP transac- tions [49]. The main application of the Web robot is to crawl and extract useful information from the Web sites, by moving from site to site and analyzing the browsed data. Therefore, this kind of Web robot is also called a “Web crawler”. For example, Googlebot2 is a Web crawling robot operated by the Google search engine. The main function of Googlebot is discovering new and updated Web pages in the Google index.
Due to the similarities of Web page structures, Web robots can quickly fetch informa- tion from Internet by performing repetitive and redundant tasks. Usually, users can utilize software like Wget3 to retrieve filtered content from the Web, or create more flexible scripts with programming libraries like the urllib4 module in Python.
Technically, all Web robots have the same core approach. There is a list of URLs known as root pages for robots to start with initially. Robots (as clients) can generate HTTP requests to get the content of those pages from Web servers. Then robots have the ability to extract the links and useful information from the page content, and identify potential URLs to crawl in the next step. Robots can repeat the previous procedures unless the termination
2Googlebot, https://support.google.com/webmasters/answer/182072?hl=en 3GNU Wget, https://www.gnu.org/software/wget/ 4urllib, Python module, https://docs.python.org/2/library/urllib.html
22 condition is satisfied. For example, a robot may terminate when all the links in a specific page are visited, or it reaches the maximum link depth from the root page.
Web robots may not be welcomed by some Web servers. There is a Robots Exclusion
Standard [22] widely used by Web servers to communicate with robots. A Web server can provide a file named “robots.txt” in the root directory, indicating which parts of the site are not allowed to be crawled. If a Web robot follows the standard, it first generates a GET request to fetch the “robots.txt” file, then modifies its further operations according to the rules. Most robot software provides configuration options to users to determine whether to follow “robots.txt” or not. For example, Wget can disable courteous operations by adding
“-e robots=off” in the command. Furthermore, some robots can mislead servers by faking the HTTP request header content, especially the user agent field.
There are numerous papers about the network traffic of Web robots. Before undertaking a full analysis of the Web traffic, a preliminary analysis for detecting and identifying robot traffic is recommended. Several papers [34, 44, 71] studied Web robot detection techniques, including statistics analysis as well as data mining approaches. Another two papers [54, 72] analyzed thousands or millions of different Web sites, and provided surveys about the usage of robots exclusion standard, namely “robots.txt”. They found that 46.02% of newspaper
Web sites and 45.93% of the USA university Web sites adopted the robots exclusion standard.
Two studies [42,43] discussed Web robot behaviors based on the analysis of Web server access logs.
2.5 Video Streaming
As mentioned in Chapter 1, one of the sites we study provides video lectures. We present a brief introduction about Internet video streaming techniques in this section.
Video streaming has changed a lot over the years, with the upgrading of network speeds and the growing demands from network users. Today, YouTube, the largest video sharing
23 Web site, allows users to view user-generated videos smoothly with qualities ranging from
240p (426 × 240, progressive scan) to 1080p (1,920 × 1,080, progressive scan). NetFlix and
Hulu provide hundreds of thousands of movies or TV shows to users with high quality video streams. Many sports or e-sports sites can even provide live broadcasting video streams to users over the Internet. This technology enriches our daily lives.
Typically, there are three different ways to watch a video on-line. Progressive download is a way to transfer the video files from a server to a client via HTTP connections [3]. When a user plays the video embedded in a Web page, the browser starts to download a copy of the video file from the Web server and the user can only have access to the video content that is already downloaded. There is usually a progress bar indicating how much video content has been loaded or played. Users can playback or fast forward any loaded video parts, however, they cannot view the part that is not yet downloaded by the browser. Although the progressive download can be easily implemented, the drawbacks of this technique are obvious:
1) Users have to wait until the video is loaded (from the beginning to where users want to watch) even if they are interested in only a small part of the whole video.
2) The technique wastes bandwidth if a user downloads a large part of the video and then exits without watching the whole video.
3) If the video quality is high bandwidth, and the capacity between the server and the client is lower than the bit rate of the video, then the user occasionally has to stop and wait for the buffer to be loaded. The user has no means to change the quality of the video.
Traditional streaming [4] is another method for video streaming, which utilizes special streaming servers to deliver videos. Special protocols like Real Time Streaming Protocol
(RTSP) and Real Time Messaging Protocol (RTMP) are adopted to divide the original video
file into small-size chunks, and then transfer those chunks via UDP or TCP connections. In this approach, the video quality is still immutable, and this streaming service cannot be
24 implemented in normal Web servers.
Adaptive streaming [5] is designed for solving the video quality issues above. By de- ploying several video files encoded in different qualities for the same video content, adaptive streaming can provide video streaming service in different qualities to users according to their network conditions. The video files are also chunked into small fragments for ease of delivery. There are four primary implementations of the adaptive streaming, including
Dynamic Adaptive Streaming over HTTP (MPEG-DASH), Adobe Dynamic Streaming for
Flash, Apple HTTP Adaptive Streaming, and Microsoft Smooth Streaming [12]. MPEG-
DASH is the only international standard widely supported by most HTTP servers. The advantage of adaptive streaming is obvious, which enables users to watch videos smoothly, instead of waiting for buffering high-quality videos or tolerating low-quality videos. The disadvantage of adaptive streaming is the computation and storage resources used for com- pressing the videos in various bit-rates and resolutions beforehand. Since user experience is more important than the additional cost, adaptive streaming is prevalently adopted in most of today’s video sites, like YouTube, NetFlix, and Vimeo.
2.6 Scientific Web Sites
There is not much literature about network traffic measurement and workload characteriza- tion of scientific Web sites.
Eldin et al. [46] studied the top 500 popular pages in Wikimedia, which is a Web site promoting free educational contents. They analyzed time-series request counts and discov- ered the collateral load phenomena, in which the links embedded in a popular page (e.g.,
Michael Jackson’s page when he died) generate more traffic than the page itself. They also suggested that simple prediction algorithms are able to predict workload in Wikimedia.
Urdaneta et al. [74] studied a sample of the network traffic for Wikipedia. Their re- search includes analysis of user requests, read and save operations, flash crowds, and non-
25 existent page requests. They also suggested a decentralized and collaborative setting to host
Wikipedia for improving the network performance.
Morais et al. [63] studied the user behavior in a citizen science project, though their focus was on the user interaction aspects, rather than Internet traffic workloads. Li et al. [58] studied the workload characterization of CiteSeer, which is a digital library for computer science literature.
The closest example in the prior literature is Faber et al. [47]. They studied the Web traffic of four different data sets. They compared the workload characteristics found in [30] and in their data sets. However, their analyses are somewhat limited due to missing HTTP header
fields in their logs, and a relatively short observation period of 1-2 months. Furthermore, their data was collected over a decade ago, and scientific Web workloads may have changed since then.
2.7 Summary
This chapter introduced basic background knowledge on computer networks. The application protocol HTTP and its underlying TCP/IP protocols were discussed. Then, we presented a series of related studies involving network traffic measurement, Web robots, video streaming techniques, and scientific Web site analysis.
This thesis studies the workload characteristics of two scientific Web sites, by analyzing the HTTP transaction logs, especially the HTTP header fields mentioned in this chapter.
The study focuses on the user behaviors and network usage of the two scientific sites.
We present our methodology in the next chapter.
26 Chapter 3
METHODOLOGY
In this chapter, we explain how the network traffic in our study was monitored. To be specific, we introduce the overall logging system, including the deployment of hardware infrastructure, and the log generation framework. For analysis purposes, we then describe the pre-processing methods applied to the logs. The majority of the infrastructure deployment and data collection work is done by Michel Laterman and Martin Arlitt, with support from
University of Calgary Information Technologies (UCIT) staff.
3.1 Endace DAG Card Deployment
The incoming and outgoing network traffic passes through the edge routers of the University of Calgary network. By mirroring these traffic flows, it is feasible to observe all of the packet-level traffic between the campus and the Internet from a monitor server.
Campus Network Internal External Network Network Campus Backbone Internet Switch Router Edge Router
monitor logs … … Router Desktop Laptop Web Server Monitor Server Storage Server
Figure 3.1: Campus Network Structure with Traffic Monitor System
Figure 3.1 shows the structure of the campus network with our traffic monitor included.
All incoming and outgoing traffic is mirrored to the monitor, and then summaries of the actual traffic are transferred to a storage server every night.
27 The monitor is a Dell server equipped with two Intel Xeon E5-2690 CPUs (32 logical cores @ 2.9GHz), 64 GB RAM, and 5.5 TB hard disk storage, running the CentOS 6.6 x64 operating system. Since the hard disk is not large enough to store summary logs of the network traffic for a long period of time (around 50 GB of compressed log files are generated everyday), the summary logs are transferred to a storage server early every morning (during the off-peak times).
The monitor uses Endace DAG 8.1SX card for traffic capture and filtering. The Endace
DAG card is designed for 10 Gbps Ethernet, and uses a series of programmable hardware- based functions to improve the packet processing performance. A full list of Endace DAG
8.1SX’s specifications is available elsewhere [2]. Typical overall daily usage during the col- lection period of the U of C network was 2 Gbps of inbound TCP/IP traffic, and 1 Gbps outbound traffic.
The primary function of the Endace DAG data capture card is to split the incoming stream for the Bro logging system. The stream from the edge router is split into two streams, providing 24 sub-streams to the Bro system.
3.2 Bro Logging System
The Bro network security monitor [65] is an open-source network analysis framework. It pro- vides a generalized platform for network performance measurements and security monitoring.
The Bro logging system is able to monitor all network activities from a high-level viewpoint and provides detailed transaction information. Specifically, Bro produces logs including all transport-layer connections appearing on the network backbone, and many application-layer transcripts, such as HTTP transaction headers, DNS requests and replies, SSL certificates, etc.
Bro is configured to process the traffic streams generated by Endace DAG card in the monitor. With the incoming stream split by the Endace DAG card, Bro’s event engine
28 Table 3.1: A Sample of a Subset of the Bro HTTP Log
Fields ts id.orig h id.resp h method host uri referer user agent Types time IP addr IP addr string string string string string 1 ts1 a.b.c.d e.f.g.h GET uc.ca /1.jpg abc.ca Mozilla/5.0 2 ts2 i.j.k.l m.n.o.p GET uc.ca /2.png def.com Chrome/35.0
first transforms the sub-streams into higher-level events, which describe network activities
in objective terms. For example, the traffic stream captured by the DAG card is determined
to be an HTTP request by Bro and converted into a Bro event containing the request
information, such as HTTP version and IP addresses. Then Bro uses its script interpreter
to convert the event into logs, and notify the Bro user of abnormal activities (e.g., malicious
attacks) if corresponding policies are in place. This study focuses on Web traffic analysis,
and not about detecting and preventing intrusions from the external network.
Once the logging system is activated, Bro collects and generates logs hourly. The two
scientific sites studied in our work are both HTTP servers. Therefore, we concentrate on
the HTTP traffic measurements. The HTTP transaction logs contain detailed information
about the requests and responses, including request start and end times, response start and
end times, host name, request method, referer, user agent, response status code, etc.
Table 3.1 shows a sample (only includes a subset of all the fields) of the HTTP log generated by Bro. The “types” are specific data formats defined by the Bro system, which also can be used in Bro scripts. Note that there are 37 fields in our original HTTP logs, and we present selected fields with fabricated data for simplification only. Our analyses primarily rely on the following fields:
• ts (time) is the request start time-stamp, in Linux epoch time format.
• id.orig h (addr) is the request IP address, in 32-bit (four-byte) format.
• id.resp h (addr) is the response IP address.
• method (string) is the HTTP method in the request.
29 • host (string) is the name in the Host request header.
• uri (string) is the requested resource name in that specific host.
• referer (string) is the value in the referer request header.
• user agent (string) indicates the user agent used by client.
• request body len (count) is the size of request.
• response body len (count) is the size of response.
• status code (count) is the response status code.
• status msg (string) is the response status message.
• resp mime types (vector[string]) indicates the MIME type of the response.
• req start, req end, res start, and res end (time) are request/response start/end
time-stamps.
As introduced above, the monitor periodically transfers the log files to the storage server.
We study the Bro logs collected from January 1, 2015 to April 30, 2015 in this thesis.
However, there were several disruptions in the logs during our observation period, primarily due to events such as power failure, network disconnection, and Bro system crashes. The outage periods were:
• January 30, 2015: 11:00 - 12:00, 1 hr
• February 15, 2015: 18:00 - 19:00, 1 hr
• April 10, 2015: 10:00 - 24:00, 14 + 23 hr
• April 11, 2015: 0:00 - 23:00
• April 30, 2015: all day, 24 hr
30 For the analyses in the following chapters, there are some graphs on which these five outages are visible. Nevertheless, these outages were relatively rare during the observation period. With about four months of data collected by the logging system, we have a good representation of the campus network usage, and are able to make informed observations about the traffic characteristics.
3.3 Data Pretreatment
The Bro system generates about 50 GB of compressed log files every day, including HTTP transaction logs, FTP transaction logs, TCP/UDP connection logs, DNS logs, etc. HTTP logs are separated into files hourly, with about 1 GB for each file. It is slow to analyze the daily HTTP transaction logs, with around 20 GB in total.
After several trials, we refined our analysis approach to use a mix of awk1 and Python2 scripts. The awk is a free Linux software used to extract particular columns of information from a file. Since it has powerful functions and good efficiency, we used it to output selected records from the HTTP logs. Python is a well-known computer programming language. We chose it because it has many free libraries, such as the graph plotting module “Matplotlib”.
Furthermore, Python is convenient for handling string variables when working with awk scripts.
While using awk to extract records is reasonably efficient, it is still quite slow to analyze the data of one or several months. Therefore, we extract and store the HTTP records for the analyzed sites in temporary files, to speed up the data processing. With this pretreatment step in place, we can normally obtain the subsequent analysis results in a matter of hours.
We analyzed data with the servers (4 Intel Xeon X5450 3.00GHz CPUs, 32 GB RAM) in the Department of Computer Science.
1The GNU Awk User’s Guide, http://www.gnu.org/software/gawk/manual/gawk.html 2Python (programming language), https://www.python.org/
31 3.4 Summary
This chapter introduced the methodology of this thesis, including the hardware deployment, the Bro logging framework, and the pretreatments applied to the data.
In the following two chapters, we analyze the HTTP traffic of two scientific Web sites, namely the Aurora site and the ISM site. The traffic of both sites was monitored and collected by the Bro logging system during our four-month observation from January 1, 2015 to April
30, 2015.
32 Chapter 4
AURORA SITE ANALYSIS
In this chapter, we analyze the Aurora site workload. To begin, we analyze the HTTP characteristics including number of requests, data volume, HTTP method, HTTP referer,
IP activity, IP geolocation information, and URL popularity. In the process, we identify the existence of automatic crawling scripts (robots) responsible for a large part of the traffic.
Next, we provide more detailed analysis for some individual IP and referer sites based on popularity information. Finally, we identify the file transfer inefficiencies for the traffic generated by the robots. Additional active measurement experiments and results about file transfer inefficiency are available in Chapter 6.
4.1 HTTP Analysis
4.1.1 HTTP Requests
Figure 4.1 shows the daily count of HTTP requests over the four-month period under study.
There are approximately 1.5 million requests per day, and 182 million in total (see Table 4.1).
The Aurora site had fairly steady request traffic throughout the observation period (except for brief monitor outages on April 11 and April 30), but with a noticeable surge reaching
6 million requests per day in mid-March 2015, due to geo-magnetic storm activity affecting the aurora (see Section 4.3).
Figure 4.2 shows the hourly counts of HTTP requests on four selected days of our trace
(i.e., January 1-3 and January 5). We choose these four days since they are in the first week
Table 4.1: Statistical Characteristics of the Aurora Site (Jan 1/15 to Apr 29/15)
Site Total Reqs Avg Reqs/day Total GB Avg GB/day Uniq URLs Uniq IPs Aurora 182,068,131 1,529,984 10,354 87.01 2,894,294 240,236
33 6.0M
5.0M
4.0M
3.0M
2.0M
Number of Requests of Number 1.0M
Jan Feb Mar Apr
Figure 4.1: HTTP Request Count Per Day for Aurora Site during our trace, and they have clear hourly workload patterns. The consistent structure of the traffic, with over 40 thousand requests per hour, suggests that automated robots are generating most of the traffic. This is particularly likely given that January 1 is a statutory holiday (New Year’s Day). Further analysis in Section 4.2 shows that about 50% of the request traffic is attributable to University of California at Berkeley robots crawling the site.
In fact, Figure 4.2 indicates that the robot is crawling the site multiple times per day, with a four-hour period in early-January. The pattern started to change slightly on Monday,
January 5, 2015.
4.1.2 Data Volume
Figure 4.3 shows that the typical daily data volume for the Aurora site is about 90 GB/day, except for mid-March, when the traffic quadrupled. The Aurora site server provides a variety of data files to the public, including videos, images, and zip files. Since the sizes of video and image files are relatively larger than the rest, the large jump in data volume is attributable to a surge in popularity for these large image or video files.
34 80K 80K 70K 70K 60K 60K 50K 50K 40K 40K 30K 30K 20K 20K
Number of Requests of Number 10K Requests of Number 10K 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (a) HTTP Reqs/hr, Jan 1 (b) HTTP Reqs/hr, Jan 2 80K 80K 70K 70K 60K 60K 50K 50K 40K 40K 30K 30K 20K 20K
Number of Requests of Number 10K Requests of Number 10K 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (c) HTTP Reqs/hr, Jan 3 (d) HTTP Reqs/hr, Jan 5
Figure 4.2: HTTP Requests Per Hour (Jan 1-3 and Jan 5, 2015)
4.1.3 IP Analysis
There were 240,236 distinct IP addresses that visited the Aurora Web site during our trace.
Figure 4.4 shows the number of distinct IP addresses viewing the Aurora site per day. The daily count for unique IPs is about 4,000, except for mid-March, when the IP count grows eightfold. It is interesting that the surges in HTTP requests and unique IPs are different, with the former one only quadrupling from its usual level.
We performed IP geolocation for all the IP addresses, using the IP location services from
IPAddressLabs1 and MaxMind2. IP addresses come from 192 distinct countries in total.
Figure 4.5 shows the IP geolocation distribution of the top 10 countries based on number of IP addresses. Most of the IPs (39.50%) are from Canada, with the United States second at 15.67%. Figure 4.6 shows the IP geolocation distribution of the top 10 countries sorted by request count. Most of the requests (73.22%) come from the United States, with Canada
1IP-GeoLoc IP Address Geolocation Online Service, http://www.ipaddresslabs.com/ 2GeoIP, MaxMind, https://www.maxmind.com/
35 500
400
300
200 Volume (GB) Volume 100
Jan Feb Mar Apr
Figure 4.3: Data Volume (GB) Per Day for Aurora Site
Table 4.2: Top 10 Most Frequently Observed IP Addresses for Aurora Site
IP Reqs Pct Organization Location 128.32.18.45 89,977,861 49.19% University of California Berkeley, USA 137.229.18.201 22,951,449 12.55% University of Alaska Fairbanks, USA 137.229.18.252 3,403,550 1.86% University of Alaska Fairbanks, USA 50.65.108.252 2,394,630 1.31% Shaw Communications Edmonton, Canada 128.32.18.192 1,919,161 1.05% University of California Berkeley, USA 162.157.255.241 1,027,080 0.56% TELUS Communications Calgary, Canada 162.157.31.100 817,197 0.45% TELUS Communications Edmonton, Canada 211.133.151.210 795,318 0.43% JIN Office Service Japan 110.92.52.141 670,887 0.37% Good Communications Kagoshima, Japan 99.66.177.107 564,430 0.31% AT&T U-verse Dallas, USA second at 17.47%. Furthermore, for IP-city distribution, Berkeley California accounted for
50.28% of the requests generated by 43 IPs, while Fairbanks Alaska was second at 14.78% of the requests with 223 IPs. Since the THEMIS project (a larger collaborative project including the Aurora group) [10] is based in North America, these results are not surprising.
There are, however, many other countries accessing the images (e.g., Japan 2.09%, UK
6.30%).
Table 4.2 shows the geolocation information for the top 10 most frequently observed
IP addresses, ranked by number of HTTP requests. Three observations are evident from
36 36K
30K
24K
18K
IP Count IP 12K
6K
Jan Feb Mar Apr
Figure 4.4: Number of Unique IP Addresses Daily from 2015-01-01 to 2015-04-29, Aurora Site these results. First, most of the Top 10 are members of the THEMIS project, as expected
(e.g., University of California at Berkeley, University of Alaska). Second, some of these organizations have multiple IPs in the Top 10, indicating either multiple auroral researchers, or the use of DHCP (Dynamic Host Configuration Protocol), or the use of automated robots.
Third, the topmost IP address, which is from UCB, generates about half of the requests. Its total request count actually exceeds the sum of all the other IP addresses, both on a daily basis and overall.
Zipf’s law [28] is observed in many types of data. It is widely used in Internet traffic analysis such as [35,67]. By sorting the IPs according to the number of requests by each IP, we get the rank and frequency (number of requests) for each IP. Then we plot the (rank, frequency) pairs on a 2-dimensional coordinate system with log scale on both axes. Data manifesting Zipf’s law should result in a straight line in a log-log plot. Figure 4.7 shows the frequency-rank profile for the IP addresses observed at the Aurora site. There is visual evidence of power-law structure.
37 Canada Canada - 39.50% - 94942 United States - 15.67% - 37669 United Kingdom - 6.30% - 15139 Germany - 4.57% - 10985 39.5% France - 2.38% - 5730 Finland - 2.35% - 5647 Australia - 2.27% - 5463 Japan - 2.09% - 5014 United States Sweden - 1.85% - 4442 15.7% Russian Federation - 1.85% - 4438 others - 21.17% - 50892
6.3%
United Kingdom 4.6%
2.4% 21.2% 2.3% Germany 2.3% 2.1% others 1.8% France 1.8%
Figure 4.5: IP Geolocation Distribution, Top 10 Countries Sorted by Unique IPs
4.1.4 HTTP Methods
Figure 4.8 shows the HTTP methods seen over the trace duration. For the Aurora site,
88.4% of the HTTP requests use the GET method, while 11.6% are HEAD requests. Among
the HEAD requests, over 99.7% are generated by Wget. Other HTTP methods are negligible with fewer than 100 requests over the four-month period.
The number of HEAD requests is fairly consistent over time, suggesting that they are generated by robots. Comparatively, the number of GET requests consists of two parts, namely human activities and robot activities. The surge of GET requests in mid-March suggests human activities that outpace the robot traffic at that point.
4.1.5 HTTP Referer
The HTTP referer field (when present) is another source of useful information about the traffic. This field in the HTTP request header indicates the Web page from which the Aurora site was visited.
We analyze the top 100 referers in terms of requests and data volume. The top referer
38 United States - 73.22% - 133941542 United States Canada - 17.47% - 31952429 Japan - 2.51% - 4587971 Germany - 1.30% - 2378523 United Kingdom - 0.96% - 1755945 73.2% France - 0.74% - 1354597 Russian Federation - 0.42% - 759914 Australia - 0.36% - 662787 Estonia - 0.24% - 441392 Hong Kong - 0.23% - 413734 others - 2.56% - 4678838
2.6% others
0.7%0.4%0.4%0.2%0.2% 1.3% 1.0% 2.5% United Kingdom Germany 17.5% Japan
Canada
Figure 4.6: IP Geolocation Distribution, Top 10 Countries Sorted by Request Numbers for both is the Canadian Space Agency (CSA) AuroraMAX portal3, which appeared in
45,763,205 (25%) requests, and triggered a data transfer volume of 4,423 GB (43%) in total.
Most of the referrals come from pages showcasing images or videos from the Aurora Web site.
For example, 25 of the top 100 referrers come from the CSA site, and 9 from virmalised.ee4, which is an Estonian Web site broadcasting live auroral imagery from cameras around the world. These live feed pages generate large volumes of network traffic. Interestingly, many of the referring Web pages use JavaScript to automatically refresh the images shown on the page every few seconds, which contributes to the machine-generated5 traffic.
4.1.6 URL Analysis
Table 4.3 shows the Top 10 most frequently requested URLs for the Aurora site. Most of these URLs are images or videos labeled with recent or latest in the “/summary plots/” directory. These images are updated automatically by the ground-based cameras every few seconds during the night, while the videos are generated and posted on Real-Time
3Canadian Space Agency, AuroraMAX, http://www.asc-csa.gc.ca/eng/astronomy/auroramax/ 4http://virmalised.ee/en/ 5Note that the browser will refresh the image automatically, whether there is a human viewing the images or not.
39 108 107 106 105 104 3
Frequency 10 102 101 100 100 101 102 103 104 105 106 Rank
Figure 4.7: Frequency-Rank Profile for IP Addresses, Aurora Site
6.0M GET 5.0M HEAD
4.0M
3.0M
2.0M
Number of Requests of Number 1.0M
Jan Feb Mar Apr
Figure 4.8: HTTP Methods in Aurora Traffic
40 Table 4.3: Top 10 Most Frequently Requested URLs for Aurora Site
URL Reqs Pct GB Pct /summary plots/slr-rt/yknf/recent 480p.jpg 32,360,809 17.8% 4,116 39.8% /summary plots/rainbow-rt/yknf/latest.jpg 25,269,475 13.9% 1,349 13.0% /summary plots/slr-rt/yknf/recent 1080p.jpg 3,344,105 1.8% 970 9.4% /summary plots/slr-rt/yknf/recent SD.jpg 3,147,139 1.7% 170 1.6% /summary plots/slr-rt/yknf/recent 720p.jpg 2,781,414 1.5% 678 6.5% /summary plots/rainbow-rt/sask/latest.jpg 2,177,948 1.2% 26 0.3% /summary plots/rainbow-rt/fsmi/latest.jpg 2,067,294 1.1% 19 0.2% /summary plots/rainbow-rt/rabb/latest.jpg 2,060,695 1.1% 14 0.1% /summary plots/rainbow-rt/gill/latest.jpg 1,958,148 1.1% 17 0.2% /summary plots/rainbow-rt/fsim/latest.jpg 1,796,832 1.0% 22 0.2%
Environmental Monitoring Platform (RTEMP)6 the next day.
From Table 4.3, we see that the topmost URL “480p” accounts for 18% of the requests and
40% of the data volume. There are a few static HTML files in the Top 100, which contribute very little data volume. Note that the number of unique URLs is actually much larger than 2 million (see Table 4.1), reaching 75,847,177. When some Web sites (e.g., CSA AuroraMAX) fetch data (mostly images and videos) from the Aurora site for live broadcasting, they append a timestamp to the URL as a query string to obtain fresh content (since the URL is used as the key to cache files). For example, the request URL “/abc/latest.jpg” is modified to
“/abc/latest.jpg?1426417182905” by the JavaScript code. This “cache busting” technique causes the excessive number of unique URLs.
Figure 4.9 shows a frequency-rank analysis applied to the URLs requested on the Aurora site. It has several distinct plateaus in the frequency-rank profile. We attribute this to machine-generated request traffic, which we explore in more detail in Section 4.2.
4.1.7 File Type
There are 39 different file types in our trace for the Aurora site. Table 4.4 shows the top
10 file types ranked by HTTP request count. JPEG (Joint Photographic Experts Group)
6Real-Time Environment Monitoring Platform, http://rtemp.ca/
41 108 107 106 105 104 3
Frequency 10 102 101 100 100 101 102 103 104 105 106 107 Rank
Figure 4.9: Frequency-Rank Profile for URLs, Aurora Site images account for most requests and data volume. It is unsurprising that the dominant traffic contributors are videos and images. Static HTML files and JavaScript files are popular in terms of requests, but have minimal contribution to data volume.
4.1.8 HTTP Response Size Distribution
From the URL and file type analysis, we know that image files in the Aurora server are extremely popular. Therefore, we select the top two popular URLs to analyze the HTTP response size distribution. The size values we obtain are mainly affected by two factors:
1) Since those images are updated by the ground-based cameras, the size of the images changes along with the variation of the content. Note that the images posted to the Web pages are compressed to JPG format. The size of the original pictures would be larger.
Figure 4.10 shows a series of images taken by the Yellowknife, NWT camera on March 10.
Although the changes are not noticeable over small time spans, the image changes regularly and thus the size of the image changes. Even when the camera is not working in the day time, there is a countdown message updated in the images.
2) Instead of directly extracting file size values from the Aurora server, we trace the
42 Table 4.4: Top 10 Most Frequently Requested File Types for Aurora Site
File Type Reqs Pct Rank Volume (GB) Pct Rank Image/JPEG 80,158,252 52.23% 1 6,570 72.50% 1 Text/HTML 56,475,028 36.80% 2 122 1.35% 5 Application/X-Gzip 5,765,942 3.76% 3 1,312 14.48% 2 Text/Plain 2,686,677 1.75% 4 5 0.06% 14 Image/PNG 975,366 0.64% 5 68 0.75% 6 Application/JavaScript 671,627 0.44% 6 1 0.01% 15 Video/MPEG 173,094 0.11% 7 163 1.81% 4 Video/MP4 151,291 0.10% 8 707 7.81% 3 Image/GIF 18,509 0.01% 9 13 0.15% 9 Image/X-Portable- 10,501 0.01% 10 35 0.39% 7 Anymap
(a) AuroraMAX, Yellowknife, 2015/03/10, 01:04 (b) AuroraMAX, Yellowknife, 2015/03/10, 01:23 am am
(c) AuroraMAX, Yellowknife, 2015/03/10, 03:03 (d) AuroraMAX, Yellowknife, 2015/03/10, 06:27 am am
Figure 4.10: AuroraMAX Images from Yellowknife, 2015/03/10
43 Figure 4.11: HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, from 2015-03-09 to 2015-03-15
Figure 4.12: HTTP Response Size Values for “/summary plots/slr-rt/yknf/recent 480p.jpg” File, on 2015-03-12
44 108
107
106
105 Frequency 104
103 0.00 0.05 0.10 0.15 0.20 Response Size (MB)
Figure 4.13: HTTP Response Size Distribution Histogram for “/summary plots/slr-rt/yknf/- recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis log-scale)
1.0
0.8
0.6
0.4
0.2
0.0 0.00 0.05 0.10 0.15 0.20 Response Size (MB)
Figure 4.14: HTTP Response Size Distribution Cumulative Histogram for “/sum- mary plots/slr-rt/yknf/recent 480p.jpg” (x-axis 0-0.2 MB, 50 bins, y-axis proportion)
45 108
107
106
105
104 Frequency
103
102 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Response Size (MB)
Figure 4.15: HTTP Response Size Distribution Histogram for “/summary plots/rainbow- rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis log-scale)
1.0
0.8
0.6
0.4
0.2
0.0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Response Size (MB)
Figure 4.16: HTTP Response Size Distribution Cumulative Histogram for “/sum- mary plots/rainbow-rt/yknf/latest.jpg” (x-axis 0-0.08 MB, 50 bins, y-axis proportion)
46 HTTP responses. The size of the HTTP responses depends on the HTTP requests (e.g., conditional GET, HEAD, partial GET), and many other factors (e.g., network interruption or not). Consequently, even requesting the same file in the Aurora server may lead to different response size values.
The purpose of analyzing the response size distributions is to understand the actual band- width costs when those popular files are requested. Figure 4.11 shows a time-series of the
HTTP response size for all the HTTP requests for “/summary plots/slr-rt/yknf/recent 480p.jpg” from March 9 to March 15. Figure 4.12 shows the HTTP response sizes for all the HTTP requests for “/summary plots/slr-rt/yknf/recent 480p.jpg” on March 12. The flat shapes in both figures are the responses generated during the camera’s idle hours, when the size of the image rarely changes.
The overall response size distributions of the top two popular URLs (see Table 4.3) are presented in Figure 4.13 and Figure 4.15. Note that the y-axis is drawn in log-scale, and there are 50 bins in total.
The histograms of the top two URLs are visually similar. We attribute this to two reasons:
1) Although these two images are taken by two different lenses, the two lenses are deployed in the same location (Yellowknife) recording the same auroral phenomena. Since the content of the two images is highly correlated, so is the file size.
2) The live feed images are usually displayed together within a single Web page. Therefore whenever the Aurora server responds to the clients, the two images will be viewed together.
In addition, the images are updated synchronously on the Web page, which leads to the similar response size distribution (from previous reason we know that the two images have strong correlations). Furthermore, the user groups who view the two images are almost the same. Thus the requests sent from their Web browsers for the two images behave similarly.
In addition, there are are many small size responses (smaller than the actual size of the
47 images) in both histograms. We attribute this to two reasons:
1) For “recent 480p.jpg” over the four-month period of observation, there are 4,271 “206
Partial Content”, 365,283 “304 Not Modified”, and 971 “404 Not Found” responses. For
“latest.jpg”, there are 3,372 “206 Partial Content” and 466,460 “304 Not Modified” responses
in four months. These response size values should be smaller than the actual size of the
images.
2) Since the JavaScript “cache busting” implementation forces the browser to re-fetch
images every few seconds, the browser may discard a pending request and the incomplete
response data if a new request is generated. This phenomenon is frequently observed when
the network connection speed is slow.
The CDFs for the response size distributions are shown in Figure 4.14 and Figure 4.16.
4.2 Robot Traffic
In this section, we study the workloads of the top IPs from University of California at
Berkeley (UCB) and University of Alaska (UA). In addition, we analyze the traffic introduced
by AuroraMAX, the top referrer site mentioned in Section 4.1.5.
4.2.1 Prominent Machine-Generated Traffic
Since we don’t have a priori knowledge about which IP addresses are robots, we rely on two
heuristics to identify them:
1) We classify a certain IP address as a robot if it requests the file robots.txt. There were 613 such IP addresses in our dataset.
2) We classify an IP address as a robot if it generates many HTTP requests in a relatively short time, or has a deterministic structure in its request patterns.
The top few IPs from UCB and UA in Table 4.2 are definitely robots, based on this loose definition. UCB and UA are two leading participants in the THEMIS project mentioned
48 Table 4.5: Prominent UCB and Alaska IPs in Aurora Web Site Traffic
Name IP Total Reqs Reqs/day Total GB Avg GB/day UCB1 128.32.18.45 89,977,861 756,116 211 1.78 UCB2 128.32.18.192 1,919,161 16,127 789 6.64 UA1 137.229.18.201 22,951,449 192,869 1,680 14.12 UA2 137.229.18.252 3,403,550 28,601 573 4.82 earlier. We use the terms “UCB1” and “UCB2” to refer to the two most prominent IPs from
UCB. Similarly, we use “UA1” and “UA2” to refer to the two most prominent IPs from UA.
Furthermore, we identify the traffic from the referrer site AuroraMAX from Canadian
Space Agency as robot traffic, since:
1) The AuroraMAX page makes the viewer’s browser re-fetch images and videos from the Aurora site repeatedly.
2) It generates a huge volume of traffic for the Aurora site (see Section 4.1.5).
HTTP Requests and Data Volume
Figure 4.17 shows the HTTP requests and data volume information for each of the four UCB and UA IP addresses in Table 4.5.
1) UCB1 generates 89,977,861 requests in total, and 756,116 requests per day on average
(see Table 4.5). This is about half of the total Aurora request traffic. However, the daily data volume that UCB1 generated is comparatively small. Upon further analysis, we found that all the requests generated by UCB1 have the user agent Wget/1.11.4 Red Hat modified, which is free software for retrieving Web site content using HTTP, HTTPS, and FTP proto- cols. With the Wget scripts, UCB1 generates many HTTP requests without generating much data volume, since it only uses GET method for fetching the HTML pages and updated data
files, and it checks the time-stamp of data files with HEAD method.
2) UCB2 was active primarily from mid-January to early-February. It only generated
1,919,161 requests in total. Nevertheless, it contributed to a large proportion of the data volume in late-January (see the surge in Figure 4.17(d)). Different from UCB1, it uses
49 6.0M 500 UCB1 UCB1 5.0M Other 400 Other 4.0M 300 3.0M 200
2.0M (GB) Volume
Number of Requests of Number 100 1.0M
Jan Feb Mar Apr Jan Feb Mar Apr (a) HTTP Requests Per Day for UCB1 (b) Data Volume Per Day for UCB1 6.0M 500 UCB2 UCB2 5.0M Other 400 Other 4.0M 300 3.0M 200
2.0M (GB) Volume
Number of Requests of Number 100 1.0M
Jan Feb Mar Apr Jan Feb Mar Apr (c) HTTP Requests Per Day for UCB2 (d) Data Volume Per Day for UCB2 6.0M 500 UA1 UA1 5.0M Other 400 Other 4.0M 300 3.0M 200
2.0M (GB) Volume
Number of Requests of Number 100 1.0M
Jan Feb Mar Apr Jan Feb Mar Apr (e) HTTP Requests Per Day for UA1 (f) Data Volume Per Day for UA1 6.0M 500 UA2 UA2 5.0M Other 400 Other 4.0M 300 3.0M 200
2.0M (GB) Volume
Number of Requests of Number 100 1.0M
Jan Feb Mar Apr Jan Feb Mar Apr (g) HTTP Requests Per Day for UA2 (h) Data Volume Per Day for UA2
Figure 4.17: HTTP Requests and Data Volume Per Day for UCB, UA IPs
50 another version of Wget, namely Wget/1.12 (linux-gnu), as the user agent.
3) UA1 generated approximately 0.2 million requests and 14 GB of data volume per day.
It has more influence on the data volume compared to the other three IPs.
4) UA2 was active in March, when the geo-magnetic storm happened (see Section 4.3).
Workload Pattern
For these four IP addresses, it is interesting to study how the automatic scripts work by analyzing the pattern of the URLs requested.
With further URL analysis, we found that the UCB1 IP uses Wget to recursively down-
load all the data files in four specific directories: fluxgate/stream0, imager/stream1, imager/stream2, and imager/stream3. Furthermore, UCB1 checks all the data files within both the previous month and the current month in those four directories everyday. Note that the data in those directories are organized into folders by month and date. Basically, the Aurora server merely stores each day’s new data into the corresponding directory.
The file robots.txt is configured by the site administrator. Located in the Web site root directory, it contains instructions in a specific format, indicating what robots are not permitted to access. By default, Wget follows the proper robot etiquette [22]. It requests robots.txt before downloading files, hence, providing us a way to figure out the workload pattern.
Figure 4.18 shows the daily count for UCB1 requesting the robots.txt file. There are around 30 robots.txt requests per day from UCB1. In other words, the Wget script was running 30 times each day. Figure 4.19 displays 4 selected hourly robots.txt request counts from UCB1. The periodic pattern is visually apparent. The cyclic period on January 16 is
8 hours, and it changes to 4 hours on other days.
Further analysis of the URL requests shows the following:
1) There are two independent robots running different Wget scripts with the same UCB1
IP. The Wget script uses recursive download mode to save all files in the given directory to
51 45 40 35 30 25 20 15 10
Robot.txt Requests Robot.txt 5
Jan Feb Mar Apr
Figure 4.18: “robots.txt” Request Count Per Day for UCB1
3 3
2 2
1 1 Robot.txt Requests Robot.txt Requests Robot.txt
0 0 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 (a) “robots.txt” Reqs/hr on 2015-01-16 (b) “robots.txt” Reqs/hr on 2015-02-01 3 3
2 2
1 1 Robot.txt Requests Robot.txt Requests Robot.txt
0 0 0 2 4 6 8 10 12 14 16 18 20 22 0 2 4 6 8 10 12 14 16 18 20 22 (c) “robots.txt” Reqs/hr on 2015-02-05 (d) “robots.txt” Reqs/hr on 2015-04-05
Figure 4.19: “robots.txt” Request Count Per Hour on Four Selected Days
52 the local hard disk (it downloads HTML files but deletes them after extracting any embedded
URLs).
2) The “imager” robot updates local data from imager/stream1, imager/stream2, and imager/stream3 directories. It usually takes 2-3 hours to complete the scan of a month’s data, and 4-6 hours in total to complete both last month and the current month. The
“fluxgate” robot updates local data from the fluxgate/stream0 directory. It usually takes
10 minutes to finish 2 months of data.
3) Both robots take a short break after one complete scan over 2 months of data. The length of the break between scans varies from around 1 hour to 4 hours.
4) The file robots.txt is requested whenever a stream directory scan launches.
5) The robots use time-stamping mode, which makes Wget send HEAD requests to check
the time-stamp of files on the server-side and only generate GET requests to fetch a file if it
has a newer time-stamp.
Since Wget applies breadth-first search to recursively retrieve the directory structure of the site, it needs to download static HTML pages and extract URLs repeatedly7. Therefore,
it is not surprising that UCB1 generates so many requests with very limited data volume.
What is surprising is that UCB1 runs the Wget scripts to download files from these directories
several times each day, even though the content rarely changes.
Furthermore, it does so using many non-persistent connections, and a lot of HEAD
requests, rather than Conditional GETs. This approach is not very Internet-efficient, because
of the excessive number of TCP connections used and network round-trip times incurred.
We revisit this issue later in Chapter 6.
The UCB2 IP applies Wget/1.12 (linux-gnu) to retrieve files in older directories. Dif-
ferent from UCB1, the UCB2 script generates the URLs itself and invokes Wget to fetch the
files. Several aspects of this script are different from UCB1:
7For more information, refer to the “recursive download mode” in the Wget manual, http://www.gnu. org/software/wget/manual/html_node/Recursive-Download.html
53 1) The referer information is missing, which indicates that UCB2 directly visits each
URL.
2) Some URL requests are spelled incorrectly (e.g., “//data/themis” should be “/data/themis”,
but the server can redirect the former URL to the latter one with 200 OK), which shouldn’t
happen if Wget extracts the links automatically.
3) Some requested resources receive the “404 Not Found” error response from the Aurora
server.
There is no periodic structure for UCB2. In addition, it downloads all data files in the
given directory rather than updating local files like UCB1. Consequently, the UCB2 robot
generates significant data volume in its short active period.
The UA robots are browsers repeatedly viewing the RTEMP live feed pages. Similar to
the CSA AuroraMAX page, RTEMP provides Aurora live feeds by re-fetching images and
videos from the Aurora site server. The process is completed every three seconds by the
client’s browser, implemented with JavaScript Document Object Model (DOM) operations.
The RTEMP live feed pages force clients to continuously send GET requests to the Aurora
server as long as they are open in the browser (even when there is no human viewing the
images, thus we classify it as robot traffic). The two UA robots did the same task. They re-
fetched the live feed images for months, except that the user agent is Mozilla/5.0 (Windows
NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0 for UA1, and Mozilla/5.0 (Windows
NT 5.1; rv:36.0) Gecko/20100101 Firefox/36.0 for UA2.
There are actually two live feed pages opened by UA1, with different aurora pictures on each page. It produces the step-like structure in Figure 4.17(e). Since the content consists of images and videos, the data volume is much larger than the UCB1 robot.
4.2.2 AuroraMAX
The AuroraMAX page on the Canadian Space Agency Web site provides aurora live feeds that include image hyper-links to the Aurora site (note that visitor’s browser fetches images
54 6.0M AuroraMAX 5.0M Other
4.0M
3.0M
2.0M
Number of Requests of Number 1.0M
Jan Feb Mar Apr
Figure 4.20: HTTP Request Count Per Day for AuroraMAX from the Aurora server instead of the CSA server). About 25% of the HTTP requests
(45,763,205) and 43% of the data volume (4,423 GB) are generated by this top referrer site.
Figure 4.20 and Figure 4.21 show that the daily requests and data volume are steady over the observation period, except for the surge in mid-March.
There were 104,529 unique IP addresses visiting the Aurora site via the AuroraMAX por- tal over the 4 month period. Further analysis shows that the requests from the AuroraMAX portal are not attributable to a small set of highly active IP addresses. The topmost IP only accounts for less than 1.5% of the HTTP requests from the AuroraMAX portal (compared to UCB1 accounting for half of the traffic for the whole Aurora site). Figure 4.22 shows the frequency-rank profile for IP addresses.
Considering AuroraMAX’s popularity as a referrer site, the naive way it fetches the images from the Aurora site is Internet-inefficient. We propose a solution for this inefficiency issue in Chapter 6.
In summary, the robot traffic (UCB, UA, and AuroraMAX) accounts for 90.1% of the total requests and 74.1% of the total data volume. Therefore, solving these inefficiency issues may significantly reduce the load of the Aurora site.
55 500 AuroraMAX 400 Other
300
200 Volume (GB) Volume 100
Jan Feb Mar Apr
Figure 4.21: Data Volume (GB) Per Day for AuroraMAX
106
105
104
103
102 Frequency
101
100 100 101 102 103 104 105 106 Rank
Figure 4.22: IP Addresses Frequency-Rank Profile for AuroraMAX
56 4.3 Geomagnetic Storm
An interesting discovery in our dataset was the non-stationary traffic observed for the Aurora site in mid-March 2015. The HTTP request traffic and the data volume both quadrupled from their normal levels for the March 17-20 period (see Figure 4.1 and Figure 4.3).
The root cause for this traffic surge was solar flare activity that triggered one of the largest geomagnetic storms in over a decade [56]. Auroral researchers knew about this im- mediately, and eagerly downloaded many of the new images. The ensuing media coverage of the geomagnetic storm triggered many other site visits, either directly or via the Aurora-
MAX portal. Figure 4.20 shows that a lot of the traffic surges arrived via the AuroraMAX referrer site.
Further analysis indicates that the increased traffic is primarily human-initiated, since:
1) The number of distinct IPs visiting the site surged (eightfold) during the geomagnetic storm period (see Figure 4.4).
2) The number of GET requests quadrupled in the surge in Figure 4.8, with no change for HEAD requests. Therefore, the contrast indicates the surge is not contributed by Wget robots.
3) There was in fact a ten-fold increase in the AuroraMAX portal traffic (requests and data volume) during this period.
It is interesting to witness how real-world events affect the traffic of scientific Web sites.
The traffic information shows that flash crowds are not limited to “popular Web sites”. Such surges are important to consider when provisioning server-side capacity configurations.
4.4 Summary
This chapter provided a detailed analysis of the network traffic for the Aurora site. First, our analysis covered fundamental HTTP characteristics. Specifically, we analyzed the daily and hourly values for HTTP requests and data volume. We extracted the top 100 popular
57 IP addresses, and discovered the existence of machine-generated traffic.
Based on the previous discoveries, we analyzed the robot traffic. We primarily studied the traffic of four distinct IP addresses from University of California at Berkeley, and University of Alaska at Fairbanks. The results showed that the way they perform data transfers is very inefficient. In addition, we analyzed the top referrer site AuroraMAX and discovered its inefficient way of fetching live images from the Aurora site. Further discussions of the data transfer inefficiency problem is presented in Chapter 6.
Finally, we showed how real-world events affect the traffic of the Aurora site, by illus- trating the changes during the geomagnetic storm.
We analyze the traffic of the ISM site in the next chapter.
58 Chapter 5
ISM SITE ANALYSIS
The ISM (Inter-Stellar Medium) Web site at the U of C provides astrophysics teaching materials. We present our analysis of the ISM site in this chapter. First, we show the
HTTP characteristics for the ISM site. Considering that the network traffic was primarily human-generated, we focus on IP geolocation distribution, user agent classification, and URL popularity analysis. Since large-volume video files are a unique feature of the ISM site, we study how network traffic relates to user behavior patterns when viewing course videos from the ISM site. Finally, we analyze how course schedules affect the network traffic of the ISM site.
5.1 HTTP Analysis
We analyze the traffic logs for a four-month period from January 1, 2015 to April 29, 2015, covering the whole Winter 2015 semester at U of C. In this semester, lectures began on
January 12, and ended on April 15, with final exams running from April 18 to 29. There was a reading week with no lectures from February 15 to 22.
Due to the limitation of our tracing framework, we can only observe the ISM Web traffic generated when users are off-campus. The on-campus traffic doesn’t pass through the campus edge routers and therefore is not seen by our monitor. We may analyze the server-side logs of the ISM site in our future work.
5.1.1 HTTP Requests
A summary of the ISM site traffic is shown in Table 5.1. There are around 1.5 million requests in total, and 13,000 requests per day for the ISM site. While robots and referrer
59 Table 5.1: Statistical Characteristics of the ISM Site (Jan 1/15 to Apr 29/15)
Site Total Reqs Avg Reqs/day Total GB Avg GB/day Uniq URLs Uniq IPs ISM 1,583,339 13,305 8,483 71.29 10,563 9,720
150K
125K
100K
75K
50K
Number of Requests of Number 25K
Jan Feb Mar Apr
Figure 5.1: HTTP Request Count Per Day for ISM Site sites contribute most of the traffic to the Aurora site, this is not true for the ISM site.
Consequently, it is not surprising that the average request traffic for ISM is about two orders of magnitude lower than that for the Aurora site in terms of requests (the data volume in both sites are similar).
The daily ISM site traffic is illustrated in Figure 5.1. Note there are 3 obvious surges in the request traffic over the four months. The surge in late-February aligns with the first midterm in the course (February 24), while the subsequent surges align with the second midterm (March 24) and the final exam (April 21). These surges are “expected” compared to the “unexpected” surges in the Aurora site.
We select six surge days and display their hourly HTTP request traffic in Figure 5.2. The requests usually decreased between midnight and dawn, conformant to human schedules.
However, February 24 is a counterexample, for which the requests were influenced by the course midterm exam.
60 20K 20K
15K 15K
10K 10K
5K 5K Number of Requests of Number Requests of Number 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (a) HTTP Reqs/hr, Feb 23 (b) HTTP Reqs/hr, Feb 24 20K 20K
15K 15K
10K 10K
5K 5K Number of Requests of Number Requests of Number 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (c) HTTP Reqs/hr, Mar 23 (d) HTTP Reqs/hr, Mar 24 20K 20K
15K 15K
10K 10K
5K 5K Number of Requests of Number Requests of Number 0K 0K 2 4 6 8 10 12 14 16 18 20 22 2 4 6 8 10 12 14 16 18 20 22 (e) HTTP Reqs/hr, Apr 20 (f) HTTP Reqs/hr, Apr 21
Figure 5.2: HTTP Request Per Hour (Feb 23, Feb 24, Mar 23, Mar 24, Apr 20, and Apr 21, 2015)
61 5.1.2 Data Volume 500
400
300
200 Volume (GB) Volume 100
Jan Feb Mar Apr
Figure 5.3: Data Volume (GB) Per Day for ISM Site
Table 5.1 shows that the average daily data volume of the ISM site (71 GB per day) is comparable to the Aurora site (87 GB per day), even though the number of requests for the two sites are very different. We attribute this to two reasons:
1) The ISM site contains more course-related materials rather than research resources.
The professor provides large objects (e.g., course videos, PDFs) to his students. Those files are much larger than the JPEG and HTML files provided by the Aurora site.
2) Although the number of requests in the ISM site is low compared to the Aurora site, most of the requests target large data volume files instead of small-size HTML or JavaScript
files.
Figure 5.3 shows the daily data volume information for the ISM site over the four-month period. It is interesting to observe the similar “sawtooth” structures in each month. To be specific, in Figure 5.3, the data volume increased on February 19, 21, 23, and decreased on February 20, 22, which makes the broken lines form a “sawtooth” structure. The same
“sawtooth” structure appeared in March and April as well. Since these surges align with the exams, we attribute the “sawtooth” structures to the students’ studying pattern.
62 Another interesting discovery is the “out of sync” phenomenon for the dates when the maximum surge was generated in requests versus data volume in late-February. Specifically, the maximum surge of requests (in Figure 5.1) was on February 24, while the biggest surge in data volume (in Figure 5.3) was on February 23 (this issue only occurred in February’s surge; surges in March and April align). By comparing the URLs requested on February 23 and February 24, we find that although the amount of video requests on February 24 is larger than February 23, the average data volume per video request on February 24 is smaller than
February 23. This may indicate that most video viewers (students) tended to skip frames when watching the course videos on February 24, while they prefer to watch video clips with longer average duration on February 23. This midterm reviewing pattern makes the number of requests peak on February 24, and the data volume peak on February 23.
5.1.3 IP Analysis
During the 4 months of observation, 9,720 unique IP addresses visited the ISM site. Every- day, around 300 IPs requested files from the ISM server over the four-month period. Since the ISM site is mainly designed for students in the University of Calgary, certainly the mag- nitude of daily users is much smaller than the Aurora site, reaching about one-tenth of the
Aurora one specifically. The amplitude of the surges in the second half of each month is comparatively smaller than the request traffic in Figure 5.1 and data volume in Figure 5.3, because the primary users are students, who are frequent repeat visitors for the ISM site.
The geolocation analysis for all the IPs visiting the ISM site shows that visitors were from
101 different countries, though about half of those countries (55) generated fewer than 100 requests in four months. Figure 5.4 shows a pie graph of the top 5 countries. Again, it is not surprising that most of the traffic is generated by Canadian (88.24%) and American (7.91%) users. For all the requests from Canada, Alberta surpasses all other provinces with 1.2 million requests (97.64%) in Figure 5.5, while the US distribution is more dispersed in Figure 5.6.
Actually, many of the USA requests are generated by Internet companies, like Google and
63 Table 5.2: Top 10 Most Frequently Observed IP Addresses for ISM Site
IP Reqs Pct. Organization Location 209.89.92.190 125,296 7.91% TELUS Communications Inc. Calgary, Canada 70.72.185.197 86,648 5.47% Shaw Communications Inc. Calgary, Canada 96.51.68.175 64,912 4.10% Shaw Communications Inc. Calgary, Canada 198.166.61.187 64,581 4.08% TELUS Communications Inc. Calgary, Canada 209.89.235.216 61,135 3.86% TELUS Communications Inc. Calgary, Canada 68.146.124.225 43,501 2.75% Shaw Communications Inc. Calgary, Canada 206.75.57.71 40,749 2.57% TELUS Communications Inc. Calgary, Canada 162.157.164.121 39,405 2.49% TELUS Communications Inc. Calgary, Canada 68.146.221.78 26,053 1.65% Shaw Communications Inc. Calgary, Canada 68.110.70.13 20,802 1.31% Cox Communications Inc. Scottsdale, USA
Canada - 88.24% - 1397096 United States - 7.91% - 125269 United Kingdom - 0.75% - 11847 France - 0.48% - 7612 China - 0.38% - 6092 others - 2.24% - 35423 Canada
88.2%
2.2% others 0.7%0.5%0.4% United Kingdom 7.9%
United States
Figure 5.4: IP Geolocation Distribution for Countries
64 Alberta - 97.64% - 1221943 British Columbia - 1.25% - 15639 Ontario - 0.61% - 7660 Quebec - 0.31% - 3838 Saskatchewan - 0.16% - 2032 others - 0.03% - 361
Alberta 97.6% 0.2%0.0% others 0.6%0.3% 1.2% Ontario British Columbia
Figure 5.5: IP Geolocation Distribution for Canada
California California - 32.85% - 39627 Arizona - 18.53% - 22356 Washington - 12.10% - 14595 New Jersey - 10.94% - 13191 Massachusetts - 4.63% - 5588 Arizona 32.9% others - 20.94% - 25263
18.5%
12.1% 20.9%
Washington 10.9% 4.6% others
New Jersey Massachusetts
Figure 5.6: IP Geolocation Distribution for USA
65 Calgary - 93.07% - 1129175 Red Deer - 1.64% - 19844 Medicine Hat - 1.21% - 14622 Cochrane - 0.92% - 11188 Edmonton - 0.79% - 9628 others - 2.38% - 28821
Calgary 93.1%
2.4% others 0.8% 1.2% 0.9% 1.6% Edmonton Cochrane Medicine Hat Red Deer
Figure 5.7: IP Geolocation Distribution for Alberta
Apple, for indexing Web content. Figure 5.7 shows that 1.1 million requests come from
Calgary, dominating all other cities in Alberta. Furthermore, for all the requests generated in Canada, about half (704,074 requests, or 50.4%) use the Internet service provided by
“Shaw Communications Inc.”, and 44.8% belong to “TELUS Communications Inc.”.
Figure 5.8 shows how many unique IPs visited the ISM site from Canada and Calgary per day. Since Canada is the primary contributor for the ISM site, and Calgary is the primary contributor for Canada, the structure of the red area aligns with the blue area and green area. In addition, we analyzed the daily unique IPs from the USA and its top state California in Figure 5.9. Nearly half of the IPs from USA are in California. Furthermore, the surges for USA align with the surges for California.
The IP frequency-rank profile of the ISM site is shown in Figure 5.10. Visual evidence of a power-law structure is apparent.
5.1.4 URL Analysis
There are 10,563 different URLs on the ISM site requested in the four-month period. Ta- ble 5.3 shows the top 10 most popular URLs for the ISM site. Note that we only show the
66 600 Outside Canada 500 Canada (excl. Calgary) Calgary 400
300
IP Count IP 200
100
Jan Feb Mar Apr
Figure 5.8: Number of Daily Unique IP Addresses Visiting ISM Site, from Canada and Calgary (2015-01-01 to 2015-04-30)
file names and parts of the directory names for convenience in Table 5.3. Since the course instructor had changed the format of all the course videos in the middle of that semester, we have both “.mov” and “.mp4” extensions in the top 10 URLs.
It is unsurprising that course materials are popular among the URLs. Furthermore, the large-size videos and PDF files generate tremendous data volume with limited requests.
Figure 5.11 shows the URL frequency-rank profile of the ISM site. Unlike the step shape for Aurora in Figure 4.9, the URL frequency-rank profile for the ISM site shows visual
Table 5.3: Top 10 Most Frequently Requested URLs for ISM Site
URL Total Reqs Total GB ASTR209 - Lec8 - Feb 5, 2015.mov 153,410 267.04 ASTR209 - Lec3 - Jan 20, 2015.mov 87,051 787.02 ASTR209 - Intro. & Lecture#1 - Jan 13,2015.mov 75,380 735.64 ASTR209 - Lec4 - Jan 22, 2015.mov 68,609 584.47 AST209 Podcast/rss.xml 56,293 0.71 2015/1/28 Course Notes files/Part2 e&m.pdf 55,952 58.07 ASTR209 - Lec2 - Jan 15, 2015.mov 39,687 998.60 ASTR209 - Lec10, Feb 12, 2015.mov 31,308 310.65 2015/3/11 Course Notes files/Part2 e&m.pdf 30,068 23.54 ASTR209 - Lec15 - Mar 12, 2015.mp4 28,690 284.02
67 600 Outside US 500 US (excl. California) California 400
300
IP Count IP 200
100
Jan Feb Mar Apr
Figure 5.9: Number of Daily Unique IP Addresses Visiting ISM Site, from USA and Cali- fornia (2015-01-01 to 2015-04-30)
Table 5.4: HTTP Method Summary for ISM Site
HTTP Method Rank Reqs Avg Reqs/day Pct. GET 1 1,575,574 13,130 99.51% HEAD 2 7,749 65 0.49% OPTIONS 3 11 0.09 0.00% POST 4 5 0.04 0.00% evidence of the power-law structure, typical of human-generated requests.
5.1.5 HTTP Methods
Table 5.4 shows a summary of the HTTP method information for the ISM site. As expected, the number of GET requests dominates other HTTP methods. Since there is no Wget robot for the ISM site, HEAD requests only account for a fairly small part of the total traffic.
Furthermore, 7,285 HEAD requests (94.01%) were generated by Apple’s iTunes application to check the existence of some resources or whether the ISM site RSS (Rich Site Summary) [23]
file was updated.
Figure 5.12 displays the daily number of GET and HEAD requests. The HEAD requests are rarely seen.
68 106
105
104
103
Frequency 102
101
100 100 101 102 103 104 Rank
Figure 5.10: Frequency-Rank Profile for IP Addresses, ISM Site
106
105
104
103
Frequency 102
101
100 100 101 102 103 104 105 Rank
Figure 5.11: Frequency-Rank Profile for URLs, ISM Site
69 150K GET 125K HEAD
100K
75K
50K
Number of Requests of Number 25K
Jan Feb Mar Apr
Figure 5.12: HTTP Methods in ISM Traffic
Table 5.5: HTTP Status Code Summary for ISM Site
Status Code Type Rank Reqs Avg Reqs/day Pct. 206 Partial Content 1 927,733 7,731 58.59% 200 OK 2 507,358 4,227 32.04% 304 Not Modified 3 79,064 658 4.99% 404 Not Found 4 47,372 394 2.99% 301 Moved Permanently 5 52 0.43 0.00% Requested Range 416 6 33 0.28 0.00% Not Satisfiable 400 Bad Request 7 1 0 0.00%
5.1.6 HTTP Status Codes
HTTP status code is a part of the HTTP response header, indicating how the server responds to a HTTP request. For example, the server responds with a “200 OK” HTTP status code when it successfully fetches the resource in response to a client’s GET request.
Table 5.5 summarizes the HTTP status codes for the ISM site. Status code “206” (Partial
Content) is the topmost one accounting for around 60% of the requests, while “200” is second at 32%. This result is quite different from the workload characterization of most general Web sites, where “200 OK” responses dominate. This situation is primarily caused by students frequently requesting pieces of large-size files (e.g., videos and PDFs), and by Internet user
70 106 206 304 200 404 105
104
103 Count 102
101
100 Jan Feb Mar Apr
Figure 5.13: HTTP Status Code in ISM Traffic agent behaviors.
We draw the daily information for the top 4 popular status codes in Figure 5.13. It is interesting to observe the interleaving patterns of status codes “200” (red dashed line) and
“206” (black solid line), whenever an exam was imminent. We attribute this to students’ reviewing strategies, and further discuss it in Section 5.3.
5.1.7 HTTP Response Size Distribution
Figure 5.14 shows the HTTP response sizes for all the HTTP requests for “Lec8 - Feb 5,
2015.mov” from February 18 to February 24. The x-axis represents the time series, covering about 130,000 requests generated in a week (the midterm surge is included).
Figure 5.15 shows the HTTP response sizes for all the HTTP requests for “Lec8 - Feb 5,
2015.mov” on February 24 (the first midterm date). The video is retrieved frequently from early morning to noon right before the midterm.
We did a series of HTTP response size distribution analyses for the ISM site. The response size analyses include all responses (e.g., “206 Partial Content” and “200 OK”). The response size values are the transferred data volume instead of “Content-Length” parameter.
71 Figure 5.14: HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, from 2015-02-18 to 2015-02-24
Figure 5.15: HTTP Response Size Values for “Lec8 - Feb 5, 2015.mov” File, on 2015-02-24
72 106
105
104
103
102 Frequency
101
100 0 1 2 3 4 5 Response Size (GB)
Figure 5.16: HTTP Response Size Distribution Histogram for “Lec8 - Feb 5, 2015.mov” (x-axis 0-5 GB, 50 bins, y-axis log-scale)
105
104
103
102 Frequency 101
100 0 2 4 6 8 10 Response Size (GB)
Figure 5.17: HTTP Response Size Distribution Histogram for “Lec3 - Jan 20, 2015.mov” (x-axis 0-10 GB, 50 bins, y-axis log-scale)
73 131072
131072 - 51.57% - 79106 262144 - 21.64% - 33205 65536 - 12.21% - 18727 51.6% others - 11.72% - 17977 0 - 1.93% - 2963 327680 - 0.93% - 1432
0.9% 327680 1.9% 0
21.6% 11.7%
others 262144 12.2%
65536
Figure 5.18: HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec8 - Feb 5, 2015.mov”
65536
65536 - 47.54% - 41385 others - 46.96% - 40875 1245184 - 1.58% - 1379 1179648 - 1.51% - 1311 47.5% 0 - 1.36% - 1181 1310720 - 1.06% - 920
1.1% 1310720 1.4% 1.5% 0 1.6% 1179648 1245184
47.0%
others
Figure 5.19: HTTP Response Size Values (Byte) Per Request Top 5 Count for “Lec3 - Jan 20, 2015.mov”
74 Figure 5.16 and Figure 5.17 shows the response size histograms for the top 2 popular URLs,
“Lec8 - Feb 5, 2015.mov” and “Lec3 - Jan 20, 2015.mov”. Note that the y-axis is drawn in log-scale, and there are 50 bins in total. Although the actual file size of “Lec8 - Feb 5,
2015.mov” is 1.5 GB, and “Lec3 - Jan 20, 2015.mov” is 2.1 GB, most of the response size values in both figures are observed in the small-value bins.
Due to the limitation of information provided by histograms, we draw pie graphs with the topmost 5 size values (Byte) for the 2 URLs in Figure 5.18 and Figure 5.19. Clearly, small-size responses with data volume smaller than 1 MB are predominant from the ISM server, even though the requested videos have around 2 GB. Furthermore, the majority of these small-size values concentrates on some specific values, such as 131,072 Bytes (51.57% with 79,106 requests) and 65,536 Bytes (47.54% with 41,385 requests). These phenomena are caused by Internet user agent behaviors when fetching large files under the condition that the server supports partial GET requests.
Figure 5.20 shows zoomed-in histograms and cumulative histograms of the 2 URLs for the response data volumes smaller than 1 MB. The peaks correspond to the popular response size values from the pie graphs (e.g., 131,072). Note that the y-axis is drawn in log-scale, and there are 50 bins in total. The step-like shapes of the cumulative histograms also indicate many responses sharing the same size values.
The ISM server supports “Accept-Ranges: bytes” function, which allows clients to request any byte ranges of a file stored on server. Therefore, clients can request partial content from the ISM server. User agents may have diverse behaviors to fetch large files in the ISM server, even though the videos are served as static files (namely, adaptive streaming techniques are not applied). We revisit this issue in Section 5.2.
5.1.8 User Agents
Unlike the traffic of the Aurora site, the vast majority of users viewing the ISM site are humans. Therefore, it is meaningful to analyze the user agent information. In this section,
75 1.0 105 0.8 104 0.6 103 0.4
Frequency 2 10 0.2
101 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Response Size (MB) Response Size (MB) (a) Response Size Histogram for “Lec8 - Feb 5, (b) Response Size Cumulative Histogram for 2015.mov” (x-axis 0-1 MB, 50 bins, y-axis log- “Lec8 - Feb 5, 2015.mov” (x-axis 0-1 MB, 50 scale) bins) 1.0 105 0.8 104 0.6 103 0.4
Frequency 2 10 0.2
101 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Response Size (MB) Response Size (MB) (c) Response Size Histogram for “Lec3 - Jan 20, (d) Response Size Cumulative Histogram for 2015.mov” (x-axis 0-1 MB, 50 bins, y-axis log- “Lec3 - Jan 20, 2015.mov” (x-axis 0-1 MB, 50 scale) bins)
Figure 5.20: Histograms of Response Size Values (smaller than 1 MB) for “Lec8 - Feb 5, 2015.mov” and “Lec3 - Jan 20, 2015.mov” Files
76 Table 5.6: Top 10 Most Popular User Agents for the ISM Site
User Agent Name Reqs AppleCoreMedia/1.0.0.11D201 (iPhone; U; CPU OS 7 1 1 like Mac OS X; en us) 142,232 AppleCoreMedia/1.0.0.12B466 (iPad; U; CPU OS 8 1 3 like Mac OS X; en us) 124,788 AppleCoreMedia/1.0.0.12B435 (iPhone; U; CPU OS 8 1 1 like Mac OS X; en gb) 72,731 AppleCoreMedia/1.0.0.11A501 (iPad; U; CPU OS 7 0 2 like Mac OS X; en us) 64,608 AppleCoreMedia/1.0.0.12A405 (iPad; U; CPU OS 8 0 2 like Mac OS X; en us) 60,223 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0 50,255 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 41,548 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0 41,536 AppleCoreMedia/1.0.0.10K549 (Macintosh; U; Intel Mac OS X 10 8; en us) 38,045 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 28,225 we use the on-line user agent database provided by “User Agent String.Com”1 to identify viewers’ operating system and user agent information.
The top 10 most popular user agents for the ISM site are shown in Table 5.6. Six user agents in the table are from Apple’s products, while the others are using Windows operating system with Firefox browser.
Among all captured user agents, about half of them (49.15%) belong to the “browser” category, 44.31% of them are identified as “AppleCoreMedia”, and 2.01% are identified as
“crawler”. Specifically, Figure 5.21 shows a distribution of the top 8 user agent names. All other identified user agents are classified into the “others” label.
Since most requests were generated by Internet browsers, we did an analysis of the user agents in the browser category in Figure 5.22. Firefox, Chrome, and Safari are the top 3 most popular browsers. Internet Explorer is in 4th position, with only 6.67% share.
For all the user agents labeled with “crawler”, we found that “Googlebot” from Google accounts for about half of the traffic (15,734 requests, 49.46%), and “Bingbot” from Microsoft ranks second with 8,031 (25.25%) HTTP requests. There are crawlers executed by some
Chinese search engines, like Baidu2 and Sogou3.
1User Agent String.Com, http://www.useragentstring.com/ 2Baidu, http://www.baidu.com/ 3Sogou, http://www.sogou.com/
77 AppleCoreMedia
AppleCoreMedia - 44.31% - 701507 Firefox - 18.63% - 295001 Chrome - 14.78% - 234035 Safari - 10.86% - 171994 44.3% Internet Explorer - 3.28% - 51897 unknown - 3.03% - 48006 Android Webkit Browser - 1.48% - 23404 iTunes - 1.38% - 21929 others - 2.25% - 35563
2.2% others 18.6% 1.4% Firefox 1.5% iTunes 3.0% Android Webkit Browser 3.3% unknown
14.8% 10.9% Internet Explorer
Safari Chrome
Figure 5.21: User Agent Names Distribution in the ISM Site
Firefox Firefox - 37.91% - 295001 Chrome - 30.07% - 234035 Safari - 22.10% - 171994 Internet Explorer - 6.67% - 51897 37.9% Android Webkit Browser - 3.01% - 23404 others - 0.24% - 1856
0.2% others 3.0% 30.1% Android Webkit Browser Chrome 6.7%
Internet Explorer
22.1%
Safari
Figure 5.22: User Agent Browsers Distribution in the ISM Site
78 Macintosh
Macintosh - 59.01% - 934367 Windows - 32.96% - 521945 unknown - 4.41% - 69809 59.0% Android - 1.88% - 29763 Darwin - 0.96% - 15223 Linux - 0.68% - 10805 others - 0.09% - 1424
0.1% others 0.7% 1.0% 1.9% Android 4.4%
unknown
33.0%
Windows
Figure 5.23: Operating System Distribution in the ISM Site
Figure 5.23 summarizes the distribution of operating systems for all the user agents.
Macintosh (the user agent database classifies all Apple products into this category) is the most popular operating system. It contains the iPhone OS (iOS) [18] running on iPhone, iPad, iPod touch, and the OS X [21] running on Apple computers. The second most prevalent is Microsoft’s Windows, generating 32.96% of the requests. Interestingly, Android is not as popular among students. We further analyze each operating system category:
1) In Apple, iPhone OS accounts for 70.88% of the requests with 662,286 in total, and
OS X represents 29.12% with 272,081 requests.
2) For Windows users, 44.9% (234,572) of the requests were generated by Windows
7, 40.87% (213,300) by Windows NT, 9.24% (48,238) by Windows 8, 2.82% (14,700) by
Windows Vista, and 1.29% (6,743) by Windows XP. These results are similar to the Windows
Web browsing shares in [26].
We list the top 5 popular versions for some selected operating systems in Table 5.7. Note that Apple users tend to upgrade their OS to the newer versions more frequently, compared to Windows users.
79 Table 5.7: Top 5 OS Versions
(a) Android (1.88% of total Reqs) (b) iPhone OS (41.83% of total Reqs) OS Version Reqs Pct. OS Version Reqs Pct. 4.4.2 14,781 49.66% 8.1.3 170,011 25.67% 4.4.4 5,667 19.04% 7.1.1 143,252 21.63% 5.0.1 2,299 7.72% 8.1.1 89,253 13.48% 5.0.2 935 3.14% 8.0.2 67,816 10.24% 4.2.1 790 2.65% 7.0.2 64,786 9.78%
(c) OS X (17.18% of total Reqs) (d) Windows (32.96% of total Reqs) OS Version Reqs Pct. OS Version Reqs Pct. 10.6.8 47,491 17.45% Win 7 234,572 44.94% 10.10.2 47,090 17.31% Win NT 213,300 40.87% 10.9.5 31,012 11.40% Win 8 48,238 9.24% 10.10.1 30,689 11.28% Win Vista 14,700 2.82% 10.8.3 19,978 7.34% Win XP 6,743 1.29%
Table 5.8: Top 5 Browser Versions
(a) Firefox (18.63% of total Reqs) (b) Chrome (14.78% of total Reqs) Browser Version Reqs Pct. Browser Version Reqs Pct. 35.0 118,495 40.17% 40.0.2214.115 37,381 15.97% 36.0 86,130 29.20% 40.0.2214.111 28,115 12.01% 37.0 54,030 18.32% 42.0.2311.90 22,460 9.60% 34.0 13,391 4.54% 41.0.2272.118 21,752 9.29% 33.0 6,259 2.12% 41.0.2272.101 19,674 8.41%
(c) Safari (10.86% of total Reqs) (d) Internet Explorer (3.28% of total Reqs) Browser Version Reqs Pct. Browser Version Reqs Pct. 8.0 49,467 28.76% 11.0 32,116 61.88% 8.0.3 20,880 12.14% 10.0 9,824 18.93% 7.0 15,995 9.30% 7.0 4,785 9.22% 8.0.2 15,944 9.27% 8.0 2,601 5.01% 8.0.4 11,225 6.53% 9.0 1,403 2.70%
80 150K Video 125K Other
100K
75K
50K
Number of Requests of Number 25K
Jan Feb Mar Apr
Figure 5.24: HTTP Requests Count Per Day for Video (requests)
In addition, we list the top 5 versions for selected browsers in Table 5.8. It is clear that clients using Internet Explorer and Safari are more inclined to update the browser to the latest version.
5.2 Video Viewing Pattern and Traffic
From the previous analysis, we have a general understanding of the ISM site traffic. The large-size course videos make its traffic pattern different from other sites. Therefore, we study the video viewing pattern and the corresponding traffic in this section.
5.2.1 Video Requests Traffic
Figure 5.24 shows the daily video requests. Clearly, most requests are video requests. How- ever, the number of video requests is at a comparatively low level during the first half of each month. Furthermore, non-video requests even exceed the video requests in the last two surges, which is caused by students’ exam reviewing strategies. Before the first midterm, students relied more on the lecture videos for studying. However, they chose to use the other materials (e.g., lecture notes) when studying for the second midterm and the final exam.
81 500 Video 400 Other
300
200 Volume (GB) Volume 100
Jan Feb Mar Apr
Figure 5.25: Data Volume (GB) Per Day for Video (requests)
Figure 5.25 compares the video-related data volume to the non-video traffic. The result shows that most of the data volume is contributed by the video requests during the four- month observation, aligning with the analysis in the previous sections. Even when the amount of requests retrieving other resources exceed video requests in Figure 5.24, the data volume of video requests is still dominating.
We analyze the HTTP transaction duration (Figure 5.26) and response size values (Fig- ure 5.27) of all the video requests during the four months of observation. The duration value is calculated as the time between sending a request and receiving a response. For all video requests, we find that 98.1% of HTTP transaction duration values are shorter than 10 sec- onds, and 94.6% of response size values are smaller than 5 MB. In other words, short HTTP transaction durations and small response sizes dominate the video HTTP transactions, from the prevalence of HTTP partial content request-responses. Furthermore, for the long-term
HTTP transactions, some last for several hours. It is surprising since one lecture video usu- ally only lasts an hour. These rare long transactions may be caused by slow network speed or connection failure issues or pausing video players. The same situation also happens in response size distribution.
82 106
105
104
103
102 Frequency
101
100 0K 10K 20K 30K 40K 50K 60K HTTP Transaction Duration (s)
Figure 5.26: HTTP Transaction Durations Distribution Histogram (x-axis 0-60K s, 50 bins, y-axis log-scale)
106
105
104
103
102 Frequency
101
100 0 2 4 6 8 10 12 Response Size (GB)
Figure 5.27: HTTP Response Size Distribution Histogram (x-axis 0-12 GB, 50 bins, y-axis log-scale)
83 1.0
0.8
0.6
0.4
0.2 Cumulative Frequency Cumulative
0 2 4 6 8 10 HTTP Transaction Duration (s)
Figure 5.28: HTTP Transaction Duration (≤ 10s) CDF for Video Requests
1.0
0.8
0.6
0.4
0.2 Cumulative Frequency Cumulative
0 1 2 3 4 5 Response Size (MB)
Figure 5.29: HTTP Response Size (≤ 5MB) CDF for Video Requests
84 Figure 5.28 and Figure 5.29 show the CDF of the HTTP transaction durations (≤ 10s) and the HTTP response sizes (≤ 5MB) for video requests. The x-axes represent duration values (in seconds) and response size values (in MB), respectively. The duration curve is comparatively smoother than the response size curve, with a few vertical jumps in the latter. Further analysis indicates that the dominant response sizes are 65,536 Bytes (35.4% of requests), 131,072 Bytes (12.1%), and 262,144 Bytes (5.2%). This phenomenon is caused by user agents when fetching large (video) files from a server that supports partial GET requests.
The ISM server supports the “Accept-Ranges: bytes” function, which allows clients to request any byte ranges of a file stored on server. Therefore, clients can request partial content from the ISM server. User agents may have diverse behaviors to fetch large video
files in the ISM server. Among all video requests, user agent “AppleCoreMedia” is respon- sible for 701,499 of the video requests (97.9%), dominating other popular Internet browsers.
AppleCoreMedia is a framework in Apple’s products used to help process on-line videos, cooperating with other applications such as Safari and iTunes. From the access logs, we
find that the user agent field in HTTP request sometimes changes to “AppleCoreMedia” even when using Safari. Furthermore, we discover that the partial requests generated by
Safari or “AppleCoreMedia” are quite unpredictable. For example, the range values are not always monotonic or contiguous; they occasionally skip or overlap. Yao et al. [60] also found this behavior happening in other iOS devices and analyzed its inefficiency. They concluded that about 10%-70% traffic is redundant when accessing Internet streaming services on iOS devices.
5.2.2 Browser Behaviors for Video Playing
As introduced in Section 2, there are three different techniques for streaming video over the Internet. We analyze the HTML source code of the ISM site and find that it uses the progressive download technique. This implementation of video streaming in the ISM site is
85 not only inconvenient for users, but also inefficient for network usage. We further explore
this by performing a comparison experiment.
We deploy an Apache HTTP server on a PC, with “Accept-Ranges: bytes” enabled by
default. The configuration of our server is essentially the same as the ISM server. We
limit client bandwidth4 to 10.24 Mbit/s for simulating the QoS environment, with Apache
“mod ratelimit” module5 configurations. The Web server and clients all run in the same PC
(localhost), thus network issues are eliminated. One lecture video (“ASTR209 - Lec4 - Jan
22, 2015.mp4”) is downloaded from the ISM site as a sample to be deployed in our server.
We experiment with three ways of server-side video playing implementations, tested by the
latest version of Firefox, Chrome, Safari, and Internet Explorer (see Figure 5.22):
Case 1) The video file is served as a static file in the server. This is the simplest way for
delivering video files.
Case 2) The video file is embedded as an HTML “
“type” set to Video/QuickTime. This is implemented exactly the same way as in the ISM
site.
Case 3) The video is displayed by the HTML5 “
standard way to embed a video in a Web page, but was not feasible before HTML5.
Case 4) The video is displayed by MPEG-DASH implementation with Dash.js support.
This approach needs to process the video and generate the Media Presentation Description
file beforehand. Dash.js requires Media Source Extensions support in the browsers.
The results are shown in Table 5.9. The browser names and versions are listed in the
leftmost column. We use “Static File”, “Object Element”, “HTML5 Video”, and “MPEG-
DASH” to represent the four implementations. The column “Play” shows whether the video
is able to be played in that condition, and “Forward” shows whether the video can be
4List of countries by Internet connection speeds, https://en.wikipedia.org/wiki/List_of_ countries_by_Internet_connection_speeds 5Apache Module mod ratelimit, http://httpd.apache.org/docs/2.4/mod/mod_ratelimit.html 6HTML
86 Table 5.9: Browser Support for the Four Video Playing Implementations
Static File Object Element HTML5 Video MPEG-DASH Browser Play Forward Play Forward Play Forward Play Forward Chrome (V44) Yes Yes No N/A Yes Yes Yes Yes Safari (V8) Yes Yes Yes No Yes Yes Yes Yes Firefox (V39) Yes Yes Yes No Yes Yes No N/A IE (V11) No N/A No N/A Yes Yes Yes Yes forwarded to any point (versus the user having to watch from the start and wait for the video to be downloaded).
Table 5.9 shows our results. The static file approach supports all the browsers except
IE, since IE downloads the video file by default instead of invoking its internal video player.
Actually, the static file approach is almost the same as the HTML5 video tag approach; in both conditions, the browser video players are used to play the video. The HTML object element implementation used by the ISM site is feasible in only two of the browsers. Fur- thermore, the browser uses the QuickTime plug-in video player to decode the video file in the object element approach, which doesn’t support fast forward. The HTML5 video tag implementation is the only approach fully supporting the fast forward function in all the browsers.
There are four current commercial implementations of adaptive streaming. MPEG-DASH is the only international standard widely supported by most HTTP servers. Safari supports this since “V8”, IE since “V11”, and Firefox only has partial support. The advantage of
DASH is to provide best quality videos based on user’s network speed. The disadvantage is the computation and storage resources used for compressing the videos in various bit-rates beforehand.
Analysis of the Apache logs in our experiment shows the following:
1) Chrome first generates a GET request for the video file. Then it generates a GET request with “bytes=0-” to test whether partial GET is applicable. When the user clicks in the progress bar where the video is not already downloaded, Chrome aborts the previous
87 GET request and generates a new partial request. IE behaves the same as Chrome does.
2) Firefox and Safari generate a GET request for the whole video file at first like Chrome.
Then they both generate a series of partial GET requests with small-size responses, when the user forwards the video. We notice that the number of requests, and the range values don’t seem to have a pattern. Recall that the popular response size values in Figure 5.27 are primarily caused by Firefox and Safari behaviors.
3) Yao et al. [60] mentioned that a slow connection can aggravate the redundant re- quests issue. In our test, the browser behaviors remain the same regardless of whether the bandwidth limitation is large or small.
In summary, the video streaming approach implemented by the ISM site is inconvenient for viewers, and inefficient for network usage. In today’s Internet with the HTML5 video tag widely supported [14], backward-compatible HTML5 approaches (retaining support for old browsers) are suggested for improving video playing implementation in the ISM site.
5.3 Course-Related Events
There is no doubt that course-related events heavily influenced the traffic of the ISM site. As we mentioned earlier, the surges in the daily requests (Figure 5.1), data volume (Figure 5.3), and unique IPs (Figure 5.8), are mainly caused by the scheduled exams of the course ASTR
209 (or AST 209). Therefore, we explore the network usage related with the course events in this section.
By identifying the names of the requested URLs, we divide the course-related network traffic into three categories, based on the file names. Figure 5.30 shows the daily HTTP request traffic for the three courses, while Figure 5.31 shows the corresponding daily data volumes. We find that ASTR 209 accounts for 77.8% (1,231,339) of requests and 99.4%
(8,434 GB) of data volume, while ASPH 213 (120,351 reqs, 46 GB) and ASPH 503 (2,613 reqs, 0.2 GB) generate minor traffic. The course ASPH 503 was not offered in Winter 2015,
88 150K AST209 125K ASPH213 ASPH503 100K
75K
50K
Number of Requests of Number 25K
Jan Feb Mar Apr
Figure 5.30: HTTP Requests Count Per Day for ASTR209, ASPH213, and ASPH503
500 AST209 ASPH213 400 ASPH503
300
200 Volume (GB) Volume 100
Jan Feb Mar Apr
Figure 5.31: Data Volume (GB) Per Day for ASTR209, ASPH213, and ASPH503
89 125K 400 Video Midterm Course Notes 320 Outline 100K Homework Final 75K 240
50K 160
25K 80 Number of Requests of Number Number of Requests of Number
Jan Feb Mar Apr Jan Feb Mar Apr (a) HTTP Requests Per Day (b) HTTP Requests Per Day 400 60 Video Midterm Course Notes 50 Outline 320 Homework 40 Final 240 30 160 20 Volume (MB) Volume Volume (GB) Volume 80 10
Jan Feb Mar Apr Jan Feb Mar Apr (c) Data Volume Per Day (d) Data Volume Per Day
Figure 5.32: HTTP Requests and Data Volume Per Day for the Six Categories thus activities of this course are rarely seen in both figures. The course ASTR 209 and ASPH
213 were both available in Winter 2015. There are some HTTP requests retrieving files of
ASPH 213, however, the ASTR 209 traffic is dominating both in requests and data volume.
The course materials in the ISM server are organized with meaningful names. For ex- ample, “AST209/Entries/2015/1/28 Course Notes files/Part2 e&m.pdf” is a course note file for ASTR 209, and “AST209 Midterm1 info files/Formula sheet Midterm1.pdf” is midterm reviewing material for ASTR 209. Therefore, by extracting information from the requested
URLs, we classify the course-related requests into 6 categories: “Video”, “Course Note”,
“Midterm” (midterm exam materials), “Outline” (course outline explanation materials),
“Homework”, and “Final” (final exam materials).
Figure 5.32 shows the daily HTTP requests and data volume traffic for the six categories.
Since the values in “Video” and “Course Note” are much larger than the other four, we use two figures with different y-axis scale to clearly display the variation trend of each category.
90 Our observations are as follows:
1) Outline and homework materials are popular near the beginning of the course. These are used by students to acquire a general understanding of the course.
2) Videos contribute the most requests and data volume for the ISM site. Students rely more heavily on the videos for the first midterm exam than they do for the second midterm or the final exam.
3) Course notes are the primary materials for students to study for the midterms and the final. Before the second midterm and the final, course notes generate more requests, comparable to the video requests.
4) The popularity of midterm exam materials increases dramatically before the midterms and the final, indicating that students’ reviewing period is relatively short. The surge of final exam materials before the final exam leads to a similar conclusion.
These network traffic trends align with the course events, which in return indicates that real-world events heavily influence the Web usage.
5.4 Summary
This chapter presented a detailed network usage analysis, as well as workload characteristics, for the ISM site. Our analysis started with the general HTTP traffic characteristics, including request counts, data volume, user IPs, user agents, HTTP methods, and response sizes.
Then we focused on the video streaming traffic and the video viewing patterns. A series of experiments were performed to measure the video streaming approach utilized by the ISM site, as well as the user agent behaviors. The results showed that the ISM implementation is inconvenient for users and inefficient from a networking perspective. We suggested the
HTML5 approach (or backward-compatible HTML5 approaches) for solving this issue.
This chapter ended with a detailed analysis of the course-related events and the corre- sponding network usage. Many inferences about students’ study patterns were provided.
91 Chapter 6
DISCUSSION
In this chapter, we first compare the workload characteristics of the Aurora site and the
ISM site. Then we revisit Arlitt and Williamson’s work [30, 31], to inspect whether the ten common characteristics they found in Web server workloads still apply to today’s scientific
Web sites. Finally, we perform a series of experiments on file transfer approaches to address the inefficiency issues discovered in the traffic of the Aurora site.
6.1 Comparative Analysis of Two Scientific Web Sites
We analyze the workload characteristics of the Aurora site and the ISM site, to explore their similarities and differences. Table 6.1 shows the overview statistical information of the two sites. The request traffic to the Aurora site is around 100 times more than the ISM site, while the data volume is in the same scale as the ISM site. Moreover, the number of unique
URLs and IPs indicates that the Aurora site provides many more resources, and to more visitors.
Figure 6.1 presents a comparison of the daily traffic from the two sites. Since the request traffic of the two sites are not in the same magnitude, Figure 6.1(a) is drawn in log-scale.
The requests figure shows that the daily requests of the Aurora site always exceed those of the ISM site within our observation period, even considering the ISM site surges caused by the exams. Nevertheless, Figure 6.1(b) shows that the daily data volume of the ISM site
Table 6.1: Statistical Characteristics of Two Scientific Web Sites (Jan 1/15 to Apr 29/15)
Site Total Reqs Avg Reqs/day Total GB Avg GB/day Uniq URLs Uniq IPs Aurora 182,068,131 1,529,984 10,354 87.01 2,894,294 240,236 ISM 1,583,339 13,305 8,483 71.29 10,563 9,720
92 480 Aurora 106 400 ISM 105 320 104 240 103 160 102 Volume (GB) Volume 101 Aurora 80 Number of Requests of Number ISM 100 Jan Feb Mar Apr Jan Feb Mar Apr (a) HTTP Requests Per Day (log-scale) (b) Data Volume in GB Per Day
Figure 6.1: HTTP Traffic Overview for the Aurora and ISM Sites
Table 6.2: HTTP Method and HTTP Status Code Percentage
(a) HTTP Method (b) HTTP Status Code HTTP Method Aurora ISM Status Code Aurora ISM GET 88.4% 99.51% 200 OK 95.59% 32.04% HEAD 11.6% 0.49% 206 Partial Content 0.14% 58.59% 304 Not Modified 1.99% 4.99% 404 Not Found 0.44% 2.99% exceeds the Aurora site on several occasions, primarily during late-January, February, and mid-April. These results reflect the traffic characteristics of the two sites, as presented in earlier chapters.
The HTTP request-response statistical results are shown in Table 6.2. The GET and
HEAD methods are dominant for both Web sites. In addition, Wget robots generated numer- ous HEAD requests, which makes the proportion of those methods different for the Aurora site. The HTTP status codes for the two sites are quite different. Most of the responses in the Aurora site are “200 OK”, while “206 Partial Content” dominates in the ISM site. This result is primarily caused by the prevalence of partial GET for fetching the large-size video
files.
Figure 6.2(a) shows the frequency-rank profile for the IP addresses observed at the Aurora and ISM sites. There is visual evidence of a Zipf-like power-law structure in the frequency distribution for each site, with similar slopes for each. Least-squares regression confirms
93 108 108 Aurora ISM Aurora ISM 107 107 106 106 105 105 104 104 3 3
Frequency 10 Frequency 10 102 102 101 101 100 100 100 101 102 103 104 105 106 100 101 102 103 104 105 106 107 Rank Rank (a) Frequency-Rank Profile for IP Addresses (b) Frequency-Rank Profile for URLs
Figure 6.2: Frequency-Rank Profiles for the Aurora and ISM Sites a strong linear fit, with slope values -1.88 (Aurora), -1.92 (ISM), and R2 values of 0.99
(Aurora), and 0.96 (ISM).
Figure 6.2(b) shows a frequency-rank analysis applied to the URLs requested on the
Aurora and ISM sites. The ISM site shows visual evidence of a power-law structure, which is confirmed by linear regression analysis with slope value -1.68 and R2 value 0.92. However, the Aurora site doesn’t fit the power-law structure very well, with slope value -1.23 and R2 value 0.72.
The frequency-rank profiles for the ISM site indicate its network traffic is largely gener- ated by humans. Conversely, the URL frequency-rank profile for the Aurora site is heavily influenced by the robot traffic. It has several distinct plateaus in the frequency-rank profile, which is mostly attributable to robots and indexing operations (recall that Wget may revisit a Web page several times for building the directory tree).
In summary, the Aurora site and the ISM site both have large effects on the campus network traffic. The network traffic of the ISM site has “expected” surges, while the Aurora site has “unexpected” surges and its characteristics show strong evidence of robot-generated traffic.
94 6.2 Workload Characteristics Revisited
In the mid-1990’s, Arlitt and Williamson [30, 31] analyzed the workloads of six data sets, and summarized ten common characteristics for Web servers. Their work is a representa- tive reference for today’s Web workload characteristics. In this section, we compare their discoveries to the workload characteristics of the scientific Web sites.
The characteristics identified by Arlitt and Williamson [30, 31] cover a wide range of
HTTP traffic characteristics. We select nine characteristics to study the differences between their work and the scientific Web sites. The “Remote Requests” characteristic is ignored since all the HTTP requests monitored are from remote visitors. The remaining nine char- acteristics are shown in Table 6.3.
6.2.1 Success Rate
This characteristic measures the proportion of successful responses appearing in the network traffic. A request-response record is defined to be successful only if the requested object is found by the server and returned to the client successfully. Among all aforementioned HTTP response status codes, only “200 OK” and “206 Partial Content” meet the requirements.
The success rate of the Aurora site and the ISM site are slightly higher than Arlitt and
Williamson’s result, reaching 95.7% and 90.6%.
6.2.2 File Types
This characteristic studies the types of objects requested. In their work, they found that
90-100% of the requests are retrieving HTML and image files from the server. For the Aurora site, our result shows that 89.7% of the requests are for HTML and image files, accounting for 75.3% of the data volume. This result aligns with their study, indicating that most of the traffic is for HTML and image files. However, only 23.8% of the requests and 0.1% of the data volume are for HTML and image files, in the ISM site.
95 Table 6.3: Comparison of Workload Characteristics
Characteristic Description Aurora Results ISM Results Name Success Rate Success rate for 95.7% 90.6% lookups at server is about 88% File Types HTML and image files 89.7% of re- 23.8% of re- account for 90-100% quests, 75.3% of quests, 0.1% of of requests data volume data volume Mean Transfer Mean transfer size ≤ 59.5 kilobytes 5.5 megabytes Size 21 kilobytes Distinct Re- Less than 3% of the re- 1.3% of requests, 0.4% of requests, quests quests are for distinct 16.0% of data 0% of data vol- files volume ume One Time Refer- Around 33% of the 82.9% of files, 62.0% of files, encing files and bytes are ac- 98.4% of volume 0.6% of volume cessed only once Size Distribu- File Size distribution Does not fit Does not fit tion is Pareto with 0.40 < Pareto Pareto α < 0.63 Concentration of 10% of the files ac- 98.4% of re- 98.4% of re- References cessed account for quests, 83.6% of quests, 99.98% 90% of server requests volume of volume and transferred bytes Wide Area Us- 10% of the visitors’ 10% IPs for 10% IPs for age domains account for ≥ 99.2% requests 93.9% requests 75% of usage Inter-Reference File inter-reference Not exponen- Not exponen- Times times are exponen- tially distributed tially distributed tially distributed and independent
96 6.2.3 Mean Transfer Size
The mean transfer size is a statistical value representing the average transferred data volume for all responses. They studied six data sets and determined that the mean transfer size was less than 21 KB. Our result shows that the mean transfer size of the Aurora site is 59.5
KB, which is more than double their result. This may be caused by the particular Web content in the Aurora site, such as the images and videos. The mean transfer size of the
ISM site is 5.5 MB, which is much larger than their result. It meets our expectation, since the ISM site primarily provides large size videos. Flix et al. [52] also studied the distribution of HTTP response sizes and noticed that the contents in Web pages have great influence on the response size values.
6.2.4 Distinct Requests
This characteristic measures how many requests retrieve distinct files. A file in the server is considered as a distinct file based on its file name. The result is calculated by summing up all distinct file requests and data volume. In this study, we eliminate the query strings appended to the URLs (see Section 4.1.6), and treat each unique URL as a file. Our results align with Arlitt and Williamson’s studies.
6.2.5 One Time Referencing
This characteristic considers files that are accessed only once. Since our log system only records the response size values, which may change along with the updating of the files (e.g., the live images refreshed by remote cameras in the Aurora site) or partial GET requests, we treat each unique URL as a unique file and assume its size is the average response data volume. Our results (82.9% for Aurora site, 62.0% for ISM site) are quite different from their result (33%), which may be caused by robots crawling unpopular files and the traffic concentration on extremely popular files.
97 6.2.6 Size Distribution
Arlitt and Williamson found that the file size distribution matches well with the Pareto distribution (0.40 < α < 0.63). Our overall file size distributions are quite different (each unique URL is treated as a unique file, and the corresponding file size is its average response data volume). They are dominated by the JPEG image and video sizes, which results in a truncated distribution with no heavy tail.
6.2.7 Concentration of References
This characteristic studies the concentration of the requests. Our results are close to their statements.
6.2.8 Wide Area Usage
They found that Web servers are accessed by thousands of domains, with 10% of the visitors accounting for ≥ 75% of usage. For the Aurora site, we discover that 10% of the most popular IPs account for 99.2% of the requests. The network traffic of the Aurora site is heavily influenced by some robots. For the ISM site, we find that 10% of the most popular
IPs account for 93.9% of the requests. The traffic is primarily generated by a small group of highly active IPs.
6.2.9 Inter-Reference Times
The inter-reference time is the time slot between any two successive requests of a file. Ar- litt and Williamson found that file inter-reference times are exponentially distributed and independent. However, the result doesn’t hold in the Aurora site and the ISM site.
In summary, some characteristics still apply to the network traffic of the two scientific
Web sites. Other characteristics don’t match the two sites, which may be caused by the robot traffic, different user behaviors, and the site characteristics.
98 37,655 files 53 MB 1st Round Download
2nd Round Update
Web Client Web Server (Node.js) 4 new files Aurora Server 324 KB
Figure 6.3: Illustration of File Transfer Methods Experiment
6.3 Network Efficiency Analysis
To close our study, we return to the network (in)efficiency issues. Usually, the criterion
for judging an implementation approach weighs the resources consumed and the outcomes
achieved. This criterion also applies to network traffic measurement research. In this section,
we first study the behavior of Wget software, and compare it to other file transfer approaches in terms of the time and bandwidth consumption for achieving the same goal. Then we propose an efficient solution for the JavaScript “cache-busting” approach.
6.3.1 File Transfer Methods
To evaluate file transfer methods, an active measurement experiment is performed both in a
LAN test environment and a SoftLayer1 cloud-based WAN environment. A simple Node.js
Ecstatic Web server [1] with a subset of the Aurora site content (37,655 files, 53 MB in total) is deployed on the server-side, and software tools on the client-side are launched to fetch files from the server. The Node.js server serves static files, mimicking the functions provided by the Aurora server.
Figure 6.3 shows our experimental setup. The Node.js server acts as a proxy server providing files fetched from the Aurora server to our Web client. The file transfer approaches evaluated are the following:
1SoftLayer cloud service, http://www.softlayer.com/
99 • Wget is a free command-line tool used for downloading files from the Internet.
It supports HTTP, HTTPS, and FTP protocols. Wget can be configured to
download a single file, or multiple files in a directory. Moreover, it can be used
to retrieve files based on given filter rules (e.g., files with specific extension
or name) or even recursively mirror directories in the server. Furthermore,
it supports useful options like downloading updated files only (by checking
time-stamps) and disabling courteous robot operations.
• rsync2 is a free well-known command-line software tool providing fast file
transfer services in Unix/Linux systems. It doesn’t support FTP or HTTP
protocols; it only supports SSH (Secure Shell), RSH (Remote Shell), and
RSYNC. Both sides must run rsync synchronously when transferring files
between the two endpoints. A daemon (background process) mode is added
to rsync, which allows clients to synchronize files with the server at any time.
rsync is implemented with a highly efficient algorithm, which divides files into
chunks and only transfers the modified chunks by performing rolling checksums
and MD5 checksums [9]. It also supports compressed file transfers for saving
network bandwidth, and it supports directory-level entity transfers.
• scp [24] is short for secure copy, which is software to perform secure file trans-
fers between two endpoints. It is a widely-used terminal tool, which is pre-
installed in Linux and Mac OS. It utilizes the SSH protocol to build connec-
tions between the two endpoints, thus it doesn’t support HTTP and FTP
protocols. Furthermore, it doesn’t support modified-file transfer only option;
in other words, it downloads all the files again even if the local copy is up to
date.
• curl [15] is a command-line tool providing data transferring functions. It
2GNU rsync, https://rsync.samba.org/
100 supports a wide variety of protocols, including FTP, HTTP, SCP, Telnet, etc.
However, it only supports single file transfer, and doesn’t support directory-
level recursive file transfers.
• lftp3 is a command-line file transfer program supporting protocols such as
FTP, HTTP, HTTPS, etc. It supports directory-level recursive downloads
and updated-file only download functions.
• HTTrack4 is a free Web resource retrieving software with command-line and
GUI (Graphical User Interface) versions. It supports FTP, HTTP, and HTTPS
protocols, as well as updated-file only download and recursive download func-
tions.
• zsync5 is software combining the rsync fast file transfer algorithm and HTTP
protocol supporting implementations. However, it doesn’t support directory-
level recursive file transfers currently.
From the aforementioned approaches, we select the software tools with modified-file trans- fer only and directory-level recursive file transfers functions supported, to perform our ex- periment. Table 6.4 provides a summary of our selected methods for file transfer and/or synchronization between multiple sites.
Table 6.4 shows the results from our experiments. In “Initial Copy”, all these software tools are invoked to retrieve files in the “data/themis/fluxgate/stream2/2015/” folder (Jan- uary 1 to April 27) from our server. In “Subsequent Copy”, the data of April 28 (4 new data
files) is added to the server and all software tools are executed again. The number of files, data volume, and elapsed time are recorded for the two file transfer experiments.
From the results in Table 6.4, and the server-side logs, we find:
3GNU lftp, http://lftp.yar.ru/ 4GNU HTTrack, https://www.httrack.com/ 5Zsync, http://zsync.moria.org.uk/
101 Table 6.4: Experimental Results for File Transfer/Synch Methods
Initial Copy Subsequent Copy Software Tool Files Volume Time Files Volume Time Wget (v. 1.11.4) 37,782 62 MB 5m 13s 132 9.9 MB 2m 49s Wget (v. 1.13) 37,782 62 MB 5m 1s 132 9.9 MB 2m 41s Wget (v. 1.16) 37,782 62 MB 4m 52s 132 9.9 MB 2m 29s rsync (v. 3.1.1) 37,655 16 MB 49.8s 4 873 KB 14.6s lftp (v. 4.6.0) 37,655 53 MB 27m 58s 4 316 KB 2m 9s HTTrack (v. 3.48-19) 37,782 62 MB 2h 52m 132 19 MB 1h 18m Remote File Transfer between Campus Network and Cloud Service Wget (v. 1.11.4) 37,782 62 MB 1h 03m 132 9.9 MB 29m 32s Wget (v. 1.16) 37,782 62 MB 1h 02m 132 9.9 MB 31m 22s rsync (v. 3.1.1) 37,655 16 MB 23.3s 4 873 KB 5.4s
1) Wget supports FTP and HTTP(S), which is convenient and flexible. However, it takes a long time to retrieve all the files, downloading additional HTML files for extracting the
file directory information (this is slow when each HTML page has many links). For the
LAN test, there is a small difference in time consumption for the three different versions of
Wget. Wget 1.11.4 is a little slower than version 1.13 and 1.16. Wget 1.11.4 does not support
HTTP/1.1, so some of its requests result in closed connections. Note that Wget 1.13 is the
first released version supporting HTTP/1.16. Wget 1.11.4 supports “Keep-Alive” connections with HTTP/1.0, if the server side permits it. For the WAN test, the time consumption for
Wget 1.11.4 and Wget 1.16 are about the same.
2) rsync has the best performance in both LAN and WAN tests, taking less than a minute to download/update all the files. Furthermore, it applies the zlib library7 to reduce data volume by about a factor of 4 in this case. The drawback is that rsync doesn’t support
HTTP. Nevertheless, the server can provide anonymous rsync service with a daemon running in the background. It is interesting that rsync uses less time in WAN test than LAN test.
Note that rsync adopts a specially designed algorithm to check the differences between files among server-side and client-side, which is CPU-intensive. Therefore, the WAN test may
6GNU Wget news, history of user-visible changes, http://bzr.savannah.gnu.org/lh/wget/trunk/ annotate/head:/NEWS 7Zlib, http://www.zlib.net/
102 cost less time since the cloud server is more capable than the shared server in the LAN test, even though the network speed in WAN test is slower.
3) HTTrack and lftp are worse than Wget, requiring a long time to complete the task. lftp is executed with FTP service, which saves bandwidth for downloading additional HTML
files.
According to these results, we provide three suggestions as follows:
1) rsync has better performance than Wget. Therefore, the performance of the Aurora server can be improved by providing rsync services specifically to robots like UCB, and general HTTP services to the public.
2) An up-to-date version of Wget (with persistent connections) is preferable for all Web robots, especially for wide-area network transfers with non-negligible round-trip times.
3) Since the data files in the Aurora site are usually updated once each day, there is no need to execute Wget several times a day to generate repetitive requests. Also, it would be more efficient if the Aurora server can provide a meta-file indicating which files are updated everyday, like the RSS (Rich Site Summary) feed approach.
6.3.2 JavaScript Cache-Busting Solution
Many referer sites (i.e., AuroraMAX) use “cache-busting” to obtain the latest image from the Aurora server, instead of reading from cache files. We propose an AJAX HEAD imple- mentation for solving this inefficiency problem, with three primary considerations:
1) The referer sites developers want users to be able to view the new images as soon as possible. Therefore, simply increasing the refresh interval is not an appropriate approach for guaranteeing the same user experience.
2) The approach shouldn’t increase load on referer Web servers (i.e., a proxy server).
Our approach has the browser send HEAD requests every few seconds to check whether the image changed or not. If the image changes, it generates a GET request with “cache-busting” technique to fetch the new image. The image files are still transfered between the Aurora
103 106 AJAX HEAD Cache-busting GET
105
104
3
Volume (Byte) Volume 10
102 0min 1min 2min 3min 4min 5min
Figure 6.4: HTTP Response Size Results for the Two Methods server and the browser.
3) The changes made to the current Aurora server and referer Web pages should be minimal. The AJAX HEAD method only requires the Aurora server to support cross-origin resource sharing (CORS)8 by adding one line to the Apache server configuration file, which is needed to bypass the same-origin policy.
We performed an active measurement experiment comparing the bandwidth consumption of “cache-busting” with our AJAX HEAD implementation. To be specific, we wrote a Python script to download the latest image from the Aurora server every 30 seconds (the image changes every 30 seconds). We emulate the two methods by rendering two HTML pages in Chrome (v. 45) browser (the result is similar for the latest version of popular browsers, such as Safari and Firefox). In our experiment, there is only one client visiting the “referer site”; in practical application the bandwidth savings would scale with the number of Internet users.
Figure 6.4 shows the HTTP response size of the server-side logs during 5 minutes of observation. The x-axis is drawn with time-series, while the y-axis is the data volume
8Cross-origin resource sharing, https://en.wikipedia.org/wiki/Cross-origin_resource_sharing
104 transferred. The AJAX HEAD method dramatically reduces the bandwidth usage, compared to the “cache-busting” approach.
6.4 Summary
This chapter discussed the workload characteristics comparisons and network efficiency is- sues.
First, we compared the workload characteristics of the Aurora site and the ISM site. We identified the influence of the robot-generated traffic for the Aurora site.
Next, we compared the workload characteristics of the Aurora site to the characteristics found by Arlitt and Williamson [30,31]. We found that some characteristics still align with the traffic of the Aurora site, while others don’t.
Finally, we closed this chapter with a study of the efficiency of several different file transfer approaches and an alternative AJAX HEAD implementation. We provided several suggestions regarding the Aurora site based on the experimental results.
105 Chapter 7
CONCLUSIONS
This chapter summarizes our workload characterization study of the two scientific Web sites. To start with, we present an overall summary of this thesis. Then we review detailed characteristics of the two Web sites. Finally, conclusions and future work are presented.
7.1 Thesis Summary
Overall, this thesis studies the workload characteristics of two scientific Web sites hosted by the Department of Physics and Astronomy at the University of Calgary. By measuring the incoming and outgoing HTTP traffic collected over a four-month period (from January
1, 2015 to April 30, 2015), the Aurora site and the ISM site are selected and thoroughly analyzed. Our study primarily focuses on the user behaviors and network usage of the two scientific sites. We identify highly redundant traffic in the Aurora site, and unfriendly Web page design issues in the ISM site. Several experiments and suggestions are made for solving these network inefficiencies.
In detail, we summarize the chapters as follows:
1) Chapter 1 introduces the role of scientific Web sites in the modern Internet, and briefly states the motivation for our work.
2) Chapter 2 introduces basic background knowledge on computer networks. The appli- cation protocol HTTP and its underlying TCP/IP protocols are discussed. We also present a series of related studies involving network traffic measurement, Web robots, video streaming techniques, and scientific Web site analysis.
3) Chapter 3 mentions detailed information about how network traffic data is monitored and collected. We introduce our methodology including the hardware deployment, the Bro
106 logging framework, and the pretreatments applied to the data.
4) Chapter 4 analyzes the network traffic of the Aurora site.
5) Chapter 5 analyzes the network traffic of the ISM site.
6) Chapter 6 compares the workload characteristics of the Aurora site and the ISM site, revisits a Web server workload study, compares its results to the scientific Web sites workload, and discusses inefficiency issues.
7.2 Scientific Web Site Characterization
We review the characteristics of the Aurora site (Chapter 4) and the ISM site (Chapter 5) in this section.
7.2.1 The Aurora Site
The Aurora site provides auroral research information to the public. The Web traffic to this site contributes over 1.5 million HTTP requests and 87 GB of data volume each day.
The IP address analysis shows that most visitors are from Canada (39.50%) and United
States (15.67%), while the traffic is dominated by United States (73.22% of requests). We identified several robots from the most popular IPs, and further studied the robots from the University of California at Berkeley (UCB) and the University of Alaska (UA). The results show that UCB robot uses Wget to periodically mirror content from the Aurora site, generating 0.75 million requests per day and 90 million requests in total. In addition, we found that 11.6% of requests are HEAD requests, 99.7% of which are generated by Wget.
This characteristic differs from normal Web sites, where HEAD requests are rarely seen.
We analyzed the referer field in the HTTP request headers, and found that many referrer sites showcase live images and videos from the Aurora site, especially the CSA AuroraMAX portal. The URL and file information analysis shows that HTML and image resources are dominant with 89.7% of requests and 75.3% of transferred data volume. We compared these
107 workload characteristics to the common server workload characteristics from a prior study.
We found that some characteristics do not hold for the Aurora site.
We studied the network traffic generated by the UCB robots and the UA robots. The approaches used by the UCB robots are inefficient, since they crawl the Aurora site repeatedly while data files rarely change. A series of experiments in Chapter 6 show that Wget is inefficient for file transfers, especially the old version of Wget used by the UCB robot. Many referrer sites like CSA AuroraMAX use JavaScript to fetch live images from the Aurora site every few seconds. This “cache busting” technique is inefficient, since the browser generates
GET requests to download images even when the image files are up-to-date.
Finally, we studied the correlations between the network traffic and the real-world events.
We found that the traffic of the Aurora site was heavily influenced by the geomagnetic storm in mid-March 2015, when the traffic quadrupled (“unexpected” surge) from the normal level.
7.2.2 The ISM Site
The ISM site is a scientific site providing lecture materials to undergraduate students. The
ASTR 209 course, taken by 400 students in Winter 2015, is the primary traffic contributor.
The ISM site generates 13,000 HTTP requests and 71 GB of data volume per day. Compared to the Aurora site, the ISM site achieves similar data volume with much fewer requests. We found that the lecture videos posted in the ISM site generate a majority of the network traffic. Further analysis shows that the traffic is primarily human-generated, with 88.24% visitors coming from Canada. The URL analysis shows that video files and lecture slides are the most frequently retrieved. For the HTTP header information, we found that 99.51% of requests use GET method, 58.59% of response status codes are “206 Partial Content” and
32.04% are “200 OK”. This result is caused by browser actions for fetching large-size video and PDF files. We also analyzed the user agent information, and found that most visitors use Apple and Windows operating system with relatively new versions of browsers.
We measured the video traffic and video viewing patterns separately. We found that most
108 traffic (partial content requests and responses) on the ISM site is for video files. Furthermore, we found that most response sizes and HTTP session durations are small. Experiments with browsers and video players show that user agents generate partial requests with small byte ranges, which increases the total number of requests tremendously. We also compared the
ISM’s video streaming implementation to the modern HTML5 approach and MPEG-DASH.
The results indicate that the ISM site implementation is unfriendly to network usage. In addition, we suggest that the HTML5 approach be adopted for improving video playing implementation on the ISM site.
By extracting information from the requested URLs, we analyzed user behaviors and network usage of course-related events. We found that course-related events like midterm and final exams have strong impacts (“expected” surges) on the network traffic. We also discovered many results of students’ reviewing patterns. For example, we found that students relied on the videos for the first midterm more so than for the second midterm and the final.
7.3 Conclusions
This thesis studies the workload characteristics of the Aurora site and the ISM site over a four-month period. We analyzed the HTTP traffic thoroughly, and performed a series of additional experiments for studying network usage efficiencies. Our conclusions are presented as follows:
1) Scientific Web sites can generate extremely large volumes of Internet traffic, even when the user community is seemingly small.
2) The Aurora site has a few workload characteristics that differ from the results in a prior study, while the remaining characteristics align with those results.
3) Significant robot traffic is observed in the Aurora site, which increases network resource usage.
4) A large fraction of the observed network traffic in the Aurora site is highly redundant,
109 and can be reduced significantly with more efficient networking solutions.
5) The ISM site has unique workload characteristics, handling partial content requests as a video streaming Web site.
6) The video displaying implementation in the ISM site is unfriendly to users, and in- efficient for network usage. The modern HTML5 approach is an alternative for improving those issues.
7) For both scientific Web sites, real-world events can have surprisingly large impacts on the network traffic. The surge in the Aurora site is “unexpected”, while in the ISM site the surges are “expected”.
8) For the Aurora site, more intelligent ways of sharing the open accessed data are needed.
For the ISM site, users are encouraged to keep their browsers up-to-date for being able to take advantage of the latest video technology with securer browsing approaches.
7.4 Future Work
Although many analyses were done in this thesis, this work could be extended in several ways:
1) Our passive network traffic measurement study focused on Bro’s HTTP logs, since both sites provide content by HTTP, and there is only limited information collected in other logs. One could analyze all connection logs with an enhanced logging system. Furthermore, one could obtain log information directly from the server side (e.g., the Aurora server access logs) to: a) capture the campus internal network traffic, b) avoid missing data caused by
Bro outages, and c) gain more request-response header information for data comparison and correction. Such an approach would be worthwhile, since the characteristics in the prior study are based on server-side logs.
2) In the Aurora site, robot traffic is identified and analyzed using simple heuristics. We may use some statistics or data mining approaches to detect and extract all robot traffic, for
110 analyzing pure robot traffic as well as non-robot traffic.
3) A series of small-scale comparison experiments were performed on file transfer methods.
More large-scale simulations or practical tests can be conducted to better understand the network usage, particularly across a wide-area network scenario.
4) We discovered many small partial requests in Safari’s behavior for fetching video files.
We may further analyze the advantages and disadvantages of this approach, by comparing to other browsers.
5) For the video playing implementations, we compared the ISM site’s approach to
HTML5 implementation and MPEG-DASH. We may further study the network usage of other approaches (e.g., Apple HTTP Live Streaming), to find out whether they are better choices for the ISM site.
6) In our four-month observation period, we witnessed strong correlations between real- world events and network traffic. We may examine this discovery by analyzing more scientific
Web sites with a longer period of observation.
111 References
[1] Ecstatic, a simple static file server middleware for Node.js. https://github.com/
jfhbrook/node-ecstatic.
[2] Endace DAG Data Capture Cards. http://www.emulex.com/products/
network-visibility-products-and-services/endacedag-data-capture-cards/
specifications/.
[3] How does video streaming work? [Part 1: Progressive Download]. http://mingfeiy.
com/progressive-download-video-streaming.
[4] How does video streaming work? [Part 2: Traditional Streaming]. http://mingfeiy.
com/traditional-streaming-video-streaming.
[5] How does video streaming work? [Part 3: HTTP-based Adaptive Streaming]. http:
//mingfeiy.com/adaptive-streaming-video-streaming.
[6] IETF, RFC 1945, Hypertext Transfer Protocol – HTTP/1.0. http://tools.ietf.
org/html/rfc1945.
[7] IETF, RFC 2068, 19.7.1 Compatibility with HTTP/1.0 Persistent Connections. https:
//www.ietf.org/rfc/rfc2068.txt.
[8] IETF, RFC 2616, Hypertext Transfer Protocol – HTTP/1.1. http://www.ietf.org/
rfc/rfc2616.txt.
[9] The rsync algorithm. https://rsync.samba.org/tech_report/tech_report.html.
[10] Time History of Events and Macroscopic Interactions during Substorms (THEMIS).
http://cse.ssl.berkeley.edu/artemis/mission-overview.html.
112 [11] W3C, RFC 2616, 5.1.1 Method. http://www.w3.org/Protocols/rfc2616/
rfc2616-sec5.html#sec5.1.1.
[12] Wikipedia, Adaptive Bitrate Streaming. https://en.wikipedia.org/wiki/
Adaptive_bitrate_streaming.
[13] Wikipedia, Aurora. https://en.wikipedia.org/wiki/Aurora.
[14] Wikipedia, Browser Support Condition for HTML5 Video. https://en.wikipedia.
org/wiki/HTML5_video#Browser_support.
[15] Wikipedia, cURL. https://en.wikipedia.org/wiki/CURL.
[16] Wikipedia, Internet Protocol. https://en.wikipedia.org/wiki/Internet_
Protocol.
[17] Wikipedia, Internet Protocol Suite. https://en.wikipedia.org/wiki/Internet_
protocol_suite.
[18] Wikipedia, iOS. https://en.wikipedia.org/wiki/IOS.
[19] Wikipedia, List of HTTP Header Fields. https://en.wikipedia.org/wiki/List_of_
HTTP_header_fields.
[20] Wikipedia, OpenCourseWare. https://en.wikipedia.org/wiki/OpenCourseWare.
[21] Wikipedia, OSX. https://en.wikipedia.org/wiki/OS_X.
[22] Wikipedia, Robots Exclusion Standard. http://en.wikipedia.org/wiki/Robots_
exclusion_standard.
[23] Wikipedia, RSS. https://en.wikipedia.org/wiki/RSS.
[24] Wikipedia, Secure Copy. https://en.wikipedia.org/wiki/Secure_copy.
113 [25] Wikipedia, Transport Layer Security. https://en.wikipedia.org/wiki/Transport_
Layer_Security.
[26] Wikipedia, Windows Web browsing share as of March 2015. https://en.wikipedia.
org/wiki/Microsoft_Windows#Usage_share.
[27] Wikipedia, World Wide Web. https://en.wikipedia.org/wiki/World_Wide_Web.
[28] Wikipedia, Zipf’s Law. https://en.wikipedia.org/wiki/Zipf’s_law.
[29] P. Ameigeiras, J. Ramos-Munoz, J. Navarro-Ortiz, and J. Lopez-Soler. Analysis and
Modelling of YouTube Traffic. Transactions on Emerging Telecommunications Tech-
nologies, 23(4):360–377, June 2012.
[30] M. Arlitt and C. Williamson. Web Server Workload Characterization: The Search for
Invariants. In Proceedings of the ACM SIGMETRICS, pages 126–137, Philadelphia,
PA, USA, May 1996.
[31] M. Arlitt and C. Williamson. Internet Web Servers: Workload Characterization and
Performance Implications. IEEE/ACM Transactions on Networking (ToN), 5(5):631–
645, October 1997.
[32] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida. Characterizing User Behav-
ior in Online Social Networks. In Proceedings of the 9th ACM Internet Measurement
Conference, pages 49–62, Chicago, Illinois, USA, November 2009.
[33] T. Berners-Lee. The World-Wide Web. Computer Networks and ISDN Systems,
25(4):454–459, November 1992.
[34] C. Bomhardt, W. Gaul, and L. Schmidt-Thieme. Web robot detection-preprocessing web
logfiles for robot detection. Springer, 2005.
114 [35] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like
Distributions: Evidence and Implications. In Proceedings of IEEE INFOCOM, pages
126–134, March 1999.
[36] M. Butkiewicz, H. Madhyastha, and V. Sekar. Understanding Website Complexity:
Measurements, Metrics, and Implications. In Proceedings of the ACM Internet Mea-
surement Conference, pages 313–328, Berlin, Germany, November 2011.
[37] T. Callahan, M. Allman, and V. Paxson. A Longitudinal View of HTTP Traffic. In
Proceedings of the Passive and Active Network Measurement Conference, pages 222–231,
Zurich, Switzerland, April 2010.
[38] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I Tube, Uou Tube, Everybody
Tubes: Analyzing the World’s Largest User Generated Content Video System. In
Proceedings of the 7th ACM Internet Measurement Conference, pages 1–14, San Diego,
California, USA, October 2007.
[39] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. Analyzing the Video Pop-
ularity Characteristics of Large-scale User Generated Content Systems. IEEE/ACM
Transactions on Networking (TON), 17(5):1357–1370, October 2009.
[40] M. Crovella and A. Bestavros. Self-similarity in World Wide Web Traffic: Evidence
and Possible Causes. pages 160–169, May 1996.
[41] C. Cunha, A. Bestavros, and M. Crovella. Characteristics of WWW Client-based
Traces. Technical report, BUCS-1995-010, Computer Science Department, Boston Uni-
versity, July 1995.
[42] A. Dikaiakos, Mariosand Stassopoulou and L. Papageorgiou. An Investigation of
Web Crawler Behavior: Characterization and Metrics. Computer Communications,
28(8):880–897, May 2005.
115 [43] M. Dikaiakos, A. Stassopoulou, and L. Papageorgiou. Characterizing Crawler Behavior
from Web Server Access Logs. In Proceedings of the E-Commerce and Web Technologies
Conference, pages 369–378, Prague, Czech Republic, September 2003.
[44] D. Doran and S. Gokhale. Web Robot Detection Techniques: Overview and Limita-
tions. Data Mining and Knowledge Discovery, 22(1-2):183–210, January 2011.
[45] D. Doran, K. Morillo, and S. Gokhale. A Comparison of Web Robot and Human
Requests. In Proceedings of the IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining, pages 1374–1380, Niagara, Ontario, CAN, August
2013.
[46] A. Eldin, A. Rezaie, A. Mehta, S. Razroev, S. Sjostedt-de Luna, O. Seleznjev, J. Tords-
son, and E. Elmroth. How Will Uour Workload Look Like in 6 Years? Analyzing
Wikimedia’s Workload. In Proceedings of the IEEE International Conference on Cloud
Engineering, pages 349–354, Boston, Massachusetts, USA, March 2014.
[47] A. Faber, M. Gupta, and C. Viecco. Revisiting Web Server Workload Invariants in
the Context of Scientific Web Sites. In Proceedings of the ACM/IEEE Conference on
Supercomputing, pages 25–38, Tampa, Florida, USA, November 2006.
[48] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube Traffic Characterization: A View
from the Edge. In Proceedings of the ACM Internet Measurement Conference, pages
15–28, San Diego, California, USA, October 2007.
[49] D. Gourley, B. Totty, M. Sayer, A. Aggarwal, and S. Reddy. HTTP: The Definitive
Guide. Definitive Guides. O’Reilly Media, 2002.
[50] L. Guo, E. Tan, S. Chen, X. Zhang, and Y. Zhao. Analyzing Patterns of User Content
Generation in Online Social Networks. In Proceedings of the ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, pages 369–378, Paris,
116 France, June 2009.
[51] L. Gyarmati and T. Trinh. Measuring User Behavior in Online Social Networks. IEEE,
Network, 24(5):26–31, September 2010.
[52] F. Hern´andez-Campos, K. Jeffay, and F. Smith. Tracking the Evolution of Web Traf-
fic: 1995-2003. In Proceedings of the 11th IEEE/ACM International Symposium on
MASCOTS, pages 16–25, Orlando, Florida, USA, October 2003.
[53] S. Ihm and V. Pai. Towards Understanding Modern Web Traffic. In Proceedings of the
ACM Internet Measurement Conference, pages 295–312, Berlin, Germany, November
2011.
[54] S. Kolay, P. D’Alberto, A. Dasdan, and A. Bhattacharjee. A Larger Scale Study of
Robots.txt. In Proceedings of the 17th International World Wide Web Conference,
pages 1171–1172, Beijing, China, April 2008.
[55] J. Kurose and K. Ross. Computer Networking: A Top-Down Approach. Pearson Edu-
cation, 2012.
[56] M. Kwong. Northern Lights to Dazzle in Skies Across Canada. http://www.cbc.
ca/news/technology/northern-lights-to-dazzle-in-skies-across-canada-1.
2998691.
[57] J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection.
http://snap.stanford.edu/data, June 2014.
[58] H. Li, W. Lee, A. Sivasubramaniam, and C. Giles. Workload Analysis for Scientific
Literature Digital Libraries. International Journal on Digital Libraries, 9(2):139–149,
November 2008.
117 [59] S. Lin, Z. Gao, and K. Xu. Web 2.0 Traffic Measurement: Analysis on Online Map
Applications. In Proceedings of the ACM International Workshop on Network and Oper-
ating Systems Support for Digital Audio and Video, pages 7–12, Williamsburg, Virginia,
USA, June 2009.
[60] Y. Liu, Q. Wei, L. Guo, B. Shen, S. Chen, and Y. Lan. Investigating Redundant Internet
Video Streaming Traffic on iOS Devices: Causes and Solutions. IEEE Transactions on
Multimedia, 16(2):510–520, November 2014.
[61] A. Mahanti, C. Williamson, and D. Eager. Traffic Analysis of a Web Proxy Caching
Hierarchy. IEEE, Network, 14(3):16–23, May 2000.
[62] A. Mislove, M. Marcon, P. Gummadi, Krishnaand Druschel, and B. Bhattacharjee.
Measurement and Analysis of Online Social Networks. In Proceedings of the ACM
Internet Measurement Conference, pages 29–42, San Diego, California, USA, October
2007.
[63] A. Morais, J. Raddick, and R. C. dos Santos. Visualization and Characterization of
Users in a Citizen Science Project. In Proceedings of the SPIE, Next-Generation Analyst
Conference, pages 87580L–12, Baltimore, Maryland, USA, May 2013.
[64] V. Paxson. Empirically Derived Analytic Models of Wide-area TCP Connections.
IEEE/ACM Transactions on Networking (TON), 2(4):316–336, August 1994.
[65] V. Paxson. Bro: A System for Detecting Network Intruders in Real-time. Computer
networks, 31(23):2435–2463, December 1999.
[66] Z. Ren, X. Xu, J. Wan, W. Shi, and M. Zhou. Workload Characterization on a Produc-
tion Hadoop Cluster: A Case Study on Taobao. In IEEE International Symposium on
Workload Characterization (IISWC), pages 3–13, La Jolla, California, USA, November
2012.
118 [67] N. Sarrar, S. Uhlig, A. Feldmann, R. Sherwood, and X. Huang. Leveraging Zipf’s Law
for Traffic Offloading. ACM Computer Communication Review, 42(1):16–22, January
2012.
[68] F. Schneider, S. Agarwal, T. Alpcan, and A. Feldmann. The New Web: Character-
izing AJAX Traffic. In Proceedings of the Passive and Active Network Measurement
Conference, pages 31–40, Cleveland, Ohio, USA, April 2008.
[69] F. Schneider, A. Feldmann, B. Krishnamurthy, and W. Willinger. Understanding Online
Social Network Usage from a Network Perspective. In Proceedings of the ACM Internet
Measurement Conference, pages 35–48, Chicago, Illinois, USA, November 2009.
[70] J. Sedayao. World Wide Web Network Traffic Patterns. In Compcon’95.’Technologies
for the Information Superhighway’, Digest of Papers., pages 8–12, San Francisco, Cali-
fornia, USA, March 1995.
[71] A. Stassopoulou and M. Dikaiakos. Web Robot Detection: A Probabilistic Reasoning
Approach. Computer Networks, 53(3):265–278, February 2009.
[72] Y. Sun, Z. Zhuang, and C. Giles. A Large-scale Study of Robots.txt. In Proceedings of
the 16th International World Wide Web Conference, pages 1123–1124, Banff, Alberta,
Canada, May 2007.
[73] K. Thompson, G. Miller, and R. Wilder. Wide-area Internet Traffic Patterns and
Characteristics. IEEE, Network, 11(6):10–23, November 1997.
[74] G. Urdaneta, G. Pierre, and M. Van Steen. Wikipedia Workload Analysis. Vrije Uni-
versiteit, Amsterdam, The Netherlands, Technical Report, IR-CS-041, September 2007.
[75] Q. Wang, D. Makaroff, H. Edwards, and R. Thompson. Workload Characterization
for an E-commerce Web Site. In Proceedings of the 2003 Conference of the Centre
119 for Advanced Studies on Collaborative Research, pages 313–327, Markham, Ontario,
Canada, October 2003.
[76] S. Ye, G. Lu, and X. Li. Workload-aware Web Crawling and Server Workload Detection.
In Proceedings of the 2nd Asia-Pacific Advanced Network Research Workshop, pages
263–269, Cairns, Australia, July 2004.
[77] M. Zink, K. Suh, Y. Gu, and J. Kurose. Watch Global, Cache Local: YouTube Network
Traffic at a Campus Network: Measurements and Implications. In Proceedings of the
SPIE, Multimedia Computing and Networking Conference, pages 681805–13, San Jose,
California, USA, January 2008.
120