Traffic Analysis of Two Scientific Web Sites

University of Calgary PRISM: University of Calgary's Digital Repository Graduate Studies The Vault: Electronic Theses and Dissertations 2015-12-15 Traffic Analysis of Two Scientific Web Sites Liu, Yang Liu, Y. (2015). Traffic Analysis of Two Scientific Web Sites (Unpublished master's thesis). University of Calgary, Calgary, AB. doi:10.11575/PRISM/28501 http://hdl.handle.net/11023/2678 master thesis University of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission. Downloaded from PRISM: https://prism.ucalgary.ca UNIVERSITY OF CALGARY Traffic Analysis of Two Scientific Web Sites by Yang Liu A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN COMPUTER SCIENCE CALGARY, ALBERTA DECEMBER, 2015 c Yang Liu 2015 Abstract This thesis presents a workload characterization study of two scientific Web sites at the University of Calgary based on a four-month period of observation (from January 1, 2015 to April 30, 2015). The Aurora site is a scientific site for auroral researchers, providing auroral images collected from remote cameras deployed in northern Canada. The ISM site is a scientific site providing lecture materials to about 400 undergraduate students in the ASTR 209 course. Three main observations emerge from our workload characterization study. First, scientific Web sites can generate extremely large volumes of Internet traffic, even when the user community is seemingly small. Second, robot traffic and real-world events can have surprisingly large impacts on network traffic. Third, a large fraction of the observed network traffic is highly redundant, and can be reduced significantly with more efficient networking solutions. ii Acknowledgements I would like to express my sincere appreciation and gratitude to my supervisor, Dr. Carey Williamson, for his invaluable support and insightful suggestions to my graduate research. His enthusiasm motivated my passion for accomplishing this study. His patience and metic- ulousness helped me overcome my writing weaknesses and finally finish this thesis. I would like to acknowledge Michel Laterman, Martin Arlitt, and U of C IT staff for setting up the logging system and capturing the data. Michel provided a lot of instructions to me for big log data processing issues. A very special thanks goes out to Martin and Michel for their constructive suggestions and support in helping me polish this thesis. I would like to thank my external committee member, Dr. Eric Donovan, for being on my committee and his valuable advice from the perspective of Physics and Astronomy. I would like to thank Emma Spanswick and Darren Chaddock for their technical expertise about the setup and operation of the Aurora site. I am also indebted to my fellow lab-mates in the Networks Group, Yao, Ruiting, Brad, Mohsen, Keynan, Vineet, Mahshid, Haoming, Linquan, Xunrui, Sijia, Shunyi, Yuhui, and Wei, for the fun we have had and the help you offered during my graduate study. I wish you all the best in your studies and careers. I would like to thank all my friends for supporting me spiritually, and the U of C for offering me this great opportunity to learn and to explore. Finally, I would like to thank my family for their support through my entire life, especially my parents Baocheng and Ling for respecting my decisions and encouraging me with their best wishes. iii Table of Contents Abstract ........................................ ii Acknowledgements .................................. iii Table of Contents . iv List of Tables . vii List of Figures . viii List of Symbols . x 1 INTRODUCTION . 1 1.1 Open Scientific Web Sites . 1 1.2 Background Context . 2 1.3 Motivation . 4 1.4 Objectives . 6 1.5 Contributions . 6 1.6 Thesis Overview . 7 2 BACKGROUND and RELATED WORK . 8 2.1 TCP/IP Model . 8 2.1.1 Physical Layer . 9 2.1.2 Link Layer . 10 2.1.3 Network Layer . 10 2.1.4 Transport Layer . 10 2.1.5 Application Layer . 11 2.2 HTTP and the Web . 11 2.2.1 Persistent Connections . 13 2.2.2 HTTP Messages . 14 2.2.3 HTTP Secure . 19 2.3 Network Traffic Measurement . 19 2.4 Web Robots . 22 2.5 Video Streaming . 23 2.6 Scientific Web Sites . 25 2.7 Summary . 26 3 METHODOLOGY . 27 3.1 Endace DAG Card Deployment . 27 3.2 Bro Logging System . 28 3.3 Data Pretreatment . 31 3.4 Summary . 32 4 AURORA SITE ANALYSIS . 33 4.1 HTTP Analysis . 33 4.1.1 HTTP Requests . 33 4.1.2 Data Volume . 34 4.1.3 IP Analysis . 35 4.1.4 HTTP Methods . 38 4.1.5 HTTP Referer . 38 4.1.6 URL Analysis . 39 iv 4.1.7 File Type . 41 4.1.8 HTTP Response Size Distribution . 42 4.2 Robot Traffic . 48 4.2.1 Prominent Machine-Generated Traffic . 48 4.2.2 AuroraMAX . 54 4.3 Geomagnetic Storm . 57 4.4 Summary . 57 5 ISM SITE ANALYSIS . 59 5.1 HTTP Analysis . 59 5.1.1 HTTP Requests . 59 5.1.2 Data Volume . 62 5.1.3 IP Analysis . 63 5.1.4 URL Analysis . 66 5.1.5 HTTP Methods . 68 5.1.6 HTTP Status Codes . 70 5.1.7 HTTP Response Size Distribution . 71 5.1.8 User Agents . 75 5.2 Video Viewing Pattern and Traffic . 81 5.2.1 Video Requests Traffic . 81 5.2.2 Browser Behaviors for Video Playing . 85 5.3 Course-Related Events . 88 5.4 Summary . 91 6 DISCUSSION . 92 6.1 Comparative Analysis of Two Scientific Web Sites . 92 6.2 Workload Characteristics Revisited . 95 6.2.1 Success Rate . 95 6.2.2 File Types . 95 6.2.3 Mean Transfer Size . 97 6.2.4 Distinct Requests . 97 6.2.5 One Time Referencing . 97 6.2.6 Size Distribution . 98 6.2.7 Concentration of References . 98 6.2.8 Wide Area Usage . 98 6.2.9 Inter-Reference Times . 98 6.3 Network Efficiency Analysis . 99 6.3.1 File Transfer Methods . 99 6.3.2 JavaScript Cache-Busting Solution . 103 6.4 Summary . 105 7 CONCLUSIONS . 106 7.1 Thesis Summary . 106 7.2 Scientific Web Site Characterization . 107 7.2.1 The Aurora Site . 107 7.2.2 The ISM Site . 108 7.3 Conclusions . 109 7.4 Future Work . 110 v References . 112 vi List of Tables 3.1 A Sample of a Subset of the Bro HTTP Log . ..

Traffic Analysis of Two Scientific Web Sites

Study of Web Crawler and Its Different Types

Report for Portal Specific

Scalability and Efficiency Challenges in Large-Scale Web Search

A Focused Web Crawler Driven by Self-Optimizing Classifiers

Application of ARIMA(1,1,0) Model for Predicting Time Delay of Search Engine Crawlers

Scalability and Efficiency Challenges in Large-Scale Web Search

Digital Marketing Handbook

A Smart Web Crawler for a Concept Based Semantic Search Engine

Crawlers and Crawling

Statistics for Sdo2.Oma.Be (2020) - Main

Usage-Based Testing for Event-Driven Software Systems

A Methodical Study of Web Crawler