. Visualizing Mirrors mirrors.ustc.edu.cn 服务器日志分析 .

李博杰 [email protected]

©USTC LUG

August 14, 2012

...... 李博杰 [email protected] Visualizing Mirrors . Outline

1. Requests & Traffic By Time By IP By Other Measures 2. Files Files Characteristics How Files Are Requested 3. Sessions 4. Distributions Insight CentOS Fedora Ubuntu 5. Technical Details 6. Query Optimization ...... 李博杰 [email protected] Visualizing Mirrors . Notes

The data is access log of mirrors.ustc.edu.cn in 51 days. See ‘Technical Details’ section for more info about dataset. Some graphs are in log-scale for clarity. Please note whether x axis, y axis or both are in log-scale. The graph title sometimes lies. Because there may be many points in a graph, sampling is made to reduce file size (they are vector graphics), hence there may be some ‘straight lines’. I have checked the data to make sure the graphs illustrate real trends. Title length is limited, so the title itself may not explain well, please keep an eye on the axis and keys of the graph. Graphs are shown in the hope of conveying information without words. Any questions or suggestions, please email me.

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic in a day

Requests & Traffic within a day 400000 5e+10

350000

4e+10

300000

250000 3e+10

200000 Requests Traffic (Bytes)

2e+10 150000

100000

1e+10

50000

Request count (Bezier smoothed) Traffic (Bezier smoothed) 0 0 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00 Time of the day ...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic in a week

Requests & Traffic in different weekdays 7e+07 8e+12

7e+12 6e+07

6e+12 5e+07

5e+12

4e+07

4e+12 Requests

3e+07 Traffic (Bytes)

3e+12

2e+07 2e+12

1e+07 1e+12

Request count Traffic 0 0 Monday Tuesday Wednesday Thursday Friday Saturday Sunday Time of the day ...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic across 50 days

Requests & Traffic in 50 days 1.2e+07

1.2e+12

1e+07

1e+12

8e+06

8e+11

6e+06

Requests 6e+11 Traffic (Bytes)

4e+06 4e+11

2e+06 2e+11

Request count Traffic 0 0 05-20 05-27 06-03 06-10 06-17 06-24 07-01 07-08 07-15 Time of the day ...... 李博杰 [email protected] Visualizing Mirrors . Statistics

Requests Traffic Total 328976877 36892 GB Avg. per Day 6450527 723.4 GB Max. per Day 8632963 1049.5 GB Min. per Day 4868022 421.5 GB Avg. per Hour 268771 30.14 GB Max. per Hour 561925 79.75 GB Min. per Hour 99506 2.97 GB Avg. per Minute 4480 514.4 MB Max. per Minute 14714 N/A Min. per Minute 441 N/A Avg. per Second 74.66 8779 KB Max. per Second 2117 N/A Min. per Second 1 N/A

Because the time recorded is only completion time of the request, and large requests can span hours, so Max./Min. per minute/second is not applicable...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Requests per Hour

Sorted Requests per Hour 600000

550000

500000

450000

400000

350000

Requests 300000

250000

200000

150000

100000

Hour Percentage 50000 0 10 20 30 40 50 60 70 80 90 100 Hour Percentage (sorted by Requests count)...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Requests per Minute

Sorted Requests per Minute 16000

14000

12000

10000

8000 Requests

6000

4000

2000

Minutes Percentage 0 0 10 20 30 40 50 60 70 80 90 100 Minutes Percentage (sorted by Requests count)...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Requests per Second

Sorted Requests per Second 450

400

350

300

250 Requests 200

150

100

50

Seconds Percentage 0 0 10 20 30 40 50 60 70 80 90 100 Seconds Percentage (sorted by Requests count)...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Traffic over IPs: 20%-80% law

Cumulative Traffic over unique IPs 4e+13

3.5e+13

3e+13

2.5e+13

2e+13 Cumulative Traffic

1.5e+13

1e+13

5e+12

0 1 10 100 1000 10000 100000 1e+06 1e+07 Percentage of unique IP (log-scale) (sorted by Traffic. DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Requests over IPs: 20%-80% law

Cumulative Requests over unique IPs 3.5e+08

3e+08

2.5e+08

2e+08

1.5e+08 Cumulative Requests

1e+08

5e+07

0 1 10 100 1000 10000 100000 1e+06 1e+07 Percentage of unique IP (log-scale) (sorted by Request. Num. DESC)...... 李博杰 [email protected] Visualizing Mirrors . IPv4 vs. IPv6

Requests Traffic IPv4 318575688 (96.84%) 34180 GB (92.65%) IPv6 10401189 (3.15%) 2712 GB (7.35%) It can be seen that IPv6 still have a long way to go…

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic TOP 40: xxx.0.0.0/24

Request & Traffic among IPv4 first fields 0.08 Requests Traffic

0.07

0.06

0.05

0.04 Percentage

0.03

0.02

0.01

0 IPv6 222 202 113 218 58 114 61 183 180 121 124 116 219 210 119 122 59 221 125 123 60 115 117 118 220 211 203 112 111 14 182 110 27 1 120 175 101 159 223

...... 李博杰 [email protected] Visualizing Mirrors . Traffic TOP 40: IPv4 addrs

Request & Traffic among popular IPv4 addrs 0.005 Requests Traffic

0.0045

0.004

0.0035

0.003

0.0025 Percentage

0.002

0.0015

0.001

0.0005

0 219.133.0.1 218.242.250.212 203.114.244.88 114.212.189.93 114.113.226.53 180.169.73.90 180.96.19.25 66.197.225.53 218.3.125.243 61.234.123.57 159.226.126.177 203.198.202.225 124.74.45.130 220.181.145.27 114.213.255.162 202.108.130.138 202.119.45.31 222.66.23.57 124.127.250.34 208.53.156.36 221.216.135.54 116.228.240.198 113.108.76.195 114.80.133.7 210.13.71.73 180.149.134.10 124.126.245.14 218.94.63.55 220.248.0.145 112.65.134.2 222.56.17.109 124.74.78.2 124.207.104.18 58.211.218.74 1.202.225.132 116.226.65.12 220.248.0.154 222.94.140.45 210.73.5.33 116.247.98.50

...... 李博杰 [email protected] Visualizing Mirrors . Request count TOP 40: IPv4 addrs

Request & Traffic among popular IPv4 addrs 0.03 Requests Traffic

0.025

0.02

0.015 Percentage

0.01

0.005

0 63.245.214.78 209.132.181.102 113.111.38.40 129.143.116.10 211.86.56.227 223.5.20.10 49.123.105.219 159.226.20.217 60.208.111.199 119.97.142.81 202.38.95.60 203.244.218.6 221.219.75.222 202.104.151.152 121.49.96.70 59.77.33.100 182.89.199.227 183.45.54.83 218.13.224.81 59.37.44.133 59.44.42.194 222.134.53.246 210.21.243.170 61.130.247.168 116.228.202.66 123.185.172.126 219.134.89.202 114.113.29.21 183.31.242.39 58.19.126.37 27.17.19.75 203.114.244.88 220.178.52.108 124.42.77.160 113.111.40.89 218.94.63.55 180.153.97.82 222.171.60.177 210.34.196.99 222.92.29.130

...... 李博杰 [email protected] Visualizing Mirrors . USTC Mirrors Usage (IPv4 only)

IP range Requests Traffic Note

202.38.64.0-202.38.95.255 994661 (0.30%) 149.3 GB (0.40%) CERNET

210.45.64.0-210.45.79.255 191141 (0.06%) 68.31 GB (0.19%) CERNET

210.45.112.0-210.45.127.255 243976 (0.07%) 37.59 GB (0.10%) CERNET

211.86.144.0-211.86.159.255 81035 (0.02%) 24.96 GB (0.07%) CERNET

222.195.64.0-222.195.95.255 319435 (0.10%) 86.88 GB (0.24%) CERNET

114.214.160.0-114.214.255.255 0 0 CERNET

210.72.22.0-210.72.22.255 3622 (0.00%) 11.86 MB (0.00%) TechNet (?)

218.22.21.0-218.22.21.31 1 (0.00%) 0.01 MB (0.00%) China Telecom

218.104.71.160-218.104.71.175 0 0 China Unicom

202.141.160.0-202.141.175.255 123455 (0.04%) 12.60 GB (0.03%) China Telecom

202.141.176.0-202.141.191.255 187 (0.00%) 120.4 MB (0.00%) China Mobile

Total 1957513 (0.60%) 379.77 GB (1.03%) USTC IPv4

Data source of USTC IP range: http://lib.ustc.edu.cn/ustcip.html

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic of distributions

Request & Traffic of Distributions 0.35

0.3

0.25

0.2

Percentage 0.15

0.1

0.05

Requests Traffic 0 eclipsefedoraubuntucentosdebiantdf cygwinCTANarchlinuxmozilla-currentopensusegentookde-applicationdatakde backtrackepel gnu CRANNULLfreebsdubuntu-releasesdebian-securityscientificlinuxlinux-kerneldebian-backportskdemoddebian-cdmeegoslackwaredeepinsourceware.orgCPANlinuxmintlinux-2.6.gitpuppylinux.gitdebian-multimedia4 qomo3.7

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic of distributions

Request & Traffic of Distributions 0.35

0.3

0.25

0.2

Percentage 0.15

0.1

0.05

Requests Traffic 0 centosNULLeclipsefedoraubuntuubuntu-releasesmozilla-currenttdf debianCTANbacktrackopensusegentoodeepin-cdkde-applicationdatadebian-cdcygwinUbuntulinuxmintarchlinuxkde gnu linuxmint-cdCRANdebian-multimediaqomodebian-securityepel puppyscientificlinuxdebian-backportsfreebsddeepinCPANkdemodturnkeylinuxslackwaredebian-uogentoo-portagelinux-kernel

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among HTTP Status Codes

Request & Traffic among HTTP status codes 1

0.9

0.8

0.7

0.6

0.5 Percentage

0.4

0.3

0.2

0.1

Requests Traffic 0 200 206 301 304 400 403 404 405 408 416 499 500 502 HTTP Status Code ...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among User Agents

Request & Traffic of User Agents 0.35

0.3

0.25

0.2 Percentage 0.15

0.1

0.05

Requests Traffic 0 MozillaurlgrabberDebianWget JakartaNULL FedoraPreupgradeUbuntupacmanlibwwwtexlive Cygwinjigdo Opera ZYpp Axel CentOSMPM FDM lftp NSIS anacondacurl Eclipsearia2 PythonBTWebClientR

User Agent (Only largest 30 are shown) ...... 李博杰 [email protected] Visualizing Mirrors . Traffic per Request order by Length

Request Num sorted by Request Length 1e+10

1e+09

1e+08

1e+07

1e+06

100000

10000 Request Length (Log-scale)

1000

100

10

1 5e+07 1e+08 1.5e+08 2e+08 2.5e+08 3e+08 3.5e+08 Request Num ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Traffic sorted by Request Length

Cumulative Request Traffic sorted by Request Length 4e+13

3.5e+13

3e+13

2.5e+13

2e+13 Cumulative Traffic

1.5e+13

1e+13

5e+12

0 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 Request Length (log-scale) ...... 李博杰 [email protected] Visualizing Mirrors . Request Num sorted by Request Length

Request Num sorted by Request Length 1.2e+08

1e+08

8e+07

6e+07 Request Num

4e+07

2e+07

0 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 Request Length (Log-scale) ...... 李博杰 [email protected] Visualizing Mirrors . Outline

1. Requests & Traffic By Time By IP By Other Measures 2. Files Files Characteristics How Files Are Requested 3. Sessions 4. Distributions Insight CentOS Fedora Ubuntu Eclipse 5. Technical Details 6. Query Optimization ...... 李博杰 [email protected] Visualizing Mirrors . File Number at different FileSizes (log scale)

Sorted Filenum by Filesize 25000

20000

15000 FileNum

10000

5000

0 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 FileSize (Log-scale) ...... 李博杰 [email protected] Visualizing Mirrors . Total Size at different Filesizes (log scale)

Total Size at different Filesizes (normal scale) 2e+10

1.8e+10

1.6e+10

1.4e+10

1.2e+10

1e+10 FileSize * Filenum 8e+09

6e+09

4e+09

2e+09

0 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 FileSize ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Filesize per File Num (normal scale)

Accumulated FileSize by FileNum sorted by Size (normal scale) 1.4e+13

1.2e+13

1e+13

8e+12 FileSize

6e+12

4e+12

2e+12

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07 FileNum ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Filesize per File Num (log scale)

Accumulated FileSize by FileNum sorted by Size (log scale y) 1e+14

1e+13

1e+12

1e+11 FileSize

1e+10

1e+09

1e+08

1e+07 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07 FileNum ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Filesize per FileSize (normal scale)

Accumulated Filesize by Filesize 1.4e+13

1.2e+13

1e+13

8e+12

6e+12 Accumulated FileSize

4e+12

2e+12

0 0 1e+09 2e+09 3e+09 4e+09 5e+09 6e+09 7e+09 FileSize ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Filesize per FileSize (log scale)

Accumulated Filesize by Filesize 1.4e+13

1.2e+13

1e+13

8e+12

6e+12 Accumulated FileSize

4e+12

2e+12

0 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 FileSize (Log-scale) ...... 李博杰 [email protected] Visualizing Mirrors . File Extensions TOP 40 (order by Total Size)

FileNum and Filesize of popular file extensions 0.4 Total Num Total Size

0.35

0.3

0.25

0.2 Percentage

0.15

0.1

0.05

0 tbz rpm iso deb gz bz2 xz zip drpm img tgz mar jar png ogg exe txz dmg tar pet sfs run 7z udeb lzma pkg ogv pdf xpi tbz2 cb m4v msi 1 log 2 jpg 92 3 wz

...... 李博杰 [email protected] Visualizing Mirrors . File Extensions TOP 40 (order by File Num)

FileNum and Filesize of popular file extensions 0.4 Total Num Total Size

0.35

0.3

0.25

0.2 Percentage

0.15

0.1

0.05

0 tbz rpm deb gz png drpm readme bz2 jar dsc hdr xz zip tgz meta c,v sig log tfm txt ebuild jpg asc xml changes udeb md5 h,v xpi pdf tex patch html sign in,v news ltx sha1 txz 0

...... 李博杰 [email protected] Visualizing Mirrors . How many files share a same filename (extension excluded)

FileNum and FileSize of Filenames with Different Number of Shares 1 Total Num Total Size

0.1

0.01 Percentage (logscale)

0.001

0.0001 1 10 100 1000 10000 100000 1e+06 Number of Files sharing a same FileName (logscale)...... 李博杰 [email protected] Visualizing Mirrors . How many files share a same filename and extension

FileNum and FileSize of Filename swith Different Number of Shares 1 Total Num Total Size

0.1

0.01 Percentage (logscale)

0.001

0.0001 1 10 100 1000 10000 100000 1e+06 Number of Files sharing a same FileName and Extension. (logscale)...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Files in each Distribution (sorted by Size)

Num & Size of Files in each Distribution 0.45 Number Size

0.4

0.35

0.3

0.25

Percentage 0.2

0.15

0.1

0.05

0 freebsdfedoradebian-cdscientificlinuxdebianubuntuopensusemeegognomegentoobacktrackcentosepel eclipseslackwareubuntu-releaseskde linux-kernelkde-applicationdatalinuxmint-cdarchlinuxmozilla-currentbin deepin-cdturnkeylinuxqomokdemodCTANprogress-linuxpuppygnu debian-backportsknoppix-dvddebian-securitytdf src cygwinloongson2fdeepinlinuxmint

...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Files in each Distribution (sorted by Num)

Num & Size of Files in each Distribution 0.45 Number Size

0.4

0.35

0.3

0.25

Percentage 0.2

0.15

0.1

0.05

0 freebsdfedorakde-applicationdatadebianubuntuopensusemoduleseclipseCTANauthorsepel gnomegentoo-portageslackwarebacktrackmeegoscientificlinuxbin centosgentooarchlinuxkde web src qomodebian-backportsmacportsXorg linux-kernelcygwingnu mozilla-currentprogress-linuxdebian-securitydebian-cdkdemodlinuxmintdebian-multimediapuppytdf

...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Requests over FileSize order by Size

Cumulative Requests Num per File order by Filesize 2e+08

1.8e+08

1.6e+08

1.4e+08

1.2e+08

1e+08

8e+07 Accumulated Requests Count

6e+07

4e+07

2e+07

0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 FileSize ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Traffic over non-cumu. FileSize

Cumulative Traffic over non-cumulated Filesize 2.5e+13

2e+13

1.5e+13

Accumulated Traffic 1e+13

5e+12

0 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07 FileSize ...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Traffic over non-cumu. Filesize (log-scale)

Cumulative Traffic over non-cumulated FileSize 1e+14

1e+13

1e+12

1e+11 Cumulated Traffic (log-scale)

1e+10

1e+09 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06 1e+07 1.1e+07 FileSize ...... 李博杰 [email protected] Visualizing Mirrors . Cache 1: Cumu. Traffic over FileSize order by Size

Cumulative Traffic over Cumulative FileSize 2.5e+13

2e+13

1.5e+13 Cumulative Traffic 1e+13

5e+12

0 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize (order by FileSize) ...... 李博杰 [email protected] Visualizing Mirrors . Cache 2: Cumu. Traffic over FileSize order by Size DESC

Cumulative Traffic over Cumulative FileSize 2.5e+13

2e+13

1.5e+13 Cumulative Traffic 1e+13

5e+12

0 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize (order by FileSize DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cache 3: Cumu. Traffic over FileSize order by Req Num

Cumulative Traffic over Cumulative FileSize 2.5e+13

2e+13

1.5e+13 Cumulative Traffic 1e+13

5e+12

0 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize (order by request num DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cache 4: Cumu. Traffic over FileSize order by Traffic

Cumulative Traffic over Cumulative FileSize 2.6e+13

2.4e+13

2.2e+13

2e+13

1.8e+13

1.6e+13 Cumulative Traffic 1.4e+13

1.2e+13

1e+13

8e+12

6e+12 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize (order by traffic of this file DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cache 5: Cumu. Traffic order by Traffic/FileSize

Cumulative Traffic over Cumulative FileSize 2.5e+13

2e+13

1.5e+13 Cumulative Traffic 1e+13

5e+12

0 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize (order by traffic/filesize DESC)...... 李博杰 [email protected] Visualizing Mirrors . Comparison of the Previous five ‘Caching Policies’

Cumulative Traffic over Cumulative FileSize 2.5e+13

2e+13

1.5e+13 Cumulative Traffic 1e+13

5e+12

FileSize FileSize DESC Requests Num DESC Traffic DESC Traffic/FileSize DESC 0 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize order by different metrics...... 李博杰 [email protected] Visualizing Mirrors . Comparison of ‘Caching Policies’ (log-scale)

Cumulative Traffic over Cumulative FileSize 2.5e+13 FileSize FileSize DESC Requests Num DESC Traffic DESC Traffic/FileSize DESC

2e+13

1.5e+13 Cumulative Traffic 1e+13

5e+12

0 100 10000 1e+06 1e+08 1e+10 1e+12 1e+14 Cumulative FileSize order by different metrics (log-scale)...... 李博杰 [email protected] Visualizing Mirrors . Comparison of ‘Caching Policies’

Among these static caching policies, caching files with most traffic or requests is acceptable. Caching files that carried most traffic in history has a good performance. A 10GB cache of 85 files can cover 40% of the total traffic. If cache size continue to increase, the cache efficiency will deteriorate, since x axis of the graph is in log-scale. Caching files with largest traffic/filesize ratio shows best performance (by definition). A 10GB cache of 7162 files can cover 58% of the total traffic.

...... 李博杰 [email protected] Visualizing Mirrors . Details of Most Traffic Caching (log-scale)

Cumulative Ratio over Cumulative FileSize 1

0.8

0.6 Cumulative Ratio 0.4

0.2

Cumulative Traffic (\%) Cumulative Requests (\%) Cumulative File Num (\%) 0 1e+08 1e+09 1e+10 1e+11 1e+12 1e+13 1e+14 Cumulative FileSize (log-scale) (order by traffic of this file. DESC)...... 李博杰 [email protected] Visualizing Mirrors . Details of Traffic/FileSize Caching (log-scale)

Cumulative Ratio over Cumulative FileSize 1

0.8

0.6 Cumulative Ratio 0.4

0.2

Cumulative Traffic (\%) Cumulative Requests (\%) Cumulative File Num (\%) 0 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 1e+10 1e+11 1e+12 1e+13 1e+14 Cumulative FileSize (log-scale) (order by Traffic/FileSize of this. file. .DESC)...... 李博杰 [email protected] Visualizing Mirrors . An Alternate Metric: Request Hits

Cumulative Requests over Cumulative FileSize 2e+08

1.8e+08

1.6e+08

1.4e+08

1.2e+08

1e+08

Cumulative Requests 8e+07

6e+07

4e+07

2e+07 FileSize FileSize DESC Requests Num DESC Traffic DESC 0 0 2e+12 4e+12 6e+12 8e+12 1e+13 1.2e+13 1.4e+13 Cumulative FileSize order by different metrics...... 李博杰 [email protected] Visualizing Mirrors . An Alternate Metric: Request Hits

Cumulative Requests over Cumulative FileSize 2e+08 FileSize FileSize DESC Requests Num DESC Traffic DESC 1.8e+08

1.6e+08

1.4e+08

1.2e+08

1e+08

Cumulative Requests 8e+07

6e+07

4e+07

2e+07

0 100 10000 1e+06 1e+08 1e+10 1e+12 1e+14 Cumulative FileSize order by different metrics (log-scale)...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Never-accessed Files (% of total)

Num & Size of Never-Accessed Files in each Distribution 0.45 Number Size

0.4

0.35

0.3

0.25

0.2 Percentage of Total

0.15

0.1

0.05

0 freebsdfedorakde-applicationdatadebianubuntumodulesopensuseeclipseauthorsepel gnomegentoo-portageCTANslackwaremeegobacktrackscientificlinuxbin gentooarchlinuxkde web src qomomacportscentosdebian-backportsXorg linux-kernelgnu mozilla-currentprogress-linuxdebian-cddebian-securitycygwinkdemodlinuxmintdebian-multimediatdf loongson2f

...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Never-accessed Files (% of total)

Num & Size of Never-Accessed Files in each Distribution 0.45 Number Size

0.4

0.35

0.3

0.25

0.2 Percentage of Total

0.15

0.1

0.05

0 freebsdfedoradebian-cdscientificlinuxdebianopensusemeegoubuntugnomegentooepel slackwarebacktracklinux-kernelkde eclipsekde-applicationdataubuntu-releasesmozilla-currentbin archlinuxcentoslinuxmint-cdturnkeylinuxqomokdemodprogress-linuxdeepin-cdknoppix-dvdCTANdebian-backportsgnu src tdf puppyloongson2fdebian-securitylinuxmintauthorsdeepin

...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Never-accessed Files (% of distribution)

Num & Size of Never-Accessed Files in each Distribution 1 Number Size

0.95

0.9

0.85

0.8

0.75 Percentage of Distribution 0.7

0.65

0.6

0.55 clpa html modules authorsbin src web mozilla-currentdoc qomoscriptsfreebsdmeegoknoppix-dvdports contribkde-applicationdatalinux-kernelscientificlinuxknoppixdebian-cdtdf gnomeXorg slackwareturnkeylinuxbacktrackdebian-volatileeclipseepel dotdebprogress-linuxmacportsgentoo-portagemisc indiceskde gentoodebian-multimedia m odules

...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Never-accessed Files (% of distribution)

Num & Size of Never-Accessed Files in each Distribution 1 Number Size

0.95

0.9

0.85 Percentage of Distribution

0.8

0.75 knoppix-dvdmozilla-currenthtml contribbin src modulesindicesdoc clpa authorsports dotdebweb debian-volatilemeegofreebsdslackwarescriptsmisc Xorg gnomeepel progress-linuxqomoscientificlinuxknoppixdebian-cdturnkeylinuxdebian-multimediagentoo-portagelinuxmintloongson2fkdemodlinux-kerneldeepintdf kde-applicationdatakde m odules

...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Ever-accessed Files (% of distribution)

Num & Size of Ever-Accessed Files in each Distribution 1 Number Size

0.9

0.8

0.7

0.6

0.5

0.4 Percentage of Distribution

0.3

0.2

0.1

0 fink CRANCPANhelp cygwincentospuppyubuntudeepin-cdmirmondebianubuntu-releasesgnu linuxmint-cdCTANdebian-securitydeepinuksm-kernelfedorakdemodopensusedebian-backportsloongson2flinuxmintarchlinuxdebian-multimediagentookde indicesmisc gentoo-portagemacportsprogress-linuxdotdebepel eclipsedebian-volatilebacktrackturnkeylinuxslackware

...... 李博杰 [email protected] Visualizing Mirrors . Num & Size of Ever-accessed Files (% of distribution)

Num & Size of Ever-Accessed Files in each Distribution 0.7 Number Size

0.6

0.5

0.4

0.3 Percentage of Distribution

0.2

0.1

0 centoscygwindeepin-cdpuppyCTANeclipseubuntumirmonlinuxmint-cdmacportsubuntu-releasesgnu debian-securitydebianfedoraarchlinuxbacktrackopensusedebian-backportsuksm-kernelgentookde kde-applicationdatatdf deepinlinux-kernelkdemodloongson2flinuxmintgentoo-portagedebian-multimediaturnkeylinuxdebian-cdknoppixscientificlinuxqomoprogress-linuxepel gnomeXorg

...... 李博杰 [email protected] Visualizing Mirrors . Outline

1. Requests & Traffic By Time By IP By Other Measures 2. Files Files Characteristics How Files Are Requested 3. Sessions 4. Distributions Insight CentOS Fedora Ubuntu Eclipse 5. Technical Details 6. Query Optimization ...... 李博杰 [email protected] Visualizing Mirrors . Discovering Sessions

. Definition . Two requests are within a same Interval Session iff: Have same IP address . The time difference does not exceed some limit . Definition . A Gap Session is a longest sequence of requests where: All requests are from the same IP address Time difference of every two adjacent requests do not exceed . some limit

...... 李博杰 [email protected] Visualizing Mirrors . Discovering Sessions (continued)

Since Mirrors is a resource-downloading site, many sessions download many files for hours. The time limit of Interval Session is 60 minutes. The time limit of Gap Session is 30 minutes. For long downloads that extend over 30 minutes, the session will be broken into two. Both algorithms suffer from false positives and true negatives. Some IPs access so frequently that the gap session never ends (see the next slide), making the Gap Session data highly biased.

...... 李博杰 [email protected] Visualizing Mirrors . Average Session Duration over IP (log-scale)

Average Duration over Unique IPs 1e+07

1e+06

100000

10000

1000

100

10 Average Duration (log-scale)

1

0.1

0.01

Gap Session Interval Session 0.001 1 10 100 1000 10000 100000 1e+06 1e+07 Percentage of unique IP (log-scale) (sorted by Duration. DESC)...... 李博杰 [email protected] Visualizing Mirrors . Average Session Duration over IP (normal-scale)

Average Duration over Unique IPs 1e+07

1e+06

100000

10000

1000

100

10 Average Duration (log-scale)

1

0.1

0.01

Gap Session Interval Session 0.001 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 Percentage of unique IP (sorted by Duration DESC)...... 李博杰 [email protected] Visualizing Mirrors . Gap Session Statistics in a day

Session statistics at different Time in a day

Session count (Bezier smoothed) Request count (Bezier smoothed) 0.0016 Duration (Bezier smoothed) Traffic (Bezier smoothed)

0.0014

0.0012

0.001

0.0008 Ratio of total

0.0006

0.0004

0.0002

0 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00 Time in a day ...... 李博杰 [email protected] Visualizing Mirrors . Explanation

How strange the three graphs look! Durations are always integers (seconds), so there are straight lines in duration graphs. Many sessions are actually ‘single’ requests, so their duration is zero, and the amount of them can be seen in normal-scale graph. About 40 sessions extend throughout the whole 51 days, and more sessions extend less days. The amount of them is the length of red horizontal line in log-scale graph. Because the log starts at May 22 06:25:37, requests in these long-live Gap Sessions accumulate to a horrible peak at 06:27 in Gap Session statistics (much higher than shown, since the original data is noizy, and the graph is Bezier-smoothed).

...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Over Long-live Gap Sessions

Cumulative Ratio of Requests, Duration and Traffic over Sessions 0.35 Request count Duration Traffic

0.3

0.25

0.2

0.15 Cumulative Ratio of Total

0.1

0.05

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sessions order by Duration DESC ...... 李博杰 [email protected] Visualizing Mirrors . Ratio to Average for Long-live Gap Sessions (log-scale)

Requests, Duration and Traffic over Sessions 1e+07 Request count Duration Traffic 1e+06

100000

10000

1000

100

10

Ratio to Average (Log-scale) 1

0.1

0.01

0.001

0.0001 1 10 100 1000 10000 Sessions order by Duration DESC ...... 李博杰 [email protected] Visualizing Mirrors . Statistics of Interval Sessions

Total Average Max Min Std. Dev Requests 328976877 7.5020 186043 1 173.1824 Traffic 36277 GB 867.46 KB Overflow 0 15.227 MB Duration 4.026 ∗ 1010 918.13 s 3599 s -6 s 1502.9 s

43852053 sessions in total. Sessions with negative duration is because log items can accidentally be not in time non-decreasing order. I’m not going to fix it, since these 230 wrong sessions have little influence.

...... 李博杰 [email protected] Visualizing Mirrors . Statistics of Gap Sessions

Total Average Max Min Std. Dev Requests 328976877 6.6200 8432330 1 1255.9488 Traffic 35291 GB 744.67 KB Overflow 0 15.184 MB Duration 7.666 ∗ 109 154.27 s 4406403 -27 s 7060.6 s

49694582 sessions in total. For comparison with Interval Sessions. The following statistics are based on Interval Session if not noted specially.

...... 李博杰 [email protected] Visualizing Mirrors . Cumulatives over Sessions order by Traffic DESC

Cumulative stat over Sessions order by Traffic DESC 1

0.9

0.8

0.7

0.6

0.5 Cumulative stat 0.4

0.3

0.2

0.1 Requests count Duration Traffic 0 0 5e+06 1e+07 1.5e+07 2e+07 2.5e+07 3e+07 3.5e+07 4e+07 4.5e+07 Number of Sessions (order by traffic DESC) ...... 李博杰 [email protected] Visualizing Mirrors . Cumulatives over Sessions order by Traffic DESC

Cumulative stat over Sessions order by Traffic DESC 1

0.9

0.8

0.7

0.6

0.5 Cumulative stat 0.4

0.3

0.2

0.1 Requests count Duration Traffic 0 100 1000 10000 100000 1e+06 1e+07 1e+08 Number of Sessions (log-scale) (order by traffic DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cumulatives over Sessions order by Requests DESC

Cumulative stat over Sessions order by Requests DESC 1

0.8

0.6 Cumulative stat 0.4

0.2

Requests count Duration Traffic 0 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 Number of Sessions (log-scale) (order by Requests DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cumulatives over Sessions order by Duration DESC

Cumulative stat over Sessions order by duration DESC 1

0.9

0.8

0.7

0.6

0.5 Cumulative stat 0.4

0.3

0.2

0.1 Requests count Duration Traffic 0 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 Number of Sessions (log-scale) (order by Duration DESC)...... 李博杰 [email protected] Visualizing Mirrors . Cumulative Session Duration over Unique IPs

Cumulative Interval Session Duration over Unique IPs 4e+10

3.5e+10

3e+10

2.5e+10

2e+10 Cumulative Duration 1.5e+10

1e+10

5e+09

0 1 10 100 1000 10000 100000 1e+06 1e+07 Percentage of unique IP (log-scale) (sorted by Duration. .DESC)...... 李博杰 [email protected] Visualizing Mirrors . Sessions in a day

Session statistics at different Time in a day

Session count (Bezier smoothed) Request count (Bezier smoothed) 0.0016 Duration (Bezier smoothed) Traffic (Bezier smoothed)

0.0014

0.0012

0.001

0.0008 Ratio of total

0.0006

0.0004

0.0002

0 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 00:00 Time in a day ...... 李博杰 [email protected] Visualizing Mirrors . Sessions across days

Session statistics across days 0.03 Session count Request count Duration Traffic

0.025

0.02

0.015 Ratio of total

0.01

0.005

0 05-20 05-27 06-03 06-10 06-17 06-24 07-01 07-08 07-15 Date ...... 李博杰 [email protected] Visualizing Mirrors . Sessions in Distributions order by Traffic

Sessions in each Distribution (order by Traffic) 0.6 Number Requests Duration Traffic

0.5

0.4

0.3 Percentage

0.2

0.1

0 eclipse fedora centos ubuntu debian tdf cygwin CTAN archlinux mozilla-currentopensuse gentoo kde-applicationdataUbuntu kde backtrack ubuntu-releasesgnu CRAN NULL

...... 李博杰 [email protected] Visualizing Mirrors . Sessions in Distributions order by Requests

Sessions in each Distribution (order by Requests) 0.6 Number Requests Duration Traffic

0.5

0.4

0.3 Percentage

0.2

0.1

0 centos NULL eclipse fedora ubuntu ubuntu-releasesmozilla-currenttdf debian CTAN backtrack opensuse Ubuntu deepin-cd kde-applicationdatalinuxmint debian-cd cygwin archlinux kde

...... 李博杰 [email protected] Visualizing Mirrors . Sessions in Distributions order by Session Num

Sessions in each Distribution (order by Session Num) 0.6 Number Requests Duration Traffic

0.5

0.4

0.3 Percentage

0.2

0.1

0 NULL centos fedora eclipse ubuntu mozilla-currentdebian ubuntu-releaseskde-applicationdataopensuse epel tdf archlinux CTAN cygwin CRAN backtrack deepin-cd gnu kde

...... 李博杰 [email protected] Visualizing Mirrors . Sessions in Distributions order by Duration

Sessions in each Distribution (order by Duration) 0.6 Number Requests Duration Traffic

0.5

0.4

0.3 Percentage

0.2

0.1

0 centos NULL fedora eclipse ubuntu mozilla-currentdebian ubuntu-releasesepel tdf archlinux opensuse CTAN cygwin kde-applicationdatabacktrack gnu deepin-cd kde gentoo

...... 李博杰 [email protected] Visualizing Mirrors . Sessions and User Agents

Sessions in each User Agent (order by Session Num DESC) 0.6 Number Requests Duration Traffic

0.5

0.4

0.3 Percentage

0.2

0.1

0 NULL urlgrabber Mozilla Jakarta Debian Ubuntu ZYpp Java pacman Wget Python Eclipse curl libwww HomebrewOpera Fedora PreupgradeMirrorBrainSosospider

...... 李博杰 [email protected] Visualizing Mirrors . Outline

1. Requests & Traffic By Time By IP By Other Measures 2. Files Files Characteristics How Files Are Requested 3. Sessions 4. Distributions Insight CentOS Fedora Ubuntu Eclipse 5. Technical Details 6. Query Optimization ...... 李博杰 [email protected] Visualizing Mirrors . CentOS Basic Stat

Value % of total Rank Requests 100986986 30.7% 1 Traffic 5252.5 GB 14.2% 4 Files 70043 0.7% 19 FileSize 172.8 GB 1.3% 12 Sessions 19452260 44.4% 1

...... 李博杰 [email protected] Visualizing Mirrors . CentOS Basic Stat - Average

Average Ratio Std.Dev Ratio Request Length 55846 0.464 882920 0.289 File Size 2665431 1.953 54320911 2.294 File Requests 1402.46 76.994 92327 11.029 File Traffic 74454404 32.12 1231957014 1.784 Session Duration 1051.19 1.145 1586.69 1.056 Session Requests 5.796 0.773 137.5 0.794 Session Traffic 336791 0.379 8931320 0.559 Std.Dev stands for Standard Deviation. ‘Ratio’ in the third col stands for Ratio of Distribution Average to Global Average; ‘Ratio’ in the fifth col stands for Ratio of DIst Std.Dev to Global. Size & Traffic are in bytes, Duration is in seconds.

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among CentOS Versions

Request & Traffic of Subdirectories in centos 0.7

0.6

0.5

0.4

Percentage 0.3

0.2

0.1

Requests Traffic 0 5.8 6.2 5 6.3 6 build 4.4 5.1 3.5 3.4 3.3 3.1 2.1 6Serverdostools5.0 4.7 4.5 4.0 6.1 4.6 3 graphics4.9 2 %E6%96%87%E7%AB%A0%E5%87%BA%E5%A4%84%EF%BC%9A%E9%A4.8 HEADER.images4.2 3.9 5.3 NULL6.0 5.2 5Server5.7 4 5.6 5.4 %24releasever

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among CentOS 2-level Subdirs

Request & Traffic of Subdirectories in centos 0.3

0.25

0.2

0.15 Percentage

0.1

0.05

Requests Traffic 0 5.8/updates6.2/updates5.8/os6.2/os5/updates6.3/os5/os 6/os 6/updates5.8/extras6.3/updates5/extras6.2/centosplus5/addons6.2/isos5.8/centosplus5.8/isos5/centosplus6/centosplus6.3/isos6.3/centosplus6.2/extrasNULL5.7/isos5/contrib6/isos6/extras6.0/isosNULL5.8/addons5/isos6.3/extras5.7/updates5.7/os5.7/extras6.2/fasttrack6/fasttrack5/fasttrack5.8/fasttrack5.5/os

...... 李博杰 [email protected] Visualizing Mirrors . Fedora Basic Stat

Value % of total Rank Requests 36329528 11.0% 4 Traffic 7509.2 GB 20.3% 2 Files 924620 8.8% 2 FileSize 1207.8 GB 9.4% 2 Sessions 3596727 8.2% 2

...... 李博杰 [email protected] Visualizing Mirrors . Fedora Basic Stat - Average

Average Ratio Std.Dev Ratio Request Length 221945 1.843 2277229 0.744 File Size 1407150 1.031 21376814 0.903 File Requests 28.3134 1.554 6563 0.784 File Traffic 3601714 1.554 190851795 0.276 Session Duration 1226 1.336 1598 1.063 Session Requests 10.8047 1.440 251.43 1.452 Session Traffic 2238126 2.520 21424912 1.342

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among Fedora Subdirs

Request & Traffic of Subdirectories in fedora 0.8

0.7

0.6

0.5

0.4 Percentage

0.3

0.2

0.1

Requests Traffic 0 linux epel rpmfusionweb huangou.php?id=1js xamppJingdianphpcmsExamplesabc-abc-abc-$%7Bprint(md5(base64?s= abc abc,abc,abc,$%7Bprint(md5(base64membermanagementindex.php?-dautotools phpssogo.php?a=5Clienttesting5Serverbeta fedora-updates-USTC.mirrors4.repo%20-O%20fedora-updates-USTC.mirrors6.repo%20-O%20repodatae%5Belfedora-USTC.mirrors4.repo%20-O%20apel pub NULLreleases4 epeel4AS developmentupdates6 5

s erver

p repend

f ile%3d ...... d ecode(MzYwd2Vicd ecode(MzYwd2Vic 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among Fedora 2-level Subdirs

Request & Traffic of Subdirectories in fedora 0.5

0.45

0.4

0.35

0.3

0.25 Percentage

0.2

0.15

0.1

0.05

Requests Traffic 0 linux/updatesepel/5linux/developmentlinux/releasesepel/6rpmfusion/freerpmfusion/nonfreeepel/5Serverepel/testingNULLepel/4NULLepel/betaepel/4WSepel/5Clientepel/4ASepel/4ESNULLreleases/12linux/coreNULLlinux/extrasNULLreleases/17NULLNULLepel/x86NULLlinux/sreleases/testNULLNULLNULLlinux/developementNULLNULLNULLNULLNULLNULL

6 4

...... 李博杰 [email protected] Visualizing Mirrors . Ubuntu Basic Stat

Value % of total Rank Requests 19193451 5.8% 5 Traffic 5653.1 GB 15.3% 3 Files 608361 5.8% 5 FileSize 584.6 GB 4.6% 6 Sessions 160800 0.4% 4

...... 李博杰 [email protected] Visualizing Mirrors . Ubuntu Basic Stat - Average

Average Ratio Std.Dev Ratio Request Length 316251 2.626 2950601 0.964 File Size 1100038 0.806 8926254 0.377 File Requests 22.2714 1.222 860.6907 0.103 File Traffic 8152155 3.517 551019922 0.798 Session Duration 636.37 0.693 1053.94 0.701 Session Requests 110.93 14.788 294.68 1.702 Session Traffic 34260641 38.570 101247464 6.341

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among Ubuntu Subdirs

Request & Traffic of Subdirectories in ubuntu 0.8

0.7

0.6

0.5

0.4 Percentage

0.3

0.2

0.1

Requests Traffic 0 pool dists NULLubuntuintrepid-securityintrepid-updates%20intrepid-%20nattyarchiveubuntu-cn%20gutsy%20main%20multiverse%20restricted%20univernatty jaunty%20gutsy%20main%20linux maverick%20preciserepodatafeisty-backportsfeisty-proposedfeisty-securityfeisty-updates%20natty-proposed%20natty-securityintrepid-proposedintrepid-backportsintrepidupdatesneverexistsakarmic-updateslucid-updateslucid-securitylucid-proposedlucid-backportslucid %20gutsy%20main%20multiverse%20restricted%20univerfeistyprecise-updatesprecise-securityprecise-proposed

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among Ubuntu 2-level Subdirs

Request & Traffic of Subdirectories in ubuntu 0.6

0.5

0.4

0.3 Percentage

0.2

0.1

Requests Traffic 0 pool/mainpool/universedists/precisedists/luciddists/oneiricdists/nattypool/restrictedpool/multiversedists/hardydists/maverickdists/precise-updatesdists/lucid-updatesdists/quantaldists/oneiric-updatesdists/lucid-securitydists/precise-securitydists/precise-proposeddists/natty-updatesNULLdists/oneiric-securitydists/natty-securityNULLdists/hardy-updatesdists/maverick-updatesdists/lucid-proposeddists/precise-backportsdists/oneiric-proposeddists/hardy-securitydists/lucid-backportsdists/maverick-securitydists/natty-proposeddists/hardy-backportsdists/oneiric-backportsubuntu/pooldists/maverick-proposedubuntu/distsdists/natty-backportsNULLdists/hardy-proposeddists/maverick-backports

...... 李博杰 [email protected] Visualizing Mirrors . Eclipse Basic Stat

Value % of total Rank Requests 51279473 15.6% 3 Traffic 11498.6 GB 31.2% 1 Files 263355 2.5% 8 FileSize 161.1 GB 1.3% 14 Sessions 524586 1.2% 3

...... 李博杰 [email protected] Visualizing Mirrors . Eclipse Basic Stat - Average

Average Ratio Std.Dev Ratio Request Length 240769 2.000 5813562 1.901 File Size 700002 0.513 8457323 0.357 File Requests 113.5340 6.233 17894.2 2.137 File Traffic 28156070 12.148 4197033259 6.079 Session Duration 673.17 0.733 1059.08 0.704 Session Requests 97.1089 12.944 782.80 4.520 Session Traffic 22130605 24.914 62746610 3.930

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among Eclipse Subdirs

Request & Traffic of Subdirectories in eclipse 0.6

0.5

0.4

0.3 Percentage

0.2

0.1

Requests Traffic 0 technologyeclipsereleasestools mylynwebtoolsarch birt mat modelingwindowbuilderrt hudsontptp konekiedt datatoolse4 equinoxorion tm managementfacet scouteclipse.org-commonm2e-wtpjava-ee-configgyrexgraphitibpmn2-modelerjpa stem mpc egf stp ztimegeminiNULL

...... 李博杰 [email protected] Visualizing Mirrors . Requests & Traffic among Eclipse 2-level Subdirs

Request & Traffic of Subdirectories in eclipse 0.6

0.5

0.4

0.3 Percentage

0.2

0.1

Requests Traffic 0 technology/eppeclipse/downloadsreleases/indigoeclipse/updatesreleases/junotools/cdtmylyn/dropswebtools/downloadsNULLreleases/heliosbirt/downloadsreleases/galileomat/1.1.1windowbuilder/WBmat/1.2.0modeling/tmfmodeling/emfreleases/ganymedehudson/warmodeling/mdtedt/releasestechnology/subversivetptp/4.7.2virgo/releasekoneki/productsbirt/update-sitert/eclipselinktools/aspectjtools/gefequinox/dropsdatatools/updatesdatatools/downloadse4/sdkorion/dropstools/pdtrt/rapwebtools/updatestechnology/epfe4/downloadstools/orbit

...... 李博杰 [email protected] Visualizing Mirrors . Outline

1. Requests & Traffic By Time By IP By Other Measures 2. Files Files Characteristics How Files Are Requested 3. Sessions 4. Distributions Insight CentOS Fedora Ubuntu Eclipse 5. Technical Details 6. Query Optimization ...... 李博杰 [email protected] Visualizing Mirrors . Data Source

Nginx access log of mirrors.ustc.edu.cn From 2012-05-22 to 2012-07-12, 51 days 4041MB compressed, 62383MB decompressed Thanks Guo JiaHua for providing data File list of mirrors.ustc.edu.cn crawled by spider FTP for CPAN and CRAN HTTP other directories Need to detect symlinks to parent dir (e.g. /ubuntu/ubuntu/…) Scripts are written in bash and PHP. All scripts are available at GitHub: https://github.com/bojieli/mirrors-log

...... 李博杰 [email protected] Visualizing Mirrors . Saving Data in BRIGHTHOUSE

Scanning through such amount of data is time-consuming. And the data is not too large to fit in a relational database. I tried InfoBright and InfiniDB with 6.4 ∗ 106 rows of artificial data, InfiniDB is faster in queries, while InfoBright takes much less disk space. InfoBright’s compress rate is no less than gzip: 4316MB table size, compared to 4041MB gzip, not to mention that I have added some additional rows for faster statistics. I do not have much disk space, so I choose InfoBright (a backend of MySQL). Most of the queries I used (mostly GROUP BY, WHERE) take less than 2 minutes. Create two FIFOs foreach log: zcat logfile > php (preprocessing) > mysql-ib LOAD DATA INFILE

...... 李博杰 [email protected] Visualizing Mirrors . Preprocessing

Make queries faster (I’m afraid of full-table scan) ip => integers: ipv4 ipv4_0 ipv4_1 ipv4_2 ipv4_3 ipv6_0 ipv6_1 ipv6_2 ipv6_3 time => integers: time year yearday weekday daymin daytime hour status (200, 403, 404…) length (filesize) url => substrings: url_0 …url_9 filename extension referer (I do not want to analyze it, for Mirrors is not a site of user-interaction) ua => ua_0 (Mozilla, Ubuntu…), ua (full)

...... 李博杰 [email protected] Visualizing Mirrors . Preprocessing

PHP is slow …only 3MB/s. At first preg_match (regular exps) take 85% time, then I optimized the regexp and it only takes 25% now. Xdebug show that stream_get_line (fgets) and fputs take about 50% of total time. InfoBright’s data loading speed is 15MB/s (crawl_http.log). I don’t think PHP’s preprocessing work is harder than database’s… Maybe PHP’s interpretive nature makes it much slower than C. Anyone give a benchmark for Python etc?

...... 李博杰 [email protected] Visualizing Mirrors . Files Table

Many files on mirrors are never accessed, so we have to make a full list of files on mirrors. Preprocessed url, filename, extension and filesize are recorded for each file. When processing logs and files, escape characters and VARCHAR max length should be taken care of, and filename should not be limited to a simple regular expression, since there are always exceptions: UTF-8 strings in malicious request, ”,v” files in CVS …

...... 李博杰 [email protected] Visualizing Mirrors . GNUPLOT

GNUPLOT is pretty flexible in plotting, the documentation and online demos are consulted many times. However GNUPLOT is not flexible in processing data. It is strong type, where integers and strings need to be explicitly converted. And the integer type is limited to −231 231 − 1. When the query result needs to be postprocessed, I write a simple sed s/regexp/replace/ for simple replacement, or awk NR%n==0 for sampling. When it comes to accumulation or some complex stuff, a PHP script goes between.

...... 李博杰 [email protected] Visualizing Mirrors . Discovering Sessions

Sequentially scan the log. If a request can fill into an existing session, update it; otherwise create a new one and flush the timeout session if exists. Garbage collection of timeout sessions: if (count(array) > gc_limit) { unset timeout sessions; gc_limit = count(array) * 1.5; } Inspired by JavaScript GC algorithm in IE7: If recycled memory is less than 15% of total, then limit is doubled; if recycled memory is more than 85%, then reset to initial value. Maintain a query buffer of 10000 rows to reduce query num. The PHP takes 4 hours, 15 times slower than C (see the following section).

...... 李博杰 [email protected] Visualizing Mirrors . Some Bugs Due to DIE Time Functions (PHP)

PHP: Find some sessions with negative session length. The only explanation is that logs are not in time increasing order. Dive into the log table, only to find that a timestamp matches records in both April 31 and May 1. In fact, mktime() accepts 5 parameters just like mktime(struct time_t) in C, while its month is 1 to 12, different from 0 to 11. However strptime() is a simple encapsulation of its C respective. I dare not to use DATETIME in MySQL, because the arithmetic of such timestamps is tricky, and I would rather implement it in SQL or PHP. Thanks Godness, no timezone problem this time.

...... 李博杰 [email protected] Visualizing Mirrors . Some Bugs Due to DIE Time Functions (C)

C: malloc() fall into deadlock, so strange, I GDBed an evening and no answer. Change some code and accidentally got Segmentation Error in time(). In fact, time() need a parameter of type time_t and it is dynamically linked. I used it without any parameter, and time() treats the garbage on stack as its parameter. If it is NULL, nothing happens; if not, the position is considered a struct time_t and unpredictable stuff happen.

...... 李博杰 [email protected] Visualizing Mirrors . Outline

1. Requests & Traffic By Time By IP By Other Measures 2. Files Files Characteristics How Files Are Requested 3. Sessions 4. Distributions Insight CentOS Fedora Ubuntu Eclipse 5. Technical Details 6. Query Optimization ...... 李博杰 [email protected] Visualizing Mirrors . Query Optimization

select count(*), ipv4, count(*) as c, sum(length) as s, concat(ipv4_0,'.',ipv4_1,'.',ipv4_2,'.',ipv4_3) from log where ipv4 is not null group by ipv4 order by s limit 40; Query_time: 798.179622 2012-07-30 16:32:12 Cnd(0): VC:0(t0a0) IS NOT NULL (0) 2012-07-30 16:33:02 Aggregating: 318575688 tuples left. 2012-07-30 16:45:29 Aggregated (1415554 gr). Omitted packrows: 0 + 0 partially, out of 5020 total. 2012-07-30 16:45:29 Heap Sort initialized for 1415554 rows, 8+61 bytes each. 2012-07-30 16:45:30 Total data packs actually loaded (approx.): 35139 Infobright is column-based, so cross-column queries are slow. Try to generate IP string from ‘ipv4’ field.

...... 李博杰 [email protected] Visualizing Mirrors . Query Optimization (continued)

SELECT ipv4, COUNT(*)/$COUNT, SUM(length)/$LENGTH AS c, CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.', (ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM log WHERE ipv4 IS NOT NULL GROUP BY ipv4 ORDER BY c DESC LIMIT 40; Query_time: 585.069205 InfoBright packs column data into packages, and pre-computed MAX, MIN, AVG and GROUP data for each package. WHERE clause requires re-computation of these data: 2012-07-30 14:33:06 Cnd(0): VC:0(t0a0) IS NOT NULL (0) 2012-07-30 14:33:55 Aggregating: 318575688 tuples left. 2012-07-30 14:42:47 Generating output.

...... 李博杰 [email protected] Visualizing Mirrors . Query Optimization (continued)

SELECT ipv4, COUNT(*)/$COUNT, SUM(length)/$LENGTH AS c, CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.', (ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM log GROUP BY ipv4 ORDER BY c DESC LIMIT 40; Query_time: 328.580979 WHERE clause is removed, but the result is wrong (includes a large NULL which stands for IPv6). 2012-07-30 14:45:41 Unoptimized expression near ’/’ 2012-07-30 14:45:41 Unoptimized expression near ’/’ 2012-07-30 14:45:41 Unoptimized expression near ’concat’ 2012-07-30 14:45:41 Aggregating: 328976877 tuples left. InfoBright calculates these columns for each tuple, no wonder it is slow. Keep the core subquery small. Moving these calculation outside may help.

...... 李博杰 [email protected] Visualizing Mirrors . Query Optimization (continued)

SELECT ipv4, c/$COUNT, s/$LENGTH, CONCAT((ipv4 & 255<<24)>>24, '.', (ipv4 & 255<<16)>>16, '.', (ipv4 & 255<<8)>>8, '.', (ipv4 & 255)) FROM (SELECT ipv4, COUNT(*) AS c, SUM(length) AS s FROM log GROUP BY ipv4 HAVING (ipv4 IS NOT NULL) ORDER BY s DESC LIMIT 40) AS t; Query_time: 138.297943 Total data packs actually loaded (approx.): 10040 Use HAVING clause to filter NULL. The internal exec order is: FROM TABLE, JOIN, OUTER JOIN, WHERE, SELECT clause, GROUP BY, HAVING, ORDER BY, LIMIT, projection & output. Turn on MySQL’s query_cache: identical queries should not be re-executed if I change some GNUPLOT command and run the script again. Is there any way to make the query faster? I don’t know.

...... 李博杰 [email protected] Visualizing Mirrors . Query Optimization (continued)

Another example of moving expressions ‘outside’ inner query: SELECT ipv4_0, c/$COUNT, s/$LENGTH FROM (SELECT IFNULL(ipv4_0, 'IPv6') AS ipv4_0, COUNT(*) AS c, SUM(length) AS s FROM log GROUP BY ipv4_0 ORDER BY s DESC LIMIT 40) AS t; Query_time: 263.048662 SELECT IFNULL(ipv4_0, 'IPv6'), c/$COUNT, s/$LENGTH FROM (SELECT ipv4_0, COUNT(*) AS c, SUM(length) AS s FROM log GROUP BY ipv4_0 ORDER BY s DESC LIMIT 40) AS t; Query_time: 84.589837 I cannot believe my eyes at the first sight of these figures.

...... 李博杰 [email protected] Visualizing Mirrors . Query Optimization is Not Everything

SELECT url_0, COUNT(*)/$COUNT, SUM(files.filesize)/$LENGTH AS c FROM files WHERE NOT EXISTS (SELECT * FROM log WHERE log.filename = files.filename) GROUP BY url_0 ORDER BY c DESC LIMIT 40; The query ran 12 hours and I have to kill it. Infobright treats NOT EXISTS clause as dependent subquery and might have to execute it for each row. I don’t know how to use cursors and storage procedure in MySQL, and more importantly Infobright does not support dynamic update of tables, so SELECT INTO is impossible. I write a C program (using mysqlclient API) to count occurences of each url in log and write it into another table. Performance: 10M rows in files table, 329M rows in log table 17.4 minutes in total 440000 rows per second at Probe phase

460 MB of memory usage (360MB resident)...... 李博杰 [email protected] Visualizing Mirrors . Simulating Hash JOIN

I found my algorithm actually named Hash JOIN after programming finished. It is a fast JOIN algorithm for RDBMS. The task is to find, for each distinct value of the join attribute, the set of tuples in each relation which have that value. (Wikipedia) My work is to count the occurence in log table of each url in files table. Build step: Traverse files table and add the url field of each row to a hash list. Probe step: Traverse log table and add the counter of the corresponding hash slot. Output step: Traverse files table again and output the corresponding counters. The output is FIFOed to LOAD DATA INFILE. Easy to shard horizontally. Steps are similar to Map, Shuffle and Reduce...... 李博杰 [email protected] Visualizing Mirrors . Simulating Hash JOIN (continued)

Since there might be many files (currently only about 10M), storing the hash list entirely in memory is my top concern. Use one hash for position, two additional hashes for checking, and do not store the original string. 10−22 probability of undetected hash collision, but it is low enough for an analytic program. Only requires 12 bytes for each slot (2 * sizeof(int) hash check, sizeof(int) counter) Hash algorithm: DJBX33A (hash = ((hash«5) + hash) + *key++), used by PHP, Apache and more. Use three seeds for position and check hash. Hash collision: Linearly find the next empty slot. This is simple (I’m afraid of pointers) and memory-saving.

...... 李博杰 [email protected] Visualizing Mirrors . The End

I intended to finish it in 3 days, but because I’m unfamiliar with these tools, it takes me a week (or more if including preparation time). Access logs are the origin of discoveries on access patterns. Data is precious, while disk is cheap. Please do not delete them after 52 days. I cannot draw conclusion on ‘trends’, for there is not data for an adequately long time. If you want other statistics, please email me and I will query it (if I have time :).

...... 李博杰 [email protected] Visualizing Mirrors . 写在最后

Thanks all maintainers and supporters of mirrors.ustc.edu.cn! All scripts and source of this slides are available at GitHub: https://github.com/bojieli/mirrors-log 终于搞定了中文字体问题,不过懒得翻译了……

...... 李博杰 [email protected] Visualizing Mirrors