Big Data with rubygems.org Download Data

Aja Hammerly

1 Aja Hammerly http://github.com/thagomizer @the_thagomizer http://www.thagomizer.com [email protected]

2 3 Lawyer Cat Says: All code is copyright Google and

@the_thagomizer 4 Ruby

@the_thagomizer 5 Questions

@the_thagomizer 6 Which gems are used often?

@the_thagomizer 7 Is Minitest or Rspec more popular?

@the_thagomizer 8 Do we need to support Ruby 1.9?

@the_thagomizer 9 Is Rails 3, Rails 4, or Rails 5 more popular

@the_thagomizer 10 Guess?

@the_thagomizer 11 Guess?

@the_thagomizer 12 Data

@the_thagomizer 13 rubygems.org

@the_thagomizer 14 Github

@the_thagomizer 15 rubygems.org Data

@the_thagomizer 16 Overview

@the_thagomizer 17 rubygems

@the_thagomizer 18 Column Type Nameid integer name varchar created_at datetime updated_at datetime slug varchar

@the_thagomizer 19 Column Type Nameid integer name varchar created_at datetime updated_at datetime slug varchar

@the_thagomizer 20 gem_downloads

@the_thagomizer 21 Column Type Nameid integer rubygem_id integer version_id integer count bigint

@the_thagomizer 22 dependencies

@the_thagomizer 23 linksets

@the_thagomizer 24 versions

@the_thagomizer 25 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar

@the_thagomizer 26 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar

@the_thagomizer 27 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar

@the_thagomizer 28 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar

@the_thagomizer 29 GitHub Data

@the_thagomizer 30 files

@the_thagomizer 31 Column Type repo_nameName string ref string path string mode integer id string symlink_targ string et

@the_thagomizer 32 Column Type repo_nameName string ref string path string mode integer id string symlink_targ string et

@the_thagomizer 33 contents

@the_thagomizer 34 Column Type Nameid string size integer content string binary boolean copies integer

@the_thagomizer 35 Column Type Nameid string size integer content string binary boolean copies integer

@the_thagomizer 36 commits

@the_thagomizer 37 languages

@the_thagomizer 38 Column Type repo_nameName string language record language.na string language.byme integer tes

@the_thagomizer 39 Column Type repo_nameName string language record language.na string language.byme integer tes

@the_thagomizer 40 licenses

@the_thagomizer 41 Now What?

@the_thagomizer 42 BigQuery

@the_thagomizer 43 What

@the_thagomizer 44 Why

@the_thagomizer 45 How

@the_thagomizer 46 I ❤ BigQuery

@the_thagomizer 47 SQL

@the_thagomizer 48 Fast

@the_thagomizer 49 Scales

@the_thagomizer 50 Complex

@the_thagomizer 51 Demo

@the_thagomizer 52 Vocabulary

@the_thagomizer 53 Dataset

@the_thagomizer 54 Table

@the_thagomizer 55 Import

@the_thagomizer 56 Streaming

@the_thagomizer 57 google-cloud

@the_thagomizer 58 pg

@the_thagomizer 59 require 'pg' require 'google/cloud/bigquery'

ENV["GOOGLE_CLOUD_PROJECT"] = "rubygems-bigquery" ENV["GOOGLE_CLOUD_KEYFILE"] = "#{key_path}"

@the_thagomizer 60 bigquery = Google::Cloud.bigquery bq_db = bigquery.dataset "rubygems"

@the_thagomizer 61 postgres = PG.connect dbname: "rubygems"

@the_thagomizer 62 bq_table ||= bq_db.create_table("gems") do |s| s.integer "id" s.string "name" s.timestamp "created_at" s.timestamp "updated_at" end

@the_thagomizer 63 columns = %w[id name created_at updated_at]

@the_thagomizer 64 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end

@the_thagomizer 65 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end

@the_thagomizer 66 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end

@the_thagomizer 67 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end

@the_thagomizer 68 Zip & Hash[]

@the_thagomizer 69 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ]

@the_thagomizer 70 zip

@the_thagomizer 71 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ]

[[ , ], [ , ], [ , ], [ , ]]

@the_thagomizer 72 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ]

[[ key1 ,val1 ], [key2 ,val2 ], [ key3 ,val3 ], [key4 ,val4 ]]

@the_thagomizer 72 [[key1, val1], [key2, val2], [key3, val3], [key4, val4]]

@the_thagomizer 73 Hash::[]

@the_thagomizer 74 Hash[[key1, val1], [key2, val2], [key3, val3], [key4, val4]]

@the_thagomizer 75 { key1 => val1, key2 => val2, key3 => val3, key4 => val4 }

@the_thagomizer 76 Hash[keys.zip(values)]

@the_thagomizer 77 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end

@the_thagomizer 78 Batch

@the_thagomizer 79 Formats

@the_thagomizer 80 CSV

@the_thagomizer 81 JSON

@the_thagomizer 82 Avro

@the_thagomizer 83 CSV

@the_thagomizer 84 Import

@the_thagomizer 85 86 87 What Now?

@the_thagomizer 88 Answer Questions

@the_thagomizer 89 Which gem has the most downloads?

@the_thagomizer 90 SELECT name, count FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id ORDER BY count DESC LIMIT 5

@the_thagomizer 91 name count 107,076,261 rack 100,955,906 multi_json 100,171,080 json 95,715,131 bundler 93,085,862

@the_thagomizer 92 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name ORDER BY total DESC LIMIT 5

@the_thagomizer 93 name count rake 214,152,212 rack 201,911,759 multi_json 200,342,260 json 191,430,173 bundler 186,172,479

@the_thagomizer 94 How many downloads does Rails have?

@the_thagomizer 95 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id WHERE name = 'rails' GROUP BY name

@the_thagomizer 96 name total

rails 137,635,731

@the_thagomizer 97 Is Minitest or Rspec more popular?

@the_thagomizer 98 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name HAVING name IN ('minitest', '')

@the_thagomizer 99 name total

minitest 101,151,246

rspec 77,293,803

@the_thagomizer 100 Which version of Rails is the most popular?

@the_thagomizer 101 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major

@the_thagomizer 102 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major

@the_thagomizer 103 REGEXP_EXTRACT(number,r'(\d*)\.') AS major

@the_thagomizer 104 version downloads 0 6,446,448 1 103,236 2 4,627,625 3 28,731,007 4 28,719,391 5 190,789

@the_thagomizer 105 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id LEFT JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major

@the_thagomizer 106 version downloads null 68,817,235 0 6,446,448 1 103,236 2 4,627,625 3 28,731,007 4 28,719,391 5 190,789

@the_thagomizer 107 Do we need to support Ruby 1.9?

@the_thagomizer 108 Which version of ruby do gems released in the past year require?

@the_thagomizer 109 SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY required_ruby_version ORDER BY total DESC

@the_thagomizer 110 name total >= 0 75,821 >= 1.9.3 6,833 >= 2.0.0 3,829 >= 2.0 1,428 >= 2.3.0 1,334

@the_thagomizer 111 name total >= 0 75,821 >= 1.9.3 6,833 >= 2.0.0 3,829 >= 2.0 1,428 >= 2.3.0 1,334

@the_thagomizer 112 SELECT REGEXP_EXTRACT(required_ruby_version, r'(.*?\d\.?)') AS version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY version ORDER BY total DESC

@the_thagomizer 113 name total >= 0 95,851 >= 1 13,080 >= 2 12,944 ~> 2 2,040 > 2 49

@the_thagomizer 114 What gems are most commonly used?

@the_thagomizer 115 Which gems are used the most on GitHub?

@the_thagomizer 116 SELECT REGEXP_EXTRACT(line, r"\s+([\w\-_]*) \(") as gem, count(*) as total FROM ( SELECT SPLIT(content, '\n') as line FROM github_ruby.gemfilelock ) GROUP BY GEM HAVING gem IS NOT NULL ORDER BY total DESC

@the_thagomizer 117 name total activesuppor 355,791 rackt 197,631 railties 179,002 actionpack 167,075 json 132,750

@the_thagomizer 118 Conclusions

@the_thagomizer 119 120 @the_thagomizer 121 @the_thagomizer 122 @the_thagomizer 122 Thank You

@the_thagomizer 123