
Big Data with rubygems.org Download Data Aja Hammerly 1 Aja Hammerly http://github.com/thagomizer @the_thagomizer http://www.thagomizer.com [email protected] 2 3 Lawyer Cat Says: All code is copyright Google and @the_thagomizer 4 Ruby @the_thagomizer 5 Questions @the_thagomizer 6 Which gems are used often? @the_thagomizer 7 Is Minitest or Rspec more popular? @the_thagomizer 8 Do we need to support Ruby 1.9? @the_thagomizer 9 Is Rails 3, Rails 4, or Rails 5 more popular @the_thagomizer 10 Guess? @the_thagomizer 11 Guess? @the_thagomizer 12 Data @the_thagomizer 13 rubygems.org @the_thagomizer 14 Github @the_thagomizer 15 rubygems.org Data @the_thagomizer 16 Overview @the_thagomizer 17 rubygems @the_thagomizer 18 Column Type Nameid integer name varchar created_at datetime updated_at datetime slug varchar @the_thagomizer 19 Column Type Nameid integer name varchar created_at datetime updated_at datetime slug varchar @the_thagomizer 20 gem_downloads @the_thagomizer 21 Column Type Nameid integer rubygem_id integer version_id integer count bigint @the_thagomizer 22 dependencies @the_thagomizer 23 linksets @the_thagomizer 24 versions @the_thagomizer 25 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 26 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 27 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 28 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 29 GitHub Data @the_thagomizer 30 files @the_thagomizer 31 Column Type repo_nameName string ref string path string mode integer id string symlink_targ string et @the_thagomizer 32 Column Type repo_nameName string ref string path string mode integer id string symlink_targ string et @the_thagomizer 33 contents @the_thagomizer 34 Column Type Nameid string size integer content string binary boolean copies integer @the_thagomizer 35 Column Type Nameid string size integer content string binary boolean copies integer @the_thagomizer 36 commits @the_thagomizer 37 languages @the_thagomizer 38 Column Type repo_nameName string language record language.na string language.byme integer tes @the_thagomizer 39 Column Type repo_nameName string language record language.na string language.byme integer tes @the_thagomizer 40 licenses @the_thagomizer 41 Now What? @the_thagomizer 42 BigQuery @the_thagomizer 43 What @the_thagomizer 44 Why @the_thagomizer 45 How @the_thagomizer 46 I ❤ BigQuery @the_thagomizer 47 SQL @the_thagomizer 48 Fast @the_thagomizer 49 Scales @the_thagomizer 50 Complex @the_thagomizer 51 Demo @the_thagomizer 52 Vocabulary @the_thagomizer 53 Dataset @the_thagomizer 54 Table @the_thagomizer 55 Import @the_thagomizer 56 Streaming @the_thagomizer 57 google-cloud @the_thagomizer 58 pg @the_thagomizer 59 require 'pg' require 'google/cloud/bigquery' ENV["GOOGLE_CLOUD_PROJECT"] = "rubygems-bigquery" ENV["GOOGLE_CLOUD_KEYFILE"] = "#{key_path}" @the_thagomizer 60 bigquery = Google::Cloud.bigquery bq_db = bigquery.dataset "rubygems" @the_thagomizer 61 postgres = PG.connect dbname: "rubygems" @the_thagomizer 62 bq_table ||= bq_db.create_table("gems") do |s| s.integer "id" s.string "name" s.timestamp "created_at" s.timestamp "updated_at" end @the_thagomizer 63 columns = %w[id name created_at updated_at] @the_thagomizer 64 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end @the_thagomizer 65 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end @the_thagomizer 66 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end @the_thagomizer 67 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end @the_thagomizer 68 Zip & Hash[] @the_thagomizer 69 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ] @the_thagomizer 70 zip @the_thagomizer 71 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ] [[ , ], [ , ], [ , ], [ , ]] @the_thagomizer 72 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ] [[ key1 ,val1 ], [key2 ,val2 ], [ key3 ,val3 ], [key4 ,val4 ]] @the_thagomizer 72 [[key1, val1], [key2, val2], [key3, val3], [key4, val4]] @the_thagomizer 73 Hash::[] @the_thagomizer 74 Hash[[key1, val1], [key2, val2], [key3, val3], [key4, val4]] @the_thagomizer 75 { key1 => val1, key2 => val2, key3 => val3, key4 => val4 } @the_thagomizer 76 Hash[keys.zip(values)] @the_thagomizer 77 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end @the_thagomizer 78 Batch @the_thagomizer 79 Formats @the_thagomizer 80 CSV @the_thagomizer 81 JSON @the_thagomizer 82 Avro @the_thagomizer 83 CSV @the_thagomizer 84 Import @the_thagomizer 85 86 87 What Now? @the_thagomizer 88 Answer Questions @the_thagomizer 89 Which gem has the most downloads? @the_thagomizer 90 SELECT name, count FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id ORDER BY count DESC LIMIT 5 @the_thagomizer 91 name count rake 107,076,261 rack 100,955,906 multi_json 100,171,080 json 95,715,131 bundler 93,085,862 @the_thagomizer 92 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name ORDER BY total DESC LIMIT 5 @the_thagomizer 93 name count rake 214,152,212 rack 201,911,759 multi_json 200,342,260 json 191,430,173 bundler 186,172,479 @the_thagomizer 94 How many downloads does Rails have? @the_thagomizer 95 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id WHERE name = 'rails' GROUP BY name @the_thagomizer 96 name total rails 137,635,731 @the_thagomizer 97 Is Minitest or Rspec more popular? @the_thagomizer 98 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name HAVING name IN ('minitest', 'rspec') @the_thagomizer 99 name total minitest 101,151,246 rspec 77,293,803 @the_thagomizer 100 Which version of Rails is the most popular? @the_thagomizer 101 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major @the_thagomizer 102 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major @the_thagomizer 103 REGEXP_EXTRACT(number,r'(\d*)\.') AS major @the_thagomizer 104 version downloads 0 6,446,448 1 103,236 2 4,627,625 3 28,731,007 4 28,719,391 5 190,789 @the_thagomizer 105 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id LEFT JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major @the_thagomizer 106 version downloads null 68,817,235 0 6,446,448 1 103,236 2 4,627,625 3 28,731,007 4 28,719,391 5 190,789 @the_thagomizer 107 Do we need to support Ruby 1.9? @the_thagomizer 108 Which version of ruby do gems released in the past year require? @the_thagomizer 109 SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages125 Page
-
File Size-