Created at Datetime Updated at Datetime Slug Varchar

Created at Datetime Updated at Datetime Slug Varchar

Big Data with rubygems.org Download Data Aja Hammerly 1 Aja Hammerly http://github.com/thagomizer @the_thagomizer http://www.thagomizer.com [email protected] 2 3 Lawyer Cat Says: All code is copyright Google and @the_thagomizer 4 Ruby @the_thagomizer 5 Questions @the_thagomizer 6 Which gems are used often? @the_thagomizer 7 Is Minitest or Rspec more popular? @the_thagomizer 8 Do we need to support Ruby 1.9? @the_thagomizer 9 Is Rails 3, Rails 4, or Rails 5 more popular @the_thagomizer 10 Guess? @the_thagomizer 11 Guess? @the_thagomizer 12 Data @the_thagomizer 13 rubygems.org @the_thagomizer 14 Github @the_thagomizer 15 rubygems.org Data @the_thagomizer 16 Overview @the_thagomizer 17 rubygems @the_thagomizer 18 Column Type Nameid integer name varchar created_at datetime updated_at datetime slug varchar @the_thagomizer 19 Column Type Nameid integer name varchar created_at datetime updated_at datetime slug varchar @the_thagomizer 20 gem_downloads @the_thagomizer 21 Column Type Nameid integer rubygem_id integer version_id integer count bigint @the_thagomizer 22 dependencies @the_thagomizer 23 linksets @the_thagomizer 24 versions @the_thagomizer 25 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 26 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 27 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 28 Column Type Column Type Nameid integer authorsName text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256 varchar @the_thagomizer 29 GitHub Data @the_thagomizer 30 files @the_thagomizer 31 Column Type repo_nameName string ref string path string mode integer id string symlink_targ string et @the_thagomizer 32 Column Type repo_nameName string ref string path string mode integer id string symlink_targ string et @the_thagomizer 33 contents @the_thagomizer 34 Column Type Nameid string size integer content string binary boolean copies integer @the_thagomizer 35 Column Type Nameid string size integer content string binary boolean copies integer @the_thagomizer 36 commits @the_thagomizer 37 languages @the_thagomizer 38 Column Type repo_nameName string language record language.na string language.byme integer tes @the_thagomizer 39 Column Type repo_nameName string language record language.na string language.byme integer tes @the_thagomizer 40 licenses @the_thagomizer 41 Now What? @the_thagomizer 42 BigQuery @the_thagomizer 43 What @the_thagomizer 44 Why @the_thagomizer 45 How @the_thagomizer 46 I ❤ BigQuery @the_thagomizer 47 SQL @the_thagomizer 48 Fast @the_thagomizer 49 Scales @the_thagomizer 50 Complex @the_thagomizer 51 Demo @the_thagomizer 52 Vocabulary @the_thagomizer 53 Dataset @the_thagomizer 54 Table @the_thagomizer 55 Import @the_thagomizer 56 Streaming @the_thagomizer 57 google-cloud @the_thagomizer 58 pg @the_thagomizer 59 require 'pg' require 'google/cloud/bigquery' ENV["GOOGLE_CLOUD_PROJECT"] = "rubygems-bigquery" ENV["GOOGLE_CLOUD_KEYFILE"] = "#{key_path}" @the_thagomizer 60 bigquery = Google::Cloud.bigquery bq_db = bigquery.dataset "rubygems" @the_thagomizer 61 postgres = PG.connect dbname: "rubygems" @the_thagomizer 62 bq_table ||= bq_db.create_table("gems") do |s| s.integer "id" s.string "name" s.timestamp "created_at" s.timestamp "updated_at" end @the_thagomizer 63 columns = %w[id name created_at updated_at] @the_thagomizer 64 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end @the_thagomizer 65 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end @the_thagomizer 66 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end end @the_thagomizer 67 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end @the_thagomizer 68 Zip & Hash[] @the_thagomizer 69 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ] @the_thagomizer 70 zip @the_thagomizer 71 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ] [[ , ], [ , ], [ , ], [ , ]] @the_thagomizer 72 [ key1 , key2 , key3 , key4 ] [ val1 , val2 , val3 , val4 ] [[ key1 ,val1 ], [key2 ,val2 ], [ key3 ,val3 ], [key4 ,val4 ]] @the_thagomizer 72 [[key1, val1], [key2, val2], [key3, val3], [key4, val4]] @the_thagomizer 73 Hash::[] @the_thagomizer 74 Hash[[key1, val1], [key2, val2], [key3, val3], [key4, val4]] @the_thagomizer 75 { key1 => val1, key2 => val2, key3 => val3, key4 => val4 } @the_thagomizer 76 Hash[keys.zip(values)] @the_thagomizer 77 postgres.exec("SELECT * FROM rubygems") do |pg_table| pg_table.each do |row| hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end end @the_thagomizer 78 Batch @the_thagomizer 79 Formats @the_thagomizer 80 CSV @the_thagomizer 81 JSON @the_thagomizer 82 Avro @the_thagomizer 83 CSV @the_thagomizer 84 Import @the_thagomizer 85 86 87 What Now? @the_thagomizer 88 Answer Questions @the_thagomizer 89 Which gem has the most downloads? @the_thagomizer 90 SELECT name, count FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id ORDER BY count DESC LIMIT 5 @the_thagomizer 91 name count rake 107,076,261 rack 100,955,906 multi_json 100,171,080 json 95,715,131 bundler 93,085,862 @the_thagomizer 92 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name ORDER BY total DESC LIMIT 5 @the_thagomizer 93 name count rake 214,152,212 rack 201,911,759 multi_json 200,342,260 json 191,430,173 bundler 186,172,479 @the_thagomizer 94 How many downloads does Rails have? @the_thagomizer 95 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id WHERE name = 'rails' GROUP BY name @the_thagomizer 96 name total rails 137,635,731 @the_thagomizer 97 Is Minitest or Rspec more popular? @the_thagomizer 98 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name HAVING name IN ('minitest', 'rspec') @the_thagomizer 99 name total minitest 101,151,246 rspec 77,293,803 @the_thagomizer 100 Which version of Rails is the most popular? @the_thagomizer 101 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major @the_thagomizer 102 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major @the_thagomizer 103 REGEXP_EXTRACT(number,r'(\d*)\.') AS major @the_thagomizer 104 version downloads 0 6,446,448 1 103,236 2 4,627,625 3 28,731,007 4 28,719,391 5 190,789 @the_thagomizer 105 SELECT name, REGEXP_EXTRACT(number,r'(\d*)\.') AS major, sum(rubygems.downloads.count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id LEFT JOIN rubygems.versions ON rubygems.versions.id = rubygems.downloads.version_id WHERE name = 'rails' GROUP BY major, name ORDER BY major @the_thagomizer 106 version downloads null 68,817,235 0 6,446,448 1 103,236 2 4,627,625 3 28,731,007 4 28,719,391 5 190,789 @the_thagomizer 107 Do we need to support Ruby 1.9? @the_thagomizer 108 Which version of ruby do gems released in the past year require? @the_thagomizer 109 SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    125 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us