Questions first, then data Perspectives on data science from a social scientist
Drew Conway Dept. of Politics, NYU Monktoberfest - April 4, 2012
Wednesday, October 3, 12 How I got here...
Wednesday, October 3, 12 Wednesday, October 3, 12 10% 10% Hacking I hold the following truths to be self-evident... 1. Data come from many sources Skills 2. Data come in many form(at)s 80%
Obtain Munge
A .zip file of PDFs ≠ data Real data are messy ‣ Data scientist must know where to ‣ Even curated data: duplicates, missing values, date formats get data and how to obtain it Work with big text files ‣ Combine data from multiple sources/ ‣ formats $ head publicvotes-20101018_votes.dump ‣ To o l s • *NIX tools: sed, awk, grep ‣ Work with APIs • Scripting languages: Perl, Python and R
$ curl http://search.twitter.com/ $ cat ufo_awesome.tsv | grep probe | wc -l search.json?q=@drewconway > drewconway.json 131
Wednesday, October 3, 12 Hacking While 80% of effort is spent here, but Skills perhaps most straightforward to train
Heavily tool focused, lean on current CS/IS curriculums ‣ Comfort working at the command-line, with text editors ‣ A language for every season!
Conveying findings in creative and compelling ways
Wednesday, October 3, 12 Math & Stats If: Better data beats better math Knowledge Then: What methods should be taught?
Explore Model How do you find structure Methods must match data in new data? ‣ Text ‣ Scatter plots ‣ Geospatial ‣ Density plots ‣ Web-scale Data exploration that scales What is the ‘best’ model? ‣ Reduce dimensionality ‣ Most predictive ‣ PCA, SVD, MDS ‣ Most parsimonious ‣ Cross-validation
Wednesday, October 3, 12 Math & Stats Universities good at methods training... Knowledge ...but what methods fit into Data Science?
$&^! data scientist like... ‣ Describing the current state of the world ‣ Predicting future observations ‣ Classifying/ranking observations 1. When applicable 2. Right tool / right job 3. Open black boxes } 4. Learn limitations $&^! social scientists like... ‣ Testable theoretical models ‣ Natural experiments ‣ Causality
Wednesday, October 3, 12 10% 10% Substantive Data Science, as a discipline, is Expertise fundamentally about human behavior 80%
Inquire Interpret Focus on questions / not tech How do we know when ‣ What new questions can be the results we get make asked from web-scale data? sense, if ever? ‣ Tools are a means to an end Social science has questions ‣ Markets ‣ Organization ‣ Decision making
Wednesday, October 3, 12 Case study!
...or, how my friends and I spend hours hacking through messy data to answer seemingly trivial questions!
Wednesday, October 3, 12 Wednesday, October 3, 12 What makes a programming language popular? How can we measure a programming language’s popularity
Compare # of question tags on StackOverflow to projects on Github
Wednesday, October 3, 12 Compare # of question tags on StackOverflow to ✓ projects on Github
How can we measure a X programming language’s popularity
Can we get closer to answer this question?
Wednesday, October 3, 12 ρ = 0.73
Github Rank (# projects) Rank Github
tags) (# questions Rank StackOverlow Wednesday, October 3, 12 Most Popular Incomparable
Second Tier
High Variance
Least Popular
Wednesday, October 3, 12 actionscript javascript asp objective-c common lisp assembly perl gosu c php openedge abl c# python pure data c++ r rust coffeescript ruby scilab haskell scala supercollider java shell viml arduino groovy visual basic clojure lua coldfusion matlab d ocaml delphi powershell ada io emacs lisp prolog applescript max erlang puppet arc nemerle f# racket autohotkey objective-j fortran scheme boo opa go tcl coq rebol apex julia dart self augeas kotlin eiffel smalltalk bro logtalk elixir vala ceylon mirah dcpu-16 asm nimrod fact verilog dylan nu or vhdl ec ooc haxe xquery ecl parrot fancy standard ml fantom turing ioke Wednesday, October 3, 12 actionscript javascript asp objective-c common lisp assembly perl gosu c php openedge abl c# python pure data c++ r rust coffeescript ruby scilab haskell scala supercollider java shell viml arduino groovy visual basic clojure lua coldfusion matlab d ocaml delphi powershell ada io emacs lisp prolog applescript max erlang puppet arc nemerle f# racket autohotkey objective-j fortran scheme boo opa go tcl coq rebol apex julia dart self augeas kotlin eiffel smalltalk bro logtalk elixir vala ceylon mirah dcpu-16 asm nimrod fact verilog dylan nu or vhdl ec ooc haxe xquery ecl parrot fancy standard ml fantom turing ioke Wednesday, October 3, 12 WHY?
What makes a programming language popular?
Wednesday, October 3, 12 Thoughts from Twitter
Wednesday, October 3, 12 Resources
‣ A Taxonomy of Data Science, Hilary Mason & Chris Wiggins • http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ ‣ Building Data Science Teams, DJ Patil • http://radar.oreilly.com/2011/09/building-data-science-teams.html ‣ Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistic, William S. Cleveland
• http://cm.bell-labs.com/stat/doc/datascience.pdf • Data Science, Moore’s Law, and Moneyball, Harlan Harris • http://www.harlan.harris.name/2011/09/data-science-moores-law-and-moneyball/
Wednesday, October 3, 12