Questions first, then data Perspectives on data science from a social scientist

Drew Conway Dept. of Politics, NYU Monktoberfest - April 4, 2012

Wednesday, October 3, 12 How I got here...

Wednesday, October 3, 12 Wednesday, October 3, 12 10% 10% Hacking I hold the following truths to be self-evident... 1. Data come from many sources Skills 2. Data come in many form(at)s 80%

Obtain Munge

A .zip file of PDFs ≠ data Real data are messy ‣ Data scientist must know where to ‣ Even curated data: duplicates, missing values, date formats get data and how to obtain it Work with big text files ‣ Combine data from multiple sources/ ‣ formats $ head publicvotes-20101018_votes.dump ‣ To o l s • *NIX tools: sed, awk, grep ‣ Work with APIs • Scripting languages: , Python and R

$ curl http://search.twitter.com/ $ cat ufo_awesome.tsv | grep probe | wc -l search.json?q=@drewconway > drewconway.json 131

Wednesday, October 3, 12 Hacking While 80% of effort is spent here, but Skills perhaps most straightforward to train

Heavily tool focused, lean on current CS/IS curriculums ‣ Comfort working at the command-line, with text editors ‣ A language for every season!

Conveying findings in creative and compelling ways

Wednesday, October 3, 12 Math & Stats If: Better data beats better math Knowledge Then: What methods should be taught?

Explore Model How do you find structure Methods must match data in new data? ‣ Text ‣ Scatter plots ‣ Geospatial ‣ Density plots ‣ Web-scale Data exploration that scales What is the ‘best’ model? ‣ Reduce dimensionality ‣ Most predictive ‣ PCA, SVD, MDS ‣ Most parsimonious ‣ Cross-validation

Wednesday, October 3, 12 Math & Stats Universities good at methods training... Knowledge ...but what methods fit into Data Science?

$&^! data scientist like... ‣ Describing the current state of the world ‣ Predicting future observations ‣ Classifying/ranking observations 1. When applicable 2. Right tool / right job 3. Open black boxes } 4. Learn limitations $&^! social scientists like... ‣ Testable theoretical models ‣ Natural experiments ‣ Causality

Wednesday, October 3, 12 10% 10% Substantive Data Science, as a discipline, is Expertise fundamentally about human behavior 80%

Inquire Interpret Focus on questions / not tech How do we know when ‣ What new questions can be the results we get make asked from web-scale data? sense, if ever? ‣ Tools are a means to an end Social science has questions ‣ Markets ‣ Organization ‣ Decision making

Wednesday, October 3, 12 Case study!

...or, how my friends and I spend hours hacking through messy data to answer seemingly trivial questions!

Wednesday, October 3, 12 Wednesday, October 3, 12 What makes a programming language popular? How can we measure a programming language’s popularity

Compare # of question tags on StackOverflow to projects on Github

Wednesday, October 3, 12 Compare # of question tags on StackOverflow to ✓ projects on Github

How can we measure a X programming language’s popularity

Can we get closer to answer this question?

Wednesday, October 3, 12 ρ = 0.73

 

        Github Rank (# projects) Rank Github     

         

  tags) (# questions Rank StackOverlow  Wednesday, October 3, 12 Most Popular Incomparable

Second Tier

High Variance

Least Popular

Wednesday, October 3, 12 actionscript javascript asp objective- common lisp assembly perl gosu c openedge abl c# python pure data c++ r rust coffeescript ruby scilab haskell scala supercollider java shell viml arduino groovy visual basic clojure lua coldfusion matlab d delphi powershell ada io emacs lisp prolog applescript max erlang arc nemerle f# racket autohotkey objective-j fortran scheme boo opa go tcl coq rebol apex julia dart self augeas kotlin eiffel smalltalk bro logtalk elixir vala ceylon mirah dcpu-16 asm nimrod fact verilog dylan nu or vhdl ec ooc haxe xquery ecl parrot fancy standard ml fantom turing ioke Wednesday, October 3, 12 actionscript javascript asp objective-c common lisp assembly perl gosu c php openedge abl c# python pure data c++ r rust coffeescript ruby scilab haskell scala supercollider java shell viml arduino groovy visual basic clojure lua coldfusion matlab d ocaml delphi powershell ada io emacs lisp prolog applescript max erlang puppet arc nemerle f# racket autohotkey objective-j fortran scheme boo opa go tcl coq rebol apex julia dart self augeas kotlin eiffel smalltalk bro logtalk elixir vala ceylon mirah dcpu-16 asm nimrod fact verilog dylan nu or vhdl ec ooc haxe xquery ecl parrot fancy standard ml fantom turing ioke Wednesday, October 3, 12 WHY?

What makes a programming language popular?

Wednesday, October 3, 12 Thoughts from Twitter

Wednesday, October 3, 12 Resources

‣ A Taxonomy of Data Science, Hilary Mason & Chris Wiggins • http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ ‣ Building Data Science Teams, DJ Patil • http://radar.oreilly.com/2011/09/building-data-science-teams.html ‣ Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistic, William S. Cleveland

• http://cm.bell-labs.com/stat/doc/datascience.pdf • Data Science, Moore’s Law, and Moneyball, Harlan Harris • http://www.harlan.harris.name/2011/09/data-science-moores-law-and-moneyball/

Wednesday, October 3, 12