Big Data: New Tricks for Econometrics†
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Economic Perspectives—Volume 28, Number 2—Spring 2014—Pages 3–28 Big Data: New Tricks for Econometrics† Hal R. Varian oomputersmputers aarere nnowow iinvolvednvolved iinn mmanyany eeconomicconomic ttransactionsransactions aandnd ccanan ccaptureapture ddataata aassociatedssociated wwithith tthesehese ttransactions,ransactions, whichwhich ccanan thenthen bbee manipulatedmanipulated C aandnd aanalyzed.nalyzed. CConventionalonventional sstatisticaltatistical aandnd econometriceconometric techniquestechniques suchsuch aass rregressionegression ooftenften wworkork well,well, bbutut ttherehere aarere iissuesssues uuniquenique ttoo bbigig datasetsdatasets thatthat maymay rrequireequire ddifferentifferent ttools.ools. FFirst,irst, tthehe ssheerheer ssizeize ooff tthehe ddataata iinvolvednvolved mmayay rrequireequire mmoreore ppowerfulowerful ddataata mmanipulationanipulation ttools.ools. SSecond,econd, wewe maymay hhaveave mmoreore ppotentialotential ppredictorsredictors tthanhan aappro-ppro- ppriateriate fforor eestimation,stimation, ssoo wwee needneed toto dodo somesome kindkind ofof variablevariable selection.selection. Third,Third, llargearge ddatasetsatasets mmayay aallowllow fforor mmoreore fl eexiblexible relationshipsrelationships tthanhan simplesimple linearlinear models.models. MMachineachine llearningearning ttechniquesechniques ssuchuch aass ddecisionecision ttrees,rees, ssupportupport vvectorector machines,machines, nneuraleural nnets,ets, ddeepeep llearning,earning, aandnd soso onon maymay allowallow forfor moremore effectiveeffective waysways toto modelmodel ccomplexomplex rrelationships.elationships. IInn tthishis eessay,ssay, I wwillill ddescribeescribe a ffewew ooff tthesehese ttoolsools fforor mmanipulatinganipulating aandnd aanalyzingnalyzing bbigig ddata.ata. I bbelieveelieve tthathat tthesehese mmethodsethods hhaveave a llotot ttoo oofferffer aandnd sshouldhould bbee mmoreore wwidelyidely kknownnown aandnd uusedsed bbyy eeconomists.conomists. IInn ffact,act, mmyy sstandardtandard aadvicedvice ttoo ggraduateraduate sstudentstudents tthesehese ddaysays iiss ggoo toto tthehe ccomputeromputer sciencescience ddepartmentepartment andand taketake a classclass inin machinemachine llearning.earning. TTherehere hhaveave bbeeneen veryvery fruitfulfruitful ccollaborationsollaborations betweenbetween computercomputer scien-scien- ttistsists aandnd sstatisticianstatisticians iinn tthehe llastast ddecadeecade oorr sso,o, andand I expectexpect collaborationscollaborations betweenbetween ccomputeromputer sscientistscientists aandnd eeconometriciansconometricians wwillill aalsolso bbee pproductiveroductive iinn tthehe ffuture.uture. ■ Hal Varian is Chief Economist, Google Inc., Mountain View, California, and Emeritus Professor of Economics, University of California, Berkeley, California. His email address is [email protected]. † To access the Appendix and disclosure statements, visit http://dx.doi.org/10.1257/jep.28.2.3 doi=10.1257/jep.28.2.3 4 Journal of Economic Perspectives Tools to Manipulate Big Data EEconomistsconomists hhaveave hhistoricallyistorically dealtdealt wwithith ddataata thatthat fi tsts inin a spreadsheet,spreadsheet, butbut thatthat iiss cchanginghanging asas nnewew mmore-detailedore-detailed ddataata becomesbecomes availableavailable (see(see EinavEinav andand LevinLevin 2013,2013, fforor sseveraleveral eexamplesxamples aandnd discussion).discussion). IfIf youyou havehave moremore thanthan a millionmillion oror soso rowsrows iinn a sspreadsheet,preadsheet, yyouou pprobablyrobably wantwant toto storestore itit inin a relationalrelational database,database, suchsuch asas MMySQL.ySQL. RRelationalelational ddatabasesatabases offeroffer a fl exibleexible wayway toto store,store, manipulate,manipulate, andand retrieveretrieve ddataata uusingsing a SStructuredtructured QQueryuery LLanguageanguage (SQL),(SQL), whichwhich isis easyeasy toto learnlearn andand veryvery uusefulseful fforor ddealingealing wwithith mmedium-sizededium-sized ddatasets.atasets. HHowever,owever, iiff yyouou hhaveave sseveraleveral ggigabytesigabytes ofof ddataata oror sseveraleveral millionmillion observations,observations, sstandardtandard rrelationalelational ddatabasesatabases bbecomeecome uunwieldy.nwieldy. DatabasesDatabases toto managemanage datadata ofof thisthis ssizeize aarere ggenericallyenerically kknownnown aass “NoSQL”“NoSQL” ddatabases.atabases. TThehe termterm iiss uusedsed rratherather lloosely,oosely, bbutut iiss ssometimesometimes iinterpretednterpreted aass mmeaningeaning ““notnot oonlynly SSQL.”QL.” NNoSQLoSQL ddatabasesatabases aarere mmoreore pprimitiverimitive tthanhan SSQLQL ddatabasesatabases iinn ttermserms ooff ddataata mmanipulationanipulation ccapabilitiesapabilities butbut ccanan hhandleandle llargerarger aamountsmounts ooff ddata.ata. DDueue ttoo tthehe rriseise ooff ccomputer-mediatedomputer-mediated ttransactions,ransactions, mmanyany ccompaniesompanies hhaveave ffoundound iitt nnecessaryecessary ttoo ddevelopevelop ssystemsystems ttoo pprocessrocess billionsbillions ofof ttransactionsransactions pperer dday.ay. ForFor eexample,xample, aaccordingccording ttoo SSullivanullivan ((2012),2012), GGoogleoogle hhasas sseeneen 330 trillion0 trillion UURLs,RLs, ccrawlsrawls ooverver 220 billion0 billion ooff tthosehose a dday,ay, aandnd aanswersnswers 1100 billion00 billion ssearchearch queriesqueries a mmonth.onth. AnalyzingAnalyzing eevenven oone day’sne day’s wworthorth ooff datadata ooff thisthis sizesize isis virtuallyvirtually impossibleimpossible withwith conventionalconventional ddatabases.atabases. TThehe cchallengehallenge ooff ddealingealing wwithith ddatasetsatasets ooff tthishis ssizeize lleded ttoo tthehe ddevelopmentevelopment ooff sseveraleveral ttoolsools ttoo mmanageanage aandnd aanalyzenalyze bbigig ddata.ata. A nnumberumber ooff tthesehese ttoolsools aarere pproprietaryroprietary ttoo Google,Google, butbut havehave beenbeen describeddescribed inin aacademiccademic ppublicationsublications inin ssuffiuffi ccientient detaildetail thatthat open-sourceopen-source implementationsimplementations havehave bbeeneen ddeveloped.eveloped. TTable 1able 1 ccontainsontains bbothoth thethe GoogleGoogle namename andand thethe namename ofof relatedrelated oopen-sourcepen-source ttools.ools. FFurtherurther ddetailsetails ccanan bbee ffoundound iinn tthehe WWikipediaikipedia eentriesntries aassociatedssociated wwithith tthehe ttoolool nnames.ames. TThoughhough tthesehese ttoolsools ccanan bbee rrunun oonn a ssingleingle ccomputeromputer fforor llearningearning ppurposes,urposes, rrealeal aapplicationspplications uusese llargearge cclusterslusters ooff ccomputersomputers ssuchuch aass tthosehose pprovidedrovided bbyy AAmazon,mazon, GGoogle,oogle, MMicrosoft,icrosoft, aandnd ootherther ccloud-computingloud-computing pproviders.roviders. TThehe aabilitybility ttoo rrentent rratherather tthanhan bbuyuy ddataata sstoragetorage aandnd pprocessingrocessing hhasas tturnedurned wwhathat wwasas ppreviouslyreviously a fi xxeded ccostost ooff ccomputingomputing iintonto a vvariableariable ccostost aandnd hhasas lloweredowered tthehe bbarriersarriers ttoo eentryntry fforor wworkingorking wwithith bbigig ddata.ata. Tools to Analyze Data TThehe ooutcomeutcome ooff tthehe bbig-dataig-data pprocessingrocessing ddescribedescribed aabovebove iiss ooftenften a ““small”small” ttableable ooff ddataata tthathat mmayay bbee ddirectlyirectly hhumanuman rreadableeadable oorr ccanan bbee lloadedoaded iintonto aann SSQLQL ddatabase,atabase, a sstatisticstatistics ppackage,ackage, oorr a spreadsheet.spreadsheet. IfIf thethe extractedextracted datadata isis stillstill inconvenientlyinconveniently llarge,arge, iitt iiss ooftenften ppossibleossible ttoo selectselect a subsamplesubsample forfor statisticalstatistical analysis.analysis. AtAt Google,Google, forfor eexample,xample, II have have ffoundound tthathat rrandomandom ssamplesamples onon tthehe oorderrder ooff 0.1 percent0.1 percent wworkork fi nene fforor aanalysisnalysis ooff bbusinessusiness ddata.ata. OOncence a datasetdataset hhasas bbeeneen eextracted,xtracted, itit isis ooftenften necessarynecessary toto dodo somesome explor-explor- aatorytory ddataata aanalysisnalysis alongalong withwith cconsistencyonsistency aandnd ddata-cleaningata-cleaning tasks.tasks. ThisThis isis somethingsomething Hal R. Varian 5 Table 1 Tools for Manipulating Big Data Google name Analog Description Google File System Hadoop File System This system supports fi les so large that they must be distributed across hundreds or even thousands of computers. Bigtable Cassandra This is a table of data that lives in the Google File System. It too can stretch over many computers. MapReduce Hadoop This is a system for accessing and manipulating data in large data structures such as Bigtables. MapReduce allows you to access the data in parallel, using hundreds or thousands of machines to extract the data you are interested in. The query is “mapped” to the machines and is then applied in parallel to different shards of the data. The partial calculations are then combined (“reduced”) to create the summary table you are interested in. Sawzall Pig This is a language for creating MapReduce jobs. Go None Go is fl exible open-source, general-purpose computer language that makes it easier to do parallel data processing. Dremel, BigQuery Hive, Drill, Impala This is a tool that allows data queries to be written in a simplifi ed form of of Structured Query Language (SQL). With Dremel it is possible to run an SQL