
Lecture Notes Hou, Xue, and Zhang (2020, Review of Financial Studies, “Replicating Anomalies”) Lu Zhang1 1Ohio State and NBER FIN 8250, Autumn 2021 Ohio State Theme Replicating anomalies Most anomalies fail to hold up to currently acceptable standards for empirical finance Results Replicate the published anomalies literature with 452 variables 65% cannot clear the single test hurdle of jtj ≥ 1:96, with microcaps mitigated with NYSE breakpoints and value-weights Most (52%) anomalies fail to replicate irrespective of microcaps (NYSE-Amex-NASDAQ breakpoints and equal-weights), if adjusting for multiple testing Similar results in the original samples: 65.3% versus 56.2% The biggest casualty in the trading frictions category, with 96% failing the single tests Replicated anomalies size much smaller than originally reported Outline 1 Motivating Replication 2 Replicating Procedures 3 452 Anomalies 4 Replication Results 5 Impact Outline 1 Motivating Replication 2 Replicating Procedures 3 452 Anomalies 4 Replication Results 5 Impact Motivation Academia Lo and MacKinlay 1990, Fama 1998, Conrad, Cooper, and Kaul 2003, Schwert 2003, McLean and Pontiff 2016 Harvey, Liu, and Zhu 2016: 27–53% of 296 anomalies are false, adjusting for multi-testing Two publication biases: Hard to publish a nonresult; difficult to publish replication studies in finance and economics Harvey 2017: P-hacking, selecting sample criteria and test specifications until insignificant results become significant InvestorsMotivation Always Think They’re Getting Ripped Off. Here’s Why They’... https://www.bloomberg.com/news/articles/2017-04-06/lies-damn-lies-and... ETFGI: $680 billion worldwide in smart beta ETFs and ETPs (8/2018) Coy (2017): “Researchers have more knobs to twist in search of a prized ‘anomaly...’ They can vary the period, the set of securities under consideration, or even the statistical method. Negative findings go in a file drawer; positive ones get submitted to a journal (tenure!) or made into an ETF whose performance we rely on for retirement.” 2 of 6 11/8/2017, 6:10 PM Motivation Ioannidis (2005) Motivation Evidence: Baker (2016, Nature) 7% Don’t know 3% No, there is no crisis IS THERE A REPRODUCIBILITY CRISIS? A Nature survey lifts the lid on how researchers view the ‘crisis’ rocking science and what they think will help. BY MONYA BAKER 52% 3838%% Yes, a significant YYes,es, a sslightlight crisis ccrisisrisis 1,576 RESEARCHERS SURVEYED Motivation Evidence: Baker (2016, Nature) HAVE YOU FAILED TO REPRODUCE WHAT FACTORS CONTRIBUTE TO AN EXPERIMENT? IRREPRODUCIBLE RESEARCH? Most scientists have experienced failure to reproduce results. Many top-rated factors relate to intense competition and time pressure. Someone else’s My own Always/often contribute Sometit mes contribute Selective reporting Chemistry Pressure to publish Biology Low statistical power or poor analysis Not replp iccated enough in original lab Physics and engineering Innsufficient oversigght/mentoring MeM tht ods,s code unavailable Medicine Poor experimental design Earth and environment Raw dad ta not available from origi innal lab Frraud Other Insufficient peer reeview 0 20 40 60 80 100% 0 20 40 60 80 100% Motivation Evidence: Begley and Ellis (2012, Nature) S. GSCHMEISSNER/SPL Many landmark findings in preclinical oncology research are not reproducible, in part because of inadequate cell lines and animal models. Raise standards for preclinical cancer research C. Glenn Begley and Lee M. Ellis propose how methods, publications and incentives must change if patients are to benefit. RESEARCH | RESEARCH ARTICLE out this test on the subset of study pairs in which nal study (original P value, original effect size, P values for replications were distributed widely. both the correlation coefficient and its standard original sample size, importance of the effect, When there is no effect to detect, the null distribu- error could be computed [we refer to this data surprising effect, and experience and expertise tion of P values is uniform. This distribution de- set as the meta-analytic (MA) subset]. Standard of original team) and seven indicators of the viated slightly from uniform with positive skew, errors could only be computed if test statistics replication study (replication P value, replica- however, suggesting that at least one replication 2 were r, t,orF(1,df2). The expected proportion is tion effect size, replication power based on orig- could be a false negative, c (128) = 155.83, P = the sum over expected probabilities across study- inal effect size, replication sample size, challenge 0.048. Nonetheless, the wide distribution of P pairs.Thetestassumesthesamepopulationeffect of conducting replication, experience and exper- values suggests against insufficient power as the size for original and replication study in the same tise of replication team, and self-assessed qual- only explanation for failures to replicate. A scat- study-pair. For those studies that tested the effect ityofreplication)(Table2).Asfollow-up,wedid terplot of original compared with replication 2 with F(df1 >1,df2)orc , we verified coverage using thesamewiththeindividualindicatorscompris- study P values is shown in Fig. 2. other statistical procedures (computational details ing the moderator variables (tables S3 and S4). are provided in the supplementary materials). Evaluating replication effect against Results original effect size Meta-analysis combining original and Evaluating replication effect against null A complementary method for evaluating repli- replication effects hypothesis of no effect cation is to test whether the original effect size is We conducted fixed-effect meta-analyses using A straightforward method for evaluating repli- withinthe95%CIoftheeffectsizeestimatefrom the R package metafor (27)onFisher-transformed cation is to test whether the replication shows a the replication. For the subset of 73 studies in correlations for all study-pairs in subset MA and statistically significant effect (P < 0.05) with the which the standard error of the correlation could onstudy-pairswiththeoddsratioasthedepen- same direction as the original study. This di- be computed, 30 (41.1%) of the replication CIs dent variable. The number of times the CI of all chotomous vote-counting method is intuitively contained the original effect size (significantly these meta-analyses contained 0 was calculated. appealing and consistent with common heu- lower than the expected value of 78.5%, P < For studies in the MA subset, estimated effect ristics used to decide whether original studies 0.001) (supplementary materials). For 22 studies 2 sizes were averaged and analyzed by discipline. “worked.” Ninety-seven of 100 (97%) effects using other test statistics [F(df1 >1,df2)andc ], from original studies were positive results (four 68.2% of CIs contained the effect size of the “ ” Subjective assessment of Did it replicate? had P values falling a bit short of the 0.05 original study. Overall, this analysis suggests a In addition to the quantitative assessments of criterion—P = 0.0508, 0.0514, 0.0516, and 0.0567— 47.4% replication success rate. on May 16, 2017 replication and effect estimation, we collected but all of these were interpreted as positive This method addresses the weakness of the subjective assessments of whether the replica- effects). On the basis of only the average rep- firsttestthatareplicationinthesamedirection tion provided evidence of replicating the origi- lication power of the 97 original, significant ef- and a P valueof0.06maynotbesignificantly nal result. In some cases, the quantitative data fects [M =0.92,median(Mdn)=0.95],wewould different from the original result. However, the anticipate a straightforward subjective assess- expect approximately 89 positive results in the method will also indicate that a replication “fails” ment of replication. For more complex designs, replications if all original effects were true and when the direction of the effect is the same but such as multivariate interaction effects, the quan- accurately estimated; however, there were just the replication effect size is significantly smaller titative analysis may not provide a simple inter- 35 [36.1%; 95% CI = (26.6%, 46.2%)], a significant than the original effect size (29). Also, the repli- pretation. For subjective assessment, replication reduction [McNemar test, c2(1) = 59.1, P <0.001]. cation “succeeds” when the result is near zero but teams answered “yes” or “no” to the question, A key weakness of this method is that it treats not estimated with sufficiently high precision to “Did your results replicate the original effect?” the 0.05 threshold as a bright-line criterion be- be distinguished from the original effect size. Additional subjective variables are available for tween replication success and failure (28). It http://science.sciencemag.org/ analysis in the full data set. could be that many of the replications fell just Comparing original and replication short of the 0.05 criterion. The density plots of effect sizes MotivationAnalysis of moderators P values for original studies (mean P value = Comparing the magnitude of the original and Evidence:We correlated Open the fiveScience indicators Collaboration evaluating 0.028) and (2015,replications (meanScienceP value), = 0.302) 100 psychologicalreplication effect sizes avoids studies special emphasis reproducibility with six indicators of the origi- are shown in Fig. 1, left. The 64 nonsignificant on P values. Overall, original study effect sizes Downloaded from 1.00 1.00 Quantile
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages67 Page
-
File Size-