Skip to main content

Currently Skimming:

3 Conceptualizing, Measuring, and Studying Reproducibility
Pages 35-67

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 35...
... Dennis Boos (North Carolina State University) , Andrea Buja (Wharton School of the University of Pennsylvania)
From page 36...
... William Whewell, Goodman noted, was one of the most important philosophers of science of the 19th century. He coined many common words such as scientist, physicist, ion, anode, cathode, and dielectric.
From page 37...
... Goodman noted that it is challenging to discuss reproducibility across the disciplines because there are both definitional and procedural differences among the various branches of science. He suggested that disciplines may cluster somewhat into groups with similar cultures as follows: • Clinical and population-based sciences (e.g., epidemiology, clinical re search, social science)
From page 38...
... , which state that replication of results requires that the analytical data set, the methods of the computer code, and all of the necessary metadata in the documentation necessary to run that code be available and that standard methods for distribution be used. Goodman explained that, in this case, reproducible research was research where you could see the data, run the code, and go from there to see if you could replicate the results or make adjustments.
From page 39...
... In addition, the software environment necessary to execute that code is available. Documentation Adequate documentation of the computer code, software environment, and analytical data set is available to enable others to repeat the analyses and to conduct other similar ones.
From page 40...
... Goodman referenced several recent and current reproducibility efforts, includ ing the Many Labs Replication Project,2 aimed at replicating important findings in social psychology. He noted that replication in this context means repeating the experiment to see if the findings are the same.
From page 41...
... Glenn Begley and John Ioannidis: This has been highlighted empirically in preclinical research by the inability to rep licate the majority of findings presented in high-profile journals. The estimates for irreproducibility based on these empirical observations range from 75% to 90%.
From page 42...
... Goodman noted that there are many cases in epidemiology and clinical research in which an investigation leads to a significant finding, but no associated claims or interpretations are stated, other than to assert that the finding is interesting and should be examined further. He does not think this is necessarily a false finding, and it may be a proper conclusion.
From page 43...
... In terms of results, Goodman explained, the methods used to assess replication are not clear. He noted some methods that are agreed upon within different disciplinary cultures include contrasting statistical significance, assessing statistical 3    The Consolidated Standards of Reporting Trials website is http://www.consort-statement.org, accessed January 6, 2016.
From page 44...
... Goodman mentioned a highly discussed editorial (Trafimow and Marks, 2015) in Basic and Applied Social Psychology whose authors banned the use of indices related to null hypothesis significance testing procedures in that journal (including
From page 45...
... The analogous movement toward clinical trial registration worked only when the top journals declared they would not publish clinical trials that were not pre­ egistered. While any one journal could stand up and be the first to similarly r require data sharing, Goodman stressed that this is unlikely given the highly competitive publishing environment.
From page 46...
... Many data-sharing disciplines are already real izing this benefit. Yoav Benjamini, Tel Aviv University Yoav Benjamini began his presentation by discussing the underlying effort of reproducibility: to discover the truth.
From page 47...
... Similar findings in at least two studies are a minimal requirement, but the strength of the replicability argument increases as the number of studies with findings in agreement increases. When screening many potential findings, the r-value is the smallest false discovery rate at which the finding will be among those where the alternate is true in at least two studies.
From page 48...
... . If regression coefficients β are reported, the standard error is easy to produce unless model selection is used.
From page 49...
... Using the Bayes factor, which is the posterior probability of the alternative over the null hypothesis, he illustrated that Bayes factors show similar repeatability problems as p-values when the values are near the threshold. To conclude, Boos summarized that under the null hypothesis, p-values have large variability.
From page 50...
... Statistical methods should take into account all data analytic activity, according to Buja. This includes the following: • Revealing all exploratory data analysis, in particular visualizations; • Revealing all model searching (e.g., lasso, forward/backward/all-subsets, Bayesian, cross validated, Akaike information criterion, Bayesian informa tion criterion, residual information criterion)
From page 51...
... Johnson offered the following review of the Bayesian approach to hypothesis testing: The Bayes theorem provides the posterior odds between two hypotheses ­after seeing the data. This demonstrates that the posterior probability of each hypothesis is equal to the Bayes factor times the prior odds between the two hypotheses.
From page 52...
... If one assumes that equal prior probability is assigned to the null hypothesis and the alternative hypothesis, then there is equivalence between the p-value and the posterior value and the null hypothesis is true. Under this assumption and using UMPBT to calculate posterior probabilities and prior odds, a p-value of 0.05 leads to the posterior probability of the null hypothesis of 20 percent.
From page 53...
... Johnson explained that Bayes factors between experiments multiply together naturally so they serve as an easy way to combine information across multiple experiments.
From page 54...
... Johnson agreed that there are many sources of irre producibility in scientific studies and statistics, and the use of elevated significance thresholds is just one of many factors. A participant summarized that the level of evidence needs to be much higher than the current standard, be it through Bayes factors or p-values.
From page 55...
... Assessment of Factors Affecting Reproducibility Marc Suchard, University of California, Los Angeles Marc Suchard began his presentation by discussing some recent cases of conflicting observational studies using nationwide electronic health records. The first example given was the case of assessing the exposure to oral bisphosphonates and the risk of esophageal cancer.
From page 56...
... The experiment now has about 500 such statements about "ground truth," which pro vide some information on the null distribution of any statistic under any method. Suchard noted that this is important to counter some of the confounders that cannot be controlled in observational studies.
From page 57...
... , the public-private partnership Observational Health Data Sciences and Informatics12 group was established to construct reproducible tools for large-scale observational studies in health care. This group consists of more than 80 collaborators across 10 countries and has a shared, open data model (tracking more than 600 million people)
From page 58...
... . Soderberg explained that a good way to assess reproducibility rates across disci plines is through a large-scale reproducibility project, such as the Center for Open Science's two reproducibility projects in psychology13 and cancer biology.14 The idea behind both reproducibility projects is to better understand reproducibility rates in these two disciplines.
From page 59...
... . He proposed that instead of focusing on a measure of statistical inference, such as a p-value, adjusted p-value, normalized p-value, or a base factor equivalent, it may be useful to see where an effect size falls in the distribution of the effect size seen across a field in typical situations.
From page 60...
... Q&A A participant congratulated Suchard on his important work in quantifying uncertainty associated with different designs in pharmacoepidemology appli cations but wondered to what extent the methodology is extendable to other applications, including to studies that are not dependent on databases. Suchard explained that it is difficult to know how to transfer the methodology to studies that are not easy to replicate, but there may be some similarities in applying it to observational studies with respect to comparative effectiveness research.
From page 61...
... Some involve shared task workshops (such as the Conference on Natural Language Learning,16 Open Keyword Search Evaluation,17 Open Machine Translation Evaluation,18 Reconnaissance de ­Personnes dans les Émissions Audiovisuelles,19 Speaker Recognition Evaluation,20 Text ­REtrieval Conference,21 TREC Video Retrieval Evaluation,22 and Text Analysis Conference23) where all participants utilize a given data set or evaluation metrics to produce a result and report on it.
From page 62...
... In the 1960s, he explained, there was strong resistance to such group approaches to natural language research, the implication being that basic scientific work was needed first. For example, Liberman cited Languages and Machines: Computers in Translation and Linguistics, which recom mended that machine translational funding "should be spent hardheadedly toward important, realistic, and relatively short-range goals" (NRC, 1966)
From page 63...
... He also credited the common task method with creating a new culture where researchers exchanged methods and results on shared data with a common metric. This participation in the culture became so valuable that many research groups joined without funding.
From page 64...
... He noted that progress usually comes from many small improvements and that shared data play a crucial role because they can be reused in unexpected ways. Liberman concluded by stating that while science and engineering cultures vary, sharing data and problems lowers the cost of the data entry, creates intellectual communities, and speeds up replication and extension.
From page 65...
... Informatics tools can be used in a number of environments to deal with information flow, depending on what properties and reproducibility claims are being examined. TABLE 3.2  Some Types of Reproducibility Issues and Use Cases Common Labels Reproducibility Related Issue Example Interventions Misconduct, bit rot, author Data were fabricated, corrupted, Discipline/community data archives.
From page 66...
... Data dredging: multiple Author bias to creating significant Holdout data escrow comparisons; p-hacking results resulting in difference between stated method/analysis and actual (complete) method/ analysis Sensitivity, robustness Variance of support for claims Sensitivity analysis across specification change Reliability Variance of support for claims Meta-analysis; across repeated measures, Cochrane review samples Data integration Generalizability Variance of support for claims Cochrane review across different frames Laws, truth Variance of support for claims to Grand challenge?
From page 67...
... • Which information flows and systems are most closely associated with these inferential claims? • Which properties of information systems support generating these infer ential claims?


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.