Skip to main content

Currently Skimming:

5 Replicability
Pages 71-104

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 71...
... Beginning with an examination of methods to assess replicability, in this chapter we discuss evidence that bears on the extent of non-replicability in scientific and engineering research and examine factors that affect replicability. Replication is one of the key ways scientists build confidence in the scientific merit of results.
From page 72...
... The nature of the problem under study and the prior likelihoods of possible results in the study, the type of measurement instruments and research design selected, and the novelty of the area of study and therefore lack of established methods of inquiry can also contribute to non-replicability. Because of the complicated relationship between replicability and its variety of sources, the validity of scientific results should be considered in the context of an entire body of evidence, rather than an individual study or an individual replication.
From page 73...
... . 3 See Table 5-1, for an example of this in the reviews of a psychology replication study by Open Science Collaboration (2015)
From page 74...
... Rather than focus on an arbitrary threshold such as statistical significance, it would be more revealing to consider the distributions of observations and to examine how similar these distributions are. This examination would include summary measures, such as proportions, means, standard deviations (or uncertainties)
From page 75...
... However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained "statistical significance," that is, when the p-values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the dis tributions of observations and to examine how similar these distribu tions are.
From page 76...
... Assessments of Replicability The most direct method to assess replicability is to perform a study following the original methods of a previous study and to compare the new results to the original ones. Some high-profile replication efforts in recent years include studies by Amgen, which showed low replication rates in biomedical research (Begley and Ellis, 2012)
From page 77...
... The replication studies such as those shown in Table 5-1 are not necessarily indicative of the actual rate of non-replicability across science for a number of reasons: the studies to be replicated were not randomly chosen, the replications had methodological shortcomings, many replication studies are not reported as such, and the reported replication studies found widely varying rates of non-replication (Gilbert et al., 2016)
From page 78...
... life outcomesb corresponding original effects. Behavioral Multiple laboratories Meta-analysis of the studies Science, (23 in total)
From page 79...
... effect sizes smaller than the original ones, and 25% yielded larger effect sizes than the original ones. Experimental Attempt to 36% of the replication studies Direct Psychology independently replicate produced significant results, (Open Science selected results from 100 compared to 97% of the Collaboration, studies in psychology original studies.
From page 80...
... of solubility, viscosity, critical temperature, and vapor pressure Biology Large-scale replication The first five articles have Direct Reproducibility project to replicate key been published; two replicated Project: Cancer results in 29 cancer important parts of the original Biology papers published in papers, one did not replicate, Nature, Science, Cell, and two were uninterpretable. and other high-impact journals Psychology, Statcheck tool used to 49.6% of the articles with Indirect Statistical test statistical values null hypothesis statistical test Checks within psychology (NHST)
From page 81...
... was not lower than the original effect size. For studies reporting null results, we treated as successful replications for which original effect sizes fell inside the bounds of the 95 percent CI." bFrom Soto (2019, p.
From page 82...
... . Surveys and studies have also assessed the prevalence of specific problematic research practices, such as a 2018 survey about questionable research practices in ecology and evolution 5 Nature uses the word "reproducibility" to refer to what we call "replicability."
From page 83...
... High-quality researcher surveys are expensive and pose significant challenges, including constructing exhaustive sampling frames, reaching adequate response rates, and minimizing other nonresponse biases that might differentially affect respondents at different career stages or in different professional environments or fields of study (Corley et al., 2011; Peters et al., 2008; Scheufele et al., 2009)
From page 84...
... Such a comprehensive effort would be daunting due to the vast amount of research published each year and the diversity of scientific and engineering fields. Among studies of replication that are available, there is no uniform approach across scientific fields to gauge replication between two studies.
From page 85...
... , and problems in study design, execution, or interpretation in either the original study or the replication attempt. In many instances, non-replication between two results could be due to a combination of multiple sources, but it is not generally possible to identify the source without careful examination of the two studies.
From page 86...
... The complexity and controllability of a system contribute to the underlying variance of the distribution of expected results and thus the likelihood of non-replication.7 7 Complexity and controllability in an experimental system affect its susceptibility to non replicability independently from the way prior odds, power, or p-values associated with hypothesis testing affect the likelihood that an experimental result represents the true state of the world.
From page 87...
... . The Lifespan of Worms In 2013, three researchers set out to attempt to clarify inconsistent research results on compounds that could extend the lifespan of lab animals (Phillips et al., 2017)
From page 88...
... Figure 5-2 illustrates the combinations of complexity and controllability. Many scientific fields have studies that span these quadrants, as demonstrated by the following examples from engineering, physics, and psychology.
From page 89...
... Physics.  In physics, measurements of the electronic band gap of semiconducting and conducting materials using scanning tunneling microscopy is a highly controlled, simple system (Quadrant A)
From page 90...
... When the sources are knowable, or arise from experimental design choices, researchers need to identify and assess these sources of uncertainty insofar as they can be estimated. Researchers need also to report on steps that were intended to reduce uncertainties inherent in the study or differ from the original study (i.e., data cleaning decisions that resulted in a different final dataset)
From page 91...
... We consider here a selected set of such avoidable sources of non-replication: • publication bias • misaligned incentives • inappropriate statistical inference • poor study design • errors • incomplete reporting of a study We will discuss each source in turn. Publication Bias Both researchers and journals want to publish new, innovative, ground-breaking research.
From page 92...
... Figure 5-3 shows how publication bias can result in a skewed view of the body of evidence when only positive results that meet the statistical significance threshold are reported. When a new study fails to replicate the previously published results -- for example, if a study finds no relationship between variables when such a relationship had been shown in previously published studies -- it appears to be a case of non-replication.
From page 93...
... If only positive statistically significant results are reported 10 Each dot represents a hypothetical study 5 Figure 2-2 left Estimated Coefficient R02978 vector editable 0 -5 -10 0 1 2 3 4 Standard error Justin Wolfers, L e s s o n s fr o m r e plic a t in g d e a t h pe n a lt y r e s e a r c h 2 FIGURE 5-3 Funnel charts showing the estimated coefficient and standard error (a) if all hypothetical study experiments are reported and (b)
From page 94...
... has demonstrated that the proportion of statistically significant results across a set of psychology studies often far exceeds the estimated statistical power of those studies; this pattern of results that is "too good to be true" suggests that results were either not obtained following the rules of statistical inference (i.e., conducting a single statistical test that was chosen a priori) or did not report all studies attempted (i.e., there is a "file drawer" of statistically nonsignificant studies that do not get published; or possibly the results were p-hacked or cherry picked (see Chapter 2)
From page 95...
... . Assessing Unpublished Literature.  One approach to countering publication bias is to search for and include unpublished papers and results when conducting a systematic review of the literature.
From page 96...
... Exploratory and confirmatory research are essential parts of science, but they need to be understood and communicated as two separate types of inquiry, with two different interpretations. A well-conducted exploratory analysis can help illuminate possible hypotheses to be examined in subsequent confirmatory analyses.
From page 97...
... Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives (John et al., 2012; Munafo et al., 2017)
From page 98...
... It is unknown to what extent inappropriate HARKing occurs in various disciplines, but some have attempted to quantify the consequences of HARKing. For example, a 2015 article compared hypothesized effect sizes against non-hypothesized effect sizes and found that effects were significantly larger when the relationships had been hypothesized, a finding consistent with the presence of HARKing (Bosco et al., 2015)
From page 99...
... In the case of computational errors, transparency in data and computation may make it more likely that the errors can be caught and corrected. For other errors, such as mistakes in measurement, errors might not be detected until and unless a failed replication that does not make the same mistake indicates that something was amiss in the original study.
From page 100...
... Even if a researcher reports all of the critical information about the conduct of a study, other seemingly inconsequential details that have an effect on the outcome could remain unreported. Just as reproducibility requires transparent sharing of data, code, and analysis, replicability requires transparent sharing of how an experiment was conducted and the choices that were made.
From page 101...
... It can be difficult in practice to differentiate between honest mistakes and deliberate misconduct because the underlying action may be the same while the intent is not. Reproducibility and replicability emerged as general concerns in science around the same time as research misconduct and detrimental research practices were receiving renewed attention.
From page 102...
... include failing to follow sponsor requirements or disciplinary standards for retaining data, authorship misrepresentation other than plagiarism, refusing to share data or methods, and misleading statistical analysis that falls short of falsification. In addition to the behaviors of individual researchers, detrimental research practices also include actions taken by organizations, such as failure on the part of research institutions to maintain adequate policies, procedures, or capacity to foster research integrity and assess research misconduct allegations, and abusive or irresponsible publication practices by journal editors and peer review.
From page 103...
... From the available evidence, documented cases of researcher misconduct are relatively rare, as suggested by a rate of retractions in scientific papers of approximately 4 in 10,000 (Brainard, 2018)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.