REVIEW OF THE REVISED NTP MONOGRAPH ON THE SYSTEMATIC REVIEW OF FLUORIDE EXPOSURE AND NEURODEVELOPMENTAL AND COGNITIVE HEALTH EFFECTS: A LETTER REPORT
In 2019, the National Toxicology Program (NTP) released the draft monograph Systematic Review of Fluoride Exposure and Neurodevelopmental and Cognitive Health Effects (NTP 2019a).1 The draft monograph summarized the findings of the systematic review and concluded that “fluoride is presumed to be a cognitive neurodevelopmental hazard to humans. This conclusion is based on a consistent pattern of findings in human studies across several different populations showing that higher fluoride exposure is associated with decreased IQ or other cognitive impairments in children” (NTP 2019a, p. 59). Given the controversies surrounding the risks and benefits associated with fluoride exposure and to ensure the integrity of its evaluation, NTP asked the National Academies of Sciences, Engineering, and Medicine (the National Academies) to review the draft monograph.
The National Academies committee that was convened to address the request identified deficiencies in the analysis of various aspects of some of the studies and in the analysis, summary, and presentation of the data in the draft monograph (NASEM 2020). The committee provided many suggestions for improvement and concluded that NTP had not adequately supported its conclusions. It noted that the committee's finding did not mean that NTP's conclusions were incorrect; rather, further analysis or reanalysis would be needed to support the conclusions. Taking the committee's suggestions into consideration, NTP revised the draft monograph.
STATEMENT OF TASK AND COMMITTEE APPROACH
NTP asked the National Academies to review the revised monograph (NTP 2020a) to ensure that it was responsive to the committee’s recommendations and, more important, adequately supported its conclusions. Attachment A provides the verbatim statement of task. The committee that reviewed the draft monograph was reconvened to review the revised monograph; Attachment B provides biographic information on the committee.
To complete its task, the committee held several virtual meetings, one of which included a public session at which NTP provided an overview of the changes that had been made in the draft monograph. The committee reviewed the revised monograph, including the newly added appendixes with details of lower risk-of-bias studies and the meta-analysis; NTP responses to the committee’s recommendations; the revised protocol; and public comments submitted to the committee. It is important to note that the committee did not conduct its own independent evaluation of the evidence, nor did it conduct a data audit; both were outside its scope. The committee reviewed the revised monograph and determined whether the evidence as presented in it supported NTP’s main conclusion that “fluoride is presumed to be a cognitive neurodevelopmental hazard to humans” (NTP 2020a, p. 80). Each section below provides the committee’s assessment of NTP responses to substantive issues previously raised (NASEM 2020) regarding methods, animal evidence, human evidence, and communication. Attachment C summarizes the substantive issues previously raised and NTP’s responses. The committee
1 Referred to hereafter as the draft monograph. The revised version released in 2020 is referred to as the revised monograph.
provides many recommendations for improving the revised monograph and has highlighted in boldface, italics some particularly critical ones, but all are important to address.
In its previous review, the committee raised several issues associated with the general methods of NTP’s systematic review process. The issues were concerning because they decreased the transparency of the process and the probability of reproducing the findings and did not align with some general best practices for systematic review. The committee finds that NTP has addressed many of the issues regarding methods in its revisions of the draft monograph but notes that some further improvements would be useful. A brief overview of suggested improvements is provided below; other methodologic issues raised in the previous review that are not discussed here have been adequately addressed in the revised monograph. The committee considers the remaining issues related to the systematic review methods to be minor with the exception of the comment below concerning NTP’s process for upgrading and downgrading the body of evidence (NTP 2020b, Table 5).
First, the role of the Office of Health Assessment and Translation (OHAT) handbook (NTP 2015, 2019b,c) has been explicitly added to the revised monograph. Two statements in the revised monograph—on pp. ii and 6 (footnote)—describe the OHAT handbook as a source of general systematic review methods that are selected and tailored to the project in the prespecified protocol. Although the statement clarifies the general role of the handbook, the committee finds that it does not address the committee’s previous recommendation to set the expectation for how closely the process described in the handbook will be followed in the protocol and in the eventual systematic review. For example, the handbook section “Key Questions and Analytical Framework” that guides development of the population, exposure, comparator, and outcomes (PECO) statement is not included in the fluoride protocol or the revised monograph. As the committee recommended in its previous review, NTP should treat each systematic review protocol as a stand-alone document that contains all the information necessary for understanding of the planning and conduct of the review, and these expectations should be explicitly stated in the protocol. The committee did not find that revisions of the protocol adequately addressed this recommendation.
Second, several recommendations in the committee’s previous review that might have increased the overall transparency of the monograph do not appear to have been addressed, such as reporting the excluded studies at the title and abstract step (also recommended in the OHAT handbook) and adding to the protocol clear definitions for each factor that contributes to increasing or decreasing confidence in the body of evidence and key considerations that warrant upgrading or downgrading the body of evidence (NTP 2020b, Table 5, p. 18). The committee found that such omissions decrease the reproducibility and transparency of the systematic review process and should be viewed as a deficiency that should be addressed.
Third, NTP has added text to the revised monograph regarding the use of the SWIFT-Active Screener tool to priority-rank studies for screening and to set stopping rules. However, the committee recommends that a more detailed explanation of some terminology be added to eliminate any confusion that might arise given the novelty of the use of such tools. For example, the term percent recall might lack consistent interpretation, and it would be helpful to define it to clarify the implications of stopping at a set recall, such as 98% estimated recall, and the implication of the potential number of missed studies at the set stopping point.
The committee appreciates that NTP agrees that there were problems with the risk-of-bias analyses of the animal studies, and it agrees with NTP’s decision that devoting further effort to refining the analyses is not worthwhile but has concerns regarding the reasons provided by NTP for not reanalyzing any of the animal data. NTP provided the following reasons in the revised monograph: “(1)…a more critical risk-of-bias assessment would result in fewer relevant animal studies judged to be of high quality; (2)…the highest quality experimental animal study reviewed for this monograph (McPherson et al. 2018) did not find effects of fluoride on learning, memory or motor activity in the critical ≤20 ppm in drinking water concentration range; and (3)…[there are] a large number of human epidemiology studies directly addressing neurobehavioral and cognitive effects of fluoride in children” (NTP 2020a, p. 58). Although the committee agrees with the first reason to the extent that a reanalysis would probably not find any low risk-of-bias studies, it is inappropriate for NTP to highlight one specific study (McPherson et al. 2018) as a rationale for not reassessing all the animal literature. Regarding the third reason, the committee disagrees that a large number of epidemiologic studies generally negates the value of animal studies in hazard determination. Instead, NTP should clarify that a large number of relevant epidemiologic studies can be used as a primary source of evidence to support a conclusion in its hazard identification scheme for integrating human and animal data to reach a final rating of the overall evidence.
In the revised monograph, NTP has added a disclaimer about the animal evidence but left the original discussion unchanged. The committee strongly recommends that NTP not publish the monograph with the original text that states that evidence of effects on activity or motor function invalidate observations of learning or memory deficits. If taken out of context, that text could be interpreted incorrectly or raise questions about the scientific validity of the monograph more generally. For example, Yang et al. (2018) was grouped with studies that were classified as high risk of bias because in addition to finding learning deficits by using the Morris water maze, it found open-field effects. However, the Morris water maze data are highly unlikely to be affected by the minor open-field differences found in that study not only because swimming is different from ambulation and rearing but because there were no differences among groups in learning the task over 5 days of testing. Differences emerged only on retesting 10 and 20 days later and then were not significant on days 30, 40, and 50. It is implausible that rats with any kind of activity effect would learn the Morris water maze equally well, show deficits on only some retest days, and then fail to show further deficits because of an open-field effect. That example shows that the monograph overgeneralizes concerns about activity without examining the learning data in sufficient depth to determine their validity. Instead, the monograph dismisses all data on the basis of a sweeping indictment that no learning differences can be used if activity differences are found. That view is not scientifically justifiable.
The committee strongly recommends that NTP revise the monograph text that states that a change in motor activity necessarily complicates interpretation of learning and memory tests and that the absence of an evaluation of motor activity is automatically problematic.2 First, the mere observation of a change in motor activity does not automatically undermine a learning and memory effect, nor does the absence of statements about the general health of the animals undercut validity, as the monograph asserts. Second, the absence of a motor-activity test does not
2 Text that needs to be edited includes p. 58, last paragraph, lines 4–7, and p. 59, last paragraph, lines 4–13.
necessarily invalidate a learning and memory effect if the test has an internal control for activity. The central issue is whether the learning and memory method alone or in combination with other indexes dissociates learning from performance in a way that allows a correct interpretation of animal learning and memory.
The committee provided many suggestions in its previous report (NASEM 2020) to address deficiencies that it identified in the analysis of the human evidence provided in the draft monograph (NTP 2019a). The headings in this section represent the overarching concerns that the committee raised in its previous report, and the text provides the committee’s assessment of NTP’s responses to the concerns and the revisions made in the draft monograph.
Potential for Biased Selection of Studies
NTP has done excellent work in responding to concerns about a potentially biased selection of studies. The expansion of the literature search to include several Chinese databases strengthens NTP’s review and strengthens the overall process that it has used to support its conclusions. In a few respects, NTP could improve the process even further, and these are discussed below.
First, the databases that NTP chose for searching the Chinese literature were selected on the basis of their covering “studies previously identified from other sources” (NTP 2020b, p. 6). Although that approach might be appropriate, it would have been helpful for NTP to provide a few brief details about the quality or scope of the two new Chinese databases. For example, NTP chose such databases as PubMed and BIOSIS for a reason—for example, fairly extensive coverage of journals or some quality-control standards. Do the same reasons or qualities also apply to the CNKI and Wanfang databases? NTP should also address the concern that selecting databases on the basis of studies already identified might perpetuate, rather than ameliorate, biases resulting from the initial search.
Second, the monograph states that “newly-retrieved human references were reviewed to identify studies that might impact conclusions with priority given to identifying and translating null studies” (NTP 2020a, p. 10). It is somewhat understandable that NTP would want to focus on null studies because these studies would most likely affect NTP’s conclusions. However, that statement provides questionable justification, given NTP’s primary mission—an unbiased review of the literature, which means including all relevant studies whether positive or negative. NTP needs to consider all eligible studies identified in the new literature search.
Lack of Independence of Studies
NTP recognizes that the monograph evaluates and describes multiple publications from the same study. It also indicates some uncertainty about a few publications that cannot be attributed to a parent study, given insufficient published details. The revised monograph states that it addressed the independence issue, but the exact process used for selection of a single publication remains unclear, and in the meta-analysis, two reports on the same population are inappropriately included as described below. It would be useful for the monograph to identify clearly which publications were derived from which study to minimize concerns about potential selection bias;
doing so would also help to define the publications selected for the meta-analysis. NTP might consider editing the monograph to differentiate studies from publications or papers. That revision can be achieved by restricting the term study to the original body of research conducted with a defined population during a specified time and using the terms publications and papers to refer to the published work drawn from a study.
Inconsistent Application of Risk-of-Bias Criteria
In response to the committee’s concern regarding the risk-of-bias assessment, NTP has added Appendix 4, which provides its rationale for classifying studies relative to their estimated risk of bias. The new appendix is helpful and adds transparency, but inconsistencies remain in the application of risk-of-bias criteria to individual studies, particularly in NTP’s evaluation of how various studies handled major confounders, co-exposures, and outcomes. An example concerns the handling of co-exposure to arsenic and lead. According to the protocol, a cross-sectional study is rated as having a probably low risk of bias on confounding if there is direct evidence that appropriate adjustments for arsenic and lead were made; the monograph requires the studies to address arsenic and lead, if applicable. Barberio et al. (2017) did not adjust for arsenic and lead, nor did the authors discuss co-exposures; however, it was rated as having a probably low risk of bias. The committee also identified several studies whose classification changed in revisions in the draft monograph without any justification provided (Sudhir et al. 2009; Trivedi et al. 2012; Das and Modal 2016).
Evaluation of Confounding Insufficient, Difficult to Understand, or Applied Inconsistently
The revised monograph articulates a formal approach for assessing confounding by defining what it considers to be key confounders (that is, children’s age, sex, and socioeconomic status) and other potential confounders. The addition of Appendix 4 makes it easier to follow how individual studies were assessed for risk of bias and confounding, but the committee still considers NTP’s evaluation of confounding insufficient and sometimes inconsistently applied. For example, Cui et al. (2020), which was rated as having a probably high risk of bias for confounding and was included with the lower risk-of-bias studies, presented a univariate comparison of IQ by high vs low fluoride exposure without any adjustment for confounders. According to the protocol, the study should have been rated as having a definitely high risk of bias for confounding and included with the higher risk-of-bias studies. An example of inconsistent application of criteria to classify confounding is the adjustment for smoking and lead exposure. Specifically, Broadbent et al. (2015) is rated as having a probably high risk of bias on confounding, but other studies, such as Trivedi et al. (2012), were not similarly ranked. Another example of inconsistent application of confounding assessment concerns Valdez-Jimenez et al. (2017); here, the issue was the unbalanced and unexplained demographic characteristics of the study population. In Appendix 4, NTP attempted to clarify the direction and magnitude of bias due to confounding, although supporting text is often unclear. For several studies, NTP added a paragraph on the potential direction of bias due to lack of adjustment for arsenic exposure but then provided an argument to justify its absence as a confounder (see, for example, Sudhir et al. 2009). As noted, the committee did not conduct a full audit but examined some illustrative papers and still found reasons for concern.
Possibility of Exposure Misclassification
The revised monograph addresses methodologic issues concerning potential exposure misclassification in light of the various types of exposure measures—for example, child and mother spot urines, serum, drinking water, urine, and residence—considered in the studies. Specifically, Appendix 4 addresses the potential direction and magnitude of bias due to exposure misclassification, if applicable. Thus, the committee’s prior concerns regarding exposure misclassification appear to have been adequately addressed.
Need for Further Consideration of Blinding
In its previous review, the committee recommended that NTP consider more carefully the effect of not intentionally blinding outcome assessors when evaluating the human studies. In its response, NTP indicated that when authors did not directly provide evidence of examiner blinding, it contacted the authors for information. It is unclear how the risk-of-bias information has been updated regarding blinding on the basis of any new information that was received. Specifically, Health Assessment and Workspace Collaborative records identify only whether and when authors were contacted but not what information was obtained or how it might have changed risk-of-bias ratings. NTP also stated that it “verified that the lower risk-of-bias studies did not provide direct evidence of imprecision or lack of blinding” (NTP 2020c). However, that approach assumes that authors will always reveal in their manuscripts a lack of blinding and other weaknesses in their study design. A more conservative approach would be to assume that there was no blinding of outcome assessors unless it was specified in the manuscript and that a designation of probably high risk of bias for this criterion (at a minimum) would be more appropriate when the blinding status was not explicitly stated. That approach would follow the one described in the protocol in which NTP states that “studies should be considered ‘probably high RoB’ unless specific direct or indirect evidence of blinding is provided” (NTP 2020b, p. 13).
Appendix 4 in the revised monograph outlines details of each lower risk-of-bias study and includes outcome-assessor blinding, if known, and any information gathered from direct contact with manuscript authors. In several cases in which assessor blinding was not known, risk of bias for confidence in the outcome assessment was considered low because of the cross-sectional design in which exposure and outcome were measured simultaneously or when all children resided in the same geographic area. The committee considers that an acceptable approach. However, in studies in which children were tested in schools or other facilities in areas where low and high fluoride concentrations of different localities were being compared (see, for example, Cui et al. 2018), there is an increased risk of bias because examiners might make assumptions about children in the different areas. A designation of probably high risk of bias (at a minimum) would be more appropriate in those cases given the approach described in the protocol noted above.
Flawed Measures of Neurodevelopmental and Cognitive Outcome
The committee raised a concern in its previous review about studies that were classified as having lower risk of bias when measurement of a neurodevelopmental or cognitive outcome was flawed. NTP’s response indicated that it did not change the draft monograph but verified that the lower risk-of-bias studies did not provide direct evidence of imprecision in their outcome
measurement. However, the committee remains concerned about the application of the protocol definitions to rate studies. For example, Barberio et al. (2017) assessed outcomes that rely on parent or child self-report of diagnosis of learning disability or attention deficit hyperactivity disorder. According to the protocol, that study would be rated as either probably or definitely high risk-of-bias because the method was not listed in Table 6 (NTP 2020b, p. 21), but NTP failed to address whether there is direct evidence that a self-reported diagnosis has been validated as a reliable outcome measure. That evidence would allow one to distinguish which category (probably or definitely high risk of bias) would be most appropriate. Because the outcome measure is critical for the interpretation of the findings, the committee recommends that NTP apply its criteria in a more consistent manner and specifically address whether there is direct evidence of the sensitivity and precision of self-reported neurodevelopmental outcomes.
Lack of Rigorous Statistical Review
The committee recognizes that NTP made substantial efforts to improve the statistical reviews of the lower risk-of-bias studies. Each study was reviewed by a senior statistician, and summaries of the analytic methods were added to the study descriptions in Appendix 4 in the section “Other potential threats.” However, the summaries provided for a few publications were only a single sentence—“Statistical analyses used were appropriate for the study” (Sudhir et al. 2009; Barberio et al. 2017; Bashash et al. 2017, 2018)—and two other summaries mentioned only log-transformations (Choi et al. 2015) or that tests of normality were performed (Zhang et al. 2015). For those publications, NTP should have provided more evidence to support its conclusion that the analyses were appropriate. It is also concerning that NTP assumed that the analyses in Soto-Barreras et al. (2019) were appropriate despite few details provided in the manuscript regarding their methods.
The committee also finds that NTP did not adequately address the issue of clustering. Most of the attention to clustering pertained to the examples provided in the committee’s previous review. Although it was important for NTP to review those examples, they were meant to highlight the issue and were not meant to serve as a comprehensive list of problematic studies. In fact, when reviewing Appendix 4 in the revised monograph, the committee found several other studies whose analyses failed to account for clustering. Of most concern are the studies that used fluoride concentration measured at the community level as the exposure—see, for example, Seraj et al. (2012), Till et al. (2020), Trivedi et al. (2012), and Wang et al. (2012). When everyone in a community is subject to the same exposure, the standard error of the difference in means between high-exposure and low-exposure groups increases multiplicatively by the square root of a variance inflation factor (VIF) equal to [1 + (n - 1)r], where n is the number of persons in each community and r is the correlation in outcomes (such as IQ score) between members of the same community (Murray 1998; Donner and Klar 2000; Feng et al. 2001). The same phenomenon occurs in randomized control trials that assign treatment to groups of persons. Thus, unless within-community clustering is accounted for in the analysis—for example, through a random-effects model—standard-error estimates will be too small and confidence intervals (CIs) too narrow. Those errors could have a substantial effect on the meta-analysis, which requires valid estimates of within-study variability. The same issue applies to analyses that use community-level exposure to estimate slopes in a regression model. For individual-level exposures, such as urinary fluoride concentration, the VIF is probably smaller than one would see for community-level exposures because some communities might contain people in multiple exposure groups.
However, it is still important to account for clustering in the analysis because one would expect most people in a community to be in the same exposure group. NTP should note specifically whether each study applied an analytic approach that addressed clustering when that was a feature of the design.
In the case of Green et al. (2019), NTP learned from the investigators that accounting for city-level clustering via a random-effects model “showed similar results to the main model.” More details should be provided regarding the similarity of results because although overall conclusions might not have changed, the results of the meta-analysis could be affected by incorrect exposure-effect or standard-error estimates.
The statistical review conducted by NTP also failed to identify a study that did not properly account for the sampling design. Yu et al. (2018) used a hierarchical stratified sampling design but did not indicate that sampling weights were used in the analysis. Thus, both point estimates (means and regression coefficients) and standard errors were likely biased (Lohr, 2019). NTP should examine the studies included in the meta-analysis in greater depth to determine whether each study properly accounted for its design because not doing so could invalidate the meta-analysis results.
Need to Juxtapose Results of Broadly Comparable Studies
In its previous review, the committee expressed concern about selective consideration and presentation of results from the various studies. That approach can convey inaccurate impressions regarding consistency unless the findings are derived from studies that are comparable or aligned with respect to study population, exposure measurement, and outcome ascertainment. Some text in the revised monograph continues to be impressionistic and haphazard in citing various findings from studies and does not provide a clear rationale for why some findings are reported and others are not. The committee notes that reporting findings that are most or least supportive of a finding does not necessarily indicate bias and that this issue might be more editorial than substantive in that the text is not the basis for drawing conclusions. However, it does constitute a concern with transparent communication.
The critical information regarding comparison of study results comes from the new meta-analysis, which seeks to extract and integrate comparable findings from selected studies as discussed further below. The overall approach appears to be sound in comparing mean IQ scores for the most and least highly exposed to fluoride even though the absolute fluoride concentrations are not comparable among studies. A similar approach appears to have been used in the analyses restricted to comparisons in the lower exposure ranges (less than 2 mg/L or less than 1.5 mg/L), but it was not documented clearly in the revised monograph. Because the meta-analysis is so critical to the conclusions that are drawn, NTP should provide the data that were used from each study to enable the reader to understand and evaluate what was done. The values that were used to determine the standardized mean differences (SMDs) could not be found in the revised monograph, nor was there a figure that showed the pattern of results from studies restricted to the lower exposure ranges. A more detailed assessment of the meta-analysis is provided in the next section.
Evaluation of the Meta-Analysis
The committee found the meta-analysis to be a valuable addition to the monograph and acknowledges the tremendous amount of work that was required. The meta-analysis applied standard, broadly accepted methods, and the data shown in Figure A5-1 and the related evaluations are especially informative (NTP 2020a, p. 235). As noted in the revised monograph, 44 of the 46 studies represented in that figure had effect estimates to the left of zero—results that indicate an association between higher fluoride exposures and lower IQ. Those results highlight the marked consistency in the current epidemiologic literature on fluoride and childhood IQ. The subgroup analyses also add considerable strength to the monograph. Despite those improvements, there are areas in which further clarification or revision is needed. Because the revised monograph provides the first opportunity to review and comment on the meta-analysis, the committee offers more detailed suggestions here than in the other sections of this letter report.
One area that needs attention is data transparency. Although the results of each study in the meta-analysis are presented in figures, it is difficult to understand where each of the data points comes from and what each data point represents. Many of the publications used in the meta-analysis provide a number of results or present results in several ways. For example, Bashash et al. (2017) provide results for both child and maternal urinary fluoride concentrations. It is difficult to determine which results were selected for the overall meta-analysis or for each subgroup analysis. In addition to the figures in the revised monograph, NTP should add a table that provides more information on each study result, including the actual result used from each study (SMDs, regression coefficients, and CIs), any data that NTP might have used to calculate the results (for example, means, standard deviations, and sample sizes), and other key information (for example, exposure concentrations of the high- and low-fluoride groups, the method used to assess exposure and outcome, which populations overlap, and information obtained from study authors). Table A-1 includes some of that information but does not include the actual results that NTP selected for the meta-analysis. Overall, adding a table that includes the critical information on each study result would allow readers to identify which result from each study was used and support a better understanding of why NTP selected the results that it did for inclusion in the meta-analysis.
As part of its meta-analysis, NTP presents several subgroup and sensitivity analyses. The committee finds them very informative; several are directly responsive to some of the committee’s previous concerns. However, NTP should also include subgroup or sensitivity analyses that respond to the committee’s concerns about blinding, complex sampling designs, and statistical analyses that account for clustered study designs. Those analyses would include subgroup analyses that separate studies that did and did not blind the outcome assessors, a sensitivity analysis that omits studies with complex sampling designs that did not mention the use of sampling weights, and a sensitivity analysis that omits studies that used community-level exposures but did not account for clustering. Alternatively, NTP could perform a sensitivity analysis in which the standard errors of the studies that did not account for clustering are multiplied by an estimate of the VIF. Other subgroup analyses that should be considered are ones that compare prenatal and postnatal exposures. The additional subgroup or sensitivity analyses noted could help to alleviate some of the committee’s current concerns.
Another major concern of the committee in its first review was that NTP might have been including multiple results from a given study population. In its meta-analysis protocols (NTP
2020b, p. 83), NTP implies that only one result from each population was used. The section of the meta-analysis of “individual-level exposure data” (NTP 2020a, Appendix 5, p. 246) includes a good discussion of two overlapping sets of publications (Yu et al. 2018/Wang et al. 2020 and Green et al. 2019/Till et al. 2020) and the process used to select one result from each set. However, NTP appears to have included at least one set of overlapping publications—Xiang (2003) and Xiang (2011) (Figure A5-1)—in the overall meta-analysis of mean effects. NTP should review all its analyses to ensure that overlapping publications are not included in any single meta-analysis. That exercise is especially important given that the issue of “double counting” was a substantive concern of the committee in its first review.
Another issue involves the overall organization of the meta-analysis protocols and results. Information on the meta-analysis protocols and information on the meta-analysis results are presented in several places. That approach forces the reader to go back and forth between sections and between documents to determine what was done or to obtain a clear picture of the meta-analysis findings. For example, some methods are described in the protocol, some in the revised monograph (NTP 2020a, pp. 48-51), and some in Appendix 5. In addition, NTP presents an exhaustive set of forest plots, funnel plots, Egger and Begg test results, and trim and fill plots and results. NTP can be applauded for developing so many data displays and being so transparent here. However, much of the information is not that helpful, and it is difficult to wade through it, given the sheer volume. Some of the information could be eliminated, summarized, or presented more succinctly or at least provided in a separate document, website, or appendix. Overall, some coalescing and reorganization of the meta-analysis protocols and results would make the meta-analysis easier to follow and easier to interpret.
NTP provides a reasonably thorough and appropriate evaluation of publication bias. In addition to what it has presented, it should mention the weaknesses of the tests used to evaluate that bias. One weakness is that the evaluation of the funnel plot involves mostly a subjective interpretation, which can be especially troublesome when the number of studies is small. Another weakness is the possibility that positive results from the funnel plot and the Egger and Begg tests might be caused by something other than publication bias. In addition, NTP uses the phrase “eliminating publication bias” when it refers to the results of the trim and fill analyses (see, for example, NTP 2020a, p. 49). However, because the tests for publication bias are not 100% specific, it is not known exactly what is being eliminated by the trim and fill process. The committee suggests that a better phrase might be “adjusting for possible publication bias.” In summary, acknowledging the weaknesses of the tests that were used to evaluate publication bias would make the report more transparent.
NTP notes that 44 of the 46 studies (96%) in its meta-analysis of childhood IQ have effect estimates to the left of zero. That finding should be emphasized more, and its meaning with respect to evaluating and quantifying heterogeneity should be mentioned. To assess heterogeneity, NTP primarily used the Cochran’s Q test. However, heterogeneity can also be assessed by providing a count or percentage of the number of studies to the right or left of the null value. Some would consider that a much simpler, more intuitive, and perhaps more useful way of assessing heterogeneity, especially in light of the marked differences between the studies in design, study populations, exposure and outcome assessment methods, and statistical analyses. Although that approach should not be used as the sole basis of conclusions, it can be a useful first step in exploring why heterogeneity might exist. For example, Figure A5-1 appears to show that Broadbent et al. (2015) and Bashash et al. (2017) are two major contributors to the heterogeneity seen in the overall meta-analysis, and they should be clearly identified in the
monograph. NTP does note that there were two studies with effect estimates to the right of the null (NTP 2020a, p. 49, last full paragraph), but a key reference (Bashash et al. 2017) is missing. In addition to identifying the studies, NTP should explore whether there might be an obvious or likely reason for the results of those two studies to tend to differ from the results of the others. For example, the Bashash et al. (2017) result used in the meta-analysis of SMDs appears to be for the cross-sectional evaluation of children’s urinary fluoride concentrations. However, the study also presents prospective results that use maternal prenatal urinary fluoride concentrations, and, unlike the cross-sectional results, the prospective results indicate a fairly strong adverse relationship—a relationship that is much more consistent with that in the other studies used in the meta-analysis. The rationale for choosing one result over the other should be provided because such decisions can affect the results of the meta-analysis.
Finally, NTP should review the process it used to exclude study results from its meta-analysis. For example, Table A-2 says Green (2019) was excluded because of “missing mean or SD of outcome measure; used in individual level meta-analysis.” However, means and SDs are available (Green 2019, Table 1), and at least two other studies (Ding 2011; and Zhang 2015) are used in both the mean-effect and individual-level meta-analyses.
The committee identified several minor points concerning the meta-analysis, and these are provided below.
- NTP notes that pooled SMDs and pooled relative risks were considered significantly different when their 95% CIs did not overlap (NTP 2020b, p. 85). That approach can provide many false-negative results because significant differences can occur when CIs overlap. Statistical significance should instead be determined by hypothesis tests, such as those described in Altman and Bland (2003).
- Almost all the forest and funnel plots are difficult to see because they are too narrow. They should be expanded horizontally. An example of a forest plot that is much easier to read is Figure 2 in Choi et al. (2012).
- Labeling the Aim 2 meta-analysis as a “meta-analysis using individual-level exposure data” is somewhat misleading because it is not clear that all the studies used in it involved individual exposure data. Some might have used ecologic exposure data or the types of clustered exposure data discussed above. Aim 2 actually appears to be a meta-analysis of regression slopes, and labeling it as such would be more appropriate.
- NTP notes that the pooled SMD in its main meta-analysis after applying the trim and fill method is -0.42 (95% CI: -0.54, 0.30) (NTP 2020a, p. 49). NTP should confirm that the CI is correct and that the upper confidence limit is not -0.30.
- If possible, NTP should summarize its meta-analysis results for SMDs by putting the results in a format that is easier to interpret. For example, if the typical standard deviation for a commonly used IQ test is 15 IQ points, a pooled SMD of -0.50 would be expected to represent about a 7.5-point decrease in IQ. Expressing the major results as estimated IQ points, rather than as just SMDs, would make the results easier for people unfamiliar with SMDs to interpret.
- The rationale for excluding the PhD thesis by Thomas from the NTP review of meta-analysis should be provided.3
Overall, NTP has done a good job of identifying and extracting the underlying epidemiologic information that it needs to evaluate the possible neurodevelopmental effects of fluoride. With a few exceptions, the major problem with the report is not related to missing or misinterpreted information, but rather with how the underlying research and its evaluations are presented by NTP. As detailed in many of the preceding comments, NTP’s protocols and its evaluations of the research are sometimes difficult to follow. As NTP is aware, the issue of fluoride toxicity and safety is highly contentious. To be widely accepted, any analysis concerning the issue needs to be performed and presented with exceptional care and with exceptional clarity. Overall, the revised monograph seems to include a wealth of evidence and a number of evaluations that support its main conclusion, but the monograph falls short of providing a clear and convincing argument that supports its assessment, given the lack of details in several places and the lack of clarity on several substantive issues.
Much of the evidence presented in the report comes from studies that involve relatively high fluoride concentrations. Little or no conclusive information can be garnered from the revised monograph about the effects of fluoride at low exposure concentrations (less than 1.5 mg/mL). NTP therefore should make it clear that the monograph cannot be used to draw any conclusions regarding low fluoride exposure concentrations, including those typically associated with drinking-water fluoridation. Drawing conclusions about the effects of low fluoride exposures (less than 1.5 mg/mL) would require a full dose–response assessment, which would include at a minimum more detailed analyses of dose–response patterns, models, and model fit; full evaluations of the evidence for supporting or refuting threshold effects; assessment of the differences in exposure metrics and intake rates; more detailed analyses of statistical power and uncertainty; evaluation of differences in susceptibility; and detailed quantitative analyses of effects of bias and confounding of small effect sizes. Those analyses fall outside the scope of the NTP monograph, which focuses on hazard identification and not dose–response assessment. Given the substantial concern regarding health implications of various fluoride exposures, comments or inferences that are not based on rigorous analyses should be avoided.
As noted above, the committee focused on determining whether the evidence as presented in the revised monograph supported NTP’s main conclusion that “fluoride is presumed to be a cognitive neurodevelopmental hazard to humans” (NTP 2020a, p. 80). The revised monograph is much improved from the initial draft that the committee reviewed. The addition of the meta-analysis substantially increases the support for NTP’s main conclusion. However, the committee is still concerned about the presentation of the data, the methods, and the analyses in the revised monograph and finds that the monograph falls short of providing a clear and convincing argument that supports its assessment. The committee urges NTP to improve the clarity of the document. The monograph has great importance in the discussion about effects of fluoride on neurodevelopmental and cognitive health effects and will likely influence exposure guidelines or regulations. Thus, it is extremely important for it to be able to withstand scientific scrutiny by those who have vastly different opinions on the risks and benefits associated with fluoride exposure. The committee strongly recommends that NTP improve the revised monograph by
seriously considering the suggestions that are provided in this letter report to improve its clarity and transparency.
A Statement of Task
B Committee Membership
C Key Issues and NTP Response
E Acknowledgment of Reviewers