Page 115 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

5

Guidance for Selection and Use of Population Descriptors in Genomics Research

INTRODUCTION

This chapter’s primary audience are researchers who work with genetic data. The committee’s intent is to provide practical guidance for using descent-associated population descriptors in human genetics and genomics research. As emphasized throughout the report, the appropriate population descriptor depends on the scientific question being asked. In some cases, none of these descriptors may be needed. In other situations, when descent-associated population descriptors are advisable or needed for methodological reasons, this chapter gives guidance on which approaches to consider and why.

In formulating these recommendations, the committee recognizes that there exists a large amount of legacy data in which study participants have already been classified on the basis of population descriptors (Khan et al., 2022; Wallace et al., 2020). When using such data, researchers may be constrained in their options, but their choices need to be described in ensuing publications. Furthermore, the committee appreciates the dynamic nature of research and the changing landscape of descent-associated population descriptors; there is no single solution to this challenge of appropriate use of descriptors, and applying a uniform approach across different types of studies is not possible. Rather, responsive approaches are needed to accommodate the specific research question being asked, develop best practices for grouping individuals and naming those groups, and take community preferences into account.

This chapter builds on the foundation established by the previous four chapters. Therefore, the committee encourages a careful reading of Chapters

Page 116 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

1 through 4 in order to understand the context of these recommendations. Notably, Chapter 3 provides a set of guiding principles for conducting human genetics research (and all research involving humans) that support the report’s recommendations and can help guide researchers when none of the specific best practices apply.

THE IMPORTANCE OF TRANSPARENCY AND SPECIFICITY WHEN SELECTING AND REPORTING POPULATION DESCRIPTORS

Transparency in methodology is a scientific norm for replication of research findings (NASEM, 2019), yet the challenge of transparency is not only in scientific description but also in communicating specifically how and why particular decisions were made. Although imperfect, categories and labels are needed to conduct and communicate science. Transparency, therefore, requires stating the rationale behind the classification scheme and group labels applied when using population descriptors. Beyond describing the exact nature of the study conducted and ensuring reproducibility, comparability and meta-analysis with other studies, transparency about methods, assumptions, and decision making promotes trustworthiness of the research (Claw et al., 2018; NASEM, 2019). Moreover, understanding the factors that inform decision making supports reproducibility.

When communicating their research methods, findings, and conclusions, researchers should be as transparent as possible about the specific procedures used to identify and name groups within their data sets. Transparency can take three major forms:

Clear identification of the concept of human difference underpinning the population descriptor(s) chosen for analysis, and the rationale for that choice,
Verbal descriptions of how samples were collected and labeled, as well as the rationale for the decisions made; and
Sharing analysis scripts and decision rules used to transform per-individual metadata (e.g., responses to surveys) to the labels used in an analysis.

The primary focus of this chapter is on the first two, namely the conceptual approaches and specific language that enable appropriate and accurate use of population descriptors in genomics research. Furthermore, the guidance that follows is intended to provide researchers with best practices and the rationale for decision making, in alignment with the guiding principles outlined in Chapter 3 and in an effort to support the goal of promoting trustworthy research.

Page 117 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

In delimiting their study participants, researchers inevitably make choices about which classification schemes or descriptors to use, which scale of resolution to consider, which specific group labels to apply, how to treat individuals with missing data, and so forth. Researchers may also be constrained to using group categories and labels adopted by others in order to allow for data aggregation or harmonization (Doiron et al., 2013; Khan et al., 2022; Wallace et al., 2020). A further challenge arises when such categories have been applied inconsistently, with a mixture of some individuals in a study labeled based on race, others based on ethnicity, and yet others based on geography. For instance, some researchers merge genomic data sets from different sources and assign individuals to clusters on the basis of genetic similarity to each other or to reference panels. Then they assign labels to individuals based on a characteristic that is frequent in the cluster or by using the labels from the reference panels. The number and size of the clusters that are detected in any given study depend on the sample composition. Moreover, the group labeling assigned to these clusters is often highly heterogeneous, borrowing terms from distinct classification schemes, at vastly different scales of resolution, such as African (a continental geographical location)/African American (an ethnicity), East Asian (a geographic location), and Finnish (a nationality). In that regard, it is worth noting that even when the labels are carried over from previous data collection, choices have to be made about what ancillary information to use and which subsets of individuals to combine and split in the new analysis.

CONCLUSION AND RECOMMENDATIONS

Conclusion 5-1. In employing population descriptors and assigning group labels in genetics studies, researchers tend to rely on existing and commonly used population classifications, often with unclear justification for their choices.

Recommendation 6. Researchers should tailor their use of population descriptors to the type and purpose of the study, in alignment with the guiding principles, and explain how and why they used those descriptors. Where appropriate for the study objectives, researchers should consider using multiple descriptors for each study participant to improve clarity.

Recommendation 7. For each descriptor selected, labels should be applied consistently to all participants. For example, if ethnicity is the descriptor, all participants should be assigned an ethnicity label, rather than labeling some by race, others by geography, and yet others by ethnicity or nationality. If researchers choose to use multiple descriptors,

Page 118 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

each descriptor should be applied consistently across all individuals in that study.

Recommendation 8. Researchers should disclose the process by which they selected and assigned group labels and the rationale for any grouping of samples. Where new labels are developed for legacy samples, researchers should provide descriptions of new labels relative to old labels.

To equip researchers with the information to follow these recommendations, the committee developed the following decision-making tools and best practices. These tools will be particularly helpful to reviewers of genetics and genomics research proposals to try to ensure consistent usage of terms and appropriate study designs.

TOOLS FOR SELECTING AND USING POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH

The table below and decision tree in Appendix D suggest which descent-associated population descriptors are most appropriate as analytical tools for each of the seven genetics study types outlined in this report. Note that each descriptor represents a particular concept of difference across populations. In other words, the recommendations in the decision tree and table focus on the conceptual building blocks that researchers should use in study design, data analysis, and reporting their results. While the conceptual structure of research naturally has implications for the language that scientists adopt, the tree and table are not intended to be a linguistic straitjacket or a checklist of acceptable words. Instead, the objective of the committee’s guidelines is to encourage genetics researchers to consider, define, and delineate very carefully the concepts of human difference with which they are working, and to choose wording that transparently reflects the analytical steps taken.

These considerations are particularly salient with respect to genetic ancestry, which is not directly observable and is instead inferred from measures of genetic similarity. Therefore, the committee recommends that researchers relying on such measures explicitly refer to genetic similarity when describing their results, rather than the shorthand of genetic ancestry. An exception is for human evolutionary genetics studies explicitly aiming to learn about genetic ancestries over time or space. The committee recommends that when researchers borrow labels from ethnic, racial, political, or geographic classification schemes, they be explicit about their choice of descriptor. For all these reasons, the columns in Table 5-1 distinguish the

Page 119 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

concept of genetic ancestry from other population descriptors like ethnicity or geography.

Key Terminology for the Selection Guide

Throughout the rest of this chapter, a nuanced understanding of key terms and concepts introduced earlier in this report is necessary (summarized in Box 5-1). In the table and decision tree below, population descrip

BOX 5-1
Key Terminology for This Chapter

Population descriptor: a concept or classification scheme that categorizes people into groups (or “populations”) according to a perceived characteristic or dimension of interest. A few examples include race, ethnicity, and geographic location, although this is a non-exhaustive list.

Group label: name given to a population that describes or classifies it according to the dimension along which it was identified. An example is French as the label for a group identified by its members’ possession of French nationality, where nationality is the population descriptor.

Ancestral recombination graph: for a set of individuals, the graph depicting the genetic ancestry lines (or paths) that trace back to their common genetic ancestors at every position in the genome.

Ancestry: a person’s origin or descent, lineage, “roots,” or heritage, including kinship. Examples of ancestry group labels include clan names or patronyms, but geographic, ethnicity, or racial labels are often used to denote groups whose members are presumed to share common ancestry.

Genetic ancestry: the paths through an individual’s family tree by which they have inherited DNA from specific ancestors. Genetic ancestry can be thought of in terms of lines extending upwards in a family tree from an individual through their genetic ancestors (see Figure 2-1). Shared genetic ancestry arises from having genetic ancestors in common (that is, overlapping lines of ancestry). For a set of individuals, a fundamental representation of genetic ancestry is a structure called an ancestral recombination graph. In practice, shared genetic ancestry is typically inferred by some measure(s) of genetic similarity.

Genetic ancestry group: a set of individuals who share more similar genetic ancestries. In practice a genetic ancestry group is constituted based on some measure(s) of genetic similarity; Once a set is designated as a genetic ancestry group, its members are often assigned a geographic, ethnic, or other nongenetic label that is common among its members.

Genetic similarity: quantitative measure of the genetic resemblance between individuals that reflects the extent of shared genetic ancestry.

See Appendix B for further comments, definitions, and citations.

Page 120 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

tors refer to conceptual classification schemes used to group people based on specific characteristics. The appropriate application of these concepts in particular study contexts is the primary focus of the recommendations to follow. Group labels are names given to groupings of individuals.

Many of the best practices recommended by the committee rely on understanding the distinctions between genealogical ancestry, genetic ancestry, genetic ancestry group, and genetic similarity. For a full background of these concepts, see the subsection “Ancestry” in Chapter 2. Briefly, genealogical ancestors refer to the collection of ancestors for an individual as found in a family tree, such as parents, grandparents, and so on (Mathieson and Scally, 2020; Rohde et al., 2004). Genetic ancestry refers to the paths through an individual’s family tree by which they have inherited DNA from specific ancestors (Mathieson and Scally, 2020); it is inferred from measures of genetic similarity rather than directly observed (Mathieson and Scally, 2020). Genetic ancestry groups are usually defined by demarcating sets of people based on various measures of genetic similarity. Then these groups are often given a label derived from nongenetic characteristics, such as ethnicity, geography, or race. This mapping of nongenetic descriptors introduces additional assumptions (see Chapters 2 and 4). Moreover, it may suggest homogeneity of genetic and environmental effects within social categories where none exists.

In a number of contexts, reliance on, and reference to, ancestry groupings may be unnecessary for the goals of the study. For example, when matching the background allele frequencies of cases to controls, there is a need to identify a set of individuals who are genetically similar, but not to rely on inferences about their genetic ancestry. Likewise, identifying individuals who are genetically similar to each other or to a reference panel is usually sufficient to delimit a subset of participants for genome-wide association studies (GWAS). Although the distinction between genetic ancestry and genetic similarity may be subtle, it is nonetheless important to enable moving beyond fundamental misconceptions about population descriptors, particularly race and typological thinking.

Conclusion 5-2. Assigning ancestry group labels based on such descriptors as geography, ethnicity, or race is often scientifically unnecessary and may contribute to typological thinking (Coop, 2022). In particular, genetic ancestry group is commonly conflated with continental geography, which in turn often stands in for—and thereby reifies—race (Lewis et al., 2022).

Page 121 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Orientation to the Selection Guide

We must aspire to research scholarship and assessments and treatments based on actual and not assumed genetic variation, and the social, historical, structural context in which the bodies and lives of the people that we’re interested exist. That means assessing the patterns of diversity that reflect the distribution of human genetic variation across the globe, not proxies thereof.

—Agustín Fuentes, testimony to the committee
in a public session on April 4, 2022

As shown in Table 5-1, best practices in the use of population descriptors vary by study type for any non-disease or disease trait. The committee considered seven major study types: (1) gene discovery for Mendelian traits; (2) prediction for Mendelian traits; (3) gene discovery for complex and polygenic traits; (4) prediction for complex and polygenic traits; (5) elucidation of molecular, cellular, or physiological mechanisms; (6) studies of health disparities with genomic data; and (7) studies of human evolutionary history. For descriptions and examples of each type, see the section “Classification of Genomics Study Types” in Chapter 1. Population descriptors refer to conceptual frameworks for describing descent-associated differences across groups of people. First, careful consideration should be given to whether descent-associated population descriptors are needed at all. If needed, and once researchers identify the appropriate population descriptor or descriptors for the context of their study, they should apply group labels consistent with each concept to all study participants. For a given study, more than one concept may be appropriate, and studies may benefit from using multiple descriptors. For example, a project may incorporate both geography and ethnicity simultaneously to distinguish, say, Kurds in Iraq from Kurds in Turkey.

In some contexts, descent-associated population descriptors are used not as indicators of shared genetic ancestry but as proxies for shared environmental exposures (see “The Importance of Environmental Factors in Genetics and Genomics Research” in Chapter 2). This practice should be avoided where possible in favor of measuring the environmental variable directly. Nevertheless, when direct measurements are not possible, Table 5-1 indicates which population descriptors might be most appropriate. While race (or racialized group) may capture some shared exposure to racism and therefore may be suitable for some health disparities studies, it is a poor proxy for other environmental exposures and carries the risk of contributing to typological thinking. Therefore, the committee recommends race not be used outside of a subset of health disparities studies. Even in that context, the combination of information from other classification schemes (e.g., ethnicity and geography) may be more accurate. Moreover, should

Page 122 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

descent-associated population descriptors be used as proxies for environments, that research-design decision should be explicitly noted and its rationale explained.

Finally, readers should bear in mind that the recommendations in Table 5-1 apply to the analytical use of population descriptors—that is, as variables or other tools in analysis. The committee recognizes, however, that researchers may wish—or be obligated—to use population descriptors for other research-related activities, notably for constructing and/or describing samples of individuals whose genetic material is to be analyzed. In the interests of equity, justice, or the diversification of human genetic data and knowledge about it, researchers may choose to use race and/or ethnicity in order to identify individuals to be included in their studies (Oni-Orisan et al., 2021), and Table 5-1 is not meant to govern such sampling decisions or procedures. Even outside the realm of analysis, however, the committee encourages scientists to carefully consider whether race and/or ethnicity are the most conceptually appropriate and useful descriptors for the information they wish to capture, or the best guides to seeking a heterogeneous sample.

Population Descriptor Selection Guide

Table 5-1 is a highly condensed summary of the best practices described in this chapter. It should not be inferred to indicate in absolute terms what to use and not use in every circumstance. To use Table 5-1 effectively, the reader is advised to review the subsection “Key Terminology for the Selection Guide” above and consult the text describing the best practices for each specific study type in conjunction with viewing the table. In addition, the reader should note that Table 5-1 provides only a broad overview and summary of the best practices; additional considerations for decision making are outlined in the decision tree (Figure D-1 in Appendix D) and in the body of this report.

The text that follows explains what is summarized in the table and illustrated in the decision tree in Appendix D. Although the text does not cover every possible variation of the genetics study types, the intent is for the discussion and examples to allow researchers to understand why certain population descriptors are recommended or discouraged depending on the type of study and the goals of the research.

Page 123 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

TABLE 5-1 Recommended Approaches for the Use of Population Descriptors by Genomics Study Type

This table should be read and interpreted in conjunction with the report text. Consult the decision tree in Appendix D for more information and Chapter 5 text for best practices for each study type. See also the terminology box preceding the table and descriptions of each study type in Chapter 1 section “Classification of Genomics Study Types.” For any given study, the use of multiple descriptors may be preferable.
LEGEND
Preferred population descriptor(s)				Should not be used
In some cases; refer to Ch. 5 text and the decision tree in Appendix D				Descriptors could be used if appropriate proxies for environmental, not genetic, effects
GENOMICS STUDY TYPE	Race	Ethnicity/Indigeneity	Geography	Genetic Ancestry	Genetic Similarity	Notes
1: Gene Discovery - Mendelian Traits						Similarity suffices as a genetic measure; at fine-scale, other variables may be useful
2: Trait Prediction - Mendelian Traits						No population descriptors may be necessary for analysis
3: Gene Discovery - Complex Traits						Similarity suffices as a genetic measure
4: Trait Prediction - Complex Traits						Similarity suffices as a genetic measure
5: Cellular and Physiological Mechanisms						No population descriptors may be necessary for analysis
6: Health Disparities with Genomic Data						Not all health disparities studies rely on descent-associated population groupings, so none may be necessary for analysis
7: Human Evolutionary History						Reconstructing genetic ancestry may be of central interest

Page 124 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Study Type 1: Gene Discovery for Mendelian Traits

Sequence variants that underlie Mendelian diseases fall into two categories: de novo mutations and inherited ones. When the goal is identifying de novo mutations, as through family studies (e.g., Simons et al., 2013; Turner et al., 2017), no population descent-associated descriptors are necessarily needed to identify variants. In some contexts, it may nonetheless be helpful to provide population descriptors (e.g., current geographic location) for the families that were studied in order to enable identification of additional cases or study the geographical spread of new mutations (e.g., Wexler, 2004).

Conclusion 5-3. Where the goal is to describe families in which de novo mutations have been identified, the relevant information is likely much more finely scaled than broad categories like those labeled by continental ancestry (or other large-scale genetic ancestry) or by ethnicity.

Best Practice 1: To enable identification of additional cases, rather than using genetic ancestry or ethnicity, researchers should use categories based on kinship (e.g., recent genealogical ancestors), identity-by-descent information, or fine-scaled geographical or genetic similarity data.

Studies aimed at identifying Mendelian disease variants use not only pedigrees, but also collections of unrelated affected persons (loosely called cohorts). A small number of variants found in these affected persons will be de novo, but most will be inherited and of unknown functional significance and pathogenicity. To annotate such variants, researchers commonly rely on a reference database comprised of data from individuals who are not diagnosed with the disease, in order to exclude those variants that are likely unrelated to the condition. In doing so, the goal is to exclude variants that are not rare (e.g., common variants) found in groups of individuals or populations with genetic backgrounds similar to one another. Alternatively, researchers may rely on a global reference panel to evaluate whether the variant of interest is at high frequency anywhere in the world, and if so, exclude it as a likely causal mutation.

Best Practice 2: Where researchers aim to match the genomes of focal individuals to people that are genetically similar, they should not rely on matching solely by racial (e.g., black), ethnic (e.g., Hispanic), geographic (e.g., West African), or national (e.g., Nigerian) categorizations, which are poor proxies for allele frequencies. Since in this context researchers are not interested in genetic ancestry per se, the committee further recommends that, when possible, they avoid ancestry category

Page 125 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

labels (e.g., “Admixed-American” in the 1000 Genomes), as such labeling brings in additional, unnecessary assumptions (see Chapters 2 and 4). The committee recommends instead that they rely on genetic similarity measures to delimit the reference set to which to compare focal individuals and to describe study participants (Coop, 2022).

Study Type 2: Prediction for Mendelian Traits

Examples of phenotypic prediction for Mendelian traits include prenatal or newborn screening (e.g., for Tay Sachs or phenylketonuria—PKU, respectively) or clinical testing for highly penetrant germline mutations that increase disease risk (e.g., Tan et al., 2017). Once the genetic basis for a disease has been elucidated, group labels may no longer be needed as a stand-in for allele frequencies. Instead, people can readily be genotyped for the variants themselves. Furthermore, making screening for specific alleles available only in people with particular descriptors (e.g., Tay Sachs only in children of Ashkenazi Jewish parents) will miss some disease-variant carriers (Dolitsky et al., 2020; Nazareth et al., 2015). In clinical scenarios, there may be exceptions where allele frequencies are necessary to estimate the genotype of missing parents. Here again, genetic similarity is more appropriate than reference to genetic ancestry.

For some traits considered Mendelian (e.g., Huntington’s disease), prognosis, for instance regarding the age of onset, depends on modifier alleles in the genome (GeM-HD Consortium, 2015). Where the modifier alleles are unknown, which they usually are today, information about genetic similarity may be helpful in providing some information about allele frequencies at other loci in the genome. When the modifier alleles are known, however, sequencing the individual will provide much more accurate individual information than will population descriptors.

Best Practice 3: Where the genetic basis of a trait is known, researchers should focus on characterizing the individual’s alleles rather than use population descriptors as an unreliable proxy for the genomic background.

Phenotypic trait prediction of Mendelian disorders may also depend on environmental factors, such as air pollution or secondhand smoke exposure for children with cystic fibrosis (O’Neal and Knowles, 2018). In such cases, researchers may be tempted to include such population descriptors as race, ethnicity, or geography to capture shared exposures (Martinez et al., 2022). However, where the aim is explicitly to measure environmental effects, the committee recommends that descent-associated population descriptors not

Page 126 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

be used in place of individual-level data, as the use of such descriptors runs the risk of erroneously suggesting that the effects of interest are genetic.

Best Practice 4: Given that any descent-associated population descriptor will be a poor proxy for environmental effects (Benmarhnia et al., 2021; Martinez et al., 2022), researchers should aim to directly collect information about as many potentially relevant environmental factors as possible.

Best Practice 5: When including population descriptors for phenotype prediction of Mendelian traits, researchers should be explicit about whether the aim is to study genetic or environmental effects or both, and whether these can be disentangled given the study design.

Study Types 3 and 4: Gene Discovery and Prediction for Complex and Polygenic Traits

Complex traits, such as height or the risk of a disease such as type 2 diabetes, depend on the effects of not only many loci across the genome but also the environment (Falconer and Mackay, 1996). In the past two decades, the main method to map the genetic basis of such traits has been genome-wide association studies of unrelated individuals (Hirschhorn and Daly, 2005). Such efforts have been motivated by two distinct goals: to identify loci that affect a particular phenotype, and to predict trait values in currently asymptomatic individuals (Visscher et al., 2017). Association studies are conducted in sets of individuals that have some degree of genetic similarity in order to better control for effects of alleles in the genomic background as well as potential environmental effects that correlate with the genomic background. How much of a problem environmental confounding presents depends on the trait (e.g., Okbay et al., 2022).

Study Type 3: Gene Discovery for Complex and Polygenic Traits

For researchers who are mapping variants that influence complex trait values, common practice is to describe their study participants as members of genetic ancestry groups that are labeled with geographic, ethnic, or racial terms. Such labels for inferred ancestry groups are defined at very different levels of resolution depending on the data at hand, such as the Southern European versus the white British subsample of the UK Biobank. Moreover, to increase their sample sizes when they lack access to the original data, researchers often combine summary statistics provided by different studies (Lesko et al., 2018). In that case, current practice usually consists of grouping them under a general label, such as European, without spelling out the

Page 127 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

often-implicit assumptions about genetic and environmental effects in the combined samples or considering the genetic or environmental diversity within any such group (Coop, 2022). In this context, researchers are not interested in ancestry or race per se, and instead are aiming to identify study participants who are more genetically similar to one another, to better control for effects of the genomic background and correlated environmental effects.

Similar considerations arise when conducting mapping studies in recently admixed individuals (Shriner, 2013; Thornton and Bermejo, 2014). In this context, the mapping is conducted by considering local ancestry estimates, that is, inferences based on genetic similarity for different segments of the genome (Atkinson et al., 2021). Included in the statistical model also are sometimes genome-wide ancestry proportions, estimates that capture genomic background effects beyond the scale of local ancestry estimates, and, sometimes, ethnic or racial labels as a proxy for environmental exposures.

Best Practice 6: When mapping variants that contribute to complex traits, the goal is to conduct the study in a set of individuals that are genetically more similar, rather than to infer ancestry per se. Therefore, researchers should characterize their study participants in terms of their genetic similarity to one another or to a reference panel, with a specified similarity measure (Coop, 2022).

As an example, researchers would describe samples as “carrying genotypes most genetically similar by measure X to the GBR panel of the 1000 Genomes data set, as compared to individuals sampled elsewhere in the world” (GBR being the acronym for British in England and Scotland) (Coop, 2022) or by using coordinates in a low-dimensional representation of the data, like principal component analysis (PCA) or uniform manifold approximation and projection (UMAP) (e.g., “individuals projecting to the region [-0.1,-0.05] in PC1 and [0.3,0.5] in PC2 of a PCA generated from the 1000 Genomes data set”). For recently admixed individuals, this description would then naturally lend itself to statements such as, “Seventy-three percent of the genome is most genetically similar to genotypes of individuals in the GBR panel, and 27 percent of the genome is most similar to genotypes of individuals in the YRI panel” (YRI being the acronym for Yoruba in Ibadan, Nigeria). (Or alternatively, “73% of the genome is most similar to genomes from region 1 and 27% from region 2 in a 1000 Genomes PCA.”) This approach avoids descriptions of recently admixed people as either African or European when they derive recent ancestry from diverse locations in both continents. Importantly, this descriptive change does not alter or compromise the underlying science. For comparability across studies, there

Page 128 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

is still work to be done to assess which similarity method to use and how these may change with the composition of the reference panel and different choices of genotype measures.

Study Type 4: Prediction for Complex and Polygenic Traits

GWAS results can be useful for trait prediction, even where the mechanism linking genotypic variation to phenotypic variation is not understood (Torkamani et al., 2018). In particular, the hope is to use polygenic scores (PGS) to identify individuals at high risk for specific diseases (e.g., Khera et al., 2018; Mavaddat et al., 2019). PGS are calculated by summing alleles carried by an individual, weighted by effect sizes of alleles that are estimated in association studies (often GWAS); PGS provide a predictor of a deviation from a mean value in a given study population (often adjusted for relevant covariates, such as sex and age) (Sirugo et al., 2019).

Polygenic scores (also called polygenic risk scores) are based on GWAS that pick up not only causal loci but also genetic variants correlated with causal variants, to an extent that depends on allelic association (called linkage disequilibrium or LD) patterns among sites (Choi et al., 2020). Since at present causal variants can rarely be pinpointed, the construction of a PGS requires weighting all these associations. In practice, therefore, phenotypic prediction of complex traits relies on LD patterns characterized in a set of individuals that are genetically similar, by some operational definition.

When the goal is to predict trait values—as distinct from identifying causal loci—it may not be as important to entirely control for environmental effects on the trait that are correlated with genetic differences. In some contexts, uncontrolled environmental stratification can actually enhance predictive power (Mostafavi et al., 2020). For related reasons, the practice of performing genetic prediction after stratifying by a population descriptor can increase predictive power because it implicitly captures both genetic similarity and shared environmental exposures. A danger, though, is that by including a contribution of nongenetic effects into what is widely understood to be a genomic predictor, this practice will end up over-emphasizing the role of genetics in trait etiology and reifying group differences.

Another important aspect of genomic trait prediction is generalizability beyond the GWAS study population. Generalizability is particularly important because, to date, the vast majority of GWAS have been conducted by sampling people in Europe or those who report recent European ancestry (Martin et al., 2019; Mills and Rahal, 2020). Given that LD patterns vary across the globe (Charles et al., 2014), as do the frequencies of causal loci, the prediction accuracy of PGS is expected to decrease with genetic divergence from the GWAS set, even if nothing else were to differ (Wang et al., 2020, 2022). That decrease is seen in practice: PGS have lower prediction

Page 129 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

accuracy with increasing genetic distance from the GWAS set of individuals (Martin et al., 2019; Privé et al., 2022; Scutari et al., 2016; Wang et al., 2020). Factors other than LD and shifts in causal allele frequencies may also decrease prediction power, such as differences in the degree of environmental variance or gene–environment interactions; in other words, genetic effects may differ across environmental settings (Giannakopoulou et al., 2021; Mills and Rahal, 2020; Mostafavi et al., 2020; Wang et al., 2022).

Best Practice 7: When predicting complex traits, the goal is to study a set of individuals that vary in a trait but are relatively similar genetically, rather than to infer ancestry per se. Therefore (as with Best Practice 5 and 6), researchers should characterize their study participants in terms of their genetic similarity (to one another or with regards to a reference panel), with a specified similarity measure (Coop, 2022).

Considerations Common to Gene Discovery and Prediction for Complex and Polygenic Traits

The committee recognizes that after delimiting study participants based on genetic similarity to a reference panel, researchers may want to refer to the set of study participants with a label based on ethnicity (e.g., Yoruba), nationality (e.g., Nigerian), or geography (e.g., residing in Nigeria)—or a combination of various labels—either as shorthand in communicating the results, or to underscore a particular characteristic of the group that distinguishes their ethnicity, geography, or demographic history from that of the closest other individuals in the reference panel. In so doing, care should be taken to avoid applying broad labels (e.g., African ancestry) to panels represented by narrower sampling (e.g., YRI). This consideration underscores the need for more widely available, geographically and ethnically diverse reference panels. In every case, researchers should be transparent about their reasons for using such ancestry labels and for the choice of the particular label(s) in question. Importantly, in many cases, it may be unnecessary to refer to genetic ancestry at all, since terms such as the study population, alongside information about how and where individuals were sampled, may be sufficient and require fewer assumptions.

The committee further appreciates that when researchers use summary statistics from a previous GWAS and have no access to the individual-level genotype and phenotype data (e.g., when conducting a meta-analysis), it is not always feasible to assess genetic similarity to a reference panel. In this case, researchers should be explicit about the reasons for the nomenclature they have adopted or borrowed (e.g., if grouping many sets of individuals under a common label) and the procedure by which individuals have been retained or excluded from the sample; where possible, they should adopt

Page 130 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

labels based on the use of genetic similarity. In this regard, those sharing data should also attempt to provide indirect measures of genetic similarity (e.g., summary statistics for coordinate positions in a reference principal component analysis) that might enable genetic similarity to be assessed more precisely than is possible with group labels.

Where no reference panel is available, researchers often use a group label based on an attribute that is common to the study participants, such as a subset of people who self-identify as “white British” in the UK Biobank. Researchers should be explicit about their reasons for choosing the attribute used to delineate and describe the study participants.

Best Practice 8: Researchers should describe samples in as many dimensions as possible, using population descriptors, individual-specific environmental data, and their ascertainment scheme (e.g., were participants recruited from a research hospital, in an urban or rural area, and so on).

Best Practice 9: When descent-associated descriptors such as ethnicity or geography are used, researchers should be explicit about what types of effects they intend to capture—genetic, nongenetic, or both—and whether the effects can be teased apart reliably given the study design.

Best Practice 10: Where the goal is to control for environmental effects that are correlated with genomic background effects, researchers should, if possible, replace or, at least, augment the use of population descriptors with more reliable and precise measures of individual environmental effects. Whenever labels remain, researchers should be explicit about their reasons for using them.

Cataloging the data collection in these ways will enable samples to be assessed for their genetic similarity to reference panels as well as for their similarity along nongenetic dimensions such as employment status or geographic location. A richer description of the data will also help to identify obstacles to generalizability beyond genetic similarity. A further benefit may be that forms of study ascertainment or enrollment bias could potentially be taken into account (e.g., Van Alten et al., 2022).

Finally, a major goal of all of these genetics and genomics studies, particularly GWAS, is to dissect the genetic and environmental architecture of these traits, and to identify the underlying mechanisms (pathophysiology). Further progress along this front will almost surely require the collection of new samples with new data, particularly longitudinal data, rather than simply retrofitting legacy studies.

Page 131 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Study Type 5: Elucidation of Molecular, Cellular, or Physiological Mechanisms

Many studies that include human genetic data ultimately aim to understand the molecular, cellular, tissue, and physiological underpinnings of traits, sometimes triggered by gene discovery. One example might be studies aimed at understanding the genetic and neuronal mechanisms by which a DNA repeat expansion causes Huntington’s disease (Jimenez-Sanchez et al., 2017). Another is the molecular mechanism underlying messenger RNA (mRNA) vaccines against SARS-CoV-2: their development relied on an understanding of antibody production and chemical modifications to mRNA that help evade the human innate immune response, much of which was learned in mice, human cells, and other model systems (Delorey et al., 2021; Sadarangani et al., 2021). In such cases, where underlying mechanisms are expected to be shared by all humans (and often by other species), there is no compelling reason to stratify study participants by descent-associated population descriptors at all.

As noted by Pavličev and Wagner (2002), “A shared mechanistic basis of a trait does not mean that exactly the same loci will be detectable by association with variation in this trait.” Conversely, the observation that variation in a trait differs in its allelic basis among humans does not imply that the underlying mechanisms are different. Thus, despite the universality of the underlying mechanisms of vaccination (Delorey et al., 2021; Sadarangani et al., 2021), humans vary in their specific response (Randolph et al., 2021), likely because of both genetic variants and environmental exposures. As an example, in all humans, myopia is caused by deformations in the shapes of the eye or cornea and can be corrected by eyeglasses (Chakraborty et al., 2020). Nonetheless, the genetic and environmental factors that lead to myopia likely differ across the world, owing to changes in allele frequencies, average effect sizes, and environmental exposures (Chakraborty et al., 2020; Li and Zhang, 2017). Researchers may be interested in understanding such perturbations to the underlying mechanisms and how they are distributed geographically, but often the primary goal is to leverage these perturbations (e.g., loss-of-function mutations) as a tool to better understand underlying mechanisms.

When specific candidate loci or salient environmental factors are unknown, a common approach has been to use population descriptors, and in particular ancestry group labels, as a proxy for differences in allele frequencies across the genome and potentially environmental exposures. A danger of this approach is the implication, implicit or explicit, that the underlying mechanisms themselves somehow differ by population descriptors, when in fact, the observed differences are caused by alleles at specific loci in the genome or varying environmental exposures (or interactions of the two). Once the nature of the perturbations has been identified, any observed

Page 132 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

differences between groups defined by population descriptors will resolve as differences between individuals carrying distinct alleles and the environments to which they are exposed.

Conclusion 5-4. Given that underlying cellular and physiological mechanisms are expected to be universal among humans, the default practice in such studies should be to not use any descent-associated population descriptors.

Best Practice 11: When researchers are interested in studying perturbations to underlying mechanisms that arise from genetic variation, and the genetic variants are known, population descriptors should not be used as a substitute for individual information. If, instead, the genetic variants are unknown, and researchers are interested in delimiting a set of individuals with similar allele frequencies, they should rely on genetic similarity rather than such descriptors as ethnicity or geography.

Best Practice 12: Where the goal is to study the effect of unknown environmental exposures or possible gene–environment interactions, researchers should aim to replace or supplement population descriptors with direct information about potentially salient environmental factors. Regardless, researchers should be explicit about their intent in using population descriptors, including whether the aim is to study genetic or environmental effects or both, and whether these can be teased apart given the study design.

Study Type 6: Studies of Health Disparities with Genomic Data

Health disparities studies often compare groups of individuals identified by different descent-associated population descriptors (e.g., by OMB racial and ethnic categories). Some of these studies include genetic information, such as genome-wide genotyping data (Batai et al., 2021), data for variants at a single locus (e.g., apolipoprotein E gene—APOE4) (Torres and Kittles, 2007), or tumor genome sequencing (Daly and Olopade, 2015; Spratt et al., 2016). Other health disparities studies include only nongenetic data but may assign the unexplained variance to untested genetic differences (e.g., Kistka et al., 2007).

Conclusion 5-5. It is invalid to assign unexplained trait variance to any type of effect without direct evidence; notably, racial or ethnic phenotypic differences cannot be ascribed to genetic differences without evidence. The unexplained variance could be caused by environmental factors that are not considered or were imprecisely or inaccurately measured, or by inadequacy of the statistical model used.

Page 133 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Given the variety of goals and sources of input data for different health disparity studies, it is helpful to consider some of these categories separately. Below is a short but incomplete list of three types of health disparities study that include genetic and genomic data.

Health Disparities Study Type 1: The sole goal is to study the role of one or multiple genetic variants on observed or possible health disparities between groups.
Best Practice 13: In this type of study, what is needed is to consider the effects of the focal variant of interest among individuals with similar allele frequencies, so genetic similarity is the relevant descriptor to use, and racial and ethnic labels should not be used. The use of genetic similarity to a reference panel is both more accurate and more transparent than using descent-associated descriptors such as race or ethnicity (Coop, 2022).

In cases where ancestries are correlated with traits such as skin color (Parra et al., 2004), which may mediate the effect of racism on health (Kittles et al., 2007; Teteh et al., 2020), genetic ancestry may be considered if these traits are a key component of the research question.
Health Disparities Study Type 2: The goal is to study the effect of environmental exposures or examine possible gene–environment interplay.
Best Practice 14: Researchers should avoid racial or ethnic labels because they are poor proxies for differences in environmental exposures. Instead, the committee recommends that they replace or supplement descent-associated population descriptors with information about the relevant factors that mediate differences in environmental exposures, such as education, types of employment, housing quality, and access to health care, to name only a few.

There is one exception to Best Practice 14. When the goal is explicitly to study the effect of structural racism and discrimination, then racial and ethnic labels may be appropriate but need to be carefully described (e.g., self-identified or not) and justified. Instruments and variables that measure discrimination (e.g., Williams’s Everyday

Page 134 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Discrimination Scale¹) or mediators of discrimination directly may be more appropriate, although challenging to implement (Hardeman et al., 2022).
Health Disparities Study Type 3: Although the goal is to study the effect of environmental exposures or examine possible gene–environment interplay, information about environmental factors is limited.
Best Practice 15: If environmental information is unavailable and population descriptors such as race or ethnicity are used as proxies for it in such studies, for example in analyses of electronic health records, then their source should be described in detail (e.g., self-reported or assigned by provider) and along multiple dimensions (e.g., Hispanic, Mexican-American, rural, sampled in Texas health clinic, born in Texas). Moreover, the researcher should explicitly state why each descent-associated population descriptor is being used, by identifying specifically what types of effects their inclusion is intended to capture and the accuracy of this capture.

Study Type 7: Studies of Human Evolutionary History

Population genetics studies of human history and prehistory aim to use genetics to make inferences about the genetic evolution of humans and integrate such inferences with data from archeology, history, paleontology, and other disciplines (e.g., Nielsen et al., 2017). Many such studies analyze variation data using models and methods that employ the mathematical construct of discrete, unstructured populations (Gutenkunst et al., 2009; Patterson et al., 2012; Pritchard et al., 2000). It is common practice also to rely on samples collected by geographic or ethnic criteria (e.g., Scheinfeldt et al., 2019).

Some studies of human evolutionary history embed all samples within the same analytic structure, notably when inferring an ancestral recombination graph (e.g., Schaefer et al., 2021). In that case, descent-associated population descriptors may not be necessary, such as when the goal is to estimate the time to the most recent common genetic ancestor of modern humans at a locus (Mallick et al., 2016).

Nonetheless, population descriptors will often be needed to describe the sample collection scheme to other researchers and to capture characteristics of sampled individuals that help place them in a historical and geographic context. These descriptors might include the geographic provenance of the

___________________

¹ See https://scholar.harvard.edu/davidrwilliams/node/32397 (accessed January 20, 2023).

Page 135 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

sample, or some indicator of the geographic or ethnic affiliation of an individual’s recent ancestors (e.g., via grandparental birthplace questionnaires, or by reference to ethnicity, such as “Houston residents who identify ethnically as Gujarati”). In many circumstances, a sample will be labeled by the ethnicity, current geographic location, or commonly spoken language of present-day people to which its genome bears the greatest genetic similarity (e.g., Yoruba, Andamanese, Basque).

Conclusion 5-6. In genetics studies of human evolutionary history, social or geographic population descriptors are often used to describe genetic ancestry groups inferred based on genetic similarity (e.g., labels may be based on shared characteristics of participants such as language spoken, self-identified ethnicity, or location sampled) in order to shed light on population history.

In studying human evolution, researchers may also be interested in studying the genetic and phenotypic changes that occurred in response to localized selection pressures. To study such biological adaptations, which occur through systematic changes in allele frequencies over generations in groups of individuals, researchers will often delimit a set of people whose ancestry is thought to have been subject to similar selection pressures at some time point (e.g., to study the evolution of lactose tolerance in descendants of Nilotic-speaking pastoralists from East Africa in the past several thousand years). A challenge is that the appropriate scale will often be unknown a priori. For example, in studying human adaptation to Plasmodium vivax, continental groupings likely do not offer the necessary fine-scale resolution. Then one must address whether to try to enrich specifically for individuals whose recent genetic ancestors lived in environments where P. vivax was common, or focus on individuals currently living in such environments, despite the fact that their ancestors may not have been subject to the same pressures.

Best Practice 16: When gathering new data in genetics studies of human evolutionary history, after researchers engage local communities as described in Chapter 4, they should collect and include population descriptors along multiple dimensions, both to convey the myriad ways in which an individual could be described and to enable additional uses of these samples in the future. Notably, in addition to genetic data, researchers should also report their sampling location and when known, their birthplace, parental birthplaces, language(s) spoken, and self-described ethnicity. However, researchers should be consistent with population descriptors used for all samples in a study (for example, it is

Page 136 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

not good practice to use self-identified ethnic group for some samples but geographic origin for others).

The committee appreciates that many researchers will use existing data and therefore inherit population descriptors that may not be of their design. In that case, researchers should be transparent about the specific criteria according to which they included or excluded individuals. Moreover, when using legacy data, researchers should be mindful to apply consistent population descriptors across samples within the study. New labels may be appropriate to define, and when doing so, the new labels and their relationship to previous ones should be communicated.

While not a focus of this report, the committee notes additional challenges can arise in assigning population descriptors in studies of ancient DNA, which often integrate genetic data with archeological, or even historical data, to make inferences about modern population origins. Individuals in such studies are often given population labels based on cultural practices inferred from material objects identified from archeological data (e.g., the Corded Ware and Yamnaya cultures) (Eisenmann et al., 2018). Assigning cultural population names to ancient individuals clustered together using genetic data can be problematic. As Eisenmann and colleagues note:

Giving groups that have been identified through a completely different line of evidence—in this case material culture and genomics—the same or related names results in their conflation and the archaeological designations risk becoming reified in genetic terms (and vice versa) (Eisenmann et al., 2018).

Their recommendation is to either label genetically defined populations numerically (giving no cultural label) or use a mixed system where names are based on a combination of geographic and subsistence terms, and a relative time span, together with archeological culture when appropriate (Eisenmann et al., 2018). One example of such a label would be to describe individuals from present-day Spain and dating to the Early Neolithic period as “Spain_EN” (Eisenmann et al., 2018). Such practices are in general alignment with the principles outlined in this report, though the committee reiterates that consideration of the questions of interest when choosing what labels to use, if any, is paramount.

Decision Tree for the Use of Population Descriptors

To aid a researcher contemplating a specific genetics or genomics study, the committee believes that a decision tree to systematically decide which descent-associated population descriptors to consider using and which to

Page 137 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

avoid is a helpful addition to the table. The decision tree can be found in Appendix D. The process begins by asking the following questions:

What is the purpose of your study?
Are you collecting new data, working with existing individual-level data, or using summary-level legacy data?
Does your research question pertain to environmental sources of differences?
If the answer to question 3 is yes, then do you plan to study environmental effects as a predictor or as a control variable?

CONSIDERATIONS FOR HARMONIZATION OF POPULATION DESCRIPTORS ACROSS STUDIES

In general, harmonization enhances comparability of data among different studies and enables the continued use of existing data to answer new research questions (Doiron et al., 2013; Khan et al., 2022; Wallace et al., 2020). Harmonization of population descriptors, specifically, would allow greater interoperability among data sets in human genomics research. Although the advantages of harmonization are clear, there are many challenges to the harmonization of population descriptors (see “Challenges of Harmonization and Legacy Data” in Chapter 1). Descriptors differ not only in scale or resolution but also in the concepts they represent. For example, is it possible to harmonize studies where one uses race or ethnicity, another uses geography, and yet another genetic similarity? Another consideration is the harmonization of descriptors to account for both unique preferences and the needs analytical groups may have within the consortia (Lee et al., 2019). There is a fundamental tension between harmonization on one hand and flexibility or specificity on the other, and the solution is not straightforward. To cope with these challenges, informatics tools have been developed to harmonize data and metadata. Some examples include common data elements (see Box 5-2), machine learning algorithms, visualization tools, and data processing standards.

Over time, regularly employing the best practices and recommendations in this chapter will promote harmonization across studies. As illustrated through the best practices above, different descriptors may be warranted based on study design. Just as individual investigators should be mindful of the purpose of their study, harmonization efforts similarly need to consider research objectives since the context shapes the appropriate use of population descriptors. The objective is less to offer a single definitive descriptor or set of labels but rather systems and approaches for harmonization—that is, clear ways to denote which population descriptors are used and why and how to merge data sets that may have used different descriptor schemes.

Page 138 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

BOX 5-2
Common Data Elements for Researchers to Include as Metadata to Help Harmonize Across Studies

Systems for improving data sharing and harmonization are an important research need. When sharing data, researchers could explicitly share a set of accessory files that provide information to communicate their labeling schemes. An example of useful common data elements to include would be:

Per population descriptor:

Overall rationale for the population descriptor (e.g., classification scheme) and associated group labels
Set of possible population descriptor values (e.g., group labels) and recommended abbreviations of label values used in study
Per individual in the study:
- Label value
- Provenance of label value: Self-report, ascribed externally and by whom, other
When using existing data, if a new set of population descriptor values (e.g., group labels) is being used instead of those used in the original data set, provide a mapping of how old labels map onto new labels.

For example, for geographic population descriptors:

Identification of specific geographic labeling scheme (e.g., based on sample location, birthplace)
If relevant, set of geographic entities with associated shape files defining boundary of the entity or latitude/longitude specifying representative locations
Per individual either:
- Point based:
  - Latitude
  - Longitude
  - Estimated mean square error in units of kilometers
  - Provenance: Self-reported, ascribed externally, other
- Geographic entity based:
  - Entity value
  - Provenance: Self-report, ascribed externally, other

Upholding the principle of transparency and adhering to Recommendations 6, 7, and 8 inherently support harmonization through the application of consistent definitions of population descriptors and transparent communication of methods. More specifically, for novel data collection, data should be collected per individual along multiple nongenetic dimensions and population descriptor types that may facilitate other studies. In addition, clear instructions should be provided on how downstream users can respect consent and any collaborative agreements with study participants

Page 139 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

regarding population descriptors. For existing individual-level data, the available metadata can be used while following existing agreements and consent structures to form the population descriptors (see the decision tree in Appendix D).

Harmonized population descriptors that are well understood would be highly valuable. In the context of genetics studies, genetic similarity to specific reference sets could have advantages for promoting harmonization. While a broader sampling of human genetic diversity is needed, current candidates for specific reference sets include, for example, data from the 1000 Genomes Project, the Human Genome Diversity Project, and the Simons Genome Diversity Project (1000 Genomes Project Consortium et al., 2015; Bergström et al., 2020; Cann et al., 2002; Mallick et al., 2016).

A specific challenge is harmonizing across studies so readers of a research manuscript can understand a label quickly and in a technically precise way. For example, a possible methods description that adheres to the guidelines outlined in this report would be: “To minimize heterogeneity in genetic ancestry across our sample, we filtered our sample to only include individuals with a pairwise genotypic dissimilarity less than 10^-3 to the centroid of the Yoruba of Ibadan sample of the 1000 Genomes Project.” A possible critique of this language though is that this language may be perceived as bulky and difficult to apply throughout a study write-up. Researchers will be eager to find more concise language. In that regard, one possible approach is to favor using a sample abbreviation and the suffix -like. So, in a setting where conciseness is prioritized, instead of the above phrasing, one might say “1KG-YRI-like individuals” (see Box 5-3).

The overall approach of using an abbreviation and the -like suffix is compatible with other descriptors, such as geographic and ethnic descriptors. So, one might for example, if scientifically justified, conduct a study on “self-described ethnically Italian individuals sampled in Houston, Texas, who are 1KG-TSI-like” to refer to self-described individuals sampled from Houston, Texas, who were further filtered based on genetic similarity to the 1000 Genome Project sample of Tuscans of Italy (TSI) individuals.

The approach also offers researchers flexibility regarding the choice of reference panels and the scale at which they are analyzed. For example, while the committee generally recommends against continental-scale conceptualizations of human genetic variation, in the “chromosome painting” (e.g., local ancestry calling) approaches that are common in human genetics, continental-scale conceptualizations are prominent in many analysis pipelines. In such settings, it is still possible, and favorable, to use concise descriptors (e.g., “we partition the genome into tracts that are 1KG-EUR-like and 1KG-AFR-like”) in place of using continental ancestry labels as is common practice (e.g., “we portioned the genome into European

Page 140 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

BOX 5-3
Concise Language for Genetic Similarity: The Abbreviation + -Like System

Because the language used to fully describe population descriptors in terms of genetic similarity may be cumbersome, it may be useful to adopt an approach that uses a sample abbreviation and the suffix like.

For example, one might use the abbreviation “1KG-YRI-like individuals” for “individuals with a pairwise genotypic dissimilarity less than 10^-3 to the Yoruba of Ibadan sample of the 1000 Genomes Project.”

The use of -like as a suffix is a form of abbreviation for a procedure of defining similarity. Although there is an element of vagueness, it is concise, and for readers who need to understand the exact procedure used for ascribing this designation, they should be able to find in a well-written methods section what the precise procedures and thresholds used were to define the term.

Abbreviations are often disfavored in science communication and writing, but in this setting, attempts to use more accessible wording such as European ancestry in place of precise language are so prone to misunderstanding and propagating misconceptions that use of such terms is often counterproductive. An abbreviation implicitly invites a reader to read deeper into the technical meaning of the abbreviation rather than proceed with preconceived notions. For example, if one does not immediately recognize “1KG-YRI” as indicating the 1000 Genomes Project Yoruba of Ibadan (YRI) reference panel, they need to read deeper in the methods and understand what is meant.

“superpopulation”² (EUR) ancestry tracts and African “superpopulation” (AFR) ancestry tracts”). The changed language is concise and more information rich while avoiding the implication of clear continental boundaries in human genetic variation. For admixed individuals themselves, a harmonious approach using the language of genetic similarity would be to refer to the best approximating reference group; for example, “1KG-PEL-like,” and “1KG-PUR-like” are two among many possible genetic similarity descriptors of Latino populations, with PEL = Peruvian in Lima, Peru, and PUR = Puerto Rican in Puerto Rico.

While potentially difficult to read by novices, the use of abbreviations for precision and conciseness is in fact a key aspect of scientific language in many fields (e.g., chemistry and the abbreviations for the elements, though the committee notes the analogy is not exact as there are no fundamental elements with regards to genetic ancestry). Their use in many scientific

___________________

² For example, the 1000 Genomes Project uses a classification of five superpopulations: Africans (AFR), Admixed Americans (AMR), East Asians (EAS), Europeans (EUR), and South Asians (SAS).

Page 141 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

fields is evidence that abbreviations are not an impediment to scientific communication and can foster a culture of concise reference to precisely defined entities. A potential caveat of this approach is that one study’s definition of XX-like may be different from another group’s because of varying definitions to define a threshold on similarity. Standardization for such genetic similarity procedures may be feasible, and would be fruitful to develop, especially as a fuller representation of human genetic variation is sampled by ongoing studies. Nonetheless, the abbreviation plus -like approach would have less vagueness than the current widespread use of such terms as European genetic ancestry and African genetic ancestry, where both the reference populations and the methods to ascribe an affiliation to European or African sources are unclear and make implicit assumptions about the time frame of interest.

As investigators grapple with these complex challenges, harmonization efforts will continue to take many forms. Importantly, given the multiuse nature of modern data sets, any future harmonization efforts must be meaningful in how they aggregate populations or harmonize labels, while remaining flexible to uses and studies. Further alignment across the field to implement the recommendations and best practices described in this report can go a long way toward enhancing harmonization of the use of population descriptors (see Chapter 6).

REFERENCES

1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526:68-74.

Atkinson, E. G., A. X. Maihofer, M. Kanai, A. R. Martin, K. J. Karczewski, M. L. Santoro, J. C. Ulirsch, Y. Kamatani, Y. Okada, H. K. Finucane, K. C. Koenen, C. M. Nievergelt, M. J. Daly, and B. M. Neale. 2021. Tractor uses local ancestry to enable inclusion of admixed individuals in GWAS and to boost power. Nature Genetics 53(2):195-204.

Batai, K., S. Hooker, and R. A. Kittles. 2021. Leveraging genetic ancestry to study health disparities. American Journal of Physical Anthropology 175(2):363-375.

Benmarhnia, T., A. Hajat, and J. S. Kaufman. 2021. Inferential challenges when assessing racial/ethnic health disparities in environmental research. Environmental Health 20(1):7.

Bergström, A., S. A. McCarthy, R. Hui, M. A. Almarri, Q. Ayub, P. Danecek, Y. Chen, S. Felkel, P. Hallast, J. Kamm, H. Blanché, J.-F. Deleuze, H. Cann, S. Mallick, D. Reich, M. S. Sandhu, P. Skoglund, A. Scally, Y. Xue, R. Durbin, and C. Tyler-Smith. 2020. Insights into human genetic variation and population history from 929 diverse genomes. Science 367(6484):eaay5012.

Cann, H. M., C. de Toma, L. Cazes, M.-F. Legrand, V. Morel, L. Piouffre, J. Bodmer, W. F. Bodmer, B. Bonne-Tamir, A. Cambon-Thomsen, Z. Chen, J. Chu, C. Carcassi, L. Contu, R. Du, L. Excoffier, G. B. Ferrara, J. S. Friedlaender, H. Groot, D. Gurwitz, T. Jenkins, R. J. Herrera, X. Huang, J. Kidd, K. K. Kidd, A. Langaney, A. A. Lin, S. Q. Mehdi, P. Parham, A. Piazza, M. P. Pistillo, Y. Qian, Q. Shu, J. Xu, S. Zhu, J. L. Weber, H. T. Greely, M. W. Feldman, G. Thomas, J. Dausset, and L. L. Cavalli-Sforza. 2002. A human genome diversity cell line panel. Science 296(5566):261-262.

Page 142 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Chakraborty, R., S. A. Read, and S. J. Vincent. 2020. Understanding myopia: Pathogenesis and mechanisms. In Updates on myopia: A clinical perspective, edited by M. Ang and T. Y. Wong. Singapore: Springer Singapore. Pp. 65-94.

Charles, B. A., D. Shriner, and C. N. Rotimi. 2014. Accounting for linkage disequilibrium in association analysis of diverse populations. Genetic Epidemiology 38(3):265-273.

Choi, S. W., T. S.-H. Mak, and P. F. O’Reilly. 2020. Tutorial: A guide to performing polygenic risk score analyses. Nature Protocols 15(9):2759-2772.

Claw, K. G., M. Z. Anderson, R. L. Begay, K. S. Tsosie, K. Fox, N. A. Garrison, A. C. Bader, J. Bardill, D. A. Bolnick, J. Brooks, A. Cordova, R. S. Malhi, N. Nakatsuka, A. Neller, J. A. Raff, J. Singson, K. TallBear, T. Vargas, J. M. Yracheta, and Summer internship for INdigenous peoples in Genomics (SING) Consortium. 2018. A framework for enhancing ethical genomic research with indigenous communities. Nature Communications 9(1):2957.

Coop, G. 2023. Genetic similarity versus genetic ancestry groups as sample descriptors in human genetics. arXiv (preprint).

Daly, B., and O. I. Olopade. 2015. A perfect storm: How tumor biology, genomics, and health care delivery patterns collide to create a racial survival disparity in breast cancer and proposed interventions for change. CA: A Cancer Journal for Clinicians 65(3):221-238.

Delorey, T. M., C. G. K. Ziegler, G. Heimberg, R. Normand, Y. Yang, Å. Segerstolpe, D. Abbondanza, S. J. Fleming, A. Subramanian, D. T. Montoro, K. A. Jagadeesh, K. K. Dey, P. Sen, M. Slyper, Y. H. Pita-Juárez, D. Phillips, J. Biermann, Z. Bloom-Ackermann, N. Barkas, A. Ganna, J. Gomez, J. C. Melms, I. Katsyv, E. Normandin, P. Naderi, Y. V. Popov, S. S. Raju, S. Niezen, L. T. Y. Tsai, K. J. Siddle, M. Sud, V. M. Tran, S. K. Vellarikkal, Y. Wang, L. Amir-Zilberstein, D. S. Atri, J. Beechem, O. R. Brook, J. Chen, P. Divakar, P. Dorceus, J. M. Engreitz, A. Essene, D. M. Fitzgerald, R. Fropf, S. Gazal, J. Gould, J. Grzyb, T. Harvey, J. Hecht, T. Hether, J. Jané-Valbuena, M. Leney-Greene, H. Ma, C. McCabe, D. E. McLoughlin, E. M. Miller, C. Muus, M. Niemi, R. Padera, L. Pan, D. Pant, C. Pe’Er, J. Pfiffner-Borges, C. J. Pinto, J. Plaisted, J. Reeves, M. Ross, M. Rudy, E. H. Rueckert, M. Siciliano, A. Sturm, E. Todres, A. Waghray, S. Warren, S. Zhang, D. R. Zollinger, L. Cosimi, R. M. Gupta, N. Hacohen, H. Hibshoosh, W. Hide, A. L. Price, J. Rajagopal, P. R. Tata, S. Riedel, G. Szabo, T. L. Tickle, P. T. Ellinor, D. Hung, P. C. Sabeti, R. Novak, R. Rogers, D. E. Ingber, Z. G. Jiang, D. Juric, M. Babadi, S. L. Farhi, B. Izar, J. R. Stone, I. S. Vlachos, I. H. Solomon, O. Ashenberg, C. B. M. Porter, B. Li, A. K. Shalek, A.-C. Villani, O. Rozenblatt-Rosen, and A. Regev. 2021. COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets. Nature 595(7865):107-113.

Doiron, D., P. Burton, Y. Marcon, A. Gaye, B. H. Wolffenbuttel, M. Perola, R. P. Stolk, L. Foco, C. Minelli, M. Waldenberger, R. Holle, K. Kvaløy, H. L. Hillege, A.-M. Tassé, V. Ferretti, and I. Fortier. 2013. Data harmonization and federated analysis of population-based studies: The BioSHaRE project. Emerging Themes in Epidemiology 10(1):12.

Dolitsky, S., A. Mitra, S. Khan, E. Ashkinadze, and M. V. Sauer. 2020. Beyond the “Jewish panel”: The importance of offering expanded carrier screening to the Ashkenazi Jewish population. F&S Reports 1(3):294-298.

Eisenmann, S., E. Bánffy, P. van Dommelen, K. P. Hofmann, J. Maran, I. Lazaridis, A. Mittnik, M. McCormick, J. Krause, D. Reich, and P. W. Stockhammer. 2018. Reconciling material cultures in archaeology with genetic data: The nomenclature of clusters emerging from archaeogenomic analysis. Scientific Reports 8:13003.

Falconer, D. S., and T. F. C. Mackay. 1996. Introduction to quantitative genetics. 4th ed. Essex, England: Addison Wesley Longman Limited.

GeM-HD Consortium (Genetic Modifiers of Huntington’s Disease Consortium). 2015. Identification of genetic factors that modify clinical onset of Huntington’s disease. Cell 162(3):516-526.

Page 143 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Giannakopoulou, O., K. Lin, X. Meng, M.-H. Su, P.-H. Kuo, R. E. Peterson, S. Awasthi, A. Moscati, J. R. I. Coleman, N. Bass, I. Y. Millwood, Y. Chen, Z. Chen, H.-C. Chen, M.-L. Lu, M.-C. Huang, C.-H. Chen, E. A. Stahl, R. J. F. Loos, N. Mullins, R. J. Ursano, R. C. Kessler, M. B. Stein, S. Sen, L. J. Scott, M. Burmeister, Y. Fang, J. Tyrrell, Y. Jiang, C. Tian, A. M. McIntosh, S. Ripke, E. C. Dunn, K. S. Kendler, R. G. Walters, C. M. Lewis, K. Kuchenbaecker, N. R. Wray, S. Ripke, M. Mattheisen, M. Trzaskowski, E. M. Byrne, A. Abdellaoui, M. J. Adams, E. Agerbo, T. M. Air, T. F. M. Andlauer, S.-A. Bacanu, M. Bækvad-Hansen, A. T. F. Beekman, T. B. Bigdeli, E. B. Binder, J. Bryois, H. N. Buttenschøn, J. Bybjerg-Grauholm, N. Cai, E. Castelao, J. H. Christensen, T.-K. Clarke, J. R. I. Coleman, L. Colodro-Conde, H. Coon, B. Couvy-Duchesne, N. Craddock, G. E. Crawford, G. Davies, I. J. Deary, F. Degenhardt, E. M. Derks, N. Direk, C. V. Dolan, E. C. Dunn, T. C. Eley, V. Escott-Price, F. F. H. Kiadeh, H. K. Finucane, J. C. Foo, A. J. Forstner, J. Frank, H. A. Gaspar, M. Gill, F. S. Goes, S. D. Gordon, J. Grove, L. S. Hall, C. S. Hansen, T. F. Hansen, S. Herms, I. B. Hickie, P. Hoffmann, G. Homuth, C. Horn, J.-J. Hottenga, D. M. Howard, D. M. Hougaard, M. Ising, R. Jansen, I. Jones, L. A. Jones, E. Jorgenson, J. A. Knowles, I. S. Kohane, J. Kraft, W. W. Kretzschmar, Z. Kutalik, Y. Li, P. A. Lind, J. J. Luykx, D. J. Macintyre, D. F. Mackinnon, R. M. Maier, W. Maier, J. Marchini, H. Mbarek, P. McGrath, P. McGuffin, S. E. Medland, D. Mehta, C. M. Middeldorp, E. Mihailov, Y. Milaneschi, L. Milani, F. M. Mondimore, G. W. Montgomery, S. Mostafavi, N. Mullins, M. Nauck, B. Ng, M. G. Nivard, D. R. Nyholt, P. F. O’Reilly, H. Oskarsson, M. J. Owen, J. N. Painter, C. B. Pedersen, M. G. Pedersen, R. E. Peterson, E. Pettersson, W. J. Peyrot, G. Pistis, D. Posthuma, J. A. Quiroz, P. Qvist, J. P. Rice, B. P. Riley, M. Rivera, S. S. Mirza, R. Schoevers, E. C. Schulte, L. Shen, J. Shi, S. I. Shyn, E. Sigurdsson, G. C. B. Sinnamon, J. H. Smit, D. J. Smith, H. Stefansson, S. Steinberg, F. Streit, J. Strohmaier, K. E. Tansey, H. Teismann, A. Teumer, W. Thompson, P. A. Thompson, T. E. Thorgeirsson, M. Traylor, J. Treutlein, V. Trubetskoy, A. G. Uitterlinden, D. Umbricht, S. Van Der Auwera, A. M. Van Hemert, A. Viktorin, P. M. Visscher, Y. Wang, B. T. Webb, S. M. Weinsheimer, J. Wellmann, G. Willemsen, S. H. Witt, Y. Wu, H. S. Xi, J. Yang, F. Zhang, V. Arolt, B. T. Baune, K. Berger, D. I. Boomsma, S. Cichon, U. Dannlowski, E. De Geus, J. R. Depaulo, E. Domenici, K. Domschke, T. Esko, H. J. Grabe, S. P. Hamilton, C. Hayward, A. C. Heath, K. S. Kendler, S. Kloiber, G. Lewis, Q. S. Li, S. Lucae, P. A. Madden, P. K. Magnusson, N. G. Martin, A. M. McIntosh, A. Metspalu, O. Mors, P. B. Mortensen, B. Müller-Myhsok, M. Nordentoft, M. M. Nöthen, M. C. O’Donovan, S. A. Paciga, N. L. Pedersen, B. W. Penninx, R. H. Perlis, D. J. Porteous, J. B. Potash, M. Preisig, M. Rietschel, C. Schaefer, T. G. Schulze, J. W. Smoller, K. Stefansson, H. Tiemeier, R. Uher, H. Völzke, M. M. Weissman, T. Werge, C. M. Lewis, D. F. Levinson, G. Breen, A. D. Børglum, P. F. Sullivan, M. Agee, S. Aslibekyan, A. Auton, E. Babalola, R. K. Bell, J. Bielenberg, K. Bryc, E. Bullis, B. Cameron, D. Coker, G. Cuellar Partida, D. Dhamija, S. Das, S. L. Elson, T. Filshtein, K. Fletez-Brant, P. Fontanillas, W. Freyman, P. M. Gandhi, K. Heilbron, B. Hicks, D. A. Hinds, K. E. Huber, E. M. Jewett, Y. Jiang, A. Kleinman, K. Kukar, V. Lane, K.-H. Lin, M. Lowe, M. K. Luff, J. C. McCreight, M. H. McIntyre, K. F. McManus, S. J. Micheletti, M. E. Moreno, J. L. Mountain, S. V. Mozaffari, P. Nandakumar, E. S. Noblin, J. O’Connell, A. A. Petrakovitz, G. D. Poznik, M. Schumacher, A. J. Shastri, J. F. Shelton, J. Shi, S. Shringarpure, C. Tian, V. Tran, J. Y. Tung, X. Wang, W. Wang, C. H. Weldon, P. Wilton, D. Avery, D. Bennett, Z. Bian, R. Boxall, F. Bragg, K. H. Chan, L. Chang, Y. Chang, B. Chen, J. Chen, J. Chen, N. Chen, N. Chen, X. Chen, Y. Chen, Z. Chen, L. Cheng, J. Clarke, R. Clarke, R. Collins, C. Dong, H. Du, R. Du, Z. Fairhurst-Hunter, L. Fan, S. Feng, Z. Fu, W. Gan, R. Gao, Y. Gao, P. Ge, S. Gilbert, W. Gong, Q. Gu, Y. Guo, Z. Guo, Z. Guo, A. Hacker, X. Han, P. Hariri, P. He, T. He, M. Hill, M. Holmes, C. Hou, W. Hou, C. Hu, R. Hu, X. Hu, Y. Hu, H. Hua, Y. Hua, Y. Huang, P. K. Im, A. Iona, Q. Jiang, J. Jin, M. Kakkoura, Q. Kang, C. Kartsonaki, R. Kerosi, L. Kong, J. Lan, G. Lancaster, F. Li, H. Li, J. Li, L. Li, M. Li, S. Li, Y. Li, Y. Li, Z. Li, K. Lin, L. Lingli, C. Liu, D. Liu, D. Liu, F. Liu, H. Liu, J. Liu, J. Liu, Y.

Page 144 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Liu, Y. Liu, H. Long, Y. Lu, G. Luo, J. Lv, S. Lv, L. Ma, E. Mao, J. McDonnell, F. Meng, J. Meng, I. Millwood, Q. Nie, F. Ning, D. Pan, R. Pan, Z. Pang, P. Pei, R. Peto, A. Pozarickij, Y. Qian, Y. Qin, C. Qu, X. Ren, P. Ryder, S. Sansome, D. Schmidt, P. Sherliker, R. Sohoni, B. Stevens, J. Su, H. Sun, Q. Sun, X. Sun, A. Tang, Z. Tang, R. Tao, X. Tian, I. Turnbull, R. Walters, M. Wan, C. Wang, C. Wang, H. Wang, J. Wang, L. Wang, P. Wang, T. Wang, S. Wang, S. Wang, X. Wang, L. Wei, M. Weng, N. Wright, M. Wu, X. Wu, S. Wu, K. Xie, Q. Xu, Q. Xu, X. Xu, S. Yan, L. Yang, X. Yang, J. Yang, P. Yao, L. Yin, B. Yu, C. Yu, M. Yu, Y. Zhai, H. Zhang, H. Zhang, J. Zhang, L. Zhang, N. Zhang, X. Zhang, X. Zhang, X. Zhang, X. Zhong, D. Z. Zhou, G. Zhou, J. Zhou, L. Zhou, W. Zhou, X. Zhou, Y. Zhou, and M. Zou. 2021. The genetic architecture of depression in individuals of east Asian ancestry. JAMA Psychiatry 78(11):1258-1269.

Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Bustamante. 2009. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics 5(10):e1000695.

Hardeman, R. R., P. A. Homan, T. Chantarat, B. A. Davis, and T. H. Brown. 2022. Improving the measurement of structural racism to achieve antiracist health policy: Study examines measurement of structural racism to achieve antiracist health policy. Health Affairs 41(2):179-186.

Hirschhorn, J. N., and M. J. Daly. 2005. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6(2):95-108.

Jimenez-Sanchez, M., F. Licitra, B. R. Underwood, and D. C. Rubinsztein. 2017. Huntington’s disease: Mechanisms of pathogenesis and therapeutic strategies. Cold Spring Harbor Perspectives in Medicine 7(7).

Khan, A. T., S. M. Gogarten, C. P. McHugh, A. M. Stilp, T. Sofer, M. L. Bowers, Q. Wong, L. A. Cupples, B. Hidalgo, A. D. Johnson, M.-L. N. McDonald, S. T. McGarvey, M. R. G. Taylor, S. M. Fullerton, M. P. Conomos, and S. C. Nelson. 2022. Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the NHLBI TOPMed program. Cell Genomics 2(8):100155.

Khera, A. V., M. Chaffin, K. G. Aragam, M. E. Haas, C. Roselli, S. H. Choi, P. Natarajan, E. S. Lander, S. A. Lubitz, P. T. Ellinor, and S. Kathiresan. 2018. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics 50(9):1219-1224.

Kistka, Z. A.-F., L. Palomar, K. A. Lee, S. E. Boslaugh, M. F. Wangler, F. S. Cole, M. R. Debaun, and L. J. Muglia. 2007. Racial disparity in the frequency of recurrence of preterm birth. American Journal of Obstetrics and Gynecology 196(2):131.e1-131.e6.

Kittles, R. A., E. R. Santos, N. S. Oji-Njideka, and C. Bonilla. 2007. Race, skin color and genetic ancestry: Implications for biomedical research on health disparities. Californian Journal of Health Promotion 5(Special Issue):9-23.

Lee, S. S.-J., S. M. Fullerton, A. Saperstein, and J. K. Shim. 2019. Ethics of inclusion: Cultivate trust in precision medicine. Science 364(6444):941-942.

Lesko, C. R., L. P. Jacobson, K. N. Althoff, A. G. Abraham, S. J. Gange, R. D. Moore, S. Modur, and B. Lau. 2018. Collaborative, pooled and harmonized study designs for epidemiologic research: Challenges and opportunities. International Journal of Epidemiology 47(2):654-668.

Lewis, A. C. F., S. J. Molina, P. S. Appelbaum, B. Dauda, A. Di Rienzo, A. Fuentes, S. M. Fullerton, N. A. Garrison, N. Ghosh, E. M. Hammonds, D. S. Jones, E. E. Kenny, P. Kraft, S. S. Lee, M. Mauro, J. Novembre, A. Panofsky, M. Sohail, B. M. Neale, and D. S. Allen. 2022. Getting genetic ancestry right for science and society. Science 376(6590):250-252.

Li, J., and Q. Zhang. 2017. Insight into the molecular genetics of myopia. Molecular Vision 23:1048.

Page 145 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Mallick, S., H. Li, M. Lipson, I. Mathieson, M. Gymrek, F. Racimo, M. Zhao, N. Chennagiri, S. Nordenfelt, A. Tandon, P. Skoglund, I. Lazaridis, S. Sankararaman, Q. Fu, N. Rohland, G. Renaud, Y. Erlich, T. Willems, C. Gallo, J. P. Spence, Y. S. Song, G. Poletti, F. Balloux, G. van Driem, P. de Knijff, I. G. Romero, A. R. Jha, D. M. Behar, C. M. Bravi, C. Capelli, T. Hervig, A. Moreno-Estrada, O. L. Posukh, E. Balanovska, O. Balanovsky, S. Karachanak-Yankova, H. Sahakyan, D. Toncheva, L. Yepiskoposyan, C. Tyler-Smith, Y. Xue, M. S. Abdullah, A. Ruiz-Linares, C. M. Beall, A. Di Rienzo, C. Jeong, E. B. Starikovskaya, E. Metspalu, J. Parik, R. Villems, B. M. Henn, U. Hodoglugil, R. Mahley, A. Sajantila, G. Stamatoyannopoulos, J. T. Wee, R. Khusainova, E. Khusnutdinova, S. Litvinov, G. Ayodo, D. Comas, M. F. Hammer, T. Kivisild, W. Klitz, C. A. Winkler, D. Labuda, M. Bamshad, L. B. Jorde, S. A. Tishkoff, W. S. Watkins, M. Metspalu, S. Dryomov, R. Sukernik, L. Singh, K. Thangaraj, S. Pääbo, J. Kelso, N. Patterson, and D. Reich. 2016. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538(7624):201-206.

Martin, A. R., M. Kanai, Y. Kamatani, Y. Okada, B. M. Neale, and M. J. Daly. 2019. Clinical use of current polygenic risk scores may exacerbate health disparities: A systematic literature review. Pharmacogenomics 18(16):1541-1550. Nature Genetics 51(4):584-591.

Martinez, R. A. M., N. Andrabi, A. N. Goodwin, R. E. Wilbur, N. R. Smith, and P. N. Zivich. 2023. Conceptualization, operationalization, and utilization of race and ethnicity in major epidemiology journals 1995–2018: A systematic review. American Journal of Epidemiology 192(3):483-496.

Mathieson, I., and A. Scally. 2020. What is ancestry? PLoS Genetics 16(3):e1008624.

Mavaddat, N., K. Michailidou, J. Dennis, M. Lush, L. Fachal, A. Lee, J. P. Tyrer, T.-H. Chen, Q. Wang, M. K. Bolla, X. Yang, M. A. Adank, T. Ahearn, K. Aittomäki, J. Allen, I. L. Andrulis, H. Anton-Culver, N. N. Antonenkova, V. Arndt, K. J. Aronson, P. L. Auer, P. Auvinen, M. Barrdahl, L. E. Beane Freeman, M. W. Beckmann, S. Behrens, J. Benitez, M. Bermisheva, L. Bernstein, C. Blomqvist, N. V. Bogdanova, S. E. Bojesen, B. Bonanni, A.-L. Børresen-Dale, H. Brauch, M. Bremer, H. Brenner, A. Brentnall, I. W. Brock, A. Brooks-Wilson, S. Y. Brucker, T. Brüning, B. Burwinkel, D. Campa, B. D. Carter, J. E. Castelao, S. J. Chanock, R. Chlebowski, H. Christiansen, C. L. Clarke, J. M. Collée, E. Cordina-Duverger, S. Cornelissen, F. J. Couch, A. Cox, S. S. Cross, K. Czene, M. B. Daly, P. Devilee, T. Dörk, I. dos-Santos-Silva, M. Dumont, L. Durcan, M. Dwek, D. M. Eccles, A. B. Ekici, A. H. Eliassen, C. Ellberg, C. Engel, M. Eriksson, D. G. Evans, P. A. Fasching, J. Figueroa, O. Fletcher, H. Flyger, A. Försti, L. Fritschi, M. Gabrielson, M. Gago-Dominguez, S. M. Gapstur, J. A. García-Sáenz, M. M. Gaudet, V. Georgoulias, G. G. Giles, I. R. Gilyazova, G. Glendon, M. S. Goldberg, D. E. Goldgar, A. González-Neira, G. I. Grenaker Alnæs, M. Grip, J. Gronwald, A. Grundy, P. Guénel, L. Haeberle, E. Hahnen, C. A. Haiman, N. Håkansson, U. Hamann, S. E. Hankinson, E. F. Harkness, S. N. Hart, W. He, A. Hein, J. Heyworth, P. Hillemanns, A. Hollestelle, M. J. Hooning, R. N. Hoover, J. L. Hopper, A. Howell, G. Huang, K. Humphreys, D. J. Hunter, M. Jakimovska, A. Jakubowska, W. Janni, E. M. John, N. Johnson, M. E. Jones, A. Jukkola-Vuorinen, A. Jung, R. Kaaks, K. Kaczmarek, V. Kataja, R. Keeman, M. J. Kerin, E. Khusnutdinova, J. I. Kiiski, J. A. Knight, Y.-D. Ko, V.-M. Kosma, S. Koutros, V. N. Kristensen, U. Krüger, T. Kühl, D. Lambrechts, L. Le Marchand, E. Lee, F. Lejbkowicz, J. Lilyquist, A. Lindblom, S. Lindström, J. Lissowska, W.-Y. Lo, S. Loibl, J. Long, J. Lubiński, M. P. Lux, R. J. MacInnis, T. Maishman, E. Makalic, I. Maleva Kostovska, A. Mannermaa, S. Manoukian, S. Margolin, J. W. M. Martens, M. E. Martinez, D. Mavroudis, C. McLean, A. Meindl, U. Menon, P. Middha, N. Miller, F. Moreno, A. M. Mulligan, C. Mulot, V. M. Muñoz-Garzon, S. L. Neuhausen, H. Nevanlinna, P. Neven, W. G. Newman, S. F. Nielsen, B. G. Nordestgaard, A. Norman, K. Offit, J. E. Olson, H. Olsson, N. Orr, V. S. Pankratz, T.-W. Park-Simon, J. I. A. Perez, C. Pérez-Barrios, P. Peterlongo, J. Peto, M. Pinchev, D. Plaseska-Karanfilska, E. C. Polley, R. Prentice, N. Presneau, D. Prokofyeva, K. Purrington, K. Pylkäs, B. Rack, P. Radice, R.

Page 146 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Rau-Murthy, G. Rennert, H. S. Rennert, V. Rhenius, M. Robson, A. Romero, K. J. Ruddy, M. Ruebner, E. Saloustros, D. P. Sandler, E. J. Sawyer, D. F. Schmidt, R. K. Schmutzler, A. Schneeweiss, M. J. Schoemaker, F. Schumacher, P. Schürmann, L. Schwentner, C. Scott, R. J. Scott, C. Seynaeve, M. Shah, M. E. Sherman, M. J. Shrubsole, X.-O. Shu, S. Slager, A. Smeets, C. Sohn, P. Soucy, M. C. Southey, J. J. Spinelli, C. Stegmaier, J. Stone, A. J. Swerdlow, R. M. Tamimi, W. J. Tapper, J. A. Taylor, M. B. Terry, K. Thöne, R. A. E. M. Tollenaar, I. Tomlinson, T. Truong, M. Tzardi, H.-U. Ulmer, M. Untch, C. M. Vachon, E. M. van Veen, J. Vijai, C. R. Weinberg, C. Wendt, A. S. Whittemore, H. Wildiers, W. Willett, R. Winqvist, A. Wolk, X. R. Yang, D. Yannoukakos, Y. Zhang, W. Zheng, A. Ziogas, A. M. Dunning, D. J. Thompson, G. Chenevix-Trench, J. Chang-Claude, M. K. Schmidt, P. Hall, R. L. Milne, P. D. P. Pharoah, A. C. Antoniou, N. Chatterjee, P. Kraft, M. García-Closas, J. Simard, and D. F. Easton. 2019. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. American Journal of Human Genetics 104(1):21-34.

Mills, M. C., and C. Rahal. 2020. The GWAS diversity monitor tracks diversity by disease in real time. Nature Genetics 52(3):242-243.

Mostafavi, H., A. Harpak, I. Agarwal, D. Conley, J. K. Pritchard, and M. Przeworski. 2020. Variable prediction accuracy of polygenic scores within an ancestry group. eLife 9:e48376.

NASEM (National Academies of Sciences, Engineering, and Medicine). 2019. Reproducibility and replicability in science. Washington, DC: The National Academies Press.

Nazareth, S. B., G. A. Lazarin, and J. D. Goldberg. 2015. Changing trends in carrier screening for genetic disease in the United States. Prenatal Diagnosis 35(10):931-935.

Nielsen, R., J. M. Akey, M. Jakobsson, J. K. Pritchard, S. Tishkoff, and E. Willerslev. 2017. Tracing the peopling of the world through genomics. Nature 541:302-310.

O’Neal, W. K., and M. R. Knowles. 2018. Cystic fibrosis disease modifiers: Complex genetics defines the phenotypic diversity in a monogenic disease. Annual Review of Genomics and Human Genetics 19:201-222.

Okbay, A., Y. Wu, N. Wang, H. Jayashankar, M. Bennett, S. M. Nehzati, J. Sidorenko, H. Kweon, G. Goldman, T. Gjorgjieva, Y. Jiang, B. Hicks, C. Tian, D. A. Hinds, R. Ahlskog, P. K. E. Magnusson, S. Oskarsson, C. Hayward, A. Campbell, D. J. Porteous, J. Freese, P. Herd, 23andMe Research Team, Social Science Genetic Association Consortium, C. Watson, J. Jala, D. Conley, P. D. Koellinger, M. Johannesson, D. Laibson, M. N. Meyer, J. J. Lee, A. Kong, L. Yengo, D. Cesarini, P. Turley, P. M. Visscher, J. P. Beauchamp, D. J. Benjamin, and A. I. Young. 2022. Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals. Nature Genetics 54(4):437-449.

Oni-Orisan, A., Y. Mavura, Y. Banda, T. A. Thornton, and R. Sebro. 2021. Embracing genetic diversity to improve black health. New England Journal of Medicine 384(12):1163-1167.

Parra, E. J., R. A. Kittles, and M. D. Shriver. 2004. Implications of correlations between skin color and genetic ancestry for biomedical research. Nature Genetics 36(S11):S54-S60.

Patterson, N., P. Moorjani, Y. Luo, S. Mallick, N. Rohland, Y. Zhan, T. Genschoreck, T. Webster, and D. Reich. 2012. Ancient admixture in human history. Genetics 192(3):1065-1093.

Pavličev, M., and G. P. Wagner. 2022. The value of broad taxonomic comparisons in evolutionary medicine: Disease is not a trait but a state of a trait! MedComm 3(4):e174.

Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155(2):945-959.

Privé, F., H. Aschard, S. Carmi, L. Folkersen, C. Hoggart, P. F. O’Reilly, and B. J. Vilhjálmsson. 2022. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. American Journal of Human Genetics 109(1):12-23.

Randolph, H. E., J. K. Fiege, B. K. Thielen, C. K. Mickelson, M. Shiratori, J. Barroso-Batista, R. A. Langlois, and L. Barreiro. 2021. Genetic ancestry effects on the response to viral infection are pervasive but cell type specific. Science 374(6571):1127-1133.

Page 147 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Rohde, D. L. T., S. Olson, and J. T. Chang. 2004. Modelling the recent common ancestry of all living humans. Nature 431(7008):562-566.

Sadarangani, M., A. Marchant, and T. R. Kollmann. 2021. Immunological mechanisms of vaccine-induced protection against COVID-19 in humans. Nature Reviews Immunology 21(8):475-484.

Schaefer, N. K., B. Shapiro, and R. E. Green. 2021. An ancestral recombination graph of human, Neanderthal, and Denisovan genomes. Science Advances 7(29):eabc0776.

Scheinfeldt, L. B., S. Soi, C. Lambert, W.-Y. Ko, A. Coulibaly, A. Ranciaro, S. Thompson, J. Hirbo, W. Beggs, M. Ibrahim, T. Nyambo, S. Omar, D. Woldemeskel, G. Belay, A. Froment, J. Kim, and S. A. Tishkoff. 2019. Genomic evidence for shared common ancestry of east African hunting-gathering populations and insights into local adaptation. Proceedings of the National Academy of Sciences 116(10):4166-4175.

Scutari, M., I. Mackay, and D. Balding. 2016. Using genetic distance to infer the accuracy of genomic prediction. PLoS Genetics 12(9):e1006288.

Shriner, D. 2013. Overview of admixture mapping. Current Protocols in Human Genetics Chapter 1: Unit 1.23.

Simons, C., N. I. Wolf, N. McNeil, L. Caldovic, J. M. Devaney, A. Takanohashi, J. Crawford, K. Ru, S. M. Grimmond, D. Miller, D. Tonduti, J. L. Schmidt, R. S. Chudnow, R. van Coster, L. Lagae, J. Kisler, J. Sperner, M. S. van der Knaap, R. Schiffmann, R. J. Taft, and A. Vanderver. 2013. A de novo mutation in the beta-tubulin gene TUBB4A results in the leukoencephalopathy hypomyelination with atrophy of the basal ganglia and cerebellum. American Journal of Human Genetics 92(5):767-773.

Sirugo, G., S. M. Williams, and S. A. Tishkoff. 2019. The missing diversity in human genetic studies. Cell 177(1):26-31.

Spratt, D. E., T. Chan, L. Waldron, C. Speers, F. Y. Feng, O. O. Ogunwobi, and J. R. Osborne. 2016. Racial/ethnic disparities in genomic sequencing. JAMA Oncology 2(8):1070.

Tan, T. Y., O. J. Dillon, Z. Stark, D. Schofield, K. Alam, R. Shrestha, B. Chong, D. Phelan, G. R. Brett, E. Creed, A. Jarmolowicz, P. Yap, M. Walsh, L. Downie, D. J. Amor, R. Savarirayan, G. McGillivray, A. Yeung, H. Peters, S. J. Robertson, A. J. Robinson, I. Macciocca, S. Sadedin, K. Bell, A. Oshlack, P. Georgeson, N. Thorne, C. Gaff, and S. M. White. 2017. Diagnostic impact and cost-effectiveness of whole-exome sequencing for ambulant children with suspected monogenic conditions. JAMA Pediatrics 171(9):855-862.

Teteh, D. K., L. Dawkins-Moultin, S. Hooker, W. Hernandez, C. Bonilla, D. Galloway, V. La-Groon, E. R. Santos, M. Shriver, C. D. M. Royal, and R. A. Kittles. 2020. Genetic ancestry, skin color and social attainment: The four cities study. PLoS ONE 15(8):e0237041.

Thornton, T. A., and J. L. Bermejo. 2014. Local and global ancestry inference and applications to genetic association analysis for admixed populations. Genetic Epidemiology 38(S1):S5-S12.

Torkamani, A., N. E. Wineinger, and E. J. Topol. 2018. The personal and clinical utility of polygenic risk scores. Nature Reviews Genetics 19(9):581-590.

Torres, J. B., and R. A. Kittles. 2007. The relationship between “race” and genetics and biomedical research. Current Hypertension Reports 9(3):196-201.

Turner, T. N., B. P. Coe, D. E. Dickel, K. Hoekzema, B. J. Nelson, M. C. Zody, Z. N. Kronenberg, F. Hormozdiari, A. Raja, L. A. Pennacchio, R. B. Darnell, and E. E. Eichler. 2017. Genomic patterns of de novo mutation in simplex autism. Cell 171(3):710-722.e712.

Van Alten, S., B. W. Domingue, T. Galama, and A. T. Marees. 2022. Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering. medRxiv (preprint).

Visscher, P. M., N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy, M. A. Brown, and J. Yang. 2017. 10 years of GWAS discovery: Biology, function, and translation. American Journal of Human Genetics 101(1):5-22.

Page 148 Cite

Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.

×

Wallace, S. E., E. Kirby, and B. M. Knoppers. 2020. How can we not waste legacy genomic research data? Frontiers in Genetics 11:446.

Wang, Y., J. Guo, G. Ni, J. Yang, P. M. Visscher, and L. Yengo. 2020. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications 11(1).

Wang, Y., K. Tsuo, M. Kanai, B. M. Neale, and A. R. Martin. 2022. Challenges and opportunities for developing more generalizable polygenic risk scores. Annual Review of Biomedical Data Science 5:293-320.

Wexler, N. S. 2004. Venezuelan kindreds reveal that genetic and environmental factors modulate Huntington’s disease age of onset. Proceedings of the National Academy of Sciences 101(10):3498-3503.