National Academies Press: OpenBook

Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field (2023)

Chapter:5 Guidance for Selection and Use of Population Descriptors in Genomics Research

« Previous: 4 Requisites for Sustained Change
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page113
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page114
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page115
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page116
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page117
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page118
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page119
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page120
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page121
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page122
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page123
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page124
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page125
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page126
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page127
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page128
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page129
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page130
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page131
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page132
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page133
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page134
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page135
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page136
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page137
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page138
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page139
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page140
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page141
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page142
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page143
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page144
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page145
Suggested Citation:"5 Guidance for Selection and Use of Population Descriptors in Genomics Research." National Academies of Sciences, Engineering, and Medicine. 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. Washington, DC: The National Academies Press. doi: 10.17226/26902.
×
Page146

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Guidance for Selection and Use of Population Descriptors in Genomics Research INTRODUCTION This chapter’s primary audience are researchers who work with ge- netic data. The committee’s intent is to provide practical guidance for using descent-associated population descriptors in human genetics and genomics research. As emphasized throughout the report, the appropriate population descriptor depends on the scientific question being asked. In some cases, none of these descriptors may be needed. In other situations, when descent-associated population descriptors are advisable or needed for methodological reasons, this chapter gives guidance on which approaches to consider and why. In formulating these recommendations, the committee recognizes that there exists a large amount of legacy data in which study participants have already been classified on the basis of population descriptors (Khan et al., 2022; Wallace et al., 2020). When using such data, researchers may be con- strained in their options, but their choices need to be described in ensuing publications. Furthermore, the committee appreciates the dynamic nature of research and the changing landscape of descent-associated population descriptors; there is no single solution to this challenge of appropriate use of descriptors, and applying a uniform approach across different types of studies is not possible. Rather, responsive approaches are needed to accom- modate the specific research question being asked, develop best practices for grouping individuals and naming those groups, and take community preferences into account. This chapter builds on the foundation established by the previous four chapters. Therefore, the committee encourages a careful reading of Chapters 113 PREPUBLICATION COPY—Uncorrected Proofs

114 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH 1 through 4 in order to understand the context of these recommendations. Notably, Chapter 3 provides a set of guiding principles for conducting hu- man genetics research (and all research involving humans) that support the report’s recommendations and can help guide researchers when none of the specific best practices apply. THE IMPORTANCE OF TRANSPARENCY AND SPECIFICITY WHEN SELECTING AND REPORTING POPULATION DESCRIPTORS Transparency in methodology is a scientific norm for replication of research findings (NASEM, 2019), yet the challenge of transparency is not only in scientific description but also in communicating specifically how and why particular decisions were made. Although imperfect, categories and labels are needed to conduct and communicate science. Transparency, therefore, requires stating the rationale behind the classification scheme and group labels applied when using population descriptors. Beyond describ- ing the exact nature of the study conducted and ensuring reproducibility, comparability and meta-analysis with other studies, transparency about methods, assumptions, and decision making promotes trustworthiness of the research (Claw et al., 2018; NASEM, 2019). Moreover, understanding the factors that inform decision making supports reproducibility. When communicating their research methods, findings, and conclu- sions, researchers should be as transparent as possible about the specific procedures used to identify and name groups within their data sets. Trans- parency can take three major forms: 1. Clear identification of the concept of human difference underpin- ning the population descriptor(s) chosen for analysis, and the ra- tionale for that choice, 2. Verbal descriptions of how samples were collected and labeled, as well as the rationale for the decisions made; and 3. Sharing analysis scripts and decision rules used to transform per- individual metadata (e.g., responses to surveys) to the labels used in an analysis. The primary focus of this chapter is on the first two, namely the concep- tual approaches and specific language that enable appropriate and accurate use of population descriptors in genomics research. Furthermore, the guid- ance that follows is intended to provide researchers with best practices and the rationale for decision making, in alignment with the guiding principles outlined in Chapter 3 and in an effort to support the goal of promoting trustworthy research. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 115 In delimiting their study participants, researchers inevitably make choices about which classification schemes or descriptors to use, which scale of resolution to consider, which specific group labels to apply, how to treat individuals with missing data and so forth. Researchers may also be constrained to using group categories and labels adopted by others in order to allow for data aggregation or harmonization (Doiron et al., 2013; Khan et al., 2022; Wallace et al., 2020). A further challenge arises when such categories have been applied Inconsistently, with a mixture of some individuals in a study labeled based on race, others based on ethnicity, and yet others based on geography. For instance, some researchers merge genomic data sets from different sources and assign individuals to clusters on the basis of genetic similarity to each other or to reference panels. Then they assign labels to individuals based on a characteristic that is frequent in the cluster or by using the labels from the reference panels. The number and size of the clusters that are detected in any given study depend on the sample composition. Moreover, the group labeling assigned to these clusters is often highly heterogeneous, borrowing terms from distinct classification schemes, at vastly different scales of resolution, such as African (a conti- nental geographical location)/African American (an ethnicity), East Asian (a geographic location), and Finnish (a nationality). In that regard, it is worth noting that even when the labels are carried over from previous data col- lection, choices have to be made about what ancillary information to use and which subsets of individuals to combine and split in the new analysis. CONCLUSION AND RECOMMENDATIONS Conclusion 5-1. In employing population descriptors and assigning group labels in genetics studies, researchers tend to rely on existing and commonly used population classifications, often with unclear justifica- tion for their choices. Recommendation 6. Researchers should tailor their use of population descriptors to the type and purpose of the study, in alignment with the guiding principles, and explain how and why they used those descrip- tors. Where appropriate for the study objectives, researchers should consider using multiple descriptors for each study participant to im- prove clarity. Recommendation 7. For each descriptor selected, labels should be ap- plied consistently to all participants. For example, if ethnicity is the descriptor, all participants should be assigned an ethnicity label, rather than labeling some by race, others by geography, and yet others by eth- nicity or nationality. If researchers choose to use multiple descriptors, PREPUBLICATION COPY—Uncorrected Proofs

116 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH each descriptor should be applied consistently across all individuals in that study. Recommendation 8. Researchers should disclose the process by which they selected and assigned group labels and the rationale for any group- ing of samples. Where new labels are developed for legacy samples, researchers should provide descriptions of new labels relative to old labels. To equip researchers with the information to follow these recommen- dations, the committee developed the following decision-making tools and best practices. These tools will be particularly helpful to reviewers of ge- netics and genomics research proposals to try to assure consistent usage of terms and appropriate study designs. TOOLS FOR SELECTING AND USING POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH The table below and decision tree in Appendix D suggest which de- scent-associated population descriptors are most appropriate as analytical tools for each of the seven genetics study types outlined in this report. Note that each descriptor represents a particular concept of difference across populations. In other words, the recommendations in the decision tree and table focus on the conceptual building blocks that researchers should use in study design, data analysis, and in reporting their results. While the conceptual structure of research naturally has implications for the language that scientists adopt, the tree and table are not intended to be a linguistic straitjacket or a checklist of acceptable words. Instead, the objective of the committee’s guidelines is to encourage genetics researchers to consider, define, and delineate very carefully the concepts of human difference with which they are working, and to choose wording that transparently reflects the analytical steps taken. These considerations are particularly salient with respect to genetic ancestry, which is not directly observable and is instead inferred from measures of genetic similarity. Therefore, the committee recommends that researchers relying on such measures explicitly refer to genetic similarity when describing their results, rather than the shorthand of genetic ancestry. An exception is for human evolutionary genetics studies explicitly aiming to learn about genetic ancestries over time or space. The committee recom- mends that when researchers borrow labels from ethnic, racial, political, or geographic classification schemes, they be explicit about their choice of descriptor. For all these reasons, the columns in Table 5-1 distinguish the PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 117 concept of genetic ancestry from other population descriptors like ethnicity or geography. Key Terminology for the Selection Guide Throughout the rest of this chapter, a nuanced understanding of key terms and concepts introduced earlier in this report is necessary (summa- rized in Box 5-1). In the table and decision tree below, population descrip- BOX 5-1 Key Terminology for This Chapter Population descriptor: a concept or classification scheme that categorizes people into groups (or “populations”) according to a perceived characteristic or dimension of interest. A few examples include race, ethnicity, and geographic location, although this is a non-exhaustive list. Group label: name given to a population that describes or classifies it according to the dimension along which it was identified. An example is French as the label for a group identified by its members’ possession of French nationality, where nationality is the population descriptor. Ancestral recombination graph: For a set of individuals, the graph depicting the genetic ancestry lines (or paths) that trace back to their common genetic ances- tors at every position in the genome. Ancestry: a person’s origin or descent, lineage, “roots,” or heritage, including kinship. Examples of ancestry group labels include clan names or patronyms, but geographic, ethnicity, or racial labels are often used to denote groups whose members are presumed to share common ancestry. Genetic ancestry: the paths through an individual’s family tree by which they have inherited DNA from specific ancestors. Genetic ancestry can be thought of in terms of lines extending upwards in a family tree from an individual through their genetic ancestors (see Figure 2-1). Shared genetic ancestry arises from having genetic ancestors in common (that is, overlapping lines of ancestry). For a set of individuals, a fundamental representation of genetic ancestry is a structure called an ancestral recombination graph. In practice, shared genetic ancestry is typically inferred by some measure(s) of genetic similarity. Genetic ancestry group: a set of individuals who share more similar genetic ancestries. In practice a genetic ancestry group is constituted based on some measure(s) of genetic similarity; Once a set is designated as a genetic ancestry group, its members are often assigned a geographic, ethnic, or other nongenetic label that is common among its members. Genetic similarity: quantitative measure of the genetic resemblance between individuals that reflects the extent of shared genetic ancestry. See Appendix B for further comments, definitions, and citations. PREPUBLICATION COPY—Uncorrected Proofs

118 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH tors refer to conceptual classification schemes used to group people based on specific characteristics. The appropriate application of these concepts in particular study contexts is the primary focus of the recommendations to follow. Group labels are names given to groupings of individuals. Many of the best practices recommended by the committee rely on understanding the distinctions between genealogical ancestry, genetic an- cestry, genetic ancestry group, and genetic similarity. For a full background of these concepts, see the subsection “Ancestry” in Chapter 2. Briefly, ge- nealogical ancestors refer to the collection of ancestors for an individual as found in a family tree, such as parents, grandparents, and so on (Mathieson and Scally, 2020; Rhode et al., 2004). Genetic ancestry refers to the paths through an individual’s family tree by which they have inherited DNA from specific ancestors (Mathieson and Scally, 2020); it is inferred from measures of genetic similarity rather than directly observed (Mathieson and Scally, 2020). Genetic ancestry groups are usually defined by demarcating sets of people based on various measures of genetic similarity. Then these groups are often given a label derived from nongenetic characteristics, such as ethnicity, geography, or race. This mapping of nongenetic descriptors in- troduces additional assumptions (see Chapters 2 and 4). Moreover, it may suggest homogeneity of genetic and environmental effects within social categories where none exists.  In a number of contexts, reliance on, and reference to, ancestry group- ings may be unnecessary for the goals of the study. For example, when matching the background allele frequencies of cases to controls, there is a need to identify a set of individuals who are genetically similar, but not to rely on inferences about their genetic ancestry. Likewise, identifying indi- viduals who are genetically similar to each other or to a reference panel is usually sufficient to delimit a subset of participants for GWAS. Although the distinction between genetic ancestry and genetic similarity may be subtle, it is nonetheless important to enable moving beyond fundamental misconceptions about population descriptors, particularly race and typo- logical thinking. Conclusion 5-2. Assigning ancestry group labels based on such descrip- tors as geography, ethnicity, or race is often scientifically unnecessary and may contribute to typological thinking (Coop, 2022). In particular, genetic ancestry group is commonly conflated with continental geogra- phy, which in turn often stands in for—and thereby reifies—race (Lewis et al., 2022). PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 119 Orientation to the Selection Guide We must aspire to research scholarship and assessments and treatments based on actual and not assumed genetic variation, and the social, his- torical, structural context in which the bodies and lives of the people that we’re interested exist. That means assessing the patterns of diversity that reflect the distribution of human genetic variation across the globe, not proxies thereof. —Agustín Fuentes, testimony to the committee in a public session on April 4, 2022 As shown in Table 5-1, best practices in the use of population descrip- tors vary by study type for any non-disease or disease trait. The committee considered seven major study types: (1) gene discovery for Mendelian traits; (2) prediction for Mendelian traits; (3) gene discovery for complex and polygenic traits; (4) prediction for complex and polygenic traits; (5) elu- cidation of molecular, cellular, or physiological mechanisms; (6) studies of health disparities with genomic data; and (7) studies of human evolutionary history. For descriptions and examples of each type, see the section “Clas- sification of Genomics Study Types” in Chapter 1. Population descriptors refer to conceptual frameworks for describing descent-associated differ- ences across groups of people. First, careful consideration should be given to whether descent-associated population descriptors are needed at all. If needed, and once researchers identify the appropriate population descriptor or descriptors for the context of their study, they should apply group labels consistent with each concept to all study participants. For a given study, more than one concept may be appropriate, and studies may benefit from using multiple descriptors. For example, a project may incorporate both geography and ethnicity simultaneously to distinguish, say, Kurds in Iraq from Kurds in Turkey. In some contexts, descent-associated population descriptors are used not as indicators of shared genetic ancestry but as proxies for shared en- vironmental exposures (see “The Importance of Environmental Factors in Genetics and Genomics Research” in Chapter 2). This practice should be avoided where possible in favor of measuring the environmental variable directly. Nevertheless, when direct measurements are not possible, Table 5-1 indicates which population descriptors might be most appropriate. While race (or racialized group) may capture some shared exposure to racism and therefore may be suitable for some health disparities studies, it is a poor proxy for other environmental exposures, and carries the risk of contributing to typological thinking. Therefore, the committee recom- mends it not be used outside of a subset of health disparities studies. Even in that context, the combination of information from other classification schemes (e.g., ethnicity and geography) may be more accurate. Moreover, PREPUBLICATION COPY—Uncorrected Proofs

120 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH should descent-associated population descriptors be used as proxies for environments, that research-design decision should be explicitly noted and its rationale explained. Finally, readers should bear in mind that the recommendations in Table 5-1 apply to the analytical use of population descriptors—that is, as vari- ables or other tools in analysis. The committee recognizes, however, that researchers may wish—or be obligated—to use population descriptors for other research-related activities, notably for constructing and/or describ- ing samples of individuals whose genetic material is to be analyzed. In the interests of equity, justice, or the diversification of human genetic data and knowledge about it, researchers may choose to use race and/or ethnicity in order to identify individuals to be included in their studies (Oni-Orisan et al., 2021), and Table 5-1 is not meant to govern such sampling decisions or procedures. Even outside the realm of analysis, however, the committee encourages scientists to carefully consider whether race and/or ethnicity are the most conceptually appropriate and useful descriptors for the informa- tion they wish to capture, or the best guides to seeking a heterogeneous sample. Population Descriptor Selection Guide Table 5-1 is a highly condensed summary of the best practices described in this chapter. It should not be inferred to indicate in absolute terms what to use and not use in every circumstance. To use Table 5-1 effectively, the reader is advised to review subsection “Key Terminology for the Selection Guide” above and consult the text describing the best practices for each specific study type in conjunction with viewing the table. In addition, the reader should note that Table 5-1 provides only a broad overview and sum- mary of the best practices; additional considerations for decision making are outlined in the decision tree (Figure D-1 in Appendix D) and in the body of this report. The text that follows explains what is summarized in the table and illustrated in the decision tree in Appendix D. Although the text does not cover every possible variation of the genetics study types, the intent is for the discussion and examples to allow researchers to understand why certain population descriptors are recommended or discouraged depending on the type of study and the goals of the research. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 121 TABLE 5-1  Recommended Approaches for the Use of Population Descriptors by Genomics Study Type This table should be read and interpreted in conjunction with the report text. Consult the decision tree in Appendix D for more information and Chapter 5 text for best practices for each study type. See also the terminology box preceding the table and descriptions of each study type in Chapter 1 section “Classification of Genomics Study Types.” For any given study, the use of multiple descriptors may be preferable. LEGEND  Preferred population descriptor(s) � Should not be used ? In some cases; refer to Ch. 5 text and the E Descriptors could be used if appropriate decision tree in Appendix D proxies for environmental, not genetic, effects Indigeneity Geography Ethnicity/ Similarity Ancestry Genetic Genetic Race Notes GENOMICS STUDY TYPE Similarity suffices as a genetic 1: Gene Discovery - Mendelian Traits � ? ? ?  measure; at fine-scale, other variables may be useful 2: Trait Prediction - No population descriptors may be Mendelian Traits � E E ?  necessary for analysis 3: Gene Discovery - Similarity suffices as a genetic Complex Traits � E E ?  measure 4: Trait Prediction - Similarity suffices as a genetic Complex Traits � E E ?  measure 5: Cellular and No population descriptors may be Physiological � E E � ? necessary for analysis Mechanisms Not all health disparities studies 6: Health Disparities rely on descent-associated with Genomic E E E ?  population groupings, so none Data may be necessary for analysis 7: Human Reconstructing genetic ancestry Evolutionary � ?    may be of central interest History PREPUBLICATION COPY—Uncorrected Proofs

122 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH Study Type 1: Gene Discovery for Mendelian Traits Sequence variants that underlie Mendelian diseases fall into two cat- egories: de novo mutations and inherited ones. When the goal is identifying de novo mutations, as through family studies (e.g., Simons et al., 2013; Turner et al., 2017), no population descent-associated descriptors are nec- essarily needed to identify variants. In some contexts, it may nonetheless be helpful to provide population descriptors (e.g., current geographic loca- tion) for the families that were studied in order to enable identification of additional cases or study the geographical spread of new mutations (e.g., Wexler, 2004). Conclusion 5-3. Where the goal is to describe families in which de novo mutations have been identified, the relevant information is likely much more finely scaled than broad categories like those labeled by conti- nental ancestry (or other large-scale genetic ancestry) or by ethnicity. Best Practice 1: To enable identification of additional cases, rather than using genetic ancestry or ethnicity, researchers should use categories based on kinship (e.g., recent genealogical ancestors), identity-by-de- scent information, or fine-scaled geographical or genetic similarity data. Studies aimed at identifying Mendelian disease variants use not only pedigrees, but also collections of unrelated affected persons (loosely called cohorts). A small number of variants found in these affected persons will be de novo, but most will be inherited and of unknown functional signifi- cance and pathogenicity. To annotate such variants, researchers commonly rely on a reference database comprised of data from individuals who are not diagnosed with the disease, in order to exclude those variants that are likely unrelated to the condition. In doing so, the goal is to exclude variants that are not rare (e.g., common variants) found in groups of individuals or populations with genetic backgrounds similar to one another. Alternatively, researchers may rely on a global reference panel to evaluate whether the variant of interest is at high frequency anywhere in the world, and if so, exclude it as a likely causal mutation. Best Practice 2: Where researchers aim to match the genomes of fo- cal individuals to people that are genetically similar, they should not rely on matching solely by racial (e.g., black), ethnic (e.g., Hispanic), geographic (e.g., West African), or national (e.g., Nigerian) categoriza- tions, which are poor proxies for allele frequencies. Since in this context researchers are not interested in genetic ancestry per se, the committee further recommends that, when possible, they avoid ancestry category PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 123 labels (e.g., “Admixed-American” in the 1000 Genomes), as such label- ing brings in additional, unnecessary assumptions (see Chapters 2 and 4). The committee recommends instead that they rely on genetic simi- larity measures to delimit the reference set to which to compare focal individuals and to describe study participants (Coop, 2022). Study Type 2: Prediction for Mendelian Traits Examples of phenotypic prediction for Mendelian traits include pre- natal or newborn screening (e.g., for Tay Sachs or phenylketonuria—PKU, respectively) or clinical testing for highly penetrant germline mutations that increase disease risk (e.g., Tan et al., 2017). Once the genetic basis for a disease has been elucidated, group labels may no longer be needed as a stand-in for allele frequencies. Instead, people can readily be genotyped for the variants themselves. Furthermore, making screening for specific alleles available only in people with particular descriptors (e.g., Tay Sachs only in children of Ashkenazi Jewish parents) will miss some disease-variant carriers (Dolitsky et al., 2020; Nazareth et al., 2015). In clinical scenarios, there may be exceptions where allele frequencies are necessary to estimate the genotype of missing parents. Here again, genetic similarity is more ap- propriate than reference to genetic ancestry. For some traits considered Mendelian (e.g., Huntington’s disease), prog- nosis, for instance regarding the age of onset, depends on modifier alleles in the genome (GeM-HD Consortium, 2015). Where the modifier alleles are unknown, which they usually are today, information about genetic similar- ity may be helpful in providing some information about allele frequencies at other loci in the genome. When the modifier alleles are known, however, sequencing the individual will provide much more accurate individual in- formation than will population descriptors. Best Practice 3: Where the genetic basis of a trait is known, research- ers should focus on characterizing the individual’s alleles rather than use population descriptors as an unreliable proxy for the genomic background. Phenotypic trait prediction of Mendelian disorders may also depend on environmental factors, such as air pollution or secondhand smoke exposure for children with cystic fibrosis (O’Neal and Knowles, 2018). In such cases, researchers may be tempted to include such population descriptors as race, ethnicity, or geography to capture shared exposures (Martinez et al., 2022). However, where the aim is explicitly to measure environmental effects, the committee recommends that descent-associated population descriptors not PREPUBLICATION COPY—Uncorrected Proofs

124 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH be used in place of individual-level data, as the use of such descriptors runs the risk of erroneously suggesting that the effects of interest are genetic. Best Practice 4: Given that any descent-associated population descrip- tor will be a poor proxy for environmental effects (Benmarhnia et al., 2021; Martinez et al., 2022), researchers should aim to directly collect information about as many potentially relevant environmental factors as possible. Best Practice 5: When including population descriptors for phenotype prediction of Mendelian traits, researchers should be explicit about whether the aim is to study genetic or environmental effects or both, and whether these can be disentangled given the study design. Study Types 3 and 4: Gene Discovery and Prediction for Complex and Polygenic Traits Complex traits, such as height or the risk of a disease such as type 2 diabetes, depend on the effects of not only many loci across the genome but also the environment (Falconer and Mackay, 1996). In the past two decades, the main method to map the genetic basis of such traits has been genome-wide association studies of unrelated individuals (Hirschhorn and Daly, 2005). Such efforts have been motivated by two distinct goals: to identify loci that affect a particular phenotype, and to predict trait values in currently asymptomatic individuals (Visscher et al., 2017). Association studies are conducted in sets of individuals that have some degree of genetic similarity in order to better control for effects of alleles in the genomic background as well as potential environmental effects that correlate with the genomic background. How much of a problem environmental con- founding presents depends on the trait (e.g., Okbay et al., 2022). Study Type 3: Gene Discovery for Complex and Polygenic Traits For researchers who are mapping variants that influence complex trait values, common practice is to describe their study participants as members of genetic ancestry groups that are labeled with geographic, ethnic, or racial terms. Such labels for inferred ancestry groups are defined at very different levels of resolution depending on the data at hand, such as the Southern European versus the white British subsample of the UK Biobank. Moreover, to increase their sample sizes when they lack access to the original data, researchers often combine summary statistics provided by different studies (Lesko et al., 2018). In that case, current practice usually consists in group- ing them under a general label, such as European, without spelling out the PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 125 often-implicit assumptions about genetic and environmental effects in the combined samples or considering the genetic or environmental diversity within any such group (Coop, 2022). In this context, researchers are not interested in ancestry or race per se, and instead are aiming to identify study participants who are more genetically similar to one another, to better con- trol for effects of the genomic background and correlated environmental effects. Similar considerations arise when conducting mapping studies in re- cently admixed individuals (Shriner, 2013; Thornton and Bermejo, 2014). In this context, the mapping is conducted by considering local ancestry estimates, that is, inferences based on genetic similarity for different seg- ments of the genome (Atkinson et al., 2021). Included in the statistical model also are sometimes genome-wide ancestry proportions, estimates that capture genomic background effects beyond the scale of local ancestry estimates, and, sometimes, ethnic or racial labels as a proxy for environ- mental exposures. Best Practice 6: When mapping variants that contribute to complex traits, the goal is to conduct the study in a set of individuals that are genetically more similar, rather than to infer ancestry per se. Therefore, researchers should characterize their study participants in terms of their genetic similarity to one another or to a reference panel, with a specified similarity measure (Coop, 2022). As an example, researchers would describe samples as “carrying geno- types most genetically similar by measure X to the GBR panel of the 1000 Genomes data set, as compared to individuals sampled elsewhere in the world” (GBR being the acronym for British in England and Scotland) (Coop, 2022) or by using coordinates in a low-dimensional representation of the data, like principal component analysis (PCA) or uniform manifold approximation and projection (UMAP) (e.g., “individuals projecting to the region [-0.1,-0.05] in PC1 and [0.3,0.5] in PC2 of a PCA generated from the 1000 Genomes data set”). For recently admixed individuals, this description would then naturally lend itself to statements such as, “Seventy-three per- cent of the genome is most genetically similar to genotypes of individuals in the GBR panel, and 27 percent of the genome is most similar to genotypes of individuals in the YRI panel” (YRI being the acronym for Yoruba in Ibadan, Nigeria). (Or alternatively, “73% of the genome is most similar to genomes from region 1 and 27% from region 2 in a 1000 Genomes PCA”). This approach avoids descriptions of recently admixed people as either Af- rican or European when they derive recent ancestry from diverse locations in both continents. Importantly, this descriptive change does not alter or compromise the underlying science. For comparability across studies, there PREPUBLICATION COPY—Uncorrected Proofs

126 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH is still work to be done to assess which similarity method to use and how these may change with the composition of the reference panel and different choices of genotype measures. Study Type 4: Prediction for Complex and Polygenic Traits GWAS results can be useful for trait prediction, even where the mecha- nism linking genotypic variation to phenotypic variation is not understood (Torkamani et al., 2018). In particular, the hope is to use polygenic scores (PGS) to identify individuals at high risk for specific diseases (e.g., Khera et al., 2018; Mavaddat et al., 2019). PGS are calculated by summing alleles carried by an individual, weighted by effect sizes of alleles that are estimated in association studies (often GWAS); PGS provide a predictor of a deviation from a mean value in a given study population (often adjusted for relevant covariates, such as sex and age) (Sirugo et al., 2019). Polygenic scores (also called polygenic risk scores) are based on GWAS that pick up not only causal loci but also genetic variants correlated with causal variants, to an extent that depends on allelic association (called link- age disequilibrium or LD) patterns among sites (Choi et al., 2020). Since at present causal variants can rarely be pinpointed, the construction of a PGS requires weighting all these associations. In practice, therefore, phenotypic prediction of complex traits relies on LD patterns characterized in a set of individuals that are genetically similar, by some operational definition. When the goal is to predict trait values—as distinct from identifying causal loci—it may not be as important to entirely control for environmen- tal effects on the trait that are correlated with genetic differences. In some contexts, uncontrolled environmental stratification can actually enhance predictive power (Mostafavi et al., 2020). For related reasons, the practice of performing genetic prediction after stratifying by a population descriptor can increase predictive power because it implicitly captures both genetic similarity and shared environmental exposures. A danger, though, is that by including a contribution of nongenetic effects into what is widely under- stood to be a genomic predictor, this practice will end up over-emphasizing the role of genetics in trait etiology and reifying group differences. Another important aspect of genomic trait prediction is generalizability beyond the GWAS study population. Generalizability is particularly impor- tant because, to date, the vast majority of GWAS have been conducted by sampling people in Europe or those who report recent European ancestry (Martin et al., 2019; Mills and Rahal, 2020). Given that LD patterns vary across the globe (Charles, et al., 2014), as do the frequencies of causal loci, the prediction accuracy of PGS is expected to decrease with genetic diver- gence from the GWAS set, even if nothing else were to differ (Wang et al., 2020, 2022). That decrease is seen in practice: PGS have lower prediction PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 127 accuracy with increasing genetic distance from the GWAS set of individu- als (Martin et al., 2019; Privé et al., 2022; Scutari et al., 2016; Wang et al., 2020). Factors other than LD and shifts in causal allele frequencies may also decrease prediction power, such as differences in the degree of environ- mental variance or gene–environment interactions; in other words, genetic effects may differ across environmental settings (Giannakopoulou et al., 2021; Mills and Rahal, 2020; Mostafavi et al., 2020; Wang et al., 2022). Best Practice 7: When predicting complex traits, the goal is to study a set of individuals that vary in a trait but are relatively similar geneti- cally, rather than to infer ancestry per se. Therefore (as with Best Prac- tice 5 and 6), researchers should characterize their study participants in terms of their genetic similarity (to one another or with regards to a reference panel), with a specified similarity measure(Coop, 2022). Considerations Common to Gene Discovery and Prediction for Complex and Polygenic Traits The committee recognizes that after delimiting study participants based on genetic similarity to a reference panel, researchers may want to refer to the set of study participants with a label based on ethnicity (e.g., Yoruba), nationality (e.g., Nigerian), or geography (e.g., residing in Nigeria)—or a combination of various labels—either as shorthand in communicating the results, or to underscore a particular characteristic of the group that dis- tinguishes their ethnicity, geography, or demographic history from that of the closest other individuals in the reference panel. In so doing, care should be taken to avoid applying broad labels (e.g., African ancestry) to panels represented by narrower sampling (e.g., YRI). This consideration under- scores the need for more widely available, geographically and ethnically diverse reference panels. In every case, researchers should be transparent about their reasons for using such ancestry labels and for the choice of the particular label(s) in question. Importantly, in many cases, it may be un- necessary to refer to genetic ancestry at all, since terms such as the study population, alongside information about how and where individuals were sampled, may be sufficient and require fewer assumptions. The committee further appreciates that when researchers use summary statistics from a previous GWAS and have no access to the individual-level genotype and phenotype data (e.g., when conducting a meta-analysis), it is not always feasible to assess genetic similarity to a reference panel. In this case, researchers should be explicit about the reasons for the nomenclature they have adopted or borrowed (e.g., if grouping many sets of individuals under a common label) and the procedure by which individuals have been retained or excluded from the sample; where possible, they should adopt PREPUBLICATION COPY—Uncorrected Proofs

128 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH labels based on the use of genetic similarity. In this regard, those sharing data should also attempt to provide indirect measures of genetic similarity (e.g., summary statistics for coordinate positions in a reference principal component analysis) that might enable genetic similarity to be assessed more precisely than is possible with group labels. Where no reference panel is available, researchers often use a group label based on an attribute that is common to the study participants, such as a subset of people who self-identify as “white British” in the UK Biobank. Researchers should be explicit about their reasons for choosing the attribute used to delineate and describe the study participants. Best Practice 8: Researchers should describe samples in as many dimen- sions as possible, using population descriptors, individual-specific envi- ronmental data, and their ascertainment scheme (e.g., were participants recruited from a research hospital, in an urban or rural area, and so on). Best Practice 9: When descent-associated descriptors such as ethnicity or geography are used, researchers should be explicit about what types of effects they intend to capture—genetic, nongenetic, or both—and whether the effects can be teased apart reliably given the study design. Best Practice 10: Where the goal is to control for environmental ef- fects that are correlated with genomic background effects, researchers should, if possible, replace or, at least, augment the use of population descriptors with more reliable and precise measures of individual en- vironmental effects. Whenever labels remain, researchers should be explicit about their reasons for using them. Cataloging the data collection in these ways will enable samples to be assessed for their genetic similarity to reference panels as well as for their similarity along nongenetic dimensions such as employment status or geo- graphic location. A richer description of the data will also help to identify obstacles to generalizability beyond genetic similarity. A further benefit may be that forms of study ascertainment or enrollment bias could potentially be taken into account (e.g., Van Alten et al., 2022). Finally, a major goal of all of these genetics and genomics studies, par- ticularly GWAS, is to dissect the genetic and environmental architecture of these traits, and to identify the underlying mechanisms (pathophysiology). Further progress along this front will almost surely require the collection of new samples with new data, particularly longitudinal data, rather than simply retrofitting legacy studies. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 129 Study Type 5: Elucidation of Molecular, Cellular, or Physiological Mechanisms Many studies that include human genetic data ultimately aim to un- derstand the molecular, cellular, tissue, and physiological underpinnings of traits, sometimes triggered by gene discovery. One example might be studies aimed at understanding the genetic and neuronal mechanisms by which a DNA repeat expansion causes Huntington’s disease (Jimenez-Sanchez et al., 2017). Another is the molecular mechanism underlying messenger RNA (mRNA) vaccines against SARS-CoV2: their development relied on an un- derstanding of antibody production and chemical modifications to mRNA that help evade the human innate immune response, much of which was learned in mice, human cells, and other model systems (Delorey et al., 2021; Sadarangani et al., 2021). In such cases, where underlying mechanisms are expected to be shared by all humans (and often by other species), there is no compelling reason to stratify study participants by descent-associated population descriptors at all. As noted by Pavličev and Wagner (2002), “A shared mechanistic basis of a trait does not mean that exactly the same loci will be detectable by association with variation in this trait.” Conversely, the observation that variation in a trait differs in its allelic basis among humans does not imply that the underlying mechanisms are different. Thus, despite the universality of the underlying mechanisms of vaccination (Delorey et al., 2021; Sada- rangani et al., 2021), humans vary in their specific response (Randolph et al., 2021), likely because of both genetic variants and environmental exposures. As an example, in all humans, myopia is caused by deforma- tions in the shapes of the eye or cornea and can be corrected by eyeglasses (Chakraborty et al., 2020). Nonetheless, the genetic and environmental factors that lead to myopia likely differ across the world, owing to changes in allele frequencies, average effect sizes, and environmental exposures (Chakraborty et al., 2020; Li and Zhang, 2017). Researchers may be inter- ested in understanding such perturbations to the underlying mechanisms and how they are distributed geographically, but often the primary goal is to leverage these perturbations (e.g., loss-of-function mutations) as a tool to better understand underlying mechanisms. When specific candidate loci or salient environmental factors are un- known, a common approach has been to use population descriptors, and in particular ancestry group labels, as a proxy for differences in allele frequen- cies across the genome and potentially environmental exposures. A danger of this approach is the implication, implicit or explicit, that the underlying mechanisms themselves somehow differ by population descriptors, when in fact, the observed differences are caused by alleles at specific loci in the genome or varying environmental exposures (or interactions of the two). Once the nature of the perturbations has been identified, any observed PREPUBLICATION COPY—Uncorrected Proofs

130 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH differences between groups defined by population descriptors will resolve as differences between individuals carrying distinct alleles and the environ- ments to which they are exposed. Conclusion 5-4. Given that underlying cellular and physiological mech- anisms are expected to be universal among humans, the default practice in such studies should be to not use any descent-associated population descriptors. Best Practice 11: When researchers are interested in studying perturba- tions to underlying mechanisms that arise from genetic variation, and the genetic variants are known, population descriptors should not be used as a substitute for individual information. If, instead, the genetic variants are unknown, and researchers are interested in delimiting a set of individuals with similar allele frequencies, they should rely on genetic similarity rather than such descriptors as ethnicity or geography. Best Practice 12: Where the goal is to study the effect of unknown environmental exposures or possible gene–environment interactions, researchers should aim to replace or supplement population descriptors with direct information about potentially salient environmental factors. Regardless, researchers should be explicit about their intent in using population descriptors, including whether the aim is to study genetic or environmental effects or both, and whether these can be teased apart given the study design. Study Type 6: Studies of Health Disparities with Genomic Data Health disparities studies often compare groups of individuals identified by different descent-associated population descriptors (e.g., by OMB racial and ethnic categories). Some of these studies include genetic information, such as genome-wide genotyping data (Batai et al., 2021), data for variants at a single locus (e.g., apolipoprotein E gene—APOE4) (Torres and Kittles, 2007), or tumor genome sequencing (Daly and Olopade, 2015; Spratt et al., 2016). Other health disparities studies include only nongenetic data but may assign the unexplained variance to untested genetic differences (e.g., Kistka et al., 2007). Conclusion 5-5. It is invalid to assign unexplained trait variance to any type of effect without direct evidence; notably, racial or ethnic phe- notypic differences cannot be ascribed to genetic differences without evidence. The unexplained variance could be caused by environmental factors that are not considered or were imprecisely or inaccurately measured, or by inadequacy of the statistical model used. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 131 Given the variety of goals and sources of input data for different health disparity studies, it is helpful to consider some of these categories separately. Below is a short but incomplete list of three types of health disparities study that include genetic and genomic data. • Health Disparities Study Type 1: The sole goal is to study the role of one or multiple genetic variants on observed or possible health disparities between groups. Best Practice 13: In this type of study, what is needed is to consider the effects of the focal variant of interest among individuals with similar allele frequencies, so genetic similarity is the relevant de- scriptor to use, and racial and ethnic labels should not be used. The use of genetic similarity to a reference panel is both more accurate and more transparent than using descent-associated descriptors such as race or ethnicity (Coop, 2022). In cases where ancestries are correlated with traits such as skin color (Parra et al., 2004), which may mediate the effect of racism on health (Kittles et al., 2007; Teteh et al., 2020), genetic ancestry may be considered if these traits are a key component of the re- search question. • Health Disparities Study Type 2: The goal is to study the effect of environmental exposures or examine possible gene–environment interplay. Best Practice 14: Researchers should avoid racial or ethnic labels because they are poor proxies for differences in environmental exposures. Instead, the committee recommends that they replace or supplement descent-associated population descriptors with in- formation about the relevant factors that mediate differences in environmental exposures, such as education, types of employment, housing quality, and access to health care, to name only a few. There is one exception to Best Practice 14. When the goal is explic- itly to study the effect of structural racism and discrimination, then racial and ethnic labels may be appropriate but need to be carefully described (e.g., self-identified or not) and justified. Instruments and variables that measure discrimination (e.g., Williams’s Everyday PREPUBLICATION COPY—Uncorrected Proofs

132 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH Discrimination Scale1) or mediators of discrimination directly may be more appropriate, although challenging to implement (Harde- man et al., 2022). • Health Disparities Study Type 3: Although the goal is to study the effect of environmental exposures or examine possible gene–en- vironment interplay, information about environmental factors is limited. Best Practice 15: If environmental information is unavailable and population descriptors such as race or ethnicity are used as proxies for it in such studies, for example in analyses of electronic health records, then their source should be described in detail (e.g., self- reported or assigned by provider) and along multiple dimensions (e.g., Hispanic, Mexican-American, rural, sampled in Texas health clinic, born in Texas). Moreover, the researcher should explicitly state why each descent-associated population descriptor is being used, by identifying specifically what types of effects their inclusion is intended to capture and the accuracy of this capture. Study Type 7: Studies of Human Evolutionary History Population genetics studies of human history and prehistory aim to use genetics to make inferences about the genetic evolution of humans and integrate such inferences with data from archeology, history, paleontology, and other disciplines (e.g., Nielsen et al., 2017). Many such studies analyze variation data using models and methods that employ the mathematical construct of discrete, unstructured populations (Gutenkunst et al., 2009; Patterson et al., 2012; Pritchard et al., 2000). It is common practice also to rely on samples collected by geographic or ethnic criteria (e.g., Scheinfeldt et al., 2019). Some studies of human evolutionary history embed all samples within the same analytic structure, notably when inferring an ancestral recombi- nation graph (e.g., Schaefer et al., 2021). In that case, descent-associated population descriptors may not be necessary, such as when the goal is to estimate the time to the most recent common genetic ancestor of modern humans at a locus (Mallick et al., 2016). Nonetheless, population descriptors will often be needed to describe the sample collection scheme to other researchers and to capture characteristics of sampled individuals that help place them in a historical and geographic context. These descriptors might include the geographic provenance of the 1 See https://scholar.harvard.edu/davidrwilliams/node/32397 (accessed January 20, 2023) PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 133 sample, or some indicator of the geographic or ethnic affiliation of an indi- vidual’s recent ancestors (e.g., via grandparental birthplace questionnaires, or by reference to ethnicity, such as “Houston residents who identify ethni- cally as Gujarati”). In many circumstances, a sample will be labeled by the ethnicity, current geographic location, or commonly spoken language of present-day people to which its genome bears the greatest genetic similarity (e.g., Yoruba, Andamanese, Basque). Conclusion 5-6. In genetics studies of human evolutionary history, social or geographic population descriptors are often used to describe genetic ancestry groups inferred based on genetic similarity (e.g., labels may be based on shared characteristics of participants such as language spoken, self-identified ethnicity, or location sampled) in order to shed light on population history. In studying human evolution, researchers may also be interested in studying the genetic and phenotypic changes that occurred in response to localized selection pressures. To study such biological adaptations, which occur through systematic changes in allele frequencies over generations in groups of individuals, researchers will often delimit a set of people whose ancestry is thought to have been subject to similar selection pressures at some time point (e.g., to study the evolution of lactose tolerance in descen- dants of Nilotic-speaking pastoralists from East Africa in the past several thousand years). A challenge is that the appropriate scale will often be un- known a priori. For example, in studying human adaptation to Plasmodium vivax, continental groupings likely do not offer the necessary fine-scale resolution. Then should one try to enrich specifically for individuals whose recent genetic ancestors lived in environments where P. vivax was common, or focus on individuals currently living in such environments, despite the fact that their ancestors may not have been subject to the same pressures. Best Practice 16: When gathering new data in genetics studies of hu- man evolutionary history, after researchers engage local communities as described in Chapter 4, they should collect and include population descriptors along multiple dimensions, both to convey the myriad ways in which an individual could be described and to enable additional uses of these samples in the future. Notably, in addition to genetic data, researchers should also report their sampling location and when known, their birthplace, parental birthplaces, language(s) spoken, and self-described ethnicity. However, researchers should be consistent with population descriptors used for all samples in a study (for example, it is PREPUBLICATION COPY—Uncorrected Proofs

134 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH not good practice to use self-identified ethnic group for some samples but geographic origin for others). The committee appreciates that many researchers will use existing data and therefore inherit population descriptors that may not be of their design. In that case, researchers should be transparent about the specific criteria according to which they included or excluded individuals. Moreover, when using legacy data, researchers should be mindful to apply consistent popu- lation descriptors across samples within the study. New labels may be ap- propriate to define and when doing so, the new labels and their relationship to previous ones should be communicated. While not a focus of this report, the committee notes additional chal- lenges can arise in assigning population descriptors in studies of ancient DNA, which often integrate genetic data with archeological, or even his- toric data, to make inferences about modern population origins. Individuals in such studies are often given population labels based on cultural practices inferred from material objects identified from archeological data (e.g., the Corded Ware and Yamnaya cultures) (Eisenmann et al., 2018). Assigning cultural population names to ancient individuals clustered together using genetic data can be problematic. As Eisenmann and colleagues note: Giving groups that have been identified through a completely different line of evidence—in this case material culture and genomics—the same or related names results in their conflation and the archaeological designa- tions risk becoming reified in genetic terms (and vice versa) (Eisenmann et al., 2018). Their recommendation is to either label genetically defined populations numerically (giving no cultural label) or use a mixed system where names are based on a combination of geographic and subsistence terms, and a relative time span, together with archeological culture when appropriate (Eisenmann et al., 2018). One example of such a label would be to describe individuals from present-day Spain and dating to the Early Neolithic pe- riod as “Spain_EN” (Eisenmann et al., 2018). Such practices are in general alignment with the principles outlined in this report, though the committee reiterates that consideration of the questions of interest when choosing whether and what labels to use is paramount. Decision Tree for the Use of Population Descriptors To aid a researcher contemplating a specific genetics or genomics study, the committee believes that a decision tree to systematically decide which descent-associated population descriptors to consider using and which to PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 135 avoid is a helpful addition to the table. The decision tree can be found in Appendix D. The process begins by asking the following questions: 1. What is the purpose of your study? 2. Are you collecting new data, working with existing individual-level data, or using summary-level legacy data? 3. Does your research question pertain to environmental sources of differences? 4. If the answer to question 3 is yes, then do you plan to study envi- ronmental effects as a predictor or as a control variable? CONSIDERATIONS FOR HARMONIZATION OF POPULATION DESCRIPTORS ACROSS STUDIES In general, harmonization enhances comparability of data among dif- ferent studies and enables the continued use of existing data to answer new research questions (Doiron et al., 2013; Khan et al., 2022; Wallace et al., 2020). Harmonization of population descriptors, specifically, would al- low greater interoperability among data sets in human genomics research. Although the advantages of harmonization are clear, there are many chal- lenges to the harmonization of population descriptors (see “Challenges of Harmonization and Legacy Data” in Chapter 1). Descriptors differ not only in scale or resolution but also in the concepts they represent. For example, is it possible to harmonize studies where one uses race or ethnicity, another uses geography, and yet another genetic similarity? Another consideration is the harmonization of descriptors to account for both unique preferences and the needs analytical groups may have within the consortia (Lee, 2019). There is a fundamental tension between harmonization on one hand and flexibility or specificity on the other, and the solution is not straightforward. To cope with these challenges, informatics tools have been developed to harmonize data and metadata. Some examples include common data ele- ments (see Box 5-2), machine learning algorithms, visualization tools, and data processing standards. Over time, regularly employing the best practices and recommendations in this chapter will promote harmonization across studies. As illustrated through the best practices above, different descriptors may be warranted based on study design. Just as individual investigators should be mindful of the purpose of their study, harmonization efforts similarly need to consider research objectives since the context shapes the appropriate use of popula- tion descriptors. The objective is less to offer a single definitive descriptor or set of labels but rather systems and approaches for harmonization—that is, clear ways to denote which population descriptors are used and why and how to merge data sets that may have used different descriptor schemes. PREPUBLICATION COPY—Uncorrected Proofs

136 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH BOX 5-2 Common Data Elements for Researchers to Include as Metadata to Help Harmonize Across Studies Systems for improving data sharing and harmonization are an important research need. When sharing data, researchers could explicitly share a set of accessory files that provide information to communicate their labeling schemes. An example of useful common data elements to include would be: Per population descriptor: • Overall rationale for the population descriptor (e.g., classification scheme) and associated group labels • Set of possible population descriptor values (e.g., group labels) and rec- ommended abbreviations of label values used in study • Per individual in the study: • Label value • Provenance of label value: Self-report, ascribed externally and by whom, other • When using existing data, if a new set of population descriptor values (e.g., group labels) is being used instead of those used in the original data set, provide a mapping of how old labels map onto new labels. For example, for geographic population descriptors: • Identification of specific geographic labeling scheme (e.g., based on sam- ple location, birthplace) • If relevant, set of geographic entities with associated shape files defin- ing boundary of the entity or latitude/longitude specifying representative locations • Per individual either: • Point based: █ Latitude █ Longitude █ Estimated mean square error in units of kilometers █ Provenance: Self-reported, ascribed externally, other • Geographic entity based: █ Entity value █ Provenance: Self-report, ascribed externally, other Upholding the principle of transparency and adhering to Recommenda- tions 6, 7, and 8 inherently support harmonization through the application of consistent definitions of population descriptors and transparent com- munication of methods. More specifically, for novel data collection, data should be collected per individual along multiple nongenetic dimensions and population descriptor types that may facilitate other studies. In addi- tion, clear instructions should be provided on how downstream users can respect consent and any collaborative agreements with study participants PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 137 regarding population descriptors. For existing individual-level data, the available metadata can be used while following existing agreements and consent structures to form the population descriptors (see the decision tree in Appendix D). Harmonized population descriptors that are well understood would be highly valuable. In the context of genetics studies, genetic similarity to specific reference sets could have advantages for promoting harmonization. While a broader sampling of human genetic diversity is needed, current can- didates for specific reference sets include, for example, data from the 1000 Genomes Project, the Human Genome Diversity Project, and the Simons Genome Diversity Project (Bergström et al., 2020; Cann et al., 2002; 1000 Genomes Project Consortium et al., 2015; Mallick et al., 2016). A specific challenge is harmonizing across studies so readers of a re- search manuscript can understand a label quickly and in a technically pre- cise way. For example, a possible methods description that adheres to the guidelines outlined in this report would be: “To minimize heterogeneity in genetic ancestry across our sample, we filtered our sample to only include individuals with a pairwise genotypic dissimilarity less than 10-3 to the cen- troid of the Yoruba of Ibadan sample of the 1000 Genomes Project.” A pos- sible critique of this language though is that this language may be perceived as bulky and difficult to apply throughout a study write-up. Researchers will be eager to find more concise language. In that regard, one possible approach is to favor using a sample abbreviation and the suffix -like. So, in a setting where conciseness is prioritized, instead of the above phrasing, one might say “1KG-YRI-like individuals” (see Box 5-3). The overall approach of using an abbreviation and the -like suffix is compatible with other descriptors, such as geographic and ethnic descrip- tors. So, one might for example, if scientifically justified, conduct a study on “self-described ethnically Italian individuals sampled in Houston, Texas, who are 1KG-TSI-like” to refer to self-described individuals sampled from Houston, Texas, who were further filtered based on genetic similarity to the 1000 Genome Project sample of Tuscans of Italy (TSI) individuals. The approach also offers researchers flexibility regarding the choice of reference panels and the scale at which they are analyzed. For example, while the committee generally recommends against continental-scale con- ceptualizations of human genetic variation, in the “chromosome painting” (e.g., local ancestry calling) approaches that are common in human genet- ics, continental-scale conceptualizations are prominent in many analysis pipelines. In such settings, it is still possible, and favorable, to use concise descriptors (e.g., “we partition the genome into tracts that are 1KG-EUR- like and 1KG-AFR-like”) in place of using continental ancestry labels as is common practice (e.g., “we portioned the genome into European PREPUBLICATION COPY—Uncorrected Proofs

138 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH BOX 5-3 Concise Language for Genetic Similarity: The Abbreviation + -Like System Because the language used to fully describe population descriptors in terms of genetic similarity may be cumbersome, it may be useful to adopt an approach that uses a sample abbreviation and the suffix like. For example, one might use the abbreviation “1KG-YRI-like individuals” for “individuals with a pairwise genotypic similarity greater than 10-3 to the Yoruba of Ibadan sample of the 1000 Genomes Project.” The use of -like as a suffix is a form of abbreviation for a procedure of defining similarity. Although there is an element of vagueness, it is concise, and for readers who need to understand the exact procedure used for ascribing this designation, they should be able to find in a well-written methods section what the precise procedures and thresholds used were to define the term. Abbreviations are often disfavored in science communication and writing, but in this setting, attempts to use more accessible wording such as European ancestry in place of precise language are so prone to misunderstanding and propagating misconceptions that use of such terms is often counterproductive. An abbreviation implicitly invites a reader to read deeper into the technical meaning of the abbreviation rather than proceed with preconceived notions. For example, if one does not immediately recognize “1KG-YRI” as indicating the 1000 Genomes Project Yoruba of Ibadan (YRI) reference panel, they need to read deeper in the methods and understand what is meant. “superpopulation”2 (EUR) ancestry tracts and African “superpopulation” (AFR) ancestry tracts”). The changed language is concise and more infor- mation rich while avoiding the implication of clear continental boundaries in human genetic variation. For admixed individuals themselves, a harmoni- ous approach using the language of genetic similarity would be to refer to the best approximating reference group; for example, “1KG-PEL-like,” and “1KG-PUR-like” are two among many possible genetic similarity descrip- tors of Latino populations, with PEL = Peruvian in Lima, Peru, and PUR = Puerto Rican in Puerto Rico). While potentially difficult to read by novices, the use of abbreviations for precision and conciseness is in fact a key aspect of scientific language in many fields (e.g., chemistry and the abbreviations for the elements, though the committee notes the analogy is not exact as there are no fundamental elements with regards to genetic ancestry). Their use in many scientific 2 For example, the 1000 Genomes Project uses a classification of five superpopulations: Africans (AFR), Admixed Americans (AMR), East Asians (EAS), Europeans (EUR), and South Asians (SAS). PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 139 fields is evidence that abbreviations are not an impediment to scientific communication and can foster a culture of concise reference to precisely defined entities. A potential caveat of this approach is that one study’s definition of XX-like may be different from another group’s because of varying definitions to define a threshold on similarity. Standardization for such genetic similarity procedures may be feasible, and would be fruitful to develop, especially as a fuller representation of human genetic variation is sampled by ongoing studies. Nonetheless, the abbreviation plus -like ap- proach would have less vagueness than the current widespread use of such terms as European genetic ancestry and African genetic ancestry, where both the reference populations and the methods to ascribe an affiliation to European or African sources are unclear and make implicit assumptions about the time frame of interest. As investigators grapple with these complex challenges, harmoniza- tion efforts will continue to take many forms. Importantly, given the mul- tiuse nature of modern data sets, any future harmonization efforts must be meaningful in how they aggregate populations or harmonize labels, while remaining flexible to uses and studies. Further alignment across the field to implement the recommendations and best practices described in this report can go a long way toward enhancing harmonization of the use of popula- tion descriptors (see Chapter 6). REFERENCES 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526:68-74. Atkinson, E. G., A. X. Maihofer, M. Kanai, A. R. Martin, K. J. Karczewski, M. L. Santoro, J. C. Ulirsch, Y. Kamatani, Y. Okada, H. K. Finucane, K. C. Koenen, C. M. Nievergelt, M. J. Daly, and B. M. Neale. 2021. Tractor uses local ancestry to enable inclusion of admixed individuals in GWAS and to boost power. Nature Genetics 53(2):195-204. Batai, K., S. Hooker, and R. A. Kittles. 2021. Leveraging genetic ancestry to study health dis- parities. American Journal of Physical Anthropology 175(2):363-375. Benmarhnia, T., A. Hajat, and J. S. Kaufman. 2021. Inferential challenges when assessing racial/ethnic health disparities in environmental research. Environmental Health 20(1):7. Bergström, A., S. A. McCarthy, R. Hui, M. A. Almarri, Q. Ayub, P. Danecek, Y. Chen, S. Felkel, P. Hallast, J. Kamm, H. Blanché, J.-F. Deleuze, H. Cann, S. Mallick, D. Reich, M. S. Sandhu, P. Skoglund, A. Scally, Y. Xue, R. Durbin, and C. Tyler-Smith. 2020. Insights into human genetic variation and population history from 929 diverse genomes. Science (New York, N.Y.) 367(6484):eaay5012. Cann, H. M., C. de Toma, L. Cazes, M.-F. Legrand, V. Morel, L. Piouffre, J. Bodmer, W. F. Bodmer, B. Bonne-Tamir, A. Cambon-Thomsen, Z. Chen, J. Chu, C. Carcassi, L. Contu, R. Du, L. Excoffier, G. B. Ferrara, J. S. Friedlaender, H. Groot, D. Gurwitz, T. Jenkins, R. J. Herrera, X. Huang, J. Kidd, K. K. Kidd, A. Langaney, A. A. Lin, S. Q. Mehdi, P. Parham, A. Piazza, M. P. Pistillo, Y. Qian, Q. Shu, J. Xu, S. Zhu, J. L. Weber, H. T. Greely, M. W. Feldman, G. Thomas, J. Dausset, and L. L. Cavalli-Sforza. 2002. A human genome diversity cell line panel. Science 296(5566):261-262. PREPUBLICATION COPY—Uncorrected Proofs

140 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH Chakraborty, R., S. A. Read, and S. J. Vincent. 2020. Understanding myopia: Pathogenesis and mechanisms. In Updates on myopia: A clinical perspective, edited by M. Ang and T. Y. Wong. Singapore: Springer Singapore. Pp. 65-94. Charles, B. A., D. Shriner, and C. N. Rotimi. 2014. Accounting for linkage disequilibrium in association analysis of diverse populations. Genetic Epidemiology 38(3):265-273. Choi, S. W., T. S.-H. Mak, and P. F. O’Reilly. 2020. Tutorial: A guide to performing polygenic risk score analyses. Nature Protocols 15(9):2759-2772. Claw, K. G., M. Z. Anderson, R. L. Begay, K. S. Tsosie, K. Fox, N. A. Garrison, Alyssa  C. Bader, J. Bardill, Deborah A. Bolnick, J. Brooks, A. Cordova, R. S. Malhi, N. Nakatsuka, A. Neller, Jennifer A. Raff, J. Singson, K. TallBear, T. Vargas, J. M. Yracheta, and Summer internship for INdigenous peoples in Genomics Consortium.. 2018. A framework for enhancing ethical genomic research with indigenous communities. Nature Communica- tions 9(1):2957. Coop, G. 2022. Genetic similarity and genetic ancestry groups. arXiv (preprint) Daly, B., and O. I. Olopade. 2015. A perfect storm: How tumor biology, genomics, and health care delivery patterns collide to create a racial survival disparity in breast cancer and proposed interventions for change. CA: A Cancer Journal for Clinicians 65(3):221-238. Delorey, T. M., C. G. K. Ziegler, G. Heimberg, R. Normand, Y. Yang, Å. Segerstolpe, D. Abbon- danza, S. J. Fleming, A. Subramanian, D. T. Montoro, K. A. Jagadeesh, K. K. Dey, P. Sen, M. Slyper, Y. H. Pita-Juárez, D. Phillips, J. Biermann, Z. Bloom-Ackermann, N. Barkas, A. Ganna, J. Gomez, J. C. Melms, I. Katsyv, E. Normandin, P. Naderi, Y. V. Popov, S. S. Raju, S. Niezen, L. T. Y. Tsai, K. J. Siddle, M. Sud, V. M. Tran, S. K. Vellarikkal, Y. Wang, L. Amir-Zilberstein, D. S. Atri, J. Beechem, O. R. Brook, J. Chen, P. Divakar, P. Dorceus, J. M. Engreitz, A. Essene, D. M. Fitzgerald, R. Fropf, S. Gazal, J. Gould, J. Grzyb, T. Harvey, J. Hecht, T. Hether, J. Jané-Valbuena, M. Leney-Greene, H. Ma, C. McCabe, D. E. McLoughlin, E. M. Miller, C. Muus, M. Niemi, R. Padera, L. Pan, D. Pant, C. Pe’Er, J. Pfiffner-Borges, C. J. Pinto, J. Plaisted, J. Reeves, M. Ross, M. Rudy, E. H. Rueckert, M. Siciliano, A. Sturm, E. Todres, A. Waghray, S. Warren, S. Zhang, D. R. Zollinger, L. Cosimi, R. M. Gupta, N. Hacohen, H. Hibshoosh, W. Hide, A. L. Price, J. Rajagopal, P. R. Tata, S. Riedel, G. Szabo, T. L. Tickle, P. T. Ellinor, D. Hung, P. C. Sabeti, R. Novak, R. Rogers, D. E. Ingber, Z. G. Jiang, D. Juric, M. Babadi, S. L. Farhi, B. Izar, J. R. Stone, I. S. Vlachos, I. H. Solomon, O. Ashenberg, C. B. M. Porter, B. Li, A. K. Shalek, A.-C. Villani, O. Rozenblatt-Rosen, andA. Regev. 2021. COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets. Nature 595(7865):107-113. Doiron, D., P. Burton, Y. Marcon, A. Gaye, B. H.Wolffenbuttel, M. Perola, R. P. Stolk, L. Foco, C. Minelli, M. Waldenberger, R. Holle, K. Kvaløy, H. L. Hillege, A.-M. Tassé, V. Ferretti, and I. Fortier. 2013. Data harmonization and federated analysis of population-based studies: The BioSHaRE project. Emerging Themes in Epidemiology 10(1):12. Dolitsky, S., A. Mitra, S. Khan, E. Ashkinadze, and M. V. Sauer. 2020. Beyond the “Jewish panel”: The importance of offering expanded carrier screening to the Ashkenazi jewish population. F&S Reports 1(3):294-298. Eisenmann, S., E. Bánffy, P. van Dommelen, K. P. Hofmann, J. Maran, I. Lazaridis, A. Mittnik, M. McCormick, J. Krause, D. Reich, and P. W. Stockhammer. 2018. Reconciling material cultures in archaeology with genetic data: The nomenclature of clusters emerging from archaeogenomic analysis. Scientific Reports 8:13003. Falconer, D. S., and T. F. C. Mackay. 1996. Introduction to quantitative genetics. 4th ed. Essex, England: Addison Wesley Longman Limited. GeM-HD Consortium (Genetic Modifiers of Huntington’s Disease Consortium). 2015. Identification of genetic factors that modify clinical onset of huntington’s disease. Cell 162(3):516-526. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 141 Giannakopoulou, O., K. Lin, X. Meng, M.-H. Su, P.-H. Kuo, R. E. Peterson, S. Awasthi, A. Moscati, J. R. I. Coleman, N. Bass, I. Y. Millwood, Y. Chen, Z. Chen, H.-C. Chen, M.-L. Lu, M.-C. Huang, C.-H. Chen, E. A. Stahl, R. J. F. Loos, N. Mullins, R. J. Ursano, R. C. Kessler, M. B. Stein, S. Sen, L. J. Scott, M. Burmeister, Y. Fang, J. Tyrrell, Y. Jiang, C. Tian, A. M. McIntosh, S. Ripke, E. C. Dunn, K. S. Kendler, R. G. Walters, C. M. Lewis, K. Kuchenbaecker, N. R. Wray, S. Ripke, M. Mattheisen, M. Trzaskowski, E. M. Byrne, A. Abdellaoui, M. J. Adams, E. Agerbo, T. M. Air, T. F. M. Andlauer, S.-A. Bacanu, M. Bækvad-Hansen, A. T. F. Beekman, T. B. Bigdeli, E. B. Binder, J. Bryois, H. N. Buttenschøn, J. Bybjerg-Grauholm, N. Cai, E. Castelao, J. H. Christensen, T.-K. Clarke, J. R. I. Cole- man, L. Colodro-Conde, H. Coon, B. Couvy-Duchesne, N. Craddock, G. E. Crawford, G. Davies, I. J. Deary, F. Degenhardt, E. M. Derks, N. Direk, C. V. Dolan, E. C. Dunn, T. C. Eley, V. Escott-Price, F. F. H. Kiadeh, H. K. Finucane, J. C. Foo, A. J. Forstner, J. Frank, H. A. Gaspar, M. Gill, F. S. Goes, S. D. Gordon, J. Grove, L. S. Hall, C. S. Hansen, T. F. Hansen, S. Herms, I. B. Hickie, P. Hoffmann, G. Homuth, C. Horn, J.-J. Hottenga, D. M. Howard, D. M. Hougaard, M. Ising, R. Jansen, I. Jones, L. A. Jones, E. Jorgenson, J. A. Knowles, I. S. Kohane, J. Kraft, W. W. Kretzschmar, Z. Kutalik, Y. Li, P. A. Lind, J. J. Luykx, D. J. Macintyre, D. F. Mackinnon, R. M. Maier, W. Maier, J. Marchini, H. Mbarek, P. McGrath, P. McGuffin, S. E. Medland, D. Mehta, C. M. Middeldorp, E. Mihailov, Y. Milaneschi, L. Milani, F. M. Mondimore, G. W. Montgomery, S. Mostafavi, N. Mullins, M. Nauck, B. Ng, M. G. Nivard, D. R. Nyholt, P. F. O’Reilly, H. Oskarsson, M. J. Owen, J. N. Painter, C. B. Pedersen, M. G. Pedersen, R. E. Peterson, E. Pettersson, W. J. Peyrot, G. Pistis, D. Posthuma, J. A. Quiroz, P. Qvist, J. P. Rice, B. P. Riley, M. Rivera, S. S. Mirza, R. Schoevers, E. C. Schulte, L. Shen, J. Shi, S. I. Shyn, E. Sigurdsson, G. C. B. Sinnamon, J. H. Smit, D. J. Smith, H. Stefansson, S. Steinberg, F. Streit, J. Strohmaier, K. E. Tansey, H. Teismann, A. Teumer, W. Thompson, P. A. Thompson, T. E. Thorgeirsson, M. Traylor, J. Treutlein, V. Trubetskoy, A. G. Uitterlinden, D. Umbricht, S. Van Der Auwera, A. M. Van Hemert, A. Viktorin, P. M. Visscher, Y. Wang, B. T. Webb, S. M. Weinsheimer, J. Wellmann, G. Willemsen, S. H. Witt, Y. Wu, H. S. Xi, J. Yang, F. Zhang, V. Arolt, B. T. Baune, K. Berger, D. I. Boomsma, S. Cichon, U. Dannlowski, E. De Geus, J. R. Depaulo, E. Domenici, K. Domschke, T. Esko, H. J. Grabe, S. P. Hamilton, C. Hayward, A. C. Heath, K. S. Kendler, S. Kloiber, G. Lewis, Q. S. Li, S. Lucae, P. A. Madden, P. K. Magnusson, N. G. Martin, A. M. McIntosh, A. Metspalu, O. Mors, P. B. Mortensen, B. Müller-Myhsok, M. Nordentoft, M. M. Nöthen, M. C. O’Donovan, S. A. Paciga, N. L. Pedersen, B. W. Penninx, R. H. Perlis, D. J. Porteous, J. B. Potash, M. Preisig, M. Rietschel, C. Schaefer, T. G. Schulze, J. W. Smoller, K. Stefansson, H. Tiemeier, R. Uher, H. Völzke, M. M. Weissman, T. Werge, C. M. Lewis, D. F. Levinson, G. Breen, A. D. Børglum, P. F. Sullivan, M. Agee, S. Aslibekyan, A. Auton, E. Babalola, R. K. Bell, J. Bielenberg, K. Bryc, E. Bullis, B. Cameron, D. Coker, G. Cuellar Partida, D. Dhamija, S. Das, S. L. Elson, T. Filshtein, K. Fletez-Brant, P. Fontanillas, W. Freyman, P. M. Gandhi, K. Heilbron, B. Hicks, D. A. Hinds, K. E. Huber, E. M. Jewett, Y. Jiang, A. Kleinman, K. Kukar, V. Lane, K.-H. Lin, M. Lowe, M. K. Luff, J. C. McCreight, M. H. McIntyre, K. F. McManus, S. J. Micheletti, M. E. Moreno, J. L. Mountain, S. V. Mozaffari, P. Nandakumar, E. S. Noblin, J. O’Connell, A. A. Petrakovitz, G. D. Poznik, M. Schumacher, A. J. Shastri, J. F. Shelton, J. Shi, S. Shringarpure, C. Tian, V. Tran, J. Y. Tung, X. Wang, W. Wang, C. H. Weldon, P. Wilton, D. Avery, D. Bennett, Z. Bian, R. Boxall, F. Bragg, K. H. Chan, L. Chang, Y. Chang, B. Chen, J. Chen, J. Chen, N. Chen, N. Chen, X. Chen, Y. Chen, Z. Chen, L. Cheng, J. Clarke, R. Clarke, R. Collins, C. Dong, H. Du, R. Du, Z. Fairhurst-Hunter, L. Fan, S. Feng, Z. Fu, W. Gan, R. Gao, Y. Gao, P. Ge, S. Gilbert, W. Gong, Q. Gu, Y. Guo, Z. Guo, Z. Guo, A. Hacker, X. Han, P. Hariri, P. He, T. He, M. Hill, M. Holmes, C. Hou, W. Hou, C. Hu, R. Hu, X. Hu, Y. Hu, H. Hua, Y. Hua, Y. Huang, P. K. Im, A. Iona, Q. Jiang, J. Jin, M. Kakkoura, Q. Kang, C. Kartsonaki, R. Kerosi, L. Kong, J. Lan, G. Lancaster, F. Li, H. Li, J. Li, L. Li, M. Li, S. Li, Y. Li, Y. Li, Z. Li, K. Lin, L. Lingli, C. Liu, D. Liu, D. Liu, F. Liu, H. Liu, J. Liu, J. Liu, Y. PREPUBLICATION COPY—Uncorrected Proofs

142 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH Liu, Y. Liu, H. Long, Y. Lu, G. Luo, J. Lv, S. Lv, L. Ma, E. Mao, J. McDonnell, F. Meng, J. Meng, I. Millwood, Q. Nie, F. Ning, D. Pan, R. Pan, Z. Pang, P. Pei, R. Peto, A. Pozarickij, Y. Qian, Y. Qin, C. Qu, X. Ren, P. Ryder, S. Sansome, D. Schmidt, P. Sherliker, R. Sohoni, B. Stevens, J. Su, H. Sun, Q. Sun, X. Sun, A. Tang, Z. Tang, R. Tao, X. Tian, I. Turnbull, R. Walters, M. Wan, C. Wang, C. Wang, H. Wang, J. Wang, L. Wang, P. Wang, T. Wang, S. Wang, S. Wang, X. Wang, L. Wei, M. Weng, N. Wright, M. Wu, X. Wu, S. Wu, K. Xie, Q. Xu, Q. Xu, X. Xu, S. Yan, L. Yang, X. Yang, J. Yang, P. Yao, L. Yin, B. Yu, C. Yu, M. Yu, Y. Zhai, H. Zhang, H. Zhang, J. Zhang, L. Zhang, N. Zhang, X. Zhang, X. Zhang, X. Zhang, X. Zhong, D. Z. Zhou, G. Zhou, J. Zhou, L. Zhou, W. Zhou, X. Zhou, Y. Zhou, and M. Zou. 2021. The genetic architecture of depression in individuals of east Asian ancestry. JAMA Psychiatry 78(11):1258-1269. Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Bustamante. 2009. Infer- ring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLOS Genetics 5(10):e1000695. Hardeman, R. R., P. A. Homan, T. Chantarat, B. A. Davis, and T. H. Brown. 2022. Improving the measurement of structural racism to achieve antiracist health policy: Study exam- ines measurement of structural racism to achieve antiracist health policy. Health Affairs 41(2):179-186. Hirschhorn, J. N., and M. J. Daly. 2005. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6(2):95-108. Jimenez-Sanchez, M., F. Licitra, B. R. Underwood, and D. C. Rubinsztein. 2017. Huntington’s disease: Mechanisms of pathogenesis and therapeutic strategies. Cold Spring Harbor Perspectives in Medicine 7(7). Khan, A. T., S. M. Gogarten, C. P. McHugh, A. M. Stilp, T. Sofer, M. L. Bowers, Q. Wong, L. A. Cupples, B. Hidalgo, A. D. Johnson, M.-L. N. McDonald, S. T. McGarvey, M. R. G. Taylor, S. M. Fullerton, M. P. Conomos, and S. C. Nelson. 2022. Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the nhlbi topmed program. Cell Genomics 2(8):100155. Khera, A. V., M. Chaffin, K. G. Aragam, M. E. Haas, C. Roselli, S. H. Choi, P. Natarajan, E. S. Lander, S. A. Lubitz, P. T. Ellinor, and S. Kathiresan. 2018. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics 50(9):1219-1224. Kistka, Z. A.-F., L. Palomar, K. A. Lee, S. E. Boslaugh, M. F. Wangler, F. S. Cole, M. R. Debaun, and L. J. Muglia. 2007. Racial disparity in the frequency of recurrence of preterm birth. American Journal of Obstetrics and Gynecology 196(2):131.e1-131.e6. Kittles, R. A., E. R. Santos, N. S. Oji-Njideka, and C. Bonilla. 2007. Race, skin color and genetic ancestry: Implications for biomedical research on health disparities. Californian Journal of Health Promotion 5(Health Disparities & Social Justice):9-23. Lee, S. S.-J., S. M. Fullerton, A. Saperstein, and J. K. Shim. 2019. Ethics of inclusion: Cultivate trust in precision medicine. Science 364(6444):941-942. Lesko, C. R., L. P. Jacobson, K. N. Althoff, A. G. Abraham, S. J. Gange, R. D. Moore, S. Mo- dur, and B. Lau. 2018. Collaborative, pooled and harmonized study designs for epide- miologic research: Challenges and opportunities. International Journal of Epidemiology 47(2):654-668. Lewis, A. C. F., S. J. Molina, P. S. Appelbaum, B. Dauda, A. Di Rienzo, A. Fuentes, S. M. Ful- lerton, N. A. Garrison, N. Ghosh, E. M. Hammonds, D. S. Jones, E. E. Kenny, P. Kraft, S. S. Lee, M. Mauro, J. Novembre, A. Panofsky, M. Sohail, B. M. Neale, and D. S. Allen. 2022. Getting genetic ancestry right for science and society. Science 376(6590):250-252. Li, J., and Q. Zhang. 2017. Insight into the molecular genetics of myopia. Molecular vision 23:1048. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 143 Mallick, S., H. Li, M. Lipson, I. Mathieson, M. Gymrek, F. Racimo, M. Zhao, N. Chennagiri, S. Nordenfelt, A. Tandon, P. Skoglund, I. Lazaridis, S. Sankararaman, Q. Fu, N. Rohland, G. Renaud, Y. Erlich, T. Willems, C. Gallo, J. P. Spence, Y. S. Song, G. Poletti, F. Balloux, G. van Driem, P. de Knijff, I. G. Romero, A. R. Jha, D. M. Behar, C. M. Bravi, C. Capelli, T. Hervig, A. Moreno-Estrada, O. L. Posukh, E. Balanovska, O. Balanovsky, S. Karachanak- Yankova, H. Sahakyan, D. Toncheva, L. Yepiskoposyan, C. Tyler-Smith, Y. Xue, M. S. Abdullah, A. Ruiz-Linares, C. M. Beall, A. Di Rienzo, C. Jeong, E. B. Starikovskaya, E. Metspalu, J. Parik, R. Villems, B. M. Henn, U. Hodoglugil, R. Mahley, A. Sajantila, G. Stamatoyannopoulos, J. T. Wee, R. Khusainova, E. Khusnutdinova, S. Litvinov, G. Ayodo, D. Comas, M. F. Hammer, T. Kivisild, W. Klitz, C. A. Winkler, D. Labuda, M. Bamshad, L. B. Jorde, S. A. Tishkoff, W. S. Watkins, M. Metspalu, S. Dryomov, R. Suke- rnik, L. Singh, K. Thangaraj, S. Pääbo, J. Kelso, N. Patterson, and D. Reich. 2016. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538(7624):201-206.Martin, A. R., M. Kanai, Y. Kamatani, Y. Okada, B. M. Neale, and M. J. Daly. 2019. Clinical use of current polygenic risk scores may exacerbate health disparities: A systematic literature review. Pharmacogenomics 18(16):1541-1550. Nature genetics 51(4):584-591. Martinez, R. A. M., N. Andrabi, A. N. Goodwin, R. E. Wilbur, N. R. Smith, and P. N. Zivich. 2022. Conceptualization, operationalization, and utilization of race and ethnicity in ma- jor epidemiology journals 1995–2018 American Journal of Epidemiology. Mathieson, I., and A. Scally. 2020. What is ancestry? PLOS Genetics (3):e1008624 Mavaddat, N., K. Michailidou, J. Dennis, M. Lush, L. Fachal, A. Lee, J. P. Tyrer, T.-H. Chen, Q. Wang, M. K. Bolla, X. Yang, M. A. Adank, T. Ahearn, K. Aittomäki, J. Allen, I. L. Andrulis, H. Anton-Culver, N. N. Antonenkova, V. Arndt, K. J. Aronson, P. L. Auer, P. Auvinen, M. Barrdahl, L. E. Beane Freeman, M. W. Beckmann, S. Behrens, J. Benitez, M. Bermisheva, L. Bernstein, C. Blomqvist, N. V. Bogdanova, S. E. Bojesen, B. Bonanni, A.-L. Børresen-Dale, H. Brauch, M. Bremer, H. Brenner, A. Brentnall, I. W. Brock, A. Brooks- Wilson, S. Y. Brucker, T. Brüning, B. Burwinkel, D. Campa, B. D. Carter, J. E. Castelao, S. J. Chanock, R. Chlebowski, H. Christiansen, C. L. Clarke, J. M. Collée, E. Cordina- Duverger, S. Cornelissen, F. J. Couch, A. Cox, S. S. Cross, K. Czene, M. B. Daly, P. Devilee, T. Dörk, I. dos-Santos-Silva, M. Dumont, L. Durcan, M. Dwek, D. M. Eccles, A. B. Ekici, A. H. Eliassen, C. Ellberg, C. Engel, M. Eriksson, D. G. Evans, P. A. Fasching, J. Figueroa, O. Fletcher, H. Flyger, A. Försti, L. Fritschi, M. Gabrielson, M. Gago-Dominguez, S. M. Gapstur, J. A. García-Sáenz, M. M. Gaudet, V. Georgoulias, G. G. Giles, I. R. Gilyazova, G. Glendon, M. S. Goldberg, D. E. Goldgar, A. González-Neira, G. I. Grenaker Alnæs, M. Grip, J. Gronwald, A. Grundy, P. Guénel, L. Haeberle, E. Hahnen, C. A. Haiman, N. Håkansson, U. Hamann, S. E. Hankinson, E. F. Harkness, S. N. Hart, W. He, A. Hein, J. Heyworth, P. Hillemanns, A. Hollestelle, M. J. Hooning, R. N. Hoover, J. L. Hopper, A. Howell, G. Huang, K. Humphreys, D. J. Hunter, M. Jakimovska, A. Jakubowska, W. Janni, E. M. John, N. Johnson, M. E. Jones, A. Jukkola-Vuorinen, A. Jung, R. Kaaks, K. Kaczmarek, V. Kataja, R. Keeman, M. J. Kerin, E. Khusnutdinova, J. I. Kiiski, J. A. Knight, Y.-D. Ko, V.-M. Kosma, S. Koutros, V. N. Kristensen, U. Krüger, T. Kühl, D. Lambrechts, L. Le Marchand, E. Lee, F. Lejbkowicz, J. Lilyquist, A. Lindblom, S. Lindström, J. Lis- sowska, W.-Y. Lo, S. Loibl, J. Long, J. Lubiński, M. P. Lux, R. J. MacInnis, T. Maishman, E. Makalic, I. Maleva Kostovska, A. Mannermaa, S. Manoukian, S. Margolin, J. W. M. Martens, M. E. Martinez, D. Mavroudis, C. McLean, A. Meindl, U. Menon, P. Middha, N. Miller, F. Moreno, A. M. Mulligan, C. Mulot, V. M. Muñoz-Garzon, S. L. Neuhausen, H. Nevanlinna, P. Neven, W. G. Newman, S. F. Nielsen, B. G. Nordestgaard, A. Norman, K. Offit, J. E. Olson, H. Olsson, N. Orr, V. S. Pankratz, T.-W. Park-Simon, J. I. A. Perez, C. Pérez-Barrios, P. Peterlongo, J. Peto, M. Pinchev, D. Plaseska-Karanfilska, E. C. Polley, R. Prentice, N. Presneau, D. Prokofyeva, K. Purrington, K. Pylkäs, B. Rack, P. Radice, R. PREPUBLICATION COPY—Uncorrected Proofs

144 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH Rau-Murthy, G. Rennert, H. S. Rennert, V. Rhenius, M. Robson, A. Romero, K. J. Ruddy, M. Ruebner, E. Saloustros, D. P. Sandler, E. J. Sawyer, D. F. Schmidt, R. K. Schmutzler, A. Schneeweiss, M. J. Schoemaker, F. Schumacher, P. Schürmann, L. Schwentner, C. Scott, R. J. Scott, C. Seynaeve, M. Shah, M. E. Sherman, M. J. Shrubsole, X.-O. Shu, S. Slager, A. Smeets, C. Sohn, P. Soucy, M. C. Southey, J. J. Spinelli, C. Stegmaier, J. Stone, A. J. Swerdlow, R. M. Tamimi, W. J. Tapper, J. A. Taylor, M. B. Terry, K. Thöne, R. A. E. M. Tollenaar, I. Tomlinson, T. Truong, M. Tzardi, H.-U. Ulmer, M. Untch, C. M. Vachon, E. M. van Veen, J. Vijai, C. R. Weinberg, C. Wendt, A. S. Whittemore, H. Wildiers, W. Willett, R. Winqvist, A. Wolk, X. R. Yang, D. Yannoukakos, Y. Zhang, W. Zheng, A. Ziogas, A. M. Dunning, D. J. Thompson, G. Chenevix-Trench, J. Chang-Claude, M. K. Schmidt, P. Hall, R. L. Milne, P. D. P. Pharoah, A. C. Antoniou, N. Chatterjee, P. Kraft, M. García-Closas, J. Simard, and D. F. Easton. 2019. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics 104(1):21-34. Mills, M. C., and C. Rahal. 2020. The GWAS diversity monitor tracks diversity by disease in real time. Nature Genetics 52(3):242-243. Mostafavi, H., A. Harpak, I. Agarwal, D. Conley, J. K. Pritchard, and M. Przeworski. 2020. Variable prediction accuracy of polygenic scores within an ancestry group. eLife 9:e48376. National Academies of Sciences, Engineering, and Medicine. 2019. Reproducibility and repli- cability in science. Washington, DC: The National Academies Press. Nazareth, S. B., G. A. Lazarin, and J. D. Goldberg. 2015. Changing trends in carrier screening for genetic disease in the united states. Prenatal Diagnosis 35(10):931-935. Nielsen, R., J. M. Akey, M. Jakobsson, J. K. Pritchard, S. Tishkoff, and E. Willerslev. 2017. Tracing the peopling of the world through genomics. Nature 541:302-310. O’Neal, W. K., and M. R. Knowles. 2018. Cystic fibrosis disease modifiers: Complex genetics defines the phenotypic diversity in a monogenic disease. Annual Review of Genomics and Human Genetics 19:201-222. Okbay, A., Y. Wu, N. Wang, H. Jayashankar, M. Bennett, S. M. Nehzati, J. Sidorenko, H. Kweon, G. Goldman, T. Gjorgjieva, Y. Jiang, B. Hicks, C. Tian, D. A. Hinds, R. Ahlskog, P. K. E. Magnusson, S. Oskarsson, C. Hayward, A. Campbell, D. J. Porteous, J. Freese, P. Herd, andMe Research Team, C. Social Science Genetic Association, C. Watson, J. Jala, D. Conley, P. D. Koellinger, M. Johannesson, D. Laibson, M. N. Meyer, J. J. Lee, A. Kong, L. Yengo, D. Cesarini, P. Turley, P. M. Visscher, J. P. Beauchamp, D. J. Benjamin, and A. I. Young. 2022. Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals. Nature Genet- ics 54(4):437-449. Oni-Orisan, A., Y. Mavura, Y. Banda, T. A. Thornton, and R. Sebro. 2021. Embracing genetic diversity to improve black health. New England Journal of Medicine 384(12):1163-1167. Parra, E. J., R. A. Kittles, and M. D. Shriver. 2004. Implications of correlations between skin color and genetic ancestry for biomedical research. Nature Genetics 36(S11):S54-S60. Patterson, N., P. Moorjani, Y. Luo, S. Mallick, N. Rohland, Y. Zhan, T. Genschoreck, T. Webster, and D. Reich. 2012. Ancient admixture in human history. Genetics 192(3):1065-1093. Pavličev, M., and G. P. Wagner. 2022. The value of broad taxonomic comparisons in evolution- ary medicine: Disease is not a trait but a state of a trait! MedComm 3(4), e174. Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155(2):945-959. Privé, F., H. Aschard, S. Carmi, L. Folkersen, C. Hoggart, P. F. O’Reilly, and B. J. Vilhjálmsson. 2022. Portability of 245 polygenic scores when derived from the UK Biobank and ap- plied to 9 ancestry groups from the same cohort. American Journal of Human Genetics 109(1):12-23. Randolph, H. E., J. K. Fiege, B. K. Thielen, C. K. Mickelson, M. Shiratori, J. Barroso-Batista, R. A. Langlois, and L. Barreiro. 2021. Genetic ancestry effects on the response to viral infection are pervasive but cell type specific. Science 374(6571):1127-1133. PREPUBLICATION COPY—Uncorrected Proofs

GUIDANCE FOR SELECTION AND USE 145 Sadarangani, M., A. Marchant, and T. R. Kollmann. 2021. Immunological mechanisms of vaccine-induced protection against covid-19 in humans. Nature Reviews Immunology 21(8):475-484. Schaefer, N. K., B. Shapiro, and R. E. Green. 2021. An ancestral recombination graph of hu- man, Neanderthal, and Denisovan genomes. Science Advances 7(29):eabc0776. Scheinfeldt, L. B., S. Soi, C. Lambert, W.-Y. Ko, A. Coulibaly, A. Ranciaro, S. Thompson, J. Hirbo, W. Beggs, M. Ibrahim, T. Nyambo, S. Omar, D. Woldemeskel, G. Belay, A. Froment, J. Kim, and S. A. Tishkoff. 2019. Genomic evidence for shared common ancestry of east African hunting-gathering populations and insights into local adaptation. Proceedings of the National Academy of Sciences 116(10):4166-4175. Scutari, M., I. Mackay, and D. Balding. 2016. Using genetic distance to infer the accuracy of genomic prediction. PLOS Genetics 12(9):e1006288. Shriner, D. 2013. Overview of admixture mapping. Current Protocols in Human Genetics Chapter 1:Unit 1.23. Simons, C., N. I. Wolf, N. McNeil, L. Caldovic, J. M. Devaney, A. Takanohashi, J. Crawford, K. Ru, S. M. Grimmond, D. Miller, D. Tonduti, J. L. Schmidt, R. S. Chudnow, R. van Coster, L. Lagae, J. Kisler, J. Sperner, M. S. van der Knaap, R. Schiffmann, R. J. Taft, and A. Vanderver. 2013. A de novo mutation in the beta-tubulin gene TUBB4A results in the leukoencephalopathy hypomyelination with atrophy of the basal ganglia and cerebellum. American Journal of Human Genetics 92(5):767-773. Sirugo, G., S. M. Williams, and S. A. Tishkoff. 2019. The missing diversity in human genetic studies. Cell 177(1):26-31. Spratt, D. E., T. Chan, L. Waldron, C. Speers, F. Y. Feng, O. O. Ogunwobi, and J. R. Osborne. 2016. Racial/ethnic disparities in genomic sequencing. JAMA Oncology 2(8):1070. Tan, T. Y., O. J. Dillon, Z. Stark, D. Schofield, K. Alam, R. Shrestha, B. Chong, D. Phelan, G. R. Brett, E. Creed, A. Jarmolowicz, P. Yap, M. Walsh, L. Downie, D. J. Amor, R. Savarirayan, G. McGillivray, A. Yeung, H. Peters, S. J. Robertson, A. J. Robinson, I. Macciocca, S. Sadedin, K. Bell, A. Oshlack, P. Georgeson, N. Thorne, C. Gaff, and S. M. White. 2017. Diagnostic impact and cost-effectiveness of whole-exome sequencing for ambulant chil- dren with suspected monogenic conditions. JAMA Pediatrics 171(9):855-862. Teteh, D. K., L. Dawkins-Moultin, S. Hooker, W. Hernandez, C. Bonilla, D. Galloway, V. La- Groon, E. R. Santos, M. Shriver, C. D. M. Royal, and R. A. Kittles. 2020. Genetic ancestry, skin color and social attainment: The four cities study. PLoS ONE 15(8):e0237041. Thornton, T. A., and J. L. Bermejo. 2014. Local and global ancestry inference and applica- tions to genetic association analysis for admixed populations. Genetic Epidemiology 38(S1):S5-S12. Torkamani, A., N. E. Wineinger, and E. J. Topol. 2018. The personal and clinical utility of polygenic risk scores. Nature Reviews Genetics 19(9):581-590. Torres, J. B., and R. A. Kittles. 2007. The relationship between “race” and genetics and bio- medical research. Current Hypertension Reports 9(3):196-201. Turner, T. N., B. P. Coe, D. E. Dickel, K. Hoekzema, B. J. Nelson, M. C. Zody, Z. N. Kronen- berg, F. Hormozdiari, A. Raja, L. A. Pennacchio, R. B. Darnell, and E. E. Eichler. 2017. Genomic patterns of de novo mutation in simplex autism. Cell 171(3):710-722.e712. Van Alten, S., B. W. Domingue, T. Galama, and A. T. Marees. 2022. Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering. medRxiv (preprint). Visscher, P. M., N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy, M. A. Brown, and J. Yang. 2017. 10 years of gwas discovery: Biology, function, and translation. The American Journal of Human Genetics 101(1):5-22. Wallace, S. E., E. Kirby, and B. M. Knoppers. 2020. How can we not waste legacy genomic research data? Frontiers in Genetics 11:446. PREPUBLICATION COPY—Uncorrected Proofs

146 POPULATION DESCRIPTORS IN GENETICS AND GENOMICS RESEARCH Wang, Y., J. Guo, G. Ni, J. Yang, P. M. Visscher, and L. Yengo. 2020. Theoretical and empiri- cal quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications 11(1). Wang, Y., K. Tsuo, M. Kanai, B. M. Neale, and A. R. Martin. 2022. Challenges and opportuni- ties for developing more generalizable polygenic risk scores. Annual Review of Biomedi- cal Data Science 5:293-320. Wexler, N. S. 2004. Venezuelan kindreds reveal that genetic and environmental factors modu- late Huntington’s disease age of onset. Proceedings of the National Academy of Sciences 101(10):3498-3503. PREPUBLICATION COPY—Uncorrected Proofs

Next: 6 Implementation and Accountability »
Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field Get This Book
×
Buy Prepub | $34.00 Buy Paperback | $25.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Genetic and genomic information has become far more accessible, and research using human genetic data has grown exponentially over the past decade. Genetics and genomics research is now being conducted by a wide range of investigators across disciplines, who often use population descriptors inconsistently and/or inappropriately to capture the complex patterns of continuous human genetic variation.

In response to a request from the National Institutes of Health, the National Academies assembled an interdisciplinary committee of expert volunteers to conduct a study to review and assess existing methodologies, benefits, and challenges in using race, ethnicity, ancestry, and other population descriptors in genomics research. The resulting report focuses on understanding the current use of population descriptors in genomics research, examining best practices for researchers, and identifying processes for adopting best practices within the biomedical and scientific communities.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!