This chapter lays out the evidentiary basis that the committee used to assess the recommended measures that are presented in the subsequent chapters of this report. It presents a synthesis of methods and criteria that are commonly used to test and evaluate survey questions, emphasizing evidence suggesting that a measure works well within both sexual/gender minority and majority populations. The panel is not aware of unique criteria that specifically address testing and evaluation of demographic measures collected in clinical settings, but we found little reason to use different evaluation criteria to assess these measures across these settings. Nor is the committee aware of unique criteria for evaluating measures collected for administrative records. Official federal forms (e.g., passport applications) are reviewed and approved by the Office of Management and Budget’s Office of Information Regulatory Affairs using criteria that include many of the factors used to evaluate survey measures, such as respondent burden. After discussing commonly used evaluation criteria, the chapter concludes with a set of criteria the committee used to assess and ultimately select the recommended measures in the rest of the report.
When developing standardized questions, the primary goal is to establish their construct validity—alignment between what the item(s) is (are) measuring and the underlying concept being measured. The purpose of establishing validity is to reduce measurement error and craft questions that
- The content standard evaluates whether the question is asking about the right construct.
- The cognitive standard evaluates whether respondents understand the questions and are willing and able to answer them.
- The usability standard evaluates whether respondents (and potentially, interviewers) can complete the question easily and as it was intended.
There are a number of metrics that are commonly used to evaluate measure quality, including respondent comprehension; cognitive load (the amount of working memory required to respond) and retrieval (the ability to recall the requested information from memory); interviewer administration issues; response time; response bias; item response distributions; and item nonresponse, refusal, and do not know rates. Some measures may speak to more than one of the three standards. For example, measures of respondent comprehension can help establish content validity (content standard) and whether respondents understand what is being asked of them (cognitive standard). Similarly, item nonresponse rates can provide information regarding whether questions meet cognitive and usability standards.
Methods of Assessing Measure Validity
Pretesting is perhaps the most common method for evaluating question quality, and there are a variety of methods that are used to do so, including:
- expert review, in which content experts provide evaluations of question quality (see, e.g., Olson, 2010);
- focus groups, in which a group of participants are recruited as a panel to develop, evaluate, or provide feedback on a specific topic, most commonly to develop new measures (see, e.g., Krueger, 1994);
- cognitive interviews, in which a small set of respondents is interviewed to discuss in detail their thought processes as they interpreted and responded to potential items (see, e.g., Willis, 2005; Desimone and La Floch, 2004);
- respondent debriefings, in which respondents are provided with additional information about the data collection process and asked to provide feedback on specific questions after they have completed the instrument (see, e.g., Campanelli, Martin, and Rothgeb, 1991);
- field pretests and behavior coding, which monitor small numbers
- of interviews with respondents to evaluate question performance (see, e.g., Ongena and Dijkstra, 2006; Presser et al., 2004); and
- randomized split-panel experiments, in which the effects of question wording or administration conditions (e.g., mode) are tested by randomly assigning respondents into panels that are administered different questions or conditions (Presser et al., 2004).
These evaluation methods are generally introduced at different stages of the question design process, with focus groups most commonly occurring early in question development and field tests, behavior coding, and split-panel experiments occurring during administration. Key indicators that are used to assess the quality of questions differ across the various evaluation methods. For example, focus groups and cognitive interviews produce indicators of comprehension, readability, cognitive load and retrieval, and sensitivity. Behavior coding and field pretests yield interviewer administration problems, as well as respondent comprehension. Split-ballot experiments produce indicators such as differential item nonresponse rates, refusal and don’t know rates, response bias, and response distributions, that are used to evaluate different question wordings.
When discussing response options, ordering is also considered. Because sex, gender identity, and sexual orientation are nominal variables, there is no inherent ordering to their response categories. However, the order of response categories is important as it does affect how respondents answer. Studies have documented two types of order effects: primacy and recency effects. Primacy effects occur when respondents tend to “satisfice” and select options early in a list (Krosnick, 1999). Alternatively, when respondents hear categories in interviewer-administered surveys, the opposite can happen with respondents selecting options toward the end of the list (Krosnick and Alwin, 1987). In automated data collection, randomizing the order of categories can reduce this effect. In this case, programming is used to randomize the order such that they are varied across interviews. In data collection that cannot be automated, ordering is sometimes presented alphabetically or according to the predicted size of each category (from largest to smallest). Free-text categories—for example, “Other-specify: __________”—are typically presented at the end of a list.
To the extent possible, the panel considered these types of key indicators in developing our recommended measures. When discussing alternative question designs to recommend, the panel also considered evaluation approaches that reflect our guiding principles (specifically, precision, inclusiveness, autonomy, parsimony, and privacy [see Chapter 2]). For example, a measure needs to be representative, allowing respondents to see themselves and their identities, while balancing the need to collect accurate information without undue burden; needs to clearly specify which component(s)
of sex, gender, and sexual orientation are being measured; and must allow respondents to self-identify without requiring external authorization or attestation of identity. Finally, given the statement of task’s focus on data collection among the English-speaking adult population, the panel was keenly interested in quality indicators coming from current measures used in large-scale nationally representative data collections.
The panel recognizes the complex and nuanced identities that characterize some small minority groups, and additionally how misreporting of these identities can increase overall measurement error if these identities are not well known or are misunderstood by the broader population. It is also important for researchers to use community-appropriate terminology and ensure that data collection is culturally grounded. As such, there are several dimensions of community responsiveness that need to be balanced in the development of questions that allow the identification of these populations, including how the questions are understood by respondents, the use of appropriate language that understands how dynamic terminology and niche jargon are understood by both minority and majority people, and abiding by principles of not doing harm (McDonnell, Goldman, and Koumjian, 2020; Kelley et al., 2019; Moore, 2018; Harper and Schneider, 2003).
Adjustments to existing well-tested measures that appear on prominent national surveys, such as the National Health Interview Survey sexual orientation identity item, are often proposed by minority communities as a way of giving the community voice, better representation, or legitimation within the data collection process. One type of adjustment that is commonly requested (and sometimes implemented) is to increase the number of response options by adding additional unique identity terms to the question response options. While it is important for respondents to be able to find a suitable category for themselves in the response options, as the number of categories increases, so does statistical noise due to misidentification. When categories representing small minority populations are introduced, these categories may not be well understood outside of these minority communities, and respondents from the majority population may misinterpret the response option and select it, leading to (potential) overreporting within the new category and (potential) underreporting within the original category.
Misreporting as a small minority by the majority population—sometimes referred to as “false positive” reports—does not only affect the data from those who misreported their identity. The effects of this misreporting can actually be more consequential for members of the smaller group. Even if only a small fraction of the majority population are “false positives,” it can lead to a biased understanding of the size, characteristics, and outcomes
of people in the smaller group. This occurred in the 2010 census, when a tiny fraction of straight couples mismarked the sex question which resulted in upwards of one-quarter of same-sex couples being misclassified (DeMaio, Bates, and O’Connell, 2013). These effects can be compounded if the item also leads to a significant number of “false negative” reports, which occur when members of the small minority group do not select that response category. This can occur when a person fits the definition of a category or experience but does not recognize the terminology provided, finds the response options offensive, or is otherwise uncomfortable reporting an identity that is marginalized or stigmatized. Although it is almost impossible to entirely eliminate false positives and false negatives, careful pretesting of items through cognitive interviews and experimental studies that compare results from different wordings help to minimize these misclassifications and improve data validity.
A further complication can arise when data are tabulated and reported because categories with few people in them are often later collapsed into broader categories and are sometimes even dropped from analysis. When this occurs, although respondents may have initially had the opportunity to express their unique identity in data collection, the end result is that their voice is erased.
A similar outcome can occur when a question uses an open-response or free-text question format. In this format, respondents are allowed to write in their personal identity. This type of approach requires recoding each written response into a broad category that can be used in statistical analysis. This coding process requires the coder or analyst to make decisions on how to categorize the information that the respondent provided. This means that the coder will choose the best way to recategorize an individual’s identity into usable data groups, and this recategorization comes with implicit biases that may not be consistent with the individual’s understanding of their own identity (Guyan, 2022). Additionally, this process can be very time consuming and resource intensive, particularly when a large number of responses have to be coded. Nonetheless, write-ins allow respondents to record terms outside of a fixed list and allow analysts to monitor the use of terms over time to determine if the inclusion of a new category is warranted going forward, which is particularly useful when terminology is in flux.
Determining when a write-in response is “sufficient” to warrant its own category requires several considerations. First, has previous research indicated respondents identifying with the potential new category have different outcomes than those identifying with existing categories? For example, do people writing in “nonbinary” or “gender-fluid” have different outcomes than those currently being classified as transgender? Second, has the potential new category seen an increase in frequency over time and over different
data collections? In other words, will the terminology have “staying power” or is it a temporarily popular term? Third, is the frequency of the potential new category as large as or larger than an existing category? Furthermore, is the number of responses large enough to pass disclosure avoidance thresholds such that it can be published in public data tables? If these conditions are met, then pretesting the new category is essential to understand whether it might confuse some respondents—a narrow niche might know what it means, but what impact will it have on the general population? Finally, if resources permit, a randomized split ballot experiment can be conducted testing the old set of categories against the expanded group to examine who selects the new category and with what frequency (as well as a comparison of the new group’s demographics to original write-in group demographics).
Although we recognize that all kinds of data can inform public policy and community action, the statement of task stipulated that the panel’s recommendations be focused on the types of information collected in population-based surveys, large-scale administrative contexts, and other data collection activities that track entire populations or large general samples, not just those that target sexual and gender minorities. These contexts almost always use multiple-choice questions that capture the vast majority of respondent identities and minimize the need for further data coding and processing. Moreover, because many sexual and gender minority groups are small populations, mismeasurement can have an outsized impact on data quality, which means that pretesting of measures is particularly important. For this reason, the panel decided to base our evaluation on existing measures that had undergone testing, and, when possible, have been used in general populations.
Measures that primarily have been used and tested in LGBTQI+ communities may better capture the diverse range of identities and experiences in those populations, but they may be less comprehensible to the general population. Similarly, the use of specific terminology may vary with age or other respondent characteristics, such as race, ethnicity, and geography. Data collection efforts that target these populations may wish to consider modifying the recommended questions and response options. We strongly urge any adjustments to the recommended questions be properly tested to understand the potential impact on the resulting data.
Given the generally accepted criteria for assessing measures and the concerns about balancing community responsiveness with usability in a general population that are discussed above, we used the following criteria for selecting which measures for sexual orientation identity, gender identity, and intersex status to recommend:
- consistency with the data collection principles discussed above (e.g., precision, inclusiveness, autonomy);
- comprehensibility to the general population as well as the LGBTQI+ populations of interest;
- tested in both general population and LGBTQI+ populations;
- requires that respondents select only a single response option in order to simplify enumeration, tabulation, and analysis of the resulting data;
- provides consistent estimates when measured across data collection contexts; and
- tested or previously administered with adequate performance using multiple administration modes (i.e., web-based, interviewer-administered, computer-assisted, and telephone administration).
For the response options, we used the following criteria:
- comprehensibility to the general population;
- consistency with terminology that is currently used in both the general population and LGBTQI+ populations;
- ability to measure current trends;
- ability to measure, assess, and incorporate changes with less well-known terminology;
- balance in providing comprehensive options with minimizing complexity and respondent burden that arises from considering a longer list of response options;
- produces a sufficient number of respondents per category to minimize the need to collapse categories and reclassify respondents; and 7. considers the effects of response item ordering, including relevant factors, such as:
- population prevalence,
- alphabetical listing,
- previous testing, and
In Part II of this report, the panel’s recommended measures were weighed against these criteria using the evaluation methods described earlier. Both qualitative and quantitative evidence are cited to demonstrate degrees of understandability and comprehension, usability, item-nonresponse rates, frequency of category responses, and other psychometric measures of construct validity. In some cases, such as for intersex status, we highlight promising measures and recommend testing for inclusion in future data collection efforts.
The panel was charged with recommending measures for use in each of three context settings: surveys and research, administrative, and clinical. Much of the research that has been done to evaluate these measures has been done in surveys and other research settings. For this reason, we have greater confidence in the performance of our recommended measures in this setting than in the other two.
For clinical settings, the panel reviewed the available information on measures, including data collection guidance from a variety of sources, including government agencies, such as the Centers for Disease Control and Prevention and the National Institutes of Health, as well as research and practice from public and private health care organizations. We did not find any reasons to modify our recommended measures for data collection in this setting.
For administrative settings, which cover a wide range of contexts and practices, and, with the exception of vital statistics and legal identification documents, tend to be privately maintained, very little information is available on the practices that are in use, and even less information is publicly available on how those measures perform. As noted in Chapter 3, there may be specific contexts in administrative settings in which the collection of some of these data, such as sex assigned at birth, may be considered invasive; therefore, it may be necessary to modify the recommended measures for these contexts. Unfortunately, the panel did not have a sufficient evidentiary base for data collected in this setting to allow us to recommend possible alternative measures. Thus, we propose one set of measures that can be used across all three settings; however, users need to exercise caution when using these measures, particularly in administrative settings.