National Academies Press: OpenBook
« Previous: Report Contents
Page 17
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 17
Page 18
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 18
Page 19
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 19
Page 20
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 20
Page 21
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 21
Page 22
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 22
Page 23
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 23
Page 24
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 24
Page 25
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 25
Page 26
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 26
Page 27
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 27
Page 28
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 28
Page 29
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 29
Page 30
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 30
Page 31
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 31
Page 32
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 32
Page 33
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 33
Page 34
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 34
Page 35
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 35
Page 36
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 36
Page 37
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 37
Page 38
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 38
Page 39
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 39
Page 40
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 40
Page 41
Suggested Citation:"1. Introduction to Research Investigations." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 41

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-1 1. Introduction to Research Investigations The main goal of the project was to arrive at an operationally practical data perturbation approach that will satisfy the transportation data user community’s analytical needs and satisfy the disclosure rules set by the Census Bureau Disclosure Review Board (DRB). The main disclosure avoidance practice that was used on certain Census Transportation Planning Products (CTPP) tabulations to accomplish this objective was cell suppression. First, small cells were identified and suppressed, and then other related table cells that would allow the primary cell’s value to be logically deduced from the table’s margins also were suppressed. The small cells were defined using the “Rule of 3,” which reduced the disclosure risk, although this would result in suppressed data in an estimated 80 percent or more of places in the nation, using a 10-level Means of Transportation (MOT) variable (Miller 2008) for three-year American Community Survey (ACS) tabulations. With the underlying data for the CTPP moving from the Census Long Form data to the smaller ACS five-year combined sample, it was clear that the data loss at finer geographic areas, such as planned Traffic Analysis Zones (TAZs) would be substantial on five-year American Community Survey (ACS) data due to DRB disclosure rules. For this reason, efforts were focused on ways to generate a complete set of data containing perturbed values that strived to retain the usability of the data. Provided the opportunity to address this issue, the project team, consisting of Westat, its subcontractor, Vanasse Hangen Brustlin, Inc. (VHB), and analysis consultant Dr. Michael Larsen, along with the Westat Senior Statistical Advisory Group, conducted the initial tasks as laid out in the working plan document. As part of the kickoff meeting on January 26, 2010, the panel (Transportation Research Board panel members for NCHRP 08-79) provisionally approved the initial plans discussed in the working plan document and provided recommendations related to travel model validation that are discussed later in this document. The research was divided into the following four phases: 1. Research investigations (Research Tasks 1 and 2, Chapter 1); 2. Development (Research Tasks 3, 4, and 5; Chapter 2); 3. Validation (Research Task 6, Chapter 3); and 4. National test and transition of programs (Research Task 7, Chapter 4). The main result of the research activities was arriving at three main perturbation approaches to evaluate. The approaches were evaluated in the development phase to determine the best approach for moving forward to the validation phase of the research. Also described are the data utility and disclosure risk measures that have been developed in order to assess the performance of the perturbation approaches. The methodology is described in Chapters 2, 3 and 4, since the approaches changed slightly during each phase. After the development phase and validation phase results were generated, the DRB was provided the documentation that discussed the variables perturbed in the development phase and the percentage of values that were replaced. The variables to be perturbed will not be known to the public and the nondisclosure of the list is considered vital in importance. All variables mentioned in this document are discussed as a subset of preliminary variables focused on for the research.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-2 The results of the research investigations provided the research parameters, which included the following:  Tables and variables. As discussions commenced with the DRB, transportation experts, Census operations staff, and the statistical advisory group, it was readily apparent that establishing the set of tables and variables for the purposes of this research was needed to facilitate concrete discussion, decisions, and efficient use of resources.  Disclosure thresholds. In working with the DRB, the research team sought to obtain clarification of disclosure thresholds, in order to ensure meeting the standards set forth through DRB disclosure rules for this special set of tabulations.  Transportation user needs. In on-going discussions with VHB in the use of transportation/CTPP data in the design, development, and use in travel demand models, the research team sought to determine the variables most important in the development of travel demand models, to gain further understanding of the needs of the transportation community, and to work toward the involvement of transportation planners in the validation of the resulting perturbed data.  Operational needs. In collaboration with the Census Bureau’s special tabulation group, the team identified the datasets that served as the basis for the evaluations in the spring and fall of 2010, and established relationships to develop mutual understanding of the requirements for assimilating a final product from this research. An important step in moving toward the goal of this effort was a critical assessment of a set of promising data perturbation approaches in order to identify the most credible among the approaches, so that a small number of approaches needed to be programmed and evaluated. The information gained about the tables, variables, DRB rules, transportation users needs, and operational needs worked toward establishing a concrete foundation on which to base these discussions and decisions, which resulted from a sequence of meetings between members of the research group (Mark Freedman, Tom Krenzke, Jane Li, David Hubble, Michael Larsen), and members of the Senior Statistical Advisory Group (David Judkins, Graham Kalton, Mike Brick, Bob Fay, David Morganstein). Three main perturbation approaches were selected for the development phase. Section 1.1 provides more details of the work conducted in the initial research phase, relating to the tables, variables, and disclosure thresholds. Section 1.2 discusses the involvement of transportation planners, while Section 1.3 discusses the involvement of the ACS operations staff. An overview of the critical assessment of data perturbation approaches considered for the CTPP is provided in Section 1.4. 1.1 CENSUS DRB RULES ON ACS FIVE-YEAR TABULATIONS AND DISCLOSURE RISK ELEMENTS IN CTPP TABLES The research began by reviewing the Census Bureau DRB rules on ACS five-year tabulations and disclosure risk elements (i.e., factors that affect disclosure risk) in CTPP tables. The general structure of the CTPP tables are described first; then the risk elements are discussed. 1.1.1 General Structure of CTPP Tables In general, tables are derived from microdata by tabulating counts of individuals in cells determined by the cross-classification of one or more variables. In sample surveys, the survey weights for

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-3 individuals in a table cell are added to produce a weighted count in the cell. In applications such as business, establishment, or transportation studies, one can summarize information on a quantitative variable (total, mean, standard deviation, median, or percentiles) for individuals within table cells defined by other variables. Categorical variables with nominal or ordinal values and discrete quantitative variables with relatively few values can be directly used for classification of subjects. The new CTPP product will be processed from the 2006–2010 ACS combined sample. There are three parts to the CTPP tables: residence-based (Part 1), workplace-based (Part 2), and residence-to- workplace flows (Part 3). In February 2010, a set of CTPP tables for this research was approved by the CTPP Advisory Board. The initial set of tables can be found in the Technical Memorandum for Tasks 1 and 2 (Westat 2010). In June 2010, AASHTO reduced the number of tables and proposed a new draft set of tables, which are presented in Appendix A of this report. Within the tables is a one-way flow table approved by the DRB for 18 categories of Means of Transportation (MOT). Most of the Part 1 tables, Part 2 tables, and Part 3 tables contain estimates of total workers, although some tables include cell aggregates, means, and medians. There are also household-based tabulations on household income, for example, as well as other universe differences between the tables. Among the most important variables for the transportation community is the MOT, especially when it comes to the flows. Parts 1, 2 and 3 each include a table on MOT that consists of 18 categories. For small areas (smaller than counties), the MOT variable is compressed into fewer categories and crossed pairwise to generate cell estimates of workers with other variables in Part 1 and Part 2. In Part 1, eight variables are crossed with MOT as follows (the number in parentheses refers to the number of categories in the variable, including the total):  Crossed with MOT(11): Age of Worker(8), Travel Time(12), Household Income(26), Vehicles Available (6), Under 18 (3), Minority status (3), Number of workers in household (3); and  Crossed with MOT(7): Time Leaving Home (10). For Part 2, Time Arriving(17) is substituted in the above list for Time Leaving Home(10). For Part 3, for small areas (defined below), MOT is crossed pairwise with only four variables, to obtain cell estimates of the number of workers as follows:  Crossed with MOT(7): Time Leaving Home(5), Household Income(5), Vehicles Available(4); and  Crossed with MOT(4): Travel Time(12). Other variables involved in the small area flows are Age of Worker (8), Industry(8)1 —overall and excluding self-employed, Time Leaving Home(17), Minority Status(3), Travel Time(12), Household Income(9), and Poverty Status(4). The tables that include cell medians, aggregates, and means are shown in Table 1-1. Small areas. There is a distinction in Appendix A between tables slated for large areas and tables for small areas. Small areas involve areas smaller than county, for example, block groups, tracts, places. For flows, this is transparent in the TAZ variable in the Census datasets; that is, TAZ may be defined as 1 Industry(8) has been proposed by the research team and agreed to by AASHTO. Appendix A tables still mention Industry(15). NOTE: R = residence based; W = workplace based; F = Flows.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-4 block groups, tracts, places by the Metropolitan Planning Organization (MPO), or defined as the default (tracts) for areas where TAZs are not explicitly defined. Large areas are defined as counties or areas larger than a county, such as states. Table 1-1. CTPP Tables with Cell Aggregates, Means, and Medians Part Table Variable for Cell Aggregates, Means and Medians Subgroup 1(R) TimeLeavingHome(5) Workers per carpools (no median) Workers 16 years and over using carpools TimeLeavingHome(5) Workers per car, truck or van (no median) Workers 16 years and over who used car, truck or van TimeLeavingHome(5) VehiclesUsed (no median) Workers 16 years and over using car, truck or van MOT(18) TravelTime Workers 16 years and over who did not work at home MOT(11)*TimeLeavingHome(17) TravelTime Workers 16 years and over who did not work at home NumWorkers(6) HHIncome Households VehAvail(6) HHIncome Households 2(W) TimeLeavingHome(5) Total workers in carpools (no median) Workers 16 years and over using carpools TimeLeavingHome(5) Workers per car, truck or van (no median) Workers 16 years and over who used car, truck or van TimeLeavingHome(5) VehiclesUsed (no median) Workers 16 years and over using car, truck or van MOT(18) TravelTime Workers 16 years and over who did not work at home MOT(11)*TimeArriving(17) TravelTime Workers 16 years and over who did not work at home 3(F) MOT(7) TravelTime Workers 16 years and over who did not work at home MOT(7)*TimeLeavingHome(5) TravelTime Workers 16 years and over who did not work at home NOTE: R: Residence-based; W: Workplace-based; F: Flows. 1.1.2 Disclosure Rules Through frequent discussions between the research team and Census Bureau DRB chair Laura Zayatz, and a pivotal meeting with the Census Bureau DRB in January 2010, an understanding of the DRB disclosure rules was established. The rules and how they relate exactly to the CTPP tabulations were clarified. Table 1-2 provides a summary of the DRB disclosure rules as discussed and confirmed by the Census DRB.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-5 Table 1-2. Disclosure Rules for CTPP Tables Based on the Five-Year ACS # Part Table type Example Table Initial Risk Rule Post-Synthetic Rule 1 1 (R) -- Total workers(1) No threshold No threshold 2 1 (R) 1-way, 2- ways, etc (no MOT) R* Vehicles available(6) No threshold No threshold 3 1 (R) MOT 1-way R* MOT(18) No threshold No threshold 4 1 (R) MOT* X R* MOT(11)* Age of worker(8) MOT(11) marginals must have at least 3 unweighted records. No threshold 5 1 (R) Means, aggregates R* MOT(18) – Aggregate Travel Time Means and Aggregates must be based on at least 3 unweighted records for every cell No threshold 6 1 (R) Medians R* MOT(18) – Median Travel Time Medians are an interpolation from a frequency distribution of unrounded data (not subject to rounding), or as a point quantile rounded to two significant digits with at least 5 cases on either side of the quantile point. No threshold 7 2 (W) Same as above Same as above Same as Part 1 Residence tables No threshold 8 3 (F) Total Total workers(1) No threshold No threshold 9 3 (F) 1-way F* Poverty(4) Must have at least 3 unweighted records in flow (F). No threshold 10 3 (F) MOT 1-way F* MOT(18) No threshold No threshold 11 3 (F) MOT*X F* MOT(7)* HH Income(5) MOT(7) marginals must have at least 3 unweighted records. No threshold 12 3 (F) Means, aggregates F* MOT(7)* Time Leaving Home (5)– Mean Travel Time Means and Aggregates must be based on at least 3 unweighted records for every cell No threshold 13 3 (F) Medians F* MOT(7)* Time Leaving Home (5)– Median Travel Time Medians are an interpolation from a frequency distribution of unrounded data (not subject to rounding), or as a point quantile rounded to two significant digits with at least 5 cases on either side of the quantile point. No threshold NOTE: R: Residence-based; W: Workplace-based; F: Flows. The motivation and rationale for the DRB disclosure rules are as follows:  One particular threat of disclosure, as recognized by the Census DRB, arises in the CTPP tables when sample uniques (singletons) exist in the marginals of MOT. When a single sample unit appears in the marginals of several tables, for example, MOT* A, MOT * B,… MOT * P, tables can be linked together to define a microdata record for the sample unit consisting of MOT, A, B, … P. The resulting microdata record then reveals a lot of information about a certain individual; that is, even though the CTPP are in tabular form, tables can be linked together to form a string of identifying characteristics (referred to as a “key”). In some cases, the key could be matched to external databases, such as the ACS Public Use Microdata Sample (PUMS) in the context of CTPP tables. Matching to microdata in an external source would further compromise the confidentiality of information.  In addition, if there is a count of two in the marginals, and a sample case can be identified in the marginal, then that case can piece together the other sample case accordingly.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-6  Therefore, in Parts 1 and 2, for pairwise cross-tabs involving MOT, the Rule of 3 is applied to MOT marginals. In general, the Rule of 3, based on the concept of k-anonymity, specifies that at least three individuals must be represented (a count of at least three). If there is only one, then it is a unique person in the sample. If there are only two, then each person in the sample from that cell knows that there is only one other person in the sample in that cell. If there are three or more, one person cannot make a statement about someone being unique in the sample without additional information.  In addition, there are a few tables that have cell aggregates and means. For these tables, the Rule of 3 is applied on every cell.  For Part 3, the Rule of 3 is applied for any one-way table, other than MOT. – For cross-tabs involving MOT, the Rule of 3 is applied to MOT marginals. – As in the Part 1 and 2 tables, the Rule of 3 is applied to each cell of a table that involves cell aggregates and means; that is, in CTPP tables, means, and aggregates must be based on at least three unweighted records for every cell. As with counts, if two people contribute data to a total (or mean), then one of those people can determine the other person’s value by subtraction.  Medians are computed whenever means are computed. Medians will likely be computed as an interpolation from a frequency distribution of unrounded data (not subject to rounding). It was also recognized that the DRB disclosure rules may be used to identify high risk cells. Given that the perturbation approach, at a minimum, would target the underlying microdata contributing to those high risk cells, there would be no DRB threshold rules applied to the tables. 1.1.3 Risk Elements Neighbors, extended kin, friends, and workmates may have the motivation to obtain sensitive information about their acquaintances. If obtained, the disclosure could be of three types, as discussed in Federal Committee on Statistical Methodology (FCSM) (2005): “identity,” attribute, or inferential. Identity disclosure occurs if a data snooper can identify a person from the highly identifiable released data, such as residence, workplace, MOT, age, earnings, industry, length of U.S. residence, sex, occupation, and household income. Attribute disclosure occurs when sensitive information about a person, such as earnings, household income and poverty status, is revealed. Inferential disclosure happens when data can be inferred with high confidence from statistical properties of the released data. The DRB disclosure rules established are an attempt to alleviate concerns about identity and attribute types of disclosure, which may arise through various risk elements. The risk elements pertaining to the CTPP tables include the following: Small geography. With 166,000 TAZs from Census 2000, the size of TAZs is roughly similar to block groups. The smaller the geography, the more a data snooper can reduce the universe of possibilities. Small ACS sample sizes. As illustrated in Table 1-3, by design, the ACS five-year sample size is expected to only be about 44 percent of historical Census Long Form design sample sizes. Even with an expected 11 percent growth in the number of housing units between 2000 and 2008 (the middle year of the 2006–2010 five-year period to be used for the first ACS-based CTPP), the ACS sample size is expected to still only be about 49 percent of the Census 2000 Long Form sample size. Essentially, the smallest TAZs will have just 20–25 ACS sample workers in TAZs with a total population of about 600 people.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-7 Table 1-3. Comparison of ACS and Census 2000 Long Form Sample Sizes Survey Sampling Rate Sampling Rate After Nonresponse ACS Design as Percent of Long Form Design Estimated ACS Sample Size (2006- 2010) as Percent of Census 2000 Long Form Sample Size Assuming Growth in Number of Housing Units Ratio Percent Census 2000 Long Form 1 in 6 16.7 16.7 ACS 1 yr 1 in 45 2.2 1.5 9 3 yr 1 in 15 6.7 4.4 27 5 yr 1 in 9 11.1 7.4 44 49 Flow tables. With the TAZ size thresholds remaining the same and with a smaller underlying sample size, the set of flows for each TAZ is likely to result in a majority with sample uniques. Outlier trip scenarios. Population uniques are likely for scenarios such as long distance bicycle/walker commuter from known point A to known point B. Identity disclosure and matchability to the ACS PUMS data records. In other words, a risk in the set of CTPP tables is the ability to link the tables to build a microdata record, and then using the CTPP variables, match to the ACS PUMS to obtain about 150 variables for the record. Census Bureau rules are to not show microdata for small geographies. Given a match, there is a high probability of a true record match success. Neighbors. Extended kin, friends, and workmates may have the motivation to leverage their knowledge of specific people’s attributes to obtain sensitive information about their acquaintance. 1.1.4 Addressing the Risk Elements Investigate the Impact of TAZ Sizes on Risk Certain disclosure risk elements are associated with the population size of TAZs. The largest fundamental shift in these risk elements relates to the transitioning from the Census Long Form serving as the CTPP data source to the American Community Survey (ACS) five-year data files. As shown in Table 1-3, the number of ACS sample cases is only expected to be about 50 percent of what was realized from the Census 2000 Long Form. A clear consequence of this reduced sample size is an increase in the number of TAZs that would present a disclosure risk as defined by the DRB and discussed in Section 1.1.2. Consider the following investigation into the tradeoffs associated with “small” TAZs:  Let A = a TAZ formed as the minimum size under the current rules.  Let B = regular size TAZ.  Suppose A needs a lot of masking due to very many small cell sizes.  Suppose the degree of masking is represented by A".  Suppose B needs some masking due to some small cell sizes.  Suppose the degree of masking is represented by B', but less masking than for A, as represented by one apostrophe instead of two used for A.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-8  Let AB = a TAZ that is needed to exceed a new rule of twice the current minimum threshold.  Suppose AB needs some masking due to some small cell sizes.  Suppose the degree of masking is represented by A'B'. Scenario 1 has two TAZs: represented by A" and B'. Scenario 2 has just one TAZ represented by A'B'. The general tradeoff between Scenarios 1 and 2 is understood: Scenario 1 provides more unique (smaller) TAZs, but with more synthesized data, while Scenario 2 has fewer (larger) TAZs, but less perturbed data. But how sensitive the tradeoff is for “small” TAZs (TAZs near the minimum population size threshold) was unknown. The concept of this research was to assess the tradeoff between Scenario 1 and 2 by comparing the measured impact on data usability for both scenarios. The DRB is responsible for reviewing the disclosure risks inherent in the CTPP. Disclosure risks are in part associated with the population size of the TAZs, which are the localities of interest used in transportation modeling and planning. The smallest TAZs in the ACS five-year sample would have a total population of about 600 people, and their ACS samples would have on average just 20–25 workers. This will not only provide unstable results, but will also lead to sparse TAZ to TAZ flows resulting in a substantial proportion of flows with only one or two sample cases, thereby causing concern about table linking. The DRB rules that are placed on the tables are based on the Rule of 3. In effect, the DRB defines the riskiest cases where there are less than three sample cases in the following ways:  Categories of (MOT) when crossed with another variable. This is because MOT is a common thread in the tables which leaves the tables susceptible to table linking.  Cell means and aggregates. This is because a cell mean based on one case reveals the original value of the response. Also, a cell mean or aggregate based on only two cases reveals the original value of both cases if the value of one case is known, such as when a respondent to the ACS classified in a particular cell is looking at the reported value.  Flow tables involving a table variable other than MOT. This is due to table linking risks explained above. With many sparse tables, clearly an alternative to the traditional cell suppression was needed. The alternative was to perturb the ACS data before generating the CTPP tables. This can be thought of as adding noise to the data in order to add uncertainty to the identification of individuals. The goal of the perturbation approach is to retain as much of the ACS original data as possible, while targeting the riskiest data values, which are generally associated with small TAZs and TAZ flows. With the perturbed data, the DRB has agreed to drop the threshold rules. The purpose of this section is to express the relationship between TAZ size and disclosure risk, which essentially is strongly related to the amount of perturbation applied. Every point estimate is associated with a measure of uncertainty, which has a sampling error component and a perturbation error component. Point estimates are produced to estimate totals or proportions in cell categories, and means or aggregate values within cells. For small TAZ estimates, the perturbation error and sampling error would both be very large in relation to the magnitude of the point estimate of interest. For larger areas, the perturbation error is smaller in relation to the sampling error. To illustrate the impact of TAZ size on the number of records that violate the DRB disclosure rules, the proportion of records in TAZs, from the ACS 2005 to 2009 combined sample, involved in at

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-9 least one table with a DRB rule violation, was computed. In order to study the impact of TAZ size, TAZs with fewer than 50 ACS sample cases were then collapsed with nearby TAZs (the resulting collapsed TAZs are referred to here as CTAZ50). Further, TAZs with fewer than 300 ACS sample cases were collapsed with nearby TAZs (referred to as CTAZ300). As can be seen for the variables listed in Table 1– 4, the proportion of records from housing units in TAZs below the DRB threshold can be substantially reduced through the collapsing of those TAZs into TAZs with at least 50 or 300 ACS sample cases. For example, for the poverty variables, around 30 percent of persons in the housing unit population are in TAZs that have a DRB rule violation in at least one of the tables involving poverty. This percentage dropped to 20 percent of persons in CTAZ50 below the DRB thresholds, and 10 percent for CTAZ300. Similar results are seen for the group quarters population (Table 1-5), though the TAZ percents are smaller. The TAZ percents are smaller most likely due to the group quarters sample being more clustered than the housing unit sample and possibly more concentrated in TAZs with larger populations. Table 1-4. Comparison of Percent of Records from Housing Units in TAZs and Collapsed TAZs that Contain a DRB Rules Violation in at Least One Table: ACS 2005-2009 Variable TAZ CTAZ50 CTAZ300 Time Leaving Home 50 35 20 Travel Time 50 35 20 Age 40 25 10 Minority status 40 25 10 Poverty status 30 20 10 Industry 40 25 10 * NOTE: Values are rounded to the nearest five percent value. Table 1-5. Comparison of Percent of Records from Group Quarters in TAZs and Collapsed TAZs that Contain a DRB Rules Violation in at Least One Table: ACS 2005-2009 Variable TAZ CTAZ50 CTAZ300 Time Leaving Home 25 25 15 Travel Time 30 25 15 Age 20 15 5 Minority status 20 10 5 Poverty status 5 5 5 Industry 15 10 5 * NOTE: Values are rounded to the nearest 5 percent value. The state-level scatter plots in Figures 1-1 through 1-3 further illustrate the risk, for TAZ, CTAZ50, and CTAZ300, respectively. The plots use the variable “travel time” in determining the percent of records in TAZs that contain a DRB rule violation in at least one table involving the travel time to work variable. Each dot in the scatter plot represents a state. The x-axis is the ratio of the number of TAZs to the number of block groups by state (provided in the last column from Appendix B). Therefore, the far right on the plot means smaller TAZ sizes. The y-axis shows the percentage of ACS records in TAZs with threshold violations. Therefore, values higher on the plots have more DRB violations. This plot shows a relationship between TAZ size and disclosure risk—the smaller the TAZ size, the greater the risk and therefore, the more perturbation needed in the estimates. As can be seen, in general, states with ratios of TAZ to block group counts less than 1.0 (i.e., the average TAZ size is larger than the average block group size within their state) have a lower percentage of their records subject to the data perturbation process. However, as seen in Figure 1-2, the percentage of records that is subject to the data perturbation process drops substantially when using CTAZ50 instead of TAZ, and further yet with CTAZ300 in Figure 1-3.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-10 Figure 1-1. Scatter Plot by State of TAZ to Block Group Ratio and Percentage of ACS Sample in TAZs Below DRB Thresholds Figure 1-2. Scatter Plot by State of TAZ to Block Group Ratio and Percentage of ACS Sample in CTAZ50 Below DRB Thresholds

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-11 Figure 1-3. Scatter Plot by State of TAZ to Block Group Ratio and Percentage of ACS Sample in CTAZ300 Below DRB Thresholds In summary, the above tables and figures have demonstrated the sensitivity of the proportion of the ACS sample that would be subject to the perturbation process as a function of TAZ size. The basic understanding is that if less of the ACS sample was subject to perturbation then the impact on the CTPP estimates would most likely be reduced. Another way of looking at this is that small TAZs will generally have high sampling error and will also need more perturbation to reduce disclosure risk. The greater the need for perturbation, then the greater the error in the estimates. So it is important to choose TAZ sizes carefully to balance the need for precision in modeling flows against error (perturbation and sampling error) in estimating flows. Without any real mechanism to measure the tradeoff of lost utility from increasing TAZ sizes to 50 and 300 ACS sample cases, it has been shown that the percentage of ACS records subject to the perturbation process can be substantially reduced by forming larger TAZs (Tables 1-4 and 1-5) and that those states that on average form larger TAZs (as measured by having many fewer TAZs than block groups) have substantially fewer ACS sample records subject to the perturbation process (Figures 1-1 through 1-3). This was offered as guidance to the states as they approached the process of TAZ formation, especially those states collapsing the current TAZs as part of their CTPP data use process. States that form larger TAZs would be subject to less sampling and perturbation error, thereby reducing loss in data utility. Identify Outlier Tr ip Scenar ios To alleviate concerns over outlier trip scenarios, a system was developed to detect outlier trip scenarios based on the flow, MOT, and travel time. Discussions with the DRB chair ensured that these

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-12 outlier detection procedures are similar to those used at the Census Bureau and acceptable. State-level travel time distributions were processed by MOT, to detect possible outliers. Classify Risk Levels for Each Var iable When it comes to identifiability and matchability to the ACS PUMS, identifiable characteristics are subject to replacement under the perturbation approach brought forward by this research. Conceptually, variables that are highly identifiable but have low usability are prime candidates for perturbation. In order to protect the ACS PUMS data from identity disclosure, perturbing a subset of variables may only be necessary to break the link in the table-linking effort, assuming that the list of perturbed variables is kept secret. This helps to reduce the risk of attribute disclosure if a data snooper is in pursuit of coworker attributes. To this end, the research team worked with the DRB to determine the variables considered highly identifiable, and with transportation specialists to determine the variables considered highly usable for their means. The high, medium, and low classifications for the identifiability of each variable are provided in Table 1-6. The identifiability levels shown in the table are illustrative in order to not provide too much information. These classifications do not necessarily determine a list of variables to be perturbed and a list that will not be touched. Such classifications help to understand DRB concerns. Transportation planning specialists provided the usability ratings in Table 1-6 (5 = most useful) as they relate to their use by transportation planners. Although the transportation group is most concerned with residence, workplace, and MOT, there may be cases for which this data must be modified to protect confidentiality. Table 1-6. CTPP Variables and Their Usability and Identifiability Levels CTPP Variable Usability Rating Illustrative Identifiability Level Age 3 high Class of worker 2 mid Earnings 5 high Industry 2 high Length of US residence 1 high Minority 3 mid Occupation 1 high Sex 1 high Time leaving home 4 mid Travel time 5 low Age of youngest child 2 mid HH income 5 high Poverty 4 mid Vehicles available 5 low # of workers in HH 4 low Time arriving (Part 2 only) 3 low Identify and Target High Risk Data Values The DRB was aware of the risk elements inherent in the ACS data, and the threshold rules were a reflection of this realization. Therefore, the primary definition of initial disclosure risk was based on the DRB disclosure rules (before applying the perturbation approach). The initial risk assessment involved processing the CTPP tabulations to identify cell violations using the DRB disclosure rules. In production,

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-13 data values associated with such violations will be flagged as high risk. Data values flagged as high risk will be targeted for replacement by the data perturbation approach. Identify Risk Reducing Elements Several sources of data protection have been identified in the CTPP based on the ACS sample data. For a given microdata record formed through table linking, there is a chance of the data being protected due to the following:  Sampling reduces the risk of disclosure as compared with a census of individuals. As shown in Table 1-3, the sampling rate for the five-year ACS is about 7.4 percent, after nonresponse is taken into account.  Swapping is used to reduce the risk of disclosure in ACS data products. The swapping rate and list of swapped variables is withheld by the DRB.  Moving or changing job locations over a five-year period is non-negligible. For example, about 46 percent of the population age 5+ moved their residential address between 1995 and 2000 according to the 2000 Census, and the percentage that changed workplaces is thought to be about the same.  Imputation due to item nonresponse is inherent in the data. The national imputation rate varies from near 0 percent for sex to about 13 percent for income, earnings, and poverty.  In general, there is an underlying uncertainty or divergence of variables, such as response errors, that reduces the factual (re-identifiable) nature of the variable over time. 1.2 TRANSPORTATION PLANNING CONSIDERATIONS Historically, CTPP has been used in the transportation community by travel demand forecasters as a comparative observed dataset for model validation (in some cases certain tables/variables have been used in model estimation and model calibration, but this is less frequent). CTPP data are also used by transportation planners as a base to create separate, quick-response analysis tools independent of the traditional travel demand forecasting (“four-step”) process. The Aggregate Rail Ridership Forecasting (ARRF) model developed by the Federal Transit Administration (FTA) is a good example of such a tool. Finally, planners use CTPP for historical travel trend analysis, special studies, and to assist with local travel surveys. All of these uses will continue with the ACS-based CTPP data products, and transportation data users in turn expect data products that will permit continuation of existing uses. The key to these research efforts is balancing transportation user needs with the requirements for disclosure avoidance. As noted above, the general desire of the transportation planning community is for smaller units of analysis (TAZs) for CTPP, although in practical terms, the average TAZ size varies greatly from state to state, as shown in Appendix B. The desire for smaller TAZs is becoming more of a need in many areas where agencies are moving to a more disaggregate level of travel modeling and analysis, but those areas still represent a small fraction of Metropolitan Planning Organizations (MPOs) nationally (although a larger fraction of population, since nearly all of these tools are being applied in complex, major urban environments). More discussion of the approaches to TAZ size and the surrounding issues can be found in Section 1.1.4. Regardless of the geographic unit of analysis, the key CTPP variables needed at the microdata level for transportation planners are place of residence, place of work, and MOT. A quick

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-14 assessment of the usability of other CTPP variables in travel demand forecasting may be found in Table 1-6. The use of microdata (PUMS) as a seed for population synthesizers as part of travel demand models is a special case that is growing into a larger transportation user need. How perturbed ACS data works in such a process was addressed as part of the research and testing, particularly for those variables where synthesis is employed at the microdata level. The effects of a “dual” synthesis, that is, running the population synthesizer on perturbed microdata to create a base year population and using that as the travel demand model input, was considered for the research. Transportation planners need to understand the effect that new disclosure-avoidance techniques will have on their analytical tools and any limitations on the use of the resulting data, particularly when cross-tabulating (for example, crossing a raw variable with disclosure-proofed ones that have been perturbed through differing techniques). This is independent of potential error propagation from the use of perturbed data through the travel demand model chain, which will not be explicitly considered in the research. There may be a need to re-explain the use and validity of perturbed data, which, although largely accepted in the transportation planning community, will still face pockets of suspicion. All this education also makes transportation planners’ jobs easier when explaining their analytical process to the local elected officials to whom they are accountable. Transportation planners devoted significant resources to understanding (and in some cases elucidating) the limitations of the 2000 CTPP for certain types of analysis, and to the extent that some of the documented and tested issues associated with perturbed ACS data for CTPP can be alerted to planners during this research, data user needs will be further satisfied. As discussed in the January 2010 kickoff meeting notes from the Executive Session, given the range of procedures used by MPOs in model development and the range of ways that CTPP data are used in the modeling process, the Panel was unsure if the use of one model would provide any useful performance measure. Therefore the Panel provided some key recommendations during the kickoff meeting as follows: ITEM V - Panel’s Recommendations to the Research Agency The Panel would like some feedback from the research team on how travel model validation would be done. The Panel would like the research team to explain how the procedures applied to disclosure protection would affect model validation. The Panel hoped that this could be accomplished comparing the model-based home-based work (HBW) outputs (for models developed around the country during the same time period) with ACS raw and ACS post- disclosure based outputs. This approach would allow multiple runs to figure out the data variability, especially with respect to mode choice. The Panel hoped that the research team does not get into investigating the ripple effect of different ACS-based datasets on a model from trip generation all the way to assignment. The approach described by the Panel might not require proprietary software to be installed at the Census Bureau, but instead would comprise collecting MPO-based HBW model outputs to compare against ACS-raw, and ACS-post disclosure datasets. The research team should also pay attention to the effect on the use of the data as a source of controls for population and household-based synthesizer models. The research team accepted the Panel’s recommended approach and agreed that comparing the ACS CTPP data against home-based work (HBW) model outputs is the best approach, since HBW remains the dominant trip purpose in all metropolitan areas and is the only purpose directly comparable against the questions posed by the ACS (which does not explicitly cover non-work travel). The approach

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-15 also ameliorates the need to install, run, and support proprietary travel modeling software at Census Bureau as part of the research. The confirmed development phase and potential validation phase test sites (discussed in the next section) attempted to use models developed and validated during the 2005–2009 period to correspond directly with the three-year and five-year ACS data used for testing. Because of different ACS-based data sets, error propagation through the model chain was not directly addressed. However, the relationship between the test results of individual model components outputs (generation, distribution, mode choice) and potential interactive effects of different ACS-based data (for example, the effect of perturbed time of departure/arrival for comparison against time-of-day models for those agencies that employ them) may be considered. The team collected HBW model outputs from the test sites in order to conduct comparison tests between the model output and ACS raw and disclosure-proofed ACS data. The team paid close attention to the effect on the use of ACS data as a source for controls of population and household-based synthesizer models. The team specifically identified Atlanta (development and validation phase) as a test site that creates synthetic populations for use as model inputs to examine this issue and compared the ACS and perturbed CTPP data against the model-synthesized population across different levels of geography. 1.2.1 Test Sites The evaluation for the project was partitioned into two phases. In the development phase, the selected approaches were developed and evaluated for four test sites. In the validation phase, the most credible data perturbation approach was tested further for two test sites. The selection of the test sites took into consideration the planning organization’s travel modeling experiences and sought out sites from across the nation to the extent possible, rather than focus on one part of the country. The test sites are shown in Table 1-7. Table 1-7. Model and ACS Comparison Test Sites Phase / Model Type Agency (Region) Year of Model Output (Base or Forecast) Most Recent Validation (Base Year) Development phase using 3 year (2006-2008) ACS Tour/Activity-Based Model / Population Synthesizer Atlanta Regional Commission (Atlanta, GA) 2007 (Forecast) 2008 (2000)* Large MPO Trip-Based Model East-West Gateway Council of Governments (St. Louis, MO) 2007 (Forecast) 2002 (2000) Medium/Small MPO Trip- Based Model Madison 2005 (Forecast)** 2006 (2000) Statewide Model Iowa Department of Transportation (State of Iowa) 2005 (Base)** 2009 (2005) Validation phase using 5-year (2005-2009) ACS Tour/Activity-Based Model / Population Synthesizer Atlanta Regional Commission (Atlanta, GA) 2007 (Forecast) 2008 (2000)* Medium/Small MPO Trip- Based Model Olympia * Many of the Atlanta submodels have been refined and recalibrated during subsequent years. ** Modelers at Madison and Iowa have confirmed that growth from 2005 to the period of the ACS is negligible and thus the 2005 data is acceptable for our tests.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-16 1.2.2 Usability Ratings An arm-chair assessment of the usability of CTPP variables in travel demand models was provided in Table 1-6, where CTPP variables were rated on a scale from one to five (most important for transportation purposes). Note that although earnings and household (HH) income potentially could serve as proxies for one another, the ratings reflect a strong preference for and historical use by planners of HH income (rather than earnings). 1.2.3 Data Consistency Issues The topic of ACS five-year production tabulations relates to the issue of consistency of weighted 5-year production block group estimates and TAZ estimates coming out of the CTPP tabulations. (In many MPOs, block group geographic definitions are directly used in defining TAZ geographic boundaries.) The Transportation Research Board Panel provided guidance at the kickoff meeting that the issue of potential inconsistency of totals and demographic distributions between ACS production block group level tables and CTPP TAZ level tabulations was a secondary consideration. They cited that “rounding” had previously created differences that users understood and were able to address. In general, the Census Bureau staff concurred with this position. That said, the merits of calibrating at higher levels of geography were considered. TAZ definitions can be within a block group, cross block group, and tract boundaries, but must be nested within a county. Given feedback from the Panel, the research team planned to calibrate the perturbed data at the Public Use Microdata Area (PUMA) level, which are areas with at least 100,000 in population, by key variables. Based on a request by transportation planners, the research team also calibrated the weighted totals to sub-PUMA levels so that the total workers would add to the ACS total for combined TAZs that were formed to have about 4,000 workers. 1.3 ACS OPERATIONS CONSIDERATIONS The successful implementation of a process to produce perturbed CTPP tables required comprehensive discussions and a close working relationship with the ACS operational staff, to ensure the viability of the CTPP process to work within the fairly constrained annual ACS production and special tabulation processes. Several meetings were held with key personnel related to the CTPP special tabulations, covering such key issues as data files, data file access, timeline, and identifying other tabulations other than CTPP. 1.3.1 Input Datasets The first CTPP tabulations will be produced from the ACS five-year 2006–2010 data file. Under ideal conditions, research efforts for NCHRP Tasks 3 and 4 (February–July 2010) would have been conducted with a fully processed ACS five-year production file. However, such a file did not exist at the time that research efforts began. After discussions with the Census Bureau, the research team concluded that using the three-year production file for 2006–2008 was the best alternative. One implication of this decision was that the research file was “sparser” than a true five-year file. To partially address this limitation, the team considered that a TAZ with a population of 600 persons from a 5-year file (600 * 7.4% = 45 ACS sample persons) would have about as many ACS sample persons as a 1,000 person TAZ

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-17 from the three-year file (1000 * 4.5% = 45 ACS sample persons). While there could still be meaningful differences between such sized TAZs, the patterns of sparseness and potential disclosure issues should be somewhat similar. The first 5-year production data file was made available for use in time for the validation phase (Task 6). The file included the first five years of ACS at full production levels: 2005 through 2009. Public release of 5-year tabulations did not occur until the end of 2010/early 2011. The file reflected complete and final review of weights and content. The file reflected the 2009 geography which was still Census 2000 based, including the TAZ definitions. The first CTPP tabulations from the ACS 2006–2010 5-year file are expected to reflect the 2010 Census geography and updated TAZ definitions. The research team and the DRB agreed that the imputation and swapping flags could be used to identify certain situations as not a disclosure risk that otherwise would have required some action to reduce the risk of disclosure. These flags were placed on this file. The degree of swapping could expand under the revised process for the ACS 2006–2010 5-year data files. 1.3.2 Operations Timeline/Resources The plans call for the CTPP package tabulation to be generated once every five years with the first coming from the ACS 2006–2010 five-year data. Scaling up for a large once-every-five years special process has its own set of potential resource challenges. With that in mind, the research team asked that the Census Bureau members of the Special Tabulations Group review the general class of methods under consideration, and provide feedback to the research team on the pros and cons as they relate to the Census Bureau needing ultimately to take ownership of the developed method and apply it to produce, review, and approve the 2006–2010 CTPP tabulation package. Initial feedback was that, though some methods may be more “transparent” than others, no particular method is a “show stopper” from a Census Bureau production perspective when it comes to overly onerous resource requirements associated with validating/verifying the perturbed CTPP tables. The research team continued to discuss these issues with Census Bureau staff during the initial stages of research. 1.3.3 Workplace Allocation Nationally, extended workplace allocation is necessary for about 23 percent of records missing workplace geography below the place level. This is a procedure conducted by the Census Bureau, although the timing is such that the allocations were not provided in the three-year and five-year research files. The implication of this is that Part 2 (Workplace) and Part 3 (Worker Flow) tables included on average non-missing block- and TAZ-level values of workplace allocation for only 77 percent of all the worker records. As with a three-year research file (instead of a five-year file), this caused these tables to be even “sparser” than they would be otherwise. In addition, the ACS 2005–2009 five-year data file used for implementation in Task 6 was subject to the same limitation. However, the planned five-year CTPP, based on 2006–2010 data, is expected to have the extended allocation process applied post-hoc to all records. This process is expected to code workplaces to the block for another 13 percent of the microdata records, resulting in about 90 percent of records with block-level workplaces and 10 percent with only place-level workplaces. The research team continued to correspond with ACS operations staff so that appropriate methods could be used to account for the workplace allocation status.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-18 To address the layer of added complexity due to missing workplace TAZs in the research-based files, the team considered the effect on the research, as well as the handling of such cases in the critical production run. Given Panel feedback on proposed options discussed in Westat (2010), the team decided that for this research, cases would be assigned with missing workplace TAZ to the remainder of the state for the development phase (remainder of the county for the validation phase) associated with the test site planning area. 1.4 INITIAL CRITICAL ASSESSMENT OF PROMISING DISCLOSURE AVOIDANCE TECHNIQUES A review of disclosure prevention techniques was conducted to identify the most credible to the transportation planning process while satisfying the DRB, meeting the needs of transportation analysts, and designing a system that will allow the Census Bureau to implement it with available resources and on the promised schedule. These goals were in serious tension with each other. Given this challenge, a variety of techniques for initial examination were undertaken. At the end of the assessment, recommendations were made to reduce the number of approaches. Section 1.4.1 provides a discussion of the set of criteria for the assessment. Section 1.4.2 continues with an overview of the perturbation approaches that were considered, presenting the pros and cons of each approach as they relate to the set of criteria. 1.4.1 Set of Cr iter ia The set of disclosure avoidance techniques were measured against the following criteria:  Disclosure risk and rules. The approach will adequately reduce disclosure risk, and satisfy the disclosure rules that the DRB has developed.  Data utility. The approach will satisfy the needs of transportation planner analysts by minimizing the effect of statistical disclosure control (SDC) on data utility.  Operations. It will be necessary to implement the approach on a critical production path so that the CTPP tables can be produced in a timely manner. Sensitivity to Census operations staff processing and checking over the results needs to be a consideration.  Applicability and flexibility. The approach adopted should be flexible;, the CTPP offers a variety of tables that involve different types of variables, such as unordered categorical, ordered categorical and continuous numeric, and the approach should handle all such variables The approach adopted should also be applicable to the generation of tables that require cell means, aggregates, and medians. Also, flexibility could be measured when it comes to being able to add new variables to the tables.  Variance estimation. The approaches should facilitate variance estimation, not only retaining the sampling error variation, but where possible incorporating the error added due to the data perturbation approach.  Data consistency. The approaches should provide consistent data within the set of CTPP tables. For example, Part 3 tables should align with Part 1 and Part 2 tables; the marginal totals for a variable must match to the marginal totals for the same variable in another table.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-19 Consideration was given to the interplay between the resources consumed by the different approaches and the benefits obtained from each. With the development of two to four approaches, the efficiency of developing one approach may overshadow the development of other approaches. Several discussions among the Senior Statistical Advisory Group brought forth critical assessments of each approach. The approaches were presented to the group, and discussion of their advantages and disadvantages occurred. The next section provides a brief description of each of the initial approaches considered and provides a summary of the discussion. 1.4.2 Assessment of Initial Approaches In general, there are two major types of SDC approaches. The first type is deterministic, such that there is no random error introduced into the process. The second type (data perturbation approaches) includes random error as part of the process. There are two major types of data perturbation approaches, one in which table-level modifications to already-generated tables occur, and one in which unit-level modifications occur and then the tables are processed from the modified dataset. Table 1-8 provides a list of SDC treatments that were initially considered. Table 1-8. List of Statistical Disclosure Control Treatments Initially Planned Type of approach Level of application Approach Deterministic Variables Coarsening TAZ TAZ redefinition Perturbed Table modifications Small area estimation OnTheMap approach Bayesian/IPF Microdata modifications Semi-parametric Parametric modeling Data swapping Super-sampling 1.4.2.1 Deterministic Approaches Procedures for protecting tables of counts without using perturbation methods are described in Willenborg and de Waal (1996, 2001). They describe ideas for redesign of tables (collapsing) to avoid sensitive cells, suppression of cells, rounding of cells to fixed points, such as multiples of 5, and reporting of feasible intervals. Little attention, besides providing illustration, was given in 1996 to concerns that arise when tables are linked, or when multiple tables sharing some of the same margins are published. More attention was paid to a linked table example in 2001, but methods were not studied extensively. Transforming a variable by coarsening, bounding, or rounding (practices used in the ACS sample data), removes some of the information content in the values, but makes identifying a unique individual based on the data values less likely. Topcoding variables such as income or commute time would be examples of commonly used transformations. For the CTPP, two deterministic approaches were initially considered: re-defining TAZs to be larger or having a larger minimum threshold (described earlier in Section 1.1.4), and collapsing categories of CTPP variables (coarsening). The implementation of these deterministic approaches would reduce disclosure risk and help to retain data usability, because they allow more sample records to contribute to the subgroups.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-20 One of the first steps is to process tabulations that will detect variables that are the causes of disclosure rule violations. If certain variables are the usual culprits and the variables are relatively non- important to travel demand analyses, then collapsing categories of such variables is an option. From the initial investigations, the research team recommended that industry be collapsed to seven categories for the flow tables because of two categories in the 14 category version with very small sample size counts. As determined by the Census Bureau, deterministic approaches by themselves will not be sufficient for the CTPP, and data perturbation approaches are still necessary. 1.4.2.2 Perturbation Approaches Data perturbation basically refers to a disclosure control strategy that generates perturbed data from one or more statistical models, and uses the generated data for release in lieu of raw data. Perturbation approaches introduce a random component to produce perturbed data. Several meetings of the Senior Statistical Advisory Group at Westat brought forth critical assessments of each perturbation approach under initial consideration. One by one, the approaches were presented to the group, and discussion of their advantages and disadvantages occurred. The discussion of each of the approaches that follow includes an abbreviated summary of the critical assessment (refer to Westat 2010 for more discussion). Tabular Approaches Traditionally for the CTPP, the tabular approach of cell suppression has been used to reduce the risk of disclosure. Issues involved in theory and practice of cell suppression for tabular data are presented in Giessing (2001), Domingo-Ferrer and Franconi (2006), and Domingo-Ferrer and Saygin (2008), and the first two sections of Domingo-Ferrer and Torra (2004). If cell suppression was used under current DRB rules with the ACS sample sizes, the practice would lead to the deterioration of the usability of the produced tables. Outside of cell suppression, the literature related for tabular approaches for tables of counts has been limited to adding noise to the counts in tables, such as in Willenborg and de Waal (2001). Doing so will create inconsistencies across multiple tables. See Fischetti and Salazar-González (1998) for more discussion of controlled rounding of tables of counts for disclosure protection. Three additional methods for perturbing an existing table have been recently reported in the literature and were reviewed by the research team. They include small area estimation, a methodology used in OnTheMap, and a version of Bayesian iterative proportional fitting. Although of intellectual interest, they are not being pursued in this application for reasons given below. All would require substantial investment of resources beyond what is feasible in the given time frame and have uncertain outcomes given the serious limitations presented. Small Area Estimation. Small area estimation is a statistical modeling approach that produces model-dependent estimates, called “indirect” estimates, to distinguish them from standard survey or “direct’ estimates that are derived directly from responses of sampled individuals who live in an area included in the assessment. Rao (2003) and Jiang and Lahiri (2006) provide comprehensive current overviews and comparisons of models and methods for small area estimation. The indirect estimates are produced using small area estimation techniques that rely on the direct estimates from the current area, estimates from other geographic areas included in the planning area, and other variables and other

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-21 geographical characteristics related to the variable of concern. Area-level small area estimation models are infeasible for the CTPP for the following reasons: 1. With sparse TAZ-to-TAZ flows, there very likely are not enough flows with adequate sample size to estimate model parameters, which would be used for the prediction of 90 to 95 percent of the flows with insufficient sample size. 2. Each cell of each CTPP table would need to be estimated, which is an infeasible undertaking—given 166,000 TAZs, with many more flows and with effectively over 200 tables. OnTheMap Approach. This approach would use methods that are similar to the area-level modeling approach that generates perturbed block-to-block flows for LEHD OnTheMap, where Bayesian techniques are used to synthesize workers’ place of residence, conditional on counts of workers by place of work, industry, age, and earnings categories that can be disclosed. Machanavajjhala et al. (2008) and Abowd et al.(2009) provide a description of the approach. Given the success of the OnTheMap system, the merits of this approach demanded initial consideration for the CTPP, although, while some investigation was done, the decision was to not develop this approach for the following reasons: 1. There is no technical documentation to truly replicate this approach and apply it to CTPP variables. 2. More time and resources would be needed to develop this approach properly and investigate questions (including consistency between marginal totals) than is allowed under the current timeline, given the need to develop other approaches to evaluate. 3. More operational resources would be needed to verify and check modifications to tables than what is needed when tabulating perturbed microdata. 4. It is unclear that the OnTheMap approach would provide any added benefits beyond the microdata approaches proposed in this document. Bayesian/IPF. In a study by Cambridge Systematics, Inc. et al. (2009), Iterative Proportional Fitting (IPF), also known as raking, was explored to generate synthetic journey-to-work or origin- destination (OD) tract-to-tract traffic flows from five-year ACS (or long-form Census) data based on fixed marginal counts at “super-tract” or tract level. Cambridge Systematics also modified the IPF-based data synthesis by combining it with a preliminary step that used a Bayesian approach to create synthetic trip origin counts. The Bayesian/IPF is unlike the IPF only in that before fitting the model one needs to make a table of prior counts. This approach allows more variation in interior cells of the table. Our intention was to not pursue the Bayesian/IPF further, because of the following reasons: 1. The potential run time on applying this approach to make the modifications table-by-table is likely quite extensive, given 166,000 TAZs, with many more flows and with effectively over 200 tables, and also considering layers of geography. 2. With likely 90 percent of the flows having singletons or doubletons, it is not clear that this methodology will have successful results, given the sparsity of the data. 3. The assumption of applying models fit to high levels of geography to the individual TAZs inside those geographies would need more study.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-22 4. The modeling potential is not used to the maximum, in that a hierarchical model makes more sense rather than separate models. Likewise, the modeling potential is not used to the maximum, in that separate low dimensional tables fit separately would appear to lose some prediction power that could be gained from other auxiliary data. 5. More operational resources are needed to verify and check modifications to tables than what is needed when tabulating perturbed microdata. The Generalized Shuttle Algorithm, discussed in Cambridge Systematics, Inc. et al. (2009), was briefly considered early on among the table modification approaches. It was dismissed because of the reported computational intensity and the increasingly widespread acceptance of perturbing microdata by the senior statistical advisory group and the Census ACS operations group. Microdata Approaches Besides aggregating levels of variables to increase counts above two or suppressing entries so that numbers based on small counts are not reported and cannot be derived, an option for preventing certain disclosure is to perturb the microdata that come as an input to the table before the table is created (Duncan, Fienberg et al., 2001). Doing so creates the question of whether the numbers based on small counts in the table correspond to the true composition of the sample or result from the artificial random process. It is then possible to deny that an apparent linkage to an actual individual is real. The major advantage of this approach is that the end product is a single perturbed dataset underlying all CTPP tables; that is, the tables derived from the dataset have consistent margins while simultaneously providing disclosure protection. Microdata are perturbed by adding noise, which can be done in a variety of ways. For example, one approach is to compute model predictions for continuous dependent variables from conditional models and add random noise from draws from the normal distribution with mean 0 and variance . For unordered categorical dependent variables, and draws from the predicted distribution resulting from conditional generalized multinomial mixed effects, models can produce the perturbed data. The methods need to be implemented in a manner appropriate for the type of variable being perturbed. Sometimes two or more methods of perturbation could be applied in conjunction. Semi-Parametric. This methodology is influenced by Judkinset al. (2007) and the Gibbs sampler, to a certain extent. A similar approach is discussed in Bocci and Beaumont (2009). In this approach, the resulting replacement values are model-assisted, rather than model-based. Because of the less important role of the models in this approach, it is less critical to get the structure of these models exactly correct. The sequential nature of the process has the benefit of preserving multivariate associations. The model selection and estimation step would use linear regressions for variables with a small number of categories. This is done primarily to reduce processing time, and it is not as critical as long as the ordering of the predicted values is correct. Hot deck cells are formed from the predicted values and the original value from a randomly selected donor is used as the replacement value. Some consideration was given to the fact that the semi-parametric approach had to be developed and would take some resources to do so. Parametric Modeling. The parametric modeling approach is a data perturbation approach that involves difficult work to create strong parametric models. None of the other microdata approaches suggested require nearly as much modeling work. In this approach, several variations were discussed that would model several CTPP variables. These approaches considered the Gibbs sampler (Gelfand and

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-23 Smith 1990) to link the conditional models for each variable, as well as fully Bayesian models involving Markov Chain Monte Carlo (MCMC) samples. However, a simplified approach was determined to be the most promising parametric approach. It maintains the associations between variables and may be expandable if new tables, added after the production of the perturbed dataset, involve at least one perturbed variable. The use of Bayesian parametric algorithms for data imputation has grown in recent years (Raghunathan et al., 2002). These methods, such as discussed above, draw imputed values from a posterior predictive distribution specified by a regression model, usually with a flat or non-informative prior distribution for the regression parameters. However, they are often heavily reliant on normality assumptions and are not designed to cope well with unusually shaped distributions, such as heaping of reported income at round thousands. The simplified parametric modeling approach contained these same drawbacks, but was far less computationally intensive. Data Swapping. Swapping values entails choosing a case for potentially changing the value, selecting a donor with which to change values, and switching values between the two units in the dataset. This could be formulated as choosing two cases to switch, but often it is applied to cases for which there is a concern about identifiability, hence the formulation as choosing a donor for a select case. Fienberg and McIntyre (2004) give an overview of microdata swapping in the context of tables for which there is a desire to preserve some marginal counts in the tables. In general, the data utility of the approach is sometimes suspect. With swapping, variables highly associated with the swapped variable can be linked to change if the swapped variable changes. Supersampling. This approach selects a new ACS sample through a supersample frame. Suppose five copies of the ACS dataset are appended and a new ACS sample of size “n” is selected. Cell counts of one would result in counts of 0, 1, 2, 3, 4, or 5. A problem with supersampling is that it can actually create a sample unique when the cell was already acceptable to begin with. The super-sampling approach alone, as proposed, would not alleviate concerns about disclosure risk since it could result in single records given combinations of variables. A brief discussion with the Census Bureau DRB also indicated concerns. 1.4.3 Credible Approaches Selected for the Development Phase While having good properties in reducing disclosure risk and facilitating variance estimation, the resulting data utility related to the table-level modification approaches is suspect due to the sparseness of the flow data. Also, the data consistency between tables would need to be addressed. Operationally, given the number of TAZs, flows, tables, and layers of geography, with millions of tables to be generated, the amount of maintenance and checking is understatedly non-trivial. Therefore, none of the table modification approaches were recommended for further development. Due to the lack of disclosure protection properties, the supersampling approach was also excluded from further development. Therefore, the following three microdata perturbation approaches were selected for the development phase: 1. Parametric model-based approach; 2. Semi-parametric model-assisted approach; and 3. Data swapping (with later adaptation referred to as constrained hot deck).

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-24 The three perturbation approaches each retain internal consistency of marginals within the set of “threshold” Set B tables (described further below) produced in the CTPP. While the parametric modeling approach is very demanding on resources to develop, and somewhat risky when it comes to convergence issues, it has great potential in facilitating variance estimation, data usability, and in reducing disclosure risk. The semi-parametric has similar advantages as the parametric modeling approach, and includes the strength of drawing from empirical distributions from either within a small area (e.g., collapsed TAZ). The semi-parametric approach is less dependent on models and is not susceptible to convergence issues. Data swapping is appealing in that it was readily accepted by the Census Bureau, and low effort was needed to apply it to the ACS five-year data. In addition to perturbation approaches, the following discusses approaches that were selected for further development. Weight Calibration After any of the perturbation approaches relating to microdata modifications are applied, the ACS weights will be calibrated to published ACS totals through raking. Raking is sometimes called iterative poststratification or iterative proportionate fitting, and was introduced by Deming and Stephan (1940), and more discussion can be found in Oh and Scheuren (1987). Raking forces the modified microdata file to have estimates for selected marginal dimensions equal to or calibrated to those from the unadjusted ACS. Two Sets of CTPP Tables This general approach uses perturbed data where tables would have been subjected to DRB disclosure rules, and uses the ACS five-year data for tables where there are no disclosure thresholds. It is designed to retain as much observed ACS data as possible. The end result can be thought of as dividing the current CTPP tables into two sets:  CTPP Set A (ACS five-year data tabs) based on real data and ACS weights, where the DRB agrees to release data fully, without suppression; and  CTPP Set B (perturbed part) based on perturbed (postdisclosure proofing) data and CTPP adjusted weights, where DRB has concerns. The benefit of this approach is that data are not touched unless needed, perhaps providing better data utility to the users. There would be different marginal totals for the same variable; that is, the marginals will not be consistent between the Set A and Set B CTPP tables for the same variable. Operationally, a table generator would need to call the correct version of the variables, and each table would need to be checked carefully before release. The DRB has reviewed this approach and has accepted it, although they have specified that the usual rounding rules will apply to the Set A tables, as they do for the other special tabulations from ACS data. The rounding rules are applied to interior cells while fixing the marginals. Since the marginals are the summation of the interior cells, this in effect will cause the marginals to differ for the same variable across tables. Users will be alerted through the table title or a footnote that the Set B tables were generated from perturbed data. Further clarification is summarized as follows:

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 1-25 1. There will be two underlying microdata files as input into the Census Bureau CTPP table generator program. The first microdata file will contain all original data, and the second file will contain perturbed microdata for the variables in the Set B tables. 2. The perturbed microdata file resulting from the initial risk analysis for the Set B tables on TAZ level Part 1, 2, and 3 tables will be used for all localities for the Set B tables. The tables will be generated from the same perturbed microdata for all geographies including TAZs, Block Groups, Tracts, Transportation Analysis Districts (TADs), Places, Counties, States and PUMAs. 3. The perturbed microdata file will be used for TAZs where there are no violations as determined by the initial risk analysis. Even if the values of variables are unchanged, the raked weights may differ from the ACS weights, and therefore the CTPP estimates will be different from the ACS estimates. 4. The list of tables (Appendix A) contains several collapsed tables: 12201C, 12201C2 and 12201C3, for example. The collapsed versions of tables will be generated from the same perturbed microdata. 5. In Appendix C, there is reference to “Large Geography Only” for some of the tables. Large geography means county, PUMA, and state. 6. The disclosure proofing process in the research used the most detailed table in the table series (e.g., 12201 was used in the risk analysis for the series 12201, 12201C, 12201C2, and 12201C3). 7. Having more detailed tables (e.g., all based on MOT(18)) would increase the amount of perturbation in the microdata. It would also impact the DRB decisions and the perturbation rates assigned they would assign. It would necessitate a reassessment of the impact on data utility. 8. On data consistency, suppose you have residence TAZ All flows for Table 33204 in Part 3 involving residence TAZ, if added together, will produce the same results as Table 13204 for residence TAZ from Part 1. All tables will be consistent with one another within the set of tables referred to as Set B since they are all generated from the same perturbed microdata file. Using the set of CTPP research tables provided by AASHTO and given in Appendix A, the Set B tables (ones subject to DRB rules) were identified and the list is provided in Appendix C. Var iance Estimation A variance estimation process on the resulting single perturbed dataset was constructed to capture the sampling error in the ACS sample as well as the impact of the perturbation approach. Data utility checks compared the variances before and after the perturbation approach was applied. The research team discussed the variance estimation approach, described further in this report, with the ACS operations staff.

Next: 2. Development Phase »
Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules Get This Book
×
 Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s National Cooperative Highway Research Program (NCHRP) Web-Only Document 180: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules explores approaches to apply data perturbation techniques that will provide Census Transportation Planning Products data users complete tables that are accurate enough to support transportation planning applications, but that also are modified enough that the Disclosure Review Board is satisfied that they prevent effective data snooping.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!