Appendix C
Data Storage, Security, and Management Procedures
DATA STORAGE AND SECURITY PROCEDURES
The National Academies of Sciences, Engineering, and Medicine received the data and delivered them to RTI International, which set up the files in a secure environment and allowed only approved personnel to access the files, according to the data-handling procedures for controlled-use data described in the project contract. Only RTI International staff who had been trained in controlled-use data handling and who had signed affidavits of nondisclosure for this project worked with the data files. RTI International reviewed all output for potential disclosure issues and cleaned data of such issues before sending them to the National Academies. Table cells based on fewer than three observations were suppressed to protect confidentiality.
FILE STRUCTURE
The data files delivered by the National Opinion Research Center (NORC) came in two primary formats: data from filers and data from individual establishments. Data were also divided into multiple files, separated by year (2017 and 2018) and by data type (employer/establishment/employee). The data-collection instruments provided two separate structures, with the online-entry version containing large matrices (one for each occupation) and the data-upload version having one record per sex-race/ethnicity-occupation-pay band (SROP) cell. The files delivered by NORC followed the data-upload version and excluded cells that had both zero
employees and zero hours worked, which is more efficient for data-storage purposes. For data analysis, RTI International prepared multiple versions of the files, with one version having one record per SROP cell and another having one record per establishment; some analyses were better suited to one format and some to the other format.
A notable feature for analytic considerations is the difficulty of obtaining data at the firm level. In the case of firms with only a single establishment, firms’ data were directly available in the establishment file. Often, and particularly for large firms, firms acted as their own filers, again making firm-level data readily available. However, for those firms that used professional employer organizations (PEOs) as their filers, multiple firms could be included in a single filing, and there was no separate ID to distinguish a firm from a filer. NORC attempted to enumerate the firms within each PEO’s submission by creating a universe file: Comp2EmployerUniverse. If NORC was able to separate the firm from the filing PEO, the firm was given a unique ID (USERID), and the PEO ID was retained as USERID2. Otherwise, USERID was the ID for the PEO and could be associated with multiple firms. NORC also provided the federal employment identification number (FEIN or EIN), but a firm may have multiple EINs, so EINs have limitations as firm identifiers. RTI International’s approach, as recommended by NORC, was to match the analytic data files with the universe file, using a combination of EIN and USERID (U.S. EEOC, 2020h). This approach should be largely successful, based on NORC’s substantial data-editing efforts, but with errors.
Due to the difficulty in identifying firms, there were also difficulties identifying firm sizes. PEOs provided a consolidated report, which included the size of each establishment, but if a report contained multiple Type 6 reports (for establishments with fewer than 50 employees), it was difficult to associate those establishments with firms. Firm sizes were estimated by summing the sizes of the establishments.
Since EEOC’s primary planned use of the data files was with regard to establishments (where enforcement efforts are targeted), and because of the difficulties in properly identifying firms, this report primarily focused on establishments.
Another key aspect of the file structure is that the files were not designed to support merges of Components 1 and 2, or of Component 2 over time. The difficulty with identifying firms is one reason, but another is that establishment IDs were not consistent across databases, so establishment IDs cannot be used to support merges.
DATA ISSUES
As described predominantly in Chapter 5, much of the data appeared to be of high quality, but there were instances of substantial errors or likely
errors large enough to potentially have a major impact on statistics produced from the data. Major errors or likely errors include the following:
- Some reported numbers of employees were so large that they would produce major changes to national estimates. For example, one firm reported having more than 245 million employees, and another reported having over 86 million. A total of 33 firms in 2017 and 29 firms in 2018 were excluded from the database because they reported more employees than the largest U.S. employer (i.e., greater than 1.4 million). For perspective, note that employees at franchises are counted separately. For example, Forbes listed McDonald’s as the fourth largest employer in the world in 2015, with 1.9 million employees;1 however, that number included franchises, and 93 percent of McDonald’s restaurants are franchises (McDonald’s Corporation, 2021). Without franchises, McDonald’s had about 200,000 employees worldwide in 2020, and less than 25 percent were in the U.S. (McDonald’s Corporation, 2021).
- Similarly, some reports of hours worked were physically impossible, requiring workers to work more hours than exist in a year. For this study, reports of hours that reflected working more than 16 hours per day every day of the year (i.e., 5,840 hours) were assigned a red flag and excluded from the primary analyses in Chapters 6 and 7. This was a conservative adjustment designed to allow for naturally occurring extreme values, with the intention that analysts could make different filtering decisions (or data corrections) depending on specific analytic needs.
- Some inconsistencies appeared in the data, such as the reporting of non-zero numbers of employees with zero hours worked reported, non-zero hours worked with zero employee counts reported, discrepancies between Components 1 and 2, and discrepancies between 2017 and 2018 in the Component 2 data. The discrepancies between 2017 and 2018 data could reflect real change over time (though the size of the differences was sometimes large enough to make that highly unlikely), as to a much lesser degree could the differences between Components 1 and 2 (however, both Components 1 and 2 were to be based on a pay period within the same October–December quarter, lessening the likelihood of large changes).
___________________
1https://www.forbes.com/sites/niallmccarthy/2015/06/23/the-worlds-biggest-employers-infographic/?sh=5132bfb9686b
DOCUMENTATION
The analysis programs prepared by RTI International are available in a single compressed zip file, with separate folders for the file construction and for each chapter containing data analysis (Chapters 3–7). Within each chapter folder, the SAS programs are identified based on which tables each program produced. Note, however, that table numbering changed when the report authors extracted the key data to be discussed in the report; a crosswalk provided in the zip files compares the original and final table numbers. The zip file also contains technical memos describing how data merges were performed, how the flag variables were created to filter out problematic data, and a report on the geocoding performed to support matching establishments across time.