Appendix G
Summary of Data Processing Steps for Findings on DoD Funding of Institutions of Higher Education for Fiscal Years 2010–2020
Institute for Defense Analyses for National Academies of Sciences,
Engineering, and Medicine Committee
October 2021
NOTE: Data tables can be accessed via the study’s
public access file upon request.
EXECUTIVE SUMMARY
This document includes a list of sources and processing steps used to create and tabulate the results of the inquiry into the funding of Historically Black Colleges and Universities (HBCUs), minority institutions (MIs), minority-serving institutions (MSIs), and other institutions of higher education by the Department of Defense (DoD). The sections below enumerate each of the data sources, outline steps taken as part of the data processing, and discuss any limitations or special considerations. Although the document does include a detailed list as part of the appendixes, the overall focus is on summarizing the steps of the inquiry at a high level and therefore does not include precise technical details. For the exact steps taken, please refer to the accompanying programming script written in R.
1. List of Data Sources
The section below provides an overview of the data sources used to create the deliverable tables along with a brief discussion of any relevant data processing considerations or limitations.
A. USAspending.gov
USAspending.gov was used to obtain data on contracts and grants for institutions of higher education. These data presented a transaction-level overview of a major proportion of federal government awards (both contracts and grants) and included information such as estimates of contract duration, the name of the funding entity, and the projected monetary amount. Although this source presented the best available data on grants and contracts awarded to institutions of higher education, both in terms of the breadth of available awards and institutions, there were several limiting factors that impacted data quality or increased the processing time:
Federal obligations: The most reliable measure of the amount of money spent during a period of performance was the federal obligation amount. However, this amount often does not represent the actual amount paid by a federal entity considering it may not reflect changes in the award, or may represent an early estimate of the cost of the award.
Missing or inaccurate information: The data obtained from USAspending. gov include many instances where a field is missing (e.g., a blank entry for the address of a recipient organization) or data are inaccurate (e.g., identifier for institution type, such as institution of higher education, includes some organizations that are not actually institutions of higher education). These lapses necessitated a number of work-arounds and approximations in order to ameliorate the damage to data quality.
Differences between data retrieval systems:USAspending.gov has two primary platforms for retrieving data: the “award search”1 interface and the “award data archive.”2 However, the data obtained from the two platforms are not exactly the same. The problem this discrepancy presented to this project is discussed in more detail below.
Disconnect between prime and sub-award reporting systems: Federal entities are obligated to report the funding they provide to external organizations, but the same is not true of sub-awards, as those are reported through
__________________
1https://www.usaspending.gov/search
2https://www.usaspending.gov/download_center/award_data_archive
a different system. In practice, this often means that sub-award and prime award amounts do not agree (e.g., there are cases where the sub-award amount is more than 100 percent of the reported prime award amount). For more information on prime and sub-award reporting practices, see www.usaspending.gov.
No consistency in recipient organization names: The names of the recipient organizations were often inconsistent from one award to another, indecipherable, or, at times, not correct. This required finding an alternative method of consistently identifying a recipient organization.
Of the above-mentioned factors, the presence of missing/inaccurate information and the discrepancy between data retrieval systems provided the biggest impediments to data ingestion and processing. This became evident after examining the initial data pull using the “award search” interface, which allows the user to filter the available data using keywords, years, and categories. The initial data pull selected the following filter:
- Fiscal years: 2010–2020
- Awarding agency: Department of Defense (DoD)
- Recipient type: Higher education
However, the resulting data contained transactions associated with institutions that were not institutions of higher education and were missing transactions for institutions of higher education that other sources indicated had received DoD grants or contracts in the same time period. At the same time, removing the filter “recipient type: higher education” resulted in a dataset that was too large to download through the “award search” interface, and the “award data archive” had to be used. Comparing the new dataset obtained through “award data archive” to the previously obtained “award search” dataset revealed that there were transactions in each dataset that were not present in the other. Moreover, neither dataset contained an accurate identifier for institutions of higher education.
In order to ensure that no available data were excluded, the subsequent analyses were performed on the joint dataset from both the “award data archive” and “award search” interfaces and thus included the union of the transactions from the two datasets. Furthermore, a proxy technique was used to set aside any transactions that were not associated with institutions of higher education. Specifically, the recipient organization names were scanned for both relevant search terms (e.g., university, college) as well as country of origin. The resulting set of recipient institutions of DoD funding from FY 2010 through FY 2020 was then checked manually for accuracy and updated to include any institutions whose names did not contain the search terms but were verified as institutions of higher education, and the set of recipient institutions was trimmed to exclude
any “false-positive” institutions, those whose names contained the search terms of “university” or “college” but are not institutions of higher education.
B. National Science Foundation (NSF)
Several datasets from NSF were used to present information related to institutional capacity as well as to compare federal funding for different types of institutions.
1) Higher Education Research and Development (HERD) survey, 20193
This dataset was used to inform the total number and percentage of dollars spent on R&D by federal agencies and departments. This dataset is the sole source for the data presented in Tables 19 through 26.
2) Major Research Instrumentation (MRI) program records, 2017-20214
These were used as a proxy measure for institutional capacity to bring in major scientific awards and perform the requisite research. This dataset is the primary source for the data presented in Tables 36 through 38.
3) NSF institutional profiles, 2011-20175
These institutional profiles include information on the number of R&D personnel, graduate students, and principal investigators at all surveyed institutions of higher education. The bulk data are not available for download and were instead web-scraped from the NSF website. This dataset is one of the primary sources for the data presented in Tables 13, 39, and 40.
4) Survey of science and engineering research facilities, 20196
This survey was used as a measure of the size of each institution’s research facilities. This dataset is one of the primary sources for the data presented in Table 30.
C. Integrated Postsecondary Education Data System (IPEDS)
IPEDS institutional characteristics7 were used to consolidate recipient organization names and link the transaction dataset to other supplementary datasets. IPEDS institutional characteristics included the Data Universal Numbering
__________________
3https://ncses.nsf.gov/pubs/nsf21314#data-tables
4https://www.nsf.gov/awardsearch/advancedSearchResult?ProgEleCode=1189&BooleanElement=ANY&BooleanRef=ANY&ActiveAwards=true&
5https://ncsesdata.nsf.gov/profiles/site
6https://ncses.nsf.gov/pubs/nsf21311#data-tables
7https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx?goToReportId=7
System (DUNS) number, which is designed to individually identify business entities, as well as UNITID, an identification number for institutions of higher education that is used by several education-related organizations (e.g., Carnegie Foundation).
In addition, IPEDS was used to estimate the number of science, technology, engineering, and mathematics (STEM) degrees conferred (presented in Tables 39 and 40) along with the financial endowment and state funding amounts (presented in Table 41).
D. D&B Hoovers8
Hoovers is a marketing product tool owned by Dun & Bradstreet (developers of the DUNS number) that was used as an additional resource to assist with consolidating recipient organization names. All available DUNS numbers of institutions from the transaction dataset that were not otherwise identified via IPEDS were uploaded to Hoovers, which matched the DUNS numbers to organization names if they were available in their database.
E. Committee’s Minority Institution Identifier
The committee-provided list of HBCUs and MIs served as the primary identifier of the institution of higher education (IHE) category for the study. The names and institution categories were appended to the transaction dataset from USAspending.gov to be used for grouping institutions in the majority of the tables created for the study.
F. Rutgers Center for Minority Serving Institutions Directory of Institutions
The Rutgers Minority Serving Institutions directory,9 based on 2020 data from the Department of Education, was used to categorize institutions of higher education that were not already labeled as either MIs or HBCUs using the committee-provided list. Minority-serving institution (MSI) was included as a separate category from MI and HBCU for any tables that grouped institution by demographic categories.
G. Carnegie Classification
Carnegie Classification data were obtained to further delineate differences between institutions on the basis of the type of degrees awarded as well as the
__________________
8https://www.dnb.com/products/marketing-sales/dnb-hoovers.html
character of the research and scholarly activity. Institutions of higher education were categorized by Carnegie Classification for Tables 13 through 15.
H. MURI/DURIP/HBCU-MSI
Publicly available information on Multidisciplinary University Research Initiative (MURI)10 and Defense University Research Instrumentation Program (DURIP)11 awards was collated with DoD-provided information on HBCU-MSI awards and presented in Tables 31 through 33.
2. Data Processing and Tabulation
This section includes a concise list of data processing steps and a set of notes on the more involved steps in the process (in bold).
A. Data Processing Steps
- Download and process transaction data from USAspending.gov
- Download data from USAspending.gov
- Obtain “award search” prime and sub-award dataset
- Obtain “award data archive” prime and sub-award dataset
- Join the two datasets and ensure that one record exists for each unique transaction
- Filter non-IHE transactions
- Identify research area for each award
- Join prime and sub-award datasets
- Download data from USAspending.gov
- Finalize organization naming scheme
- IPEDS
- Obtain general information dataset
- Obtain endowment/state funding dataset
- Obtain number of STEM degrees conferred dataset
- Join IPEDS datasets by UNITID
- Append joint IPEDS dataset to transaction dataset by DUNS code
- D&B Hoovers
- Pull DUNS codes from transaction dataset
- Enter DUNS codes in D&B Hoovers
- Obtain resulting DUNS + organization name dataset
- IPEDS
__________________
10 For an example of the available MURI data, see https://www.cto.mil/2020-muri/.
11 For an example of the available DURIP data, see https://media.defense.gov/2020/Dec/01/2002543787/-1/-1/0/FY21-DURIP-SELECTIONS-FOR-PRESS-RELEASE-FINAL-20-NOV.PDF.
- Append DUNS + organization name dataset to transaction dataset
- Create final naming scheme
- If an IPEDS name exists for the organization, use it
- If there is no IPEDS name for the organization, use the D&B Hoovers name
- If neither IPEDS nor D&B Hoovers has a name for the organization, manually adjudicate
- Check all names to determine consistency and cohesion, and fix if necessary
- Download and append supplementary datasets
- Carnegie
- Obtain Carnegie Classification dataset
- Append Carnegie Classification dataset to the transaction dataset by UNITID
- NSF
- Generate cross-walk between NSF and USAspending.gov IHE naming systems
- Obtain research facilities dataset
- Obtain R&D personnel dataset
- Carnegie
- Assign demographic category for each institution (i.e., HBCU, non-HBCU MI, non-HBCU MSI, or other)
- Identify MURI/DURIP/HBCU-MI awards
- Compute Tables 1-18, 27-30, 34-35, and 39-41
- Create NSF HERD tables
- Obtain NSF HERD federal funding data for 2019
- Compute Tables 19-26
- Create tables based on publicly available and DoD-provided MURI/DURIP/HBCU-MSI information
- Create NSF MRI tables
- Obtain publicly available list of NSF MRI awards
- Create a cross-walk between NSF MRI and USAspending.gov names
- Append institutional demographic label to NSF MRI dataset
- Compute Tables 36-38
__________________
12 For example, see https://www.cto.mil/2020-muri/.
13https://media.defense.gov/2020/Dec/01/2002543787/-1/-1/0/FY21-DURIP-SELECTIONS-FOR-PRESS-RELEASE-FINAL-20-NOV.PDF
B. Identifying Research Area Per Award
Each award in the transaction dataset from USAspending.gov was assigned a research area label based on a set of decision rules. The North American Industry Classification System (NAICS), Catalog of Federal Domestic Assistance (CFDA), and product or service codes were used to identify and designate awards into each category. The awards were examined for any entries that were associated with R&D research. The awards identified as “R&D” were further classified into “basic R&D” or “applied R&D” depending on the nature of the research. If the award was not associated with R&D, the characteristics of the award were checked for terms associated with “STEM” research. If the award was not designated as R&D or STEM, its characteristics were checked for terms associated with infrastructure activities, and if a match was found, the research area of the award was labeled “other infrastructure.” If the characteristics of the award did not match any of the above categories, the research area of the award was labeled as “miscellaneous.” The research area labels were used to group institution of higher education funding in Tables 1, 6 through 12, and 29.
C. Assigning Demographic Category Per Institution
Institutional demographic categories were assigned primarily based on the committee-provided list of HBCUs and other MIs. Any institutions that were not assigned a label after consulting the committee-provided list were assigned a label of MSI if they were present in the Rutgers Center for Minority Serving Institutions’ MSI directory, or a label of “other” if they were not present in either list.
D. Identifying MURI/DURIP/HBCU-MSI Awards
The transaction dataset from USAspending.gov did not include an identifier for MURI, DURIP, or HBCU-MSI research awards. Instead, these awards were identified by scanning the project description of each award for the three award acronyms. However, the resulting number and total monetary amount of the awards was not consistent with external sources, so the figures obtained from the transaction dataset (Tables 16 through 18 and 29) provide a limited estimate of the actual figures. Instead, the information presented in Tables 31 through 33, obtained from a DoD-provided source, should be considered as more accurate estimates.
E. Computation of Median and Total Award Values
For any tables that included a measure or central tendency (e.g., median) or a total funding amount across several institutions, sub-award and prime award amounts were “collapsed” within each cell. Specifically, when calculating the cell value, if any IDs were found to identify both a prime award and sub-award amount, the sub-award amount was removed prior to the calculation of the total or median cell value.
F. Comparisons for Top 25 HBCUs (Table 13)
A convenience method was used to make a comparison group for the the top 25 HBCUs in DoD funding. In particular, for each level of Carnegie classification, a median value for number of R&D personnel, R&D graduate students, and the average number of R&D principal investigators was computed based on institutions in each of the 2nd and 3rd quartiles of the total amount of DoD funding in FY 2010 through FY 2020.
G. Estimates of Central Tendency
Most tables that contain an estimate of the central tendency of award amounts and award lengths present the median award amount or length as opposed to the average. This was done primarily due to the positively skewed distributions of award length and amount; the outliers in the upper portion of the distribution shifted the average value away from the densest portion of the distribution, whereas the median was unaffected by the outliers.
H. Compatibility of Tables
As indicated by the data processing section, four separate datasets were used to create the data tables. The majority of the tables (1 through 18, 27 through 30, 34 and 35, and 39 through 41) were created using the transactional dataset from USAspending.gov in conjunction with the appended supplementary datasets (e.g., IPEDS, Carnegie, NSF, HBCU/MI/MSI identifier). The remaining tables were computed based on NSF HERD (Tables 19 through 26), the MURI/DURIP/HBCU-MSI winners’ dataset (Tables 31 through 33) and NSF MRI (Tables 36 through 38). Each of the four datasets uses a different naming scheme and data structure, and includes a different number of institutions. For this reason, the tables generated by the four different datasets cannot be considered wholly compatible with one another.
I. Legal Entities (e.g., University-Affiliated Research Centers, Centers of Excellence)
The transaction dataset from USAspending.gov does not provide information that is specific enough to identify which institution of higher education–affiliated legal entity was responsible for obtaining an award. Furthermore, there is no mention of known university-affiliated research centers or centers of excellence in the project description or any other column of the data, making it difficult to pinpoint the influence that these types of legal entities have on the institution of higher education funding process.
This page intentionally left blank.