Page 167 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

8

Using Multiple Data Sources for County-Level Crop Estimates

Previous chapters have focused on enhancing information about the U.S. population by using administrative records directly or linking them with surveys. This chapter looks at an example in the arena of business statistics—in particular, the use of surveys, administrative records, and remote sensing data to produce county-level estimates of agricultural production.

The U.S. Department of Agriculture (USDA) has been involved in producing county-level crop estimates since 1917 (Cruze et al., 2019).¹ A National Academies of Sciences, Engineering, and Medicine report described the importance of these estimates:

Participants in agricultural markets rely on such information to make decisions: for producers, about what to grow and how to manage inventories; for processors and traders, about how to organize production and determine sales; and for retailers and consumers, about how to anticipate costs and assess the availability of food. When market participants share a common understanding of the fundamentals of supply and demand, market transactions accurately reflect the value of commodities to those along the supply chain and help ensure that food is grown, processed, and consumed at the lowest cost to the nation (NASEM, 2017b, pp. 6–7).

The National Academies (NASEM, 2017b) reviewed procedures then used by the USDA National Agricultural Statistics Service (NASS) to produce

___________________

¹ For the history of agricultural statistics in the United States, see U.S. Department of Agriculture (1969); Allen (2008); and https://www.nass.usda.gov/About_NASS/History_of_Ag_Statistics/

Page 168 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

BOX 8-1
Selected Recommendations from the National Academies of Sciences, Engineering, and Medicine Report Improving Crop Estimates by Integrating Multiple Data Sources

RECOMMENDATION 2-1: The National Agricultural Statistics Service should evolve the Agricultural Statistics Board role from one of integrating multiple data sources to one of reviewing model-based predictions; macro-editing; and ensuring that models are continually reviewed, assessed, and validated.

RECOMMENDATION 2-2: The National Agricultural Statistics Service should achieve transparency and reproducibility by developing, evaluating, validating, documenting, and using model-based estimates that combine survey data with complementary data in accordance with Office of Management and Budget standards.

RECOMMENDATION 2-3: The National Agricultural Statistics Service (NASS) should adopt and use the following publication standard:

County-level estimates may be withheld to protect confidentiality.

County-level estimates may be withheld because NASS deems them unreliable for any use, based on its measure of uncertainty.

All other county-level estimates will be published, along with their measures of uncertainty.

RECOMMENDATION 2-4: The National Agricultural Statistics Service should develop and publish uncertainty measures for county-level estimates.

county-level estimates for crops (including planted acres, harvested acres, production, and yield by commodity) and recommended pursuing a model-based approach relying on multiple data sources (see Box 8-1). The approach that panel recommended would build on modeling research performed by NASS, the U.S. Census Bureau, Statistics Canada, and other agencies.

This chapter describes the models that NASS has developed since the National Academies’ 2017 review (NASEM, 2017b), considers challenges for integrating data from agricultural and other business surveys, and outlines additional ways that NASS might take advantage of multiple data sources. While Chapters 5 through 7 focus on linkage of income and health data and consolidation of crime data submitted by states, this chapter focuses on the use of small area models to integrate information from administrative and other data sources.

Section 8.1 briefly reviews data sources that might be used for producing crop estimates. Sections 8.2 and 8.3 discuss statistical modeling

Page 169 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

RECOMMENDATION 2-8: The National Agricultural Statistics Service should adopt the Farm Services Agency’s Common Land Unit as its basic spatial unit.

RECOMMENDATION 2-9: The National Agricultural Statistics Service should be prepared to maintain alternative geospatial field-level boundaries (e.g., resource land units and precision agriculture measurements) in its databases to facilitate completing the geospatially referenced farm-level database.

RECOMMENDATION 3-5: The National Agricultural Statistics Service should develop a precision agriculture reporting option for the County Agricultural Production Survey/Acreage, Production, and Stocks survey system. Farmers who reported relevant precision agriculture data would either not receive an additional survey form or receive one that was simplified and easy to use.

RECOMMENDATION 3-8: The National Agricultural Statistics Service should explore collaboration with other U.S. Department of Agriculture agencies that are actively involved in remote sensing applications to obtain access to data with finer spatial resolution and possibly also to share in the costs of processing those data.

RECOMMENDATION 3-9: The National Agricultural Statistics Service (NASS) should keep abreast of emerging data sources; how they are used; and how they might be used to improve county estimates, especially of yield. Based on a careful evaluation, NASS might consider purchasing data.

SOURCE: NASEM (2017b, pp. 3–4).

approaches taken by NASS and Statistics Canada, respectively, to incorporate data from non-survey sources into crop-estimation programs, relying in part on presentations from the workshop session on Improving Agriculture Statistics with New Data Sources. Section 8.4 explores opportunities for continued improvement of agricultural statistics.

8.1 DATA SOURCES FOR CROP ESTIMATES

This section summarizes the main data sources that NASS has used to make county-level crop estimates in the United States, as well as other data sources with potential to improve model-based estimates: private-sector data and data obtained from social media, webscraping, and crowdsourcing.²

___________________

² See also Stubbs (2016, p. 1), who categorized “big data” sources for agriculture as “public-level big data,” which are “collected, maintained, and analyzed through publicly

Page 170 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

Probability Samples

NASS conducts hundreds of surveys every year.³ A census and three probability surveys provide information for NASS’s Crops County Estimates Program.

The Census of Agriculture is taken every 5 years with the purpose of providing “a complete count of U.S. farms and ranches and the people who operate them.”⁴ It collects information on characteristics of farm operators, land use, production practices, income, and expenditures. Although the intent is to include every agricultural operation, the Census of Agriculture has undercoverage, nonresponse, and misclassification. Some farms, particularly smaller operations, are not on the mailing list that serves as the sampling frame for the census, and some operations are misclassified. Estimates of the total number of farms and acreage devoted to agriculture are adjusted for undercoverage, nonresponse, and misclassification using information from the June Area Survey (USDA, 2019).
The June Area Survey (JAS) collects information on “crop acreage, grain stocks, cattle inventory, hog inventory, sheep and goat presence, land values, farm numbers, technology use, and value of sales data.”⁵ It is called an “area survey” because the sample is drawn from an area frame that identifies parcels of land (Davies, 2009). For the JAS, land segments of approximately one square mile are selected for the sample, and interviewers attempt to interview every farm operator within the boundaries of the sampled land segments. Because the sampling frame consists of parcels of land, the JAS has full coverage of all farm operators (although there is still nonresponse because some of the sampled operators cannot be reached or decline to participate in the survey).

___________________

funded sources, specifically by federal agencies (e.g., farm program participant records, Soil Survey, and weather data)” and “private big data,” which “represent records generated at the production level and originate with the farmer or rancher (e.g., yield, soil analysis, irrigation levels, livestock movement, and grazing rates).”

³ See https://www.nass.usda.gov/Surveys/ and https://www.nass.usda.gov/Surveys/Guide_to_NASS_Surveys/index.php for listings and descriptions of NASS surveys and programs. Schnepf (2017, p. 5) noted that NASS uses these surveys to publish “about 400 national agricultural statistical reports and thousands of additional state agricultural statistical reports covering more than 120 crops and 45 livestock items.”

⁴ See https://www.nass.usda.gov/AgCensus/ for a description of the Census of Agriculture and https://www.census.gov/history/www/programs/agriculture/census_of_agriculture.html for its history.

⁵ https://www.nass.usda.gov/Surveys/GuidetoNASSSurveys/JuneArea/index.php

Page 171 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

The quarterly (March, June, September, and December) Agricultural (Crops/Stocks) Surveys are conducted in all states except Hawaii, and they provide national estimates and early-season predictions of acreages, yields, and production for major crops. Farm operators are asked about the total number of acres they operate and how much acreage is devoted to each commodity of interest.⁶ The main samples are selected from list frames—lists of known farm operations—and thus do not include operations not on the list, but farms from the June Area Survey “that are not included in the list frame sampling population are subsampled for the March, September, and December surveys so that the target population is completely represented” (NASS, 2022, p. 1).
The County Agricultural Production Survey (CAPS), conducted annually at the end of the harvest season, supplements the county-level sample sizes from the Agricultural Surveys. All counties in the 44 states in which CAPS is conducted must be represented in the sample, although the commodities studied are specific to each state. The survey is mainly conducted by mail and telephone.⁷

Figure 2-1 displays response rates for the JAS from 2000–2022. Response rates for the quarterly Agricultural Surveys dropped from about 85 percent in the early 1990s to the 60 percent range in 2016 (Johansson, Effland, & Coble, 2017). The December 2021 Agricultural Survey had a response rate of 50.1 percent, a decrease from the 55.7 percent response rate from the previous December (NASS, 2022, p. 6). In addition, there is item nonresponse to the questions about specific commodities.

Schnepf (2017, p. 16) wrote: “The potential bias related to nonresponse becomes increasingly important for more localized estimates. For example, NASS estimates remain most accurate at the national level, but low response rates become increasingly important for estimates at the state and especially county levels.” Increasing nonresponse to agricultural surveys suggests that assessment of alternate data sources is an appropriate next step, as recommended by the National Academies’ report on Improving Crop Estimates by Integrating Multiple Data Sources (NASEM, 2017b; see Box 8-1).

Administrative Records

Several administrative data sources provide information related to crop estimates. Agencies that collect data through program administration

___________________

⁶ https://www.nass.usda.gov/Surveys/Guide_to_NASS_Surveys/Crops_Stocks/index.php

⁷ https://www.nass.usda.gov/Surveys/Guide_to_NASS_Surveys/County_Agricultural_Production/

Page 172 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

include the USDA Farm Service Agency (FSA), which collects individual producers’ farm record data, federal payments, and loan information used in administering various farm programs; and the USDA Risk Management Agency (RMA), which collects individual farm yield and loss information to administer the Federal Crop Insurance program.⁸

Farmers who elect to participate in FSA programs provide the agency with planted acreages and crop types. Because participation in FSA programs is voluntary, estimates of planted acreage from FSA data alone will usually underestimate the total amount of planted acreage for a crop (which will also include acreage from farmers who do not participate in FSA programs).⁹ FSA planted acreage data can be considered as a lower bound for the true amount of planted acreage. The RMA, in its role as an underwriter of crop insurance policies, receives data from crop insurance providers about failed acreage (acreage that was planted but not harvested, perhaps because of local weather or flooding) and checks submissions for accuracy before making payments to farmers.¹⁰ As with FSA data, there is undercoverage of the population because some farmers do not participate in a crop insurance program. In general, the FSA and RMA data have high coverage of planted acres for major commodities (NASEM, 2017b, p. 57). However, Cruze et al. (2019, p. 303) noted that some groups are particularly prone to undercoverage, for example “known Amish communities in Pennsylvania and other midwestern states may represent significant portions of local agricultural activity but tend not to participate in federal or commercial crop insurance programs.”

One complication in combining these administrative sources with survey data is that NASS, FSA, and RMA use different definitions of farms. NASS defines a farm as “any establishment from which $1,000 or more of agricultural products were sold or would normally be sold during the year.”¹¹ NASS associates one or more operators with each farm on its list frame. For the FSA, a farm “is made up of tracts that have the same owner and the same operator” (NASEM, 2017b, p. 48). The RMA does not define

___________________

⁸ An additional possible administrative records data source is the USDA National Resources Conservation Service, which collects data on conservation plans, geospatial data, and conservation program activities and payments to meet the USDA’s responsibilities under the Soil and Water Resources Conservation Act of 1977. See https://www.nrcs.usda.gov/wps/portal/nrcs/main/national/about/

⁹ For overviews of FSA programs and the information collected by FSA, see https://www.fsa.usda.gov/Assets/USDA-FSA-Public/usdafiles/FactSheets/2016/farm_service_agency_programs.pdf, https://www.fsa.usda.gov/Assets/USDA-FSA-Public/usdafiles/FactSheets/2019/arc-plc_farm_bill_comparisons-fact_sheet-aug-2019.pdf, and https://www.fsa.usda.gov/Assets/USDAFSA-Public/usdafiles/FactSheets/2022/fsa_cropacreagereporting_factsheet_22.pdf

¹⁰ For more information on the RMA see https://www.rma.usda.gov/en/Fact-Sheets/National-Fact-Sheets/About-the-Risk-Management-Agency

¹¹ https://www.nass.usda.gov/About_NASS/History_of_Ag_Statistics/index.php

Page 173 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

farms but collects information from entities that purchase crop insurance from approved insurance providers.

The National Academies’ report on Improving Crop Estimates by Integrating Multiple Data Sources (NASEM, 2017b) recommended that NASS adopt the FSA’s Common Land Unit (similar in spirit to a farm field) as its basic spatial unit, to enhance interoperability and facilitate linkage of data sources (see Box 8-1).¹²

Satellite, Aerial Imagery, and Sensor Data

The Global Strategy to Improve Agricultural and Rural Statistics (2017) provided an overview and guidelines for using remote sensing in agricultural statistics, with chapters on land cover mapping and monitoring, detailed crop mapping, and crop area and yield estimation. Important survey-related uses of remotely sensed data include improving coverage of list frames and improving the efficiency of sampling designs; many agricultural surveys use information on land cover to stratify the sampling design (Carfagna & Carfagna, 2015).¹³

Various remote sensing sources could be used as inputs to crop models: “An increasing number of satellites, aircraft, drones, flux towers, and weather stations collect geospatially referenced data that may be useful for monitoring crop-growing conditions. These data may be available from other government agencies or for purchase from private companies” (NASEM, 2017b, p. 67).

Carletto, Dillon, and Zezza (2021, p. 4453) noted that:

Remote sensing data are being used and adapted for countless purposes in farm management, agricultural programs, agricultural statistics, and empirical agricultural economics…. For empirical applications in agricultural economics, remote sensing data offer the promise of far greater accuracy, objectivity, temporal resolution, and coverage, than could be achieved through traditional survey methods relying on farmers’ self-reporting. However, remote sensing datasets are not immune from measurement error…. Errors can be introduced through the measurement technology,

___________________

¹² FSA defined the Common Land Unit as an “individual, contiguous farming parcel,” which is the smallest unit of land that has a permanent, contiguous boundary; common land cover and land management; and a common owner and/or common producer association. See the 2017 Common Land Unit information sheet at https://www.fsa.usda.gov/Assets/USDA-FSA-Public/usdafiles/APFO/support-documents/pdfs/cluinfosheet2017Final.pdf. Ali and Dahlhaus (2022) discussed interoperability in the agricultural data context.

¹³ When data from remote sensing are used for stratifying a survey design, misclassification errors do not affect the validity of the survey. More accurate data from remote sensing will improve the efficiency of the design, but any misclassification errors will be corrected during the ground survey.

Page 174 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

the algorithm to convert the measurement into a variable for analytical use (e.g., rainfall), or the resolution of the data. Errors can also occur in linking remote sensing data to the household, plot, or farm on which the analysis is run, as well as by using variables that are not ‘fit for purpose’ from an agronomic perspective.

NASS uses data from satellite and aerial imagery to create the Cropland Data Layer, a detailed map of crops grown across the continental United States.¹⁴ Historically, the Cropland Data Layer has had 85–95 percent accuracy for major crops (Young, 2022, slide 4). NASS adjusts for bias with a regression model that uses the observed acreages for a specific crop recorded during the June Area Survey.

Although the Cropland Data Layer is highly accurate overall, there are data-equity issues in that land classification based on satellite observation is less accurate for smaller fields, which may produce multiple crops or have land parcels smaller than one pixel. Smaller holdings are more likely to have the “mixed-pixel problem,” meaning a pixel contains more than one type of ground cover and may be inaccurately classified.

Satellite imagery is a valuable resource for producing crop estimates, but comes with challenges described by workshop participants.¹⁵ Nkwimi-Tchahou et al. (2022) mentioned the effects from clouds and other contaminants on data quality, the intensive information technology needs for processing satellite imagery data, potential comparability problems when satellites change (because of differences in resolution), and the rare possibility of satellite failure. Goodchild (2022) emphasized the uncertainty inherent in using remote sensing data: “The pixels of remote sensing … are not sharp boundaries on the Earth’s surface, but instead the contents of one pixel bleed quite substantially into the contents of a neighboring pixel.” Goodchild also expressed concern about propagation of uncertainties through the estimation system: “We are combining datasets which have different, independent, uncertainties associated with them.” He illustrated this with an example: Common Land Units are often defined

___________________

¹⁴ See Craig (2010) for a history of the Cropland Data Layer, and Boryan et al. (2011), https://data.nal.usda.gov/dataset/cropscape-cropland-data-layer, and https://www.nass.usda.gov/Research_and_Science/Cropland/sarsfaqs2.php for information on data sources and uses.

¹⁵ See also Gallego, Carfagna, and Baruth (2010), who identified characteristics associated with quality of data from remote sensing systems, including accuracy, objectivity, and cost-efficiency. They identified the main characteristics for use in agricultural applications as spectral resolution, spatial resolution, and the ability to provide data for large areas of land for low cost. They also mentioned the need for “high temporal frequency” to be able to “follow crop growth during the season” (p. 204).

Page 175 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

by physical boundaries, such as roads, but farmers often do not plant to the edge of a road.¹⁶

Private-Sector Data

Private-sector entities (agricultural producers) provide data through NASS agricultural surveys and administrative records. But many farm operators collect much more detailed information than is submitted to surveys. Coble et al. (2018, p. 82) commented on “the remarkable growth in producers’ ability to collect data pertaining only to their own operation through the growth of techniques and technologies such as grid soil sampling, telematics systems for farm equipment, Global Navigation Satellite Systems (GNSS), farm aerial imagery acquired via small unmanned aerial systems (sUAS), and the like.”

These detailed data are used in precision agriculture, a field that emerged in the 1980s to take advantage of technological advances in global navigation satellite systems, geographic information systems, and computing to enable data-driven decisions about planting, fertilizer use, pest and disease management, and other aspects of agricultural production. The International Society of Precision Agriculture (2019) defined precision agriculture as “a management strategy that gathers, processes and analyzes temporal, spatial and individual data and combines it with other information to support management decisions according to estimated variability for improved resource use efficiency, productivity, quality, profitability and sustainability of agricultural production.”

Early uses of precision agriculture involved adapting fertilizer distribution to soil conditions. Since then, uses have become more sophisticated, combining information from “sensors, information systems, enhanced machinery, and informed management to optimize production by accounting for variability and uncertainties within agricultural systems” (Gebbers & Adamchuk, 2010, p. 828). Stubbs (2016, p. 8) emphasized the dependence of data collection on “physical technology, such as sensors, imagery, drones, radar, and other technologies all working together to provide detailed information about soil content, weeds and pests, sunlight and shade, nutrient deficiencies, moisture, and other factors…. Data collection is an ever-expanding area of big data and includes a number of key players, including” equipment manufacturers, chemical companies and applicators, and developers of technologies such as radio frequency identification. Mendez-Costabel (2022) described uses of linked geospatial and

___________________

¹⁶ With the advent of precision agriculture (see below), some farmers may use tractor-based data to report the actual acreage planted, which may differ from acreage that would be reported using Common Land Units.

Page 176 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

other data sources to predict performance of seed varieties under various growing conditions (e.g., open fields, greenhouses, and small land holdings) and climates.

Individual farmers are the main beneficiaries of precision agriculture, as it provides them with better data for making decisions. But these data also hold promise for improving agricultural statistics through integration with surveys and administrative data. The National Academies’ report on Improving Crop Estimates by Integrating Multiple Data Sources (NASEM, 2017b, p. 52) commented that these data might be used to reduce burden on survey respondents or to impute data for nonrespondents.

There are several challenges, however, in using private-sector data to improve agricultural statistics (see Chapter 2). Sourav and Emanuel (2020), reviewing recent trends of “big data” technology in the field of precision agriculture, argued that sensors and machinery help farmers track temperature, humidity, and soil conditions, but the data require processing. There may also be data gaps, measurement errors, lack of documentation, or proprietary data-manipulation methods that cannot be shared.

Undercoverage may occur because not all farms use precision agriculture: large corporate farming operations are more likely to have the resources to collect such data. This may lead to data inequities, in which more timely and accurate information is available for areas with large farm operations compared with areas consisting mainly of small farms.

Beyond those are the challenges of obtaining—and continuing to obtain—access to the data. Hurst (2016, p. 6) stated that many farmers using data and analytics “are reporting higher yields, fewer inputs, more efficiency, less strain on the environment, and higher profits. Yet many are also expressing concerns about privacy, security, portability, and transparency in how their data is used, and who exactly has access.” Stock and Gardezi (2022, p. 6) also highlighted concerns about the ability of agricultural technology firms and data consolidators to protect the confidentiality of farmers’ data: they may have “a royalty-free license over this data, giving them unrestricted permission to access.”

Hurst (2016, p. 8) mentioned the issue of data ownership, noting that “the individual farmer’s data has considerably more value than the average consumer’s data.” Ryan (2019) discussed the potential for a digital divide, in which farmers with data can prosper more than those without. Public data from NASS could mitigate some of that impact, but care is needed to ensure that the benefits of data are shared by all.

One issue with using private-sector data sources for agriculture is also shared with other types of business data. Private companies collect many types of data that give them a competitive advantage, and that advantage may be lessened if data are shared. The previous National Academies’ report in this series (NASEM, 2023) discussed possible benefits that could

Page 177 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

be offered to encourage data sharing, including the value of timely and granular data, confidentiality protection, and financial incentives. One benefit would be the development of standards for the collection and processing of data.

Data from Social Media, Webscraping, and Crowdsourcing

The unknown coverage of social media, webscraping, and crowdsourcing data makes it difficult to use these data as a single source to produce statistics, but they can provide valuable information when verified and combined with other sources. In agricultural statistics, webscraped and crowdsourced data have been used for expanding sampling frames and providing “ground truth” to verify data obtained from other sources such as satellite images. Hyman, Sartore, and Young (2022) described the use of webscraping to assess the coverage of local food farms in the NASS list frame (see Section 3.2). Webscraping has similarly been used to identify urban agricultural operations (Young, Hyman, & Rater, 2018) and farmers’ markets (Young & Jacobsen, 2022). In these studies, the researchers created a list of terms that might be used on websites to identify operations (e.g., “urban farm” or “community garden”) and verified that the operations were in the target population. These efforts advance data equity by improving coverage of small farms that are missing from the list frame and expensive to capture in an area frame.

In ground-truthing applications, participants visit sites that correspond to the satellite images, to verify the crops grown.¹⁷ Goodchild and Li (2012) offered three approaches to ensuring quality of “volunteered geographic information”: crowdsourcing, referring to “the ability of a group to validate and correct the errors that an individual might make” (p. 112); social, relying on “a hierarchy of trusted individuals who act as moderators or gate-keepers” (p. 114); and geographic, relying on “a comparison of a purported geographic fact with the broad body of geographic knowledge” (p. 115).

Fritz et al. (2019) discussed the possibility of using data from smartphones and social media: “The increased amount of smartphones all over the world, even among low income farmers, usually the group responsible for the largest agricultural uncertainties, allows for increased opportunities to self-report geo-located crops and parcel practices, including planting

___________________

¹⁷ For example, Lesiv et al. (2019) discussed an effort to estimate field sizes across the globe using crowdsourcing to assess the contribution made by smallholder farms to food production. Saralioglu and Gungor (2020) provided a literature review of the use of crowdsourcing to validate remote sensing data. One example of crowdsourcing is the Geo-Wiki Project (https://www.geo-wiki.org/), “which enables volunteers from around the world to help make land cover maps more accurate” (p. 99).

Page 178 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

dates, fertilizer application, irrigation and expected yields, through the use of purpose-designed mobile applications” (p. 270). They also suggested: “The food security and early warning community should also make greater use of the latent predictive capacity of social media and sources such as web search data” (p. 270), and gave examples of social media messages that could have given early warning of lower-than-average wheat yields.

8.2 MODELING CROPS COUNTY ESTIMATES IN THE UNITED STATES

Crops county estimates are used for many purposes. County yield data from surveys are used by USDA for various programs, including those administered by USDA’s Farm Service Agency and Risk Management Agency. For example, when a natural disaster such as drought or flooding impacts crop production, these data are crucial to the agriculture industry. They are also used by government agencies, researchers, and organizations “to determine many production and economic values on a small area basis” (Schnepf, 2017, p. 17).

County-level crop estimates published before 2020 were the result of an expert review process directed by the USDA Agricultural Statistics Board, which considered survey data (in particular, the quarterly Agricultural Surveys and CAPS) and other sources of information (such as administrative records from FSA and RMA) when determining an official estimate for each county (NASS, 2012; NASEM, 2017b; Cruze et al., 2019). To ensure consistency across geographic units of varying sizes, the Agricultural Statistics Board first determined the final national and state estimates for crop yield, acreage, and production. They then set estimates for agricultural statistics districts (sets of contiguous counties) and counties, ensuring that county totals summed to district totals, and district totals summed to state totals. Once the official estimates were approved, they were subject to NASS production standards for confidentiality and consistency across different-sized geographic units (Cruze et al., 2019).

Although the historical process made use of administrative records information such as that from the FSA and RMA, that information was incorporated through the expert judgment of the Agricultural Statistics Board, not through a statistical model. The process of manual assessment of separate inputs was time consuming and needed to be repeated for each state and commodity separately. Young (2022) noted that because of the subjective input from the Agricultural Statistics Board, there was a “lack of transparency and reproducibility.” In addition, no measures of uncertainty (such as margins of error) were reported.

The National Academies’ report Improving Crop Estimates by Integrating Multiple Sources recommended that NASS revise the county-level

Page 179 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

crop estimates program by using statistical models that rely on multiple data sources (see Box 8-1). They recommended development of small area statistical models as in the U.S. Census Bureau’s Small Area Income and Poverty Estimates program (see Box 2-2). The vision for 2025 had three components:

First, NASS prepares its county estimates using a transparent and well-documented process, publishing measures of uncertainty along with point estimates. Second, the NASS list frame is a georeferenced farm-level database, serving as a sampling frame for surveys and facilitating the use of farm data in statistical analysis. Third, NASS acquires all relevant georeferenced administrative and remotely sensed and ground-gathered information and uses this information to complement its traditional survey data (NASEM, 2017b, p. 17).

NASS has taken important steps toward realizing the vision in Improving Crop Estimates by Integrating Multiple Sources. Successive stages in the model-development process to include non-survey sources of data have been documented in a series of journal articles and conference presentations.¹⁸ The current panel anticipates that NASS will issue an official methodology report that consolidates the information in the research reports and describes the current production models, as that will provide important documentation for data users.

The models that have been developed can be viewed as extensions of those used for the Small Area Income and Poverty Estimates program, with additional features to meet the special challenges of producing crops county estimates, such as the additional information from FSA and RMA that can be used to set a lower bound on planted acreage in each county. Separate models were needed for planted acres, harvested acres, and yield (or production) for each commodity.¹⁹

A model-based estimate for the number of acres planted to a particular crop in a county relies on a direct estimate for that county (computed from the survey data), auxiliary information from administrative data (which provide a lower limit for planted acreage), and other sources of covariates. For the model described in Erciulescu, Cruze, and Nandram (2019), the direct estimate for acreage came from CAPS and the auxiliary data considered as covariates included:

___________________

¹⁸ See, for example, Cruze et al. (2019); Erciulescu, Cruze, and Nandram (2018, 2019, 2020); Chen and Nandram (2022); Chen, Nandram, and Cruze (2022); and Nandram et al. (2022).

¹⁹ Young and Chen (2022) noted that because production is the product of yield and harvested acres, only three models are needed for each commodity: planted acres, harvested acres, and either yield or production.

Page 180 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

Planted acreage totals reported to FSA;
Insurance claims totals for failed acreage reported to RMA;
Information on maximum planted acreage for each operator in the NASS list sampling frame;
Cropland Data Layer information on planted acreage, derived from satellite imagery; and
Monthly weather variables from the National Oceanographic and Atmospheric Administration.

It was desired that estimates produced at differing levels of geography be consistent with each other, with county totals aggregating to agricultural statistics district totals, and district totals aggregating to state totals. This was done by estimating acreage for both agricultural statistical districts and counties in the same model, thereby ensuring that the estimates for counties within a district summed to the district estimate.²⁰ At the end of the estimation process, district and county estimates were multiplied by a common factor that ensured they summed to state-level estimates.

Logical constraints among the quantities measured—for example, the number of harvested acres for a county must be less than or equal to the number of planted acres—were also incorporated into the small area models. Participation in the FSA and RMA programs is voluntary (see Section 8.1), so total planted acres from those administrative datasets would miss the acreage from nonparticipating farm operators. Totals from the administrative records could, however, be viewed as “informative lower bounds” for the planted acreage in each county, and Chen, Nandram, and Cruze (2022) incorporated constraints into the planted acreage model by requiring the county-level estimate of planted acreage to be at least as large as the maximum acres planted to the crop, as determined from FSA and RMA values. Similar constraints were introduced for other models (for example, the RMA value for failed acreage provided a lower bound for that quantity).

Ensuring county-district-state agreement and including lower bounds from FSA and RMA into the estimation process “led to estimates that were consistent with the expert opinion used by the members of the Agricultural Statistics Board, which enabled the model to be considered for production”

___________________

²⁰ Erciulescu, Cruze, and Nandram (2019) accomplished this with a hierarchical Bayesian subarea level model, in which the areas were agricultural statistics districts, and the subareas were counties. As described in Chen, Cruze, and Young (2021), the full process uses three univariate Bayesian subarea models working in concert: (1) A planted area model constrained by known minimum administrative totals from FSA and RMA; (2) A harvested area model that uses survey harvested-to-planted ratio and transformation to produce coherent harvested area totals; and (3) a crop yield model with geographic benchmarking to generate distributions and summaries for crop production totals.

Page 181 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

(Young & Chen, 2022, p. 890). According to Young (2022, slide 13), models for 13 crops were used for production estimates beginning in 2020, after rounding and review by state field office staff.

Moving to small area estimates based on statistical modeling has had several advantages. First, the models have increased the transparency and reproducibility of the estimate-production process. Second, the models allow calculation of measures of uncertainty, such as variances or coefficients of variation, about the estimates. And third, “the automation of modeling, rounding, and enforcing coherence across geospatial scales has led to a substantial savings in staff time” (Young & Chen, 2022, p. 895).

Young and Chen (2022) described the process of moving these estimates into a production mode:

Transitioning to these models being the foundation for major survey programs including those associated with the principal federal economic indicators has required substantial changes in the final stages of the NASS processes and a major cultural shift…. For the reviews within the state field offices and by the Agricultural Statistics Board, tools are available to facilitate the review process, but were not designed for the inclusion of modeled estimates or their measures of uncertainty. These tools had to be revised to integrate the modeled estimates into the review process. Following the 2020 growing season, small area models became the foundation for crop county estimates for the 13 nationally reported crops (p. 893).

CONCLUSION 8-1: The National Agricultural Statistics Service has made substantial progress in the difficult process of developing models to produce crop estimates at different levels of geography. Important advances include producing objective estimates with measures of uncertainty.

8.3 MODELING CROP ESTIMATES IN CANADA

NASS county-level crop estimates rely on survey data as the basis for the modeling. In Canada, models have been used to completely replace some surveys. Nkwimi-Tchahou et al. (2022) developed models for estimating mid-season field crop yields from alternative data sources that could potentially be used to replace survey estimates. Traditionally, Statistics Canada collected six field crop surveys each year, three of which asked about crop yields. The July and September surveys dealt with mid-season estimated crop yields, and the November survey asked about actual yields for the season. But mid-season estimates usually underestimated the final values for the actual yields. Nkwimi-Tchahou et al. (2022, slide 2) asked: “Can we make use of alternative data sources to reduce cost and response

Page 182 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

burden and produce a mid-season set of estimates of equal or better quality than the mid-season surveys provide?”

Data sources they considered as sources of predictor variables included:

A weekly Normalized Difference Vegetation Index computed from satellite imagery (in general, higher values are associated with higher crop-yield potentials);
Agroclimatic data about temperature, precipitation, hours of sunshine, soil moisture, and other characteristics; and
Crop insurance data provided by provincial crop insurance corporations, with information on “the location of the land parcel, what crop is being grown, the acreage of each crop sown and, after the growing season, the resulting yield” (Nkwimi-Tchahou et al., 2022, slide 4).

Previous research reported in Brisbane and Mohl (2014), Reichert et al. (2016), and Statistics Canada (2020b) investigated models that could be used to produce mid-season estimates of crop yield and production that could, potentially, replace estimates from the September Farm Survey. Predictor variables included Normalized Difference Vegetation Index and agroclimatic data available in August, as well as information from the July Farm Survey; these models did not include crop insurance data. Accuracy of estimates was evaluated by comparing estimates to final yields from the November survey. Reichert et al. (2016, p. 11) found that “estimates produced by the yield model were comparable to those produced by the September Farm Survey in terms of relative difference from the November Farm Survey estimates for the 15 crops modelled.” As a result, Statistics Canada decided to replace the September Farm Survey with estimates of field crops from the model, resulting in less burden on survey respondents and reduced costs, as well as earlier publication of mid-season estimates. Reichert et al. (2016, p. 12) noted that this “replacement of a statistical field crop survey with a remote sensing model-based administrative approach is a first for any statistical agency worldwide.”

Statistics Canada researchers then explored whether both the July and September surveys could be eliminated (Nkwimi-Tchahou et al., 2022). They dropped July yield as an explanatory variable and included crop insurance data. Crop insurance data presented additional challenges for model building. The first challenge was acquiring access to the data from data providers, and data were not available for all provinces. Moreover, the data structure varied across provinces, with some provinces having more information than others. Undercoverage was also a challenge (as with the USDA’s RMA data; see Section 8.1) because not all crops are insured. Finally, when multiple crops were grown within a parcel, insurance data

Page 183 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

did not tell where, within the parcel, each type of crop was grown, which prevented associating its exact Normalized Difference Vegetation Index.

The model with crop insurance data was first studied with data from Manitoba. To simplify modeling, mixed-crop parcels were dropped from the model, and estimates were adjusted to compensate. It was also assumed that uninsured crops have a similar yield as insured crops. Technical details of the model, as well as estimates and their coefficients of variation, are given in Statistics Canada (2020a).

Nkwimi-Tchahou et al. (2022) reported that, despite the challenges in acquiring and standardizing data, the July estimates from the model were much closer to the November values than estimates from the July survey. Furthermore, using the models, they could publish estimates for additional, less-common, crops that could not be published from survey-based estimates. Nkwimi-Tchahou et al. (2022, slide 16) concluded that “[t]he methodology has shown to be a good replacement for mid-season surveys estimates,” in particular because of the cost savings and reduced burden on survey respondents.

One challenge in relying entirely on model-based estimates is the assumption that the relationship between the predictor variables and the response variable is the same for the predicted years as it was for the dataset used for model development. Nkwimi-Tchahou et al. (2022, slide 16) noted that the model had more difficulty in extreme years such as 2021, a year of severe drought for Alberta, Saskatchewan, and Manitoba. In addition, the models “still generally underestimate the values from the end of season survey.” They suggested that adding variables to the models, or using machine-learning methods, might further improve predictions.

The modeling efforts of Statistics Canada demonstrate the promise of using satellite imagery along with administrative records data for producing crop estimates that could replace estimates from surveys. As a result, Statistics Canada was able to reduce the number of Field Crop Surveys from six to four (March, June, November, and December), while relying on model-based estimates of yields and production based on satellite imagery in July and September.²¹

8.4 OPPORTUNITIES FOR IMPROVING AGRICULTURAL STATISTICS

The Data Foundation and AGree Initiative (2022) argued that farmers today face unprecedented challenges including supply chain disruptions and extreme weather events, and that timely and accurate data are essential for addressing critical issues related to food and agriculture:

___________________

²¹ https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=3401

Page 184 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

Modernizing the national data infrastructure for the agricultural sector is the linchpin to provide critical agricultural insights, improve the effectiveness of farm bill programs, and deliver better value for farmers and taxpayers. Harnessing existing data from government, industry, and individual sources has the potential for farmers to work in a more productive, streamlined manner and economically empower rural America (p. 5).

Many opportunities exist for continued improvement of the accuracy, timeliness, detail, and transparency of agricultural statistics through the use of multiple data sources. In the short term, continued research on small area models is likely to be fruitful in producing even more accurate estimators, through inclusion of additional predictor variables (perhaps acquired from new data sources) and new developments from statistical research. Other approaches to improving small area models could also be considered, including further investigations into the properties of the datasets used as model inputs, or exploring groupings of counties other than agricultural statistics districts (NASEM, 2017b, p. 95). Young (2022, slide 16) reported that NASS is investigating the use of drones and in situ sensors to provide data, although establishing a nationwide system would be costly.

There are also opportunities for continuing to improve data equity for agriculture. Presenting an analysis of U.S. farm owners, operators, and workers by race, ethnicity, and gender, Horst and Marion (2019, p. 14) noted the importance of producing statistics that are disaggregated by these characteristics, and concluded: “Survey data should also enable intersectional analysis across race, ethnicity and gender, at national, regional state and county-levels…. We also urge collection of more detailed demographic data following emerging best practices.”

Remotely sensed data can provide information about which crops are grown, but not about whose crops they are. Survey data or administrative records are needed to answer questions about demographic characteristics of farm owners and workers, and about impacts of USDA programs on small-scale, female, minority, and new farmers. Roberts and Hernandez (2021, p. 4) argued that it is important for population groups to have more than mere representation in the data and “that there is a compelling need to improve the participation of women, people living with disabilities, and other marginalized groups in all aspects of open data for agriculture and nutrition.”

For county-level crop estimates, equity aspects could be explored by comparing measures of uncertainty about model inputs and outputs with county-level statistics about poverty, race, ethnicity, and other characteristics calculated from the decennial census or American Community Survey. An important part of data equity is identifying areas with less accurate estimates and taking steps to improve those estimates. As with the income

Page 185 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

studies in Section 5.4, it may be possible to use administrative records to study survey measurement properties and nonresponse bias.

In the longer term, improvements in data quality from non-survey sources may reduce dependence on surveys in the future. One promising area is exploring the potential for using private-sector precision-agriculture data. This would involve investing in an infrastructure for using such data that includes data standards, system interoperability, incentives for data holders to provide their data, cybersecurity, and consideration of data equity. The Data Foundation and AGree Initiative (2022, p. 3) described eight attributes that would be key for this infrastructure: “farmer and public trust, privacy and confidentiality protections, independence, data acquisition, scalability, stable funding, oversight and accountability, and intergovernmental support.” A pilot study, in which a probability sample of farm operators was selected to supplement or replace their survey data with data from internal operations, would provide information about ways of using these data to shift burden away from survey respondents.

With the increasing availability of satellite remote sensing, on-the-ground sensor networks, and social media, there is a great opportunity to improve agricultural statistics by combining these data sources at fine spatial and temporal scales. The spatial and temporal resolution of these data sources tends to become increasingly detailed as technology advances (Wang & Goodchild, 2019). While rapid change and the variety of such data sources often translate into higher uncertainty for analysis results and can affect scientific reproducibility, there are new opportunities for geospatial analysis and statistical approaches to support scalable data integration with adequate uncertainty quantification (Wang, 2016). Recent advances in artificial intelligence and machine learning provide an opportunity to harness diverse data sources for improving prediction of crop types and yields at various spatial and temporal scales (Cai et al., 2018; Jiang et al., 2019). Integration of such advances with cyberinfrastructure and cyberbased geospatial information systems and science (cyberGIS) is important to the data-intensive transformation of national agricultural statistics leading to more intelligent, robust, and transparent outcomes (Lyu et al., 2022).

CONCLUSION 8-2: Remotely sensed data have great potential for improving agricultural production models. The resolution and quality of the data are important considerations when choosing appropriate geographic units for modeling and analysis. Private-sector data, such as data from precision agriculture, could also be of value if data-sharing mechanisms that protect privacy can be developed. Data sharing could be improved by cross-agency cooperation to develop and use interoperable geographic units, and by development of quality standards for non-survey data.

Page 186 Cite

Suggested Citation:"8 Using Multiple Data Sources for County-Level Crop Estimates." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×