**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

**Suggested Citation:**"Chapter 2 - Interpreting Monitoring Data." National Academies of Sciences, Engineering, and Medicine. 2017.

*Interpreting the Results of Airport Water Monitoring*. Washington, DC: The National Academies Press. doi: 10.17226/24752.

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

65 C h a p t e r 2 2.1 Introduction This chapter presents guidance on the interpretation of water monitoring data. Interpretation of the monitoring data is defined as the process of assessing the monitoring data set as a whole to provide quantitative descriptors of the data sets characteristics. Those descriptors (e.g., average, 95th percentile, standard deviation) can then be used in meeting the application objectives for the monitoring situation. The process of interpreting the data involves the following principal steps: 1. Verifying the accuracy and representativeness of the monitoring data that has been acquired 2. Analyzing the data to characterize the data set as a whole and compare the data features to known limiting values, expected results, or standard conditions Data interpretation in this context is essentially a mathematical exercise. The objective is to quantify the magnitude, variation, trends, and relationships of water monitoring parameters in a way that supports the application of the results to actions the airport may take. Figure 11 lists the key concepts in the chapter. In the electronic version of this document, hyperlinks to the referenced sections are provided. Key terms used in this chapter are provided in the following section. 2.2 Terminology Critical for Interpreting Monitoring Data A variety of statistical and laboratory analysis terms are used in reference to stormwater monitoring data interpretation at airports. Although many of these terms are used inter- changeably by airport staff and consultants, they are uniquely defined in this section to facili- tate a consistent understanding of the detailed concepts presented herein. Additionally, the glossary provides definitions for a variety of other terms that are relevant to overall guidebook content. Accuracy â A measure of the closeness of an individual measurement or the average of a number of measurements to the true value. Accuracy includes a combination of random error (precision) and systematic error (bias) components that are due to sampling and analytical operations; the U.S. EPA recommends using the terms âprecisionâ and âbias,â rather than âaccuracy,â to convey the information usually associated with accuracy (U.S. EPA, 1998). Interpreting Monitoring Data

66 Interpreting the results of airport Water Monitoring Bias â The systematic or persistent distortion of a measurement process, which causes errors in one direction (i.e., the expected sample measurement is higher or lower than the sampleâs true value) (U.S. EPA, 1998). Data analysis â The process of summarizing data using statistical quantities and graphics to assist in the interpretation and application of monitoring results. Descriptive statistic â A quantity computed from the data to describe or summarize the data. Example descriptive statistics include the mean, median, and standard deviation. Detection limit â The minimum concentration of a pollutant that a laboratory can measure and report with a 99 percent confidence that its concentration is greater than zero as determined by a specific laboratory method. Precision â A measure of mutual agreement among individual measurements of the same property, usually under prescribed similar conditions expressed generally in terms of the standard deviation (U.S. EPA, 1998). Probability distribution â A theoretical or empirical description of the frequency or prob- ability of various parameter quantities. Common theoretical distributions for water resources data are the Gaussian (or normal) distribution, the lognormal distribution, exponential, and gamma. Quantitation limit â The level at which a laboratory can reliably report concentrations with a specified level of error. Random error â Unknown or unpredictable changes in the measuring instrument or the environmental conditions or variables that may affect measured quantities. Representativeness â A qualitative term that expresses the degree to which data accurately and precisely represents a characteristic of a population. Systematic error â Also referred to as bias; distortion caused by consistently high or low values reported by an instrument or analyst due to a variety of potential issues, such as device malfunction, calibration errors, inexperience, or incorrect application of monitoring protocols. Figure 11. Key concepts for the interpretation of water monitoring data. â¢ 2.3.1 Potential Measurement Errors â¢ 2.3.2 Effects of Ambient Conditions and Background Concentrations â¢ 2.3.4 Parameter Relationships â¢ 2.3.5 Concurrent Airport Activities and Events â¢ 2.3.6 Applicability Ranges and Errors in Flow Monitoring â¢ 2.3.3 Analytical Considerations 2.4 Analyzing the Data â¢ 2.4.1 Uses of Statistical Analyses of Water Monitoring Data â¢ 2.4.2 Descriptive Statistics â¢ 2.4.3 Graphical Data Analysis â¢ 2.4.4 Comparative Data Analysis and Hypothesis Testing â¢ 2.4.5 Trend Analysis â¢ 2.4.6 Hydrograph Analysis â¢ 2.4.7 Censored Data (Nondetects) â¢ 2.4.8 Bootstrap Methods 2.3 Verifying the Accuracy and Representativeness of Raw Data

Interpreting Monitoring Data 67 2.3 Verifying the Accuracy and Representativeness of Raw Data Chapter 1 presented guidance that airport staff can use to acquire monitoring data. By consis- tently applying sound data acquisition techniques, users can improve accuracy and representa- tiveness, while reducing error. However, even a perfectly planned and executed data acquisition process cannot fully eliminate inaccuracy in data sets or guarantee that data is appropriate for use in detailed analysis or decision making. Therefore, before using the acquired data, airport staff should always assess acquired data sets to verify, to the extent possible, the accuracy and representativeness of the data. U.S. EPA defines accuracy as a measure of the overall agreement of a measurement to a known value. Accurate measurements require that random errors and systematic errors associated with both sampling and analytical operations are minimized. Random errors influence the precision of data values and are caused by unknown or unpredictable changes in the measuring instrument or the environmental conditions. Systematic errors (or bias) are caused by consistently high or low values reported by an instrument or analyst due to a variety of potential issues, such as device malfunction, calibration errors, inexperience, or incorrect application of monitoring protocols. U.S. EPA recom- mends using the terms âprecisionâ and âbias,â rather than âaccuracy,â to convey the information usually associated with accuracy. Precision is defined by U.S. EPA as a measure of mutual agreement among individual measurements of the same property usually under prescribed similar conditions expressed generally in terms of the standard deviation, and bias is defined by U.S. EPA as the system- atic or persistent distortion of a measurement process, which causes errors in one direction (i.e., the expected sample measurement is different from the sampleâs true value) (U.S. EPA, 1998). Minimizing random errors (i.e., overall data variability) requires controlling, to the extent possible, environmental conditions or variables that may affect measured quantities but are not intended to be analyzed (via correlation or regression) with the collected data. In a laboratory setting, random errors are minimized by isolating the analysis from outside variables. However, in the field, this is typically not possible due to the number of potential variables affecting a measured result. Hence, the random error of field-collected data is typically much higher due to the cumulative effects of multiple uncontrolled variables. Minimizing systematic errors requires continual verification that the measuring system is functioning properly and the field data collection personnel are consistently using appropriate techniques. Training and supervision are required to attain measurements within prescribed accuracy bounds. For example, accurate application of flow-measuring devices generally depends on standard designs or careful selection of devices, careful fabrication and installation, good calibration data and analyses, and proper user operation with sufficiently frequent inspec- tion and maintenance procedures. Representativeness is a qualitative term that expresses the degree to which data accurately and precisely represents a characteristic of a population. An assessment of representativeness entails evaluating whether measurements are made and physical samples collected in such a manner that the resulting data appropriately reflect the environment or condition being measured. A perfectly representative data set would include the continuous measurements of all relevant parameters at a location under the complete range of conditions that contribute to the pollutant and flow characteristics at the monitoring point. Since this is not feasible, no monitoring data set is fully representative. Since there is no standardized benchmark for assessing representative- ness, there is an element of judgment involved in determining âhow representativeâ the data set is for a particular site and set of conditions. As discussed later in this chapter, some statistical techniques can be used to inform this judgment. Factors that influence data accuracy and representativeness are discussed in the following text.

68 Interpreting the results of airport Water Monitoring 2.3.1 Potential Measurement Errors Measurement errors can occur in the sample collection and analysis stages. Measurement error is influenced by the inherent variability of the sampled population over space and time, the sample collection design, and the number of samples collected. Limited sampling may lead to exclusion of some of the natural variation of the measurement of interest. Sampling design error occurs when the data collection design does not capture the complete variability within the sampled population space, to the extent appropriate for making conclusions. Measurement error can lead to random error (i.e., random variability or imprecision) and systematic error (bias) in estimates of population parameters, such as mean and standard deviation. There are several ways measurement errors can be introduced, including field sampling error, variation in the conditions in which the measuring process is conducted, precision of the measuring instru- ment, and the accuracy of the instrument calibration. Several commonly experienced sources of measurement error are listed in this sectionâs Topical Tips box. Topical Tips Situations that Lead to Measurement Error in Field Sampling and Monitoring 1. Collecting Non-representative Samples â Low flow or stagnant flow conditions â Highly variable flows â Sampling from discharges with downstream flow constrictions (e.g., water body tailwater, pumps, weirs) â Highly variable pollutant conditions â Sampling at non-representative location because of safety considerations â Co-mingled flows upstream of sample point â Co-mingled flow downstream of sample point that affect stream water quality â Use of grab samples instead of composite samples â Use of time-composite samples instead of flow-composite samples â Biasing monitoring results by collecting samples at âcleanerâ locations or times â Collecting samples during incorrect portion of an event â Collecting samples during times other than trigger conditions in permit â Collecting an insufficient number of samples 2. Installation and Set-up Issues for Field Sampling and Monitoring Devices â Incorrect mounting of sample probes and monitoring instruments â No instrument calibration performed â Incorrect calibration procedure 3. Sample Handling â Mislabeling sample containers â Mismatching of chain-of-custody forms, sample bottle labeling, and sampling logs â Using incorrect sample jars â Providing insufficient sample volume (e.g., not filling volatile organic compound vials) â Using incorrect sample preservatives â Introducing contamination into sample â Using composite samples when grab samples required (e.g., oil and grease)

Interpreting Monitoring Data 69 4. Maintenance Issues â Loose monitoring probes â Frozen sample inlets â Fouled sample inlets â Sedimentation over sample inlet â Damage from debris to sampling equipment â Power failures during part of sampling cycle Sample collection procedures should follow approved quality assurance project plans and stan- dard operation procedures that are part of the water monitoring plan discussed in Chapter 1, to minimize the potential for field sampling errors. Similar quality control plans and procedures must also be followed by the analytical laboratory. Laboratory control charts are used to docu- ment process results such that adjustments can be made to maintain analytical errors within acceptable limits. Even if a monitoring plan is well developed and completely followed, airport staff should review the acquired monitoring data points, as well as the notes on field observations, to identify data points that may have obvious or potential sources of error. Cases can be made for eliminat- ing such outliers from the data set if a clear cause for the anomalous measurement is known. Care must be taken, however, not to eliminate outliers from the data set if no known cause of error can be identified, as this could bias results. If all controllable sources of error are minimized, the uncertainty in the measurement is gener- ally on the same order of magnitude as that of the smallest numerical value that can be estimated with the measuring instrument (usually expressed as a percentage, or relative error). Be aware that often the accuracy of measurements identified by vendors of field instruments is typically made under ideal conditions and that true in-field accuracy of the instrument may be significantly worse. This is a key factor when selecting field monitoring instruments expected to produce accurate results at low concentrations. The true value of the uncertainty typically falls in a range of values that reflect the experi- mental uncertainty of the measurement. Calculating the mean of multiple measurements (i.e., duplicates or triplicates) can provide a better estimate of the true value and analyzing the vari- ance of those multiple measurements can help identify whether the errors are random in nature and not systematic. 2.3.2 Effects of Ambient Conditions and Background Concentrations Background and/or ambient conditions may contribute pollutant loadings that could con- tribute to the exceedance of regulatory target levels such as effluent limits or benchmarks. For example, high natural background levels of iron in soils or groundwater could cause exceed- ances of a benchmark value. Such conditions may result from (1) naturally occurring substances present in the environment in forms that have not been influenced by human activity and/or (2) anthropogenic substances, which are natural and human-made substances present in the environment as a result of human activities. The category of natural background pollutants does not include legacy pollutants from earlier activity on a site, or pollutants in run-on from neighboring sources that are not naturally occurring. Anthropogenic sources include run-on from adjacent properties and aerial deposition from human-made sources. Some pollutants may be present in the background as a result of both natural and anthropogenic conditions, such as The accuracy of measure- ments identified by vendors of field instruments are often made under ideal conditions. True in-field accuracy of the instrument may be significantly worse.

70 Interpreting the results of airport Water Monitoring naturally occurring arsenic and arsenic from treated wood, pesticide applications, or smelting operations. The U.S. EPA MSGP states that a benchmark exceedance does not trigger a corrective action if the airport determines that the exceedance is solely attributable to natural background sources; however, in the supporting rationale, the airport must include any data previously collected by itself or others (including literature studies) that describe the levels of natural background pol- lutants in the airportâs stormwater discharge (U.S. EPA, 2015). Quantifying natural background concentrations can be challenging. It may be possible to quantify natural background concentra- tions with the help of reference sites. A reference site is a monitoring location not impacted by human development that is located in the same watershed as the monitoring project. If such a site can be found, the assumption is that the constituents measured in the reference site sample are indicative of the natural background concentrations. Natural background concentrations could also be determined using information from a peer-reviewed publication or a local, state, or federal government publication specific to the matrix (e.g., stormwater) in the immediate region. Studies from other geographic areas, or locations that are clearly based on different topographies or soils, are not sufficient to meet this requirement. When no data is available, and there are no known sources of the pollutant, the background concentration should be assumed to be zero. Reference site data or published data should be used with great care because its use inherently assumes that variables affecting constituent concentrations, other than the pollution-generating activities at the monitoring site, are essentially the same at the two sites and do not affect the constituent concentrations. Effects from ambient conditions may also be particularly important when analytical methods have very low detection limits, such as in the case of trace metals analyses, because metals are ubiquitous in the environment. Using clean sampling protocols is one means of reducing ambient contamination of environmental samples. Field blanks may be collected at a specified frequency (e.g., 20 percent of all environmental samples collected) to evaluate potential sample contamina- tion from ambient conditions. 2.3.3 Analytical Considerations Detection Limits Analytical method sensitivity is a primary data quality indicator. Sensitivity is affected by the method detection limit, instrument detection and laboratory quantitation limit. The MDL, also known as detection level or limit, is the minimum concentration of a pollutant that a laboratory can measure and report with a 99 percent confidence that its concentration is greater than zero (as determined by a specific laboratory method). The quantitation level is the level at which a labora- tory can reliably report concentrations with a specified level of error. Sample concentrations that fall between the analytical MDL and the practical quantitation limit may be reported but should be flagged by the analytical laboratory to indicate the results are estimated values. Quantitation limits for each constituent need to be appropriate (i.e., lower, and ideally an order of magnitude lower) for the thresholds of concern. This may be challenging in cases where water quality objectives are extremely low and there is no approved analytical method that can achieve the appropriate quantitation limit. Examples include organic compounds such as PAHs, some pesticides, and mercury. The NPDES Permit Writersâ Manual states that, when effluent limits are being established, it is possible for the value of the calculated limit to fall below the MDL and the minimum level estab- lished by the approved analytical method. Regardless of whether current analytical methods are available to detect and quantify the parameter at the concentration of the calculated limitation, the limitation must be included in the permit as calculated (U.S. EPA, 2010).

Interpreting Monitoring Data 71 Dilutions Dilution is the act of adding distilled water and/or other preparation reagents to a sample to overcome an interferent or to bring the concentration of a target analyte back into the working calibration range of the instrument. The dilution factor is the total number of volumes, includ- ing the sample volume, in which the sample will be diluted with distilled water. The laboratory should identify any samples that were diluted for analysis and the dilution factor. Sample dilution may have several undesirable effects. The sample quantitation limit will be raised proportionally to the amount of dilution. In addition, dilution may lessen the signal from other con- stituents of concern in the sample to the point that they are no longer identified. The consequence is that the sample results may be interpreted as not containing these compounds, and the false negative results may bias the sampling effort. Additionally, for organic analyses, surrogate standards that are added to each sample prior to analysis may be diluted to the point that recovery suffers or is non-existent, which affects the ability to assess sample-specific accuracy for the analysis. Accuracy over Parameter Ranges Generally, a measured parameter value needs to be within the working range of an analytical instrument to yield accurate and representative results. The instrument working range is deter- mined through an instrument calibration procedure that typically involves an initial calibration and continuing calibration verification at a prescribed frequency over the course of the analysis, to evaluate drift from the initial calibration. An analytical laboratory typically will dilute a sample with an elevated concentration so the concentration is within the calibration range of the instru- ment. Sample results that fall outside the instrument calibration range should be qualified as estimated values. Laboratory Errors Certified analytical laboratories are required to implement a quality assurance and quality control (QA/QC) program that minimizes laboratory errors. The internal laboratory data veri- fication process includes verification of the completeness, correctness, and technical compliance of the records and documentation associated with each analysis. The laboratory should produce a narrative with the analytical data report that clearly identifies any QC analyses that do not meet method criteria, or other deviations from project-specific specifications. Manual integration by a laboratory analyst is one of the most commonly abused aspects of gas chromatograph/mass spectrometry analyses. Practices may occur where integration points are moved to decrease (peak shaving) or increase (peak juicing) peak area to meet specification. It is necessary for the laboratory to maintain written procedures that describe how and when the analyst should perform manual integrations. These written procedures should also describe how to note in the laboratory records and data that manual integrations were performed. Gas chromatograph/mass spectrometry data systems have the ability to flag the electronic and hard- copy records of manual integrations. The external data validator can confirm that the reported sample results make sense by checking the calculations that were used (the level of documentation provided with the data package should be discussed with the laboratory and may vary depending on regulatory requirements). Inputs to the calculation, such as dilution factors, may be checked for accuracy as well. Due to an oversight, a laboratory may analyze a sample outside of the prescribed analytical method holding time. This will affect the quality of the data to a varying degree depending on the magnitude of the holding time exceedance. Laboratory transcription and reporting errors are typically minimized by use of an automated laboratory information management system to produce electronic data deliverables.

72 Interpreting the results of airport Water Monitoring Flags The laboratory may assign qualifiers or flags to the data to identify potential data quality prob- lems for the data user. If flags are being used, the data user should determine if their application was defined clearly in the data report, and whether the flags were appropriately assigned to sample results based on these definitions. After the data is received from the laboratory, additional qualifiers may be assigned when the data user/validator conducts the QA/QC review of the analytical data. Some data may need a data validation qualifier to give an indication of potential bias of the data. Data validation qualifiers may be assigned to particular sample results based on information such as laboratory qualifiers, QC summaries, and data summaries. Examples of data validation qualifiers and typical definitions are included in Table 5. Expressing Errors Absolute and relative methods are the standard forms for expressing errors. Absolute error is expressed as a range of values reflecting the uncertainty in the measurement and is reported in the absolute error. Relative (or fractional) error is expressed as the ratio of the uncertainty in the measurement to the measurement itself. This is difficult to estimate, because it is a function of the true value of the quantity being measured, which is unknown. Typically this error estimate utilizes the measured value as the âtrueâ value. The type of measurement and instrumentation can provide an indication of the appropriate form of expressing errors. For example, a pressure probe used to measure depth of flow is likely to have the accuracy of the instrument expressed as a relative percentage, while readings on a staff gauge would have an absolute error related to the markings on the gauge. In these instances, the reported depth measurements would be expressed in the same manner as the precision of the measuring instrument. Propagation of Errors Quite often, measurements taken of one or more variables are used in equations to calculate the value of other variables. For example, to calculate the area of a rectangle, the length and width are usually measured. To calculate the volume of a cube, the length, width, and height are measured. Each measurement has a potential error associated with it and, as a result, the variable calculated from the combination of individual measurements will also contain some error. The magnitude Data Qualifier Definition U The analyte was analyzed for, but was not detected above the level of the reported sample quantitation limit. J The result is an estimated quantity. The associated numerical value is the approximate concentration of the analyte in the sample. J+ The result is an estimated quantity, but the result may be biased high. Jâ The result is an estimated quantity, but the result may be biased low. UJ The analyte was analyzed for, but was not detected. The reported quantitation limit is approximate and may be inaccurate or imprecise. R The data is unusable. The sample results are rejected due to serious deficiencies in meeting QC criteria. The analyte may or may not be present in the sample. Source: U.S. EPA (2014). Table 5. Example data validation qualifiers and definitions.

Interpreting Monitoring Data 73 of the error in the calculated variable can be of a different order than the error associated with any one of the measurements, depending on their mathematical relationship. 2.3.4 Parameter Relationships The results for related analytical parameters should be reviewed as an accuracy check. For exam- ple, if both total and dissolved metals analyses were performed, the dissolved metals results should not be greater than the total metals results, within an acceptable error range (e.g., 20 percent). Similarly, COD should always be greater than BOD since COD is the total mass of all chemicals in the water that can be oxidized, while BOD is the amount of food (or organic carbons) that bacteria can oxidize. 2.3.5 Concurrent Airport Activities and Events It is important to evaluate the types of operational activities that occurred at the time of or recently prior to sample collection activities. For example, collecting a sampling in a drain- age area where fuel was recently spilled could result in elevated concentrations of PAHs in the sample. Therefore, analytical results should be reviewed with respect to whether they reflect typi- cal operations or were potentially affected by an abnormal activity that could affect the sample representativeness. An evaluation of whether current sample results vary significantly from the historic data can also be useful in identifying abnormal activities that could skew the data or result in uncharacteristic outliers. 2.3.6 Applicability Ranges and Errors in Flow Monitoring Flow monitoring is a special case when considering the applicability ranges and accuracy of monitoring parameters. Monitoring of flow rates and flow volumes is performed to some degree by many airports. In most situations, the flow monitoring occurs at surface water outfalls, discharge points to local storm sewers, or sanitary sewers. In some cases, monitoring of flow at points internal to the airport storm sewer system or flow monitoring within stormwater control processes is performed. Being able to rely on the accuracy of the flow measurements is important in many phases of airport operations, including permitting, compliance, planning, design, operations, and finances. A list of common flow monitoring applications is provided in Table 6. The stormwater flow monitoring at airports typically requires mechanisms to measure flow in open channels or pipes not flowing full. Open channel and open pipe flow measurement can take many forms but typically involve calculation of flow rates and volumes based on depth and/or velocity measurements. Airports using pump stations to convey stormwater can use monitoring devices for closed pipes that are flowing full (e.g., magnetic flow meters). When considering the true accuracy of flow monitoring techniques, it is important to differ- entiate between âfactory accuracyâ and âfield accuracy.â While many flow measurement devices and mechanisms can produce accuracies of Â±5 percent within their intended operational range and some devices are capable of Â±1 percent under laboratory settings, accuracies in the field are typically not to the same level. As a general rule, flow measurement techniques in closed pipes flowing full are more accurate than techniques used to measure flow in open channels or partially full pipes. Flow monitoring devices for full pipes (most typical are magnetic flow meters) are factory-calibrated and are not significantly affected by field conditions if installed to manufacturerâs specifications. Besides verifying that pipes monitored by a magnetic flow meter are full, the most important design criterion to support accurate measurement is setting the flow meter far enough downstream from pumps, bends, and other turbulence-generating features. For more information about parameters associated with airport activities, see: 1.4.2 Typical Monitoring Parameters For more information about correlation and regression techniques, see: 2.4 Analyzing the Data

74 Interpreting the results of airport Water Monitoring Flow monitoring of stormwater in open channels or partially full pipes, on the other hand, requires field calibration and is subject to many field variables that can affect accuracy. These kinds of flow monitors are often systems rather than single devices. Design features that eliminate or minimize field variables that affect flow readings should be added (e.g., weirs, flumes, appropri- ate level sensors). Regular maintenance of these flow monitoring sites is also required. Selecting a flow monitoring system that is not appropriate for the site conditions can result in a non-standard installation and reduced accuracy, sometimes greater than Â±10 percent. Therefore, flow monitor- ing systems should be designed and assessed prior to any significant use of the flow data requiring accurate measurement. Regular maintenance should also be performed to maintain accuracy. Sources of stormwater discharge measurement error in flow monitoring devices include the choice of monitoring method or device, inherent error associated with the primary measurement device (e.g., weir, flume), and error associated with a secondary measurement device (e.g., pres- sure transducer, staff gauge). Other sources of error may include incorrect measurement tech- niques, improperly installed equipment including installing equipment where deviations from the standard velocity profile occur, and improperly maintained equipment and/or environmental damage (e.g., corrosion). Deviation from a normal transverse or vertical flow distribution, or the presence of water surface boils, eddies, or local fast currents, is reason to suspect the accuracy of a flow-measuring device. Errors of 20 percent are common, and errors as large as 50 percent or more may occur if the approach flow conditions are very poor. For example, a bend or angle in the channel just Flow Monitoring Application Notes Estimating outfall flow rates or volumes for stormwater discharge permit reporting If the reported flow value is not used in pollutant load calculations, flows could be estimated using models or simplified methods without installing flow meters. Calculating pollutant mass loading rates for stormwater discharge permit reporting Accurate flow readings at monitoring points have direct impact on compliance as mass loading rate is calculated using flow rates. Mass Load Limit for Pollutant = (Flow Rate) * (Pollutant Concentration). Reporting flow rate and volume data to outside agencies to calculate fees Most airports discharging to sanitary sewers need accurate flow measurements to accurately calculate fees paid to the local municipality. Some airports use flow data in calculating stormwater utility fees to the local operator of the MS4. Triggering sample activities The presence of flow or certain flow rate/volume used to initiate auto-sample. Characterizing base flows and flows from non-airport areas Typically it is difficult to accurately quantify the portion of storm sewer flows that are attributable to base flows from groundwater, but it can be approximated through long-term time series plots and base flow separation analysis techniques. Calibrating water quantity and water quality models Where models are used to develop regulatory criteria or size infrastructure, calibrating of modeled flows using accurate field-measured data is critical to avoid overspending or under-sizing. Developing stream wasteload allocation calculations and water quality models to set pollutant permit limits Both stream flows and outfall flows are key components in the calculation. Techniques like field development of depth-discharge curves can be used to determine flow rates. Sizing conveyance, water quality control, storage, and treatment facilities Stormwater infrastructure can be costlyâaccurate measurements or modeled flow is needed to correctly size. Managing day-to-day operations in stormwater and deicing treatment facilities and controls Accurate, real-time flow monitoring is needed to support flow management, diversion, and discharge decisions. Table 6. Typical airport flow monitoring applications and accuracy considerations.

Interpreting Monitoring Data 75 upstream from the flow-measuring device can cause secondary flow or large eddies, which tend to concentrate the flow in part of a cross section. Excessive turbulence will adversely affect the accuracy of any measuring device but is particularly objectionable when using current meters or propeller meters of any kind. Excessive turbulence can cause measurement errors of 10 percent or more. A common mistake made by those not familiar with the details of flow monitoring devices is to assume that the devices are accurate through the full range of conditions encountered. This is a dangerous assumption as most devices typically have a limited range of flow conditions for which they are accurate. For example, flow monitoring devices intended to measure depth in storm pipes using a pressure sensor mounted to the bottom of pipes lose accuracy under low flow conditions when water depth is not sufficiently above the depth probe. When areaâvelocity flow measurements are used, there is a low velocity cutoff, with flows below the cutoff not being detected. For outfalls where there is standing water, flows will not be detected until the low veloc- ity cutoff is reached for the entire flow cross section. Parshall flumes can provide highly accurate flow measurements but only within the ranges specified by the manufacturer. If water depth in the flume exceeds the maximum depth specified, flow rate readings can no longer be considered accurate. The range of flows applicable to a monitoring device is usually related to the need for certain prescribed flow conditions that are assumed in the development of rating curve calibra- tions. Large errors in measurement can occur when the flow is outside this range. It is essential that airports understand the information provided by manufacturers and design engineers related to the operating conditions for which the device is intended. Generally, the flow measurement device should be selected to cover the desired flow range. For practical reasons, it may be reason- able to establish different accuracy requirements for high and low flows. Table 7 summarizes some potential sources of flow measurement error for various open channel and closed conduit measurement techniques. Table 7. Summary of potential sources of flow measurement error. (continued on next page) Parameter Potential Sources of Error/Expected Instrument Error Range All Flow Measurement Methods (Bureau of Reclamation and U.S. Department of Agriculture, 2001) General- Measurement Environment â¢ Deviation from a normal transverse or vertical flow distribution, or the presence of water surface boils, eddies, or local fast currents affects the accuracy of a flow measurement device. Errors of 20 percent are typical, and errors as large as 50 percent or more may occur if the approach flow conditions are very poor. â¢ Sediment accumulation can make flow measurement inaccurate or render the flow measurement device inoperative. â¢ Sediment deposits can affect approach conditions and increase the approach velocity in front of weirs, flumes, and orifices. Floating and suspended debris can plug some flow measurement devices and cause significant flow measurement problems. â¢ Improper protection of equipment against the site environment can cause a loss of equipment accuracy. Examples include mineral encrustation and biological growths, which may cause component wear that could cause drift from the standard instrument calibration. â¢ Field modifications to standard flow measurement devices may affect the pre-set instrument calibrations. â¢ Poorly maintained measuring devices are no longer standard, and the resulting stormwater discharge measurements may be considerably in error. â¢ Improperly installed equipment causes measurement error. Examples include devices installed out of level or out of plumb, devices that are skewed or out of alignment, devices that have leaking bulkheads with flow passing beneath or around them, and devices that have been set too low or too high for the existing flow conditions.

76 Interpreting the results of airport Water Monitoring Table 7. (Continued). Channel Velocity â¢ Current meter error: Manufacturer should provide the instrument error for different velocity ranges. â¢ Vertical velocity distribution error: Determination of the mean velocity in a vertical segment is usually based on the one-point method but sometimes on the two- point method (or other multi-point method). The one-point method assumes the mean velocity in the vertical segment is equal to the velocity measured at 0.6 of the depth below the water surface. The two-point method assumes the mean velocity in the vertical segment is equal to the arithmetic mean of the velocities measured at 0.2 depth and 0.8 depth below the water surface. There are different measurement error ranges associated with the one-point and two-point methods. â¢ Soft sediments and use of a heavy sounding weight make it difficult for the stream gauger to sense the streambed. Uneven, rough streambeds (cobbles, rocks) may cause depth measurement errors. â¢ High velocities and deep depths may cause drag on the sounding weight and line. Depths must be corrected applying wetline and dryline adjustments. â¢ Uncertainties in measuring the vertical angle of the sounding line and uncertainties in the forces acting on the weight, meter, and sounding line can cause significant errors. â¢ Mobile streambeds (sand) may change during the discharge measurement due to dunes and antidunes moving through the reach. The resulting standard error may be about 10 percent. â¢ Depth measurements made with a rod during high velocities will produce âpile-upâ of water on the rod at the water surface and must be accounted for. Open Channel Velocity-Area Flow Measurement (Sauer and Meyer, 1992) General â¢ A study conducted by the U.S. Geological Survey on the uncertainty or standard error for individual stream discharge measurements found that most stream discharge measurements will have standard errors ranging from about 3 percent to 6 percent. However, conditions such as wind, ice, boundary effects, flow obstructions, improper equipment, and incorrect measurement procedures can result in even larger standard errors. It may be challenging to estimate the measurement error associated with these conditions.* â¢ Systematic errors are caused by improperly calibrated equipment or improper use of equipment. Systematic errors are generally considered small, on the order of 0.5 percent. Selection of Vertical Segments â¢ Each vertical segment should represent approximately 5 percent or less of the total flow. The number of vertical segments used for the measurement affects the standard error for the horizontal distribution of velocity and depth. The general assumption is that the depth and velocity for a vertical segment apply to the segment extending halfway to the vertical on either side of the measured vertical segment. â¢ Reach should be straight and uniform with sufficient distance to provide uniform flow through the measured section (Nolan and Shields, 2000). Channel Width â¢ Errors measuring channel width are generally considered to be insignificant (less than 1 percent), especially where the width is determined using a measuring tape or tagline that spans the waterbody. Channel Depth â¢ A stream gauger wading in the channel or a sounding weight resting on or near the streambed may cause scour. â¢ The type of sounding equipment used depends on the depth being measured, and each equipment type has a different error measurement range. Rod measurements are typically used for wading measurements when depths are less than about 3 to 4 ft (error for soft streambed Â±0.05 ft; uneven stable streambed Â±0.1 ft). Cable and weight sounding is usually used when depths are greater than about 3 to 4 ft. Acoustic sounding methods may be used for depths greater than about 5 ft. For cable suspension and acoustic depth sounding measurement for both streambed conditions, the error is about Â±0.3 ft. Refer to Sauer and Meyer (1992) for the standard errors associated with various other stream conditions. Parameter Potential Sources of Error/Expected Instrument Error Range

Interpreting Monitoring Data 77 Table 7. (Continued). Closed Pipe Instrument Error (Frenzel et al., 2011) â¢ The following instrument error ranges are published by the manufacturers for the ideal boundary conditions. Deviations from ideal conditions are common in practice, so additional stipulations should be made for instrument error. In general, error limits can be affected by contamination, wear and physical changes (e.g., corrosion). â Direct volume totalizer: 0.1 to 1 percent â Indirect volume totalizer: 0.5 to 3 percent â Flow meter: 0.1 to 6 percent General (Frenzel et al., 2011; Spitzer and Furness, 2008) â¢ Entrained air within the pipe may cause measurement errors. â¢ Flow meters should not be installed where there is a distortion of the velocity profile. The ideal segment upstream of the flow meter should be straight-run piping that has no fittings, valves, or other potential obstructions. The amount of straight- run pipe needed depends on the flow meter type. However, obstructions such as elbows and tees sometimes cannot be avoided. Free Surface Pipe Instrument Error (Frenzel et al., 2011) â¢ Electromagnetic flow meter in culvert: Â±0.25 percent. * The results regarding standard error only apply to discharge measurements made using the velocity-area method using vertical axis cup-type current meters (i.e., Price AA and Price Pygmy current meters). The study results do not apply to discharge measurements utilizing structures such as weirs, electromagnetic meter methods, dilution methods, or ultrasonic meter methods. â¢ Oblique flow (flow not perpendicular to the measurement section) can be either horizontal or vertical. Where horizontal angles are present throughout most of the vertical segments, a standard error of 1 percent should be used. Vertical oblique flow is not considered a significant source of error. Open Channel Weir Instrument Error (Frenzel et al., 2011) â¢ Error should not exceed 3 percent. General (Bureau of Reclamation and U.S. Department of Agriculture, 2001) â¢ Particulates that settle out in the dammed area ahead of a weir due to the decreased flow velocity channel geometry may cause measurement errors. Floating particles may change the geometry even more and may plug the meter outflow. â¢ Measurement of the head can be difficult under non-ideal conditions and the location where head is measured affects accuracy. â¢ Large errors are introduced if a sharp-crested weir blade is submerged by backwater. Open Channel Venturi Flume Instrument Error (Frenzel et al., 2011) â¢ Â±6 percent. General (Frenzel et al., 2011) â¢ The Venturi flume accelerates the fluid in its constricted areas and drives the solids through. The floating particles may have a negative impact on the level measurement. â¢ Foam buildup causes measurement errors that are a function of the type of sensor used. â¢ Backflow should not be present within the flume because backflow could produce a level at the measuring point that corresponds to a higher flow rate. (A Parshall flume allows for a slight backflow.) Parameter Potential Sources of Error/Expected Instrument Error Range

78 Interpreting the results of airport Water Monitoring Key Takeaways Verifying the Accuracy and Representativeness of Raw Data â¢ Understand the differences in key terms describing accuracy and representa- tiveness such as accuracy, precision, bias, measurement error, random error, and systematic error â¢ List and assess the potential sources of error in the field data acquisition pro- cess at the airport. â¢ Establish a protocol for the information the airport will communicate to the analytical laboratory. â¢ Establish a protocol for the information the laboratory should provide to the airport. â¢ Work with the laboratory to gain a firm understanding of the differences in various analytical limits (detection limit, quantitation limit) and flags reported in laboratory results. â¢ Be aware that stated accuracy of field monitoring instruments likely does not sufficiently account for field measurement errors. Verify accuracy of instru- ments in the field with split samples. â¢ Consider the effects of propagating errors in measurements (e.g., flow error measurements magnifying analytical errors). â¢ Accurate and repeatable flow monitoring requires careful installation, correctly applied calibration procedures, and regular maintenance. 2.4 Analyzing the Data The processes of acquiring monitoring data and verifying the accuracy and representativeness of data involve individual monitoring data points. While evaluating individual data points has value (e.g., evaluating if a permit limit has been exceeded), many applications of monitoring data require a broader understanding of the monitoring data set as a whole. Gaining that under- standing can be at least partially achieved through statistical analysis of the monitoring data set. Statistical analysis is the process of analyzing large quantities of monitoring data to char- acterize the monitored stream as a whole and provide information needed to make informed decisions. Because it is infeasible to collect data from all monitoring points at all times, statisti- cal analyses are used to infer characteristics of the monitored stream as a whole from a repre- sentative portion of the potential monitored data. In other words, statistical analyses allow a monitored subset of data to understand the probable behavior of the entire actual population of pollutant values. Statistical analysis can be used to both quantitatively describe the monitored stream and provide an assessment of the uncertainty in the data set. Statistical analysis may be required to demonstrate compliance with water quality permit benchmarks or waste load allocations. Additionally, site descriptions, temporal trends, relationships between water qual- ity parameters, control measure performance estimates, and much more information can be obtained through the appropriate selection and application of statistical methods. The following sections provide an introduction to some common statistical and graphical methods used to characterize water monitoring data and to enhance understanding of data beyond simple compliance analysis. Specific topics covered include introductions to the following: â¢ Data distributions â¢ Methods for determining data distributions and parameters

Interpreting Monitoring Data 79 â¢ Methods for utilizing censored (nondetect) data â¢ Overview of error, uncertainty, and variability This section includes example applications of the described methods. Additional statistical resources may be needed to supplement the information provided in the following sections. 2.4.1 Uses of Statistical Analyses of Water Monitoring Data Statistical analyses can be used for multiple purposes, as shown in the Topical Tips box, to assess water monitoring data. Topical Tips Uses of Statistical Analysis in Evaluating Water Monitoring Data â¢ Quantitatively describe the magnitude, variation, and distribution of the data set (e.g., mean, 90th percentile, standard deviation) â¢ Graphically represent trends and distribution ranges â¢ Quantify parameter relationships â¢ Represent variation of parameters in a time series (e.g., hydrographs) â¢ Understand the probability of events (e.g., likelihood of effluent limit exceedance) â¢ Appropriately account for analytical results below detection limits â¢ Evaluate limitations in the monitoring data set and data acquisition program Statistical methods to meet these objectives are discussed in the following sections. 2.4.2 Descriptive Statistics The computation of descriptive statistics (also known as summary statistics) is a fundamental step in exploratory data analysis. Descriptive statistical analyses are used to characterize only the data that has been acquired and are not used to infer broader characteristics of the monitored stream. Common descriptive statistical measures include those listed in the Topical Tips box. Topical Tips Commonly Used Statistical Measures in Evaluating Water Monitoring Data â¢ Location or central tendency of the data set (e.g., mean, median) â¢ Measures of spread or variability (e.g., standard deviation, interquartile range) â¢ Measures of skewness (e.g., coefficient of skewness and quartile skew coeffi- cient used to identify the level of symmetry in how the data is distributed). Two general approaches can be taken to compute descriptive statistics. Use of the methods depends upon the distribution of the data. Distribution is the arrangement of data points show- ing the observed frequency of occurrence. The most commonly known distribution is the normal

80 Interpreting the results of airport Water Monitoring distribution in which the data is distributed symmetrically about the mean and median. Many distribution types are possible and how well the data fit a distribution (i.e., goodness of fit) dictates the appropriateness of any particular statistical method. How the data is distributed has a profound effect on how the data set is interpreted and the statistical measures that can be appropriately applied. For example, Figure 12 represents a his- togram of data in which the data is balanced on either side of the centerline between the highest and lowest values. This is akin to a bell-curve or normal distribution. In this type of data set, the mean (average) value corresponds to the median (middle) value in the data set. Often times, however, water monitoring data sets do not show such a symmetrical balance. More data points tend to be skewed toward lower concentrations with the more extreme high values extending farther to the right in the distribution pattern than in a normal distribution. If such a data set were balanced at the centerline between the highest and lowest values, the data set would not, in fact, balance. Instead the balance point needs to be shifted as shown in Figure 13. In this data set, the mean value differs from the average value. Statisticians have studied how data set distributions vary and have developed sophisticated methods to account for this variation. One of the most fundamental elements of monitoring data set distributions that is important for those analyzing data sets to understand is the difference between âparametricâ and ânon-parametricâ statistical analysis methods. 1. Parametric Statistical Methods â¢ Data follows a known distribution pattern. â¢ Environmental data are typically assumed to follow a normal or lognormal distributionâ although many other distribution types may be appropriate. â¢ Statistical analyses that most are familiar with can be applied (e.g., mean, standard deviation). â¢ Equations for calculating the descriptive statistics like mean and standard deviation depend upon the type of distribution (e.g., normal, lognormal, binomial). â¢ If parametric statistical methods are applied to a data set that is not distributed normally, the result of the analyses may not be valid. Figure 12. Symmetrical, balanced distribution of data. BALANCE POINT ALIGNED WITH CENTERLINE BETWEEN HIGHEST AND LOWEST VALUE MEAN, MEDIAN, MODE SAME

Interpreting Monitoring Data 81 2. Non-parametric Statistical Methods â¢ Data does not need to follow a known distribution to apply statistical analysis methods. â¢ Non-parametric statistical methods free the user from the burden of assuming a specific data distribution. â¢ Some non-parametric analysis techniques are less commonly known by those not heavily involved in statistics. â¢ Non-parametric statistical methods may be more appropriate to describe a data set than parametric methods in certain situations (e.g., data sets with outliers or heavy skew). Table 8 summarizes the parametric and non-parametric statistics commonly used to describe data sets. The parametric and non-parametric methods used to compute these descriptive sta- tistics are briefly described in the following sections. Parametric Distributions Parametric statistics assume that data arise from a single statistical distribution. The most commonly known statistical distribution is the normal distribution. In environmental data, the lognormal distribution is a common parametric distribution as well. A data set that is log- normally distributed can be transformed to a data set that is normally distributed by taking the logarithms of the data points. Figure 14 illustrates a normal distribution. Parametric methods cannot be directly applied to a data set that is highly skewed (e.g., coef- ficient of skewness less than -3 or greater than 3, indicating the data does not arise from a normal Table 8. Common parametric and non-parametric descriptive statistics. Statistic Category Parametric Non-Parametric Measures of location Mean Median Measures of spread Variance, standard deviation Interquartile range, median absolute deviation Measures of skew Coefficient of skewness Quartile skew coefficient Figure 13. Unbalanced distribution of data. BALANCE POINT NOT ALIGNED WITH CENTERLINE BETWEEN HIGHEST AND LOWEST VALUE MEAN, MEDIAN, MODE NOT THE SAME

82 Interpreting the results of airport Water Monitoring distribution). However, the data set may be able to be transformed into a normal distribution prior to applying statistical procedures that depend on an assumption of normality (e.g., t-test). In such cases, tests of normality should be conducted to verify the transformed data fit the normal distribution. The specific distribution to which the data is modeled is often chosen by scientific judgment and graphical means, such as the methods described in the following sections and goodness-of- fit tests. The purpose of the goodness-of-fit tests is to assess how closely monitoring data fits a known distribution. Example goodness-of-fit tests include the KolmogorovâSmirnov (K-S) test, the modified Lilliefors test, the chi-square (c2) test, the ShapiroâWilk test, and the probability plot correlation coefficient test. Most statistical analysis software provide common goodness-of- fit tests, including the open source program, R. Relevance of the Lognormal Distribution Lognormal distributions can only be used on environmental parameters that have positive values. Water quality data can often be transformed to an approximate normal distribution by simply taking the log of each data point. This transformation allows for parametric statistical procedures that require normality assumptions to be performed. However, a goodness-of-fit test as previously mentioned should be performed on the transformed data prior to conducting such parametric procedures. The lognormal probability distribution is often used to represent environmental data because of the positively skewed nature of the data (i.e., concentration data is zero or greater). The assumption that a population is lognormally distributed implies that the standard deviation is proportional to the mean and the data is bounded by zero. Figure 15 illustrates a lognormal distribution. The assumption that stormwater data and control measure effluent concentrations for many constituents are lognormally distributed has been explored and is supported by the stormwater literature (Van Buren, et al., 1997; Maestre et al., 2005). Van Buren et al. (1997) found that the lognormal distribution was a better fit than the normal distribution for most stormwater 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 0 2 4 6 8 10 12 14 N o. o f S am pl es SAMPLE RESULTS Figure 14. Normal distribution.

Interpreting Monitoring Data 83 pollutants, but the normal distribution was preferred for total dissolved solids, chlorides, sulfate and COD. Maestre et al. (2005) evaluated the probability distributions of the stormwater quality data in the National Stormwater Quality Database. They confirmed that lognormal distributions are very common for the constituents found in that stormwater database, with few exceptions (such as for pH). A lognormal distribution for stormwater data implies that pollutant concen- trations tend to be skewed toward lower concentration values with higher concentration values spread out over a wide range. In other words, the mean value is closer to the lower end of the range of potential concentrations than would be the case for a normal distribution. A common misconception is that the exponential of the mean of the log-transformed vari- able y is the mean of the untransformed variable x. However, exp(Âµy) is the geometric mean of the untransformed variable x, which is equal to the median of a lognormal random variable, x, not the mean. To compute the arithmetic mean of a lognormal random variable, the variance must also be included in the calculation. Formulae for converting some common statistics from log-space to arithmetic space for a lognormally distributed random variable are provided as follows: 0.5 2ex y yÂµ = ( )Âµ + Ï 12ex x yÏ = Âµ âÏ m ex y= ( )Âµ where Âµx is the arithmetic estimate of the mean for a lognormally distributed random variable, x sx is the arithmetic estimate of the mean for a lognormally distributed random variable, x Âµy is the mean of the natural logarithms sy is the standard deviation of the natural logarithms mx is the geometric mean, which equals the median for a lognormal random variable. For example, to compute arithmetic estimates of the mean, standard deviation, and median of a lognormally distributed data set, the mean and standard deviation of the natural logs of the data points (Âµy and sy, respectively) are first computed before applying these equations. Figure 15. Lognormal distribution. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 50 100 150 200 250 300 350 400 450

84 Interpreting the results of airport Water Monitoring However, the formulae should be applied with care. If the data set distribution significantly departs from lognormality, the use of these equations is inappropriate. Non-parametric Distributions Non-parametric statistics are fundamentally based on the ranks1 of the data with no need to assume an underlying distribution. In other words, the statistical techniques are based on the positions of the data after being sorted by their magnitude. Non-parametric statistics are there- fore resistant to the occurrence of a few extreme values (e.g., high or low outlier values relative to other data points do not significantly alter the statistics like the median). Many non-parametric methods are described as robust because of their resistance to outliers and good performance describing data from a wide range of distributions. The data median is the most basic example of a non-parametric statistic. The median or 50th percentile of a data set is the value at which half the data lies above and half the data lies below. Depending on the goals of analysis and the uncertainty of the dataâs underlying statisti- cal distribution, the median may be a more appropriate measure of the central tendency of the data than the sample mean since it is less influenced by the presence of a few outliers. As a result, the median concentration may be more representative of the typical or average site storm event discharge concentration because the value is more robust in the presence of outliers, when com- pared to the mean. The mean concentration for a site, on the other hand, may be completely biased by a single event that had an abnormally high discharge concentration due to an anoma- lous point-source mass release. For example, in a 20 sample data set of COD data that averages 50 mg/L, a single sample of 5,000 mg/L raises the average to 286 mg/L. The median value in the data set, however, is unlikely to change much from a single data point. Application of Descriptive Statistics Data collected at a site through time can be characterized using descriptive statistics and exploratory data analysis techniques. The most common parametric descriptive statistics, as presented in Table 8, are the mean and standard deviation, with their non-parametric counter- parts, the median and the interquartile range. It is common practice to calculate the mean and standard deviation when presenting data to regulators, though the assumption that the data arises from a normal distribution may be unsubstantiated or even false. If the data is lognormally distributed, the standard mean is not representative of the central tendency of the data. In such cases, the geometric mean, median, or arithmetic estimate of the mean computed from the log- transformed data set is a more appropriate estimate of an âaverageâ value. Both parametric and non-parametric statistics can easily be calculated using a spreadsheet program, such as Microsoft Excel, with open source programs such as R, or commercial statistical packages such as SAS, SPSS, Systat, Minitab, etc. Concentration data collected at a single location or from multiple locations throughout a site can be used to develop descriptive statistics. Descriptive statistics help to answer questions like âare benchmark concentrations being exceeded?â and âwhat are the probabilities of exceeding benchmarks?â Comparisons to pollutant benchmarks or effluent limits in permits are often handled through calculations of the median and the mean of the data. Probabilities of exceed- ance of benchmarks or limits can be calculated using non-parametric interval estimates of the median. The median can be calculated using a spreadsheet program, such as Microsoft Excel. Using the descriptive statistics command, Excel will calculate both the mean and median of the input 1In this context, âranksâ refers to the positions of the data after being sorted by magnitude.

Interpreting Monitoring Data 85 data. This command will also provide the standard deviation, confidence interval, and other parametric statistics. The parametric statistics provided by Excel assume the data are normally distributed. The range of the data can be described by the interquartile range. Again, Excel can be used to calculate these values. Using the command âquartile,â the quartiles of the data are calculated at the 25th and 75th percentiles. Non-parametric exceedance probabilities can be calculated using a binomial distribution. A binomial distribution, such as the one provided on the VassarStats website (http://vassarstats. net/binomial.html), can be used to determine the confidence interval about the median. Using the number of data points, and the confidence level necessary, e.g., 95 percent confidence, a k-value is determined from the binomial distribution calculator. For a concentration data set containing 10 values, the sample size is n = 10. For the 95 percent confidence interval a p-value of 0.05 is entered. The exact probability closest to Â½ of the p-value (p/2 = 0.025) is selected from the chart and the closest k-value determined. For the example described here, the k-value is 3. The lower confidence interval is taken as the rank value equal to the k-value plus 1; the upper confidence interval is the rank value equal to the sample size (n) minus the k-value. The confidence interval may not be symmetric about the median, such as in cases where the distribution of data is highly skewed. An alternative to using the binomial distri- bution for computing confidence intervals is to use a bootstrap procedure as described later in this chapter. Exploratory data analysis through the use of graphical data displays can be a very helpful method for determining data characteristics and presenting data. Graphical data displays can be produced using open source programs such as R, as well as Microsoft Excel. The next section on graphical data analysis provides an overview of some common ways to visualize environmental data. For the other methods, the reader is referred to Statistical Methods in Water Resources (Helsel and Hirsch, 2002) for additional guidance. 2.4.3 Graphical Data Analysis Visualizing or graphically displaying data is an essential tool for data analysts. Not only does it provide data analysts with preliminary information about the general characteristics of a data set, but it also enables them to perform a more comprehensive and statistically valid analysis. Four types of plots are often used to describe and visually display the characteristics of environ- mental data: histograms, box plots, quantile plots, and scatter plots. Histograms Histograms are used to visualize the empirical distribution of a single data set by catego- rizing the data into bins. The number of data (frequency of occurrence) in each bin is then plotted on the dependent (y-) axis with the bins themselves on the independent (x-) axis. This practice provides a rough estimate of the shape or symmetry of the probability den- sity function of the underlying distribution from which the sample data arises. Figure 16 shows example histograms displaying the frequency of total zinc runoff concentrations from industrial sites contained in the National Stormwater Quality Database. The plot on the left is untransformed data and the plot on the right is log-transformed data. Note how the transfor- mation makes the data set more symmetric, indicating the data may arise from a lognormal distribution. Choosing the number of bins for the histogram is an important consideration, as the shape of the distribution can be obscured with too few or too many bins. One method for determining the number of bins suggests that for a sample size, n, the number of bins, k, should be the smallest integer such that 2k â¥ n (Helsel and Hirsch, 2002), as shown in Table 9. For more information about methods to visualize environ- mental data, see: Statistical Methods in Water Resources (Helsel and Hirsch, 2002)

86 Interpreting the results of airport Water Monitoring Box Plots Box plots (or box and whisker plots) provide a schematic representation of the central ten- dency and spread of the data. A standard box plot consists of two boxes and two lines. The lower box expresses the range of data from the 25th percentile (1st quartile or Q1) to the median of the data (50th percentile, 2nd quartile, Q2). An upper box represents the spread of the data from the median to the 75th percentile (3rd quartile or Q3). The total height of the two boxes is known as the interquartile range (Q3 - Q1). A âstepâ is 1.5 times the interquartile range. Two lines are drawn from the lower and upper bounds of the boxes to the minimum and maximum data points (respectively) within one step of the limits of the box. Asterisks or other point symbols are some- times used to represent outlying data points. Some statistical packages, including stand-alone software and third-party spreadsheet extensions, also include the confidence interval about the median as notches in the boxes about the centerline or can be customized to include specific data percentiles (e.g., 5th, 10th, 90th, and 95th). Figure 17 shows an example box plot with each element defined. The Water Quality Data Analysis Tool developed to supplement this guidebook produces a box plot for an input data set similar to that shown in Figure 17. See Appendix E for details. The upper and lower 95 percent confidence limits of the median allow the box plot to be used as a non-parametric, graphical analysis of variance and can be used to estimate whether the medians of two data sets are statistically different (McGill et al., 1978). For example, Figure 18 shows side-by-side box plots of median influent and effluent concentrations for grass filter strips contained in the International Stormwater BMP Database. The confidence intervals about the median concentrations do not overlap indicating that filter strips provide statistically significant reductions in total copper. Figure 16. Example histograms of total zinc runoff concentration data. 0 100 200 300 400 Total Zinc (Âµg/L) 0 10 20 30 40 N o. o f O bs er va tio ns 10 100 Total Zinc (Âµg/L) 0 5 10 15 20 25 N o. o f O bs er va tio ns Table 9. Number of bins for given sample sizes. Number of Samples (n) Number of Bins (k) 8 3 16 4 32 5 64 6 128 7 256 8

Interpreting Monitoring Data 87 In the comparison of paired or matched data, the extent to which the confidence intervals for the distributions of event concentrations at the inflow and outflow overlap gives a good indica- tion if the medians can be considered statistically different (i.e., the null hypothesis that the inflow and outflow medians are the same can be rejected). In most cases, the Kruskal-Wallis test and the K-S test support the results of the notched box plot. However, these hypothesis tests are generally more powerful at detecting statistical difference between two sample data sets than simply comparing the confidence intervals about the medians. Figure 17. Example box plot with definitions. Figure 18. Example side-by-side box plot. Source: Geosyntec Consultants and Wright Water Engineers (2013).

88 Interpreting the results of airport Water Monitoring Quantile Plots and Probability Plots Quantile plots are used to visually display data for three main reasons: (1) to compare the data distributions of two data sets (called a Q-Q plot); (2) to compare a single data set to a theoretical probability distribution (e.g., normal); or (3) to calculate exceedance frequencies. Quantile plots are constructed by ranking the sample data (i.e., observations) and then calculating the plotting position for each data point. The ranked data are placed on the x-axis and the corresponding plotting positions, or percent-less-thans (i.e., percentage of total data points below the value on the x-axis), are placed on the y-axis. This produces a sample approximation of the cumulative distribution function (CDF) where the probability of a random sample value being less than or equal to an observation can be directly determined. Conversely, the percentage of data points exceeding a water quality threshold (i.e., percent exceedance) can be simply computed as 1 minus the percentage of data points less than the value on the x-axis. Depending on the application, there are several different formulae that can be used to com- pute the plotting position. Helsel and Hirsch (2002) recommend using the Cunnane formula for general use rather than applying a different formula for each application: 0.4 0.2 p i N = â + i is the rank of the data point; N is the number of data points; p is the plotting position (i.e., non-exceedance probability). Probability plots are related to quantile plots, but, in this instance, the observations are plotted against the quantiles of the theoretical probability distribution, instead of the percent-less-thans. Both quantile plots and probability plots can be used to determine how well a data set fits a theo- retical distribution. However, rather than plotting the cumulative frequency of the data overlaid with the CDF [or plotting a histogram overlaid with the probability density function], a probability plot displays the actual data plotted against quantiles of the probability distribution of interest (e.g., normal Z-scores). As such, the agreement of the data with the theoretical straight line is more easily discernible than that of a curved probability density function or the CDF. Figure 19 illustrates a quantile plot (left) and probability plot (right) using the same data set used to plot the histograms Figure 19. Example quantile plot (left) and probability plot (right). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Fraction Less Than 10 100 To ta l Z in c (Âµg /L ) -3 -2 -1 0 1 2 3 Normal Score 10 100 To ta l Z in c (Âµg /L )

Interpreting Monitoring Data 89 in Figure 16 (note the log-scale on the y-axis). Basic spreadsheet or statistical software can be used to produce similar quantile plots and probability plots. Water quality observations do not generally form a straight line on normal probability paper, but they do (at least from about the 10th to 90th percentile level) on lognormal probability plots. This indicates that the samples generally have a lognormal distribution as described previously. That means that many parametric statistical tests can often be used (e.g., analysis of variance), but only after the data is log-transformed. These plots indicate the central tendency (median) of the data, along with its possible distribution type and variance [the steeper the plot, the smaller the coefficient of variation (COV) and the flatter the slope of the plot, the larger the COV]. Multiple data sets can also be plotted on the same plot (e.g., different sites, different seasons, different habitats, etc.) to indicate obvious similarities or differences in the data sets. Most sta- tistical methods that are used to compare different data sets require that the sets have the same variances, and many require normal distributions. Similar variances are indicated by generally parallel plots of the data on the probability paper, while normal distributions are reflected by data plotted in a straight line on normal probability paper (Burton and Pitt, 2001). Probability plots should be supplemented with standard statistical tests that determine if the data is normally distributed. These tests, at least some are available in most software packages, include the K-S one-sample test, the chi-square goodness-of-fit test, and the Lilliefors variation of the K-S test. They are paired tests comparing data points from the best-fitted normal curve to the observed data. The statistical tests may be visualized on a normal probability plot where the best-fit normal curve (a straight line) and the observed data are both plotted. If the observed data crosses the line numerous times, it is much more likely to be normally distributed than if it only crosses the line a small number of times (Burton and Pitt, 2001). Scatter Plots and Time Series Plots Scatter plots are the most basic of the graphical preliminary investigation tools discussed. These plots are used when discerning a potential relationship between paired data sets or the temporal trend of a single data set (how the data changes over time). For paired data, two vari- ables can be plotted against each other to quickly identify any potential relationships that may warrant further investigation or analysis (e.g., regression). For example, TSS can often be associ- ated with pollutants that tend to be bound to particulates. If a pollutant suspected to be mostly particulate bound is linearly related to TSS, the slope of the line provides an indication of the strength of that association (e.g., milligram of copper per milligram of suspended solids). When looking at potential temporal trends of a single data set, the independent variable can be time to produce a time series plot. For stormwater and riverine studies, hyetographs (rainfall versus time) and hydrographs (flow rate versus time) are the most common time series plots used. The rainfallârunoff response of a drainage area can be visually investigated by plotting rainfall and flow on the same figure, as shown in Figure 20. 2.4.4 Comparative Data Analysis and Hypothesis Testing The field of comparative data analysis encompasses a series of tests that facilitate determining whether two data sets are statistically different. These methods are capable of comparing totally independent (non-paired) sets of data, such as the effluent concentrations from two different studies or locations, or dependent (paired or matched) data sets, such as the inflow and outflow concentrations of a stormwater BMP. If inflow and outflow data appear to follow different or unknown distributions (e.g., normal or lognormal), or if either data set contains a high proportion (i.e., > 15 percent) of nondetects, non-parametric tests may be more appropriate than parametric tests. Both parametric and non-parametric tests are briefly described in the following subsections.

90 Interpreting the results of airport Water Monitoring Independent Data Sets Independent data sets can be compared using the MannâWhitneyâWilcoxon rank sum test or the t-test. The rank sum is a non-parametric test of the assumption that two groups arise from the same population (called the null hypothesis). Rejection of the null hypothesis confirms that the two groups are statistically different in their medians (Helsel and Hirsch, 2002). When dealing with small data sets (less than 10), the functional statistic of this test, Wrs is computed by summing all of the ranks of the smaller of the two data sets. For larger data sets, a second statistic, Zrs is computed using Wrs, the sample mean, and the sample standard deviation of the combined data sets. A normal probability table is then used to assess Zrs. The t-test can only be used on normally distributed, uncensored data sets (i.e., water monitor- ing data sets with no analytical nondetect values) and does not work well for small sample sizes (Helsel and Hirsch, 2002). For these reasons, the rank sum test is often preferred. The difference of magnitudes of two data sets can be quantified using the HodgesâLehmann test, which is the median of all possible pairwise differences of two data sets. The difference of the sample means is rarely of any value unless the conditions prescribed for the t-test (uncensored, normally dis- tributed) are met. Paired Data Sets Matched data sets can be compared using the sign test, rank-sign test, and the paired t-test. Given two matched data sets, x and y, these tests are performed solely on the difference between the two (D = x - y). The sign test is fully non-parametric and therefore is often preferred. The number of elements in D that are larger than 0 (noted as S+) is compared to the number less than 0 (noted as S-). The signed rank test is used to determine if x and y are samples of the same population; if they are, the test is also used to determine whether the difference between the two is only in their location (e.g., median). The paired t-test is again subject to the assumptions and stipulations associated with other t-tests mentioned previously. Application of Hypothesis Tests Hypothesis testing can be used to determine if different locations are behaving similarly or if the effluent concentrations from a control measure are significantly different from the influent Figure 20. Example hyetograph and hydrograph.

Interpreting Monitoring Data 91 concentrations. In some cases, industrial stormwater permits may only require monitoring outfalls from drainage areas that have distinctly different land uses, activities, and stormwater discharge concentrations. Comparative statistics and hypothesis testing between locations can determine which locations are statistically similar and therefore whether one may be removed from the monitoring program. When data from different locations are compared, the data can be paired or non-paired. An example of paired data is where data is collected at approximately the same time at both locations, such that concentration data at one site is paired with concentration data at another site through the time variable. In general, non-parametric tests are preferred because they can be consistently applied without requiring a validation of normality assumptions. As described previously, the rank sum test can be used for non-paired data, and the signed rank test can be used for paired data to evaluate differences between groups. The rank sum test is used to determine whether the null hypothesis that group 1 is the same as group 2 can be rejected and at what probability. For 95 percent confidence, the probability would be 5 percent (p = 0.05). The alternative hypothesis is the probability that group 1 differs from group 2 by less than or more than 5 percent. To perform the rank sum test, observations at each site are ranked together. If there are ties, then the average of the indi- vidual ranks is used for each. The sum of the ranks of group 1 are then determined. The value, termed Wrs, is compared to values in a table (Helsel and Hirsch, 2002; Table B4). A p-value is determined and compared to the desired confidence level. If the p-value is less than the desired confidence, then the null hypothesis is rejected. In the case of a rejected null hypothesis, the locations would be determined statistically different and therefore required for the monitoring program. In the case of an accepted null hypothesis, the locations would be statistically similar and a case could be made for the removal of one of the locations from the monitoring program. For paired data, the signed rank test can be used to determine if two locations are statistically similar. The null hypothesis of the signed rank test is that the median of the differences between data pairs is equal to zero. The alternative hypothesis is that the median of the differences is not equal to zero. To calculate the test statistic, W-, the differences of each pair of data are calculated and ranked. The rank is negative if the value of the difference is less than zero, positive if the difference is greater. The test statistic, W+, is equal to the sum of the positive ranks. The test statistic, W-, is compared to a table (Helsel and Hirsch, 2002; Table B6) and if the p-value is less than the desired confidence, then the null hypothesis can be accepted. In the case where the null hypothesis is accepted, then a case can be made for the removal of one of the locations from the monitoring program. 2.4.5 Trend Analysis Trend analysis of water quality data is often used to evaluate whether water quality is improving or getting worse over time and provides insight to whether changes to site conditions or imple- mentation of control measures have had an effect on water quality parameters. Available analysis techniques range from very simple observational tools to more complex analysis involving the removal of seasonal effects from the data to detect monotonic changes. Plotting data against time and fitting a simple linear equation is perhaps the most common type of trend analysis. Increasing trends are indicated by a positive slope and decreasing trends are indicated by a negative slope. For skewed data sets, log-transforming the data before plotting may be needed. However even with transformation, a trend line computed using simple linear regression may be highly influenced by a few extreme values thereby imparting leverage on the overall slope of the line. To minimize

92 Interpreting the results of airport Water Monitoring the effects of extreme values, the non-parametric KendallâTheil robust line approach, which computes the slope of the line as the median of all possible pairwise slopes between two data sets, can be used (Granato, 2006). To simply evaluate whether there is a monotonic trend (i.e., general increase or decrease with time) without an assumption of linearity, the non-parametric MannâKendall test can be used. To apply the test, the sign of all the differences in each data point has to be calculated. For instance, the difference between the first data point is calculated by subtracting the second from the first. If the value is greater than 0, the difference is positive and assigned a value of 1, if it is negative than it is assigned a value of -1, and if there is no difference then a value of 0 is assigned. After the sign of all differences is calculated, the values are summed up and com- pared to the S value in a table of probabilities (Gilbert, 1987). If the probability is less than the desired significance level (e.g., 5 percent for 95 percent confidence), then a trend exists in the data. If a trend exists, the sign of the sum of signs indicates whether the trend is positive or negative. Environmental data can exhibit varying degrees of seasonality, which may obscure trends. If the parameter is expected to vary seasonally, then this seasonality should be removed prior to conducting the trend analysis. For example, one way to remove seasonality is to use an average parameter value across all seasons (e.g., a year) and linearly regress the average value against time (e.g., regression of annual average stormwater discharge concentrations versus year). Another way to remove seasonality is to use the seasonal Kendallâs tau method. The seasonal Kendallâs tau test is a fully non-parametric test that accounts for seasonality by com- puting the MannâKendall test on each season separately across all years in the record, and then combining the results. If the seasons are defined monthly then January data is only compared to January data, etc., and no data is compared across seasonal boundaries. The results from each season are combined to form an overall test statistic, which is evaluated against a standard normal distribution to test the null hypothesis that no trend is present. To apply the test, the time series data must include at least three seasons and each season should have at least three measurements (Interstate Technology and Regulatory Council, 2013). The reader is referred to Gilbert (1987) for further reading. Using a spreadsheet program, the log of the concentration data can be plotted versus time to create a time series plot. A linear regression can be performed on the data using the data analysis add-in. The regression on the data results in the generation of both the slope of the regression line (x-variable coefficient) and the p-value. The slope of the line is an indicator of trend: positive values indicating increasing concentrations, negative values indicating decreas- ing concentrations, and slope values close to zero indicating stable concentration value. The p-value provides an indication of tendency of the dependent variable, concentration, to change with time. A large p-value, >0.05 at the 95 percent confidence interval, indicates that slope is not significantly different from zero, whereas small p-values indicate that the slope of the line is significantly different from zero and the dependent variable is likely related to the indepen- dent variable. The slope of the line (x-variable) and the p-value together give an indication of trend in the time series. Regression analysis of the log-transformed data assumes that the data is lognormally distributed. Using both the MannâKendall test of trend and linear regression analysis of the log- transformed concentration time series data provides a statistically significant and defensible argument for trends in concentration at a location. Trend analysis is supported by plotting the concentration time series. Regression statistics can be provided on the plot. The probability asso- ciated with the MannâKendall test statistic is supported by the second line of evidence provided by the regression analysis.

Interpreting Monitoring Data 93 2.4.6 Hydrograph Analysis The analysis of flow through a surface water body is an important aspect of the understanding of concentration data. The concentration of a particular parameter is a function of the amount of mass present in a particular volume. For samples taken during high flow events or low flow events, the concentration of a constituent may be diluted or concentrated, respectively. Deter- mining the representative volume in which the sample was taken as well as understanding the amount of flow resulting from a particular storm event are important aspects to consider when undertaking an analysis of surface water quality data. Baseflow separation is a method of hydrograph interpretation that aids in determining what volumes are due to a storm event and what volumes are the contributions of groundwater flow, i.e., baseflow. Baseflow separation can be achieved through several techniques, depending on the watershed and particular surface water body. The straight line method of baseflow separa- tion assumes that the contribution of groundwater flow is constant throughout the storm event. Other more sophisticated methods assume that baseflow changes throughout the duration of a storm event or multiple storm events. Further information on baseflow separation can be found in many hydrology texts, such as Applied Hydrology (Chow et al., 1988). The volume of flow from a hydrograph after baseflow separation can also be estimated through a simple integration of the hydrographs. If the time steps of the hydrograph are constant, then integrating for the volume of the hydrograph can be quite simple. The volume of the hydrograph with constant time step can be estimated as the sum of the arithmetic averages of the flow at the beginning of a time step and the flow at the end of that time step multiplied by the duration of the time step over the course of the storm. 2.4.7 Censored Data (Nondetects) Censored data, or nondetects from laboratory analyses, include only values reported to be above or below an analytical reporting limit.2 Nondetects are commonly found in analytical laboratory reports received by airports and are influenced by differing reporting limits based on changes in analytical methods, laboratories, or sample variability. If nondetects are not carefully considered when analyzing data, estimated summary statistics may become biased and non- representative of the monitored site (Helsel, 2005). Four approaches are often used to handle nondetects: (1) simple substitution, (2) maximum likelihood estimation (MLE), (3) regression on order statistics (ROS), and (4) KaplanâMeier (K-M) estimation. Each of these methods is briefly described in the following subsections. Simple Substitution Simple substitution replaces all nondetect values with a constant value, such as zero, the detec- tion limit, or half the detection limit. There is no theoretical or mathematical justification for the practice, yet it remains widely used. By substituting constant values, the distribution of the data (i.e., histogram) is altered and the overall variability is reduced. Estimates of the mean and For more information about multiple detection limits and bias-corrected maximum likelihood estimators, see: Adjusted Maximum Likeli- hood Estimation of the Moments of Lognormal Populations from Type I Censored Samples (Cohn, 1988) Statistical Approaches to Estimating Mean Water Quality Concentrations with Detection Limits (Shumway et al., 2002) Maximum Likelihood Method for Parameter Estimation in Linear Model with Below-Detection Data (Sharma et al., 1995) Robust Estimation of Mean and Variance Using Environ- mental Data Sets with Below Detection Limit Observations (Singh and Nocerino, 2001) Estimation of Moments and Quantiles Using Censored Data (Kroll and Stedinger, 1996) 2The terms âreporting limitâ and âdetection limitâ are intentionally used loosely and interchangeably in this chapter. While there is clearly a difference between these valuesâa reporting limit (or quantitation limit) is a threshold based on a measure of the variability or noise inherent in the laboratory process, while a detection limit is a threshold below which measured values are not considered statistically different from a blank signal (Helsel, 2005)âit has become commonplace to use either term when referring to nondetects. However, some laboratories will report values between the method detection limit and the laboratory reporting limit. In these cases, Helsel (2005) recommends re-censoring the in-between values as less than the reporting limit and then using a method that can handle multiple detection limits such that the technique accounts for the fact that the reporting limit is greater than the detection limit.

94 Interpreting the results of airport Water Monitoring median may become biased high or low depending on the level of censoring and the substitution method employed. It is strongly recommended that simple substitution be avoided, especially when the level of censoring exceeds 5 to 10 percent of observed data. If simple substitution must be performed, the only reasonable value to use is half of the detection limit, as the use of zero or the detection limit can cause more severe bias in computed summary statistics. Maximum Likelihood Estimation With the MLE method, both the censored and uncensored data is assumed to follow a theoretical distribution (as discussed previously, the lognormal is often a good choice for water quality data). Summary statistics are then computed as the values that maximize the log-likelihood function [see Helsel (2005) for details]. Maximum likelihood estimators for estimating the statistics of censored data sets have been refined to handle multiple detection limits, and several researchers have developed bias-corrected (Cohn, 1988; Shumway et al., 2002; Sharma et al., 1995) and robust (Singh and Nocerino, 2001; Kroll and Stedinger, 1996) MLE formulations. Regression on Order Statistics ROS is a category of robust methods used to estimate descriptive statistics of censored data sets that utilize the normal scores for the order statistics (Shumway et al., 2002). The ROS is a plotting position method developed by Hirsch and Stedinger (1987) and later refined by Helsel and Cohn (1988) for water quality data. In this method, plotting positions are based on conditional probabilities and ranks, where the ranks of the censored (below detection) and uncensored (above detection) data related to each detection limit are ranked independently. After plotting positions for the censored and uncensored values have been calculated, the log- transformed uncensored values are plotted against the z-statistic corresponding to the plotting position. The best-fit line of the known data points is derived. Using this line and the plot- ting positions for the uncensored data, the values for the censored data can be extrapolated. The complete âfilled inâ data set can then be used to estimate descriptive statisticsâeither by transforming all values back to the original units and computing the statistics (non-parametric formulation) or by computing the statistics in the log-transformed units (parametric formula- tion) and using lognormal reconversion formulae. Refer to Helsel (2005) or Helsel and Cohn (1988) for details. KaplanâMeier The K-M method is the standard method for estimating summary statistics for censored survival data (Helsel, 2005). It is a completely non-parametric method that utilizes the ranks of the data to estimate âsurvival probabilities.â In the context of water quality data, the sur- vival probability is the probability that a data point would occur below the next incremental concentration given the number of data at or below that concentration or detection limit. Because this method is designed for right-censored data, all observations must be subtracted from an arbitrary value that is higher than the largest observation before it can be used. This transformation results in an empirical cumulative distribution function for the data set. See Helsel (2005) for details. Recommended Approach for Handling Nondetects Of the methods described herein, the K-M method is the most robust method for calculating percentiles and works well on both small and large data sets; however, it cannot be used if the level of nondetects is greater than 50 percent of the samples for a given parameter. Estimates of the mean are biased high using this approach if the lowest reported values are nondetects, which is typically the case. The variance and standard deviation tend to be sensitive to the presence of For more information about log-transformed units and lognormal reconversion formulas, see: Nondetects and Data Analysis: Statistics for Censored Environmental Data (Helsel, 2005) Estimation of Descriptive Statistics for Multiply Censored Water Quality Data (Helsel and Cohn, 1988)

Interpreting Monitoring Data 95 extreme values in the data set (Helsel, 2005). For these reasons, the K-M method is not recom- mended for general use, particularly when estimates of the mean and its confidence interval are desired. The MLE and ROS approaches are both useful and can be equally robust and accurate meth- ods for estimating summary statistics. They both require that a distribution be assumed and both have robust and fully parametric formulations. When the distributional assumption is valid, the MLE methods can be more precise, but these methods require larger sample sets (n â¥ 50) to estimate unbiased summary statistics using the parametric formulation. Probability plotting, as used in the ROS method, is less precise but handles small data sets better. The ROS method is more straightforward than the MLE, does not require numerical approximations, and can be relatively easily programmed into spreadsheets. As such, the ROS is generally preferred. Many statistical software packages include one or both methods. The ProUCL software package is available free from the U.S. EPA website and can be used to compute summary statistics using the ROS method. The Water Quality Data Analysis Tool developed to supplement this guide- book performs an ROS analysis on identified nondetects prior to computing summary statistics. See Appendix E for details. 2.4.8 Bootstrap Methods Bootstrap methods are a class of data resampling procedures used to estimate summary sta- tistics and their accuracy (standard error). Originally developed by Bradley Efron in 1979, many variations and improvements have been made and the number of applications has grown sig- nificantly (Efron and Tibishirani, 1993; Chernick, 1999). The basic bootstrap method includes sampling from the data set with replacement, calculating the desired descriptive statistics from the sampled data, and repeating several thousand times. Fundamentally, this bootstrap procedure is based on the central limit theorem, which sug- gests that even when the underlying population distribution is non-normal, averaging produces a distribution more closely approximated with the normal distribution than the sampled distri- bution (Devore, 1995). There are a number of benefits to using the bootstrap method to estimate summary statistics rather than other standard techniques. First, the statistical distribution of the underlying popu- lation need not be assumed when using this method. Secondly, the bootstrap method provides more robust estimates of parametric statistics when underlying distribution can be assumed. Lastly, the bootstrap method allows the accuracy of statistical estimates to be computed even when no analytical formula exists (e.g., standard error of the median). Several methods for cal- culating confidence intervals, or the reliability of an estimate, are also available. Refer to Efron and Tibishirani (1993) for more information. The Water Quality Data Analysis Tool developed to supplement this guidebook uses a bootstrap method to compute confidence intervals on selected summary statistics. See Appendix E for details. As with any statistical analysis technique, small data sets can be a problem with the boot- strap method. Small data sets underestimate the true variability of the underlying distribu- tion and this underestimation can become magnified with the bootstrap method due to multiple repeated values collected during resampling (Chernick, 1999). Therefore, as a word of caution, for small data sets (e.g., n < 30), the bootstrap method may produce inaccurate estimates of population statistics. In these cases, especially when estimating confidence inter- vals or exceedance frequencies, parametric methods may be more reliable than the bootstrap method. For more information about summary statistics using the ROS method, see: U.S. EPA ProUCL software package

96 Interpreting the results of airport Water Monitoring Key Takeaways Applying Statistical Methodologies â¢ Statistical methods can be used to project the behavior of an entire population of water monitoring data from a subset of the total population. â¢ Statistical methods are not valid under all conditions. Understand the limita- tions of the methods (especially related to distribution of the data) and take the time to select the proper methods. â¢ When possible, define the statistical methods to be used in the water monitor- ing plan, providing details on the method type, data points needed, limitations in applicability, assumptions, and uses. â¢ Use graphs to illustrate the statistical analysis when possible. â¢ Various software packages are available with built-in statistical methods, including methods for handling nondetects. If uncomfortable with the details of the statistical analysis, seek out experts and request that they explain the result in understandable terms. â¢ Set a policy for how nondetects (censored data) are handled: â Avoid simple substitution (use of zero, half the detection limit, or the detec- tion limit). â Use a valid statistical method (regression of order statistics is recommended) instead of simple substitution. â If simple substitution must be used, select half the detection limit.