In the context of the U.S. Department of Defense (DoD) acquisition system, reliability metrics are summary statistics that are used to represent the degree to which a defense system’s reliability as demonstrated in a test is consistent with successful application across the likely scenarios of use. Different metrics are used in conjunction with continuously operating systems (such as tanks, submarines, aircraft), which are classified as either repairable or nonrepairable, and with “one-shot” systems (such as rockets, missiles, bombs). Reliability metrics are calculated from data generated by test programs.
This chapter discusses “reliability metrics,” such as the estimated mean time between failures for a continuously operating system. We consider repairable and nonrepairable systems, continuous and one-shot systems, and hybrids.
A system’s requirements in a request for proposal (RFP) will be written in terms of such metrics, and these metrics will be used to evaluate a system’s progress through development. Tracking a reliability metric over time, as a system design is modified and improved, leads to the topic of reliability growth models, which are the subject of the next chapter.
In developmental and operational testing, continuously operating systems that are repairable perform their functions as required until interrupted by a system failure that warrants repair or replacement (ordinarily at the subsystem or component level). For measuring and assessing opera-
tional reliability, the primary focus for “failures” is generally restricted to operationally critical failure modes, which include operational mission failure, critical failure, and system abort. Test results and requirements are often expressed accordingly—as the mean time between operational mission failures (MTBOMF), the mean time between critical failures (MTBCF), and the mean time between system abort (MTBSA)—or as the probability of successfully completing a prescribed operational mission of a given time duration without experiencing a major failure.1,2
Standard DoD reliability analyses normally entail three analytical assumptions:
- Restoration activities return a failed test article to a state that is “as good as new.” That is, the time to first failure (from the beginning of the test3) and the subsequent times between failures for the subject test article all are taken to be statistically independent observations governed by a single probabilistic distribution.
- The same time-to-failure distribution (or failure probability for one-shot systems) applies to each test article over replications.
- The common time-to-failure distribution (or failure probability for one-shot systems) is exponential with failure rate parameter λ (alternatively parameterized in terms of a mean time to failure parameter, θ = 1/λ).
There are two advantages to invoking this nominal set of assumptions: it simplifies statistical analyses, and it facilitates the interpretability of results. Analyses are then the examination of the number of failure times and censored times (from the time of the last failure for a test article to the end of testing time for that article) that are observed, assuming a single underlying exponential distribution. A mathematically equivalent formulation is that the total number of observed failures in the total time on test (across all test articles), T, is governed by a Poisson distribution with expected value equal to λT (or T/θ). The DoD primer on reliability, availability, and maintainability (RAM) (U.S. Department of Defense, 1982), as well as numerous textbooks on reliability, address this nominal situation (within the framework of a homogeneous Poisson process) and provide straightforward estimation, confidence bounds, and test duration planning
1 Lower levels of failures should not necessarily be ignored by logistics planners, especially if they will lead to substantial long-term supportability costs.
2 Another important metric, but outside of the scope of this report, is operational availability—the long-term proportion of time that a system is operationally capable of performing an assigned mission. Estimating availability requires estimating system down times due to planned and unplanned maintenance activities.
3 Test articles may undergo pretest inspections and maintenance actions.
methodologies. In practice, the customary estimate of a system’s mean time between failures is simply calculated to be the total time on test, T, divided by the total number of observed failures (across all test articles).4 It is readily comprehensible and directly comparable to the value of reliability that had been projected for the test event or required to be demonstrated by the test event.
Although the above assumptions support analytical tractability and are routinely undertaken in DoD assessments of the reliability demonstrated in an individual test, alternative assumptions merit consideration and exploration for their viability and utility. Rather than the assumption of a return to a state “as good as new,” a physically more defensible assumption in many instances (e.g., in a system with many parts) might be that a repair or replacement of a single failed part only minimally affects the system state (relative to what it was before the failure) and the system is thus restored to a state approximately “as bad as old.” This perspective would accommodate more complex phenomena in which the system failure rate may not be constant over time (e.g., monotonically increasing, which corresponds to aging articles that tend to experience more failures as operating time accumulates). Flexible statistical models and analysis approaches suitable for these more general circumstances, both parametric and nonparametric, are widely available (e.g., Rigdon and Basu, 2000; Nelson, 2003). Sample size demands for precise estimation, however, may exceed what typical individual developmental or operational tests afford. For example, the total hours on test available for a single test article often can be quite limited—spanning only a few lifetimes (measured in terms of the prescribed reliability requirement) and sometimes even less than one lifetime. An additional issue relates to the interpretability of models that portray nonconstant failure intensities: In particular, what sort of a summary estimate for a completed developmental or operational test event should be reported for comparison to a simply specified mean time between failure prediction or requirement (that did not contemplate time-variant intensities)?
Sample size limitations likewise may hinder examinations of heterogeneity for individual or collective groupings of test articles. When data are ample, statistical tests are available for checking a number of potential hypotheses of interest, such as no variability across select subgroups (e.g., from different manufacturing processes), no outliers that may be considered for deletion from formal scoring and assessment (e.g., possibly attributable to test-specific artificialities), and the like. Caution needs to be taken, however, to recognize potential sensitivities to inherent assumptions (e.g., such
4 Under the three analytic assumptions above, the mean time between failures is synonymous with the mean time to failure or the mean time to first failure, other commonly used terms.
as assumptions 1 and 3 above) attendant to the application of any specific methodology.
Although there often are plausible motivations for asserting that an exponential time-to-failure distribution is appropriate (e.g., for electronics or for “memoryless” systems in circumstances for which aging is not a major consideration), there is no scientific basis to exclude the possibility of other distributional forms. For example, the more general two-parameter Weibull distribution (which includes the exponential as a special case) is frequently used in industrial engineering. Observed failure times and available statistical goodness-of-fit procedures can guide reliability analyses to settle on a particular distribution that reasonably represents the recorded data from a given test. The plausibility of the “as good as new” assumption warrants scrutiny when repeat failures (recorded on an individual test article) are incorporated into the analyses.
Distinct estimation and confidence interval methods are associated with different choices for the time-to-failure distribution. The mathematical form of the distribution function provides a direct link between parameter estimates (e.g., the mean time between failures for the exponential distribution) and the probability of the system performing without a major failure over a prescribed time period (e.g., mission reliability). For a given set of failure data, different specifications of the time-to-failure distribution can lead to different estimates for the mean time between failures and generally will lead to distinct estimates for mission reliability. For the one-parameter exponential distribution, there is a one-to-one correspondence between the mean time between failures and mission reliability. This is not the case for other distributions.
Implicit in assumption 3 above is that the environment and operating conditions remain constant for the test article each time it is repaired and returned to service. Unless statistical extrapolation methods are applied, reliability estimates generated from a single test’s observed failure data should be interpreted as representative solely of the circumstances of that test.5 The possible effects of influential factors (e.g., characterizing the execution of the testing or description of the past usage or maintenance and storage profiles) on system reliability can be portrayed in regression models or hierarchical model structures. For instance, with sufficient test data, one could assess whether changes in storage conditions had an impact on system reliability. In general, adequate sample sizes would be needed to support parameter estimation for these more sophisticated representations of system reliability.
5 The issue of the relevance of the testing conditions to operational scenarios is considered in the next chapter.
Continuously operating nonrepairable systems (e.g., batteries, remote sensors) function until a failure occurs or until there is some signal or warning that life-ending failure is imminent: when that occurs, the system is swapped out. Each system experiences at most one failure (by definition—it cannot be restored and brought back into service after having failed). Some systems that routinely are subjected to minor maintenance can be viewed as nonrepairable with respect to catastrophic failure modes (e.g., a jet engine).
For these systems, a relevant reliability metric is mean time to failure. From an experimental perspective, nonrepairable systems can be tested until they fail or can be tested under censoring schemes that do not require all articles to reach their points of failure. These data provide an estimate of expected operational lifetimes and their variability. In addition, analytical models can be developed to relate expected remaining life to concomitant data, such as recorded information on past environmental or usage history, measures of accumulated damage, or other predictive indicators obtained from sensors on the systems.
Nonrepairable systems are common in many commercial settings, but they are rare in DoD acqusition category I testing. The role of prognostic-based reliability predictions to assess various forms of degradation in reducing defense system life-cycle costs, however, continues to gain prominence (Pecht and Gu, 2009; Collins and Huzurbazar, 2012). System program managers are instructed (U.S. Department of Defense, 2004, p. 4) to optimize operational readiness with “diagnostics, prognostics, and health management techniques in embedded and off-equipment applications when feasible and cost-effective.”
Testing of one-shot (or “go/no go”) systems in a given developmental or operational testing event involves a number of individual trials (e.g., separate launches of different missiles) with the observed performance in any trial being characterized as either a “success” or “failure.” Assumption 1 above generally is not germane because a test article is required to function only once.6 But assumptions 2 and 3 and the discussion above are, for the most part, very relevant. One exception is that the distribution of interest governing a single test result generally is modeled with a one-parameter Bernoulli distribution.
6 There may be exceptions. For example, it is possible that a failure to launch for a rocket in a trial could be traced to an obvious fixable wiring problem and the rocket subsequently reintroduced into the testing program.
One associated reliability metric is the estimated probability of success. The estimate can be a “best estimate” or it can be a “demonstrated” reliability, for which the metric is a specified statistical lower confidence limit on the probability of success. Because reliability can depend on environmental conditions, the metrics of estimated reliability at specified conditions may be required, rather than a single reliability metric. It can be defined for individual prescribed sets of conditions that establish operational profiles and scenarios or for a predetermined collection of conditions. Estimates of system reliability derive from observed proportions of “successful” trials—either in total for a test event or specific to particular combinations of test factors (e.g., using logistic regression models). The estimates need to be interpreted to pertain to the specific circumstances of that testing. Given sufficient and adequate data, statistical modeling could support extrapolations and interpolations to other combinations of variables.
Hybrid models of system reliability, embodying both time-to-failure and success-failure aspects, also may be suitable for some testing circumstances. Imagine, for instance, a number of cruise missiles (without active warheads), which are dedicated to testing in-flight reliability, being repeatedly captive-carried by aircraft for extended periods to simulate the in-flight portions of an operational cruise missile mission. Observed system failures logically could be examined from the perspective of mean time between failures, facilitating the construction of estimates of in-flight reliability that correspond to distinct operational mission scenarios that span a wide spectrum of launch-to-target ranges. To obtain measures of overall reliability, these estimates could be augmented by results from separate one-shot tests, in the same developmental or operational testing event, that focus on the probabilities of successful performance for other non-in-flight elements of cruise missile performance (e.g., launch, target recognition, warhead activation, and warhead detonation).
In this example, the mode of testing (involving repairable test articles) does not match the tactical use of the operational system (single launch, with no retrieval). From an operational perspective, the critical in-flight reliability metric for cruise missiles could be taken to be mean time to failure—which can be conceptually different from mean time between failures when assumption 1 (above) does not hold.
The process of deriving formal system reliability requirements and intermediate reliability goals to be attained at various intermediate test
program points is undertaken well before any system testing begins. Consequently, the simple mean time between failures and probability-of-success metrics traditionally prescribed in DoD acquisition test and evaluation contexts are reasonable. For a given developmental or operational test event, knowledge of the nature of the reliability data and of the particulars of the testing circumstances, including composition and characteristics of test articles, are available. This information can support the development of alternative models and associated specific forms of reliability metrics that are potentially useful for describing reliability performance demonstrated in testing and projected to operational missions.
Very often, the standard mean time between failures and probability-of-success metrics will be appropriate for describing system-level reliability given the confines of the available test data. Such a determination should not be cavalierly invoked, however, without due consideration of more advanced plausible formulations—especially if those formulations might yield information that will support reliability or logistics supportability improvement initiatives, motivate design enhancements for follow-on testing, or substantively inform acquisition decisions. The more sophisticated methodological approaches based on more elaborate distributions with parameters linked to storage, transportation, type of mission, and environment of use may be particularly attractive after a system is fielded (for some classes of fielded systems), when the amount and composition of reliability data may change substantially given what is available from developmental and operational testing.
Several points that have been noted warrant emphasis. For any system, whether in the midst of a developmental program or after deployment, there is no such thing as a single true mean time between failures or actual mean time to first failure (Krasich, 2009). System reliability is a function of the conditions, stresses, and operating profiles encountered by the system during a given period of testing or operations, and these can and do vary over time. System reliability likewise is influenced by the composition of the systems themselves (i.e., test articles or specific deployed articles that are monitored), which may include diverse designs, manufacturing processes, past and current usage, and maintenance profiles. Estimates of the mean time between failures, mean time to first failure, or other metric for system reliability needs to be interpreted accordingly.
Operational reliability is defined in terms of one or more operational mission profiles expected to be encountered after a defense system has attained full-rate production status and is deployed. Ideally, system-level developmental and operational testing would mimic or plausibly capture the key attributes of these operational circumstances, particularly for operational testing and the later stages of developmental testing. It is important to understand, however, that there are limitations to the extent to which
operational realism is or can be reflected in those testing events. Moreover, efficient developmental testing strategies, especially when a system’s functional capabilities emerge incrementally over time, may not readily lend themselves to complete examination of system operational reliability, especially in the early stages of developmental testing. Again, as appropriate, distinctions should be drawn between estimates of system reliability and estimates of system operational reliability.