Read "Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report" at NAP.edu

« Previous: 2. Test Process

Page 31 Cite

Suggested Citation:"3. Test Measures." National Research Council. 2003. Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report. Washington, DC: The National Academies Press. doi: 10.17226/10710.

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

3 Test Measures The Interim Brigade Combat Team (IBCT) equipped with the Stryker is intended to provide more combat capability than the current Light Infantry Brigade (LIB) and to be significantly more strategically deployable than a heavy Mechanized Infantry Brigade (MIB). It is anticipated that the IBCT will be used in at least two roles: 1. as part of an early entry combat capability against armed threats in small-scale contingencies (SSC). These IBCT engagements are likely to be against comparable forces forces that can inflict meaningful casualties on each other. 2. in stability and support operations against significantly smaller and less capable adversaries than anticipated in SSC. The Stryker system evalu- ation plan (SEP) uses the term security operations in a stability environ- ment (SOSE); that term will be used here. The IBCT/Stryker initial operational test (JOT) will include elements of both types of IBCT missions to address many of the issues described in the Stryker SEP. This chapter provides an assessment of ATEC's plans for mea- sures to use in analyzing results of the JOT. We begin by offering some definitions and general information about measures as background for spe- .^ . . ClilC comments In su Sequent sections. 31

32 IMPROVED OPERATIONAL TESTING AND EVALUATION INTRODUCTION TO MEASURES Using the IBCT and the IOT as context, the following definitions are used as a basis for subsequent discussions. The objective of the IBCT is synonymous with the mission it is as- signed to perform. For example: · "Attack to seize and secure the opposition force's (OPFOR) de- fended position" (SSC mission) · "Defend the perimeter around . . . for x hours" (SSC mission) · "Provide area presence to . . . " (SOSE mission) Objectives will clearly vary at different levels in the IBCT organization (brigade, battalion, company, platoon), and several objectives may exist at one level and may in fact conflict (e.g., "Attack to seize the position and minimize friendly casualties". Electiveness is the extent to which the objectives of the IBCT in a mission are attained. Performance is the extent to which the IBCT demon- strates a capability needed to fulfill its missions effectively. Thus, perfor- mance could include, for example, the Stryker vehicle's survivability, reli- ability, and lethality; the IBCT's C4ISR (command, control communications, computers, intelligence, surveillance, and reconnais- sance); and situation awareness, among other things. A measure of performance (MOP) is a metric that describes the amount (or level) of a performance capability that exists in the IBCT or some of its systems. A measure of effectiveness (MOE) is a quantitative index that indicates the degree to which a mission objective of the IBCT is attained. Often many MOEs are used in an analysis because the mission may have multiple objectives or, more likely, there is a single objective with more than one MOE. For example, in a perimeter defense mission, these may include the probability that no penetration occurs, the expected value of the time until a penetration occurs, and the expected value of the number of friendly casualties, all of which are of interest to the analyst. For the IBCT JOT, mission-level MOEs can provide useful information to: evaluate how well a particular mission or operation was (or will be) performed. Given appropriate data collection, they provide an objective and quantitative means of indicating to appropriate decision makers the degree of mission accomplishment;

TEST MEASURES 33 2. provide a means of quantitatively comparing alternative forces (IBCT versus LIB); and 3. provide a means of determining the contribution of various incom- mensurate IBCT performance capabilities (survivability, lethality, C4ISR, etc.) to mission success (if they are varied during experiments) and there- fore information about the utility of changing the level of particular capabilities. Although numerical values of mission-level MOEs provide quantita- tive information about the degree of mission success, the analysis of opera- tional test results should also be a diagnostic process, involving the use of various MOEs, MOPs, and other information to determine why certain mission results occurred. Using only summary MOE values as a rationale for decision recommendations (e.g., select A over B because MOEA = 3.2 > MOEB = 2.9) can lead to a tyranny of numbers, in which precisely stated values can be used to reach inappropriate decisions. The most important role of the analyst is to develop a causal understanding of the various factors (force size, force design, tactics, specific performance capabilities, environ- mental conditions, etc.) that appear to drive mission results and to report on these as well as highlight potential problem areas. Much has been written about pitfalls and caveats in developing and using MOEs in military analyses. We mention two here because of their relevance to MOEs and analysis concepts presented in the IBCT/Stryker test and evaluation master plan (TEMP) documentation. 1. As noted above, multiple MOEs may be used to describe how well a specific mission was accomplished. Some analysts often combine these into a single overall number for presentation to decision makers. In our view, this is inappropriate, for a number of reasons. More often than not, the different component MOEs will have incommensurate dimensions (e.g., casualties, cost, time) that cannot be combined without using an explicit formula that implicitly weights them. For example, the most common for- mula is a linear additive weighting scheme. Such a weighting scheme as- signs importance (or value) to each of the individual component MOEs, a task that is more appropriately done by the decision maker and not the analyst. Moreover, the many-to-one transformation of the formula may well mask information that is likely to be useful to the decision maker's deliberations.

34 IMPROVED OPERATIONAL TESTING AND EVALUATION 2. Some MOEs are the ratio of two values, each of which, by itself, is useful in analyzing mission success. However, since both the numerator and the denominator affect the ratio, changes in (or errors in estimating) the numerator have linear effects on the ratio value, while changes (or er- rors) in the denominator affect the ratio hyperbolically. This effect makes the use of such measures particularly suspect when the denominator can become very small, perhaps even zero. In addition, using a ratio measure to compare a proposed organization or system with an existing one implies a specific value relationship between dimensions of the numerator and the denominator. Although ratio MOE values may be useful in assessing degrees of mis- sion success, reporting only this ratio may be misleading. Analysis of each of its components will usually be required to interpret the results and de- velop an understanding of why the mission was successful. ATEC plans to use IOT data to calculate MOEs and MOPs for the IBCT/Stryker. These data will be collected in two ways: subjectively, using subject-matter experts (SMEs), and objectively, using instrumentation. Our assessment of these plans is presented in the remainder of this chapter, which discusses subjective measures (garnered through the use of SMEs) and objective measures of mission effectiveness and of reliability, availabil- ity, and maintainability. SMEs are used to subjectively collect data for MOEs (and MOPs) to assess the performance and effectiveness of a force in both SSC missions (e.g., raid and perimeter defense) and SOSE mis- sions. Objective measures of effectiveness (including casualty-related mea- sures, scenario-specif~c measures, and system degradation measures) may also be applied across these mission types, although objective casualty-re- lated MOEs are especially useful for evaluating SSC engagements, in which both the IBCT and the OPFOR casualties are indicators of mission suc- cess. Casualty-related measures are less commonly applied to SOSE mis- sions, in which enemy losses may have little to do with mission success. Objective measures of reliability, availability, and maintainability are ap- plied to assess the performance and effectiveness of the system. SUBJECTWE SUBJECT-~TTER EXPERT MEASURES Military judgment is an important part of the operational evaluation and will provide the bulk of numerical MOEs for the Stryker JOT. Trained SMEs observe mission tasks and subtasks and grade the results, according

TEST MEASURES 35 to agreed-upon standards and rating scales. The SMEs observe and follow each platoon throughout its mission set. Although two SMEs are assigned to each platoon and make independent assessments, they are not necessar- ily at the same point at the same time. SME ratings can be binary (pass/fail, yes/no) judgments, comparisons (e.g., against baseline), or indicators on a numerical task performance rat- ing scale. In addition to assigning a rating, the SME keeps notes with the reasoning behind the assessment. The mix of binary and continuous mea- sures, as well as the fact that the rating scales are not particularly on a cardinal (much less a ratio) scale, makes it inappropriate to combine them in any meaningful way. Moreover, since close combat tactical training data show that the con- ventional 10-point rating scale provides values that were rarely (if ever) used by SMEs, ATEC has proposed using an 8-point scale. However, it has also been observed in pretesting that the substantive difference between task performance ratings of 4 and 5 is very much greater than between 3 and 4. This is because, by agreement, ratings between 1 and 4 indicate various levels of task "failure" and ratings between 5 and 8 indicate levels of task "success." The resulting bimodal distribution has been identified by ATEC analysts as representing a technical challenge with respect to tradi- tional statistical analysis. We prefer to regard this phenomenon as being indicative of a more fundamental psychometric issue, having to do with rating scale development and validation. Although there has also been some discussion of using two or three separate rating scales, this would be a useful approach only if there were no attempt to then combine (roll up) these separate scales by means of some arbitrary weighting scheme. SME judgments are clearly subjective: they combine experience with observations, so that two SMEs could easily come up with different ratings based on the same observations, or a single SME, presented twice with the same observation, could produce different ratings. Using subjective data is by itself no barrier to making sound statistical or operational inferences (National Research Council, 1998b; Veit, 19961. However, to do so, care must be taken to ensure that the SME ratings have the usual properties of subjective data used in other scientific studies, that is, that they can be calibrated, are repeatable, and have been validated. One good way to sup- port the use of SME ratings in an IOT is to present a careful analysis of SME training data, with particular attention paid to demonstrating small inter-SME variance.

36 IMPROVED OPERATIONAL TESTING AND EVALUATION OBJECTIVE MEASURES OF EFFECTIVENESS In this section we discuss objective measures of effectiveness. Although these involve "objective" data, in the sense that two different observers will agree as to their values, experts do apply judgment in selecting the particu- lar variables to be measured in specific test scenarios. While it is useful to provide summary statistics (e.g., for casualty measures, as discussed below), decision makers should also be provided (as we suggest earlier in this chap- ter) with the values of the component statistics used to calculate summary statistics, since these component statistics may (depending on analytical methods) provide important information in themselves. For example, sum- mary brigade-level casualties (discussed below) are computed by aggregat- ing company and squad-level casualties, which by themselves can be of use in understanding complicated situations, events, and scenarios. There are many thousands of objective component statistics that must support com- plex analyses that depend on specific test scenarios. Our discussion below of casualty-related measures and of scenario-specific measures is intended to illustrate fruitful analyses. Casualty-Related Measures In this section we discuss some of the casualty-related MOEs for evalu- ating IBCT mission success, appropriate for both combat and SOSE mis- sions, but particularly appropriate for SS C-like engagements in which both sides can inflict significant casualties on each other. Specifically, we discuss the motivation and utility of three casualty ratio MOEs presented by ATEC in its operational test plan. Ideally, an operational test with unlimited resources would produce estimates of the probability of mission "success" (or any given degree of success), or the distribution of the number of casualties, as a function of force ratios, assets committed and lost, etc. However, given the limited replications of any particular scenario, producing such estimates is infea- sible. Still, a variety of casualty-related proxy MOES can be used, as long as they can be shown to correlate (empirically or theoretically) with these ultimate performance measures. We begin by introducing some notation and conventions. The conventions are based on analyses of the cold war security environment that led to the development and rationale underlying two of the ratio MOEs.

TEST MEASURES Let: 37 N _ initial number of enemy forces (OPFOR) in an engagement (battle, campaign) against friendly forces M _ initial number of friendly forces in an engagement FRo - N/M= initial force ratio nits _ number of surviving enemy forces at time tin the engagement myth _ number of surviving friendly forces at time t in the engage- ment FR6t9_ n~t9/m~t) - force ratio at time t Chit) = N- nits _ number of enemy casualties by time t Cm~t9= M- mitt _ number of friendly casualties by time t Although survivors and casualties vary over time during the engagement, we will drop the time notation for ease in subsequent discussions of casu- alty-related MOE. In addition, we use the term "casualties" as personnel losses, even though much of the motivation for using ratio measures has been to assess losses of weapon systems (tanks, etch. It is relatively straight- forward to convert a system loss to personnel casualties by knowing the kind of system and type of system kill. Loss Exchange Ration A measure of force imbalance, the loss exchange ratio (LER) is defined to be the ratio of enemy (usually the attacker) losses to friendly (usually defender) losses. That is3 N-n C (1) M-m Cm 2During the cold war era, measures of warfighting capability were needed to help the Army make resource allocation decisions. The LER measure was created a number of decades ago for use in simulation-based analyses of war between the Soviet-led Warsaw Pact (WP) and the U.S.-led NATO alliance. The WP possessed an overall strategic advantage in ar- mored systems of 2:1 over NATO and a much greater operational-tactical advantage of up to 6:1. Prior to the demise of the Soviet Union in 1989-1991, NATO's warfighting objective was to reduce the conventional force imbalance in campaigns, battles, and engagements to preclude penetration of the Inter-German Border. 3Enemy losses will always be counted in the numerator and friendly losses in the de- nominator regardless of who is attacking.

38 IMPROVED OPERATIONAL TESTING AND EVALUATION Thus, LER is an indicator of the degree to which the force imbalance is reduced in an engagement: the force imbalance is clearly being reduced while the conditions LER > FRo = N/M holds. Since casualty-producing capability varies throughout a battle, it is of- ten useful to examine the instantaneous LER the ratio of the rates of enemy attacker and defender losses as a function of battle time t, in order to develop a causal understanding ofthe battle dynamics. Early in the battle, the instantaneous LER is high and relatively independent of the initial force ratio (and particularly threat size) because of concealment and first shot advantages held by the defender. The LER advantage moves to the attacker as the forces become decisively engaged, because more attackers find and engage targets, and concentration and saturation phenomena come into play for the attacker. However, this pattern is not relevant in today's security environment, with new technologies (e.g., precision munitions, second-generation night vision devices, and FBCB21; more U.S. offensive engagements; and threats that employ asymmetric warfare tactics. The utility of the LER is further evidenced by its current use by analysts of the TRADOC Analysis Com- mand and the Center for Army Analysis (CAA, formerly the Army's Con- cepts Analysis Agency) in studies of the Army's Interim Force and Objec- . . rive ~ force. Force Exchange Ratios The LER indicates the degree of mission success in tactical-level en- gagements and allows an examination of the impact of different weapon systems, weapon mixes, tactics, etc. At this level, each alternative in a study traditionally has the same initial U.S. force size (e.g., a battalion, a com- pany). As analysis moves to operational-level issues (e.g., force design/struc- 4The LER is usually measured at the time during an engagement when either the attacker or defender reaches a breakpoint level of casualties. 5This MOE is also referred to as the fractional loss exchange ratio and the fractional exchange ratio.

TEST MEASURES 39 sure, operational concepts) with nonlinear battlefields, alternatives in a study often have different initial force sizes. This suggests considering a measure that "normalizes" casualties with respect to initial force size, which gives rise to the force exchange ratio (FER): ~ N - n FER= 'I (M - m) M ~ M C,' J= N Cm LER = FRo (2, The FER and the LER are equally effective as indicators of the degree by which force imbalance is reduced in a campaign: an enemy's initial force size advantage is being reduced as long as the FER > 1. Some of the history behind the use of FER is summarized in Appendix B. Relative Loss Ratio ATEC has proposed using a third casualty ratio, referred to as the rela- tive loss ratio (RLR) and, at times, the "odds ratio." They briefly define and demonstrate its computation in a number of documents (e.g., TEMP, December 2001; briefing to USA-OR, June 2002) and (equally briefly) argue for its potential advantages over the LER and the FER The basic RLR is defined by ATEC to be the ratio of Lenemy to friendly casualty ratio] to Lenemy to friendly survivor ratio] at some time t in the battle: N-n RLR= n M-m m n = Cn m = LER SVER Gm ( Cm )( ~ ) . = m where SVER = m/n is referred to as the "survivor ratio." Since the reciprocal of SVER is the force ratio FRt = (n/m) at time t in the battle, RLR can be expressed as RLR = FRt which is structurally similar to the FER given by equation (21. It is interest- ing to note that the condition (4)

40 IMPROVED OPERATIONAL TESTING AND EVALUATION N-f~ RLR= ,7 >1 M-m m implies that FRo>FRt i.e., that the initial force imbalance is being reduced at time t. However, the condition FER > 1 also implies the same thing. ATEC also proposes to use RLR, a relative force ratio normalized for . . . 1 r . ~ . Initial force ratios. 1 net IS (N-n) ~< N J RLR= n = I'M -m: C72 n = N ._ = FER SEER ~ Cm ~ Cm n M ) (M) M m m = FRt (5) ATEC does not discuss any specific properties or implications of us- ing the RLR but does suggest a number of advantages of its use relative to the LER and the FER These are listed below (in italics) with the panel's comments. 1. The RLR Addresses casualties and survivors whereas the LER and the FER address only casualties. When calculating LER and FER the number of casualties is in fact the initial number of forces minus survivors. 2. The RLR can aggregate over dLi~erent levels offorce str?vct?vre (e.g., pla- toons, companies, battalions) while the LER and the FER cannot. The initial numbers of forces and casualties for multiple platoon engagements in a company can be aggregated to compute company-level LERs and FERs, and they can be aggregated again over all company engagements to com- pute battalion-level LERs and FERs. Indeed, this is how they are regularly computed in Army studies of battalion-level engagements. 3. The RLR can aggregate dLi~erent kinds of casualties (vehicles, personnel, civilians, fratricidLe) to present a decision maker with a single RLR measure of merit, while the LER and the FER cannot. Arbitrary linear additive func- tions combining these levels of measures are not useful for the reasons given in the section on introduction to measures above. In any event, personnel casualties associated with system/vehicle losses can be readily calculated

TEST MEASURES 41 using information from the Ballistics Research Laboratories/U.S. Army Material Systems Analysis Activity (BRL/AMSAA). It is not clear why the geometric mean computed for the RLR (p. 48 of December 2001 TEMP) could not be computed for the LER or the FER if such a computation were thought to be useful. 4. The RLR motivates commandeers "to seek an optimum tradLe-o~ IDe- tween friendly survivors and enemy casualties." This does not appear ger- mane to selecting an MOE that is intended to measure the degree of mis- sion success in the IBCT JOT. 5. The RLR has n?vmero?vs attractive statisticalproperties. ATEC has not delineated these advantages, and we have not been able to determine what they are. 6. The RLR has many goods statistical properties of a "maximum likeli- hoodt statistic" including being most precise among other attractive measures of attrition (LER and FER9. It is not clear what advantage is suggested here. Maximum likelihood estimation is a technique for estimating parameters that has some useful properties, especially with large samples, but maxi- mum likelihood estimation does not appear to address the relative merits of LER, FER, and RLR 7. The IAV/Stryker; IOTis a designed experiment. To take advantage of it, there is a standard log-linear modeling approach for analyzing attrition dLata that uses RLR statistics. There are equally good statistical approaches that can be used with the FER and the LER Fratricide and Civilian Casualties ATEC has correctly raised the importance of developing suitable MOEs for fratricide (friendly casualties caused by friendly forces) and civil- ian casualties caused by friendly fires. It is hypothesized that the IBCT/ Stryker weapon capabilities and the capabilities of its C4ISR suite will re- duce its potential for fratricide and civilian casualties compared with the baseline. The June 2002 SEP states that in order to test this hypothesis, the "standard" RLR and fratricide RLR (where casualties caused by friendly forces are used in place of OPFOR casualties) will be compared for both the IBCT and the LIB. A similar comparison would be done using a civil- ian casualties RLR However, the RLR (as well as the LER and the FER') is not an appropri- ate MOE to use, not only for the reasons noted above, but also because it does not consider the appropriate fundamental phenomena that lead to

42 IMPROVED OPERATIONAL TESTING AND EVALUATION fratricide (or civilian) casualties. These casualties occur when rounds fired at the enemy go astray (for a variety of possible reasons, including errone- ous intelligence information, false detections, target location errors, aiming errors, weapons malfunction, etch. Accordingly, we recommend that ATEC report, as one MOE, the number of such casualties for IBCT/Stryker and the baseline force and also compute a fratricide frequency (FF) defined as number of fratricide casualties number of rounds fired at the enemy and a similarly defined civilian casualty frequency (CF). The denominator could be replaced by any number of other measures of the intensity (or level) of friendly fire. Adtvantages of FER and LER Over RLR The FER and the LER have served the Army analysis community well for many decades as mission-level MOEs for campaigns, battles, and en- gagements. Numerous studies have evidenced their utility and correlation to mission success. Accordingly, until similar studies show that the RLR is demonstrably superior in these dimensions, ATEC should use FER (and LER when appropriate), but not the RLR, as the primary mission-level MOE for analyses of engagement results. Our preference for using the FER, instead of the RLR, is based on the following reasons: · The FER has been historically correlated with the probability of mission success (i.e., winning an engagement/battle), and the RLR has not. · There is strong historical and simulation-based evidence that the FER is a valid measure of a force's warfighting capability given its strong correlation with win probability and casualties. It has been useful as a measure of defining "decisive force" for victory. · The Army analysis community has used, and found useful, FER and LER as the principal MOEs in thousands of studies involving major theatre war and SSC combat between forces that can inflict noticeable ca- sualties on each other. There is no similar experience with the RLR · There is no compelling evidence that the purported advantages of the RLR presented by ATEC and summarized above are valid. There is little understanding of or support for its properties or value for analysis. · Using ratio measures such as FER and LER is already a challenge to the interpretation of results when seeking causal insights. The RLR adds

TEST MEASURES 43 another variable (survivors) to the LER ratio (making it more difficult to interpret the results) but does not add any new information, since it is perfectly (albeit negatively) correlated with the casualty variables already included in the FER and the LER Scenario-Specific and System Degradation Measures ATEC states that the main Army and Department of Defense (DoD) question that needs to be answered during the Stryker operational test is: Is a Stryker-equipped force more effective than the current baseline force? The TEMP states that: The Stryker has utility in all operational environments against all projected future threats; however, it is designed and optimized for contingency em- ployment in urban or complex terrain while confronting low- and mid- range threats that may display both conventional and asymmetric warfare capabilities. This statement points directly to the factors that have been used in the current test design: terrain (rural and urban), OPFOR intensity (low, me- dium, high), and mission type (raid, perimeter defense, security operations in a stability environment). These factors are the ones ATEC wants to use to characterize if and when the Stryker-equipped force is better than the baseline and to help explain why. The Stryker SEP defines effectiveness and performance criteria and assigns a numbering scheme to these criteria and their associated measures. In the discussion below, the numbering of criteria adheres to the Stryker SEP format (U.S. Department of Defense, 2002c). There are three sets of measures that are appropriate for assessing each of the three mission types. These are detailed in the measures associated with Criterion 4-1: Stryker systems must successfully support the accomplishment of required opera- tions and missions based on standards of performance matrices and associ- ated mobility and performance requirements. In particular, the measures of effectiveness for Criterion 4-1 are: MOE 4-1-1 Mission accomplishment. MOE 4-1-2 Performance ratings on selected tasks and subtasks from the applicable performance assessment matrices while conducting operations at company, platoon, squad, and section level. MOE 4-1-3 Relative attrition.

44 IMPROVED OPERATIONAL TESTING AND EVALUATION These measures of effectiveness have been addressed in the previous sections. In addition, however, ATEC would like to know why there are differ- ences in performance between the Stryker-equipped force and the baseline force. The reasons for performance differences can be divided into two categories: Stryker capabilities and test factors. Stryker capabilities include situation awareness (which contributes to survival by avoidance), responsiveness, maneuverability, reliability-availabil- ity-maintainability (RAM), lethality, survivability (both ballistic and non- ballistic), deployability, transportability, and logistics supportability. Test factors include time of day, time of year, weather, nuclear/biological/chemi- cal (NBC) environment, personnel, and training. Measures for reliability are addressed in detail later in this chapter; test factors are addressed in Chapter 4. With the exception of situation awareness, responsiveness, maneuver- ability, and RAM, the current SEP addresses each capability using more of a technical than an operational assessment. The IBCT/Stryker IOT is not designed to address (and cannot be redesigned to address) differences in performance due to lethality, survivability, deployability, transportability, or logistics supportability. Any difference in performance that might be attributed to these factors can only be assessed using the military judgment of the evaluator supported by technical and developmental testing and modeling and simulation. The current capability measures for situation awareness, responsive- ness, and maneuverability are associated with Criterion 4-2 (the Stryker systems must be capable of surviving by avoidance of contact through inte- gration of system speed, maneuverability, protection, and situation aware- ness during the conduct of operations) and Criterion 4-3 (the Stryker must be capable of hosting and effectively integrating existing and planned Army command, control, communications, computers, intelligence, surveillance, and reconnaissance or C4ISR systems). The associated MOEs are: MOE 4-2-1 Improvement of force protection MOE 4-2-2 Improvement in mission success attributed to informa- tlon MOE4-2-3 Contributions of Army battle command systems (ABCS) information to Stryker survival

TEST MEASURES 45 MOE 4-2-4 How well did the ABCS allow the commander and staff to gain and maintain situation awareness/understand- ing? MOE 4-3-1 Ability to host C4ISR equipment and its components MOE 4-3-2 Integration effectiveness of C4ISR demonstrated dur- ing the product verification test MOE 4-3-3 Interoperability performance for the Stryker C4ISR in technical testing MOE 4-3-4 Capability of the Stryker C4ISR to withstand external and internal environmental effects IAW MIL-STD 810F and/or DTC Test Operation Procedures (TOP) MOE 4-3-5 Capability to integrate MEP and FBCB2 data The measures associated with Criterion 4-3 are primarily technical and address the ability of the existing hardware to be integrated onto the Stryker platforms. As with many of the other capabilities, any difference in perfor- mance that might be attributed to hardware integration will be assessed using the military judgment of the evaluator supported by technical and developmental testing. The problem with most of the MOPs associated with Criterion 4-2 (see Table 3-1) is that they are not unambiguously measurable. For ex- ample, consider MOP 4-2-2-2, communications success. The definition of success is, of course, very subjective, even with the most rigorous and validated SME training. Moreover, the distinction between transfer of in- formation and the value of the information is important: communications can be successful in that there is a timely and complete transfer of critical information, but at the same time unsuccessful if that information is irrel- evant or misleading. Or, for another example, consider: MOP 4-2-1-3, Incidents of BLUFOR successful avoidance of the adversary. Whether this criterion has been met can be answered only by anecdote, which is not usually considered a reliable source of data. Note that there is no clear numerator or denominator for this measure, and merely counting the fre- quency of incidents does not provide a reference point for assessment. Two other categories of measures that could be more useful in assessing performance differences attributable to situation awareness, responsiveness, and maneuverability are scenario-specific and system degradation measures.

46 IMPROVED OPERATIONAL TESTING AND EVALUATION TABLE 3-1 MOPs for Criterion 4-2 MOE 4-2-1 Improvement in force protection MOP 4-2-1-1 Relative attrition MOP 4-2-1-2 Mission success rating MOP 4-2-1-3 Incidents of BLUFOR successful avoidance of the adversary MOP 4-2-1-4 Incidents where OPFOR surprises the BLUFOR MOE 4-2-2 Improvement in mission success attributed to information MOP 4-2-2-1 Initial mission, commander's intent and concept of the operations contained in the battalion and company operations and fragmentary orders MOP 4-2-2-2 Communications success (use MOE 4-3-5: Capability to integrate MEP and FBCB2 data) MOE 4-2-3 Contributions of ABCS information (C2, situation awareness, etc.) to Stryker survival MOP 4-2-3-1 What were the ABCS message/data transfer completion rates (MCR)? MOP 4-2-3-2 What were the ABCS message/data transfer completion times (speed of service)? MOP 4-2-3-3 How timely and relevant/useful was the battlefield information (C2 message, targeting information, friendly and enemy situation awareness updates, dissemination of order and plans, alerts and warning) provided by ABCS to commander and staffs? MOE 4-2-4 How well did the ABCS allow the commander and staff to gain and maintain situation awareness/understanding? MOP 4-2-4-1 Friendly force visibility MOP 4-2-4-2 Friendly position data distribution MOP 4-2-4-3 Survivability/entity data distribution Scenario-Specific Measures Scenario-specif1c measures are those that are tailored to the exigencies of the particular mission-script combinations used in the test. For example, in the perimeter defense mission, alternative measures could include an- swers to questions such as: · Did the red force penetrate the perimeter? How many times? · To what extent was the perimeter compromised (e.g., percentage of perimeter compromised, taking into account the perimeter shape)? · How far in from the perimeter was the red force when the penetra- tion was discovered?

TEST MEASURES 47 · How long did it take the red force to penetrate the perimeter? · What fraction of time was the force protected while the OPFOR was (or was not) actively engaged in attacking the perimeter? For a raid (or assault) mission, measures might include: · Was the objective achieved? · How long did it take to move to the objective? · How long did it take to secure the objective? · How long was the objective held (if required)? For SOSE missions, measures might include: · For "show the flag" and convoy escort: How far did the convoy progress? How long did it take to reach the convoy? How much time tran- spired before losses occurred? · For route and reconnaissance: How much information was ac- quired? What was the quality of the information? How long did it take to acquire the information? We present here the principle that useful objective measures can be tied to the specific events, tasks, and objectives of missions (the unit of measurement need not always be at the mission level or at the level of the individual soldier), and so the measures suggested are intended as exem- plary, not as exhaustive. Other measures could easily be tailored to such tasks as conducting presence patrols, reaching checkpoints, searching build- ings, securing buildings, enforcing curfews, etc. These kinds of measures readily allow for direct comparison to the baseline, and definitions can be written so that they are measurable. System Degradation Measures: Situation Awareness as an Experimental Factor The other type of measure that would be useful in attributing differ- ences to a specific capability results from degrading this capability in a controlled manner. The most extreme form of degradation is, of course, complete removal of the capability. One obvious Stryker capability to test in this way is situation awareness. The IBCT equipped with Stryker is in- tended to provide more combat effectiveness than the LIB and be more

48 IMPROVED OPERATIONAL TESTING AND EVALUATION strategically deployable than a heavy MIB. More combat effectiveness is achieved by providing the IBCT with significantly more firepower and tac- tical mobility (vehicles) than the LIB. Improving strategic mobility is pro- vided by designing the IBCT systems with significantly less armor, thus making them lighter than systems in the heavy MID, but at a potential price of being more vulnerable to enemy fire. The Army has hypothesized that this potential vulnerability will be mitigated by Striker's significantly improved day and night situation awareness and C4ISR systems such as FBCB2,6 second-generation forward-looking infrared systems, unmanned aerial vehicles, and other assets. If all C4ISR systems perform as expected and provide near-perfect situ- ation awareness, the IBCT should have the following types of advantages in tactical engagements over the LIB (which is expected to have much less . . situation awareness : · IBCT units should be able to move better (faster, more directly) by taking advantage of the terrain and having common knowledge of friendly ~ _ ~ . _ _O _ _ ~ _ . ~ and enemy forces. · With better knowledge of the enemy, IBCT units should be able to get in better positions for attack engagements and to attack more advanta- geously day or night by making effective use of cover in approaches to avoid enemy fires. They could structure attacks against the enemy in two directions (thus making him fight in two directions) with little or no risk of surprise ambushes by threat forces. · IBCT units and systems should be able to acquire more enemy tar- gets accurately at longer ranges, especially at night, facilitating more effec- tive long-range fire. · IBCT systems should be able to rapidly "hand off" targets to en- hance unit kill rates at all ranges. · Using combinations of the above situation awareness advantages, IBCT units should be capable of changing traditional attacker-defender battle dynamics favoring the defender at long ranges and the attacker at shorter ranges. Attacking IBCT systems should be able to avoid long-range defender fires or attrit many of the defenders at long range before closing with them. 6FBCB2 is a top-down fed command and control system that is supposed to provide the IBCT with timely and accurate information regarding all friendly and enemy systems.

TEST MEASURES 49 The Army has yet to test the underlying hypothesis that the enhanced situation awareness/C4ISR will in fact make the IBCT/Stryker less vulner- able and more effective. As currently designed, the IOT (which compares the effectiveness of IBCT/Stryker with the LIB in various missions) cannot test this hypothesis since the IBCT/Stryker is presumably more effective than the LIB for many criteria (mobility, lethality, survivability, etc.), not just in its situation awareness/C4ISR capability. To most effectively test the underlying hypothesis, the IOT design should make situation awareness/ C4ISR an explicit factor in the experiment, preferably with multiple levels, but at a minimum using a binary comparison. That is, the design should be modified to explicitly incorporate trials of the IBCT/Stryker both with and without its improved situation awareness/C4ISR in both daytime and . . . . nighttime scenarios. It is not sufficient to rely on test conditions (e.g., the unreliability of the hardware itself) to provide opportunities to observe missions without situation awareness. There must be a scripted turning off of the situation awareness hardware. This kind of controlled test condition leads to results that can be directly attributed to the situation awareness capability. If this type of test modification is not feasible, then the underlying hypothesis should be tested using appropriate simulations at either the In- telligence School or TRAC-FLVN (Ft. Leavenworth). Although the hy- pothesis may not be testable in the IOT as currently designed, ATEC may be able to determine some of the value of good situation awareness/C4ISR by assessing the degree to which the situation awareness-related advantages noted above are achieved by the IBCT/IAV in combat missions. To accom- plish this: · SMEs should assess whether the IBCT/Stryker units move through the terrain better (because of better information, not better mobility) then LIB units. . SMEs should assess whether IBCT/Stryker units get in better posi- tions (relative to enemy locations) for attack engagements than LIB units and are able to design and implement attack plans with more covered at- tack routes to avoid enemy fires (i.e., reduce their vulnerability). . ATEC should collect target acquisition data by range and by type (visual, pinpoint) for day and night missions to determine whether IBCT/ Stryker systems have the potential for more long-range fires than LIB sys- tems. ATEC should also record the time and range distribution of actual r 1 rlre during missions.

50 IMPROVED OPERATIONAL TESTING AND EVALUATION · ATEC should determine the number of hand-off targets during en- gagements to see if the IBCT force is really more "net-centric" than the LIB. · From a broader perspective, ATEC should compute the instanta- neous LER throughout engagements to see if improved situation aware- ness/C4ISR allows the IBCT force to advantageously change traditional attacker-defender battle dynamics. OBJECTIVE MEASURES OF SUITABILITY The overall goal of the IOT is to assess baseline force versus IBCT/ Stryker force effectiveness. Because inadequate levels of reliability and main- tainability (R&M) would degrade or limit force effectiveness, R&M per- formance is important in evaluating the Stryker system. We note in passing that R&M performance will affect both sides of the comparison. It is not clear whether an assessment of baseline R&M performance is envisioned in the JOT. Such an assessment would provide an important basis for com- parison and might give insights on many differences in R&M effectiveness. Reliability Criterion 1-3 states: "The Stryker family of interim armored vehicles (excluding GFE components and systems) will have a reliability of 1,000 mean miles between critical failures (i.e., system aborts)." This require- ment is raised to 2,000 mean miles for some less stressed vehicle types. These failures could be mechanical vehicle failures or failures due to ve- hicle/GFE interface issues. Although GFE failures themselves don't con- tribute to this measure, they should and will be tracked to assess their role in the force effectiveness comparison. The IOT is not only key to decisions about meeting R&M criteria and systems comparisons, but it also should be viewed as a shakedown exercise. The IOT will provide the first view of the many mechanical and electronic pieces of equipment that can fail or go wrong in an operational environ- ment. Some failures may repeat, while others will take a fair amount of IOT exposure to manifest themselves for the first time. Thus the IOT pro- vides an opportunity for finding out how likely it is that other new failure issues may crop up. For this reason, failure incidents should be collected for all vehicles for their entire lives on a vehicle-by-vehicle basis, even though much of the

TEST MEASURES 51 data may not serve the express purposes of the JOT. Currently it appears that only the Army test incident reporting system will be used. Suitable databases to maintain this information should be established. In the remainder of this section we discuss four important aspects of reliability and maintainability assessment: · failure modes (distinguishing between them and modeling their fail- ure time characteristics separately); · infant mortality, durability/wearout, and random failures (types and consequences of these three types of failure modes); · durability accelerated testing and add-on armor; and · random failures, GFE integration, and scoring criteria. Failure ModLes Although the TEMP calls for reporting the number of identified fail- ures and the number of distinct failure modes, these are not sufficient metrics for making assessments about systems' RAM. Failures need to be classified by failure mode. Those modes that are due to wearout have differ- ent data-recording requirements from those that are due to random causes or infant mortality. For wearout modes, the life lengths of the failed parts/ systems should be observed, as well as the life lengths of all other equivalent parts that have not yet failed. Life lengths should be measured in the appro- priate time scale (units of operating time, or operating miles, whichever is more relevant mechanistically). Failure times should be recorded both in terms of the life of the vehicle (time/miles) and in terms of time since last maintenance. If there are several instances of failure of the same part on a given vehicle, a record of this should be made. If, for example, the brake or tire that fails or wears out is always in the same position, this would be a significant finding that would serve as input for corrective action. Different kinds of failure modes have different underlying hazard func- tions (e.g., constant, increasing, or decreasing). When considering the ef- fect of RAM on system effectiveness, it is potentially misleading to report the reliability of a system or subsystem in terms of a MOP that is based on a particular but untested assumption. For example, reporting of only the "mean time to failure" is sufficiently informative only when the underlying failure time distribution has only a single unknown parameter, such as a constant hazard function (e.g., an exponential distribution). One alterna- tive is to report reliability MOPs separately for random types of failure

52 IMPROVED OPERATIONAL TESTING AND EVALUATION modes (constant hazard function), wearout failure modes (increasing haz- ard function), and defect-related failure modes (decreasing hazard func- tion). These MOPs can then be used to assess the critical reliability perfor- mance measure: the overall probability of vehicle failure during a particular r ruture mission. Wearout failures may well be underrepresented in the JOT, since most vehicles are relatively new. They also depend heavily on the age mix of the vehicles in the fleet. For that reason, and to correct for this underrepresen- tation, it is important to model wearout failures separately. Some measure of criticality (not just "critical" or "not critical") should be assigned to each failure mode so as to better assess the effect~s) of that mode. Further subdivision (e.g. GFE versus non-GFE) may also be warranted. Data on the arrival process of new failure modes should be carefully documented, so that they can be used in developing a model of when new failure modes occur as a function of fleet exposure time or miles. The pre- sumably widening intervals7 between the occurrence of new failure modes will enable an assessment of the chance of encountering any further and as yet unseen failure modes. The use of these data to make projections about the remaining number of unseen failure modes should be done with great care and appreciation of the underlying assumptions used in the projection methodology. Although the different Stryker vehicle variants will probably have dif- ferent failure modes, there is a reasonable possibility that information across these modes can be combined when assessing the reliability of the family of vehicles. In the current TEMP, failure modes from developmental test (DT) and IOT are to be assessed across the variants and configurations to deter- mine the impact that the operational mission summary/mission profile and unique vehicle characteristics have on reliability estimates. This assessment can be handled by treating vehicle variant as a covariate. Other uncontrol- lable covariates, such as weather conditions, could certainly have an im- pact, but it is not clear whether these effects can be sorted out cleanly. For 70f course, these widening intervals are not likely to be true in the immediate period of transferring from developmental test to operational test, given the distinct nature of these . . . test activities.

TEST MEASURES 53 example, one could record the degree of wetness of soil conditions on a daily basis. This might help in sorting out the potential confounding of weather conditions under which a given force (IBCT or baseline) is operat- ing. For example, if the IBCT were to run into foul weather halfway through its testing, and if certain failures appeared only at that time, one would be able to make a better case for ascribing the failures to weather rather than to the difference in force, especially if the baseline force does not run into foul weather. Infant Mortality Operational tests, to some extent, serve the purpose of helping to un- cover and identify unknown system design flaws and manufacturing prob- lems and defects. Such "infant mortality" problems are normally corrected by making design or manufacturing changes or through the use of suffi- cient burn-in so that the discovered infant mortality failure modes will no longer be present in the mature system. The SEP describes no specific MOPs for this type of reliability prob- lem. Indeed, the SEP RAM MOPs (e.g., estimates of exponential distribu- tion mean times) assume a steady-state operation. Separate measures of the effects of infant mortality failures and the ability to eliminate these failure modes would be useful for the evaluation of Stryker system effectiveness. Durability and Wearout The IOT currently has no durability requirement, but issues may come up in the evaluation. Vehicles used in the IOT will not have sufficient operating time to produce reliable RAM data in general and especially for durability. Although the SEP mentions an historical 20,000-mile durabil- ity requirement, the Stryker system itself does not have a specified durabil- ity requirement. ATEC technical testing will, however, look at durability of the high-cost components. In particular, in DT, the infantry carrier vehicle will be tested in duration tests to 20,000 miles. A~-On Armor Whether or not vehicles are outfitted with their add-on armor (AoA) can be expected to have an important impact on certain reliability metrics. The AoA package is expected to increase vehicle weight by 20 percent. The

54 IMPROVED OPERATIONAL TESTING AND EVALUATION added weight will put additional strain on many operating components, particularly the vehicle power train and related bearings and hydraulic sys- tems. The additional weight can be expected to increase the failure rate for all types of failure modes: infant mortality, random, and, especially, dura- bility/wear. Because product verification test (PVT) and DT will be done in understressed conditions (that is, without AoA), any long-term durabil- ity problems that do show up can be judged to be extremely serious, and other problems that may exist are unlikely to be detected in JOT. Although the IOT will proceed without AoA (because it will not be ready for the test), weight packs should be used even if there is currently imperfect knowl- edge about the final weight distribution of the AoA. Doing this with dif- ferent weight packs will go a long way to assess the impact of the weight on the reliability metrics. The details of the actual AoA weight distribution will presumably amount to only a small effect compared with the effect of the presence or absence of armor. There is a need to use PVT, DT, and IOT results to support an early fielding decision for Stryker. Because of the absence of valid long-term durability data under realistic operating conditions (i.e., with AoA in- stalled), the planned tests will not provide a reasonable degree of assurance that Stryker will have durability that is sufficient to demonstrate long- term system effectiveness, given the potential for in-service failure of criti- cal components. Some wearout failure modes (not necessarily weight-related) may show up during the JOT, but they are likely to be underrepresented compared with steady-state operation of the Stryker fleet, because the vehicles used in the IOT will be relatively new. For such failure modes it is important to capture the time to failure for each failed part/system and the time exposed without failure for each other equivalent part/system. This will enable cor- rection for the underreporting of such failure modes and could lead to design or maintenance changes. RandLom Failures, OFF, and Scoring Criteria Random failures are those failures that are not characterized as either infant mortality or durability/wearout failures. These should be tracked by vehicle type and failure mode. Random failures are generally caused by events external to the system itself (e.g., shocks or accidents). The excessive occurrence of random failures of a particular failure mode during IOT may indicate the need for system design changes to make one or more vehicle

TEST MEASURES 55 types more robust to such failure modes. Because of such potential experi- ences, it is important to track all of these random failure modes separately, even though it is tempting to lump them together to reduce paperwork requirements. The reliability of the GFE integration is of special concern. The blend- ing of GFE with the new physical platform may introduce new failure modes at the interface, or it may introduce new failure modes for the GFE itself due to the rougher handling and environment. R&M data will be analyzed to determine the impact of GFE reliability on the system and the GFE interfaces. Although GFE reliability is not an issue to be studied by itself in JOT, it may have an impact on force effectiveness, and for this reason R&M GFE data should be tracked and analyzed separately. Since the GFE on Stryker is a software-intensive system, software failure modes can be expected to occur. To the extent possible, MOPs that distinguish among software-induced failures in the GFE, other problems with the GFE, and failures outside the GFE need to be used. R&M test data (e.g., test incidents) will be evaluated and scored at an official R&M scoring conference in accordance with the Stryker failure definition/scoring criteria. R&M MOPs will be calculated from the result- ing scores. Determination of mission-critical failure modes should not, however, be a binary decision. Scoring should be on an interval scale be- tween 0 and 1 rather than being restricted to 0 (failure) or 1 (nonfailure). For example, reporting 10 scores of 0.6 and 10 scores of 0.4 sends a differ- ent message, and contains much more information, than reporting 10 scores of 1 and 10 scores of 0. We also suggest the use of standard language in recording events to make scoring the events easier and more consistent. The use of standard language also allows for combining textual information across events and analyzing the failure event database. Availability and Maintainability MOPs for availability/maintainability, described in the SEP, include mean time to repair; the chargeable maintenance ratio (the ratio of charge- able maintenance time to the total amount of operating time); and preven- tive maintenance, checks, and services time required. Although these MOPs will be evaluated primarily using data obtained during DT, IOT informa- tion should be collected and used to complement this information. Given that some reliability criteria are expressed as number of failures

56 IMPROVED OPERATIONAL TESTING AND EVALUATION per 1,000 miles, and since repair time is not measured in miles, an attempt should be made to correlate time (operating time, mission time) with miles so that a supportable comparison or translation can take place. Contractors do initial maintenance and repair and then train the sol- diers to handle these tasks. MOPs computed on the basis of DT-developed contract maintainers and repairmen may not accurately reflect maintain- ability and repair when soldiers carry out these duties. Therefore, contrac- tor and soldier maintenance and repair data should not be pooled until it has been established that repair time distributions are sufficiently close to one another. SUMMARY Reporting Values of Measures of Effectiveness 1. Different MOEs should not be rolled up into a single overall number that tries to capture effectiveness or suitability. 2. Although ratio MOE values may be useful in assessing degrees of mission success, both the numerator and the denominator should be re- ported. Subject-Matter Expert Measures 3. To help in the calibration of SME measures, each should be asked to review his or her own assessment of the Stryker IOT missions, for each scenario, immediately before he or she assesses the baseline missions (or vice versa). 4. ATEC should review the opportunities and possibilities for SMEs to contribute to the collection of objective data, such as times to complete certain subtasks, distances at critical times, etc. 5. The inter-SME rating variances from training data should be consid- ered to be the equivalent of instrument error when making statistical infer- ences using ratings obtained from JOT. 6. The correlation between SME results and objective measures should be reported for each mission. 7. ATEC should consider using two separate SME rating scales: one for cc r · 1 33 1 1 r cc 33 "allures and another tor successes. 8. As an alternative to the preceding recommendation, SMEs could as- sign ratings on a qualitative scale (for example, the five-point scale: C`excel-

TEST MEASURES 57 ~~ cc ~~ cc r · ~~ cc ~~ cc · r ad\ A 1 lent, gooc I, tan, poor, anc ~ unsatlstactory ). Anysuosequentstatlstl- cal analysis, particularly involving comparisons, would then involve the use of techniques suitable for ordered categorical variables. 9. If resources are available, more than one SME shoulcl be assigned to each unit and trained to make independent evaluations of the same tasks and subtasks. Objective Casualty-Related Measures 10. FER (ancl the LER when appropriate), but not the RLR, shoulcl be used as the primary mission-level MOE for analyses of engagement results. 11. ATEC shoulcl use fratricide frequency and civilian casualty frequency (as defined in this chapter) to measure the amount of fratricide and collat- eral damage in a mission. Objective Scenario-Specific and System Degradation Measures 12. Only MOPs that are unambiguously measurable shoulcl be usecl. 13. Scenario-specific MOPs shoulcl be aclclecl for SOSE missions. 14. Situation awareness shoulcl be introduced as an explicit test . . cone citron. 15. If situation awareness cannot be aclclecl as an explicit test conclition, additional MOPs Discussed in this chapter) shoulcl be aclclecl as indirect r . measures ot situation awareness. 16. ATEC shoulcl use the "instantaneous LER" measure to determine changes in traditional attacker/clefencler engagement dynamics clue to im- provec ~ situation awareness. Measures of Reliability and Maintainability 17. The IOT shoulcl be viewocl as a shakoclown process and an opportu- nity to learn as much as possible about the RAM of the Stryker. 18. RAM data collection shoulcl be an ongoing enterprise. Failure and maintenance information shoulcl be trackocl on a vehicle or part/system basis for the entire life of the vehicle or part/system. Appropriate databases shoulcl be set up. This was probably not clone with those Stryker vehicles already in existence but it could be implemented for future maintenance actions on all Stryker vehicles. 19. With respect to the difficulty of reaching a decision regarding reli-

58 IMPROVED OPERATIONAL TESTING AND EVALUATION ability, given limited miles and absence of add-on armor, weight packs should be used to provide information about the impact of additional weight on reliability. 20. Accelerated testing of specific system components prior to operational testing should be considered in future contracts to enable testing in shorter and more realistic time frames. 21. Failure modes should be considered separately rather than trying to develop failure rates for the entire vehicle using simple exponential models. The data reporting requirements vary depending on the failure ~ . rate function.

Next: 4. Statistical Design »

Improved Operational Testing and Evaluation: Better Measurement and Test Design for the Interim Brigade Combat Team with Stryker Vehicles: Phase I Report (2003)

Chapter: 3. Test Measures

Welcome to OpenBook!

Get Email Updates