National Academies Press: OpenBook

Redesigning the U.S. Naturalization Tests: Interim Report (2004)

Chapter: 3 The Test Development Process

« Previous: 2 Structure of the Redesign Program
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

3
The Test Development Process

In the discussion of the test development process that follows, we refer to the most widely accepted set of guidelines, the Standards for Educational and Psychological Testing, which is a joint publication of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (1999), referred to as “the Standards” from here on.3 The Standards were developed by a joint committee of 15 leading testing experts and professionals appointed by the above three sponsoring organizations. While the Standards are a product of the three sponsoring organizations, more than 50 different groups provided comment and input over the multiyear development process. Committee members have observed that the Standards are cited in the technical reports of many state assessment programs. They have been adopted by several federal agencies, including the Office of Educational Research and Improvement of the U.S. Department of Education, the U.S. Department of Defense, and the Office of Personnel Management. They are also cited in policy guidance issued by the Equal Employment Opportunity Commission and cited as the authoritative standards in numerous education and employment legal cases.

TEST DEVELOPMENT

Among the Standards’ guiding principles are that test development should have a sound scientific basis and that evidence of the scientific approach should be documented. Although the exact sequence of events for developing a test varies from program to program, the Standards lay out a series of general procedures that should take place when developing most kinds of test (Chapter 3):

  • Specify the purpose of the test and the inferences to be drawn.

  • Develop frameworks describing the knowledge and skills to be tested.

  • Build test specifications.

  • Create potential test items and scoring rubrics.

  • Review and pilot test items.

  • Evaluate the quality of items.

3  

The introduction to the Standards describes the types of tests to which they apply: “. . . the Standards applies most directly to standardized measures generally recognized as ‘tests’ such as measures of ability, aptitude, achievement, attitudes, interests, personality, cognitive functioning, and mental health, it may also be usefully applied in varying degrees to a broad range of less formal assessment techniques” (p. 3). The document includes general chapters about test construction, evaluation, documentation, and fairness, and also more specific chapters about psychological testing, educational testing, testing in employment and credentialing, and testing in program evaluation and public policy. Whether one considers the naturalization tests most like achievement or like certification tests, they clearly fall under the umbrella of the Standards.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
  • Assemble test forms.

  • If needed, set cutscores defining pass/fail or proficiency categories.

Procedures

The process should begin with a clear statement of the purpose of the test and the intended inferences to be made from the test scores. A content framework for the test must then be developed that clearly delineates the constructs to be tested (examples of constructs are mathematics achievement, language proficiency, and fundamentals of U.S. history). We discuss content frameworks in detail in the next section. However, they are an integral part of the test development process, and work on test items should not proceed until the constructs are clearly specified.

Once decisions have been made about what the test is to measure and what its scores are intended to convey, the next step is to develop a set of test specifications. The test specifications should define how the test questions will sample from the larger construct, the proposed number of items, the item formats, the desired psychometric properties of the items, and the test instrument as a whole. There should be a clear linkage between the content frameworks and the test specifications.

According to the Standards, once the content frameworks are developed, the test developer can assemble a set of potential test items that meets the test specifications. Usually test developers create a larger set of items than will eventually be needed, to allow some items to be discarded if it is found that they do not function as intended. Rules for scoring should be developed along with test items. For open-ended questions (in contrast to multiple choice questions) detailed, standardized rules for scoring, called scoring rubrics, are developed. Scoring rubrics specify the criteria for evaluating and assigning scores to responses and are often accompanied by sample responses at each of the score levels to illustrate the criteria.

Evidence about the quality of the items and the corresponding scoring rubrics is ascertained through item review procedures and pilot testing. Potential test items are typically reviewed by a panel of experts for content quality, clarity and lack of ambiguity, and sensitivity to cultural issues. One important aspect of validity4 is establishing that the test measures what it is intended to measure. Content experts who were not themselves involved in creating the test questions (for instance, members of the content advisory panel) might be asked to judge the match between the test items, the content frameworks, and the test specifications.

Usually items are pilot tested with a group of test takers who are as representative as possible of the target population. Pilot test data are used to evaluate the quality of the test items and determine their psychometric properties, such as an item’s difficulty and whether it is biased, in the sense that it functions differently for people who are of the same ability level but members

4  

As described in Chapter 1 of the Standards, the process of validation involves accumulating evidence to provide a sound scientific basis for test score interpretations. That chapter provides a comprehensive overview of the multiple types of evidence that might be collected to build a validity argument.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

of different cultural, linguistic, gender, or age groups. Pilot test items are not used to obtain scores for test takers, but only to provide data for test development purposes.

Items that meet the test specifications, according to pilot test data, are assembled into a test, or into multiple alternative test forms. Often, the initial, item-level pilot test is followed by a larger scale field test, in which the test forms are tried out under conditions that are as close as possible to those that will be in place during operational testing. This is especially helpful in starting a new testing program when operational issues have not yet been finalized.

The Standards devote an entire chapter (Chapter 6) to the importance of providing supporting documentation for tests. Such documentation may include a technical manual, a user’s guide, instructions for test administrators and scorers, and practice materials for test takers. These documents should be complete, accurate, current, and clear, and they should be available to interested individuals as appropriate. As with any scientific endeavor, one of the purposes for documentation is to ensure that the test development process is as replicable as possible. Furthermore, the documentation should provide test users with the information needed to make sound judgments about the nature and quality of the test, the resulting scores, and the appropriate interpretations of test scores.

Finally, development of the naturalization tests is an instance of a government agency’s using test results in deciding to award or withhold a valued benefit. Therefore, it is important to make complete information about the test content and the process widely and equally available to all naturalization candidates, so that everyone has the chance to prepare for the test. This is an important aspect of fairness that the Standards discuss under the topic of “equitable treatment in the testing process” (p. 75).

Three Examples

California’s High School Exit Exam

The kind of systematic, rigorous approach to test development described above has been demonstrated and documented by many large-scale, high-stakes testing programs. One example is California’s new high school exit exam (CAHSEE). The 1999 state legislation mandating the new exit exam also called for an evaluation of the test by an independent contractor. One of the questions that the evaluators were asked to address is whether development of the exam met all of the test standards for use as a graduation requirement. The evaluation contractor, the Human Resources Research Organization (HumRRO), carefully reviewed the test development procedures, which had been well documented by the test developers, the Educational Testing Service and the American Institutes for Research. The evaluators concluded that the test development process did meet all of the relevant test standards (Wise et al., 2003).

The development of California exam involved a long sequence of activities that we do not describe in detail here, but it is instructive to note briefly some of the key steps that were taken to ensure the ultimate validity and reliability of the new test. First, the rationale for use of the CAHSEE as a high school graduation requirement is stated in the legislation establishing the

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

exam. From there, the test was developed according to a well-established set of procedures or “industry standards” (Wise et al., 2003). The content standards and test blueprints, specifying the number and types of questions for each of the standards, were adopted by the state board of education following recommendations from independent panels of experts. Potential test questions were then developed and reviewed by the test developer and independent review panels for their alignment with the intended content standards. Each test question was also reviewed for sensitivity and fairness to different demographic groups by panels that included representatives from those groups. All questions used in operational forms of the exam were first field-tested at least once to screen out items that did not have the appropriate psychometric properties and to collect the necessary data for creating equated test forms and setting cutscores (the scores students must earn to pass the tests). The test developer convened standard-setting panels to recommend cutscores, but final decisions about cutscores rested with the state board of education. All of these steps were well documented in panel reports, meeting minutes, and the test developers’ technical manuals. The complete series of HumRRO evaluation reports (available online at http://www.cde.ca.gov/ta/tg/hs/evaluations.asp) provides a much more complete evaluation of the CAHSEE development process.

GED Tests

Another example of a high-stakes testing program that has undergone a rigorous development process and has been well documented is the General Educational Development (GED) tests (American Council on Education, 1993). The GED tests are designed to provide an opportunity for adults who have not graduated from high school to earn a high school-level educational diploma. It includes tests of writing, social studies, science, literature and the arts, and mathematics. As described in the GED technical manual, the test development process for the GED includes content frameworks and test specifications, as well as many levels of item and test review and two rounds of field testing, as summarized below:

  • External item writers draft test questions, then a GED staff test specialist edits or rejects them.

  • Four independent content reviewers judge content accuracy, clarity, suitability, level, and fairness of items, then a GED test specialist revises or rejects items per reviewers’ comments.

  • Three independent psychometric and fairness experts judge items to ensure sound test construction, detect item flaws, and ensure fairness. A GED test specialist then revises or rejects items per reviewers’ comments.

  • A professional editor proofs items for language and surface errors, then a GED test specialist revises or rejects items per the editor’s comments.

  • GED examinees respond to field test items during actual test administration (the scores on field test items are not used for awarding examinees a GED, only for test development purposes).

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
  • A GED test specialist selects items and assembles tests based on test specifications, examinee performance, and judgmental and statistical fairness reviews.

  • Seven independent reviewers judge the content and fairness of both the individual items and the unity and clarity of the test as a whole. A GED test specialist then revises test composition per reviewers’ comments.

  • GED tests are field-tested with a stratified random sample of graduating students conducted in high schools just prior to graduation.

  • Operational GED test forms are finalized.

The GED technical manual also includes chapters on norming, scaling, and equating of test forms; reliability; and validity.

BEST Plus

The BEST Plus test, developed by the Center for Applied Linguistics (CAL), is a third illustration of the test development process. Similar to the naturalization tests, the BEST Plus assesses the English language proficiency of adult, nonnative English speakers. It is an oral interview that is given individually to examinees by a test administrator with the use of a computer. The BEST Plus is used by many states to evaluate the effectiveness of adult education programs, and although the stakes are not as high for examinees as they are for the naturalization tests, adult education programs often use the results to decide about promoting students to higher English learning levels, an outcome that is certainly important to those students.

The technical manual for the BEST Plus describes the series of steps that were taken to develop the test, which spanned several years and included two cycles of field testing (Center for Applied Linguistics, 2003). The test is based on a previously existing framework called the Student Performance Levels, which describes various levels of language proficiency. Early in the test development process, CAL convened a 10-member technical working group to provide advice throughout the development process. To prepare for the first cycle of field testing, CAL staff developed initial item and test specifications that were approved by the technical working group. Item writers from across the country were trained at workshops and then began drafting items, which then went through a review and revision process.

The first field test was conducted with a pool of items that was about one-quarter the size of the item pool that would eventually be needed for the operational test. In addition to examining how the test items functioned, the first field test focused on the computerized testing and scoring procedures. An initial reliability study was conducted with the data from this field test, which demonstrated that the computerized test could yield reliable results when administered by different test administrators to the same students. (The technical manual also describes a number of later studies that were conducted to collect a variety of evidence about the reliability and validity of the new test.)

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

Based on the results of the first field test, CAL staff determined that the items performed well technically, but that they were not giving examinees the chance to demonstrate all that they could do. The test specifications were revised, for example, to call for items that were more personally engaging for examinees and worded in a way that was natural and conversational. The new test specifications were approved by the technical working group; earlier test items were revised and new ones created to complete the item pool for the second field test. A small-scale tryout of the newest items was conducted, and test administrator workshops were held across the country. Over 40 administrators tested over 2,300 students in the full-scale field test, and those data were used to create the final BEST Plus.

If requested, the committee can provide USCIS with references to other tests that have undergone a rigorous development process.

Naturalization Test Redesign

In the committee’s judgment, the development process for the naturalization test redesign has been less rigorous than the systematic approach described and illustrated above. First, the program lacks a clear statement of the purpose of the tests, with no clear operationalization of the constructs embodied in the rather vague legislation and no systematic specification of the inferences about naturalization applicants to be drawn from the results. As described in more detail in the next section, the process used to develop the content frameworks has been unsystematic and uneven across the content areas. If the redesigned naturalization tests do not start out with clear statements of their purpose, the constructs to be tested, and the desired inferences about test takers, then ultimately it will not be possible to judge the validity of the new tests. It will not be clear whether the tests measure the intended constructs or whether the inferences drawn from the test scores are justifiable.

Second, the committee is concerned that the project lacks a coherent research and test development plan for collecting the necessary data to build a valid, reliable, and fair test. The Phase 1 Pilot was very limited in scope, serving only to try out some item formats for the English language test. To date, there has been no pilot testing of history or government test items. The fact that little pilot testing has yet been done is neither surprising nor inappropriate, given that the content frameworks are not yet finalized and the test specifications not yet developed. The problem is that USCIS and MetriTech are gearing up for the Phase 2 Pilot, which will involve administering redesigned test forms in reading, writing, speaking, and U.S history and government to thousands of applicants (according to the USCIS Project Overview presented to the Committee on April 23, 2004, by G. Ratliff). The purpose of the Phase 2 Pilot seems to be a feasibility study to try out different test forms and new standardized test administration procedures. However, such a pilot requires that numerous earlier, major steps in the process have been completed—including content frameworks, test specifications, development of scoring rubrics, and pilot testing of items to collect data to build equated test forms. There does not seem to be a plan in place for ensuring that those earlier steps occur before proceeding with the Phase 2 Pilot.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

USCIS and MetriTech have plans to conduct some supplemental studies between now and the Phase 2 Pilot. However, the proposed studies do not fill the gaps in the overall research and test development design noted above. USCIS and MetriTech have outlined a series of supplemental studies to investigate (1) the possibility of giving applicants the option of taking either a constructed-response or multiple-choice version of the test; (2) the possibility of giving applicants the option of taking either a computer-based or pencil-and-paper test; (3) whether a language instruction intervention, targeted toward populations at risk of performing poorly on the redesigned tests, can systematically increase performance on the language proficiency tests; (4) the extent to which the redesigned test of English proficiency correlates with other widely used measures of language proficiency; and (5) whether applicants can master the redesigned U.S. history and government content within the proposed test preparation time frame. According to internal documents USCIS shared with the committee, these studies are intended to address “the concerns of stakeholders and ensure that the test development process is perceived as fair and open.” However, in the committee’s judgment, the proposed supplemental studies will not address the most urgent, fundamental research questions that need to be answered before the Phase 2 Pilot and in order to produce an operational test.

The committee’s concerns about the necessary test development activities that must take place before conducting the Phase 2 Pilot are heightened by the fact that USCIS plans to make pass/fail decisions about applicants based on their performance on the Phase 2 Pilot. While the committee understands that making the test “count” will motivate applicants to perform their best, and that applicants who fail the pilot will be given the opportunity to take the old test, the committee is very concerned about making pass/fail decisions about individuals based on a set of test items that are being tried out for the first time. These test items will not be known to be valid, reliable, or fair at the time of the Pilot 2 administration. It is also unclear how the passing score will be set for the Phase 2 Pilot forms, since formal standard setting cannot occur until all of the pilot data have been collected, analyzed, and used to conduct formal standard-setting procedures. Because the items will not be piloted prior to their use in the Phase 2 Pilot, the test forms are not likely to be equally difficult. Using a pre-established passing score for all pilot test forms would probably allow some qualified candidates to fail and some unqualified candidates to pass. Finally, there are no internal review board mechanisms in place to ensure human-subjects protections for individuals who participate in the Phase 2 Pilot.

The committee thinks that the redesign effort currently lacks an overarching research and development plan for collecting all of the necessary data to create a valid, reliable, and fair test. The Phase 2 Pilot test is taking place too early in the test development process. A number of essential steps in the process, including definition of constructs, item-level pilot testing and evaluation of the quality of test items, and development of scoring rubrics, have not yet occurred. USCIS should postpone the Phase 2 Pilot until these steps have been completed. Furthermore, it is inappropriate to make pass/fail decisions without conducting an appropriate standard-setting study.

Recommendation 2: A detailed plan for test development should be created with help from a technical advisory panel and review by an oversight committee. The research and test development plan should comply with testing standards and should include all of the necessary steps for developing a valid, reliable, and fair test.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

The plan for the Phase 2 Pilot should be revisited and reconceptualized, in light of the overall research and development design for the project. Alternatives to making pass/fail decisions based on the Phase 2 Pilot should be seriously considered. And all steps in the test development process should be well documented.

Next we discuss in more detail two very important aspects of the redesign process: developing content frameworks and standard setting (determining passing scores). Upon being informed of efforts to date and future plans by USCIS and the testing contractor, the committee decided that there are some particularly pressing issues in these two areas that require immediate guidance.

CONTENT FRAMEWORKS

As noted above, one of the critical early steps in the test development process is developing content frameworks that clearly define the constructs to be measured. Many test design decisions flow from the content frameworks. Furthermore, ultimate judgments of the validity of the interpretations of test scores (for example, that an individual is qualified to become a U.S. citizen) rely on clear definitions of what the test was intended to measure and evidence that the test does indeed measure what was intended (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; National Research Council, 1999a, 2001). Often the content framework is guided by theory of how people develop competence in a domain (such as English language proficiency) or an analysis of job requirements, as in the case of many licensing and employment tests.

For the redesign of the naturalization tests, as with most other high-stakes testing programs, the exact nature of the content and skills to be tested is one of the most contentious sets of issues. Because Section 312 of the Immigration and Nationality Act does not clearly define such constructs as “reading and writing simple words and phrases” or the “understanding of the fundamentals of history,” interpreting the law’s intent becomes a judgment that must be made by those involved in the redesign effort. With a highly visible testing program such as the naturalization tests, it is important to make the process of developing the content frameworks, which serve as the basis for so much of the rest of test development, as public and transparent as possible. The content panels recommended in Chapter 2 are intended to ensure that the process is based on deliberation and consensus, rather than the decisions of just a few individuals either within the agency or who work for the testing contractor.

History and Government Content

USCIS and MetriTech have already done some of the necessary work toward developing a content framework in the area of history and government. To define what constitutes appropriate history and government content, MetriTech evaluated four sources of information: (1) the content included in the current naturalization textbooks and exam, (2) national and state K-12 history and government standards, (3) input from stakeholders in the immigration process, and (4) recommendations provided by a panel of educators, historians, and government experts.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

USCIS and MetriTech then defined “fundamental” (and therefore the content to be tested) as content that was identified as important by three of the four sources.

While gathering information from multiple sources is commendable, the rationale for the “three out of four” rule and for weighting the four sources equally is not obvious. For example, it is unclear why K-12 history standards would be given equal weight to the judgment of an expert panel that was formed specifically to reach a consensus about the appropriate content for the new naturalization tests. Getting input from stakeholders is a good idea, but the definition of the universe of stakeholders is limited, given the broad range of groups that have an interest in citizenship and immigration.5 The survey sample is not representative of the broad range of groups that would have opinions about what topics are very important for a naturalization applicant to know to become a citizen. The unrepresentative sample, combined with the extremely low response rate,6 results in survey data that are not a valid summary of stakeholder views.

While there are plans to gather public input by publishing the draft framework in the Federal Register, USCIS has no plan for how it will systematically reconcile the conflicting feedback that is bound to come, and for deciding which changes should be made to the draft framework. A more conventional approach for developing the content standards would be a sequential one, in which the K-12 standards and textbooks are the starting point, the expert panel is the main source of information, and then reactions from stakeholder groups and the public are sought and integrated into the content framework, as deemed appropriate by an oversight body.

In addition to these issues about the weighting and integration of the different sources of evidence about history and government content, the committee has concerns about the process that was used to form the history and government panel. It is not clear that the initial process for identifying panel members was sufficiently systematic or wide-reaching. Perhaps because of this, partway through the deliberations, USCIS explained to the committee that it decided to recruit 10 more panel members “to bring in additional perspectives.” The committee has questions about what criteria were used to select members for the original history and government panel and the rationale for reconstituting the group partway through the content development process. Although adding members to a panel to make the composition more balanced is not necessarily a problem, there must be a clear, public rationale for doing so.

Given these concerns—both about the rules for defining “fundamental” history and government content and the process used to identify members for the expert panel—the committee questions the soundness of the approach that was used to develop the history and government content framework and questions whether the document will stand up to public scrutiny. In the committee’s judgment, the oversight committee and a new history and

5  

USCIS indicates that the two groups surveyed were USCIS district adjudication officers and the public at large. The operational definition of the latter was community-based organization representatives identified through three sources: (1) email lists provided by community-based organizations who attended USCIS meetings in seven major cities; (2) the listserv database of one national organization, the Catholic Legal Immigration Network; and (3) email addresses identified via electronic searches for the names of immigration-focused organizations (CIS Responses to Committee’s April 2004 Questions, Set A).

6  

The response rates reported to the committee by USCIS are 8 percent for field officers and 24 percent for representatives of community-based organizations.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

government panel, which might include some members of the original panel, should revisit the process that led up to the draft content standards and come up with a plan for building on or modifying that work in a rigorous and publicly accountable way.

Throughout its deliberations, a new history and government panel will need to weigh congressional intent and its expert judgment in determining the most important and relevant information that potential citizens need to know. The committee recognizes that the standards for the history and government requirement will be difficult to discern because of the lack of clear guidance in the statutory language; thus, it is critical that the process for determining test content is well defined and systematic. For example, there seems to be no acknowledgment in the statutory record that the language load of an assessment of the principles and form of U.S. government may be considerably higher than the standards assumed in the current English language requirement. If the new test requires applicants to demonstrate understanding of U.S. history and government in ways that require reading even short passages related to these topics (as is the case in the regime currently proposed), the language load is likely to be significantly higher than the ability to understand and use “ordinary English” and “simple words and phrases” mentioned in the statute.

English Language Content

Development of the English language content framework (not yet completed at a time of this writing) followed a different course, one that was even less rigorous and systematic than in the history and government area. Instead of convening an expert panel, MetriTech relied on a few of its own English language specialists to develop a preliminary framework for the English proficiency test. The internal specialists turned to several sources: (1) the current naturalization exam, (2) educators of adult English language learners, (3) stakeholders, and (4) existing language proficiency frameworks. They then used a process that is not clear to synthesize the information from these sources, but there is a lack of connection between the information gathered from the sources and the resulting draft content framework.

Many other preliminary decisions are being made about the English language test without first specifying the underlying theory or theories of language development that help guide the design of the test and item formats. For example, there is insufficient rationale for replacing current dictation and interview formats with new item formats, such as picture prompts.

There also are no explicit plans for how public feedback will be solicited and incorporated into the draft English language framework. The committee is troubled by the marked difference in the process and the levels of expertise that were used to define the history and government content and the English language content. In our judgment, a highly credible expert panel for English language content should be part of the structure for the redesigned tests.

Recommendation 3: Work on developing the content frameworks (including publishing the history and government framework in the Federal Register) should cease until a clear, transparent, and publicly accountable process is defined and vetted with an oversight group.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

The oversight committee and a new history and government panel should review the process that led up to the draft content framework for the history and government test. These advisory bodies should recommend a plan for building on or modifying the existing content framework, bearing in mind testing standards related to developing content frameworks. In addition, the oversight committee and a new English language content panel should review the process currently being used to draft the content framework for the English reading, writing, and speaking tests. These advisory bodies should recommend a plan for building on or modifying that process, bearing in mind testing standards related to developing content standards.

STANDARD SETTING

USCIS has indicated that as part of its efforts to standardize the naturalization process, it plans to set passing scores, also referred to as cutscores, for the newly designed tests.7 Consequently, we include here a discussion of the principles that should guide the standard-setting process by which such cutscores are set, which is both a technical and a judgmental process. The validity of test interpretations is seriously affected by the defensibility of the cutscores. There is no single method for determining cutscores for all tests or for all purposes, nor is there any single set of procedures for establishing their defensibility, but the Standards do lay out some general principles of good testing practice (pp. 53-54, 59-60).

When the results of the standard-setting process have highly significant consequences for large numbers of examinees, the process by which cutscores are determined should be clearly documented and defensible. The method used to set the passing scores should be appropriate for the types of tasks that compose the assessment (Plake, 2002). There are several item types in the redesigned tests, some multiple-choice and some constructed response. The method selected by MetriTech for setting the passing score, modified Angoff, is appropriate only for multiple-choice questions. There is an extended Angoff method, which is appropriate for constructed-response tests (Hambleton and Plake, 1995); there are several other standard-setting methods that are appropriate for constructed-response tasks (see Cizek, 2001). Therefore, the standard-setting method selected by MetriTech does not appear to be appropriate for all of the tasks that constitute the redesigned tests.

Furthermore, the modified Angoff method is controversial. Some researchers and practitioners report that this method is “fundamentally flawed” due to the cognitive demands placed on the standard-setting panelists (Shepard, 1995; National Research Council, 1999b; Impara and Plake, 1998). Others have found that the method yields defensible cutscores (Cizek, 1993; Kane, 1995; Mehrens, 1995) and that the panelists’ item performance estimates are reliable and valid (Plake, Impara, and Irwin, 2000; Plake and Impara, 2001).

7  

Although we have not discussed the issue as a committee and therefore can make no recommendation at this time, we note that neither the legislation nor agency regulations preclude methods other than cutscores for deciding whether a naturalization application should be approved. One example is a comprehensive system in which test scores are systematically considered along with other relevant information about the applicant. Consideration of such alternatives may be a topic for which USCIS might ask the oversight committee to make recommendations.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×

Most standard-setting approaches use candidate performance data to inform standard-setting panelists of the difficulty of the test questions and the impact of their initial estimates on passing rates (Hambleton, 2001; Reckase, 2001). It is therefore very important that the data used in the standard-setting study is representative of the test performance of candidates in operational settings. It is not clear how well the Phase 2 Pilot candidates mirror the population of candidates who will take the operational test. First, the selection of sites for the Phase 2 Pilot does not appear to be representative of the population. Second, when operational, candidates will have access to specially designed test preparation materials and released items. It is not clear whether candidates in the Phase 2 Pilot will have access to these materials. Therefore, the data from the Phase 2 Pilot may not be representative of the candidates who take the operational test. A likely outcome would be that the test items would appear harder for the Phase 2 Pilot candidates and result in a recommendation of passing scores that are too low. When resource materials are available, candidates are likely to perform better on the test, resulting in a higher than expected passing rate.

The final decision about the passing scores is a policy decision. In the test redesign plans reviewed by the committee, it is not clear how the final passing scores will be determined and by whom. Best practices suggest that this final policy decision should be made by agency officials in close consultation with an oversight committee. The decision should be informed by the results of the standard-setting procedure.

Recommendation 4: After a determination has been made about the various item formats that will be used on the redesigned test, USCIS and its testing contractor should develop a detailed plan for standard setting, with input from the technical advisory group and a final recommendation by the oversight committee.

An important issue related to making pass/fail decisions about naturalization applicants is whether exceptions, accommodations, or different passing scores will be allowed for special populations. Under the current testing regime, federal regulations require that officers must consider an applicant’s background when choosing and phrasing history and government test questions and evaluating responses (referred to as “due consideration.”) The issue here is how or whether to incorporate “due consideration” into the new, more standardized testing regime. There are also currently some statutory exemptions—for instance, applicants over 50 years of age who have been lawful permanent residents for at least 20 years do not have to meet the English language requirements and can take the history and government exam in their own language. How these considerations are incorporated into the redesigned testing program has not yet been determined and would be an appropriate topic for the oversight committee.

Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 12
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 13
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 14
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 15
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 16
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 17
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 18
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 19
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 20
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 21
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 22
Suggested Citation:"3 The Test Development Process." National Research Council. 2004. Redesigning the U.S. Naturalization Tests: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/11168.
×
Page 23
Next: References »
Redesigning the U.S. Naturalization Tests: Interim Report Get This Book
×
 Redesigning the U.S. Naturalization Tests: Interim Report
MyNAP members save 10% online.
Login or Register to save!

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!