Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 The Overall Rating of Program Quality The dimensional measures provide a summary of program performance along individual dimensions that are of importance in doctoral education. The overall rating combines the variables that make up the dimensional measures into a single measure. In addition to reflecting the faculty preferences in each field as derived from the faculty questionnaire, it includes the results of the importance measures derived from the rating survey. This section describes in non- technical terms how the overall rating of a program is calculated. Readers who wish more technical detail are referred to Appendix A. THE OVERARCHING IDEA There is a great deal of uncertainty in the ratings of the quality of programs. Uncertainty can come from a variety of sources. For example, although many academics may think that they can identify the top five or ten programs in their field, this certainty about perceived quality decreases as more and more programs are included. Furthermore, one program may be strong in one area while a second programâs strengths may lie in a different area. Faculty asked to rate programs may differ in their views about the importance of these strengths, and the programs may differ in various characteristics, many of which may be considered important to the perceived quality of a doctoral program. Describing this uncertainty was a key task of the predecessor committee that produced Assessing Research-Doctorate Programs: A Methodology Study.22 This committee examined the methodology of the 1995 study and recommended that the next study rely more explicitly on 22 National Research Council., Assessing Research-Doctorate Programs: A Methodology Study. Washington, D.C. 2003. 14 PREPUBLICATION COPYâUNEDITED PROOFS

program data. It also contained two key recommendations as to how the methodology of obtaining reputation measures should be revised: âThe next study should have sufficient resources to collect and analyze auxiliary information from peer raters and the programs being rated to give meaning and context to the rating ranges that are obtained for the programsâ¦.â (p. 5) and âRe-sampling methods should be applied to ratings to give ranges of rankings for each program that reflect the variability of ratings by peer raters. The panel investigated two related methods, one based on Bootstrap re-sampling and another closely related method based on Random Halves, and found that either method would be appropriate.â (p. 5) The dimensional ratings, described in the previous section, fulfill the first recommendation. This section describes how the second recommendation was followed and combined with the first to obtain an overall rating for each program within a field. THE OVERALL APPROACH A schematic description of the overall approach appears in Box 4-1 and is described in the text: 15 PREPUBLICATION COPYâUNEDITED PROOFS

Box 4-1 Faculty Students Institutions and Programs Existing Data 1. DATA More than 5,000 doctoral programs in 222 institutions in 61 fields across the sciences, engineering, social sciences arts, and humanities. Institutional practices, program characteristics, and faculty and student demographics. Obtained through a combination of original surveys and existing data sources (NSF surveys and ISI publication and citation data). 2. WEIGHTS In two surveys, program faculty provided the NRC with information on what they value most in Ph.D. programs 1) Asked directly how important they felt 21 items in a list of program characteristics were. 2) A sample of faculty rated a sample of programs in their field. These ratings were then related through regressions to the same items as appeared in 1). 3. ANALYSIS âDirectâ and âregression-basedâ weights provided by faculty were averaged into one combined set of weights, reflecting the multi- dimensional views faculty hold about contributing factors to the quality of doctoral programs. 4. RANGES OF RANKINGS. Each programâs rating was calculated 500 times by randomly selecting half of the raters from the faculty sample in Step #2 and also incorporating statistical and measurement variability. Similarly, 500 samples of direct weights were selected. Combined weights were then applied to 500 randomly selected sets of program data to produce ratings for each program. These ratings for each of the 500 samples determine a rank 16 ordering of the programs. A ârange PREPUBLICATION COPYâUNEDITED the middle of rankingsâ was then constructed showing PROOFS half of calculated rankings. What may be compared, among programs in a field, is this range of rankings.

Faculty were surveyed to get their views on the importance of different characteristics of programs as measures of quality. Ratings were based on faculty membersâ views of how those measures related to program quality, as discussed in the chapter on dimensional measures. The views were related to program quality using two distinct methods: (1) directly, through answers to questions on the faculty survey; and (2) regression-based, obtained by asking faculty raters to provide program ratings for a sample of programs in a field and then relating these ratings, through a regression model that corrected for correlation among the characteristics, to data on the program characteristics. The two methods approach the ratings from different perspectives. The direct approach is a âbottom-upâ approach that builds up the ratings from the importance that faculty members gave to specific program characteristics independent of reference to any actual program. The regression-based method is a âtop-downâ approach that starts with ratings of actual programs and uses statistical techniques to infer the weights given by the raters to specific program characteristics. The direct approach is idealized. It asks about the characteristics that faculty feel contribute to quality of doctoral programs without reference to any particular program. The second approach presented the respondent with 15 programs in his or her field and asked for ratings of program quality23, but the responders were not explicitly queried about the basis of their ratings. Because it turned out that these different approaches gave results that were similar in magnitude24 but not strongly correlated25, the two views of the importance of program characteristics were combined26 to obtain an overall view (or combined weight) for each measured program characteristic. The sum of these weighted characteristics yielded a rating for each program. As is explained below, each rating is recalculated 500 times using different samples of raters. The program ratings obtained from all these calculations can then be arranged 23 The question given raters about program quality was: On a scale from 1 to 6, where 1 equals not adequate for doctoral education and 6 equals a distinguished program, how would you rate this program? Not Adequate For Doctoral Donât Know Education Marginal Adequate Good Strong Distinguished Well Enough 1 2 3 4 5 6 9 24 In the case of the resulting direct and regression based weights. 25 For any given measure, the results from the two methods are not highly correlated with one another, permitting us to assume that the results from the two approaches are statistically independent. 26 If there were no uncertainty, the weights would simply be averaged. Because there is uncertainty, the optimal combined weight is not so simple. but takes into account the variances of the separate coefficients. See equations (19) and (20) in Appendix A and the related discussion. 17 PREPUBLICATION COPYâUNEDITED PROOFS

in rank order and, in conjunction with all the ratings from all the other programs in the field, used to determine a range of possible rankings. Because of the various sources of uncertainty, which are discussed at greater length in Appendix A, each ranking is expressed as a range of values. These ranges were obtained by taking into account the different sources of uncertainty in these ratings (statistical variability from the estimation, program data variability, and variability among raters). The measure of uncertainty is expressed by reporting the end points of the inter-quartile range of rankings for each program; that is, the range that contains the middle half of a large number of ratings calculations that take uncertainty into account.27 An example of the derivation of rankings for a program is given in the Chapter 5. In summary, we obtain a range of rankings for each program in a given field by first obtaining two sets of weights through two different methods, direct and regression-based. We then standardize all the measures to put them on the same scale and obtain ratings by multiplying the value of the standardized measure by the weights. We obtain both the direct weights and coefficients from regressions through calculations carried out 500 times, each time with a different set of faculty, to generate a distribution of ratings that reflects their uncertainties. We obtain the range of rankings for each program by trimming the bottom quarter and the top quarter of the 500 rankings to obtain the inter-quartile range. This method of calculating ratings and rankings takes into account variability in rater assessment of what contributes to program quality within a field, variability in values of the measures for a particular program, and the range of error in the statistical estimation. It is important that these techniques give us a range of rankings for most programs. We do not know the exact ranking for each program, and to try to obtain oneâby averaging, for exampleâcould be misleading, because we have not imposed any particular distribution on the range of rankings.28 The database that presents the range of rankings for each program will list the programs alphabetically and give the range for each program. Users are encouraged to look at groups of programs that are in the same range as their own programs, as well as programs whose ranges are above or below, in trying to answer the question, âWhere do we stand?â The next section provides an example of how the ranges of rankings were calculated for a particular program. 27 The inter-quartile range eliminates the top and bottom 125 ratings calculated from 500 regressions and 500 samples of direct weights from faculty. It is a range that contains half of all the rankings for a program. 28 For example, most of the rank ordered ratings could be at the top of the range. 18 PREPUBLICATION COPYâUNEDITED PROOFS