Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Appendix C TechnicaIand S1aUsUcaITechniques 1. AnOfNa18 ~aYG 10 Present Rankings: Random Halves and Boo1s1rap Methods 2. CorreIa1es of Reputation Analysis ~7

138 APPENDIX G Alternate Ways to Present Rankings: Random Halves and Bootstrap Methods Reputational surveys, such as those conducted for earlier research-doctorate program assessments, were not designed to provide accurate rankings of the programs. They represented estimates of ratings, where the results could vary, depending on the selected set of raters. The confidence interval analysis performed in the last two assessments illustrated this point. However, users of the assessments chose to ignore this and focused instead on specific scores obtained by averaging questionnaire responses. A far better method would be to incorporate variability into the reporting of ratings en c! display a range of program ratings rather than a single ranking. Random Halves and Bootstrap are two methods which could be used to assign measures of accuracy to statistical estimates as well as to present data. Both methods involve the resampling of original data set in slightly different ways and would provide slightly different results. Methods For a particular field, such as English Language and Literature, assume there are M programs and N program raters. Each rater only rates a subset of M programs; therefore, some programs may be rated more often than others, since the number of rating for a program depends on which raters responded to the survey ant} whether they actually rated a program on their questionnaire. A response matrix R can be constructed with a reputational rating rij as an entry for rater i rating program j, i = I, . . . ,N and j = I ,. . . ,M. Along each of the rows in the matrix there will be blank spaces for programs that the rater was not asked to rate or did not rate. The different ratings for a given program are then aggregated into a single "mean" rating, rj ( rj could also include weighting and trimming, for example, and may not be just the simple mean of all ratings for program j). Random Halves Method: The Random Halves method is closely related to what is known in statistics literature as the "random group method" for assessing variances of estimates in complex sample surveys. This approach, which has many variants, has literature that goes back to at least 1939 (see Wolter, 19851. It is closely related to another method called the "Jackknife" which was introduced in 1949 and popularized in the 1960s. The essence of the random group or the Jackknife method is to calculate a numerical quantity of interest on a smaller part of the whole data set, and to do this for several such smaller parts of the original data. The differences between the results from these smaller parts of the data are combined to assess the amount of variability, computed on the whole data set. The random halves method is an example of this in which the smaller parts of the whole data are random halves of the data. The Random Halves method is applied as follows: A random sample of N/2 of the rows of R is made without replacement, meaning that a row cannot be selected twice. The mean rj for each program is then computed from this random half sample of the full data. All the programs are then ranked on a basis of these mean ratings. This procedure could be repeated ten, one hundred, or several hundred times to produce a range of

APPENDIX G ratings and rankings for each program in the field. Rankings for each program could be summarized as the distribution that lies within the interquartile range of the ratings. Users of reputational ratings would recognize that raters rate programs differently and half of the raters ranked program j from a to b, where a is the 25 percentile of its ranking distribution and b is the 75 percentile. Bootstrap Method: The Bootstrap method was developed more recently than the random group method, and its literature only dates back to 1979. it is well described in Efron (1982) and Efron and Tibshirani (19931. Although the Bootstrap method was not created specifically for assessing variances in complex sample surveys, it has been used for that purpose. It was created as a general method for assessing the variability of the results of any type of data analysis, complex or simple, and has become a standard tool. Instead of sampling N/2 rows of R without replacement, N of the rows would be sampled from R with replacement, meaning that a row could be selected several times. The same procedure could be used for computing the mean, as in the Random Halves method. The two methods provide very similar results. The perceived advantage in the Random Halves method is in the process, where a rater pool is selected and half the raters are sent questionnaires. This rating process is repeated again and again for the original pool. It is not significantly different from what was done in the past, when the selection of raters and the use of a confidence interval show that a certain percentage of the ratings would fall in a similar interval even if a different set of raters were selected. The advantage in the Bootstrap method, on the other hand, is an established method with a developed theory for the statistical accuracy of survey measurements. A Comparison of the Random Halves and Bootstrap Methods The differences between the methods can be demonstrated by the following simple example. Consider an example where three raters rate two programs. The raters are labeled 1, 2 and 3, and the two programs are labeled A and B. The Rating Matrix Is: . Table I: The Rating Matrix Average rating by Raters A B raters .. 1 1 o 1 0.5 1 2 1 ~ 1.5 . 1 O ~ 0.5 .. ; - . Average rating 1 2/3 1 In this example, all three raters rate the same two programs on a scale of O to 2. In turning ratings into rankings, assume that lower ratings correspond to assessments of higher quality. Thus, rater 1 rated A higher than B. by giving A a rating of O and B a rating of 1. The last row of the Rating Matrix has the average ratings for each program. For these ratings, B is ranked higher than A because its average rating is slightly lower than that of A. In the discussion of the example, the rank of A, will be denoted by Rank(A). Therefore, Rank(A) = 2, while Rank(B) = I. 139

140 APPENDIX G This example may appear to be unrealistic in at least two ways. First, it is very small. This means that it is only possible to examine the probability that A is ranker! ISt or And. Second, programs are not sampled for raters to rate, instead, the raters rate all of the programs in the example. However, neither of these simplifications is very important for the things that will be demonstrated by the example. On the other hancI, the example shows some differences among ratings of the three raters. Rater 1 ranks A and B differently from the way Raters 2 and 3 clot Also the seconc} raters rating numbers are higher than the other two. In applying Random Halves (RH) to this example, there are two variations, since the number of responses is not an even number. Hence, denote by RH(1) the "half-sample" consists of 1 of the 3 raters chosen at random, and in RH(2) the "half-sample" consists of 2 of the 3 raters chosen at ranclom. These are the only possibilities for the RH method in the example. in the RH(~) case, since there are three possible raters to be sampled, they are each sampled with probability I/3, and that the averages are the rating. Below is a table that summarizes the three possible sample results for RHO. Table 2: Summary of RH(~) Sample Average A Average B Rank(A) Probability of rating rating the sample {1} 0 1 1 1/3 {2} 2 1 2 1/3 {3} 1 0 2 1/3 Because Rank(A) = 2 in two of the three possible half samples, the probability that Rank(A) = 2 is 2/3. This should be compared to the finding that in the data (i.e., the Rating Matrix on Table 1) the Rank of A is 2, so the RH(1) method indicates that it could have been different from 2 about 1/3 of the time. In the RH(2) case, two raters are samplecl, and there are three possibilities ~ I,2 I, ~ 1,3 } en c! {2,3 ~ . Suppose the two sampled raters are 1 ant! 2. Then the data to be averaged are given in the following table. The table below summarizes what occurs for three possible half samples for RHO. Note that in the cases, where the average ratings are the same random tie splitting is used and the rank order is clenoted by l.5. Table 3: Summary of RH(2) Sample | Average A Average B | Rank(A) | Probability rating rating . .. .. . ... . . .. . . . { 1,2} 1 1.5 1 1/3 {1,3} ~ 1/2 1/2 1 5 1 1/3 {2,3} 3/2 1/2 2 T 1/3

APPENDIX G In the case of RH(2), there are three ways to get the probability that Rank(A) = 2. The first is from sample ~2,3 I. The other two ways are either one of two other samples and have the tie split so that Rank (A) = 2. Hence, the probability is l/3 + (~/3)(~/2) + (~/3)(~/2) = 2/3. The fraction, I/2, represents the tie splitting. Note that 2/3 is also the probability for Rank(A) = 2 in RH(1L). In summary, the RH method calls for repeatecIly taking "half-samples" of the rating matnx, averaging the resulting ratings for A en cl B. en cl then ranking A en cl B based on these average ratings. In resampling over en c! over, a clistnbution is constructed of how many times A is ranked ~ or 2. For example, in the case of RHO) or RH(2), A wouIct be ranked 2 about 2/3 of the time. Therefore, while the two versions of the RH methoc! give different data, using random tie splitting gives the same results for the probability that A is ranker! 2. Applying the Bootstrap (Boot) method to the example, three raters were samplecI, en c} the same rater couIc! be selectee! more than once. They were regarded as representative of all the possible raters who conic! have been sampled to rate the programs. Clearly such an assumption vanes in plausibility clue to various factors, such as how many raters are being considerecl and how they are onginal~y chosen. It is, however, a useful assumption and appears throughout many applications of statistics. In sampling three rows from the original Rating Matrix there are 27 possible combinations or the probability of any sample is I/27. They are listen! in the following table. Table 4: Bootstrap samples, their average ratings for A and B and the Rank of A. Sam le I A . P 111 112 113 121 122 - 123 131 32 133 0/3 2/3 1/3 2/3 4/3 3/3 1/3 ~313 2/3 | B | Rank(A) | Sam le | A ~ B | Rank(A) 1 i P . 3/3 1 1 211 2/3 3/3 3/3 1 1 212 4/3 ~ 3/3 2/3 1 1 213 3/3 2/3 ~ 3/3 221 4/3 3/3 . . 3/3 2 1 222 6/3 3/3 2/3 2 1 223 5/3 2/3 2/3 1 1 231 3/3 2/3 , 2/3 2 1 232 5/3 2/3 1/3 2 1 233 4/3 1/3 Sample . 311 . 312 . 313 321 322 323 331 332 333 A | 1/3 3/3 2/3 . 3/3 5/3 . 4/3 2/3 . 4/3 3/3 B 2/3 2i3 1/3 2/3 2/3 1/3 1/3 1/3 0/3 Rank(A) = 2 occurs a total of 20 times in the table above, yielcling a probability of 20/27 = .74. This is different from the results of the RH methods (i.e., .671. However, it is still plausible because while A was ranked second in a sample of 3, there is still some probability that it could have been ranked ~ in a different sample of raters. The Boot methoc! producer! a somewhat smaller probability estimate, i.e., .26 rather than .33, so that A couicl have been ranker! Ist, but both of these values are less than ~/z and, are both plausible in such a small example. There is no very convincing, intuitive way to favor either one of these two probability estimates, .67 or .74. Hence, this example has little to offer in making an intuitive choice 141

142 between the two approaches. What this does show is that the RH ant! Boot methods do not give the same results for something that is closely related to the types of probabilities. Thus, any claim that the two methods are "equivalent" is wrong, but they are clearly "similar." Statisticians who are specialists in variance estimation prefer the Bootstrap to ad hoc methods because it is grounded in theory. The Bootstrap method is the nonparametric, maximum likelihood estimate of the probability that Rank(A) = 2. The Random Halves method floes not enjoy this property. However, variance estimation is an important subject in statistics and many methods, in particular the Jackknife, can be tailored to situations where they provide serious competition to the Bootstrap. The next section will illustrate that, when the number of raters and programs are both large, there is little difference between the Ranclom Halves and the Bootstrap methods. Analysis of the Expected Variance for the Two Methods A natural question to ask is: What do the Ranclom Halves and Boot methods produce for probability distributions of average ratings for programs? Drawing on some results from probability theory it can be shown that these methods give similar results. Any method of resampling creates random variables with distributions that cJepencJ on the resampling method. In the rating example, let the random variables for the average ratings that result for A and B for each sample be denotes! by RA ant! RB, respectively. These are random variables with means en c} variances that have well-known values. The average ratings of A and B in the rating matrix are given in the last row of The Rating `, . . . ,~ ~ ~ ~ . . . ~ , ~ . ~ ~ Matrix In ~ ante I, and they are denoted In general as rA ano rB. Thus, in the example, rA = ~ en c! rB = 2/3. In abolition to the average ratings, the variance of the ratings in each column is clefined as the average of the squares of the ratings in each column minus the square of the mean rating for that column. Thus, for program A, the variance is vA = (O + 2 +! )/3~ = 5/3~ = 2/3, anti, for program B. it is V = (~2 + ]2 +021/3 _ (2/312 = 2/34/9 = 6/94/9 = 2/9. Table 5 gives the results for N raters rating Program A ant! n raters user! in the RH(n) method. If N is even, then n = N/2. In the table let E(RA) denote the "expected value" or "Iong-run average value" of the average rating for A, RA. Statistics show that it is the same value, rA, for both the Boot anti the RH methods. rA is also the average rating for A in the original Rating Matrix, and in general, rA is the average rating given to program A by the raters rating it. Thus, both the RH en c! Boot methods are unbiased for rA, and any sensible resampling method will share this property. APPENDIX G

APPENDIX G Table 5: The mean and variance of the average rating for A in a single resample ~ Bootstra Method | Random Halves, RH(n),Method ~ i P 1 E(RA) rA rA Var(RA) VA VA (N-n) N n (N - 1) Where the two methods can differ is in the value of the variance, Var(RA). This variance is a measure of how much RA deviates on average from the mean value, rA, from one random resampling to another. Observe that both formulas for Var(RA) involve, VA, the variance of the ratings in the column of the Rating Matrix for program A. Note that when N is even, ant! n = N/2 then the N - n in the numerator for RH(n) is n and it cancels the n in the denominator leaving only N - ~ in the denominator. This is to be compared to the N in the denominator for the Bootstrap method. When N. the number of raters is large, then N and N-l are close ant} the variances of average rating, RA, for the two methods are nearly the same. The factor or the right side of the formula for the RH(n) variance is known as the finite sampling correction anti it gets smaller as n increases relative to N. In the simple example, here is what these formulas yield. RHINE: In this case, RA takes on these three possible values with the corresponding probabilities. Possible average ratings 0 |~ |2 Probabilities ~ /3 ~ /3 ~ /3 The mean of this distribution is 0(~/3) + (/3) + 2(~/3) = ~ = rA its variance is o2(~/3) + 12(~/3) + 22(1/3)- 12 = 2/3. Applying the formula for the variance for RHO) from Table S gives ((2/3)/~)(3 - l)/(3 - l) = 2/3, the same value. RH(21: in this case, RA takes on these three possible values with the corresponding probabilities. Possible average ratings I/2 ~ 3/2 Probabi liti e s I /3 I /3 I /3 The mean of this distribution is (~/21(~/3) + I(~/3) + (3/2~(~/3) = ~ = rA, as before. 143

144 APPENDIX G Its variance is (1/2)2(1/3) + (1)2(1/3) + (312)2(1/3) _ 12 = ((1/4) + 1 + (914))13 - 1 = (1414)13 - (12/12) = 2/12 = 1/6. Applying the formula for the variance for RH(2) from Table 7 gives ((213)12)(3 - 2)/(3 - 1) = (1/3)(1/2) = 1/6, the same value. Boot: In this case, RA takes on seven possible values with the corresponding probabilities. : Possible average ratings O 1/3 2/3 1 4/3 5/3 2 Probabi 1 iti es 1/27 3/27 6/27 7/27 6/27 3/3 7 1/3 These probabilities are found by summing up the Bootstrap samples that yield the given possible value in Table 4. This is a larger set of possible average ratings for A than either one of the RH methods gives. This is due to the richer set of samples available under the Boot method. The mean of this distribution is (0)(1/27) + (113)(3127) + (213)(6127) + (1)(7/27) + (413)(6127) + (5/3)(3/27) + (2)(1/27) = 1 = rA, as it is for the other two methods. The variance is (0)2(1/27) + (113)2(3127) + (213)2(6127) + (1)2(7/27) + (4/3)2(6127) + (513)2(3127) + (2)2(1/27) _ 12 = (1/9)(1/27)(3 + 24 + 63 + 96 + 75 + 36) - 1 = (2971(9X27)) - 1 = (1 I/9) - (919) = 219. Applying the formula for the variance for Boot from Table 5 gives ((213)13) = 2/9, the same value. Summary of results The mean and variance calculations as applied to this simple example illustrates the following: (a) The RH and Boot methods are only similar when N. the number of raters rating a program. is large enough to make the difference between N and N - ~ negligible. a- - o- ~ of C7 (_7 ~_7 (b) The set of possible samples from which resampling takes place differs for the two methods, the one for method Boot is much larger in general. (c) Both methods are unbiased for the mean rating of a program, but they differ in their variances. When N is even, the variance of Boot is smaller, when N is odd, the variance of Boot lies between that for RH(n) and RH(n+~) where n < N/2 not. This is observed by examining the data in Table 4. (d) The Boot method usually has a much richer set of possible ratings in its resampling distribution, and fewer ties.

APPENDIX G References. Wolter, K. M. 1985. Introduction to Variance Estimation. New York: Springer-VerIag. Efron, B. 1982. The Jackknife, the Bootstrap ant! other Resampling Plans. Philadelphia: Society for Tnclustrial and Applied Mathematics. Efron, B., and Tibshirani, R. J. 1993. An Introduction to the Bootstrap. New York Chapman & Hall. 145

46 Correlates of Repulation Analysis The reputational quality of a program is a purely subjective measure; however, it is related to quantitative measures in the sense that quality judgment could be macle on the basis of information about programs, such as the scholarly work of the faculty and the honors awarded to the faculty for that scholarship. Therefore, it may be possible to relate or to predict quality rankings for programs using quantitative measures. It is clear that preclicted quality rankings would also be subjective and that the accuracy of such predictions may change over time. One way to construct such a relationship is to clo a least squares multilinear regression. The dependent variable in the regression analysis is represented! by a set of average ratings, rat, r2, . . , rN for N programs in a particular fielcl. The predictors or independent variables would be a set of quantitative or coclect program characteristics that are represented by a vector, an, for program n. The analysis wouIc! construct a function fix) which provides a predicted average rating foxy) for program n. In this case the relation between rn ant! few) wouIc! be rn = fern) + en = aixI'n + a2x2,n + · · · + amXm,n + am+} + en (~) where x,, x2,n, ..., xm,n represent the m quantity or coded characteristics for the program n in the field, and en, is the resiclual or the amount by which the predicted average rating varies from the actual average rating for that program. If the prediction is "goocI" then the resicluals are relatively small. The coefficients aj are cleterminec! by minimizing the sum of the squares of the differences rn - fern). While a single regression equation is generates! using quantitative data and the reputational score, the selectee! raters of the program provide a certain amount of variability. This variability can be shown in the following manner: Associated with each coefficient al is a 95%-conficlence interval [L~, Uil, en cl by ranclomly selecting values for the coefficients within their confidence intervals, a predicted average rating rn can be generated for program n. A measure of how close the set of rn ratings is to the rn ratings can be calculated by r - r ~~ < p s F. (2) where r = Oft, r2, ..., rN), r = art, r2, ..., rN) and ~~ ~2 denotes the sum of squares of the components of the difference vector. The bound on the inequality, p s2 F. is a constant that is derived from the regression analysis. p = m, the number of nonconstant terms in the regression equation, s2 is the "mean square for error" given in the output of a regression program, en c} F = the 95% cutoff point for the F-ciistribution with p and n-p degrees of freedom. By repeating the random selection of coefficients many times, a collection of coefficients can be determined that satisfies inequality (2), en cl the upper- and lower-bounds of this APPENDIX G

APPENDIX G collection defines an interval [L'i, U'i]. For coefficients in these intervals a range of predicted ratings can be generated. From the practical point of a program trying to estimate the quality of its program, a few years after a reputationa] survey is conducted, it couIct use a linear regression equation with coefficients in [L.'i, U'i] to generate a new range of ratings based on current program data, or if data for all programs in the field were available, a new interquartile ranking of programs could be obtained. The following is an example where this methoc! is applied to the 1995 ratings of programs in Mathematics. Mathematics Using the STATA statistical package and applying a forward stepwise, least-squares linear regression on a large number of quantitative variables which characterized publications, citations, faculty size and rank, research grant support, number of doctorates by gender en cl race/ethnicity, graduate students by gentler, graduate student support, and time to degree, the following seven variables were identifier! as being the most significant: (ginipub) Gini Coefficient for Program Publications, 1988-92: The Gini coefficient is an indicator of the concentration of publications on a small number of the program faculty clunng the penod 1988-92. (phcis) Total Number of Doctorates FY 86-92 (perfuIl) Percentage of Full Professors Participating in the Program (persupp) Percentage of Program Faculty with Research Support (1986-92) (perfpub) Percentage of Program Faculty Publishing in the Penoc! 1988-1992 (ratiocit) Ratio of the Total Number of Program Citations in the Period 1988-1992 to the Number of Program Faculty (myth) Meclian Time Lapse from Entenng Graduate School to Receipt of Ph.D. in Years Results of a regression analysis are shown below. About 95% of the variation is explainer! by these vanables, where R2 = 0.8304 . Source ~ ________+ Model ~ 112.36003 7 16.0514329 Residual ~ 22.954789 131 .175227397 SS df MS Total 1 135.314819 138 .98054217 147 Number of obs = 139 F( 7, 131) = 91.60 Prob > F = it-squared = Adj it-squared = Root MSE O.0000 0.8304 0.8213 .4186

148 APPENDIX G quality ~ Coef. Std. Err. t Patti [95% Conf. Interval] _ ____________+________________ ____ ___ _________ _________ _______ ________ phds 1 .3489197 .0544665 6.41 0.000 .2411721 .4566674 perfull 1 .008572 .0027864 3.08 0.003 .0030598 .0140842 persupp ~ .0183162 .0025146 7.28 0.000 .0133418 .0232906 perfpub ~ -.0150464 .0035235 -4.27 0.000 - 0220167 -.0080762 ratiocit ~ .0258671 .0077198 3.35 0.001 .0105955 .0411387 myd 1 -.7737551 .1995707 -3.88 0.000 -1.168553 _ -.3789567 ginipub ~ -.0294944 .0044222 -6.67 0.000 -.0382425 -.0207462 _cons ~ 3.070145 .3625634 8.47 0.000 2.352908 3.787382 _______________ ____________________ _ __________________________ _ _______________ The resulting predictor equation is: fix) = 3.07 + 0.349(phtls) + 0.009(perfull) + O.OlS(persupp) - 0.026(ratiocit) - .774(myd) - 0.029(ginipub) 0.0 ~ 5(perfpub)+ it is noted that the Root Mean Square Error (RMSE) from the regression is 0.4186, en c! the variation in scores from the 1995 confidence interval calculation has an RMSE of 0.2277. The following is scatter plot of the actual 1995 ratings and the predicted ratings. Plot of the Predicted Faculty Quality Score Against the Actual 1995 Score for Programs in Mathematics 6 - . 5- c' 4 In ~ 3 - i~ 2 - 1 - O- · - : : - · ~ ~ ~ ,~$ .. , .- · . · .$~..S>~;S. ~ · ~ ~ ~ ~ _ · · · ~ . ·. ~ , ~ ~ 0 1 2 3 1 995 Score 6 The 95%-confidence interval for each of the variables used in the regression can now be useri to find a new estimate for the quality score. As described above, values for the

APPENDIX G coefficients in the regression equation are randomly selected in the intervals and tester! to see if that set of coefficients satisfies the relation ~~ r - r ~2 < p S2 F. For Mathematics data the bounct p s2 F = (71~.418612~2.12) = 2.563556. For this example 3,000 random selections were made in the coefficient intervals ant! 220 coefficients sets satisfied the inequality. The corresponding maximum ant! minimum interval are: - - - Cat 1 ,, , ,~ phds persupp ginipub myd perfpub ratiocit perfull coefficient coefficient coefficient coefficient coefficient coefficient coefficient constant Max 0.35469 0.018583 -0.029026 -0.7526 -0.014673 0.026686 0.0088674 3.10858 Min 0.34314 0.018049 -0.029964 -0.79495 -0.015421 0.025047 0.0082761 3.03164 Using the values in the above table, the maximum and minimum predicted quality scores can be calculated, and the scores for Mathematics programs are displayer! in the table below. As described earlier, these maximum ant! minimum coefficient values could be used to construct new quality scores, by randomly selecting the coefficients in the regression equation between the corresponding maximum and minimum values. If this is clone repeatedly a collection of quality scores is obtained for each program and the interquartile range of this collection could be generated. This was clone 100 times and the results are given as the Predicted Ranks in the table with the Bootstrap rankings. Quality Score Predicted Ranks Bootstrap Ranks Maximum Minimum 1 st 3rd 1 st 3rd Institution Quartile Quartile Quartile Quartile Dartmouth College 2.73 2.51 73 76 53 62 Boston University 2.70 2.42 77 80 48 52 Brandeis University 3.17 2.88 49 51 32 36 Harvard University 4.41 4.09 8 9 2 4 Massachusetts Inst of Technology 5.27 4.93 2 2 3 4 U of Massachusetts at Amherst 3.40 3.11 38 40 54 60 Northeastern University 2.41 2.13 99 103 70 80 Brown University 4.60 4.31 5 6 26 29 Brown University-Applied Math 4.59 4.26 6 6 14 17 Universityof Rhode Island 1.69 1.40 128 129 122 125 University of Connecticut 2.66 2.39 79 83 98 102 Wesleyan University 2.31 2.09 104 107 101 110 Yale University 3.38 3.13 38 40 7 8 Adelphi University 1.07 0.82 138 138 130 133 CUNY - Grad Sch & Univ Center 3.38 3.10 40 41 30 32 Clarkson University 2.49 2.21 90 94 109 118 Columbia University 4.32 3.99 11 11 10 12 Cornell University 4.81 4.46 3 4 14 16 New York University 4.83 4.50 3 4 7 8 Polytechnic University 2.15 1.88 112 114 98 105 Rensselaer Polytechnic Inst 3.64 3.36 27 30 48 52 University of Rochester 3.10 2.83 52 54 56 62 149

50 State Univof New York-Albany 2.55 2.33 85 88 82 90 State Univof New York-Binghamton 2.55 2.33 85 87 65 75 State Univ of New York-Buffalo 3.00 2.76 57 59 61 70 State Univ of New York-Stony Brook 3.60 3.31 30 32 19 22 Syracuse University 2.42 2.18 95 100 76 84 Princeton University 4.52 4.21 7 7 2 3 Rutgers State Univ-New Brunswick 4.06 3.77 16 18 17 20 Stevens Inst of Technology 1.73 1.48 127 127 121 128 Carnegie Mellon University 3.63 3.33 28 31 34 40 English Language and Literature Applying the same method to the 1995 programs in English Language and Literature, a slightly different result is obtained, since programs in this field do not have the same productivity characteristics as those in Mathematics. Again, forward stepwise least squares linear regression was applied to a large number of quantitative variables, and the following were iclentifiec} as being the most significant: (nopubs2) Number of Publications During the Perioc} 1985-1992 (perfawd) Percentage of Program Faculty with at Least One Honor or Award for the Perioc} 1986-1992 (acadplan) Total Number of Doctorates FY 1986-1992 with academic employment plans at the 4-year college or university level. (ginicit) Gini Coefficienticient for Program Citations, 1988-1992: The Gini coefficienticient is an indicator of the concentration of citations on a small number of the program faculty clurina the nerioc! 1988-1992. - r--=-~ ~ ~-~ -= ---- r (nocitsI) (fulIprof) Number of Citations During the Perioc] 1981-1992 Percentage of Full Professors Participating in the Program (empplan) Total Number of Doctorates FY 1986-1992 with Employment Plans. None of the variables iclentified in the Mathematics regression are present in this . . regression ana ysls. Results of this regression analysis are shown below. About 95% of the variation is explained by these variables, where R2 = 0.8106. APPENDIX G

APPENDIX G Source ~ SS df MS Number of obs = 117 ________+________________________------ F( 7, 109) = 66.65 Model ~ 83.985691 7 11.9979559 Prob > F = 0.0000 Residual ~ 19.6227839 109 .18002554 it-squared = 0.8106 ------------+------------------------------ Adj it-squared = 0.7984 Total 1 103.608475 116 .893176507 Root MSE = .42429 q93a ~ Coef. Std. Err. t P>~t~ [95% Conf. Interval] nopubs2 ~ .1202936 .1017753 1.18 0.240 -.0814218 .322009 perfawd ~ .0326877 .0041423 7.89 0.000 .0244777 .0408977 acadplan ~ .7961931 .2416467 3.29 0.001 .3172573 1.275129 ginicit ~ -.0007486 .0001839 -4.07 0.000 -.001113 -.0003842 nocitsl ~ .0827859 .0234272 3.53 0.001 .036354 .1292178 fullprof ~ .2942413 .1096454 2.68 0.008 .0769276 .511555 empplan 1 -.599897 .2698761 -2.22 0.028 -1.134783 -.0650113 _cons ~ 1.955276 .1533968 12.75 0.000 1.651249 2.259304 The resulting preclictor equation is: fix) = 1.955 + 0.12(nopubs2) + 0.033(perfawcl) + 0.796(acadplan) -O.OOl~ginicit) + 0.083(nocitsi) + 0.294(fuliprof) - 0.6(emppplan). The following is a scatter plot of the Random Halves draw from the 1995 rankings and the predicted! ranking for that draw. For programs in English Language anti Literature, the Root Mean Square Error (RMSE) from the regression is 0.42429, and the variation in scores from the 1995 confidence interval calculation has an RMSE of 0.2544. r - - or- ~ --I-- ~ -A -- - 0 - - - ~ Plot of the Predicted Faculty Quality Score Against the Actual 1995 Score for Programs in English Language and Literature 6 5 4 o U) ~ 3 · . - · · · .^ · - ~ ·~t~f. .> . I .- 1 * .S.~. ~ ·. .. · . `~' S ·-- : 0 1 2 3 4 5 6 1 995 Score

152 In Mathematics the 95%-conficlence interval for each of the variables user! in the regression can be used to determine a new estimate for the quality score. In this case, the bound p s2 F = (71~.42869~2~2.~) = 2.747136. For this example 3,000 random selections were also made in the coefficient intervals en cl 242 coefficients sets satisfied the inequality. The corresponding maximum and minimum intervals are: nopubs2 perfawd acadplan ginicit nocits fullprof empplan coefficient coefficient coefficient coefficient coefficient coefficient coefficient constant Max 0.13384 0.033239 0.82835 -0.00072 0.085903 0.30883 -0.56399 1.97569 Min 0.10684 0.03214 0.76425 -0.00077 0.079689 0.27975 -0.63557 1.935 For the example used with Mathematics programs, the maximum and minimum values for the coefficients can be used to calculate the maximum and minimum Prectictect Quality r . ~ scores for tne programs in English Language and Literature. These scores are displayed in the table below. Repeating the exercise, descnbed for Mathematics, of randomly selecting coefficient values in the maximum-minimum intervals a large number of times, an interquartile range can be generated for programs in English Language and Literature. This was again done 100 times and the results are given as the Predicted Ranks in the table with the Ranclom Halves rankings. Quality Score Predicted Ranking Random Halves Ranks Maximum Maximum 1 st 3rd 1 st 3rd Institution Quartile Quartile Quartile Quartile Universityof New Hampshire 2.74 2.56 91 93 70 77 Boston College 2.57 2.42 96 98 59 64 Boston University 3.80 3.59 20 21 38 42 Brandeis University 3.63 3.40 19 21 44 55 Harvard University 5.55 5.05 1 1 2 3 U of Massachusetts at Amherst 3.84 3.51 30 34 38 43 Tufts University 2.35 2.22 108 110 67 74 Brown University 4.21 3.78 15 16 13 15 University of Rhode Island 2.39 2.22 113 115 94 113 University of Connecticut 3.26 3.05 53 57 79 87 Yale University 5.07 4.52 5 6 2 3 CUNY- Grad Sch & Univ Center 3.50 3.21 42 48 18 19 Columbia University 4.90 4.24 9 10 ~ 7 9 Cornell University 4.71 4.16 13 13 6 8 St John's University 1.93 1.86 127 127 119 122 Fordham University 2.38 2.23 103 106 104 112 New York University 3.59 3.25 26 28 18 20 Drew University 2.30 2.15 116 119 123 126 Universityof Rochester 3.30 3.02 30 33 44 48 State Univ of New York- Binghamton 3.01 2.72 62 64 65 69 APPENDIX G

APPENDIX G 153 State Univ of New York-Buffalo 3.65 3.16 30 37 25 27 State U of New York-Stony Brook 3.17 2.77 48 55 46 52 Syracuse University 2.53 2.38 95 98 71 76 Indiana Univ of Pennsylvania 2.19 1.93 124 126 122 124 Princeton University 4.82 4.39 5 6 12 14 Rutgers State Univ-New Brunswick 3.96 3.62 22 23 16 18 Carnegie Mellon University 3.17 3.01 33 35 52 54