Journal of Pediatric Psychology, Vol. 27, No. 1, 2002, pp. 59-66
© 2002 Society of Pediatric Psychology
Contrasts and Correlations in Theory Assessment
1 Temple University, 2 University of California, Riverside
All correspondence should be sent to Ralph L. Rosnow, 177 Biddulph Road, Radnor, Pennsylvania 19087-4506. E-mail: rosnow{at}temple.edu .
| Abstract |
|---|
|
|
|---|
Objective: To describe a systematic quantitative approach to assessing the predictions made by competing theories using contrasts and correlational indices of effect sizes.
Methods: We illustrate the use of the contrast F and t to compare and combine predictions when the raw data are continuous scores, and z contrasts when working with frequencies in 2 x k tables of counts.
Results: The traditional effect size correlation indicates the magnitude of the effect on individual scores of participants' assignment to particular conditions. The contrast correlation obtained from the contrast F or t is, in some cases, the easiest way of estimating the effect size correlation in designs using more than two groups. The alerting correlation is another way of appraising the predictive power of a contrast and can be used to compute the contrast F from published results when all we have are condition means and the omnibus F from an overall analysis of variance. Omnibus Fs, those with more than 1 df in the numerator, are rarely useful in data analytic work since they address unfocused questions, yielding only vague answers.
Conclusions: Asking focused questions using contrasts increases the clarity of our questions and the clarity and statistical power of our answers.
Key words: correlation; contrast correlation; theory assessment; effect size.
| Introduction |
|---|
|
|
|---|
How can we tell good from not-so-good theories in behavioral science? The traditional answer is that good theoretical propositions are grounded in credible ideas and facts, are stated in a precise and focused way, and are empirically falsifiable. However, historians and philosophers of science recognize that the very assumption of critical falsifying tests is an arguable proposition. In behavioral science, for example, McGuire (1986
The purpose of this article is to illustrate the use of contrasts and three correlational indices of effect sizes in this evaluation and adjudication process. Contrasts are focused statistical procedures for asking precise questions of two or more groups. We present two illustrative cases, one in which the metrics of measurement are unit scores on an underlying continuum and the other, frequencies in a 2 x k table of counts. In each case, two competing propositions are to be addressed. We describe how to evaluate the alternative predictions by t, F, or z contrasts and a family of conceptually related indices, which we have called the effect size correlation (symbolized as reffect size), the alerting correlation (ralerting), and the contrast correlation (rcontrast). We also illustrate how to combine the alternative predictions in order to assess how the competing propositions might fare together, as indicated by reffect size, ralerting, and rcontrast.
First, reffect size reflects the magnitude of the
effect on each score of the participants' assignment to particular conditions,
with membership in the conditions represented by lambda (
) weights
based on predictions or theoretical hunches. The special characteristic of
reffect size is that any disagreement between the
predicted and obtained values of the means is considered to be noise or error
and is added to the level of the noise or error found within
conditions. Next, ralerting reflects the aggregate
relationship between the group means and
weights; it takes its name
from the idea that it may "alert" the researcher who routinely
calculates omnibus F tests (i.e., F with numerator
df > 1) not to be too hasty in embracing the null hypothesis just
because the omnibus F failed to reach p =.05. The special
characteristic of ralerting is that it regards as noise or
error only the disagreement between the predicted and obtained values
of the means. That is, the level of noise or error found within conditions is
simply set aside. Finally, rcontrast reflects the partial
correlation between the outcome scores and
s after removal of all the
noncontrast variation. In certain specifiable instances (discussed later),
rcontrast may be the only effect size correlation we can
compute from other people's data in designs with more than two groups. The
special characteristic of rcontrast is that it regards as
noise or error only the level of noise or error found within
conditions.
To recap these three correlational indices of effect size, we can describe
all three in terms of what each regards as noise or error: (1) for
rcontrast, only within group noise contributes to error;
(2) for ralerting, only between group noise contributes to
error; and (3) for reffect size, both within and between
group noise contribute to error. These three indices, and other aspects of the
correlational approach, are discussed more fully in Rosenthal, Rosnow, and
Rubin (2000
); other related
discussions can be found in Rosnow and Rosenthal
(1996
) and Rosnow, Rosenthal,
and Rubin (2000
).
| Case 1: Continuous Raw Scores |
|---|
|
|
|---|
Computing t and F Contrasts
To illustrate the use of these procedures, suppose a pediatric researcher is interested in evaluating two theories, A and B, each of which implies a specific prediction about how many counseling sessions it will take to improve the psychological functioning of parents of children with serious illness. Theory A predicts a minimum of four sessions to produce any benefit and implies that fewer than four sessions will be fruitless. Theory B predicts small benefits as early as the first session, with gradual improvement continuing throughout all four sessions. To assess these competing predictions, the researcher designs a fully randomized experiment consisting of four groups, corresponding to 1, 2, 3, or 4 sessions of counseling, with 3 participants in each group. Table I shows the participants' scores (higher scores implying beneficial effects), the group means (M), the variance (S2) in each group, and the number of participants (n) in the group. Table II shows the overall analysis of variance computed by the researcher and the reported omnibus F.
|
|
Regrettably, many researchers feel the need to compute an omnibus
F test before looking more closely at their data, as if science were
a "Simon says" game in which it was necessary to seek permission
from the p value associated with some vague statistical result before
addressing the question of interest. The 3 df-F in
Table II is too imprecise to be
informative, as the omnibus F would be the same whether we are
interested in the prediction implied by Theory A or Theory B. Contrasts, on
the other hand, allow us to address the competing predictions in a precise
way. To do so, we begin by expressing the predictions as integer lambda values
that sum to zero (i.e., 
= 0). Theory A anticipates no benefits
prior to four sessions, but a substantial benefit after Session 4, which we
express by
A weights of -1, -1, -1, +3. Theory B
anticipates a continuous linear increase of benefits, which we express by
B weights of -3, -1, +1, +3. Incidentally, an easy way to
create such weights is, first, to make a guess about the mean outcome score in
each group and, second, to subtract the overall mean from each group mean to
create
s. Suppose, in the case of Theory A, we predicted group means
of 0, 0, 0, 4 for sessions 1, 2, 3, 4, respectively. Subtracting the overall
mean of 1 from those group means gives us
A weights of -1,
-1, -1, +3. For Theory B, suppose we predicted group means of 1, 2, 3, 4 for
the four "dosage" levels. Now, subtracting the overall mean of 2.5
yields -1.5, -0.5, +0.5, +1.5, which we multiply by 2 to create whole number
B weights of -3, -1, +1, +3.
A convenient contrast formula for testing each of the two competing
predictions using the t statistic is as follows:
![]() |
= contrast weight assigned to group.
Substituting in equation 1 to assess Theory A's prediction using
A weights of -1, -1, -1, +3, and
Swithin2 = 2.5 from Table 2, we find
![]() |
B weights of -3, -1, +1, +3 (and the same means, sample
sizes, and pooled error term), we find
![]() |
So far, we have been working with the original raw data. However, suppose
we wanted to work with someone else's published data and all we had were the
reported omnibus F and the group means. We could still calculate the
t or F statistic for contrasts to assess both theoretical
predictions, for all we need is the squared alerting correlation
(ralerting2). Multiplying
ralerting2 x omnibus F x
df for omnibus F gives us the 1 df contrast
F. To illustrate in the case of Theory A, correlating the group means
and
A weights yields ralerting =.9258,
and thus ralerting2 =.8571. The contrast
F(1, 8) = (.8571)(5.60)(3) = 14.399, p =.0053which
can be alternatively expressed as tcontrast (8) = 3.795,
one-tailed p =.0026. For Theory B, we find
ralerting =.9562; thus
ralerting2 =.9143. Using the same procedure
shown above, we find F(1, 8) = (.9143)(5.60)(3) = 15.360, p
=.0044, and tcontrast (8) = 3.919 (one-tailed p
=.0022). Another way of thinking about the squared alerting r is that
it immediately tells us the proportion of SSbetween
accounted for by the particular contrast weights. Here, given k = 4
groups (and, therefore, 3 df between groups), we see at once that
both contrasts far exceeded the 33% of the SSbetween
(i.e., 33% = the reciprocal of the df) that we might have expected
from any randomly drawn contrast among these four means.
| Contrast and Effect Size Correlations |
|---|
|
|
|---|
The contrast correlation (rcontrast), or partial r between the contrast weights and participants' scores after removal of all other between-group variation (i.e., removing the SSnoncontrast) can be obtained from the sums of squares by
![]() |
![]() |
In the same way that we might use the contrast sums of squares and within
sums of squares to obtain the contrast r, we can find the effect size
r from the contrast sums of squares and total sums of squares by
![]() |
.
Using equation 2 in the case of Theory A yields
, as does
equation 3, where we find
,
a value not too much greater than that of reffect size
(.762). For Theory B, equation 2 gives us
, as
does equation 3, where
,
and equation 4 gives
.
Comparing Competing Contrasts
Both theories fared well, but suppose we wanted to evaluate the accuracy or
predictive power of the contrast for Theory A relative to the contrast for
Theory B. To do so, we compute another contrast on the difference between the
weights of the two competing predictions. When contrast weights are added or
subtracted, their sums and differences are influenced more by the contrast
weights with larger variance than by the weights with smaller variance. Thus,
to be sure that the comparison is fair (i.e., not simply reflecting the
contrast with greater variance), we will standardize the
weights.
This is done by dividing the weights of each contrast by the standard
deviation (
) of the weights, defined as
![]() |

2) is the sum of the squared
lambda weights, and the denominator (k) is the number of groups or
conditions. For the contrast used to evaluate Theory A, the original
A weights are -1, -1, -1, +3, and thus substitution in
equation 5 yields
![]() |
A weights by

yields new standardized (z-scored)
A weights of -0.577, -0.577, -0.577, +1.732. We now do the
same thing for the contrast to assess Theory B, in which the original
B weights are -3, -1, +1, +3. Using equation 5 gives us
![]() |
B weights by

gives us standardized
B weights
of -1.342, -0.447, +0.447, +1.342.
Subtracting the z-scored
A weights from the
z-scored
B weights gives us the precise weights we
need for our difference contrast: -0.765, +0.130, +1.024, -0.390. With M,
n, and S2 defined as before, we now substitute in
equation 1 to find
![]() |
![]() |
![]() |
![]() |
Combining Competing Contrasts
Both theories did so well individually that we wonder how they would do
together. To find out, we begin by summing the standardized weights. We recall
that the z-transformed
A weights are -0.577,
-0.577, -0.577, +1.732; and the z-transformed
B
weights are -1.342, -0.447, +0.447, +1.342. Summing these values gives us
combined
s of -1.919, -1.024, -0.130, +3.074. As both theories
contributed equally to the combined weights, the combined
s should
correlate equally with the weights of each theory, and indeed we find that the
combined weights correlate.9420 with the
A weights and
B weights. We now use the combined weights and equation 1
to find
![]() |
Routinely repeating all the other calculations we did previously, we start
with the alerting correlation, which is now ralerting
=.9990. The large size of the squared alerting correlation
(r2alerting =.9980) assures us that
the combined prediction did exceedingly well in accounting for between-group
variation. Multiplying r2alerting
=.9980 x SSbetween = 42 gives us
SScontrast = 41.9, which we substitute in equation 2 to
find
![]() |
![]() |
![]() |
| Case 2: Frequency Data |
|---|
|
|
|---|
Computing z Contrasts
For this next example, suppose the researcher were interested in the effects of four levels of medication to control hyperactivity on the proportion of children reading at grade level. Once again, there are two competing theories, now designated as X and Y. Theory X predicts that, as dosage level increases, a higher proportion of children will achieve reading at grade level. We can express this prediction in terms of
X weights of -3, -1, +1, +3 (where 
= 0).
Theory Y, on the other hand, predicts that intermediate dosage levels will be
superior to very low or very high levels, which we express using integer
Y weights of -1, +1, +1, -1.
Table III shows the results of
an experiment in which 200 subjects were assigned in equal numbers (n
= 50) to four levels of medication, ranging from low to very high. The
frequency data in rows 1 and 2 represent the number of participants at each
medication level who ended up reading at grade level or below grade level. The
omnibus chi-square computed on this 2 x 4 table of counts is
2 = 4.762 (df = 3), p =.190. Row 4
transforms the row 1 frequencies into proportions (P), and rows 5 and
6 are self-explanatory. Row 7 shows the variance (S2) in
each column, which is the squared standard error of each proportion, obtained
by dividing row 6 by the column ns (i.e., sums) in row 3, that is,
![]() |
![]() |
X weights (to assess Theory X) gives
![]() |
Y weights (to assess Theory Y) gives
![]() |
|
Alerting and Contrast Correlations
We again get an estimate of how well the two contrasts did by computing
their alerting correlations either for our own data or in other people's data.
However, we now think of the proportions (P) in row 4 as analogous to
column means, redefining ralerting as the correlation
between the proportions and
weights. Correlating the proportions and
X weights gives us ralerting =.632,
while correlating the same proportions with the
Y weights
gives us ralerting =.707. Squaring these alerting
correlations tells us in SS terms that Theory X accounted for 40% and
Theory Y, 50% of the between-condition sum of squares. In other words, there
is a substantial amount of residual noncontrast variation that cannot be
accounted for by either competing prediction.
With large N (e.g., large enough that the expected frequency of
each cell of the table of counts is at least 4 or 5), we can also approximate
Zcontrast from the square root of
r2alerting x omnibus
2 (i.e.,
2 with df > 1), that is,
![]() |
,
whereas Theory Y yields
.
The contrast correlation, in the case of proportions in 2 x
k frequency tables, would be the partial correlation between the
dummy-coded scores (e.g., grade level = 1, and below grade level = 0) and
weights after removing all the noncontrast variation. With large
N, the contrast r can be found from
![]() |
, and
rcontrast for Theory
. Equation 9 can be
used to obtain a lower limit estimate of rcontrast from a
p value reported only as "significant at.05" (or.01
or.001, etc.). Suppose a researcher reported that, using a focused statistical
test and N = 370, the effect of a specified pediatric treatment was
"significant at p <.05" (but gave no further details).
Turning to a table of tail areas of the normal curve, we find one-tailed
p =.05 has an associated z = 1.645. Equation 9 tells us that
the lower limit of rcontrast would be
.
Comparing and Combining the Competing Contrasts
Neither theory did extremely well alone, but suppose we were nevertheless
interested in comparing them. The contrast weights comparing these competing
theories are again given by the differences between the corresponding contrast
weights in z-score form. To obtain these new standardized weights, we
begin by substituting in equation 5, with k = 4, and the lambda
weights listed in rows 8 and 9 of Table
III. For Theory X we find
![]() |
X weights by

gives us new standardized
X
weights of -1.342, -0.447, +0.447, and +1.342. For Theory Y we find
![]() |
Y weights are unchanged, remaining at -1, +1, +1,
-1.
Subtracting the z-scored
X weights from the
unchanged
Y weights gives us difference weights of +0.342,
+1.447, +0.553, and -2.342, which substituted in equation 7 yields
![]() |
. As
we expected, equation 9 reveals little difference in the predictive power of
the two competing theories.
Neither of the two contrasts did all that well alone, but we wonder whether
they would do better if we combined them. We can examine their combined effect
by summing the z-scored
s, which gives us new weights of
-2.342, +0.553, +1.447, +0.342. The combined
s are correlated.707 with
the contrast weights of Theory X and with the contrast weights of Theory Y.
(Since X and Y contributed equally, they should be correlated equally.)
Substituting in equation 7 for the combined contrast yields
![]() |
,
while ralerting was.9472. Both
rcontrast and ralerting, therefore,
were noticeably larger for the combined than for the individual theories.
Conclusions
The two cases that we have described used data on two different levels of
measurement and in each case illustrated the statistical procedure that seemed
to extend most naturally to those results. However, the metric of measurement
in which a dependent variable comes to us usually makes little difference as
to allowable statistical procedures. For example, whether we think of a
variable as nominal, ordinal, interval, or ratio really depends on the
underlying construct that the variable is supposed to reflect
(Rosenthal et al., 2000
).
Suppose the metric of measurement were grade levels of students in
kindergarten through eighth grade. If the underlying construct were a
categorization of the children, then "grade level" would be viewed
as a nominal variable. If the underlying construct were the highest grade
level yet attained, then grade level would be seen as ordinal. If the
underlying construct were exposure to formal educational material, then grade
level could even be considered interval or ratio (i.e., ignoring the
prekindergarten formal educational material).
Moreover, even when a variable bears the desired relation to the underlying
construct, the traditional restriction on which computations we can (or may)
perform is seldom justified, since even the traditional approach sometimes
instructs us to do things that might be contradictory. For example, one is
told that with ordinal scales, multiplication and addition are not allowed.
Also, summaries like the product-moment r are not allowed, and
instead one should use the rank-order correlation. But the rank-order
correlation is the Pearson product-moment correlation between the two sets of
ranked scores (Rosenthal & Rosnow,
1991
), which themselves are presumably only ordinal (but not
necessarily if the investigators felt those ranks were interval with respect
to the relevant underlying construct). Furthermore, the computation of
r involves multiplication and addition of these ranks.
To sum up, we began by alluding to the habit of many researchers of consulting omnibus F tests that are only vaguely related to their question of interest. Johnny Weissmuller, who played Tarzan in the movies, once described his philosophy of life as "not letting go of the vine." This maxim is also sound advice for researchers who let go of predictions of interest without ever realizing it, distracted by omnibus F tests or phantom limitations of the metric of measurement. In the end, the important thing is to hang on to the prediction at least long enough to evaluate it. The contrast and correlational procedures we have described are ideal in this respect, because they encourage researchers to be precise about what it is they want to know and to provide a systematic approach for assessing, comparing, or combining alternative predictions.
| Acknowledgments |
|---|
The first author thanks Temple University for support through the Thaddeus Bolton Professorship.
Received September 1, 1999; revision received May 1, 2000; accepted June 1, 2000
| References |
|---|
|
|
|---|
McGuire, W. J. (1986). A perspectivist looks at contextualism and the future of behavioral science. In R. L. Rosnow & M. Georgoudi (Eds.), Contextualism and understanding in behavioral science: Implications for research and theory (pp. 271 -301). New York: Praeger.
Rosenthal, R., & Rosnow, R. L. (1991). Essentials of behavioral research: Methods and data analysis (2nd ed.). New York: McGraw-Hill.
Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. New York: Cambridge University Press.
Rosnow, R. L., & Rosenthal, R. (1996). Computing contrasts, effect sizes, and counternulls on other people's published data: General procedures for research consumers. Psychological Methods, 1, 331 -340.
Rosnow, R. L., Rosenthal, R., & Rubin, D. B. (2000). Contrasts and correlations in effect size estimation. Psychological Science, 11, 446 -453.[ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. L. Wichman, J. Lee Rodgers, and R. C. MacCallum Birth Order Has No Effect on Intelligence: A Reply and Extension of Previous Findings Pers Soc Psychol Bull, September 1, 2007; 33(9): 1195 - 1200. [Abstract] [PDF] |
||||
![]() |
B. Soetens, C. Braet, P. Dejonckheere, and A. Roets 'When Suppression Backfires': The Ironic Effects of Suppressing Eating-related Thoughts. J Health Psychol, September 1, 2006; 11(5): 655 - 668. [Abstract] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



























