Thursday, February 27, 2014

Correlation and Variance

In social science a correlation of R = 0.4 between two variables is typically considered a strong result. For example, both high school GPA and SAT score predict college performance with R ~ 0.4. Combining the two, one can achieve R ~ 0.5 to 0.6, depending on major. See Table 2 in my paper Data Mining the University.

It's easy to understand why SAT and college GPA are not more strongly correlated: some students work harder than others in school, and effort level is largely independent of SAT score. (For psychometricians, Conscientiousness and Intelligence are largely uncorrelated.) Also, it's typically students in the upper half or quarter of cognitive ability relative to the general population that earn college degrees. If the entire range of students were enrolled in college the SAT-GPA correlation would be higher. Finally, there is, of course, inherent randomness in grading.

The figure below, from the Wikipedia entry on correlation, helps to visualize the meaning of various R values.


I often hear complaints of the type: "R = 0.4 is negligible! It only accounts for 16% percent of the total variance, leaving 84% unaccounted for!" (The fraction of variance unaccounted for is 1 - R^2.) This kind of remark even finds its way into quantitative genetics and genomics: "But the alleles so far discovered only account for 20% of total heritability! OMG GWAS is a failure!"

This is a misleading complaint. Variance is the sum of squared deviations, so it does not even carry the same units as the quantity of interest. Variance is a convenient quantity because it is additive for uncorrelated variables, but it leads to distorted intuition for effect size: SDs are the natural unit, not SD^2!

A less misleading way to think about the correlation R is as follows: given X,Y from a standardized bivariate distribution with correlation R, an increase in X leads to an expected increase in Y:  dY = R dX. In other words, students with +1 SD SAT score have, on average, roughly +0.4 SD college GPAs.  Similarly, students with +1 SD college GPAs have on average +0.4 SAT.

Alternatively, if we assume that Y is the sum of (standardized) X and a noise term (the sum rescaled so that Y remains standardized), the standard deviation of the noise term is given by  sqrt(1- R^2)/R ~ 1/R for modest correlations. That is, the standard deviation of the noise is about 1/R times larger than that of the signal X. When the correlation is 1/sqrt(2) ~ 0.7 the signal and noise terms have equal SD and variance. ("Half of the variance is accounted for by the predictor X"; see for comparison the figure above with R = 0.8.)

As another example, test-retest correlations of SAT or IQ are pretty high, R ~ 0.9 or more. What fluctuations in score does this imply? In the model above the noise SD = sqrt(1 - 0.81)/0.9 ~ 0.5, so we'd expect the test score of an individual to fluctuate by about half a population SD (i.e., ~7 points for IQ or ~50 points per SAT section). This is similar to what is observed in the SAT data of Oregon students.

I worked this out during a boring meeting. It was partially stimulated by this article in the New Yorker about training for the SAT (if you go there, come back and read this to unfog your brain), and activist nonsense like this. Let me know if I made mistakes ...  8-)

tl;dr Go back to bed. Big people are talking.

Blog Archive

Labels