Sunday, December 01, 2013

Height prediction from common DNA variants

Still early days but it's clear where this is heading. Note the design using an outlier (case) and normal (control) group.

See also Five years of GWAS discovery and Recent human evolution: European height.
Common DNA variants predict tall stature in Europeans
DOI 10.1007/s00439-013-1394-0 (Journal: Human Genetics)

Abstract Genomic prediction of the extreme forms of adult body height or stature is of practical relevance in several areas such as pediatric endocrinology and forensic investigations. Here, we examine 770 extremely tall cases and 9,591 normal height controls in a population-based Dutch European sample to evaluate the capability of known height-associated DNA variants in predicting tall stature. Among the 180 normal height-associated single nucleotide polymorphisms (SNPs) previously reported by the Genetic Investigation of ANthropocentric Traits (GIANT) genome-wide association study on normal stature, in our data 166 (92.2 %) showed directionally consistent effects and 75 (41.7 %) showed nominally significant association with tall stature, indicating that the 180 GIANT SNPs are informative for tall stature in our Dutch sample. A prediction analysis based on the weighted allele sums method demonstrated a substantially improved potential for predicting tall stature (AUC = 0.75; 95 % CI 0.72–0.79) compared to a previous attempt using 54 height-associated SNPs (AUC = 0.65). The achieved accuracy is approaching practical relevance such as in pediatrics and forensics. Furthermore, a reanalysis of all SNPs at the 180 GIANT loci in our data identified novel secondary association signals for extreme tall stature at TGFB2 (P = 1.8 × 10−13) and PCSK5 (P = 7.8 × 10−11) suggesting the existence of allelic heterogeneity and underlining the importance of fine analysis of already discovered loci. Extrapolating from our results suggests that the genomic prediction of at least the extreme forms of common complex traits in humans including common diseases are likely to be informative if large numbers of trait-associated common DNA variants are available. [ Italics mine ]
AUC = Area Under Curve for the ROC curve defined as in the figure below. An ideal predictor has AUC = 1.

4 comments:

Endre Bakken Stovner said...

(I hope this is apropos to a discussion of complex trait genomics.)


1) In your fascinating talk about the genetic architecture of intelligence you mention that the genetic distance between short people is larger than that between tall people. Do you have a citation? I tried plodding through the first papers that showed up on pubmed for various combinations of height/stature/SNP/GWAS etc., but to no avail. I did however find a paper that claimed short people had more and longer CNVs, but from your example with hamming distance I assume you were talking about SNP differences.


2) I'm wondering about some of the assumptions made to arrive at 40 SNPs ~= one SD in IQ. 2.1) For this calculation, did you make an assumption about x% of SNPs being IQ relevant? 2.2) Also, it intuitively seems like 40 SNPs shouldn't make much difference, since I imagine the average number of alternative allele SNPs in a person should be very high.


Thanks.

steve hsu said...

1) We have not published this result. The effect was not statistically significant when we tried to replicate it on a couple of other populations, so the 1 SD ~ 40 SNPs is only a rough *upper bound*. That is, if the # of SNP changes per SD change in the phenotype were larger than we would have detected it in the other sample. It could be 40 or 30 or 50 but 100 is probably ruled out.


2) If there are N causal variants then it takes about sqrt(N) SNP changes to get 1 SD. So if N ~ 1E04 then 1 SD ~ 100 would be about right (note avg MAF of causal variants also affects this relationship). We didn't make assumptions about N. We *measured* the # of SNP changes per SD difference in phenotype by averaging SNP distance over all pairs in a population of order 5-10k individuals for whom both phenotype and genotype were known.

Diogenes said...

i probably don't understand this. the height iq analogy suggests what? should or shouldn't something have been found in the bgi group?

Richard Seiter said...

That looks like a pretty good predictor from the ROC curve. Eyeballing the curve it looks like they see about an 85% sensitivity at an 85% specificity. From the numbers they give that would lead to: 654.5 TP, 1438.65 FP, 8152.35 TN, 115.5 FN (FPs are clearly the bigger problem so might be better to choose a different tradeoff on the curve, not surprising giving prevalence is only 7.43%). This gives a positive predictive value of 0.31 and a negative predictive value of 0.99. That looks less impressive (though would need exact TP/FP rates and more careful choice of threshold to say much). By any chance did they give a plot of the PPV/NPV curve?


I don't know if binary classification is the best way to think about this. It seems more useful to do something like a scatter plot of the predicted value (probability of tall) versus actual height (drawing lines at a threshold value and height would nicely divide the scatter plot into TP, FP, TN, FN quadrants). Do the incorrect predictions cluster around the not/tall threshold or are they more erratic? (if the former this could be a surprisingly good predictor IMHO)


The paper is behind a paywall so I haven't been able to examine it. Any further comments on the results? Any thoughts about implications for the BGI study?


Regarding the N causal variants and sqrt(N) SNP changes for 1 SD, is there an assumption that the variants are of equal effect size? If so, any idea how removing that assumption might change things? Is there any assumption about the minimum effect size to qualify as a "causal variant"? (it seems like that could affect N dramatically)

Blog Archive

Labels