Physicist, Startup Founder, Blogger, Dad

Monday, October 06, 2014

Common variants and the biological and genomic architecture of human height

The latest from the GIANT collaboration. They are also estimating ~ 10k causal variants in total, with 697 now identified at genome-wide significance. See On the genetic architecture of intelligence and other quantitative traits for related discussion.

With ~1k variants to work with, we can expect progress on the question of whether the ~1 SD group difference in height between north and south europeans is due to selection. Uniformly higher SNP frequencies in the north for variants that slightly increase height would be strong evidence of selection. See Recent human evolution: european height.
Defining the role of common variation in the genomic and biological architecture of adult human height (Nature Genetics doi:10.1038/ng.3097 )

Using genome-wide data from 253,288 individuals, we identified 697 variants at genome-wide significance that together explained one-fifth of the heritability for adult height. By testing different numbers of variants in independent studies, we show that the most strongly associated ~2,000, ~3,700 and ~9,500 SNPs explained ~21%, ~24% and ~29% of phenotypic variance. Furthermore, all common variants together captured 60% of heritability. The 697 variants clustered in 423 loci were enriched for genes, pathways and tissue types known to be involved in growth and together implicated genes and pathways not highlighted in earlier efforts, such as signaling by fibroblast growth factors, WNT/β-catenin and chondroitin sulfate–related genes. We identified several genes and pathways not previously connected with human skeletal growth, including mTOR, osteoglycin and binding of hyaluronic acid. Our results indicate a genetic architecture for human height that is characterized by a very large but finite number (thousands) of causal variants.

From the discussion section:
It has been argued that the biological information emerging from GWAS will become less relevant as sample sizes increase because, as thousands of associated variants are discovered, the range of impli- cated genes and pathways will lose specificity and cover essentially the entire genome. If this were the case, then increasing sample sizes would not help to prioritize follow-up studies aimed at identifying and understanding new biology and the associated loci would blanket the entire genome. Our study provides strong evidence to the contrary: the identification of many hundred and even thousand associated variants can continue to provide biologically relevant information. In other words, the variants identified in larger sample sizes both display a stronger enrichment for pathways clearly relevant to skeletal growth and prioritize many additional new and relevant genes. Furthermore, the associated variants are often non-randomly and tightly clustered (typically separated by < 250kb), resulting in the frequent presence of multiple associated variants in a locus. The observations that genes and especially pathways are now beginning to be implicated by multiple variants suggests that the larger set of results retain biological specificity but that, at some point, a new set of associated variants will largely highlight the same genes, pathways and biological mechanisms as have already been seen.


Anonymous_IV said...

"Very large but finite" is a strange formulation, since the genome is very large but finite to begin with...

Emil Kirkegaard said...

We're on it. :)

Noah Carl said...

If you plot the percent of variance explained against the number of variants, you get this. It seems to imply that very many thousands of common variants will be needed to explain a significant portion (e.g. >50%) of the variance. Or am I mistaken?

steve hsu said...

Only the top ~ 697 causal variants have been reliably identified. At higher p values there are many false positives (not genome-wide significant: p < 5E-08, of these there are only 697). If larger sample size were available one could find the *actual* top 10k variants and these would probably account for the bulk of the heritability.

Noah Carl said...

Good point; thanks for clearing that up.

steve hsu said...

Some clueless biologists ("real experts"!) thought/think the entire genome (or nearly every gene) is involved in complex traits like height. To me this just sounded stupid but the authors of this paper consider it an important point to refute. To them finite means less than the whole genome.

Bibibibibib Blubb said...

What are the genes that were involved? Because almost all genes in humans are virtually expressed everywhere. They change expression levels too, and on top of that depend on the expression detection study. Go check genecards.com, most genes are literally expressed in every tissue they have tested on. Even those IQ genes, some of them are more expressed in the rectum or skin than the brain.

Can you give me a list of the genes?
For the article: mTOR
Expressed in everything. Other than being highly expressed in bones(sometimes), it is highly expressed in the brain(like most genes) very highly in testes, lungs and things like the pancreas. Its expressed almost everywhere.

"423 loci were enriched for genes, pathways and tissue types known to be involved in growth" Most genes are involved in growth, most are involved in everything.

Please can you give me a list of the genes? At least a few of them. The top 3 or 4.

Damn this pay wall.

Bibibibibib Blubb said...

Did they replicate these alleles in the independent samples?

Bibibibibib Blubb said...

This study does not refute them. Pretty much every gene that is mentioned in this study is involved in things that does not have to do with growth. A lot of them are more expressed in other regions especially the brain. Some of them show no expression in growth or anything that has to do with height.

I'm looking at it right now.

Bibibibibib Blubb said...

Well I checked most of the ones in the supplementary graphs with the clustering around genes and only a few genes had a lot do with bone growth. Eg: BMP6 and ACAN. These genes as with most of the other genes expressed and involved in many other things like the brain, eyes, kidneys, ma you name it its involved.

Also there are other genes I checked with high clustering like: USP37, DIRC3 or ANO1 that have little to do with things involved in height.

Theres even one gene like ABCA17P which is a pseudogene that has apparently lost its function.

There also seem to be genes that are clearly highly involved growth and stature that do not show up in the study like the rest of the BMP genes. Maybe I cant find them...

steve hsu said...

You are confused about the direction of implication. The study suggests that not every gene or region of DNA affects inter-individual height variation. That is not the same as saying that the height causal variants only influence height.

Please don't post random stuff in the comments.

Bibibibibib Blubb said...

"The study suggests that not every gene or region of DNA affects inter-individual height variation".
Neither do the scientists you referred to. They think that most of the genome is involved in height and that most of the genome involved in height is involved elsewhere too. Thats exactly what I found looking through the genes.

Its actually kinda worse than that because some of the genes implicated and with a lot of clustering don't have much to do with height. A few might not even be genes.

CarlShulman said...

How much more predictive power could you eke out from the same dataset using the best statistical methods for compressed sensing?

steve hsu said...

Can't be sure until we try. Also, note that due to crappy IRB rules it's often not possible to aggregate data for a better analysis (one of the advantages of simple regression is that you can just share summary statistics across groups). But in our earlier paper we found that CS could find perhaps twice as many causal variants as simple regression with the usual genome wide significance threshold, with a small false positive rate. See figure.


Richard Seiter said...

How do you expect the CS/regression comparison to vary by sample size (in particular just past the CS phase change)? Have you been able to test this by simulation?

Have you looked for minimum p-value SNPs in GIANT within 500kb of your red hits? It would be interesting to see if there are any close to 10e-8 (possible false negatives in GIANT?).

CarlShulman said...

Thanks Steve, that's helpful. So that would give the BGI study another advantage, aside from the case-control structure and whole genome sequencing for rare variants, to help offset small sample size.

Blog Archive