Friday, May 22, 2015

Genetic architecture and predictive modeling of quantitative traits



As an experiment I recorded this video using slides from a talk I gave last week at NIH. I will be giving similar talks later this spring/summer at Human Longevity Inc. and BGI. The commonality between these institutions is that all three are on the road to accumulating a million human genomes. Who will get there first?

Recording the video was easy using Keynote, although it's a bit odd to talk to yourself for an hour. I recommend that everyone do this, in order to reach a much larger audience than can fit in a lecture hall :-)

Genetic architecture and predictive modeling of quantitative traits

I discuss the application of Compressed Sensing (L1-penalized optimization or LASSO) to genomic prediction. I show that matrices comprised of human genomes are good compressed sensors, and that LASSO applied to genomic prediction exhibits a phase transition as the sample size is varied. When the sample size crosses the phase boundary complete identification of the subspace of causal variants is possible. For typical traits of interest (e.g., with heritability ~ 0.5), the phase boundary occurs at N ~ 30s, where s (sparsity) is the number of causal variants. I give some estimates of sparsity associated with complex traits such as height and cognitive ability, which suggest s ~ 10k. In practical terms, these results imply that powerful genomic prediction will be possible for many complex traits once ~ 1 million genotypes are available for analysis.

11 comments:

Richard Seiter said...

Steve, don't know if you know about using dl.dropbox.com in your link as a way of allowing direct download of the file. I find that handy. For example:
https://dl.dropbox.com/s/tjsdre0u3hvca0x/NIH-HLI.pdf?dl=0
For PDFs it's not that big of a deal, but the direct link is much nicer for HTML files.

Thanks for the slides and talk! I like your slide 16 visual (in the video preview above). Fancier than the usual 2D version. Are there any good references for "Generalization from SNPs to whole genomes still needs work"?

David Coughlin said...

I've been puzzling over how best to do slide decks with audio. I did the Udacity course on massively parallel computing [i.e. GPU computing], and I liked the way that they incorporate writing into it, but they don't also use static slides. Plus, sometimes the act of drawing is important to the description [especially when trying to illustrate dynamic concepts]. Have you tried that?

lukelea said...

Very good. You are a natural teacher. One question though: would -- or, rather, might -- people with cognitive abilities +25SD find the world a boring place to be? The game of life would be too easy.

ben_g said...

There are always new challenges.

lukelea said...

About that 25 SD -- isn't it possible that there might be diminishing returns as more and more of those negative alleles are eliminated? Is that inconsistent with the idea of additivity?

lukelea said...

Tell it to the bored.

Bobw said...

Thanks so much, Steve. That was awesome. I had questions as I watched it, but now I can't remember them. I guess I'll have to watch it again.

Bobw said...

Hi Steve. Sorry. Dude keeps asking questions on old topics.
That empty ellipse on slide 10 (frequency vs effect magnitude). I'm trying to reason why that is empty. The math detects smaller effect sizes that that. It detects smaller frequencies than that. Why is it empty? Under selection, I reasoned that all moderate-magnitude alleles would get either pushed towards fixation or extinction, making the frequency small in that region. What aspect of the method makes it insensitive to stuff in that region?

steve hsu said...

If frequency is too low then in a given sample of individuals you won't have very many people who have the variant, so your ability to say something about it is limited. Of course if the effect size is large enough (i.e., makes people +1 foot taller) you might detect it from just a few cases. So you can see these things trade off against each other.

Bobw said...

OK, that's how I originally thought about it. However, as I said above, that region of the graph holds effects *larger* than effects detected at *lower* frequency. It's like this: If an effect of this magnitude were only *rarer*, we could detect it. Since it's so common, however, we can't.

That region should be the easiest to work with: Moderate effect, moderate frequency.

Now, maybe there's some reason that the moderately-frequent region has only *very small* effects. Strong selection might do that, I guess. Pushing things out to fixation or to extinction.

I assume I'm misunderstanding something.

Bobw said...

Oh. Dang. Never mind. Reading the plot upside-down.

Blog Archive

Labels