Genomic Prediction of Complex Traits
After a brief review (suitable for physicists) of computational genomics and complex traits, I describe recent progress in this area. Using methods from Compressed Sensing (L1-penalized regression; Donoho-Tanner phase transition with noise) and the UK BioBank dataset of 500k SNP genotypes, we construct genomic predictors for several complex traits. Our height predictor captures nearly all of the predicted SNP heritability for this trait -- thereby resolving the missing heritability problem. Actual heights of most individuals in validation tests are within a few cm of predicted heights. I also discuss application of these methods to cognitive ability and polygenic disease risk: sparsity estimates (of the number of causal loci), combined with phase transition scaling analysis, allow estimates of the amount of data required to construct good predictors. Finally, I discuss how these advances will affect human health and reproduction (embryo selection for In Vitro Fertilization, genetic editing) in the coming decade.
Michigan State University
I recently gave a similar talk at 23andMe (slides at link).
Note Added: Many people asked for video of this talk, but alas recording talks is not standard practice at IAS. I did give a similar talk using the same slides just a week later at the Allen Institute in Seattle (Symposium on Genetics of Complex Traits): video here.
Some Comments and Slides:
I tried to make the talk understandable to physicists, and at least according to what I was told (and my impression from the questions asked during and after the talk), largely succeeded. Early on, when presenting the phenotype function y(g), both Nima Arkani-Hamed (my host) and Ed Witten asked some questions about the "units" of the various quantities involved. In the actual computation everything is z-scored: measured in units of SD relative to the sample mean. I didn't realize until later that there was some confusion about how this is done for the "state variable" of the genetic locus g_i. In fact, when the gene array is read the result is 0,1,2 for homozygous common allele, heterozygous, homozygous rare allele, respectively. (I might have that backwards but you get the point.) For each locus there is a minor allele frequency (MAF) and this determines the sample average and SD of the distribution of 0's, 1's, and 2's. It is the z-scored version of this variable that appears in the computation. I didn't realize certain people were following the details so closely in the talk but I should not be surprised ;-) In the future I'll include a slide specifically on this to avoid confusion.
Looking at my slide on missing heritability, Witten immediately noted that estimating SNP heritability (as opposed to total or broad sense heritability) is nontrivial and I had to quickly explain the GCTA technique!
During the talk I discussed the theoretical reason we expect to find a lot of additive variance: nonlinear gadgets are fragile (easy to break through recombination in sexual reproduction), whereas additive genetic variance can be reliably passed on and is easy for natural selection to act on***. (See also Fisher's Fundamental Theorem of Natural Selection. More.) Usually these comments pass over the head of the audience but at IAS I am sure quite a few people understood the point.
One non-physicist reader of this blog braved IAS security and managed to attend the lecture. I am flattered, and I invite him to share his impressions in the comments!
Afterwards there was quite a bit of additional discussion which spilled over into tea time. The important ideas: how Compressed Sensing works, the nature of the phase transition, how we can predict the amount of data required to build a good predictor (capturing most of the SNP heritability) using the universality of the phase transition + estimate of sparsity, etc. were clearly absorbed by the people I talked to.
*** On the genetic architecture of intelligence and other quantitative traits (p.16):
... The preceding discussion is not intended to convey an overly simplistic view of genetics or systems biology. Complex nonlinear genetic systems certainly exist and are realized in every organism. However, quantitative differences between individuals within a species may be largely due to independent linear effects of specific genetic variants. As noted, linear effects are the most readily evolvable in response to selection, whereas nonlinear gadgets are more likely to be fragile to small changes. (Evolutionary adaptations requiring significant changes to nonlinear gadgets are improbable and therefore require exponentially more time than simple adjustment of frequencies of alleles of linear effect.) One might say that, to first approximation, Biology = linear combinations of nonlinear gadgets, and most of the variation between individuals is in the (linear) way gadgets are combined, rather than in the realization of different gadgets in different individuals.
Linear models work well in practice, allowing, for example, SNP-based prediction of quantitative traits (milk yield, fat and protein content, productive life, etc.) in dairy cattle. ...