Physicist, Startup Founder, Blogger, Dad

Saturday, January 06, 2018

Institute for Advanced Study: Genomic Prediction of Complex Traits (seminar)

Genomic Prediction of Complex Traits

After a brief review (suitable for physicists) of computational genomics and complex traits, I describe recent progress in this area. Using methods from Compressed Sensing (L1-penalized regression; Donoho-Tanner phase transition with noise) and the UK BioBank dataset of 500k SNP genotypes, we construct genomic predictors for several complex traits. Our height predictor captures nearly all of the predicted SNP heritability for this trait -- thereby resolving the missing heritability problem. Actual heights of most individuals in validation tests are within a few cm of predicted heights. I also discuss application of these methods to cognitive ability and polygenic disease risk: sparsity estimates (of the number of causal loci), combined with phase transition scaling analysis, allow estimates of the amount of data required to construct good predictors. Finally, I discuss how these advances will affect human health and reproduction (embryo selection for In Vitro Fertilization, genetic editing) in the coming decade.

Steve Hsu

Michigan State University

I recently gave a similar talk at 23andMe (slides at link).

Some Comments and Slides:

I tried to make the talk understandable to physicists, and at least according to what I was told (and my impression from the questions asked during and after the talk), largely succeeded. Early on, when presenting the phenotype function y(g), both Nima Arkani-Hamed (my host) and Ed Witten asked some questions about the "units" of the various quantities involved. In the actual computation everything is z-scored: measured in units of SD relative to the sample mean. I didn't realize until later that there was some confusion about how this is done for the "state variable" of the genetic locus g_i. In fact, when the gene array is read the result is 0,1,2 for homozygous common allele, heterozygous, homozygous rare allele, respectively. (I might have that backwards but you get the point.) For each locus there is a minor allele frequency (MAF) and this determines the sample average and SD of the distribution of 0's, 1's, and 2's. It is the z-scored version of this variable that appears in the computation. I didn't realize certain people were following the details so closely in the talk but I should not be surprised ;-) In the future I'll include a slide specifically on this to avoid confusion.

Looking at my slide on missing heritability, Witten immediately noted that estimating SNP heritability (as opposed to total or broad sense heritability) is nontrivial and I had to quickly explain the GCTA technique!

During the talk I discussed the theoretical reason we expect to find a lot of additive variance: nonlinear gadgets are fragile (easy to break through recombination in sexual reproduction), whereas additive genetic variance can be reliably passed on and is easy for natural selection to act on***. (See also Fisher's Fundamental Theorem of Natural Selection.) Usually these comments pass over the head of the audience but at IAS I am sure quite a few people understood the point.

One non-physicist reader of this blog braved IAS security and managed to attend the lecture. I am flattered, and I invite him to share his impressions in the comments!

Afterwards there was quite a bit of additional discussion which spilled over into tea time. The important ideas: how Compressed Sensing works, the nature of the phase transition, how we can predict the amount of data required to build a good predictor (capturing most of the SNP heritability) using the universality of the phase transition + estimate of sparsity, etc. were clearly absorbed by the people I talked to.


Here I am in front of Fuld Hall.

*** On the genetic architecture of intelligence and other quantitative traits (p.16):
... The preceding discussion is not intended to convey an overly simplistic view of genetics or systems biology. Complex nonlinear genetic systems certainly exist and are realized in every organism. However, quantitative differences between individuals within a species may be largely due to independent linear effects of specific genetic variants. As noted, linear effects are the most readily evolvable in response to selection, whereas nonlinear gadgets are more likely to be fragile to small changes. (Evolutionary adaptations requiring significant changes to nonlinear gadgets are improbable and therefore require exponentially more time than simple adjustment of frequencies of alleles of linear effect.) One might say that, to first approximation, Biology = linear combinations of nonlinear gadgets, and most of the variation between individuals is in the (linear) way gadgets are combined, rather than in the realization of different gadgets in different individuals.

Linear models work well in practice, allowing, for example, SNP-based prediction of quantitative traits (milk yield, fat and protein content, productive life, etc.) in dairy cattle. ...

Friday, January 05, 2018

Gork revisited, 2018

It's been almost 10 years since I made the post Are you Gork?

Over the last decade, both scientists and non-scientists have become more confident that we will someday create:

A. AGI (= sentient AI, named "Gork" :-)  See Rise of the Machines: Survey of AI Researchers.

B. Quantum Computers. See Quantum Computing at a Tipping Point?

This change in zeitgeist makes the thought experiment proposed below much less outlandish. What, exactly, does Gork perceive? Why couldn't you be Gork? (Note that the AGI in Gork can be an entirely classical algorithm even though he exists in a quantum simulation.)

Slide from this [Caltech IQI] talk. See also illustrations in Big Ed.
Survey questions:

1) Could you be Gork the robot? (Do you split into different branches after observing the outcome of, e.g., a Stern-Gerlach measurement?)

2) If not, why? e.g.,

I have a soul and Gork doesn't

Decoherence solved all that! See previous post.

I don't believe that quantum computers will work as designed, e.g., sufficiently large algorithms or subsystems will lead to real (truly irreversible) collapse. Macroscopic superpositions larger than whatever was done in the lab last week are impossible.

QM is only an algorithm for computing probabilities -- there is no reality to the quantum state or wavefunction or description of what is happening inside a quantum computer.

Stop bothering me -- I only care about real stuff like the Higgs mass / SUSY-breaking scale / string Landscape / mechanism for high-Tc / LIBOR spread / how to generate alpha. 
[ 2018: Ha Ha -- first 3 real stuff topics turned out to be pretty boring use of the last decade... ]
Just as A. and B. above have become less outlandish assumptions, our ability to create large and complex superposition states with improved technology (largely developed for quantum computing; see Schrodinger's Virus) will make the possibility that we ourselves exist in a superposition state less shocking. Future generations of physicists will wonder why it took their predecessors so long to accept Many Worlds.

Bonus! I will be visiting Caltech next week (Tues and Weds 1/8-9). Any blog readers interested in getting a coffee or beer please feel free to contact me :-)

Blog Archive