Tuesday, October 15, 2013

Compressed sensing and genomes

For more discussion of our recent paper (The human genome as a compressed sensor), see this blog post by my collaborator Carson Chow and another on the machine learning blog Nuit Blanche. One of our main points in the paper is that the phase transition between the regimes of poor and good recovery of the L1 penalized algorithm (LASSO) is readily detectable, and that the scaling behavior of the phase boundary allows theoretical estimates for the necessary amount of data required for good performance at a given sparsity. Apparently, this reasoning has appeared before in the compressed sensing literature, and has been used to optimize hardware designs for sensors. In our case, the sensor is the human genome, and its statistical properties are fixed. Fortunately, we find that genotype matrices are in the same universality class as random matrices, which are good compressed sensors.

The black line in the figure below is the theoretical prediction (Donoho 2006) for the location of the phase boundary. The shading shows results from our simulations. The scale on the right is L2 (norm squared) error in the recovered effects vector compared to the actual effects.


Perhaps we are approaching a D-T moment in genomics ;-)
... a Donoho-Tao moment in the Radar community at the next CoSeRa meeting :-). As a reminder the Donoho-Tao moment was well put in this 2008 IPAM newsletter: .... It’s David Donoho [5] reportedly exclaiming [to] a panel of NSF folks “You’ve got Terry Tao (a Fields medalist [6]) talking to geoscientists, what do you want?” ....

In previous discussions I predicted that of order millions of phenotype-genotype pairs would be sufficient to extract the genetic architecture of complex traits like height or g. This estimate is based on two ingredients:

1. The sparsity of these traits is probably no greater than s ~ 10k (evidence for this comes from looking at genomic Hamming distance as a function of phenotype distance).

2. The compressed sensing results suggest that good recovery can be achieved above a data threshold of roughly n ~ 30 s (assuming 1E06 SNPs and additive heritability h2 = 0.5 or so).

Including an extra order of magnitude to be safe, this leads to n ~ millions.

No comments:

Post a Comment