Monday, January 20, 2014

Compressed sensing and genomes v2

We placed a new version of our compressed sensing and genomics paper on arxiv. For earlier discussion see here and here. New results concern the effect of LD (linkage disequilibrium) on the method, and a demonstration that choosing a large L1 penalization parameter allows the selection of loci of nonzero effect at sample sizes too small to recover the entire effects vector -- see below, including figure.

As I've mentioned before, one practical implication of this work is that of order 30 times sparsity s (s = total number of loci of nonzero effect) is sufficient sample size at h2 = 0.5 to extract all the loci. For a trait with s = 10k, somewhat less than a million individuals should be enough.
Application of compressed sensing to genome wide association studies and genomic selection (

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered; we discuss practical methods for detecting the boundary. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.
An excerpt from the new version and corresponding figure:
... CS theory does not provide performance guarantees in the presence of arbitrary correlations (LD) among predictor variables. However, according to our simulations using all genotyped SNPs on chromosome 22, L1-penalized regression does select SNPs in close proximity to true nonzeros. The difficulty of fine-mapping an association signal to the actual causal variant is a limitation shared by all statistical gene-mapping approaches—including marginal regression as implemented in standard GWAS—and thus should not be interpreted as a drawback of L1 methods.

We found that a sample size of 12,464 was not sufficient to achieve full recovery of the nonzeros with respect to height. The penalization parameter λ, however, is set by CS theory so as to minimize the NE. In some situations it might be desirable to tolerate a relatively large NE in order to achieve precise but incomplete recovery (few false positives, many false negatives). By setting λ to a strict value appropriate for a low-heritability trait, we found that a phase transition to good recovery can be achieved with smaller sample sizes, at the cost of selecting a smaller number of markers and hence suffering many false negatives. ...

Figure 7 Map of SNPs associated with height, as identified by the GIANT Consortium meta-analysis, L1-penalized regression, and standard GWAS. Base-pair distance is given by angle, and chromosome endpoints are demarcated by dotted lines. Starting from 3 o’clock and going counterclockwise, the map sweeps through the chromosomes in numerical order. As a scale reference, the first sector represents chromosome 1 and is ∼ 250 million base-pairs. The blue segments correspond to height-associated SNPs discovered by GIANT. Note that some of these may overlap. The yellow segments represent L1-selected SNPs that fell within 500 kb of a (blue) GIANT-identified nonzero; these met our criterion for being declared true positives. The red segments represent L1-selected SNPs that did not fall within 500 kb of a GIANT-identified nonzero. Note that some yellow and red segments overlap given this figure’s resolution. There are in total 20 yellow/red segments, representing L1-selected SNPs found using all 12,454 subjects. The white dots represent the locations of SNPs selected by MR at a P-value threshold of 10−8. (Click for larger version!)


david_suzuki said...

"...and a demonstration that choosing a large L1 penalization parameter
allows the selection of loci of nonzero effect at sample sizes too small
to recover the entire effects vector... of order 30 times sparsity s (number of loci of nonzero effect) is
sufficient sample size at h2 = 0.5 to extract all the loci."

then you should be able to find some loci with the bgi data?

DK said...

For a trait with s = 10k, somewhat less than a million individuals should be enough

And the march toward more modesty continues unabated... Good. First breakthrough from the endless hype is a hint that loci affecting IQ are at least a half of the genome (big, BIG surprise!) And the second is the indirect admission that "I’d say we have a fighting chance” is probably wrong by an order of magnitude or more.

steve hsu said...

If you look through all the posts on this blog (as opposed to journalistic stuff written by other people), you'll see the s ~ 10k and n ~ 1M numbers have been discussed many times. I think I've even explained to you before (in comment threads) that 10k refers to loci, not genes. So 10k is out of millions of SNPs or even billions of bps if you want to count that way (your DNA does a lot more than code for individual proteins ...). If you read the paper you'll see that success with the algorithm (i.e., sufficient sample size) means we'll be able to build a good genomic predictor of the phenotype, even if s is "large" (e.g., 10k). You do realize we can already do this with domesticated animals?

"Fighting chance" refers to finding ONE or a few genome wide significant (p<1E-08 or so) hits, not ALL of the associated loci. That was always the goal of the first BGI project, as explained in the talks (see slides) I've given on the subject. But even finding the first hits (by this definition) is a milestone for doubters who don't believe that cognitive ability is controlled by genes. Unfortunately, our group won't be the first to do this:

DK said...

10k refers to loci, not genes ... your DNA does a lot more than just code for individual proteins

Ultimately, it's all about gene expression - protein expression, in about 99% of cases. If anything, it means that those 10K loci affect more than a half of functional genes.

milestone for doubters who don't believe that cognitive ability is influenced by genetics

That's a very expensive way or proving the obvious.

This can already be done with domesticated plants and animals

No, it presently cannot be done. Not even remotely close. Probably not even with E.coli.

steve hsu said...

"No, it presently cannot be done. Not even remotely close."

Breeders have recently started switching from pedigree based methods to statistical models incorporating SNP genotypes. We can now make good predictions of phenotypes like milk and meat production using genetic data alone. Cows are easier than people because, as domesticated animals, they have smaller effective breeding population and less genetic diversity.

Richard Seiter said...

Steve, is the BGI study incorporating any analysis of what/where the SNP is? It seems like doing that would be helpful for increasing confidence of the hits, and perhaps justify a different standard for p values of SNPs with a clear biological relationship to the brain (e.g. neural growth). What kinds of p values are showing up for the Tay-Sachs SNPs?

DK said...

The article says that it is being hoped that what you say can be done can really be done (and be appreciably more useful). No blind real life comparison of genomic predictions vs conventional breeding has been done. But you are right - cows and pigs will be there first, just a matter of time. IMO, however, it will always remain only a thing with some probability, never anything certain.

steve hsu said...

The predictions will always be statistical in nature, that is probably unavoidable for such complex systems.

>> No blind real life comparison of genomic predictions vs conventional breeding has been done.

It's just a matter of time for this. What can be done already is to make a model-based prediction on some set of animals or plants and check prediction vs outcome to get a correlation or other performance score. For example, I've seen impressive results in predicting the height of corn from SNPs.

steve hsu said...

PS Just to clarify further, the predictive models used thus far are quite primitive (linear): just associate an effect size with each variant or SNP. Most effect sizes are zero, the rest are +/- small fractions of a population SD. For a given individual add up the effects associated with their genome and you have a prediction. The correlation between prediction and actual phenotype may eventually be 0.5 or higher for height and g.

Determining the effect sizes is a matter of statistical analysis on a (large) sample of individuals. Our paper gives estimates of the amount of data required to find all nonzero effects using human genomes as the "compressed sensor".

Future developments will include nonlinear terms in the model, etc. Look for a paper in the future estimating sample size requirements for determining the nonlinear effects ...

DK said...

I'll be very, very impressed when genomic maize beats good old breeding. But the situation with humans is a lot more like taking today's quinoa and figuring out how to tweak its genes to make seeds the size of cherries (which is what breeding has done for maize and cows). Not in our lifetime, I am afraid.

re: nonlinear effects
Excellent! I am particularly looking forward to you, some years from now, realizing that epistasis is not a footnote you once thought it was. :-) ( for an example of simple QTLs in a very simple organism)

DK said...

Yeah, I remember those papers. I can't pinpoint what's wrong with them but I am pretty sure (heh!) that they are wrong. An analogy, perhaps: when the "resolution" is really bad (think wild big data scatter plot), everything looks linear or can be fit to. I think this is what we are dealing with today. The true function is not linear but the noise totally swamps it.

Blog Archive