As I've mentioned before, one practical implication of this work is that of order 30 times sparsity s (s = total number of loci of nonzero effect) is sufficient sample size at h2 = 0.5 to extract all the loci. For a trait with s = 10k, somewhat less than a million individuals should be enough.
Application of compressed sensing to genome wide association studies and genomic selection (http://arxiv.org/abs/1310.2264)An excerpt from the new version and corresponding figure:
We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered; we discuss practical methods for detecting the boundary. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.
... CS theory does not provide performance guarantees in the presence of arbitrary correlations (LD) among predictor variables. However, according to our simulations using all genotyped SNPs on chromosome 22, L1-penalized regression does select SNPs in close proximity to true nonzeros. The difficulty of fine-mapping an association signal to the actual causal variant is a limitation shared by all statistical gene-mapping approaches—including marginal regression as implemented in standard GWAS—and thus should not be interpreted as a drawback of L1 methods.
We found that a sample size of 12,464 was not sufficient to achieve full recovery of the nonzeros with respect to height. The penalization parameter λ, however, is set by CS theory so as to minimize the NE. In some situations it might be desirable to tolerate a relatively large NE in order to achieve precise but incomplete recovery (few false positives, many false negatives). By setting λ to a strict value appropriate for a low-heritability trait, we found that a phase transition to good recovery can be achieved with smaller sample sizes, at the cost of selecting a smaller number of markers and hence suffering many false negatives. ...
Figure 7 Map of SNPs associated with height, as identified by the GIANT Consortium meta-analysis, L1-penalized regression, and standard GWAS. Base-pair distance is given by angle, and chromosome endpoints are demarcated by dotted lines. Starting from 3 o’clock and going counterclockwise, the map sweeps through the chromosomes in numerical order. As a scale reference, the first sector represents chromosome 1 and is ∼ 250 million base-pairs. The blue segments correspond to height-associated SNPs discovered by GIANT. Note that some of these may overlap. The yellow segments represent L1-selected SNPs that fell within 500 kb of a (blue) GIANT-identified nonzero; these met our criterion for being declared true positives. The red segments represent L1-selected SNPs that did not fall within 500 kb of a GIANT-identified nonzero. Note that some yellow and red segments overlap given this figure’s resolution. There are in total 20 yellow/red segments, representing L1-selected SNPs found using all 12,454 subjects. The white dots represent the locations of SNPs selected by MR at a P-value threshold of 10−8. (Click for larger version!)