Information Processing: Search results for compressed sensing genomics

Showing posts sorted by relevance for query compressed sensing genomics. Sort by date Show all posts

Friday, September 15, 2017

Phase Transitions and Genomic Prediction of Cognitive Ability

James Thompson (University College London) recently blogged about my prediction that with sample size of order a million genotypes|phenotypes, one could construct a good genomic predictor for cognitive ability and identify most of the associated common SNPs.

The Hsu Boundary

... The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 million people may be required to reliably identify the genetic signals of intelligence.

... the behaviour of an optimization algorithm involving a million variables can change suddenly as the amount of data available increases. We see this behavior in the case of Compressed Sensing applied to genomes, and it allows us to predict that something interesting will happen with complex traits like cognitive ability at a sample size of the order of a million individuals.

Machine learning is now providing new methods of data analysis, and this may eventually simplify the search for the genes which underpin intelligence.

There are many comments on Thompson's blog post, some of them confused. Comments from a user "Donoho-Student" are mostly correct -- he or she seems to understand the subject. (The phase transition discussed is related to the Donoho-Tanner phase transition. More from Igor Carron.)

The chain of logic leading to this prediction has been discussed here before. The excerpt below is from a 2013 post The human genome as a compressed sensor:

Compressed sensing (see also here) is a method for efficient solution of underdetermined linear systems: y = Ax + noise , using a form of penalized regression (L1 penalization, or LASSO). In the context of genomics, y is the phenotype, A is a matrix of genotypes, x a vector of effect sizes, and the noise is due to nonlinear gene-gene interactions and the effect of the environment. (Note the figure above, which I found on the web, uses different notation than the discussion here and the paper below.)

Let p be the number of variables (i.e., genetic loci = dimensionality of x), s the sparsity (number of variables or loci with nonzero effect on the phenotype = nonzero entries in x) and n the number of measurements of the phenotype (i.e., the number of individuals in the sample = dimensionality of y). Then A is an n x p dimensional matrix. Traditional statistical thinking suggests that n > p is required to fully reconstruct the solution x (i.e., reconstruct the effect sizes of each of the loci). But recent theorems in compressed sensing show that n > C s log p is sufficient if the matrix A has the right properties (is a good compressed sensor). These theorems guarantee that the performance of a compressed sensor is nearly optimal -- within an overall constant of what is possible if an oracle were to reveal in advance which s loci out of p have nonzero effect. In fact, one expects a phase transition in the behavior of the method as n crosses a critical threshold given by the inequality. In the good phase, full recovery of x is possible.

In the paper below, available on arxiv, we show that

1. Matrices of human SNP genotypes are good compressed sensors and are in the universality class of random matrices. The phase behavior is controlled by scaling variables such as rho = s/n and our simulation results predict the sample size threshold for future genomic analyses.

2. In applications with real data the phase transition can be detected from the behavior of the algorithm as the amount of data n is varied. A priori knowledge of s is not required; in fact one deduces the value of s this way.

3. For heritability h2 = 0.5 and p ~ 1E06 SNPs, the value of C log p is ~ 30. For example, a trait which is controlled by s = 10k loci would require a sample size of n ~ 300k individuals to determine the (linear) genetic architecture.

For more posts on compressed sensing, L1-penalized optimization, etc. see here. Because s could be larger than 10k, the common SNP heritability of cognitive ability might be less than 0.5, and the phenotype measurements are noisy, and because a million is a nice round figure, I usually give that as my rough estimate of the critical sample size for good results. The estimate that s ~ 10k for cognitive ability and height originates here, but is now supported by other work: see, e.g., Estimation of genetic architecture for complex traits using GWAS data.

We have recently finished analyzing height using L1-penalization and the phase transition technique on a very large data set (many hundreds of thousands of individuals). The paper has been submitted for review, and the results support the claims made above with s ~ 10k, h2 ~ 0.5 for height.

Added: Here are comments from "Donoho-Student":

Donoho-Student says:
September 14, 2017 at 8:27 pm GMT • 100 Words

The Donoho-Tanner transition describes the noise-free (h2=1) case, which has a direct analog in the geometry of polytopes.

The n = 30s result from Hsu et al. (specifically the value of the coefficient, 30, when p is the appropriate number of SNPs on an array and h2 = 0.5) is obtained via simulation using actual genome matrices, and is original to them. (There is no simple formula that gives this number.) The D-T transition had only been established in the past for certain classes of matrices, like random matrices with specific distributions. Those results cannot be immediately applied to genomes.

The estimate that s is (order of magnitude) 10k is also a key input.

I think Hsu refers to n = 1 million instead of 30 * 10k = 300k because the effective SNP heritability of IQ might be less than h2 = 0.5 — there is noise in the phenotype measurement, etc.

Donoho-Student says:
September 15, 2017 at 11:27 am GMT • 200 Words

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing. These results give performance guarantees and describe phase transition behavior, but because they are rigorous theorems they only apply to specific classes of sensor matrices, such as simple random matrices. Genomes have correlation structure, so the theorems do not directly apply to the real world case of interest, as is often true.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don’t think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most ML people who use lasso, as opposed to people who prove theorems, are not aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don’t think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.

Wednesday, October 09, 2013

The human genome as a compressed sensor

Compressed sensing (see also here) is a method for efficient solution of underdetermined linear systems: y = Ax + noise , using a form of penalized regression (L1 penalization, or LASSO). In the context of genomics, y is the phenotype, A is a matrix of genotypes, x a vector of effect sizes, and the noise is due to nonlinear gene-gene interactions and the effect of the environment. (Note the figure above, which I found on the web, uses different notation than the discussion here and the paper below.)

Let p be the number of variables (i.e., genetic loci = dimensionality of x), s the sparsity (number of variables or loci with nonzero effect on the phenotype = nonzero entries in x) and n the number of measurements of the phenotype (i.e., the number of individuals in the sample = dimensionality of y). Then A is an n x p dimensional matrix. Traditional statistical thinking suggests that n > p is required to fully reconstruct the solution x (i.e., reconstruct the effect sizes of each of the loci). But recent theorems in compressed sensing show that n > C s log p is sufficient if the matrix A has the right properties (is a good compressed sensor). These theorems guarantee that the performance of a compressed sensor is nearly optimal -- within an overall constant of what is possible if an oracle were to reveal in advance which s loci out of p have nonzero effect. In fact, one expects a phase transition in the behavior of the method as n crosses a critical threshold given by the inequality. In the good phase, full recovery of x is possible.

In the paper below, available on arxiv, we show that

1. Matrices of human SNP genotypes are good compressed sensors and are in the universality class of random matrices. The phase behavior is controlled by scaling variables such as rho = s/n and our simulation results predict the sample size threshold for future genomic analyses.

2. In applications with real data the phase transition can be detected from the behavior of the algorithm as the amount of data n is varied. A priori knowledge of s is not required; in fact one deduces the value of s this way.

3. For heritability h2 = 0.5 and p ~ 1E06 SNPs, the value of C log p is ~ 30. For example, a trait which is controlled by s = 10k loci would require a sample size of n ~ 300k individuals to determine the (linear) genetic architecture.

Application of compressed sensing to genome wide association studies and genomic selection

http://arxiv.org/abs/1310.2264

Authors: Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
Categories: q-bio.GN
Comments: 27 pages, 4 figures; Supplementary Information 5 figures

We show that the signal-processing paradigm known as compressed sensing (CS)
is applicable to genome-wide association studies (GWAS) and genomic selection
(GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts
to predict the phenotypic values of new individuals on the basis of training
data. CS addresses a problem common to both endeavors, namely that the number
of genotyped markers often greatly exceeds the sample size. We show using CS
methods and theory that all loci of nonzero effect can be identified (selected)
using an efficient algorithm, provided that they are sufficiently few in number
(sparse) relative to sample size. For heritability h2 = 1, there is a sharp
phase transition to complete selection as the sample size is increased. For
heritability values less than one, complete selection can still occur although
the transition is smoothed. The transition boundary is only weakly dependent on
the total number of genotyped markers. The crossing of a transition boundary
provides an objective means to determine when true effects are being recovered.
For h2 = 0.5, we find that a sample size that is thirty times the number
of nonzero loci is sufficient for good recovery.

Monday, January 20, 2014

Compressed sensing and genomes v2

We placed a new version of our compressed sensing and genomics paper on arxiv. For earlier discussion see here and here. New results concern the effect of LD (linkage disequilibrium) on the method, and a demonstration that choosing a large L1 penalization parameter allows the selection of loci of nonzero effect at sample sizes too small to recover the entire effects vector -- see below, including figure.

As I've mentioned before, one practical implication of this work is that of order 30 times sparsity s (s = total number of loci of nonzero effect) is sufficient sample size at h2 = 0.5 to extract all the loci. For a trait with s = 10k, somewhat less than a million individuals should be enough.

Application of compressed sensing to genome wide association studies and genomic selection (http://arxiv.org/abs/1310.2264)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered; we discuss practical methods for detecting the boundary. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

An excerpt from the new version and corresponding figure:

... CS theory does not provide performance guarantees in the presence of arbitrary correlations (LD) among predictor variables. However, according to our simulations using all genotyped SNPs on chromosome 22, L1-penalized regression does select SNPs in close proximity to true nonzeros. The difficulty of fine-mapping an association signal to the actual causal variant is a limitation shared by all statistical gene-mapping approaches—including marginal regression as implemented in standard GWAS—and thus should not be interpreted as a drawback of L1 methods.

We found that a sample size of 12,464 was not sufficient to achieve full recovery of the nonzeros with respect to height. The penalization parameter λ, however, is set by CS theory so as to minimize the NE. In some situations it might be desirable to tolerate a relatively large NE in order to achieve precise but incomplete recovery (few false positives, many false negatives). By setting λ to a strict value appropriate for a low-heritability trait, we found that a phase transition to good recovery can be achieved with smaller sample sizes, at the cost of selecting a smaller number of markers and hence suffering many false negatives. ...

Figure 7 Map of SNPs associated with height, as identified by the GIANT Consortium meta-analysis, L1-penalized regression, and standard GWAS. Base-pair distance is given by angle, and chromosome endpoints are demarcated by dotted lines. Starting from 3 o’clock and going counterclockwise, the map sweeps through the chromosomes in numerical order. As a scale reference, the first sector represents chromosome 1 and is ∼ 250 million base-pairs. The blue segments correspond to height-associated SNPs discovered by GIANT. Note that some of these may overlap. The yellow segments represent L1-selected SNPs that fell within 500 kb of a (blue) GIANT-identified nonzero; these met our criterion for being declared true positives. The red segments represent L1-selected SNPs that did not fall within 500 kb of a GIANT-identified nonzero. Note that some yellow and red segments overlap given this figure’s resolution. There are in total 20 yellow/red segments, representing L1-selected SNPs found using all 12,454 subjects. The white dots represent the locations of SNPs selected by MR at a P-value threshold of 10−8. (Click for larger version!)

Thursday, August 28, 2014

Determination of Nonlinear Genetic Architecture using Compressed Sensing

It is a common belief in genomics that nonlinear interactions (epistasis) in complex traits make the task of reconstructing genetic models extremely difficult, if not impossible. In fact, it is often suggested that overcoming nonlinearity will require much larger data sets and significantly more computing power. Our results show that in broad classes of plausibly realistic models, this is not the case.

Determination of Nonlinear Genetic Architecture using Compressed Sensing (arXiv:1408.6583)
Chiu Man Ho, Stephen D.H. Hsu
Subjects: Genomics (q-bio.GN); Applications (stat.AP)

We introduce a statistical method that can reconstruct nonlinear genetic models (i.e., including epistasis, or gene-gene interactions) from phenotype-genotype (GWAS) data. The computational and data resource requirements are similar to those necessary for reconstruction of linear genetic models (or identification of gene-trait associations), assuming a condition of generalized sparsity, which limits the total number of gene-gene interactions. An example of a sparse nonlinear model is one in which a typical locus interacts with several or even many others, but only a small subset of all possible interactions exist. It seems plausible that most genetic architectures fall in this category. Our method uses a generalization of compressed sensing (L1-penalized regression) applied to nonlinear functions of the sensing matrix. We give theoretical arguments suggesting that the method is nearly optimal in performance, and demonstrate its effectiveness on broad classes of nonlinear genetic models using both real and simulated human genomes.

Click for larger image.

Tuesday, October 15, 2013

Compressed sensing and genomes

For more discussion of our recent paper (The human genome as a compressed sensor), see this blog post by my collaborator Carson Chow and another on the machine learning blog Nuit Blanche. One of our main points in the paper is that the phase transition between the regimes of poor and good recovery of the L1 penalized algorithm (LASSO) is readily detectable, and that the scaling behavior of the phase boundary allows theoretical estimates for the necessary amount of data required for good performance at a given sparsity. Apparently, this reasoning has appeared before in the compressed sensing literature, and has been used to optimize hardware designs for sensors. In our case, the sensor is the human genome, and its statistical properties are fixed. Fortunately, we find that genotype matrices are in the same universality class as random matrices, which are good compressed sensors.

The black line in the figure below is the theoretical prediction (Donoho 2006) for the location of the phase boundary. The shading shows results from our simulations. The scale on the right is L2 (norm squared) error in the recovered effects vector compared to the actual effects.

Perhaps we are approaching a D-T moment in genomics ;-)

... a Donoho-Tao moment in the Radar community at the next CoSeRa meeting :-). As a reminder the Donoho-Tao moment was well put in this 2008 IPAM newsletter: .... It’s David Donoho [5] reportedly exclaiming [to] a panel of NSF folks “You’ve got Terry Tao (a Fields medalist [6]) talking to geoscientists, what do you want?” ....

In previous discussions I predicted that of order millions of phenotype-genotype pairs would be sufficient to extract the genetic architecture of complex traits like height or g. This estimate is based on two ingredients:

1. The sparsity of these traits is probably no greater than s ~ 10k (evidence for this comes from looking at genomic Hamming distance as a function of phenotype distance).

2. The compressed sensing results suggest that good recovery can be achieved above a data threshold of roughly n ~ 30 s (assuming 1E06 SNPs and additive heritability h2 = 0.5 or so).

Including an extra order of magnitude to be safe, this leads to n ~ millions.

Sunday, October 11, 2015

Additivity in yeast quantitative traits

A new paper from the Kruglyak lab at UCLA shows yet again (this time in yeast) that population variation in quantitative traits tends to be dominated by additive effects. There are deep evolutionary reasons for this to be the case -- see excerpt below (at bottom of this post). For other examples, including humans, mice, chickens, cows, plants, see links here.

Genetic interactions contribute less than additive effects to quantitative trait variation in yeast (http://dx.doi.org/10.1101/019513)

Genetic mapping studies of quantitative traits typically focus on detecting loci that contribute additively to trait variation. Genetic interactions are often proposed as a contributing factor to trait variation, but the relative contribution of interactions to trait variation is a subject of debate. Here, we use a very large cross between two yeast strains to accurately estimate the fraction of phenotypic variance due to pairwise QTLQTL interactions for 20 quantitative traits. We find that this fraction is 9% on average, substantially less than the contribution of additive QTL (43%). Statistically significant QTL-QTL pairs typically have small individual effect sizes, but collectively explain 40% of the pairwise interaction variance. We show that pairwise interaction variance is largely explained by pairs of loci at least one of which has a significant additive effect. These results refine our understanding of the genetic architecture of quantitative traits and help guide future mapping studies.

Genetic interactions arise when the joint effect of alleles at two or more loci on a phenotype departs from simply adding up the effects of the alleles at each locus. Many examples of such interactions are known, but the relative contribution of interactions to trait variation is a subject of debate1–5. We previously generated a panel of 1,008 recombinant offspring (“segregants”) from a cross between two strains of yeast: a widely used laboratory strain (BY) and an isolate from a vineyard (RM)6. Using this panel, we estimated the contribution of additive genetic factors to phenotypic variation (narrow-sense or additive heritability) for 46 traits and resolved nearly all of this contribution (on average 87%) to specific genome-wide-significant quantitative trait loci (QTL). ...

We detected nearly 800 significant additive QTL. We were able to refine the location of the QTL explaining at least 1% of trait variance to approximately 10 kb, and we resolved 31 QTL to single genes. We also detected over 200 significant QTL-QTL interactions; in most cases, one or both of the loci also had significant additive effects. For most traits studied, we detected one or a few additive QTL of large effect, plus many QTL and QTL-QTL interactions of small effect. We find that the contribution of QTL-QTL interactions to phenotypic variance is typically less than a quarter of the contribution of additive effects. These results provide a picture of the genetic contributions to quantitative traits at an unprecedented resolution.

... One can test for interactions either between all pairs of markers (full scan), or only between pairs where one marker corresponds to a significant additive QTL (marginal scan). In principle, the former can detect a wider range of interactions, but the latter can have higher power due to a reduced search space. Here, the two approaches yielded similar results, detecting 205 and 266 QTL-QTL interactions, respectively, at an FDR of 10%, with 172 interactions detected by both approaches. In the full scan, 153 of the QTL-QTL interactions correspond to cases where both interacting loci are also significant additive QTL, 36 correspond to cases where one of the loci is a significant additive QTL, and only 16 correspond to cases where neither locus is a significant additive QTL.

For related discussion of nonlinear genetic models, see here:

It is a common belief in genomics that nonlinear interactions (epistasis) in complex traits make the task of reconstructing genetic models extremely difficult, if not impossible. In fact, it is often suggested that overcoming nonlinearity will require much larger data sets and significantly more computing power. Our results show that in broad classes of plausibly realistic models, this is not the case.

Determination of Nonlinear Genetic Architecture using Compressed Sensing (arXiv:1408.6583)
Chiu Man Ho, Stephen D.H. Hsu
Subjects: Genomics (q-bio.GN); Applications (stat.AP)

We introduce a statistical method that can reconstruct nonlinear genetic models (i.e., including epistasis, or gene-gene interactions) from phenotype-genotype (GWAS) data. The computational and data resource requirements are similar to those necessary for reconstruction of linear genetic models (or identification of gene-trait associations), assuming a condition of generalized sparsity, which limits the total number of gene-gene interactions. An example of a sparse nonlinear model is one in which a typical locus interacts with several or even many others, but only a small subset of all possible interactions exist. It seems plausible that most genetic architectures fall in this category. Our method uses a generalization of compressed sensing (L1-penalized regression) applied to nonlinear functions of the sensing matrix. We give theoretical arguments suggesting that the method is nearly optimal in performance, and demonstrate its effectiveness on broad classes of nonlinear genetic models using both real and simulated human genomes.

I've discussed additivity many times previously, so I'll just quote below from Additivity and complex traits in mice:

You may have noticed that I am gradually collecting copious evidence for (approximate) additivity. Far too many scientists and quasi-scientists are infected by the epistasis or epigenetics meme, which is appealing to those who "revel in complexity" and would like to believe that biology is too complex to succumb to equations. ...

I sometimes explain things this way:

There is a deep evolutionary reason behind additivity: nonlinear mechanisms are fragile and often "break" due to DNA recombination in sexual reproduction. Effects which are only controlled by a single locus are more robustly passed on to offspring. ...

Many people confuse the following statements:

"The brain is complex and nonlinear and many genes interact in its construction and operation."

"Differences in brain performance between two individuals of the same species must be due to nonlinear (non-additive) effects of genes."

The first statement is true, but the second does not appear to be true across a range of species and quantitative traits. On the genetic architecture of intelligence and other quantitative traits (p.16):

... The preceding discussion is not intended to convey an overly simplistic view of genetics or systems biology. Complex nonlinear genetic systems certainly exist and are realized in every organism. However, quantitative differences between individuals within a species may be largely due to independent linear effects of specific genetic variants. As noted, linear effects are the most readily evolvable in response to selection, whereas nonlinear gadgets are more likely to be fragile to small changes. (Evolutionary adaptations requiring significant changes to nonlinear gadgets are improbable and therefore require exponentially more time than simple adjustment of frequencies of alleles of linear effect.) One might say that, to first approximation, Biology = linear combinations of nonlinear gadgets, and most of the variation between individuals is in the (linear) way gadgets are combined, rather than in the realization of different gadgets in different individuals.

Linear models work well in practice, allowing, for example, SNP-based prediction of quantitative traits (milk yield, fat and protein content, productive life, etc.) in dairy cattle. ...

Saturday, January 06, 2018

Institute for Advanced Study: Genomic Prediction of Complex Traits (seminar)

Genomic Prediction of Complex Traits

After a brief review (suitable for physicists) of computational genomics and complex traits, I describe recent progress in this area. Using methods from Compressed Sensing (L1-penalized regression; Donoho-Tanner phase transition with noise) and the UK BioBank dataset of 500k SNP genotypes, we construct genomic predictors for several complex traits. Our height predictor captures nearly all of the predicted SNP heritability for this trait -- thereby resolving the missing heritability problem. Actual heights of most individuals in validation tests are within a few cm of predicted heights. I also discuss application of these methods to cognitive ability and polygenic disease risk: sparsity estimates (of the number of causal loci), combined with phase transition scaling analysis, allow estimates of the amount of data required to construct good predictors. Finally, I discuss how these advances will affect human health and reproduction (embryo selection for In Vitro Fertilization, genetic editing) in the coming decade.

FEATURING
Steve Hsu

SPEAKER AFFILIATION
Michigan State University

I recently gave a similar talk at 23andMe (slides at link).

Note Added: Many people asked for video of this talk, but alas recording talks is not standard practice at IAS. I did give a similar talk using the same slides just a week later at the Allen Institute in Seattle (Symposium on Genetics of Complex Traits): video here.

Some Comments and Slides:

I tried to make the talk understandable to physicists, and at least according to what I was told (and my impression from the questions asked during and after the talk), largely succeeded. Early on, when presenting the phenotype function y(g), both Nima Arkani-Hamed (my host) and Ed Witten asked some questions about the "units" of the various quantities involved. In the actual computation everything is z-scored: measured in units of SD relative to the sample mean. I didn't realize until later that there was some confusion about how this is done for the "state variable" of the genetic locus g_i. In fact, when the gene array is read the result is 0,1,2 for homozygous common allele, heterozygous, homozygous rare allele, respectively. (I might have that backwards but you get the point.) For each locus there is a minor allele frequency (MAF) and this determines the sample average and SD of the distribution of 0's, 1's, and 2's. It is the z-scored version of this variable that appears in the computation. I didn't realize certain people were following the details so closely in the talk but I should not be surprised ;-) In the future I'll include a slide specifically on this to avoid confusion.

Looking at my slide on missing heritability, Witten immediately noted that estimating SNP heritability (as opposed to total or broad sense heritability) is nontrivial and I had to quickly explain the GCTA technique!

During the talk I discussed the theoretical reason we expect to find a lot of additive variance: nonlinear gadgets are fragile (easy to break through recombination in sexual reproduction), whereas additive genetic variance can be reliably passed on and is easy for natural selection to act on***. (See also Fisher's Fundamental Theorem of Natural Selection. More.) Usually these comments pass over the head of the audience but at IAS I am sure quite a few people understood the point.

One non-physicist reader of this blog braved IAS security and managed to attend the lecture. I am flattered, and I invite him to share his impressions in the comments!

Afterwards there was quite a bit of additional discussion which spilled over into tea time. The important ideas: how Compressed Sensing works, the nature of the phase transition, how we can predict the amount of data required to build a good predictor (capturing most of the SNP heritability) using the universality of the phase transition + estimate of sparsity, etc. were clearly absorbed by the people I talked to.

Slides

*** On the genetic architecture of intelligence and other quantitative traits (p.16):

... The preceding discussion is not intended to convey an overly simplistic view of genetics or systems biology. Complex nonlinear genetic systems certainly exist and are realized in every organism. However, quantitative differences between individuals within a species may be largely due to independent linear effects of specific genetic variants. As noted, linear effects are the most readily evolvable in response to selection, whereas nonlinear gadgets are more likely to be fragile to small changes. (Evolutionary adaptations requiring significant changes to nonlinear gadgets are improbable and therefore require exponentially more time than simple adjustment of frequencies of alleles of linear effect.) One might say that, to first approximation, Biology = linear combinations of nonlinear gadgets, and most of the variation between individuals is in the (linear) way gadgets are combined, rather than in the realization of different gadgets in different individuals.

Linear models work well in practice, allowing, for example, SNP-based prediction of quantitative traits (milk yield, fat and protein content, productive life, etc.) in dairy cattle. ...

Sunday, February 16, 2014

Genetic architecture of intelligence (video with improved sound)

This is video of a talk given to cognitive scientists about a year ago (slides). The original video had poor sound quality, but I think I've improved it using Google's video editing tool. I found that the earlier version was OK if you used headphones, but I think they are not necessary for this one.

I don't have any update on the BGI Cognitive Genomics study. We are currently held up by the transition from Illumina to Complete Genomics technology. (Complete Genomics was acquired by BGI last year and they are moving from the Illumina platform to an improved CG platform.) This is highly frustrating but I will just leave it at that.

On one of the last slides I mention Compressed Sensing (L1 penalized convex optimization) as a method for extracting genomic information from GWAS data. The relevant results are in this paper.

Wednesday, October 01, 2014

Adventures in the high dimensional space of genomes

2000+ views in 4 months is not bad considering that this is a genomics paper but uses terms like phase transition, sparsity, L1-penalized regression, Gaussian random matrices, etc. I wish I knew how many views the arXiv and BioMed Central versions of the paper have received. Related posts.

Dear Dr Hsu,

We thought you might be interested to know how many people have read your article:

Applying compressed sensing to genome-wide association studies
Shashaank Vattikuti, James J Lee, Christopher C Chang, Stephen D H Hsu and Carson C Chow
GigaScience, 3:10 (16 Jun 2014)
http://www.gigasciencejournal.com/content/3/1/10

Total accesses to this article since publication: 2266

This figure includes accesses to the full text, abstract and PDF of the article on the GigaScience website. It does not include accesses from PubMed Central or other archive sites (see http://www.biomedcentral.com/about/archive). The total access statistics for your article are therefore likely to be significantly higher. ...

My guess is still that it will take some time before these methods become widely understood in genomics.

Crossing boundaries: ... In a similar way Turing found a home in Cambridge mathematical culture, yet did not belong entirely to it. The division between 'pure' and 'applied' mathematics was at Cambridge then as now very strong, but Turing ignored it, and he never showed mathematical parochialism. If anything, it was the attitude of a Russell that he acquired, assuming that mastery of so difficult a subject granted the right to invade others.

PS I will be at the ASHG meeting in San Diego later this month along with (I think) all of the other authors of the paper. Vattikuti will be giving a poster session.

Tuesday, April 30, 2019

Dialogs

In a high corner office, overlooking Cambridge and the Harvard campus.

How big a role is deep learning playing right now in building genomic predictors?

So far, not a big one. Other ML methods perform roughly on par with DL. The additive component of variance is largest, and we have compressed sensing theorems showing near-optimal performance for capturing it. There are nonlinear effects, and eventually DL will likely be useful for learning multi-loci features. But at the moment everything is limited by statistical power, and nonlinear features are even harder to detect than additive ones. ...

The bottom line is that with enough statistical power predictors will capture the expected heritability for most traits. Are people in your field ready for this?

Some are, but for others it will be very difficult.

Conference on AI and Genomics / Precision Medicine (Boston).

I enjoyed your talk. I work for [leading AgBio company], but my PhD is in Applied Math. We've been computing Net Merit for bulls using SNPs for a long time. The human genetics people have been lagging...

Caught up now, though. And first derivative (sample size growth rate) is much larger...

Yes. It's funny because sperm is priced by Net Merit and when we or USDA revise models some farmers or breeders get very angry because the value of their bull can change a lot!

A Harvard Square restaurant.

I last saw Roman at the Fellows spring dinner, many years ago. I was back from Yale to see friends. He was drinking, with serious intent. He told me about working with Wilson at Cornell. He also told me an old story about Jeffrey and the Higgs mechanism. Jeffrey almost had it, soon after his work on the Goldstone boson. But Sidney talked him out of it -- something to the effect of "if you can only make sense of it in unitary gauge, it must be an artifact" ... Afterwards, at MIT they would say When push comes to shove, Sidney is wrong. ...

Genomics is in the details now. Lots of work to be done, but conceptually it's clear what to do. I wouldn't say that about AGI. There are still important conceptual breakthroughs that need to be made.

The Dunster House courtyard, overlooking the Charles.

We used to live here, can you let us in to look around?

I remember it all -- the long meals, the tutors, the students, the concerts in the library. Yo Yo Ma and Owen playing together.

A special time, at least for us. But long vanished except in memory.

Wheeler used to say that the past only exists as memory records.

Not very covariant! Why not a single four-manifold that exists all at once?

The Ritz-Carlton.

Flying private is like crack. Once you do it, you can't go back...

It's not like that. They never give you a number. They just tell you that the field house is undergoing a renovation and there's a naming opportunity. Then your kid is on the right list. They've been doing this for a hundred years...

Card had to do the analysis that way. Harvard was paying him...

I went to the session on VC for newbies. Now I realize "valuation" is just BS... Now you see how it really works...

Then Bobby says "What's an LP? I wanna be an LP because you gotta keep them happy."

Let me guess, you want a dataset with a million genomes and FICO scores?

I've helped US companies come to China for 20+ years. At first it was rough. Now if I'm back in the states for a while and return, Shenzhen seems like the Future. The dynamism is here.

To most of Eurasia it just looks like two competing hegemons. Both systems have their pluses and minuses, but it's not an existential problem...

Sure, Huawei is a big threat because they won't put in backdoors for the NSA. Who was tapping Merkel's cellphone? It was us...

Humans are just smart enough to create an AGI, but perhaps not smart enough to create a safe one.

Maybe we should make humans smarter first, so there is a better chance that our successors will look fondly on us. Genetically engineered super-geniuses might have a better chance at implementing Asimov's Laws of Robotics.

Friday, November 09, 2018

DeepMind Talk: Genomic Prediction of Complex Traits and Disease Risks via Machine Learning

I'll be at DeepMind in London next week to give the talk below. Quite a thrill for me given how much I've admired their AI breakthroughs in recent years. Perhaps AlphaGo can lead to AlphaGenome :-)

Hope the weather holds up!

Title: Genomic Prediction of Complex Traits and Disease Risks via Machine Learning

Abstract: After a brief review (suitable for non-specialists) of computational genomics and complex traits, I describe recent progress in this area. Using methods from Compressed Sensing (L1-penalized regression; Donoho-Tanner phase transition with noise) and the UK BioBank dataset of 500k SNP genotypes, we construct genomic predictors for several complex traits. Our height predictor captures nearly all of the predicted SNP heritability for this trait -- actual heights of most individuals in validation tests are within a few cm of predicted heights. I also discuss application of these methods to cognitive ability and polygenic disease risk: sparsity estimates (of the number of causal loci), combined with phase transition scaling analysis, allow estimates of the amount of data required to construct good predictors. We can now identify risk outliers for conditions such as heart disease, diabetes, breast cancer, hypothyroidism, etc. using inexpensive genotyping. Finally, I discuss how these advances will affect human reproduction (embryo selection for In Vitro Fertilization (IVF); gene editing) in the coming decade.

Bio: Stephen Hsu is VP for Research and Professor of Theoretical Physics at Michigan State University. He is also a researcher in computational genomics and founder of several Silicon Valley startups, ranging from information security to biotech. Educated at Caltech and Berkeley, he was a Harvard Junior Fellow and held faculty positions at Yale and the University of Oregon before joining MSU.

Action Photos!

Friday, January 01, 2016

GCTA, Missing Heritability, and All That

New Update: Yang, Visscher et al. respond here.

Update: see detailed comments and analysis here and here by Sasha Gusev. Gusev claims that the problems identified in Figs 4,7 are the result of incorrect calculation of the SE (4) and failure to exclude related individuals in the Framingham data (7).

Bioinformaticist E. Stovner asked about a recent PNAS paper which is critical of GCTA. My comments are below.

It's a shame that we don't have a better online platform (e.g., like Quora or StackOverflow) for discussing scientific papers. This would allow the authors of a paper to communicate directly with interested readers, immediately after the paper appears. If the authors of this paper want to correct my misunderstandings, they are welcome to comment here!

I took a quick look at it. My guess is that Visscher et al. will respond to the paper. It has not changed my opinion of GCTA. Note I have always thought the standard errors quoted for GCTA are too optimistic, as the method makes strong assumptions (e.g., fits a model with Gaussian random effect sizes). But directionally it is obvious that total h2 accounted for by all common SNPs is much larger than what you get from only genome wide significant hits obtained in early studies. For example, the number of genome wide significant hits for some traits (e.g., height) has been growing steadily, along with h2 accounted for using just those hits, eventually approaching the GCTA prediction. That is, even *without* GCTA the steady progress of GWAS shows that common variants account for significant heritability (amount of "missing" heritability steadily declines with GWAS sample size), so the precise reliability of GCTA becomes less important.

Regarding this paper, they make what sound like strong theoretical points in the text, but the simulation results don't seem to justify the aggressive rhetoric. The only point they really make in figs 4,7 is that the error estimate from GCTA in the case where SNP coverage is inadequate (i.e., using 5k out of 50k SNPs) are way off. But this doesn't correspond to any real world study that we care about. Real world results show that as you approach ~few x 100k SNPs used the h2 result asymptotes (approaches its limiting value), because you have enough coverage of common variants. The authors of the paper seem confused about this point -- see "Saturation of heritability estimates" section.

What they should do is simulate repeatedly with multiple disjoint populations (using good SNP coverage) and see how the heritability results fluctuate. But I think that kind of calculation has been done by other people and does not show large fluctuations in h2.

Well, since you got me to write this much already I suppose I should promote this to an actual blog post at some point ... Please keep in mind that I've only given the paper a quick read so I might be missing something important. Happy New Year!

Here is the paper:

Limitations of GCTA as a solution to the missing heritability problem
http://www.pnas.org/content/early/2015/12/17/1520109113

The genetic contribution to a phenotype is frequently measured by heritability, the fraction of trait variation explained by genetic differences. Hundreds of publications have found DNA polymorphisms that are statistically associated with diseases or quantitative traits [genome-wide association studies (GWASs)]. Genome-wide complex trait analysis (GCTA), a recent method of analyzing such data, finds high heritabilities for such phenotypes. We analyze GCTA and show that the heritability estimates it produces are highly sensitive to the structure of the genetic relatedness matrix, to the sampling of phenotypes and subjects, and to the accuracy of phenotype measurements. Plausible modifications of the method aimed at increasing stability yield much smaller heritabilities. It is essential to reevaluate the many published heritability estimates based on GCTA.

It's important to note that although GCTA fits a model with random effects, it purports to estimate the heritability of more realistic genetic architectures with some other (e.g., sparse) distribution of effect sizes (see Lee and Chow paper at bottom of this post). The authors of this PNAS paper seem to take the random effects assumption more seriously than the GCTA originators themselves. The latter fully expected a saturation effect once enough SNPs are used; the former seem to think it violates the fundamental nature of the model. Indeed, AFAICT, the toy models in the PNAS simulations assume all 50k SNPs affect the trait, and they run simulations where only 5k at a time are included in the computation. This is likely the opposite of the real world situation, in which a relatively small number (e.g., ~10k SNPs) affect the trait, and by using a decent array with > 200k SNPs one already obtains sensitivity to the small subset.

One can easily show that genetic architectures of complex traits tend to be sparse: most of the heritability is accounted for by a small subset of alleles. (Here "small" means a small fraction of ~ millions of SNPs: e.g., 10k SNPs.) See section 3.2 of On the genetic architecture of intelligence and other quantitative traits for an explanation of how to roughly estimate the sparsity using genetic Hamming distances. In our work on Compressed Sensing applied to genomics, we showed that much of the heritability for many complex traits can be recovered if sample sizes of order millions are available for analysis. Once these large data sets are available, this entire debate about missing heritability and GCTA heritability estimates will recede in importance. (See talk and slides here.)

For more discussion, see Why does GCTA work?

This paper, by two of my collaborators, examines the validity of a recently introduced technique called GCTA (Genome-wide Complex Trait Analysis). GCTA allows an estimation of heritability due to common SNPs using relatively small sample sizes (e.g., a few thousand genotype-phenotype pairs). The new method is independent of, but delivers results consistent with, "classical" methods such as twin and adoption studies. To oversimplify, it examines pairs of unrelated individuals and computes the correlation between pairwise phenotype similarity and genotype similarity (relatedness). It has been applied to height, intelligence, and many medical and psychiatric conditions.

When the original GCTA paper (Common SNPs explain a large proportion of the heritability for human height) appeared in Nature Genetics it stimulated quite a lot of attention. But I was always uncertain of the theoretical justification for the technique -- what are the necessary conditions for it to work? What are conservative error estimates for the derived heritability? My impression, from talking to some of the authors, is that they had a mainly empirical view of these questions. The paper below elaborates significantly on the theory behind GCTA.

Conditions for the validity of SNP-based heritability estimation

James J Lee, Carson C Chow
doi: 10.1101/003160

...

Information Processing

About Me