Text

Physicist, Startup Founder, Blogger, Dad

Tuesday, April 01, 2014

Sequencing and GWAS

A very nice discussion of the challenges associated with sequence data, as opposed to SNP array output, in GWAS. All of these issues are familiar to our team as we work with our high cognitive ability sample at BGI.
8 Realities of the Sequencing GWAS

For several years, the genome-wide association study (GWAS) has served as the flagship discovery tool for genetic research, especially in the arena of common diseases. The wide availability and low cost of high-density SNP arrays made it possible to genotype 500,000 or so informative SNPs in thousands of samples. These studies spurred development of tools and pipelines for managing large-scale GWAS, and thus far they’ve revealed hundreds of new genetic associations.

As we all know, the cost of DNA sequencing has plummeted. Now it’s possible to do targeted, exome, or even whole-genome sequencing in cohorts large enough to power GWAS analyses. While we can leverage many of the same tools and approaches developed for SNP array-based GWAS, the sequencing data comes with some very important differences.

...

These caveats of the sequencing GWAS, while important, should not detract from the advantages over SNP array-based experiments. Sequencing studies enable the discovery, characterization, and association of many forms of sequence variation — SNPs, DNPs, indels, etc. — in a single experiment. They capture known as well as unknown variants.

Sequencing also produces an archive that can be revisited and re-analyzed in the future. That’s why submitting BAM files and good clinical data to public repositories — like dbGaP — is so important. Single analyses and meta-analyses of sequencing GWAS may ultimately help us understand the contribution of all forms of genetic variation (common, rare, SNPs, indels) to important human traits.

7 comments:

nooffensebut said...

"Expect Many Ugly Variants... sequencing picks up a lot of the 'ugly' variants that SNP arrays so meticulously avoid: low-complexity regions, indels, triallelic SNPs, tandem repeats, etc... These caveats of the sequencing GWAS, while important, should not detract from the advantages over SNP array-based experiments."

Oh, no! Sequencing will study all of the candidate repeat alleles that the GWAS Jihadists have been dismissing all this time without any evidence at all. Thanks for the warning!

Didn't some mathematicians recently figure out that a whole other set of genes were hidden in centromeres. When does the real whole-genome analysis begin?

a last a loved a long the said...

But in the end it's still just an attempt to find the best linear
approximation to the function which takes vertices of a very high
dimensional hypercube to the quantitative trait or conditional
probability of a binary trait. And obviously a least squares linear
approximation means nothing if the conditional expectation surface isn't
linear.

As soon as one "looks" at the real world gxe to trait function this best linear approximation looks very bad and ridiculous.

So multiple gwases will overlap, but the hard core of non-zero variants will be elusive. It's a wild goose chase.

I like two quotes Vischer gave in his review paper:

From Tim Crow, Molecular Psychiatry 20112:

“There comes a point at which the genetic skeptic can be pardoned
the suggestion that if the genes are so small and so multiple, what they
are hardly matters, the dividing line between polygenes and no genes is
of little practical consequence. Have we reached this point”?

Jonathan Latham, on guardian.co.uk, 17 April 2011:

“Among all the genetic findings for common illnesses, such as heart
disease, cancer and mental illnesses, only a handful are of genuine
significance for human health. Faulty genes rarely cause, or even mildly
predispose us, to disease, and as a consequence the science of human
genetics is in deep crisis.Since the Collins paper [Manolio et al.
20093] was published nothing has happened to change that conclusion. It
now seems that the original twin-study critics were more right than they
imagined. The most likely explanation for why genes for common diseases
have not been found is that, with few exceptions, they do not exist.”

a last a loved a long the said...

But in the end it's still just an attempt to find the best linear approximation to the function which takes vertices of a very high dimensional unit cube to the quantitative trait or conditional probability of a binary trait.

As soon as one "looks" at the real world gxe to trait function this best linear approximation looks very bad.

So multiple gwases with L1 "penalization" will overlap, but the hard core will be elusive. It's a wild goose chase.

From Tim Crow, Molecular Psychiatry 20112:

"There comes a point at which the genetic skeptic can be pardoned the suggestion that if the genes are so small and so multiple, what they are hardly matters, the dividing line between polygenes and no genes is of little practical consequence. Have we reached this point?"

BlackRoseML said...

For various reasons, I thought genome sequencing would have diminishing marginal returns compared to GWAS.

steve hsu said...

In the real world plant and animal breeders have fairly accurate predictive models built using similar methods. BTW, effect sizes are not 0 or 1.

a last a loved a long the said...

The vertices of the unit cube indicate genome not effect size. For a long list of variants, the genome is a sequence of 1s and 0s. The linear estimate vector is the vector of effect sizes. Multiply them and you get the estimate of trait or probability of trait. Some vertices will have no value as some 1s require 0s in the genome vector.



And BTW, humans are homogeneous at genetic level compared to apes, yet are more heterogeneous in behavior than any species. An eskimo isn't born knowing how to hunt narwhal. Domesticated animals and plants may be even more homogeneous genetically, but behavior isn't like other traits, especially in humans.

ronthehedgehog said...

Plomin, R., Caspi, A., Corley, R., Fulker, D.W., & DeFries, J. (1998).
Adoption results for self-reported personality: Evidence for nonadditive
genetic effects? Journal of Personality and Social Psychology, 75, 211-218.


In contrast to twin studies, we find little evidence for hereditary or shared
environmental influences in parent-offspring and sibling analyses of self-report
personality data. Although several factors might contribute to the discrepancy
between twin and adoption results, we suggest that nonadditive genetic
effects, which can be detected by twin studies but not adoption studies,
are a likely culprit. This has important implications for attempts to
identify specific genes responsible for genetic influence on personality.


Yeah. That's it. That's the ticket.

Blog Archive

Labels