Information Processing: Genetic architecture of complex traits and disease risk predictors

Tuesday, July 21, 2020

Genetic architecture of complex traits and disease risk predictors

The published version of the paper below is now available here:
www.nature.com/articles/s41598-020-68881-8

Genetic architecture of complex traits and disease risk predictors

Soke Yuen Yong, Timothy G. Raben, Louis Lello & Stephen D. H. Hsu

Genomic prediction of complex human traits (e.g., height, cognitive ability, bone density) and disease risks (e.g., breast cancer, diabetes, heart disease, atrial fibrillation) has advanced considerably in recent years. Using data from the UK Biobank, predictors have been constructed using penalized algorithms that favor sparsity: i.e., which use as few genetic variants as possible. We analyze the specific genetic variants (SNPs) utilized in these predictors, which can vary from dozens to as many as thirty thousand. We find that the fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits—i.e., existing PRS cannot be computed from exome data alone. We also study the fraction of SNPs and of variance that is in common between pairs of predictors. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.

There are a lot of detailed results in the paper, but two main points should be emphasized:

1. Much of the genetic risk identified in polygenic predictors is outside genic (protein coding) regions, and not accessible through exome sequencing.

2. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint, suggesting that most genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.

The space of genetic variation is high dimensional, and extends far beyond individual (protein coding) genes. Intuitions about strong pleiotropy are likely wrong -- they were developed before we knew anything about real genetic architectures. There seem to be many causal variants that can be independently modified.

The bioRxiv (preprint) version of the paper was discussed in these earlier posts:

Live Long and Prosper: Genetic Architecture of Complex Traits and Disease Risk Predictors

Pleiotropy: Myths and Reality

From the Peiotropy post above:

1. Regions of DNA correlated to different disease risks are largely disjoint.

2. It is plausible that causal genetic variants lie in these regions. For example, the predictor SNPs themselves could be causal, or they could tag (be highly correlated in state with) nearby causal variants.

3. Hypothetically, one could edit these causal variants independently, making the beneficiary simultaneously low risk for many conditions. The number of standard deviations of effect size in the polygenic score for each disease that can be modified independently (i.e., without affecting other disease risks or traits) is large and can be directly estimated from our results.

As the figure below (source) makes clear, a few SD change (e.g., ~5 SD, from 99th percentile to 1st percentile) in polygenic score for a given disease risk can lead to a 10x or possibly 100x decrease in absolute probability in having the condition. Our results suggest that the amount of variance available for engineering is much greater than this.

Some orders of magnitude:

1E07 or ~10M common SNP differences between two individuals.

1E04 or ~10k SNPs (on average; could be much fewer) control most of the common variance for a typical complex trait.

So, in principle, there could be 1E03 or ~1k entirely independent complex traits with zero pleiotropy between them. These might include dozens of common disease risks, ~100 cosmetic traits, including facial and body morphology parameters, dozens of psychometric variables, including personality traits, etc. Clearly, individual differences are well accommodated by a ~1k dimensional phenotype space embedded in a ~10M dimensional space of genetic variants.

Of course, it is an unrealistic idealization for the traits to be entirely independent in genetic architecture. We expect that some genetic variants affect more than one trait. But our results suggest that a significant part of the genetic variance of each trait can be modified (e.g., via editing) independently of the other traits. This is simply a consequence of high dimensionality.

Information Processing

About Me

Tuesday, July 21, 2020

Genetic architecture of complex traits and disease risk predictors

No comments:

Blog Archive

Labels