Showing posts with label population structure. Show all posts
Showing posts with label population structure. Show all posts

Saturday, May 14, 2011

Height, breeding values and selection

I can't wait to read this paper. The results were reported on the blog Genetic Inference, based on a talk at Biology of Genomes 2011.

A few comments:

1. Although known alleles for height only account for 5-10% of variance (out of the expected 80-90%), it is very plausible that loci of smaller effect or MAF (minor allele frequency) account for the "missing heritability". We still lack sufficient statistics to detect most of the individual loci of this type, but it's a matter of time. See beautiful paper from Visscher's group. The results described below suggest that loci just below the (arbitrary) significance threshold currently in use might also be height associated. There is a whole distribution of loci with smaller effect sizes and MAF that are just waiting to be discovered -- we have only found the tip of the iceberg.

2. Even with only a fraction of total additive variance identified, one can still make estimates of breeding value for groups by simply computing the prevalence of known associated loci in each group. How indicative these (large effect/MAF) loci are of the actual breeding values can't be answered a priori, but I would bet they are a good indicator, and this seems to be the case for height.

3. If the results on selection hold up this will be clear evidence for differential selection between groups of a quantitative trait (as opposed to lactose or altitude tolerance, which are controlled by small sets of loci). We may soon be able to conclude that there has been enough evolutionary time for selection to work within European populations on a trait that is controlled by hundreds (probably thousands) of loci.

4. With luck we might get to this level of analysis for g in the next 5-10 years. (I originally wrote 3-5 years but one of my more sober collaborators convinced me that would be quite unlikely!)

5. Understanding the evolution and distribution of quantitative traits like height and g at this level is an important milestone in scientific history.

It's amazing to see scientific and technological progress verify models that you've had in your head since age 12 :-)

Genetic Inference: ... Europeans differ systematically in their height, and these differences correlate with latitude. The average Italian is 171cm, whereas the average Swede is a full 4cm taller. Are these differences genetic? Have they been under evolutionary selection in recent human history?

Michael Turchin gave some pretty convincing answers to these questions, using genetic data from the 129 thousand individuals in the GIANT consortium. He compared the frequencies of alleles that are known to increase height, and found that they are more common in Northern Europe. Interestingly, he found the same relationship for alleles that have weaker evidence for height association, showing that there are still a large number of common height variants hiding in the genome, which are also more frequent in Northern Europe.

Height differences are thus heritable, but have they been under evolutionary selection? Or are these differences merely down to genetic drift? This can also be tested using the GIANT data, which shows significant statistical evidence of selection on height variants in recent history. On top of that, the magnitude of the selection is correlated with the effect size of the height variant, providing strong evidence that these variants are being selected specifically for their impact on height.

This is a textbook example of how an evolutionary study should be done; you show a phenotypic difference exists, that it is heritable, and that it is under selection. This opens the question as to why height has been selected in Northern Europe (or shortness in Southern Europe). Could the same data be used to test specific hypotheses there?

Sunday, December 07, 2008

Resolution of population genetic structure

How much can we resolve the substructure of a population with a given amount of data? The paper below gives a quantitative answer. With current technology, we should have no problem resolving even small national populations (see italicized text in quote below), with nearest neighbor FST as small as .0001 (i.e., 99.99 percent of variation is within-group and only .01 percent between groups)! According to this Table of European, Nigerian and East Asian FSTs, the FST between France and Spain is .0008, whereas between Nigeria and Japan it is about .19 .

Within the European + HapMap sample analyzed here, over 100 statistically significant PCA vectors were identified. That is, there is a >100 dimensional space within which structure can be teased out. (However, the largest single vector accounts for only a percent of total variation, and the integral over all 100 vectors is probably only a few percent.) Norwegians and Swedes could be resolved with 90 percent accuracy. Note the Patterson et al. paper was written before this recent analysis, which confirms their theoretical predictions of sensitivity. (Figure below.)



The first author, Nick Patterson (profiled here), is a mathematician turned cryptographer turned quant (Renaissance) turned bioinformaticist.

Population Structure and Eigenanalysis

Nick Patterson et al. (Broad Institute of Harvard and MIT)

Abstract
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.



...Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is FST = .0001. An FST value of .001 will thus be trivial to detect. To put this into context, we note that a typical value of FST between human populations in Northern and Southern Europe is about .006 [15]. Thus, we predict: most large genetic datasets with human data will show some detectable population structure.

Blog Archive

Labels