Showing posts with label fst. Show all posts
Showing posts with label fst. Show all posts

Thursday, June 25, 2009

Genetic clustering: 40 years of progress

Represent each individual human by their DNA sequence. When aggregated, they cluster into readily identifiable groups. This has been known for 40 years now, although the technology and methods of analysis continue to improve. Below are results from 1966, 1978 and 2008.

If this seems counterintuitive to you, it might be because the space of genetic variation is of very high dimension. See here for more discussion and an illustration.

(Click images for larger version.)

Population Structure and Human Evolution
L. L. Cavalli-Sforza

Proceedings of the Royal Society of London. Series B, Biological Sciences, Vol. 164, No.995, A Symposium from Mendel's Factors to the Genetic Code (Mar. 22, 1966), pp. 362-379
http://www.jstor.org/stable/pdfplus/75457.pdf




Measurement of Differentiation: Reply to Lewontin, Powell, and Taylor
Jeffry B. Mitton

The American Naturalist, Vol. 112, No. 988 (Nov. - Dec., 1978), pp. 1142-1144
http://www.jstor.org/stable/2460361?origin=JSTOR-
pdf





Current state of the art, as discussed here. Figure: The three clusters shown below are European (top, green + red), Nigerian (light blue) and E. Asian (purple + blue).



According to the mathematical analysis given in this paper, populations with FST as low as .0001 can be resolved with current technology. (Typical FST between northern and southern Europe is about .006, between Europe and E. Asia about .1 and between Europe and Nigeria about .14 .)

Sunday, December 07, 2008

Resolution of population genetic structure

How much can we resolve the substructure of a population with a given amount of data? The paper below gives a quantitative answer. With current technology, we should have no problem resolving even small national populations (see italicized text in quote below), with nearest neighbor FST as small as .0001 (i.e., 99.99 percent of variation is within-group and only .01 percent between groups)! According to this Table of European, Nigerian and East Asian FSTs, the FST between France and Spain is .0008, whereas between Nigeria and Japan it is about .19 .

Within the European + HapMap sample analyzed here, over 100 statistically significant PCA vectors were identified. That is, there is a >100 dimensional space within which structure can be teased out. (However, the largest single vector accounts for only a percent of total variation, and the integral over all 100 vectors is probably only a few percent.) Norwegians and Swedes could be resolved with 90 percent accuracy. Note the Patterson et al. paper was written before this recent analysis, which confirms their theoretical predictions of sensitivity. (Figure below.)



The first author, Nick Patterson (profiled here), is a mathematician turned cryptographer turned quant (Renaissance) turned bioinformaticist.

Population Structure and Eigenanalysis

Nick Patterson et al. (Broad Institute of Harvard and MIT)

Abstract
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.



...Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is FST = .0001. An FST value of .001 will thus be trivial to detect. To put this into context, we note that a typical value of FST between human populations in Northern and Southern Europe is about .006 [15]. Thus, we predict: most large genetic datasets with human data will show some detectable population structure.

Saturday, November 29, 2008

Human genetic variation, Fst and Lewontin's fallacy in pictures

In an earlier post European genetic substructure, I displayed the following graphic, illustrating the genetic clustering of human populations.




Figure: The three clusters shown above are European (top, green + red), Nigerian (light blue) and E. Asian (purple + blue).

The figure seems to contradict an often stated observation about human genetic diversity, which has become known among experts as Lewontin's fallacy: genetic variation between two random individuals in a given population accounts for 80% or more of the total variation within the entire human population. Therefore, according to the fallacy, any classification of humans into groups ("races") based on genetic information is impossible. ("More variation within groups than between groups.")

To understand this statement better, consider the F statistic of population genetics, introduced by Sewall Wright:

Fst = 1 - Dw / Db

Db and Dw represent the average number of pairwise differences between two individuals sampled from different populations (Db = "difference between") or the same population (Dw = "difference within"). Even in the most widely separated human populations Fst < .2 so Dw / Db > .8 (roughly). This may not sound like very much genetic diversity, but it is more than in many other animal species. See here for recent high statistics Fst values by nationality.

Dw / Db > .8 means that the average genetic distance measured in number of base pair differences between two members of a group (e.g., two randomly selected Europeans) is at least 80 percent of the average distance between distant groups (e.g., Europeans and Asians or Africans). In other words, if two individuals from very distant groups (e.g., a Japanese and a Nigerian) have on average N base pair differences, then two from the same group (e.g., two Nigerians or two Japanese) will on average have roughly .8 N base pair differences.

How can the Fst result ("more variation within groups than between groups") be consistent with the clusters shown in the figure? I've had to explain this on numerous occasions, always with great difficulty because the explanation requires a little mathematics. In order to make the point more accessible, I've created the figures below, which show two population clusters, each represented by an ellipsoid (blob). The different figures depict the same pair of objects, just viewed from different angles.

The blobs are constructed and arranged so that the average distance between two points (individuals) within the same cluster is almost as big as the average distance between two points (individuals) in different clusters. This is easy to achieve if the ellipsoids are big and flat (like pancakes) and placed close to each other along the flat directions. The figure is meant to show how one can have small Fst, as in humans, yet easily resolved clusters. The direction in which the gap between the clusters appears is one of the principal components in the space of human genetic variation, as recently found by bioinformaticists. The figure at the top of this post plots individuals as points in the space generated by the two largest principal components extracted from the combination of data from HapMap and from large statistics sampling of Europeans. Exhibited this way, isolated clusters ("races") are readily apparent.

The real space of genetic variation has many more than 3 dimensions, so it can't be easily visualized. But some aspects of the figures below still apply: there will be particular directions of variation over which different populations are more or less identical (orthogonal to the principal component; i.e. along the flat directions of each pancake), and there will be directions in which different populations differ radically and have little or no overlap. Note, however, that we are specifically referring to genetic variation, which may or may not translate into phenotypic variation.







Related posts: "no scientific basis for race" , metric on the space of genomes.

The existence of this clustering has been known for 40 years.

Blog Archive

Labels