Saturday, November 29, 2008

Human genetic variation, Fst and Lewontin's fallacy in pictures

In an earlier post European genetic substructure, I displayed the following graphic, illustrating the genetic clustering of human populations.

Figure: The three clusters shown above are European (top, green + red), Nigerian (light blue) and E. Asian (purple + blue).

The figure seems to contradict an often stated observation about human genetic diversity, which has become known among experts as Lewontin's fallacy: genetic variation between two random individuals in a given population accounts for 80% or more of the total variation within the entire human population. Therefore, according to the fallacy, any classification of humans into groups ("races") based on genetic information is impossible. ("More variation within groups than between groups.")

To understand this statement better, consider the F statistic of population genetics, introduced by Sewall Wright:

Fst = 1 - Dw / Db

Db and Dw represent the average number of pairwise differences between two individuals sampled from different populations (Db = "difference between") or the same population (Dw = "difference within"). Even in the most widely separated human populations Fst < .2 so Dw / Db > .8 (roughly). This may not sound like very much genetic diversity, but it is more than in many other animal species. See here for recent high statistics Fst values by nationality.

Dw / Db > .8 means that the average genetic distance measured in number of base pair differences between two members of a group (e.g., two randomly selected Europeans) is at least 80 percent of the average distance between distant groups (e.g., Europeans and Asians or Africans). In other words, if two individuals from very distant groups (e.g., a Japanese and a Nigerian) have on average N base pair differences, then two from the same group (e.g., two Nigerians or two Japanese) will on average have roughly .8 N base pair differences.

How can the Fst result ("more variation within groups than between groups") be consistent with the clusters shown in the figure? I've had to explain this on numerous occasions, always with great difficulty because the explanation requires a little mathematics. In order to make the point more accessible, I've created the figures below, which show two population clusters, each represented by an ellipsoid (blob). The different figures depict the same pair of objects, just viewed from different angles.

The blobs are constructed and arranged so that the average distance between two points (individuals) within the same cluster is almost as big as the average distance between two points (individuals) in different clusters. This is easy to achieve if the ellipsoids are big and flat (like pancakes) and placed close to each other along the flat directions. The figure is meant to show how one can have small Fst, as in humans, yet easily resolved clusters. The direction in which the gap between the clusters appears is one of the principal components in the space of human genetic variation, as recently found by bioinformaticists. The figure at the top of this post plots individuals as points in the space generated by the two largest principal components extracted from the combination of data from HapMap and from large statistics sampling of Europeans. Exhibited this way, isolated clusters ("races") are readily apparent.

The real space of genetic variation has many more than 3 dimensions, so it can't be easily visualized. But some aspects of the figures below still apply: there will be particular directions of variation over which different populations are more or less identical (orthogonal to the principal component; i.e. along the flat directions of each pancake), and there will be directions in which different populations differ radically and have little or no overlap. Note, however, that we are specifically referring to genetic variation, which may or may not translate into phenotypic variation.

Related posts: "no scientific basis for race" , metric on the space of genomes.

The existence of this clustering has been known for 40 years.


Carson C. Chow said...

Nice post Steve. Are these just for SNP's or do they include copy number variations as well?

Steve Hsu said...

The specific data in the first figure is SNPs, although clustering is observed on essentially every type of genetic information thus far examined. I haven't seen specific results for copy number variation -- if you can find some, please let me know.

Anonymous said...


This all seems to come down to a simple principle of logic: the more you know about something, the more unique it will appear to be. If all I know about a man is his height, it will be impossible to distinguish him from millions of other people. The more pieces of information ('dimensions') I have on him, the less he will overlap with other people. Evenually, with enough information, this man will have zero overlap with others. He will be unique.

Lewontin looked at human genetic variation in terms of one variable at a time. So, perhaps unsurprisingly, he found a lot of overlap.

Steve Hsu said...

"Evenually, with enough information, this man will have zero overlap with others. He will be unique."

But what is interesting is that long before each individual becomes unique, one can discern discrete clusters that correspond to traditional folk notions of ethnicity.

If you go back and read what Lewontin wrote, he was trying to use his 85-15 statistic to suggest that this would not be the case.

It *could* have been the case that each population overlaps strongly in *each* direction of gene space, which is what Lewontin wanted to suggest. However, it turns out not to be the case...

" 1972 Richard Lewontin of Harvard University ‘‘found that nearly 85 per cent of humanity’s genetic diversity occurs among individuals within a single population.’’ ‘‘In other words, two individuals are different because they are individuals, not because they belong to different races.’’ In 2001, the Human Genome edition of Nature(3) came with a compact disc containing a similar statement, quoted above."

Anonymous said...

Hi, I was wondering if someone could help me out here. I'm trying to use FST statistics to analyze SNP data from the HGDP database.

But I only seem to find the old version of FST, first introduced by Wright. It seems this formula had some flaws, and now geneticists use the FST formula introduced by Cockerham and Weir (1984). I can't seem to find the formula online, much less an explanation of it.

Anyone have a link to a webpage that deals with the FST formula by Cockerham and Weir?


Steve Hsu said...

I'm not familiar with the 1984 reference, but the modified FST definition used by the human genome project is given here:

" the above equation xij is the estimated frequency (proportion) of the minor allele at SNP i in population j, nij is the number of genotyped chromosomes at that position, and nj is the number of chromosomes analysed in that population. The lack of the j subscript in the denominator indicates that statistics ni and xi are calculated across the combined data sets."

botti said...

***This may not sound like very much genetic diversity, but it is more than in many other animal species. ***

The Goodrum link to the FAQ page doesn't seem to work anymore, but there is a pdf by Goodrum.
Same data also used by

Woodley, 2009. Is Homo sapiens polytypic? Human taxonomic diversity and its implications

DevilDocNowCiv said...


To cut to the chase with the non-math-savvy among us: Some insist that pesky "Race-IQ Data" should be be made to go away because-Gosh! We just noticed! Purely co-incidentally, this fact becomes useful for this pesky "Race-IQ Data"-races don't exist! No need for desegregation-its just confusion! Racism? Just confusion! Go now amongst the masses, ye converted, and spread the good news-they are confused! What they think are different races, some very strongly identifiable, some seemingly an identifiable "mix," and other variants-this is mistaken! Go! Let the vast majority of the world population know it's wrong!
If this idea were truly held, and not promulgated as a "lame" attempt to mitigate the "equality confusion" potentially generated by that pesky "Race-IQ Data," those who argue for the "No Races Theory" would be earnestly spreading the word, and not very, very specifically only using it to mitigate pesky "Race-IQ Data" in article comment blocs and opinion pieces.

Blog Archive