Thursday, January 04, 2007

Metric on the space of genomes and the scientific basis for race

[Note: the science has advanced considerably since I wrote this post. See here for a nice picture of clustering by ethnicity. I've added some comments at the bottom.]


Suppose that the human genome has 30,000 distinct genes, which we will label as i = 1,2, ... N, where N = 30k. Next, suppose that there are n_i variants or alleles (mutations) of the i-th gene. Then, each human's genetic information can be described as a point on a lattice of size n_1 x n_2 x n_3 ... n_N, or equivalently an N-tuple of integers, each of whose values range from 1 to n_i. For the simplified case where there are exactly 10 variants of each gene, the number of points in this N dimensional space is 10^N or 10^{30k}, one for each distinct 30k digit number. It's a space of very high dimension, but this doesn't stop us from defining a metric, or measure of distance between any two points in the space. (For simplicity we ignore restrictions on this space which might result from incompatibility of certain combinations, etc.)

Note that the genomes of all of the humans who have ever lived occupy only a small subset of this space -- most possible variations have never been realized. For this reason, the surprise expressed by biologists that humans have so few genes (not many more than a worm, and far less than the 100k of earlier estimates) is no cause for concern -- the number of possible organisms that might result from 30k genes is enormous -- far more than the number of molecules in the visible universe.

To define a metric, we need a notion of how far apart two different alleles are. We can do this by counting base pair differences -- most mutations only alter a few base pairs in the genetic code. We can define the distance between two alleles in terms of the number of base pair changes between them (this is always a positive number). Then, we can define the distance between two genomes as the sum of each of the i=1,2,..,N individual gene distances. It is natural, although perhaps not always possible, to choose the n_i labeling of alleles to reflect relative distances, so variants n_1 and n_2 are close together, and both very far from n_10.

The exact definition of the metric and the allele labeling are somewhat arbitrary, but you can see it is easy to define a meaningful measure of how far apart any two individuals are in genome space.

Now plot the genome of each human as a point on our lattice. Not surprisingly, there are readily identifiable clusters of points, corresponding to traditional continental ethnic groups: Europeans, Africans, Asians, Native Americans, etc. (See, for example, Risch et al., Am. J. Hum. Genet. 76:268–275, 2005.) Of course, we can get into endless arguments about how we define European or Asian, and of course there is substructure within the clusters, but it is rather obvious that there are identifiable groupings, and as the Risch study shows, they correspond very well to self-identified notions of race.

From the conclusions of the Risch paper (Am. J. Hum. Genet. 76:268–275, 2005):

Attention has recently focused on genetic structure in the human population. Some have argued that the amount of genetic variation within populations dwarfs the variation between populations, suggesting that discrete genetic categories are not useful (Lewontin 1972; Cooper et al. 2003; Haga and Venter 2003). On the other hand, several studies have shown that individuals tend to cluster genetically with others of the same ancestral geographic origins (Mountain and Cavalli-Sforza 1997; Stephens et al. 2001; Bamshad et al. 2003). Prior studies have generally been performed on a relatively small number of individuals and/or markers. A recent study (Rosenberg et al. 2002) examined 377 autosomal micro-satellite markers in 1,056 individuals from a global sample of 52 populations and found significant evidence of genetic clustering, largely along geographic (continental) lines. Consistent with prior studies, the major genetic clusters consisted of Europeans/West Asians (whites), sub-Saharan Africans, East Asians, Pacific Islanders, and Native Americans. ... We have shown a nearly perfect correspondence between genetic cluster and SIRE [self-reported ethnicity] for major ethnic groups living in the United States, with a discrepancy rate of only 0.14%.

This clustering is a natural consequence of geographical isolation, inheritance and natural selection operating over the last 50k years since humans left Africa.

Every allele probably occurs in each ethnic group, but with varying frequency. Suppose that for a particular gene there are 3 common variants (v1, v2, v3) all the rest being very rare. Then, for example, one might find that in ethnic group A the distribution is v1 75%, v2 15%, v3 10%, while for ethnic group B the distribution is v1 2% v2 6% v3 92%. Suppose this pattern is repeated for several genes, with the common variants in population A being rare in population B, and vice versa. Then, one might find a very dramatic difference in expressed phenotype between the two populations. For example, if skin color is determined by (say) 10 genes, and those genes have the distribution pattern given above, nearly all of population A might be fair skinned while all of population B is dark, even though there is complete overlap in the set of common alleles. Perhaps having the third type of variant v3 in 7 out of 10 pigmentation genes makes you dark. This is highly likely for an individual in population B with the given probabilities, but highly unlikely in population A.

We see that there can be dramatic group differences in phenotypes even if there is complete allele overlap between two groups - as long as the frequency or probability distributions are distinct. But it is these distributions that are measured by the metric we defined earlier. Two groups that form distinct clusters are likely to exhibit different frequency distributions over various genes, leading to group differences.

This leads us to two very different possibilities in human genetic variation:

Hypothesis 1: (the PC mantra) The only group differences that exist between the clusters (races) are innocuous and superficial, for example related to skin color, hair color, body type, etc.

Hypothesis 2: (the dangerous one) Group differences exist which might affect important (let us say, deep rather than superficial) and measurable characteristics, such as cognitive abilities, personality, athletic prowess, etc.

Note H1 is under constant revision, as new genetically driven group differences (e.g., particularly in disease resistance) are being discovered. According to the mantra of H1 these must all (by definition) be superficial differences.

A standard argument against H2 is that the 50k years during which groups have been separated is not long enough for differential natural selection to cause any group differences in deep characteristics. I find this argument quite naive, given what we know about animal breeding and how evolution has affected the (ever expanding list of) "superficial" characteristics. Many genes are now suspected of having been subject to strong selection over timescales of order 5k years or less. For further discussion of H2 by Steve Pinker, see here.

The predominant view among social scientists is that H1 is obviously correct and H2 obviously false. However, this is mainly wishful thinking. Official statements by the American Sociological Association and the American Anthropological Association even endorse the view that race is not a valid biological concept, which is clearly incorrect.

As scientists, we don't know whether H1 or H2 is correct, but given the revolution in biotechnology, we will eventually. Let me reiterate, before someone labels me a racist: we don't know with high confidence whether H1 or H2 is correct.

Finally, it is important to note that group differences are statistical in nature and do not imply anything definitive about a particular individual. Rather than rely on the scientifically unsupported claim that we are all equal, it would be better to emphasize that we all have inalienable human rights regardless of our abilities or genetic makeup.

[See here (Economist's View blog) for more comments.]

[See Gene Expression for more discussion and references.]

Note added: See related Risch article which discusses H1 and H2: Assessing genetic contributions to phenotypic differences among 'racial' and 'ethnic' groups (Nature 2004), and also these 2013 comments in an interview with Ta-Nehisi Coates for The Atlantic:
Q: One last question. Your paper on assessing genetic contributions to phenotype, seemed skeptical that we would ever tease out a group-wide genetic component when looking at things like cognitive skills or personality disposition. Am I reading that right? Are "intelligence" and "disposition" just too complicated?

A: (Risch) Joanna Mountain and I tried to explain this in our Nature Genetics paper on group differences. It is very challenging to assign causes to group differences. As far as genetics goes, if you have identified a particular gene which clearly influences a trait, and the frequency of that gene differs between populations, that would be pretty good evidence. But traits like "intelligence" or other behaviors (at least in the normal range), to the extent they are genetic, are "polygenic." That means no single genes have large effects -- there are many genes involved, each with a very small effect. Such gene effects are difficult if not impossible to find. The problem in assessing group differences is the confounding between genetic and social/cultural factors. If you had individuals who are genetically one thing but socially another, you might be able to tease it apart, but that is generally not the case.

In our paper, we tried to show that a trait can appear to have high "genetic heritability" in any particular population, but the explanation for a group difference for that trait could be either entirely genetic or entirely environmental or some combination in between.

So, in my view, at this point, any comment about the etiology of group differences, for "intelligence" or anything else, in the absence of specific identified genes (or environmental factors, for that matter), is speculation.
To repeat, I think it is fair to say that we don't know*** whether H1 or H2 is correct, even though we do know that populations cluster (genetically) by ancestry. Clustering makes it possible that H2 is correct, because the alleles (genetic variants) affecting a particular phenotype will tend to have different frequencies in different groups. The average value of the trait might or might not be the same in different groups. In the case of height, enough is known to suggest that variants which lead to increased height are more frequent in northern Europe than in the south, and there is evidence that this is due to selection, not drift.

*** I think it's especially important to be epistemologically careful in thinking about these matters, because of our difficult history with race. I would much rather live in a world where H1 is true and H2 false. But my preference alone does not make it so. (I would also much rather live in a universe created by a loving God, and in which I and my children have eternal souls; not a cruel Darwinian universe in which our species arose merely by chance. But my preference does not make it so.)

Blog Archive

Labels