Information Processing: Learning can hurt

Wednesday, May 01, 2013

Learning can hurt

This essay, from 2007, summarizes how progress in genomics came to confront convenient but incorrect views on the genetic clustering of human populations. The vast majority of people, even scientists, are still confused about this subject. See also Human genetic variation, Fst and Lewontin's fallacy in pictures.

The flipside of serendipity: human genetics rediscovers race

... In this paper I investigate the recent re-emergence of genetic race in more detail, and endeavour to ascertain how, after a half-century hiatus, a young team of population geneticists could casually rediscover race in 2002. Was it happenstance? Or was it opportune timing? My tentative conclusions are: (a) the flipside of serendipity is sociopolitical context; and (b) in the first few years of the new millennium, on the face of it, the time was indeed ripe for genetic race to re-emerge within the scientific mainstream.

IT ALL STARTED WITH A CLUSTER ANALYSIS ...

'We have sequenced the genome of three females and two males, who have identified themselves as Hispanic, Asian, Caucasian or African American ... to help illustrate that the concept of race has no genetic or scientific basis. In the five Celera genomes, there is no way to tell one ethnicity from another.'--J Craig Venter, White House press conference, June 2000. (4)

Craig Venter was the 'maverick' geneticist who founded Celera Genomics--a private biotech company that, in the late 1990s, took on a trans-national network of publicly and privately funded researchers in a frantic race to sequence the human genome. In June 2000, at the White House press conference that heralded the completion of the 'first assembly' of the genome, standing next to his public-sector adversary Francis Collins, United States President Bill Clinton, and (via satellite feed) United Kingdom Prime Minister Tony Blair, in a triumphal moment for Big Science and liberal humanism Venter announced: 'the concept of race has no genetic or scientific basis.'

In saying this, he reaffirmed evolutionary biologist Richard Lewontin's oft-cited claim that racial classification is of 'virtually no genetic or taxonomic significance'. (5) He also echoed a gaggle of other geneticists and an army of anthropologists who, since the middle of the twentieth century, have similarly sought to deny that the 'race concept' has any basis in biology. In short, he took up what might be termed the 'race as social construct' position--a position that has come to function as a default setting within the social sciences in recent decades. Never mind that his assertion was based on an analysis of only five individual genomes and thus a statistical fallacy; the important thing was that it befitted the hyperbole of the occasion and had the aura of incontrovertible truth.

[[ Who could have possibly accepted these results as decisive? See Bounded Cognition. ]]

Over the next few years the incontrovertibility of the 'race as social construct' position began to erode. Just four years after Venter's speech, in the pages of a Nature genetics supplement dedicated to the issue of race in genomics, Francis Collins--who had shared both podium and Time magazine cover with Venter on that aforementioned momentous day--offered this cautious qualification: 'As ancestral origins in many cases have a correlation, albeit often imprecise, with self-identified race or ethnicity, it is not strictly true that race or ethnicity has no biological connection.' (6)

... Collins ... is the director of the US National Human Genome Research Institute, a renowned medical genetics researcher, and by all accounts not a racist. So why is a distinguished geneticist, not some racist crackpot on the margins of academia, suddenly making the claim: 'it is not strictly true that race or ethnicity has no biological connection'? What happened in those few short years that separate Venter's exultant assertion and Collins's wary disclaimer?

Ostensibly, the answer to these questions can be found in Collins's paper, for he cites evidence to support his claim: a population genetics study conducted by Noah Rosenberg and his team of researchers that was published in Science in December 2002. Drawing on samples from the Human Genome Diversity Cell Line Panel, (8) these researchers investigated the 'correspondence of predefined groups with those inferred from individual multilocus genotypes'. (9) They used a complex computer algorithm to sort 1,064 genome samples, from fifty-two different populations, on the basis of 4,199 different alleles, at 377 highly variable 'junk-DNA' loci, into varying numbers of statistically significant genetic clusters, and then compared the clusters with the geographical origins of the populations from which they were drawn. Put simply, they took the labels off the samples and tried to see if the computer could sort them back into meaningful groups based solely on their genetic similarities.

They found that 'predefined labels' (such as 'Yoruba', 'Italian' or 'Japanese') were 'highly informative about membership in genetic clusters'. (10) Further, when asked to identify five clusters, the computer grouped the samples into sets roughly corresponding to five geographical regions: (i) sub-Saharan Africa, (ii) Europe and West Asia, (iii) East Asia, (iv) Oceania, and (v) the Americas (see the row marked 'K=5' in Figure 1 below for a graphical representation of these clusters). Curiously, these regions are roughly geographically concordant to those occupied by the 'black', 'white', 'yellow', 'tawny' and 'copper-coloured' 'varieties' outlined in Johann Friedrich Blumenbach's seminal eighteenth-century racial typology. (11)

These results should have come as no surprise to most population geneticists, as it had long been assumed that human groups separated by physical, environmental, linguistic, and/or cultural barriers would display some degree of genetic differentiation. (12) Nonetheless, they were quite significant, as it was the first time that human populations had been comprehensively shown to cluster together on the basis of genetic likeness in a 'blind' test.

Still, Rosenberg and his colleagues were relatively circumspect in their conclusions. Their point was not to show that old anthropological conceptions of race are genetically 'real', but to argue that differentiating between human populations is both methodologically and statistically valid, and that such distinctions can be legitimately used for tracing the origins and migrations of peoples, as well as for medical and epidemiological purposes.

The words 'race' and 'ethnicity' never appeared in their paper, only 'population' and 'ancestry', and they were careful to choose neutral colours to code the continental clusters they identified--sub-Saharan Africa was orange, Europe and West Asia pale-blue, East Asia pink, Oceania green, and the Americas purple (not black, white, yellow, tawny and copper-coloured). Here they seemingly followed a convention established by the father of human population genetics and the prime mover behind the Human Genome Diversity Project, Luigi Luca Cavalli-Sforza, whose co-authored magnum opus, The History and Geography of Human Genes (1994, abridged 2006), featured a colour-coded atlas on its cover (Figure 2), with the caption: 'Four major ethnic regions are shown. Africans are yellow, Australians red, [Mongoloids blue] and Caucasoids green.' (13) A reviewer from Time magazine described Cavalli-Sforza's book as 'a landmark global study' that 'flattens The Bell Curve, proving that racial differences are only skin deep.' (14) Similarly, the Rosenberg study was lauded for its 'humanitarian' findings and adjudged 'paper of the year' by The Lancet in 2003. (15)

On the surface the consensus was that nothing much had changed. Population genetics had proven racial differences were only 'skin deep', and the Rosenberg findings only served to further confirm this incontrovertible fact. Everyone was content to repeat the familiar mantra: 'no race here ... nothing to look at ... move along'. However, beneath the 'race as social construct' ideological edifice a fuse had been lit and a 'race as biological reality' powder keg was about to go off.

THE MESSY FALLOUT

Even before the Rosenberg paper appeared, a group of belligerent geneticists headed up by Neil Risch from Stanford University had published an opinion piece in Genome Biology defending 'the validity of race/ethnicity categories for biomedical and genetic research' on the basis of cluster analyses similar to that used by the Rosenberg study.

[[ See Neil Risch, Portrait of a mathematical geneticist. ]]

The following year A W F Edwards, who had pioneered the statistical techniques behind cluster analysis with Cavalli-Sforza in the 1960s, penned a curt refutation of Lewontin's 1972 claim about the insignificance of racial classification, citing the Rosenberg study to buttress his argument. (17)

Following on from Edwards's assault, over the next few years several review essays appeared in major journals, (18) and the issue exploded onto the pages of the major broadsheets. Two articles from The New York Times tell the story of this re-emergence quite well: in August 2000 the headline read, 'Do races differ? Not really, genes show', with the author quoting Venter and others; (19) by July 2006 another headline read, 'Imperfect, imprecise but useful: Your race', and the article discussed the utility of racial classifications for biomedical research. (20)

A further twist in the plot of this recent race potboiler began on 14 March 2005, when evolutionary biologist Armand Marie Leroi published a now infamous op-ed piece, again in The New York Times (seemingly the arbiter of all things genetic), entitled 'A family tree in every gene'. (21) Taking up the Rosenberg study bait, he heralded their findings as suggesting: 'the consensus about social constructs was unravelling', and 'looked at the right way, the genetic data show that races clearly do exist.' (22) Further, he celebrated new advances in genetics that will soon mean, 'we shall no longer gawp ignorantly at the gallery [of racial differences]; we shall be able to name the painters.' (23)

Of course, more than a few people took exception to this talk of 'genetic painters' and 'racial galleries', and Leroi's article prompted a swift multi-pronged rebuttal. Within a month the US Social Science Research Council (SSCR) had sponsored a web forum collecting together critical responses from many of the usual 'social construct' suspects, including Alan Goodman, Evelynn Hammonds, Joseph Graves, Ruth Hubbard, Richard Lewontin and Jonathan Marks (many of whom also appeared in the 2003 PBS documentary series 'Race--the Power of an Illusion'). (24)

Labouring valiantly to plug the holes emerging in the epistemic dyke separating old race pseudoscience from modern human genetics, these ageing anti-race activists accused Leroi of 'reifying' race, and argued that race is most certainly not a biological category but rather an all-too-real social category that should be kept as far away as possible from the objective realm of science. ...

... This is what more sophisticated constructivists mean when they say race is a social construct--that the social and political aspects of race cannot easily be disentangled from the biological. (28) However, some constructivists want to go as far as to suggest that there is no biological dimension to race at all, with Richard Lewontin being a notorious culprit. Lewontin's research in the 1970s did not show that racial classifications were genetically insignificant, only that differences between racial groups were less significant than within groups. (29) His denial of the genetic significance of race is a non sequitur.

Putting aside Lewontin's fallacy, the major mistakes that unsophisticated social constructivists make are: (a) to suggest that being bound up with social and political concerns renders race 'unscientific' (as if scientific knowledge about human subjects could ever be separated from society and politics); and (b) to believe that by replacing race with the term population we can somehow escape this problem. The latter move just 'purifies' the concept of race, allowing geneticists to investigate racial groups such as 'Yoruba', 'Japanese', 'Caucasoid' or 'Australian' under the rubric of population, whilst denying that their research has anything to do with race. ...

10 comments:

JayMan said...: Excellent! I've added this post to my "master list" page of human biodiversity fundamentals (where it accompanies many other works by you):

HBD Fundamentals | JayMan's Blog; 9:17 PM
steve hsu said...: For the record, while I agree with the opinions expressed in the excerpt above, I do NOT agree with all of the opinions expressed in the articles linked to by your "HBD Fundamentals" list. (Don't ask me to go into further detail ...); 9:35 PM
RKU said...: I'll admit I wasn't really following the public debate during that decade, but still find it difficult to believe that anyone smart ever seriously believed that 1+1=17.

Presumably under a somewhat different cultural milieu Venter's public 2000 declaration would have been that his recent genetic research had conclusively proven the truth of the Holy Trinity...; 12:18 AM
marcel proust said...: I am not by any stretch of the imagination a geneticist of any sort (population or otherwise), so talk to me like i am stupid as another blogger sometimes says...

... (i) sub-Saharan Africa, (ii) Europe and West Asia, (iii) East Asia, (iv) Oceania, and (v) the Americas

I am trying to understand some of the mechanics of how this classification would be done, in part stimulated by a statement in re: population genetics that is among the most frequent that I have encountered: genetic variation across sub-Sarahan African populations exceeds that in the rest of the world.

And I guess the flp side is that variation in the pre-Colombian Americas was less than in any comparable area of the world (or more likely comparablye sized population) at that time.

So I can see how the Americas would be identified, and I imagine that Oceania is a similar case: people in these areas are so similar to each other across so many locations in the genome that they are almost perfect matches for each other.

Once they are cleared out of the sample, you can do rough matching to identify groups ii and iii in the list above.

So is sub-Sarahan Africa then "none of the above", everythone that's left over after membership in groups ii-v are identified? In what sense would this make sub-Saharan Africans a race in any genetic sense.; 10:42 AM
marcel proust said...: Thank you for the link.

Apropos the last paragraph of your response, either I don't understand something about it, or it does not answer the question that I thought I asked (which of course leaves open that I was not at all clear).

I think what I am asking is first, whether it makes sense to consider sub-Saharan Africans (sSAs) a race, or whether they are so diverse that it would be more sensible to look for clusters within that population where the diversity is comparable to that of groups (say) ii and iii in the list. I notice, for instance, that in the graph at the first link in the link you supplied, the Hadza are separated out from the rest of sSAs, and I think I understand the reason for that, despite their being, literally, sSAs: genetically, the Hadza do not cluster with any other population of humans. Are there other clusters within sSAs that are as distinct from each other as (say) groups ii and iii above and no more internally heterogeneous as these groups (and I know nothing about the relative degrees of internal vs. external heterogeneity of ii or iii - I have not yet read your links, that's for tonight, but you seem to be responding now, so I thought I'd follow up. Thanks).

A second related question is whether there is a rough variance ceiling above which it makes no sense to cluster individuals into the same race and below which it does, or is this purely context dependent. Are Amerindians (&/or Oceanians) effectively a very homogeneous sub-group within East Asians, and a significant (if not the only) reason for distinguishing them from East Asians is that we have other evidence that they were physically distinct for hundreds of generations? It sure looks like that from the graph mentioned above, and in this case, I think it would then follow that context must play some role in the clustering.[1]

That African-Americans don't appear closer to west Europeans in the graph surprised me at first, but then suggests (to me) that they must be clustering closely to the African part of their ancestry, and that that must be distinct or at least distinguishable from other parts of sSA.

Thanks again.

[1] I am using "clustering" over and over. I understand that the particular technique is PCA, but this, AFAIK is 1 technique of cluster analysis, and if I am not confusing things by talking a bit loosely, ...; 12:10 PM
Iamexpert said...: We need to define what we mean by race. Are races people who share relatively recent common ancestry, or are races people who genetically PRESERVE the phenotype of their common ancestor regardless of how ancient that common ancestry might be. If it's the latter, one could argue there are only the 3 races of old and that scientists have artificially separated Africans from oceania by hyper-focusing on selectively neutral genes that correlate with divergence dates instead of genes that correlate with phenotype. It's not how long ago a population split that matters, it's how much change occurred since.; 5:04 PM
Emil Kirkegaard said...: But even before we had modern genetic knowledge, experts were not unjustified in believing races to be real. This modern research is only a confirmation, one that wasn't strictly speaking needed to be justified in the belief.

It is similar to the genetics of intelligence. We will find the genes in the new few decades, but it is surely possible now to believe that intelligence is highly heritable.; 12:43 AM
botti said...: It's staggering that Leroi's op-ed could be described as "infamous". Then again I suppose for people who have books like 'Not in Our Genes' as their authoritative texts, it might be.

http://www.nytimes.com/2005/03/14/opinion/14leroi.html?pagewanted=print&_r=0; 6:22 PM
okamiden said...: I don't usually comment on science blogs, but since you are trying to make sense out of this, I think I can make quick and to the point: you can use different techniques, sure. Hierarchical clustering, for example. Or k-means. PCA is not a clustering technique per se, but it supplies the distance function that clustering algorithms use.

On the first picture that you are referring to Hadza are separated out from the rest of sSAs along the third axes. Look at the legend: the third component captures only 3.5% of variation, compared to 20% for the first. To visualize, imagine stretching the first axis approximately 6-fold.

To answer your original question, you get five "races" (I-V) because you asked for five clusters (it's a parameter you set before running the algorithm). If you run the same clustering algorithm asking to identify more clusters, perhaps Hadza will get their own cluster. But if you set out to identify just 5 (five) clusters in the entire human population, the five "races" roughly corresponding to the continents is what you get.; 10:44 PM
Matthew Carnegie said...: As I understand it, if you had more subtribes of Hazda, then the algorithm might find it explains more of the sample variance to assign them a cluster earlier in the process.

E.g. Sarah Tishkoff's study of clustering in Africa tends to break off intra-African clusters earlier than many intra-Out of Africa clusters (e.g. Hadza differentiation before Oceanian and American differentiation from East Eurasians), basically because it reverses the normal ratio of African to non-African populations.

So there is an element of deciding which populations to include on the panel which can introduce a kind of circularity to the process.

e.g. do you sample lots of African populations because they are diverse or not many because it is just one continent and many of the African subpopulations have small populations? Either of these motivations is viable and defensible, it is only when you are getting into the realm of, for example, "Everyone knows that differentation of East Eurasians from West Eurasians is more important than intra African variation, so let's design our panel appropriately" that you begin to get into problems in terms of being able to defend the choice.

But this said generally clustering is actually still fairly robust to population choices (at least intellectually viable population choices).; 7:21 AM

Information Processing

About Me

Wednesday, May 01, 2013

Learning can hurt

10 comments:

Blog Archive

Labels