Tuesday, November 25, 2008

East Asian genetic substructure

Below are some results on East and Southeast Asian genetic substructure. As you can see, Koreans are (sort of) midway between Chinese and Japanese. It will almost certainly be possible to differentiate between different regional origins based on DNA, once larger statistics studies become available. (See European results.)

Figures: Each point is an individual, and the axes are two principal components in the space of genetic variation. Colors correspond to individuals of different Asian ancestry.

Thanks to Chao Tian of UC Davis for sending me an early draft of the paper.

Analysis of East Asia Genetic Substructure: Population Differentiation and PCA Clusters Correlate with Geographic Distribution.

C. Tian1, R. Kosoy1, A. Lee2, P. Gregersen2, J. Belmont2, M. Seldin1

1) Rowe Program Human Genetics, Univ California Sch Medicine, Davis, CA; 2) North Shore-LIJ Res Inst, Manhasset, NY, Baylor Col Med., Houston TX.

Accounting for genetic substructure within European populations has been important in reducing type 1 errors in genetic studies of complex disease. As efforts to understand complex genetic disease are expanded to other continental populations an understanding of genetic substructure within these continents will be useful in design and execution of association tests. In this study, population differentiation(Fst) and Principal Components Analyses(PCA) are examined using >200K genotypes from multiple populations of East Asian ancestry(total 298 subjects). The population groups included those from the Human Genome Diversity Panel[Cambodian(CAMB), Yi, Daur, Mongolian(MGL), Lahu, Dai, Hezhen, Miaozu, Naxi, Oroqen, She, Tu, Tujia, Naxi, and Xibo], HapMap(CHB and JPT), and East Asian or East Asian American subjects of Vietnamese(VIET), Korean(KOR), Filipino(FIL) and Chinese ancestry. Paired Fst(Wei and Cockerham) showed close relationships between CHB and several large East Asian population groups(CHB/KOR, 0.0019; CHB/JPT, 00651; CHB/VIET, 0.0065) with larger separation with FIL(CHB/FIL, 0.014). Low levels of differentiation were also observed between DAI and VIET(0.0045) and between VIET and CAMB(0.0062). Similarly, small Fst's were observed among different presumed Han Chinese populations originating in different regions of mainland of China and Taiwan(Fst < 0.0025 with CHB). For PCA, the first two PC's showed a pattern of relationships that closely followed the geographic distribution of the different East Asian populations. For example, the four "corner" groups were JPT, FIL, CAMB and MGL with the CHB forming the center group, and KOR was between CHB and JPT. Other small ethnic groups were also in rough geographic correlation with their putative origins. These studies have also enabled the selection of a subset of East Asian substructure ancestry informative markers(EASTASAIMS) that may be useful for future genetic association studies in reducing type 1 errors and in identifying homogeneous groups.

Related posts: "no scientific basis for race" , metric on the space of genomes


Anonymous said...

Interesting -- you can almost see the four major ancient trade/migration routes.

The northern route traced by the Central Asian nomads (MGL, ORO, XIBO, etc.); the main westward Silk Route traced by the Tu; the southern maritime Silk Route traced by coastal peoples (SHE, VIET, PHIL, etc.), and the southern mountain Tea Horse Route traced by the hill peoples (LAHU and somehow CAMB?).

All these traces somewhat converge on CHB. The genetic flow along these route stands in stark contrast to the lack of genetic flow along the main E-W axis of JPT-KOR-CHB-NAXI/YI.

That the NAXI and YI geneticly overlap proves the ancient legends. I therefore extrapolate and expect that TIBET, NAXI, YI, BAI, and QIANG are all geneticly closely allied.

Luigi said...

Would be interesting to correlate with rice genetic structure: http://www.vaviblog.com/rice-domestication-a-different-story/

Steve Sailer said...

So, this is just like the recent genetic graph of Europe where the nationalities were placed just as they are on a geographic map?

Steve Hsu said...


The location of each dot (individual) is determined by the projection of their genetic information along a particular direction in the space of possible genes. The two directions chosen (out of thousands or more) are the ones in which the variation between groups is largest (principal components). What is interesting is that, plotted this way, the clustering of individuals reproduces the geographical map. It shows that, in general, genetic distance is highly correlated with geographical distance and/or the structure of political boundaries.

Practically speaking, it means that in many cases an individual's ethnic origins can be determined with good accuracy simply by measuring their genetic information along those two principal components; "the Japanese guy did it!"

James X. Li said...

It would be interesting to know how large are the eigenvalues for the third and forth principal components, so that we know how much information get lost in these 2D maps.
PCA is a quite limited method for high dimensional non-linear data.
Why not try some more powerful methods, e.g. non-linear dimensionality reduction methods developed in recently years.

Steve Hsu said...


See this figure for some information about the first 25 PCA vectors from a combined European + HapMap sample.


I suggest you get in touch with Chao Tian of UC Davis and see if you can get his data to play with -- from your blog it appears you have some expertise in data mining!

Blog Archive