Showing posts with label pca. Show all posts
Showing posts with label pca. Show all posts

Tuesday, April 21, 2015

China's Ideological Spectrum


These researchers identify a dominant principal component in the Chinese ideological spectrum. Discussed on Sinica podcast.

China's Ideological Spectrum

Jennifer Pan (Harvard University - Graduate School of Arts and Sciences)
Yiqing Xu (MIT - Department of Political Science)

We offer the first large scale empirical analysis of ideology in contemporary China to determine whether individuals fall along a discernible and coherent ideological spectrum, and whether there are regional and inter-group variations in ideological orientation. Using principal component analysis (PCA) on a survey of 171,830 individuals, we identify one dominant ideological dimension in China. Individuals who are politically conservative, who emphasize the supremacy of the state and nationalism, are also likely to be economically conservative, supporting a return to socialism and state-control of the economy, and culturally conservative, supporting traditional, Confucian values. In contrast, political liberals, supportive of constitutional democracy and individual liberty, are also likely to be economic liberals who support market-oriented reform and social liberals who support modern science and values such as sexual freedom. This uni-dimensionality of ideology is robust to a wide variety of diagnostics and checks. Using post-stratification based on census data, we find a strong relationship between liberal orientation and modernization -- provinces with higher levels of economic development, trade openness, urbanization are more liberal than their poor, rural counterparts, and individuals with higher levels of education and income and more liberal than their less educated and lower-income peers.
Warning: PCA is the tool of the devil ;-)

Thursday, December 16, 2010

g at work

This is a caricature of about a dozen conversations I've had in the last few months, including today.

Colleague (physicist): What's this genomics stuff you're doing?

me: We're looking for genes associated with IQ or g. Sequencing costs are going down at ...

physicist: Yes, yes, super-exponential rate. IQ, is that like those Mensa clowns? What's g?

me: Actually the research results are pretty clear...

physicist: Aren't there lots of independent cognitive abilities? Like memory, geometric visualization, vocabulary, etc.?

me: Yes, but it turns out they are all positively correlated. Suppose there are N abilities; represent each person as a point in the N dimensional space. The population distribution is like an ellipsoid and the longest axis is what they call g. It turns out g has a lot of predictive power...

physicist: Oh, I get it. Neat! Can I be in your study?

Below are responses from analogous conversations I've had with different, uh, kinds of people (like other kinds of, uh, scientists ;-)

Sequencing costs? Sequencing what?

IQ? That was all discredited by Stephen J. Gould!

Aren't the scores just determined by the SES of the parents?

N dimensional what?!?


Note added: I hereby apologize for any psychological trauma my post may have inflicted on others :-/

Tuesday, August 24, 2010

Connect the dots

I thought I'd share this beautiful graphic of human genetic variation from the blog Gene Expression. The original paper from Science.


For panel A, PC1 = 20% of the variance, PC2 = 5%, and PC3 = 3.5%. For panel B, PC1 = 11%, PC2 = 6%, PC3 = 5% and PC4 = 4%.

Saturday, November 21, 2009

IQ, compression and simple models

I get yelled at from all sides whenever I mention IQ in a post, but I'm a stubborn guy, so here we go again.

Imagine that you would like to communicate something about the size of an object, using as short a message as possible -- i.e., a single number. What would be a reasonable algorithm to employ? There's obviously no unique answer, and the "best" algorithm depends on the distribution of object types that you are trying to describe. Here's a decent algorithm:

Let rough size S = the radius of the smallest sphere within which the object will fit.

This algorithm allows a perfect reconstruction of the object if it is spherical, but isn't very satisfactory if the object is a javelin or bicycle wheel.

Nevertheless, it would be unreasonable to reject this definition as a single number characterization of object size, given no additional information about the distribution of object types.

I suggest we think about IQ in a similar way.

Q1: If you had to supply a single number meant to characterize the general cognitive ability of an individual, how would you go about determining that number?

I claim that the algorithm used to define IQ is roughly as defensible for characterizing cognitive ability as the quantity S, defined above, is for characterizing object size. The next question, which is an empirical one, is

Q2: Does the resulting quantity have any practical use?

In my opinion reasonable people should focus on the second question, that of practical utility, as it is rather obvious that there is no unique or perfect answer to the first question.

To define IQ, or the general factor g of cognitive ability, we first define some different tests of cognitive ability, i.e., which measure capabilities like memory, verbal ability, spatial ability, pattern recognition, etc. Of course this set of tests is somewhat arbitrary, just as the primitive concept "size of an object" is somewhat arbitrary (is a needle "bigger" than a thimble?). Let's suppose we decide on N different kinds of tests. An individual's score on this battery of tests is an N-vector. Sample from a large population and plot each vector in the N-dimensional space. We might find that the resulting points are concentrated on a submanifold of the N-dimensional space, such that a single variable (which is a special linear combination of the N coordinates) captures most of the variation. As an extreme example, imagine the points form a long thin ellipse with one very long axis; position on this long axis almost completely specifies the N vector. The figure below shows real data, ~100k individuals tested. The principal axis of the ellipsoid is g (roughly speaking; as I've emphasized it is not entire well-defined).


What I've just described geometrically is the case where the N mental abilities display a lot of internal correlation, and have a dominant single factor that arises from factor analysis. This dominant factor is what we call g. Note it did not have to be the case that there was a single dominant factor -- the sampled points could have had any shape -- but for the set of generally agreed upon human cognitive abilities, there is.

(What this implies about underlying brain wetware is an interesting question but would take us too far afield. I will mention that g, defined as above using cognitive tests, correlates with neurophysical quantities like reaction time! So it's at least possible that high g has something to do with generally effective brain function -- being wired up efficiently. It's now acknowledged even by hard line egalitarians that g is at least partly heritable, but for the purposes of this discussion we only require a weaker property -- that adult g is relatively stable.)

To summarize, g is the best single number compression of the N vector characterizing an individual's cognitive profile. (This is a lossy compression -- knowing g does not allow exact reconstruction of the N vector.) Of course, the choice of the N tests used to deduce g was at least somewhat arbitrary, and a change in tests results in a different definition of g. There is no unique or perfect definition of a general factor of intelligence. As I emphasized above, given the nature of the problem it seems unreasonable to criticize the specific construction of g, or to try to be overly precise about the value of g for a particular individual. The important question is Q2: what good is it?

A tremendous amount of research has been conducted on Q2. For a nice summary, see Why g matters: the complexity of ordinary life by psychologist Linda Gottfredson, or click on the IQ or psychometrics label link for this blog. Links and book recommendations here. The short answer is that g does indeed correlate with life outcomes. If you want to argue with me about any of this in the comments, please at least first read some of the literature cited above.

From Gottfredson (WPT = Wonderlic Personnel Test):

Personnel selection research provides much evidence that intelligence (g) is an important predictor of performance in training and on the job, especially in higher level work. This article provides evidence that g has pervasive utility in work settings because it is essentially the ability to deal with cognitive complexity, in particular, with complex information processing. The more complex a work task, the greater the advantages that higher g confers in performing it well.

... These conclusions concerning training potential, particularly at the lower levels, seem confirmed by the military’s last half century of experience in training many millions of recruits. The military has periodically inducted especially large numbers of “marginal men” (percentiles 10-16, or WPT 10-12), either by necessity (World War II), social experiment (Secretary of Defense Robert McNamara’s Project 100,000 in the late 196Os), or accident (the ASVAB misnorming in the early 1980s). In each case, the military has documented the consequences of doing so (Laurence & Ramsberger, 1991; Sticht et al., 1987; U.S. Department of the Army, 1965).

... all agree that these men were very difficult and costly to train, could not learn certain specialties, and performed at a lower average level once on a job. Many such men had to be sent to newly created special units for remedial training or recycled one or more times through basic or technical training.

Limitations and open questions:

1. Are there group differences in g? Yes, this is actually uncontroversial. The hard question is whether these observed differences are due to genetic causes.

2. Is it useful to consider sub-factors? What about, e.g., a 2 or 3-vector compression instead of a scalar quantity? Yes, that's why the SAT has an M and a V section. Some people are strong verbally, but weak mathematically, and vice versa. Some people are really good at visualizing geometric relationships, some aren't, etc.

3. Does g become less useful in the tail of the distribution? Quite possibly. It's harder and harder to differentiate people in the tail.

4. How stable is g? Adult g is pretty stable -- I've seen results with .9 correlation or greater for measurements taken a year apart. However, g measured in childhood is nowhere near a perfect predictor of adult g. If someone has a reference with good data on childhood/adult g correlation, please let me know.

5. Isn't g just the same as class or SES? No. Although there is a weak correlation between g and SES, there are obviously huge variations in g within any particular SES group. Not all rich kids can master calculus, and not all disadvantaged kids read below grade level.

6. How did you get interested in this subject? In elementary school we had to take the ITED (Iowa Test of Educational Development). This test had many subsections (vocabulary, math, reading, etc.) with 99th percentile ceilings. For some reason the teachers (or was it my parents?) let me see my scores, and I immediately wondered whether performance on different sections was correlated. If you were 99 on the math, what was the probability you were also 99 on the reading? What are the odds of all 99s? This leads immediately to the concept of g, which I learned about by digging around at the university library. I also found all five volumes of the Terman study.

7. What are some other useful compressed descriptions? It is claimed that one can characterize personality using the Big Five factors. The results are not as good as for g, I would say, but it's an interesting possibility, and these factors were originally deduced in an information theoretic way. Big Five factors have been shown to be stable and somewhat heritable, although not as heritable as g. Role playing games often use compressed descriptions of individuals (Strength, Dexterity, Intelligence, ...) as do NFL scouts (40 yd dash, veritcal leap, bench press, Wonderlic score, ... ) ;-)


It's a shame that I have to write this post at all. This subject is of such fundamental importance and the results so interesting and clear cut (especially for something from the realm of social science) that everyone should have studied it in school. (Everyone does take the little tests in school...) It's too bad that political correctness means that I will be subject to abuse for merely discussing these well established scientific results.

Why think about any of this? Here's what I said in response to a comment on this earlier post:
Intelligence, genius, and achievement are legitimate subjects for serious study. Anyone who hires or fires employees, mentors younger people, trains students, has kids, or even just has an interest in how human civilization evolved and will evolve should probably think about these questions -- using statistics, biography, history, psychological studies, really whatever tools are available.

Sunday, December 07, 2008

Resolution of population genetic structure

How much can we resolve the substructure of a population with a given amount of data? The paper below gives a quantitative answer. With current technology, we should have no problem resolving even small national populations (see italicized text in quote below), with nearest neighbor FST as small as .0001 (i.e., 99.99 percent of variation is within-group and only .01 percent between groups)! According to this Table of European, Nigerian and East Asian FSTs, the FST between France and Spain is .0008, whereas between Nigeria and Japan it is about .19 .

Within the European + HapMap sample analyzed here, over 100 statistically significant PCA vectors were identified. That is, there is a >100 dimensional space within which structure can be teased out. (However, the largest single vector accounts for only a percent of total variation, and the integral over all 100 vectors is probably only a few percent.) Norwegians and Swedes could be resolved with 90 percent accuracy. Note the Patterson et al. paper was written before this recent analysis, which confirms their theoretical predictions of sensitivity. (Figure below.)



The first author, Nick Patterson (profiled here), is a mathematician turned cryptographer turned quant (Renaissance) turned bioinformaticist.

Population Structure and Eigenanalysis

Nick Patterson et al. (Broad Institute of Harvard and MIT)

Abstract
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.



...Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is FST = .0001. An FST value of .001 will thus be trivial to detect. To put this into context, we note that a typical value of FST between human populations in Northern and Southern Europe is about .006 [15]. Thus, we predict: most large genetic datasets with human data will show some detectable population structure.

Saturday, November 29, 2008

Human genetic variation, Fst and Lewontin's fallacy in pictures

In an earlier post European genetic substructure, I displayed the following graphic, illustrating the genetic clustering of human populations.




Figure: The three clusters shown above are European (top, green + red), Nigerian (light blue) and E. Asian (purple + blue).

The figure seems to contradict an often stated observation about human genetic diversity, which has become known among experts as Lewontin's fallacy: genetic variation between two random individuals in a given population accounts for 80% or more of the total variation within the entire human population. Therefore, according to the fallacy, any classification of humans into groups ("races") based on genetic information is impossible. ("More variation within groups than between groups.")

To understand this statement better, consider the F statistic of population genetics, introduced by Sewall Wright:

Fst = 1 - Dw / Db

Db and Dw represent the average number of pairwise differences between two individuals sampled from different populations (Db = "difference between") or the same population (Dw = "difference within"). Even in the most widely separated human populations Fst < .2 so Dw / Db > .8 (roughly). This may not sound like very much genetic diversity, but it is more than in many other animal species. See here for recent high statistics Fst values by nationality.

Dw / Db > .8 means that the average genetic distance measured in number of base pair differences between two members of a group (e.g., two randomly selected Europeans) is at least 80 percent of the average distance between distant groups (e.g., Europeans and Asians or Africans). In other words, if two individuals from very distant groups (e.g., a Japanese and a Nigerian) have on average N base pair differences, then two from the same group (e.g., two Nigerians or two Japanese) will on average have roughly .8 N base pair differences.

How can the Fst result ("more variation within groups than between groups") be consistent with the clusters shown in the figure? I've had to explain this on numerous occasions, always with great difficulty because the explanation requires a little mathematics. In order to make the point more accessible, I've created the figures below, which show two population clusters, each represented by an ellipsoid (blob). The different figures depict the same pair of objects, just viewed from different angles.

The blobs are constructed and arranged so that the average distance between two points (individuals) within the same cluster is almost as big as the average distance between two points (individuals) in different clusters. This is easy to achieve if the ellipsoids are big and flat (like pancakes) and placed close to each other along the flat directions. The figure is meant to show how one can have small Fst, as in humans, yet easily resolved clusters. The direction in which the gap between the clusters appears is one of the principal components in the space of human genetic variation, as recently found by bioinformaticists. The figure at the top of this post plots individuals as points in the space generated by the two largest principal components extracted from the combination of data from HapMap and from large statistics sampling of Europeans. Exhibited this way, isolated clusters ("races") are readily apparent.

The real space of genetic variation has many more than 3 dimensions, so it can't be easily visualized. But some aspects of the figures below still apply: there will be particular directions of variation over which different populations are more or less identical (orthogonal to the principal component; i.e. along the flat directions of each pancake), and there will be directions in which different populations differ radically and have little or no overlap. Note, however, that we are specifically referring to genetic variation, which may or may not translate into phenotypic variation.







Related posts: "no scientific basis for race" , metric on the space of genomes.

The existence of this clustering has been known for 40 years.

Sunday, November 23, 2008

European genetic substructure



Figure: Each point is an individual, and the axes are two principal components in the space of genetic variation. Colors correspond to individuals of different European ancestry.

The figure above is from the Nature paper: European Journal of Human Genetics (2008) 16, 1413–1429; doi:10.1038/ejhg.2008.210

Abstract: An investigation into fine-scale European population structure was carried out using high-density genetic variation on nearly 6000 individuals originating from across Europe. The individuals were collected as control samples and were genotyped with more than 300 000 SNPs in genome-wide association studies using the Illumina Infinium platform. A major East–West gradient from Russian (Moscow) samples to Spanish samples was identified as the first principal component (PC) of the genetic diversity. The second PC identified a North–South gradient from Norway and Sweden to Romania and Spain. ...

Some interesting points:

1) Significant East-West and North-South substructure is apparent already from the figure. The resolution of the study is sufficiently high that Swedes and Norwegians can be distinguished with 90 percent accuracy (Table 4). Crime scene forensics will never be the same -- "the Swede did it!" ;-)

In conclusion, we have shown that using PCA techniques it is possible to detect fine-level genetic variation in European samples. The genetic and geographic distances between samples are highly correlated, resulting in a striking concordance between the scatter plot of the first two components from a PCA of European samples and a geographic map of sample origins. We have shown how this information can be used to predict the origin of unknown samples in a rapid, precise and robust manner, and that this prediction can be performed without requiring access to the individual genotype data on the original samples of known origin. ...


2) Genetic distances between population clusters are roughly as follows: the distance between two neighboring western European populations is of order one in units of standard deviations and the distance to the Russian cluster is several times larger than that -- say, 3 or 4. From HapMap data, the distance from Russian to Chinese and Japanese clusters is about 18, and the distance of southern Europeans to the Nigerian cluster is about 19. The chance of mis-identifying a European as an African or E. Asian is exponentially small! (Table 5)

...The distance measure is a measure of the distance in standard deviations from a sample to the center of the closest matching population.

...For the other HapMap populations, the classification procedure assigned 100% of the YRI [Yoruban = Nigerian] samples to France, and almost 100% of the CHB and JPT [Chinese and Japanese] samples to Russia. However, the distribution of the distance measure for the four populations was quite different. For the CEU [HapMap European] samples, the median and 95% CI of the distance measure were 0.41 (0.11–1.01), whereas for the YRI, CHB and JPT populations, the median and 95% CIs were 19.3 (18.0–20.6), 17.7 (15.9–19.3) and 18.0 (15.4–19.6), respectively.

...The Yoruban [Nigerian] and Asian samples were identified as belonging to the countries on the south and east edges, respectively, of the European cluster, and the distance measure clearly indicates that they do not fit well into any of the proposed populations. ...



Figure: The three clusters shown above are European (top, green + red), Nigerian (light blue) and E. Asian (purple + blue).


See additional discussion at gnxp (the modified figure is from Razib), Dienekes

Related posts: "no scientific basis for race" , metric on the space of genomes

Friday, September 16, 2005

PC censorship

I used to be an admirer of Berkeley economics professor Brad DeLong. I've been a regular reader of his blog for several years. Indeed, he, more than any other individual, inspired me to start blogging. This recent post on the US dollar conundrum shows him at his analytical and expository best.

I was shocked to discover that Professor DeLong actively and surreptitiously censors comments on his blog. You can read about it here. I can't explain his actions without assuming his goal is obfuscation, rather than truth seeking.

I've copied three of my comments below, originally posted to this discussion of human evolution. The third was removed without explanation.

Posted by: steve | Sep 12, 2005 7:55:47 PM

As mentioned, Brad's calculation neglects the possibility that mutations which might be adaptive in one environment are not necessarily adaptive in another.

According to the recent Lahn research (linked to by Anne, above), certain alleles of genes with direct effects on brain function have been subject to strong selection over the last tens of thousands of years. In certain populations (e.g., in Eurasia), the new alleles have rapidly replaced their predecessors, so they were clearly adaptive in those environments. In other populations (sub-Saharan Africa), the new alleles are rare, so they were probably less adaptive.

I don't know what will happen in the future, but current research shows that geographically isolated populations can have very different distributions of certain alleles, and not just those related to superficial features like skin or hair color.

Most fascinating is the possibility that relatively recent mutations had something to do with the rapid advancement of human civilization over the last 5000 years. (The ASPM variant may have emerged about that long ago.)

Posted by: steve | Sep 13, 2005 10:07:14 AM

Janet,

As already noted, it will take hundreds of years (at minimum) for mixing to eliminate the correlations between genes and "race" (or ancestral geographic lineage) that we currently have. That is a very long time from the perspective of social policy, although not in evolutionary terms.

I am by no means a fan of Sullivan, but I think he is correct to say that most liberals (I am one myself) have, due to wishful thinking, gratefully accepted the "there is no scientific basis for race" line. Anne's post of the NYTimes op-ed by LeRoi gives the history of this facile, but now doomed, position. (Cochran's explanation above is very clear - better than LeRoi's.) I don't think most people appreciate that we are now on a Moore's Law growth curve for genomic information. Google "hapmap" and have a look for yourself at the state of the art.

Rather than rely on the scientifically unsupported claim that "we are all equal," it would be better to teach our students that we all have inalienable human rights regardless of our abilities or genetic make up. Continuing to rely on the false equality premise only undermines the liberal position on race issues.

Posted by: steve | Sep 13, 2005 11:17:21 PM

gcochran wrote:

"Do principal component analysis on the covariance matrix for many loci (or cluster analysis) and !presto! - Bob's your uncle."

This gets right to the point (see an earlier post by gcochran for a less terse explanation). Too bad that very few readers here will understand (or even try to understand) what it means. Bambi vs Godzilla had the insight to ask the question properly. Will he or she make the effort to understand the answer?

Imagine each individual's genetic code as a point in a space of *very high* dimension. Then look at clusters of points. (Define a cluster as a group of points whose distance from each other is less than some radius; distinct clusters are separated by distances larger than this radius.) These clusters map directly onto traditional groupings of ethnicity. In fact, a recent study by Neil Risch at UCSF showed that self-reported "race" correlates very well with the clustering results. (Mixed race people are obviously an exception, but as discussed they are a small fraction of the total population, and will continue to be for some time.)

People (especially professors of social science) who confidently state to their students that "there is no genetic basis for race" should think through the analysis described above and look at the data carefully if they want to retain their credentials as scientists.

From the conclusions of the Risch paper (Am. J. Hum. Genet. 76:268–275, 2005):

Attention has recently focused on genetic structure in the human population. Some have argued that the amount of genetic variation within populations dwarfs the variation between populations, suggesting that discrete genetic categories are not useful (Lewontin 1972; Cooper et al. 2003; Haga and Venter 2003). On the other hand, several studies have shown that individuals tend to cluster genetically with others of the same ancestral geographic origins (Mountain and Cavalli-Sforza 1997; Stephens et al. 2001; Bamshad et al. 2003). Prior studies have generally been performed on a relatively small number of individuals and/or markers. A recent study (Rosenberg et al. 2002) examined 377 autosomal micro-satellite markers in 1,056 individuals from a global sample of 52 populations and found significant evidence of genetic clustering, largely along geographic (continental) lines. Consistent with prior studies, the major genetic clusters consisted of Europeans/West Asians (whites), sub-Saharan Africans, East Asians, Pacific Islanders, and Native Americans. ethnic groups living in the United States, with a discrepancy rate of only 0.14%.

Blog Archive

Labels