Within the European + HapMap sample analyzed here, over 100 statistically significant PCA vectors were identified. That is, there is a >100 dimensional space within which structure can be teased out. (However, the largest single vector accounts for only a percent of total variation, and the integral over all 100 vectors is probably only a few percent.) Norwegians and Swedes could be resolved with 90 percent accuracy. Note the Patterson et al. paper was written before this recent analysis, which confirms their theoretical predictions of sensitivity. (Figure below.)
The first author, Nick Patterson (profiled here), is a mathematician turned cryptographer turned quant (Renaissance) turned bioinformaticist.
Population Structure and Eigenanalysis
Nick Patterson et al. (Broad Institute of Harvard and MIT)
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general “phase change” phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.
...Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is FST = .0001. An FST value of .001 will thus be trivial to detect. To put this into context, we note that a typical value of FST between human populations in Northern and Southern Europe is about .006 . Thus, we predict: most large genetic datasets with human data will show some detectable population structure.