I posted back in 2007 on some earlier research of Penner's which showed an 8 percent larger variance in male math ability already at the beginning of kindergarten. (This is not so different from the adult difference in variance.)
For a relatively balanced overview of this topic see Women's underrepresentation in science: Sociocultural and biological considerations; Ceci, Stephen J.; Williams, Wendy M.; Barnett, Susan M. Psychological Bulletin. Vol 135(2), Mar 2009, 218-261. Abstract, PDF.
In the more recent paper Penner claims that national variation in gender gaps in mathematical ability implies that the effect is culturally moderated. While I don't doubt that culture affects development of mathematical ability, and perhaps in such a way as to favor males, I question whether his paper or other recent papers relying on international tests like TIMSS and PISA really have the statistical power to investigate this issue very well. It is already hard to capture national differences in average ability level from tests of only a few thousand students (ensuring that these students are representative of the whole population is difficult); gender gaps are even smaller effects and therefore more sensitive to statistical and systematic error. See here (figure 4) for a convincing demonstration that PISA data on country by country gender gaps is noise dominated: the gaps are not stable between the 2003 and 2006 results. Only by aggregating the data over many countries do we arrive at a stable gap. This makes me suspicious of TIMSS results because PISA has significantly larger statistics. A meta-analysis suggests that cultural effects, while perhaps non-zero, are relatively small.
Andrew and I had an interesting discussion about his paper; my side is summarized in the message and two figures below.
Andrew,
Sorry I had to leave early from your talk and didn't get to discuss this in person. As I mentioned yesterday and in my earlier email, country level gender gaps are not stable between PISA 2003 and 2006, whereas the meta-analysis gap, averaging over all countries, is stable. This to me is clearly a signal that the PISA country level data on gender gaps is dominated by statistical error, and makes me strongly suspect the same is true for TIMSS.
In your talk you said that a biological model would imply the same gender gap in every country, and that country by country variation would undermine the biological model. However, you neglected to mention that statistical error would lead to country by country variation (of measured gaps) even in the biological model.
In the 1995 TIMSS table below there are 8 "gold standard" countries that complied with the statistical procedures. The data from the remaining countries would be suspect, since, as I mentioned, getting a representative sample for a country of millions is not an easy task. In particular, the standard error for countries outside the first group of 8 is likely to be much larger than quoted. (See the column labeled "Difference" in the table. The number in parenthesis is the standard error for the gender gap.)
For the gold standard countries, it appears that all gender gaps are within roughly 1-2 standard deviations (using the standard error given) of the group average, with the exception of Hungary which is an outlier. This suggests that the variation within this group could be entirely statistical. That is, if one formulated a "null model" with constant gender gap across countries, and asked whether TIMSS disfavors that model, the answer might be no, at least not in a statistically significant way. (Actually I suspect that the standard error given is an underestimate, because of systematic errors in the sampling procedures even in the gold standard countries.) Note within this set of countries there is a lot of variation on your societal indicators.
To summarize, I think the claim that TIMSS data supports country level variation in gender gaps has to be considered carefully for statistical significance. As I mentioned, I doubt one can really trust the TIMSS quoted standard errors, so a real test would be time stability of (measured) gender gaps -- a test which PISA fails.
One final comment on your talk: it seems to me that all of the societal variables you listed (labor force participation, wage gap, etc.) have changed significantly in the last 40 years in the US. Nevertheless, I believe gender gaps on the SAT-M (a truly large statistics measurement) have not narrowed during that time. (See second figure below.)
Steve
(Click for larger versions.)

