Predicting the diagnosis of autism spectrum disorder using gene pathway analysis (Nature Molecular Psychiatry)
Skafidas E, Testa R, Zantomio D, Chana G, Everall IP, Pantelis C.
Centre for Neural Engineering, The University of Melbourne, Parkville, VIC, Australia.
Autism spectrum disorder (ASD) depends on a clinical interview with no biomarkers to aid diagnosis. The current investigation interrogated single-nucleotide polymorphisms (SNPs) of individuals with ASD from the Autism Genetic Resource Exchange (AGRE) database. SNPs were mapped to Kyoto Encyclopedia of Genes and Genomes (KEGG)-derived pathways to identify affected cellular processes and develop a diagnostic test. This test was then applied to two independent samples from the Simons Foundation Autism Research Initiative (SFARI) and Wellcome Trust 1958 normal birth cohort (WTBC) for validation. Using AGRE SNP data from a Central European (CEU) cohort, we created a genetic diagnostic classifier consisting of 237 SNPs in 146 genes that correctly predicted ASD diagnosis in 85.6% of CEU cases. This classifier also predicted 84.3% of cases in an ethnically related Tuscan cohort; however, prediction was less accurate (56.4%) in a genetically dissimilar Han Chinese cohort (HAN). Eight SNPs in three genes (KCNMB4, GNAO1, GRM5) had the largest effect in the classifier with some acting as vulnerability SNPs, whereas others were protective. Prediction accuracy diminished as the number of SNPs analyzed in the model was decreased. Our diagnostic classifier correctly predicted ASD diagnosis with an accuracy of 71.7% in CEU individuals from the SFARI (ASD) and WTBC (controls) validation data sets. In conclusion, we have developed an accurate diagnostic test for a genetically homogeneous group to aid in early detection of ASD. While SNPs differ across ethnic groups, our pathway approach identified cellular processes common to ASD across ethnicities. Our results have wide implications for detection, intervention and prevention of ASD.
It looks like they used a quasi-linear ("superadditive") prediction model after using biochemical pathway analysis to restrict to a subset of candidate genes. It doesn't matter how you get the candidate genes -- all that matters is that you obtain predictive power.
Predicting ASD phenotype based upon candidate SNPs
For each individual, a 775-dimensional vector was constructed, corresponding to 775 unique SNPs identified as part of the GSEA. To examine whether SNPs could predict an individual’s clinical status (ASD versus non-ASD), two-tail unpaired t-tests were used to identify which of the 775 SNPs had statistically significant differences in mean SNP value (P<0.005). This significance level provided low classification error while maintaining acceptable variance in estimation of regression coefficients for each SNP’s contribution status, and provided the set of SNPs that maximized the classifier output between the populations (Fig 2 and S2). This resulted in 237 SNPs selected for regression analysis. Each dimension of the vector was assigned a value of 0, 1 or 3, dependent on a SNP having two copies of the dominant allele, heterozygous or two copies of the minor allele. The ‘0, 1, 3’ weighting provided greater classification accuracy over ‘0, 1, 2’. Such approaches using superadditive models have been used previously to understand genetic interactions.
These results, if they hold up, demonstrate just how much information is thrown away in conventional GWAS with 5E-08 "genome wide" significance thresholds (i.e., P<0.05 over 1E06 SNPs). In the conventional methodology a SNP is only considered a "hit" if significance exceeds this threshold, and "total variance accounted for" by the aggregate of all hits is typically modest (although in the case of height the total is getting fairly large now). This conservative approach reduces the number of false hits (i.e., which do not replicate) that plagued human genetics a decade ago, but does not maximize (squanders a lot of) predictive power.
The approach taken here first selects 775 SNPs of interest based on pathway information (not considered in standard GWAS) and then only requires 5E-03 significance. A linear predictor is formed from the 237 SNPs that pass this threshold. The ultimate test is, of course, whether the predictor actually works on (independent) validation samples. Once you have a statistically valid predictor, it doesn't matter how you arrived at it.
The key is the additional information used in the initial guess. If one could cleverly narrow down the set of variants for intelligence to, say, 10k (e.g., by looking at the loci at which modern humans differ from neanderthals or other earlier ancestors), and then test that subset for, e.g., 1E-04 significance, the resulting predictor *might* be able to reliably distinguish high g individuals from low g individuals. When will this approach be tried out? Stay tuned.