Thursday, October 22, 2020

Replications of Height Genomic Prediction: Harvard, Stanford, 23andMe

These are two replications of our 2017 height prediction results (also recently validated using sibling data) that I neglected to blog about previously.

1. Senior author Liang is in the Deptartments of Epidemiology and Biostatistics at Harvard.
Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes 
Wonil Chung, Jun Chen, Constance Turman, Sara Lindstrom, Zhaozhong Zhu, Po-Ru Loh, Peter Kraft and Liming Liang 
Nature Communications volume 10, Article number: 569 (2019) 
We introduce cross-trait penalized regression (CTPR), a powerful and practical approach for multi-trait polygenic risk prediction in large cohorts. Specifically, we propose a novel cross-trait penalty function with the Lasso and the minimax concave penalty (MCP) to incorporate the shared genetic effects across multiple traits for large-sample GWAS data. Our approach extracts information from the secondary traits that is beneficial for predicting the primary trait based on individual-level genotypes and/or summary statistics. Our novel implementation of a parallel computing algorithm makes it feasible to apply our method to biobank-scale GWAS data. We illustrate our method using large-scale GWAS data (~1M SNPs) from the UK Biobank (N = 456,837). We show that our multi-trait method outperforms the recently proposed multi-trait analysis of GWAS (MTAG) for predictive performance. The prediction accuracy for height by the aid of BMI improves from R2 = 35.8% (MTAG) to 42.5% (MCP + CTPR) or 42.8% (Lasso + CTPR) with UK Biobank data.

2. This is a 2019 Stanford paper. Tibshirani and Hastie are famous researchers in statistics and machine learning. Figure is from their paper.

A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems 
Junyang Qian, Wenfei Du, Yosuke Tanigawa, Matthew Aguirre, Robert Tibshirani, Manuel A. Rivas, Trevor Hastie 
1Department of Statistics, Stanford University 2Department of Biomedical Data Science, Stanford University 
Since its first proposal in statistics (Tibshirani, 1996), the lasso has been an effective method for simultaneous variable selection and estimation. A number of packages have been developed to solve the lasso efficiently. However as large datasets become more prevalent, many algorithms are constrained by efficiency or memory bounds. In this paper, we propose a meta algorithm batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and build a scalable lasso solution for large datasets. We also introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) for large-scale single nucleotide polymorphism (SNP) datasets that are widely studied in genetics. We demonstrate results on a large genotype-phenotype dataset from the UK Biobank, where we achieve state-of-the-art heritability estimation on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.

The very first validation I heard about was soon after we posted our paper (2018 IIRC): I visited 23andMe to give a talk about genomic prediction and one of the PhD researchers there said that they had reproduced our results, presumably using their own data. At a meeting later in the day, one of the VPs from the business side who had missed my talk in the morning was shocked when I mentioned few cm accuracy for height. He turned to one of the 23andMe scientists in the room and exclaimed 

I thought WE were the best in the world at this stuff!?

No comments:

Blog Archive