Monday, April 05, 2021

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank

These new results arose from initial investigations of blood biomarker predictions from DNA. The lipoprotein A predictor we built correlates almost 0.8 with the measured result, and this agreement would probably be even stronger if day to day fluctuations were averaged out. It is the most accurate genomic predictor for a complex trait that we are aware of.

We then became interested in the degree to which biomarkers alone could be used to predict disease risk. Some of the biomarker-based disease risk predictors we built (e.g., for kidney or liver problems) do not, as far as we know, have widely used clinical counterparts. Further research may show that predictors of this kind have broad utility. 

Statistical learning in a space of ~50 biomarkers is considered a "high dimensional" problem from the perspective of medical diagnosis, however compared to genomic prediction using a million SNP features, it is rather straightforward. 
 
Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank  
Erik Widen, Timothy G. Raben, Louis Lello, Stephen D.H. Hsu 
doi: https://doi.org/10.1101/2021.04.01.21254711 
We use UK Biobank data to train predictors for 48 blood and urine markers such as HDL, LDL, lipoprotein A, glycated haemoglobin, ... from SNP genotype. For example, our predictor correlates  ~ 0.76 with lipoprotein A level, which is highly heritable and an independent risk factor for heart disease. This may be the most accurate genomic prediction of a quantitative trait that has yet been produced (specifically, for European ancestry groups). We also train predictors of common disease risk using blood and urine biomarkers alone (no DNA information). Individuals who are at high risk (e.g., odds ratio of > 5x population average) can be identified for conditions such as coronary artery disease (AUC ~ 0.75), diabetes (AUC ~ 0.95), hypertension, liver and kidney problems, and cancer using biomarkers alone. Our atherosclerotic cardiovascular disease (ASCVD) predictor uses ~10 biomarkers and performs in UKB evaluation as well as or better than the American College of Cardiology ASCVD Risk Estimator, which uses quite different inputs (age, diagnostic history, BMI, smoking status, statin usage, etc.). We compare polygenic risk scores (risk conditional on genotype: (risk score | SNPs)) for common diseases to the risk predictors which result from the concatenation of learned functions (risk score | biomarkers) and (biomarker | SNPs).

No comments:

Post a Comment