Tuesday, June 29, 2021

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank (published version)


This is the published version of our MedRxiv preprint discussed back in April 2021. It is in the special issue Application of Genomic Technology in Disease Outcome Prediction of the journal Genes. 

There is a lot in this paper: genomic prediction of important biomarkers (e.g., lipoprotein A, mean platelet (thrombocyte) volume, bilirubin, platelet count), prediction of important disease risks from biomarkers (novel ML in a ~65 dimensional space) with potential clinical applications. As is typical, genomic predictors trained in a European ancestry population perform less well in distant populations (e.g., S. Asians, E. Asians, Africans). This is probably due to different SNP LD (correlation) structure across populations. However predictors of disease risk using directly measured biomarkers do not show this behavior -- they can be applied even to distant ancestry groups.

The referees did not like our conditional probability notation:
( biomarkers | SNPs )   and   ( disease risk | biomarkers )
So we ended up with lots of acronyms to refer to the various predictors.

Some of the biomarkers identified by ML as important for predicting specific disease risk are not familiar to practitioners and have not been previously discussed (as far as we could tell from the literature) as relevant to that specific disease. One medical school professor and practitioner, upon seeing our results, said he would in future add several new biomarkers to routine blood tests ordered for his patients.
 
Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank 
Erik Widen 1,*,Timothy G. Raben 1, Louis Lello 1,2,* and Stephen D. H. Hsu 1,2 
1 Department of Physics and Astronomy, Michigan State University, 567 Wilson Rd, East Lansing, MI 48824, USA 
2 Genomic Prediction, Inc., 675 US Highway One, North Brunswick, NJ 08902, USA 
*Authors to whom correspondence should be addressed. 
Academic Editor: Sulev Koks 
Genes 2021, 12(7), 991; https://doi.org/10.3390/genes12070991 (registering DOI) 
Received: 30 March 2021 / Revised: 22 June 2021 / Accepted: 23 June 2021 / Published: 29 June 2021 
(This article belongs to the Special Issue Application of Genomic Technology in Disease Outcome Prediction) 
Abstract 
We use UK Biobank data to train predictors for 65 blood and urine markers such as HDL, LDL, lipoprotein A, glycated haemoglobin, etc. from SNP genotype. For example, our Polygenic Score (PGS) predictor correlates ∼0.76 with lipoprotein A level, which is highly heritable and an independent risk factor for heart disease. This may be the most accurate genomic prediction of a quantitative trait that has yet been produced (specifically, for European ancestry groups). We also train predictors of common disease risk using blood and urine biomarkers alone (no DNA information); we call these predictors biomarker risk scores, BMRS. Individuals who are at high risk (e.g., odds ratio of >5× population average) can be identified for conditions such as coronary artery disease (AUC∼0.75), diabetes (AUC∼0.95), hypertension, liver and kidney problems, and cancer using biomarkers alone. Our atherosclerotic cardiovascular disease (ASCVD) predictor uses ∼10 biomarkers and performs in UKB evaluation as well as or better than the American College of Cardiology ASCVD Risk Estimator, which uses quite different inputs (age, diagnostic history, BMI, smoking status, statin usage, etc.). We compare polygenic risk scores (risk conditional on genotype: PRS) for common diseases to the risk predictors which result from the concatenation of learned functions BMRS and PGS, i.e., applying the BMRS predictors to the PGS output.


Figure 11. The ASCVD BMRS and the ASCVD Risk Estimator both make accurate risk predictions but with partially complementary information. (Upper left): Predicted risk by BMRS, the ASCVD Risk Estimator and a PRS predictor were binned and compared to the actual disease prevalence within each bin. The gray 1:1 line indicates perfect prediction. ... The ASCVD Risk Estimator was applied to 340k UKB samples while the others were applied to an evaluation set of 28k samples, all of European ancestry. (Upper right) shows a scatter plot and distributions of the risk predicted by BMRS versus the risk predicted by the ASCVD Risk Estimator for the 28k Europeans in the evaluation set. The BMRS distribution has a longer tail of high predicted risk, providing the tighter confidence interval in this region. The left plot y-axis is the actual prevalence within the horizontal and vertical cross-sections, as illustrated with the shaded bands corresponding to the hollow squares to the left. Notably, both predictors perform well despite the differences in assigned stratification. The hexagons are an overlay of the (lower center) heat map of actual risk within each bin (numbers are bin sizes). Both high risk edges have varying actual prevalence but with a very strong enrichment when the two predictors agree.

No comments:

Post a Comment