Information Processing: Search results for genomic prediction

Showing posts sorted by relevance for query genomic prediction. Sort by date Show all posts

Sunday, May 01, 2022

Complex Trait Prediction: Methods and Protocols (Springer 2022)

My research group contributed a chapter to this new book on Complex Trait Prediction (see below). The book is somewhat unique, covering applications to humans, plants, and animals all in a single volume.

Complex Trait Prediction: Methods and Protocols (Springer Nature)

Editors:

Nourollah Ahmadi and Jérôme Bartholomé

CIRAD, UMR AGAP Institut, Montpellier, France

About this book

This volume explores the conceptual framework and the practical issues related to genomic prediction of complex traits in human medicine and in animal and plant breeding. The book is organized into five parts. Part One reminds molecular genetics approaches intending to predict phenotypic variations. Part Two presents the principles of genomic prediction of complex traits, and reviews factors that affect its reliability. Part Three describes genomic prediction methods, including machine-learning approaches, accounting for different degree of biological complexity, and reviews the associated computer-packages. Part Four reports on emerging trends such as phenomic prediction and incorporation into genomic prediction models of “omics” data and crop growth models. Part Five is dedicated to lessons learned from case studies in the fields of human health and animal and plant breeding, and to methods for analysis of the economic effectiveness of genomic prediction.

Written in the highly successful Methods in Molecular Biology series format, the book provides theoretical bases and practical guidelines for an informed decision making of practitioners and identifies pertinent routes for further methodological researches. Cutting-edge and thorough, Complex Trait Predictions: Methods and Protocols is a valuable resource for scientists and researchers who are interested in learning more about this important and developing field.

Our article (pp 421–446):

From Genotype to Phenotype: Polygenic Prediction of Complex Human Traits

T. Raben, L. Lello, E. Widen, and S. Hsu

Decoding the genome confers the capability to predict characteristics of the organism (phenotype) from DNA (genotype). We describe the present status and future prospects of genomic prediction of complex traits in humans. Some highly heritable complex phenotypes such as height and other quantitative traits can already be predicted with reasonable accuracy from DNA alone. For many diseases, including important common conditions such as coronary artery disease, breast cancer, type I and II diabetes, individuals with outlier polygenic scores (e.g., top few percent) have been shown to have 5 or even 10 times higher risk than average. Several psychiatric conditions such as schizophrenia and autism also fall into this category. We discuss related topics such as the genetic architecture of complex traits, sibling validation of polygenic scores, and applications to adult health, in vitro fertilization (embryo selection), and genetic engineering.

Ungated arXiv version.

Previous discussion:

From Genotype to Phenotype: polygenic prediction of complex human traits

Clinical Applications of Polygenic Risk Scores

Sunday, May 29, 2022

Genomic Prediction in Bloomberg

A nice article in Bloomberg describing polygenic embryo selection in IVF: DNA Testing for Embryos Promises to Predict Genetic Diseases, by Carey Goldberg.

Bloomberg: Simone Collins knew she was pregnant the moment she answered the phone. ... Embryo 3, the fertilized egg that Collins and her husband, Malcolm, had picked, could soon be their daughter—a little girl with, according to their tests, an unusually good chance of avoiding heart disease, cancer, diabetes, and schizophrenia.

This isn’t a story about Gattaca-style designer babies. No genes were edited in the creation of Collins’s embryo. The promise, from dozens of fertility clinics around the world, is just that the new DNA tests they’re using can assess, in unprecedented detail, whether one embryo is more likely than the next to develop a range of illnesses long thought to be beyond DNA-based predictions. It’s a new twist on the industry-standard testing known as preimplantation genetic testing, which for decades has checked embryos for rare diseases, such as cystic fibrosis, that are caused by a single gene.

One challenge with leading killers like cancer and heart disease is that they’re usually polygenic: linked to many different genes with complex interactions. Patients such as Collins can now take tests that assess thousands of DNA data points to decode these complexities and compute the disease risks. Genomic Prediction, the five-year-old New Jersey company that handled the tests for her fertility clinic, generates polygenic risk scores, predicting in percentage terms each embryo’s chances of contracting each disease in the panel, plus a composite score for overall health. Parents with multiple embryos can then weigh the scores when deciding which one to implant.

...

This new form of genetic embryo testing appears to move humanity one step closer to control of its evolution. The $14 billion IVF industry brings more than 500,000 babies into the world each year, and with infertility rates rising, the market is expected to more than double this decade. Companies including Genomic Prediction bet many going into that process have seen enough loved ones suffer from a polygenic disease to want risk scoring.

[ Note I think the number of IVF babies born worldwide each year is more like 1 million, but there is some uncertainty in estimates. ]

...

In December, Genomic Prediction doubled its venture funding to about $25 million and says it will use the cash to expand and add to its testing panel. Boston IVF, one of the biggest fertility networks in the US, recently started offering Genomic Prediction’s polygenic testing to its patients, says CEO David Stern. “Like anything else, you have early adopters,” he says. “We have had patients who worked in the biotech field or the Harvard milieu who came in and asked for it.” Stern predicts that, like egg freezing, polygenic embryo testing will grow slowly at first, but steadily, and eventually demand will reflect the powerful appeal of lowering a child’s odds for disease.

...

Believers such as Collins and her husband support government subsidies for fertility and parenthood but aren’t interested in any conversation about slowing down. “This is about the people who care about giving their children every opportunity,” she says. “I do not believe that law or social norms are going to stop parents from giving their kids advantages.”

This article is well-written and informative. It covers polygenic screening from multiple perspectives: the parents who want a healthy child, the IVF doctors and genetic counselors who help the parents toward that goal, the scientists who study polygenic prediction and its ability to differentiate risk among siblings (i.e., embryos), the bioethicists who worry about a slippery slope to GATTACA.

An important point that is not discussed in the article (understandable, given the complexity of the topics listed above), is that precise genotyping of embryos leads to higher success rates in IVF.

See Preimplantation Genetic Testing for Aneuploidy: New Methods and Higher Pregnancy Rates:

... improved success rates resulting from higher accuracy in aneuploidy screening of embryos will affect millions of families around the world, and over 60% of all IVF families in the US.

The SNP array platform allows very accurate genotyping of each embryo at ~1 million locations in the genome, and the subsequent bioinformatic analysis produces a much more accurate prediction of chromosomal normality than the older methods.

Millions of embryos are screened each year using PGT-A, about 60% of all IVF embryos in the US.

Klaus Wiemer is the laborator director for Poma Fertility near Seattle. He conducted this study independently, without informing Genomic Prediction.

There are ~3000 embryos in the dataset, all biopsied at Poma and samples allocated to three testing labs A,B,C using the two different methods. The family demographics (e.g., maternal age) were similar in all three groups. Lab B is Genomic Prediction and A,C are two of the largest IVF testing labs in the world, using NGS.

The results imply lower false-positive rates, lower false-negative rates, and higher accuracy overall from our methods. These lead to a significantly higher pregnancy success rate.

The new technology has the potential to help millions of families all over the world.

This increase in pregnancy success rates was not something we directly aimed for -- rather, we were simply trying to get the most accurate characterization of chromosomal abnormality (aneuploidy) using the high precision genotype from our platform. After Dr. Wiemer surprised us with these results, it became plausible that significant increases in success rates per IVF cycle could still exist as low-hanging fruit. The ~3k embryos used in his study are considered a big sample size in fertility research, whereas in genomics today a big sample is hundreds of thousands or a million individuals.

Prioritizing research in IVF using large sample sizes could plausibly raise success rates per cycle to, e.g., ~80%. The qualitative experience of parents using IVF will improve with average success rates, perhaps relieving much of the angst and uncertainty.

Sunday, August 03, 2014

It's all in the gene: cows

Some years ago a German driver took me from the Perimeter Institute to the Toronto airport. He was an immigrant to Canada and had a background in dairy farming. During the ride he told me all about driving German farmers to buy units of semen produced by highly prized Canadian bulls. The use of linear polygenic models in cattle breeding is already widespread, and the review article below gives some idea as to the accuracy.

See also Genomic Prediction: No Bull and Plenty of room at the top.

Invited Review: Reliability of genomic predictions for North American Holstein bulls

Journal of Dairy Science Volume 92, Issue 1, Pages 16–24, January 2009.
DOI: http://dx.doi.org/10.3168/jds.2008-1514

Genetic progress will increase when breeders examine genotypes in addition to pedigrees and phenotypes. Genotypes for 38,416 markers and August 2003 genetic evaluations for 3,576 Holstein bulls born before 1999 were used to predict January 2008 daughter deviations for 1,759 bulls born from 1999 through 2002. Genotypes were generated using the Illumina BovineSNP50 BeadChip and DNA from semen contributed by US and Canadian artificial-insemination organizations to the Cooperative Dairy DNA Repository. Genomic predictions for 5 yield traits, 5 fitness traits, 16 conformation traits, and net merit were computed using a linear model with an assumed normal distribution for marker effects and also using a nonlinear model with a heavier tailed prior distribution to account for major genes. The official parent average from 2003 and a 2003 parent average computed from only the subset of genotyped ancestors were combined with genomic predictions using a selection index. Combined predictions were more accurate than official parent averages for all 27 traits. The coefficients of determination (R2) were 0.05 to 0.38 greater with nonlinear genomic predictions included compared with those from parent average alone. Linear genomic predictions had R2 values similar to those from nonlinear predictions but averaged just 0.01 lower. The greatest benefits of genomic prediction were for fat percentage because of a known gene with a large effect. The R2 values were converted to realized reliabilities by dividing by mean reliability of 2008 daughter deviations and then adding the difference between published and observed reliabilities of 2003 parent averages. When averaged across all traits, combined genomic predictions had realized reliabilities that were 23% greater than reliabilities of parent averages (50 vs. 27%), and gains in information were equivalent to 11 additional daughter records. Reliability increased more by doubling the number of bulls genotyped than the number of markers genotyped. Genomic prediction improves reliability by tracing the inheritance of genes even with small effects.

Results and Discussion: ... Marker effects for most other traits were evenly distributed across all chromosomes with only a few regions having larger effects, which may explain why the infinitesimal model and standard quantitative genetic theories have worked well. The distribution of marker effects indicates primarily polygenic rather than simple inheritance and suggests that the favorable alleles will not become homozygous quickly, and genetic variation will remain even after intense selection. Thus, dairy cattle breeders may expect genetic progress to continue for many generations.

... Most animal breeders will conclude that these gains in reliability are sufficient to make genotyping profitable before breeders invest in progeny testing or embryo transfer. Rates of genetic progress should increase substantially as breeders take advantage of these new tools for improving animals (Schaeffer, 2008). Further increases in number of genotyped bulls, revisions to the statistical methods, and additional edits should increase the precision of future genomic predictions.

Table 3

Trait	Parent average		Genomic prediction			Gain from nonlinear genomic prediction compared with published parent average
Trait	Published	Observed	Expected	Linear	Nonlinear
Net merit	30	14	67	53	53	23
Milk yield	35	32	69	56	58	23
Fat yield	35	17	69	65	68	33
Protein yield	35	31	69	58	57	22
Fat percentage	35	29	69	69	78	43
Protein percentage	35	32	69	62	69	34
Productive life	27	28	55	42	45	18

"Horses ain't like people, man. They can't make themselves better than they're born. See, with a horse, it's all in the gene. It's the fucking gene that does the running. The horse has got absolutely nothing to do with it." --- Paulie (Eric Roberts) in The Pope of Greenwich Village

Tuesday, June 29, 2021

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank (published version)

This is the published version of our MedRxiv preprint discussed back in April 2021. It is in the special issue Application of Genomic Technology in Disease Outcome Prediction of the journal Genes.

There is a lot in this paper: genomic prediction of important biomarkers (e.g., lipoprotein A, mean platelet (thrombocyte) volume, bilirubin, platelet count), prediction of important disease risks from biomarkers (novel ML in a ~65 dimensional space) with potential clinical applications. As is typical, genomic predictors trained in a European ancestry population perform less well in distant populations (e.g., S. Asians, E. Asians, Africans). This is probably due to different SNP LD (correlation) structure across populations. However predictors of disease risk using directly measured biomarkers do not show this behavior -- they can be applied even to distant ancestry groups.

The referees did not like our conditional probability notation:

( biomarkers | SNPs ) and ( disease risk | biomarkers )

So we ended up with lots of acronyms to refer to the various predictors.

Some of the biomarkers identified by ML as important for predicting specific disease risk are not familiar to practitioners and have not been previously discussed (as far as we could tell from the literature) as relevant to that specific disease. One medical school professor and practitioner, upon seeing our results, said he would in future add several new biomarkers to routine blood tests ordered for his patients.

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank

Erik Widen 1,*,Timothy G. Raben 1, Louis Lello 1,2,* and Stephen D. H. Hsu 1,2

1 Department of Physics and Astronomy, Michigan State University, 567 Wilson Rd, East Lansing, MI 48824, USA

2 Genomic Prediction, Inc., 675 US Highway One, North Brunswick, NJ 08902, USA

*Authors to whom correspondence should be addressed.

Academic Editor: Sulev Koks

Genes 2021, 12(7), 991; https://doi.org/10.3390/genes12070991 (registering DOI)

Received: 30 March 2021 / Revised: 22 June 2021 / Accepted: 23 June 2021 / Published: 29 June 2021

(This article belongs to the Special Issue Application of Genomic Technology in Disease Outcome Prediction)

Abstract

We use UK Biobank data to train predictors for 65 blood and urine markers such as HDL, LDL, lipoprotein A, glycated haemoglobin, etc. from SNP genotype. For example, our Polygenic Score (PGS) predictor correlates ∼0.76 with lipoprotein A level, which is highly heritable and an independent risk factor for heart disease. This may be the most accurate genomic prediction of a quantitative trait that has yet been produced (specifically, for European ancestry groups). We also train predictors of common disease risk using blood and urine biomarkers alone (no DNA information); we call these predictors biomarker risk scores, BMRS. Individuals who are at high risk (e.g., odds ratio of >5× population average) can be identified for conditions such as coronary artery disease (AUC∼0.75), diabetes (AUC∼0.95), hypertension, liver and kidney problems, and cancer using biomarkers alone. Our atherosclerotic cardiovascular disease (ASCVD) predictor uses ∼10 biomarkers and performs in UKB evaluation as well as or better than the American College of Cardiology ASCVD Risk Estimator, which uses quite different inputs (age, diagnostic history, BMI, smoking status, statin usage, etc.). We compare polygenic risk scores (risk conditional on genotype: PRS) for common diseases to the risk predictors which result from the concatenation of learned functions BMRS and PGS, i.e., applying the BMRS predictors to the PGS output.

Figure 11. The ASCVD BMRS and the ASCVD Risk Estimator both make accurate risk predictions but with partially complementary information. (Upper left): Predicted risk by BMRS, the ASCVD Risk Estimator and a PRS predictor were binned and compared to the actual disease prevalence within each bin. The gray 1:1 line indicates perfect prediction. ... The ASCVD Risk Estimator was applied to 340k UKB samples while the others were applied to an evaluation set of 28k samples, all of European ancestry. (Upper right) shows a scatter plot and distributions of the risk predicted by BMRS versus the risk predicted by the ASCVD Risk Estimator for the 28k Europeans in the evaluation set. The BMRS distribution has a longer tail of high predicted risk, providing the tighter confidence interval in this region. The left plot y-axis is the actual prevalence within the horizontal and vertical cross-sections, as illustrated with the shaded bands corresponding to the hollow squares to the left. Notably, both predictors perform well despite the differences in assigned stratification. The hexagons are an overlay of the (lower center) heat map of actual risk within each bin (numbers are bin sizes). Both high risk edges have varying actual prevalence but with a very strong enrichment when the two predictors agree.

Sunday, June 09, 2019

L1 vs Deep Learning in Genomic Prediction

The paper below by some of my MSU colleagues examines the performance of a number of ML algorithms, both linear and nonlinear, including deep neural nets, in genomic prediction across several different species.

When I give talks about prediction of disease risks and complex traits in humans, I am often asked why we are not using fancy (trendy?) methods such as Deep Learning (DL). Instead, we focus on L1 penalization methods ("sparse learning") because 1. the theoretical framework (including theorems providing performance guarantees) is well-developed, and (relatedly) 2. the L1 methods perform as well or better than other methods in our own testing.

The term theoretical framework may seem unusual in ML, which is at the moment largely an empirical subject. Experience in theoretical physics shows that when powerful mathematical results are available, they can be very useful to guide investigation. In the case of sparse learning we can make specific estimates for how much data is required to "solve" a trait -- i.e., capture most of the estimated heritability in the predictor. Five years ago we predicted a threshold of a few hundred thousand genomes for height, and this turned out to be correct. Currently, this kind of performance characterization is not possible for DL or other methods.

What is especially powerful about deep neural nets is that they yield a quasi-convex (or at least reasonably efficient) optimization procedure which can learn high dimensional functions. The class of models is both tractable from a learning/optimization perspective, but also highly expressive. As I wrote here in my ICML notes (see also Elad's work which relates DL to Sparse Learning):

It may turn out that the problems on which DL works well are precisely those in which the training data (and underlying generative processes) have a hierarchical structure which is sparse, level by level. Layered networks perform a kind of coarse graining (renormalization group flow): first layers filter by feature, subsequent layers by combinations of features, etc. But the whole thing can be understood as products of sparse filters, and the performance under training is described by sparse performance guarantees (ReLU = thresholded penalization?).

However, currently in genomic prediction one typically finds that nonlinear interactions are small, which means features more complicated than single SNPs are unnecessary. (In a recent post I discussed a new T1D predictor that makes use of nonlinear haplotype interaction effects, but even there the effects are not large.) Eventually I expect this situation to change -- when we have enough whole genomes to work with, a DL approach which can (automatically) identify important features (motifs?) may allow us to go beyond SNPs and simple linear models.

Note, though, that from an information theoretic perspective (see, e.g., any performance theorems in compressed sensing) it is obvious that we will need much more data than we currently have to advance this program. Also, note that Visscher et al.'s recent GCTA work suggests that additive SNP models using rare variants (i.e., extracted from whole genome data), can account for nearly all the expected heritability for height. This implies that the power of nonlinear methods like DL may not yield qualitatively better results than simpler L1 approaches, even in the limit of very large whole genome datasets.

Benchmarking algorithms for genomic prediction of complex traits

Christina B. Azodi, Andrew McCarren, Mark Roantree, Gustavo de los Campos, Shin-Han Shiu

The usefulness of Genomic Prediction (GP) in crop and livestock breeding programs has led to efforts to develop new and improved GP approaches including non-linear algorithm, such as artificial neural networks (ANN) (i.e. deep learning) and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of GP datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and five non-linear algorithms, including ANNs. First, we found that hyperparameter selection was critical for all non-linear algorithms and that feature selection prior to model training was necessary for ANNs when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple GP algorithms (i.e. ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits than that of linear algorithms. Although ANNs did not perform best for any trait, we identified strategies (i.e. feature selection, seeded starting weights) that boosted their performance near the level of other algorithms. These results, together with the fact that even small improvements in GP performance could accumulate into large genetic gains over the course of a breeding program, highlights the importance of algorithm selection for the prediction of trait values.

Monday, January 18, 2021

From Genotype to Phenotype: polygenic prediction of complex human traits

New paper, prepared for the book Genomic Prediction of Complex Traits, Springer Nature series Methods in Molecular Biology.

From Genotype to Phenotype: polygenic prediction of complex human traits

arXiv.org > q-bio > arXiv:2101.05870 33 pages, 7 figures, 1 table

Timothy G. Raben, Louis Lello, Erik Widen, Stephen D.H. Hsu

Decoding the genome confers the capability to predict characteristics of the organism (phenotype) from DNA (genotype). We describe the present status and future prospects of genomic prediction of complex traits in humans. Some highly heritable complex phenotypes such as height and other quantitative traits can already be predicted with reasonable accuracy from DNA alone. For many diseases, including important common conditions such as coronary artery disease, breast cancer, type I and II diabetes, individuals with outlier polygenic scores (e.g., top few percent) have been shown to have 5 or even 10 times higher risk than average. Several psychiatric conditions such as schizophrenia and autism also fall into this category. We discuss related topics such as the genetic architecture of complex traits, sibling validation of polygenic scores, and applications to adult health, in vitro fertilization (embryo selection), and genetic engineering.

From the introduction:

I, on the other hand, knew nothing, except ... physics and mathematics and an ability to turn my hand to new things. — Francis Crick

The challenge of decoding the genome has loomed large over biology since the time of Watson and Crick. Initially, decoding referred to the relationship between DNA and specific proteins or molecular mechanisms, but the ultimate goal is to deduce the relationship between DNA and phenotype — the character of the organism itself. How does Nature encode the traits of the organism in DNA? In this review we describe recent advances toward this goal, which have resulted from the application of machine learning (ML) to large genomic data sets. Genomic prediction is the real decoding of the genome: the creation of mathematical models which map genotypes to complex traits.

It is a peculiarity of ML and artificial intelligence (AI) applied to complex systems that these methods can often “solve” a problem without explicating, in a manner that humans can absorb, the intricate mechanisms that lie intermediate between input and output. For example, AlphaGo [1] achieved superhuman mastery of an ancient game that had been under serious study for thousands of years. Yet nowhere in the resulting neural network with millions of connection strengths is there a human-comprehensible guide to Go strategy or game dynamics. Similarly, genomic prediction has produced mathematical functions which predict quantitative human traits with surprising accuracy — e.g., height, bone density, and cholesterol or lipoprotein A levels in blood (see Table 1); using typically thousands of genetic variants as input (see next section for details) — but without explicitly revealing the role of these variants in actual biochemical mechanisms. Characterizing these mechanisms — which are involved in phenomena such as bone growth, lipid metabolism, hormonal regulation, protein interactions — will be a project which takes much longer to complete.

If recent trends persist, in particular the continued growth of large genotype | phenotype data sets, we will likely have good genomic predictors for a host of human traits within the next decade. ...

Tuesday, September 19, 2017

Accurate Genomic Prediction Of Human Height

I've been posting preprints on arXiv since its beginning ~25 years ago, and I like to share research results as soon as they are written up. Science functions best through open discussion of new results! After some internal deliberation, my research group decided to post our new paper on genomic prediction of human height on bioRxiv and arXiv.

But the preprint culture is nascent in many areas of science (e.g., biology), and it seems to me that some journals are not yet fully comfortable with the idea. I was pleasantly surprised to learn, just in the last day or two, that most journals now have official policies that allow online distribution of preprints prior to publication. (This has been the case in theoretical physics since before I entered the field!) Let's hope that progress continues.

The work presented below applies ideas from compressed sensing, L1 penalized regression, etc. to genomic prediction. We exploit the phase transition behavior of the LASSO algorithm to construct a good genomic predictor for human height. The results are significant for the following reasons:

We applied novel machine learning methods ("compressed sensing") to ~500k genomes from UK Biobank, resulting in an accurate predictor for human height which uses information from thousands of SNPs.

1. The actual heights of most individuals in our replication tests are within a few cm of their predicted height.

2. The variance captured by the predictor is similar to the estimated GCTA-GREML SNP heritability. Thus, our results resolve the missing heritability problem for common SNPs.

3. Out-of-sample validation on ARIC individuals (a US cohort) shows the predictor works on that population as well. The SNPs activated in the predictor overlap with previous GWAS hits from GIANT.

The scatterplot figure below gives an immediate feel for the accuracy of the predictor.

Accurate Genomic Prediction Of Human Height
(bioRxiv)

Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos, and Stephen D.H. Hsu

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ∼40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.

This figure compares predicted and actual height on a validation set of 2000 individuals not used in training: males + females, actual heights (vertical axis) uncorrected for gender. For training we z-score by gender and age (due to Flynn Effect for height). We have also tested validity on a population of US individuals (i.e., out of sample; not from UKBB).

This figure illustrates the phase transition behavior at fixed sample size n and varying penalization lambda.

These are the SNPs activated in the predictor -- about 20k in total, uniformly distributed across all chromosomes; vertical axis is effect size of minor allele:

The big picture implication is that heritable complex traits controlled by thousands of genetic loci can, with enough data and analysis, be predicted from DNA. I expect that with good genotype | phenotype data from a million individuals we could achieve similar success with cognitive ability. We've also analyzed the sample size requirements for disease risk prediction, and they are similar (i.e., ~100 times sparsity of the effects vector; so ~100k cases + controls for a condition affected by ~1000 loci).

Note Added: Further comments in response to various questions about the paper.

1) We have tested the predictor on other ethnic groups and there is an (expected) decrease in correlation that is roughly proportional to the "genetic distance" between the test population and the white/British training population. This is likely due to different LD structure (SNP correlations) in different populations. A SNP which tags the true causal genetic variation in the Euro population may not be a good tag in, e.g., the Chinese population. We may report more on this in the future. Note, despite the reduction in power our predictor still captures more height variance than any other existing model for S. Asians, Chinese, Africans, etc.

2) We did not explore the biology of the activated SNPs because that is not our expertise. GWAS hits found by SSGAC, GIANT, etc. have already been connected to biological processes such as neuronal growth, bone development, etc. Plenty of follow up work remains to be done on the SNPs we discovered.

3) Our initial reduction of candidate SNPs to the top 50k or 100k is simply to save computational resources. The L1 algorithms can handle much larger values of p, but keeping all of those SNPs in the calculation is extremely expensive in CPU time, memory, etc. We tested computational cost vs benefit in improved prediction from including more (>100k) candidate SNPs in the initial cut but found it unfavorable. (Note, we also had a reasonable prior that ~10k SNPs would capture most of the predictive power.)

4) We will have more to say about nonlinear effects, additional out-of-sample tests, other phenotypes, etc. in future work.

5) Perhaps most importantly, we have a useful theoretical framework (compressed sensing) within which to think about complex trait prediction. We can make quantitative estimates for the sample size required to "solve" a particular trait.

I leave you with some remarks from Francis Crick:

Crick had to adjust from the "elegance and deep simplicity" of physics to the "elaborate chemical mechanisms that natural selection had evolved over billions of years." He described this transition as, "almost as if one had to be born again." According to Crick, the experience of learning physics had taught him something important — hubris — and the conviction that since physics was already a success, great advances should also be possible in other sciences such as biology. Crick felt that this attitude encouraged him to be more daring than typical biologists who tended to concern themselves with the daunting problems of biology and not the past successes of physics.

Wednesday, November 01, 2017

The Future is Here: Genomic Prediction in MIT Technology Review

MIT Technology Review reports on our startup Genomic Prediction. Some basic points worth clarifying:

1. GP's first product, announced at the annual ASRM (American Society of Reproductive Medicine) meeting this week, tests chromosomal abnormality. It is a less expensive but more accurate version of existing tests.

2. The polygenic product, to be launched in 2018, checks for hundreds of known single-gene ("Mendelian") disease risks, and will likely have some true polygenic predictive capabilities. This last part is the main emphasis of the story, but it is just one component of the overall product offering. The article elides a lot of challenging laboratory work on DNA amplification, etc.

3. GP will only deliver results requested by an IVF physician. It is not a DTC (Direct to Consumer) company.

4. All medical risk analysis proceeds from statistical data (analyzing groups of people) to produce recommendations concerning a specific individual.

5. I am on the Board of Directors of GP but am not an employee of the company.

MIT Technology Review

Eugenics 2.0: We’re at the Dawn of Choosing Embryos by Health, Height, and More

Will you be among the first to pick your kids’ IQ? As machine learning unlocks predictions from DNA databases, scientists say parents could have choices never before possible.

Nathan Treff was diagnosed with type 1 diabetes at 24. It’s a disease that runs in families, but it has complex causes. More than one gene is involved. And the environment plays a role too.

So you don’t know who will get it. Treff’s grandfather had it, and lost a leg. But Treff’s three young kids are fine, so far. He’s crossing his fingers they won’t develop it later.

Now Treff, an in vitro fertilization specialist, is working on a radical way to change the odds. Using a combination of computer models and DNA tests, the startup company he’s working with, Genomic Prediction, thinks it has a way of predicting which IVF embryos in a laboratory dish would be most likely to develop type 1 diabetes or other complex diseases. Armed with such statistical scorecards, doctors and parents could huddle and choose to avoid embryos with failing grades.

IVF clinics already test the DNA of embryos to spot rare diseases, like cystic fibrosis, caused by defects in a single gene. But these “preimplantation” tests are poised for a dramatic leap forward as it becomes possible to peer more deeply at an embryo’s genome and create broad statistical forecasts about the person it would become.

The advance is occurring, say scientists, thanks to a growing flood of genetic data collected from large population studies. ...

Spotting outliers

The company’s plans rely on a tidal wave of new knowledge showing how small genetic differences can add up to put one person, but not another, at high odds for diabetes, a neurotic personality, or a taller or shorter height. Already, such “polygenic risk scores” are used in direct-to-consumer gene tests, such as reports from 23andMe that tell customers their genetic chance of being overweight.

For adults, risk scores are little more than a novelty or a source of health advice they can ignore. But if the same information is generated about an embryo, it could lead to existential consequences: who will be born, and who stays in a laboratory freezer.

“I remind my partners, ‘You know, if my parents had this test, I wouldn’t be here,’” says Treff, a prize-winning expert on diagnostic technology who is the author of more than 90 scientific papers.

Genomic Prediction was founded this year and has raised funds from venture capitalists in Silicon Valley, though it declines to say who they are. Tellier, whose inspiration is the science fiction film Gattaca, says the company plans to offer reports to IVF doctors and parents identifying “outliers”—those embryos whose genetic scores put them at the wrong end of a statistical curve for disorders such as diabetes, late-life osteoporosis, schizophrenia, and dwarfism, depending on whether models for those problems prove accurate. ...

This week, Genomic Prediction manned a booth at the annual meeting of the American Society for Reproductive Medicine. That organization, which represents fertility doctors and scientists, has previously said it thinks testing embryos for late-life conditions, like Alzheimer’s, would be “ethically justified.” It cited, among other reasons, the “reproductive liberty” of parents.

... Hsu’s prediction is that “billionaires and Silicon Valley types” will be the early adopters of embryo selection technology, becoming among the first “to do IVF even though they don’t need IVF.” As they start producing fewer unhealthy children, and more exceptional ones, the rest of society could follow suit.

“I fully predict it will be possible,” says Hsu of selecting embryos with higher IQ scores. “But we’ve said that we as a company are not going to do it. It’s a difficult issue, like nuclear weapons or gene editing. There will be some future debate over whether this should be legal, or made illegal. Countries will have referendums on it.”

Tuesday, November 20, 2018

Super-smart designer babies (Guardian UK)

The title (likely chosen by an editor at The Guardian) is a bit disturbing. But the article itself is clear and insightful.

Super-smart designer babies could be on offer soon. But is that ethical?

... The company [ Genomic Prediction ] says it is only offering such testing to spot embryos with an IQ low enough to be classed as a disability, and won’t conduct analyses for high IQ.

... The development must be set, too, against what is already possible and permitted in IVF embryo screening. The procedure called pre-implantation genetic diagnosis (PGD) involves extracting cells from embryos at a very early stage and “reading” their genomes before choosing which to implant. It has been enabled by rapid advances in genome-sequencing technology, making the process fast and relatively cheap.

... before we get too indignant about the horrors of designer babies, bear in mind that already we permit, even in the UK, prenatal screening for Down’s syndrome, a disability that produces low to moderate intellectual disability. It’s not easy to make a moral or philosophical case that the screening offered by Genomic Prediction for low IQ is any different. There may be more uncertainty but, given not all IVF embryos will be implanted anyway, can we object to tipping the scales? And how can we condone efforts to improve your child’s intelligence after birth but not before?

The questions are complicated. How to balance individual rights against what is good for society as a whole? When does avoidance of disease and disability shade into enhancement? Should society be more receptive to disability rather than seeing it as something to be eradicated? When does choice become tyranny?

In the UK we are extraordinarily lucky to have the HFEA [Human Fertilisation and Embryology Authority], which frames binding regulation after careful deliberation and acts as a brake so the technology does not outrun the debate. “Embryo selection needs robust regulation that society can be confident in,” says Ewan Birney, director of the European Bioinformatics Institute in Cambridge. Leaving a matter such as this to unregulated market forces is dangerous.

The writer is Philip Ball, a longtime editor at Nature and PhD in Physics.

The US does not have an HFEA. Here is what the Genomic Prediction FAQ says:

There is a formal bioethics position on this general subject, recently updated in 2018, written by the Ethics Committee of the American Society for Reproductive Medicine.

A key phrase from the paper:

PGD for adult-onset conditions is ethically justified when the condition is serious and no safe, effective interventions are available. It is ethically allowed for conditions of lesser severity or penetrance. The Committee strongly recommends that an experienced genetic counselor play a major role in counseling patients considering such procedures.

The recent burst of news coverage of Genomic Prediction was triggered by articles in The Economist and New Scientist. My earlier comments are reproduced below:

This Economist article on Genomic Prediction has been in waiting for weeks, to appear in The World in 2019 special issue. I spent a couple hours briefing their science team on what is coming in AI and genomics -- I would guess there will be more coverage of polygenic scores and health care in the future.

See also this New Scientist article on GP.

2019 may be the Year of the Designer Baby, if journos are to be believed ;-) Of course, this is sensationalism. It is more accurate to say that 2019 will see the first deployment of advanced genetic tests which can be used to screen against complex disease and health risks. Already today ~1 million IVF embryos per year are screened worldwide using less sophisticated genetic tests for single gene disease mutations and chromosomal abnormality.

The Economist:

In 2019, ... [ GP clients ] will have an opportunity to give their offspring a greater chance of living a long and healthy life.

"Expert" opinion seems to have evolved as follows:

1. Of course babies can't be "designed" because genes don't really affect anything -- we're all products of our environment!

2. Gulp, even if genes do affect things it's much too complicated to ever figure out!

3. Anyone who wants to use this technology (hmm... it works) needs to tread carefully, and to seriously consider the ethical issues.

Only point 3 is actually correct, although there are still plenty of people who believe 1 and 2 :-(

Thursday, December 27, 2018

Genomic Prediction of Complex Disease Risk (bioRxiv)

Our new paper describes over a dozen genomic predictors for common disease risk, constructed via machine learning on hundreds of thousands of genotypes. The predictors use anywhere from a few tens (e.g., 20 or 50) to thousands of SNPs to compute the risk PGS (Poly-Genic Score) for a specific disease.

The figure above (Atrial Fibrillation) shows out-of-sample testing of risk prediction (black dots with error bars) compared to theoretical prediction (red line). The theoretical prediction uses the empirical fact that cases and controls are normally-distributed in PGS score, with the two distributions shifted relative to each other. Cases have, on average, higher risk scores, and come to dominate in high PGS percentile bins. So, conditional on a high PGS risk score (e.g., 99th percentile PGS), the probability of the condition can be significantly elevated (e.g., ~8 times typical probability of developing atrial fibrillation).

We can identify, from SNP genotype alone, a subset of the population with unusual risk for conditions like Atrial Fibrillation or Diabetes or Breast Cancer or Prostate Cancer.

Just a year or two ago this would have seemed like science fiction to biomedical researchers...

Empirical validation of risk is limited by availability of out-of-sample populations for whom we have genotype and disease status. However, it is clear from the results that the theoretical models do a good job of predicting odds ratios once the properties of the case and control normal distributions (mean and standard deviation of PGS) are known.

These predictors only require data from an inexpensive ~$50 SNP array. Once the ~1 million SNPs on the array are measured *all* of the disease risks can be computed for an individual patient. It is only a matter of time before genotyping of this kind becomes Standard of Care in health systems around the world.

In the paper we also analyze the rate of improvement of prediction AUC as training sample size increases. With more data these predictors will become significantly more accurate -- the relevant timescale is just a few years!

Genomic Prediction of Complex Disease Risk

Louis Lello, Timothy Raben, Soke Yuen Yong, Laurent CAM Tellier, Stephen D. H. Hsu

We construct risk predictors using polygenic scores (PGS) computed from common Single Nucleotide Polymorphisms (SNPs) for a number of complex disease conditions, using L1-penalized regression (also known as LASSO) on case-control data from UK Biobank. Among the disease conditions studied are Hypothyroidism, (Resistive) Hypertension, Type 1 and 2 Diabetes, Breast Cancer, Prostate Cancer, Testicular Cancer, Gallstones, Glaucoma, Gout, Atrial Fibrillation, High Cholesterol, Asthma, Basal Cell Carcinoma, Malignant Melanoma, and Heart Attack. We obtain values for the area under the receiver operating characteristic curves (AUC) in the range 0.58 - 0.71 using SNP data alone. Substantially higher predictor AUCs are obtained when incorporating additional variables such as age and sex. Some SNP predictors alone are sufficient to identify outliers (e.g., in the 99th percentile of PGS) with 3-8 times higher risk than typical individuals. We validate predictors out-of-sample using the eMERGE dataset, and also with different ancestry subgroups within the UK Biobank population. Our results indicate that substantial improvements in predictive power are attainable using training sets with larger case populations. We anticipate rapid improvement in genomic prediction as more case-control data become available for analysis.

https://www.biorxiv.org/content/early/2018/12/27/506600

Monday, December 17, 2018

Advances in Genomic Prediction: Breast Cancer Risk

This is a new paper on polygenic prediction for breast cancer by a large collaboration that has been working for many years on GWAS and, more recently, genomic risk prediction.

"The lifetime risk of overall breast cancer in the top centile of the PRSs was 32.6%" !

Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes
https://www.cell.com/ajhg/fulltext/S0002-9297(18)30405-1

Stratification of women according to their risk of breast cancer based on polygenic risk scores (PRSs) could improve screening and prevention strategies. Our aim was to develop PRSs, optimized for prediction of estrogen receptor (ER)-specific disease, from the largest available genome-wide association dataset and to empirically validate the PRSs in prospective studies. The development dataset comprised 94,075 case subjects and 75,017 control subjects of European ancestry from 69 studies, divided into training and validation sets. Samples were genotyped using genome-wide arrays, and single-nucleotide polymorphisms (SNPs) were selected by stepwise regression or lasso penalized regression. The best performing PRSs were validated in an independent test set comprising 11,428 case subjects and 18,323 control subjects from 10 prospective studies and 190,040 women from UK Biobank (3,215 incident breast cancers). For the best PRSs (313 SNPs), the odds ratio for overall disease per 1 standard deviation in ten prospective studies was 1.61 (95%CI: 1.57–1.65) with area under receiver-operator curve (AUC) = 0.630 (95%CI: 0.628–0.651). The lifetime risk of overall breast cancer in the top centile of the PRSs was 32.6%. Compared with women in the middle quintile, those in the highest 1% of risk had 4.37- and 2.78-fold risks, and those in the lowest 1% of risk had 0.16- and 0.27-fold risks, of developing ER-positive and ER-negative disease, respectively. Goodness-of-fit tests indicated that this PRS was well calibrated and predicts disease risk accurately in the tails of the distribution. This PRS is a powerful and reliable predictor of breast cancer risk that may improve breast cancer prevention programs.

Note 10-25x (ER-positive and -negative) range of risk between lowest and highest percentile PRS score.

One of the senior authors (Paul Pharoah of Cambridge) details the history of his work on genetics of breast cancer in a tweet thread. He describes historical progress from simple GWAS associations to full-blown genomic prediction:

1/n This paper has been many years in the making both conceptually and in terms of the time to generate the data. It has been part of almost all my scientific life (or at least since I started my PhD).

...

11/n This PRS is now being used in an EU funded trial of risk stratified screening. It is the culmination of many years of many people working together on samples donated by hundreds of thousands of patients.

https://twitter.com/paulpharoah/status/1073677455372759041

My small team of physicists has constructed a breast cancer predictor of similar power using UKBB data and our own automated ML pipeline :-)

See earlier post Advances in Genomic Prediction.

Thursday, October 11, 2018

Population-wide Genomic Prediction of Health Risks

The UK is ahead of the US in the application of genomics in clinical practice. Part of this is due to their leadership in projects like the UK Biobank (500k genomes with extensive biomedical phenotyping), and part is due to having a single-payer system that can adopt obviously beneficial (and cost-beneficial) practices after some detailed study. Former Prime Minister David Cameron's son has a rare genetic disease, which contributed to his strong support of genomics research in the UK. The decentralized (broken) US health care system, which does not focus on quality of outcome, is having a hard time with no-brainer decisions like making inexpensive genotyping Standard of Care. Will insurance reimburse?

I estimate that within a year or so there will be more than 10 good genomic predictors covering very significant disease risks, ranging from heart disease to diabetes to hypothyroidism to various cancers. These predictors will be able to identify the, e.g., few percent of the population that are outliers in risk -- for example, have 5x or 10x the normal likelihood of getting the disease at some point in their lives. Risk predictions can be made at birth (or before! or in adulthood), and preventative care allocated appropriately. All of these risk scores can be computed using a genotype read from an inexpensive (< $50 per person) array that probes ~1M or so common SNPs.

In technical papers my research group anticipated years ago that even very complex traits would be predictable once a data threshold was crossed. The phenomenon is related to what physicists refer to as a phase transition in algorithm performance. The rapid appearance now of practically useful risk predictors for disease is one anticipated consequence of this phase transition. Medicine in well-functioning health care systems will be transformed over the next 5 years or so.

Test could predict risk of future heart disease for just £40 (Guardian)

Genomic Risk Score test is cheap enough to allow population-wide screening of children, researchers believe

A one-off genetic test costing less than £40 can show if a person is born with a predisposition to heart disease.

The Genomic Risk Score (GRS) test is cheap enough to allow population-wide screening of children, researchers believe. Medical and lifestyle interventions could then be employed to reduce the chances of those most at risk of suffering heart attacks in adulthood.

A study found that participants with a GRS in the top 20% were more than four times more likely to develop coronary heart disease than those with scores in the bottom 20%. Many in the “at risk” category lacked the usual heart disease indicators, such as high cholesterol and blood pressure.

Senior author Sir Nilesh Samani, the professor of cardiology at the University of Leicester and medical director of the British Heart Foundation charity, said: “At the moment, we assess people for their risk of coronary heart disease in their 40s through NHS health checks. But we know this is imprecise and also that coronary heart disease starts much earlier, several decades before symptoms develop.

“Therefore, if we are going to do true prevention, we need to identify those at increased risk much earlier. This study shows that the GRS can now identify such individuals.

“Applying it could provide a most cost-effective way of preventing the enormous burden of coronary heart disease, by helping doctors select patients who would most benefit from interventions.”

Coronary heart disease is the leading cause of death worldwide and claims 66,000 lives each year in the UK. Healthcare costs related to heart and circulatory diseases in the UK are estimated at £9bn per year. ...

Tuesday, August 14, 2018

Genomic Prediction of disease risk using polygenic scores (Nature Genetics)

It seems to me we are just at the tipping point -- soon it will be widely understood that with large enough data sets we can predict complex traits and complex disease risk from genotype, capturing most of the estimated heritable variance. People will forget that many "experts" doubted this was possible -- the term missing heritability will gradually disappear.

In just a few years genotyping will start to become "standard of care" in many health systems. In 5 years there will be ~100M genotypes in storage (vs ~20M now), a large fraction available for scientific analysis.

Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations (Nature Genetics)

A key public health need is to identify individuals at high risk for a given disease to enable enhanced screening or preventive therapies. Because most common diseases have a genetic component, one important approach is to stratify individuals based on inherited DNA variation1. Proposed clinical applications have largely focused on finding carriers of rare monogenic mutations at several-fold increased risk. Although most disease risk is polygenic in nature2,3,4,5, it has not yet been possible to use polygenic predictors to identify individuals at risk comparable to monogenic mutations. Here, we develop and validate genome-wide polygenic scores for five common diseases. The approach identifies 8.0, 6.1, 3.5, 3.2, and 1.5% of the population at greater than threefold increased risk for coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer, respectively. For coronary artery disease, this prevalence is 20-fold higher than the carrier frequency of rare monogenic mutations conferring comparable risk6. We propose that it is time to contemplate the inclusion of polygenic risk prediction in clinical care, and discuss relevant issues.

Using much larger studies and improved algorithms, we set out to revisit the question of whether a GPS can identify subgroups of the population with risk approaching or exceeding that of a mono- genic mutation. We studied five common diseases with major public health impact: CAD, atrial fibrillation, type 2 diabetes, inflamma- tory bowel disease, and breast cancer.

For each of the diseases, we created several candidate GPSs based on summary statistics and imputation from recent large GWASs in participants of primarily European ancestry (Table 1). Specifically, we derived 24 predictors based on a pruning and thresholding method, and 7 additional predictors using the recently described LDPred algorithm13 (Methods, Fig. 1 and Supplementary Tables 1–6). These scores were validated and tested within the UK Biobank, which has aggregated genotype data and extensive phenotypic information on 409,258 participants of British ancestry (average age: 57 years; 55% female)14,15.

We used an initial validation dataset of the 120,280 participants in the UK Biobank phase 1 genotype data release to select the GPSs with the best performance, defined as the maximum area under the receiver-operator curve (AUC). We then assessed the performance in an independent testing dataset comprised of the 288,978 partici- pants in the UK Biobank phase 2 genotype data release. For each disease, the discriminative capacity within the testing dataset was nearly identical to that observed in the validation dataset.

In the talk below @21:45 I discuss prospects for genomic prediction of disease risk.

Saturday, June 11, 2022

Genomic Prediction on WHYY The Pulse

This 20 minute podcast segment is very well done. Congratulations to science journalist Teresa Carey.

Startup offers genetic testing that promises to predict healthiest embryo

Aurea toddles around in her pink sparkly sneakers, climbing up the steps that, to her, are nearly waist high. Her tiny t-shirt is the epitome of how adorable she is. It says “you + me + snuggles.” Aurea’s father, Rafal Smigrodzki, watches over his little girl. He is clearly proud of her. “She’s very lively. I think she’s a pretty, pretty happy baby,” Smigrodzki said, “a very often smiley baby.”

Of course, Smigrodzki thinks his baby is special — most parents do. But Aurea is indeed unique. She was born almost two years ago and happens to be the first child born as the result of a new type of genetic screening, which carefully selected her embryo. Smigrodzki and his girlfriend used in vitro fertilization and an advanced selection process from a startup called Genomic Prediction.

The New Jersey startup offers genetic tests and promises to help prospective parents select embryos with the best possible genes. The company says its test can screen embryos for a variety of diseases and health conditions, like heart disease, diabetes, or breast cancer.

Smigrodzki, a neurologist with a PhD in genetics, stumbled across the company in 2017.

“I was always interested and reading about all kinds of new developments,” he said. “And just happened to read an article in the MIT Technology Review about Genomic Prediction.”

...

For more information, see (audio + transcript):

Shai Carmi: Polygenic risk scores and embryo screening — Manifold Podcast #5

Steve Hsu: Complex trait prediction in Genomics, and Genomic Prediction / Embryo Selection — Manifold Podcast #2

Saturday, September 05, 2009

Some favorite posts

NEW! Podcast show Manifold.

I started writing this blog in 2004 (it has had millions of visitors!), and by now the content is a bit unwieldy to navigate, even with labels and search. I thought I'd make a list of some of my favorite posts and topics. Please suggest other posts to add to the list!

Pessimism of the Intellect, Optimism of the Will.

Feynman and me, Memories of Feynman. Labels: Feynman, path integrals.

Richard Feynman and the 19 year old me at my Caltech graduation:

Mama said knock you out (learning how to fight).

Many Worlds and quantum mechanics, a brief guide. Label: Many Worlds. Note, the usual quantum probabilities do not emerge naturally in this interpretation. See my papers On the Origin of Probability in Quantum Mechanics and The measure problem in no-collapse (many worlds) quantum mechanics.

Cognitive limitations: statistics , higher ed. Label: bounded rationality, human capital. Brainpower and globalization.

Expert predictions in soft subjects are unreliable. Intellectual honesty. Frauds! Label: expert predictions.

We can (crudely) measure cognitive ability using simple tests. (It is amazing to me that this is a controversial statement.) Randomly sampled eminent scientists have (very) high IQs, and given the observed stability of adult IQ the causality is clear: psychometrics works. The cult of genius? Income, Wealth, and IQ, One hundred thousand brains. Bezos on the Big Brains. Label: psychometrics.

Historically isolated groups of humans cluster genetically according to geographical ancestry. Explained in pictures , words , more words.

I am skeptical of all but the weakest claims of market efficiency. My talk on the 2008 credit crisis. Venn diagram for economics.

Careers, advice to geeks: A tale of two geeks , success vs ability. Labels: careers , startups , entrepreneurs.

Net worth , life satisfaction , happiness , the gilded age.

What is the likely development path for China in the next decades? Sustainability of China growth , China development: how big is the middle class? , Back to the future , Shanghai from an Indian perspective.

That curious institution, Caltech. How did a 16 year old kid from Iowa end up there? (See memories of Feynman above.)

There are geniuses in the world. The cult of genius.

My lovely kids. Photos. Autobiographical.

Update:

Credentialism and elite careers , Defining merit , elitism , brainpower

Recent videos (talks on genomics): https://www.youtube.com/results?search_query=hsu+genomics

Talks (some with slides + video):

Cold Spring Harbor Laboratory
Berkeley Innovative Genomics Institute and OpenAI
Janelia Research Campus (HHMI)
Allen Institute (Seattle) meeting on Genetics of Complex Traits

Review article: On the genetic architecture of cognitive ability and other quantitative traits (2014)

I work on algorithms for phenotype prediction from genotype, using new methods from high dimensional statistics. My estimate is that prediction of complex traits such as height, cognitive ability, or highly polygenic disease conditions will require data sets of order one million individuals (i.e., to build a model which accounts for most of the genetic variance). Once these models are available, human reproduction (and evolution!) will be revolutionized.

These papers are somewhat technical:
https://arxiv.org/abs/1310.2264
http://arxiv.org/abs/1408.6583

This one is a bit less technical and gives a broader overview:
http://arxiv.org/abs/1408.3421

Cow genomics (an existence proof):
http://infoproc.blogspot.com/2012/08/genomic-prediction-no-bull.html
http://infoproc.blogspot.com/2014/08/its-all-in-gene-cows.html

These are for popular audiences (Nautilus Magazine):
http://nautil.us/issue/18/genius/super_intelligent-humans-are-coming
http://nautil.us/issue/28/2050/dont-worry-smart-machines-will-take-us-with-them

2018: As anticipated, we now have good height predictors thanks to the 500k genome release of UK Biobank data: Scientists of Stature

Genomic predictors for common disease risk, constructed via machine learning on hundreds of thousands of genotypes. The predictors use anywhere from a few tens (e.g., 20 or 50) to thousands of SNPs to compute the risk PGS (Poly-Genic Score) for conditions such as diabetes, breast cancer, heart attack, and more: Genomic Prediction of Complex Disease Risk.

The Economist on polygenic risk scores (2019).

Detailed analysis of genetic architectures of disease risk predictors. Implications for pleiotropy.

Sibling validation of genomic predictors.

Recent papers from my group:

https://www.genetics.org/content/210/2/477
https://www.nature.com/articles/s41598-019-51258-x
https://www.nature.com/articles/s41598-020-68881-8
https://www.nature.com/articles/s41598-020-69927-7

2021 review article, prepared for the book Genomic Prediction of Complex Traits, Springer Nature series Methods in Molecular Biology:

From Genotype to Phenotype: polygenic prediction of complex human traits

Information Processing

About Me

Sunday, May 01, 2022

Complex Trait Prediction: Methods and Protocols (Springer 2022)

Sunday, May 29, 2022

Genomic Prediction in Bloomberg

Sunday, August 03, 2014

It's all in the gene: cows

Tuesday, June 29, 2021

Machine Learning Prediction of Biomarkers from SNPs and of Disease Risk from Biomarkers in the UK Biobank (published version)

Sunday, June 09, 2019

L1 vs Deep Learning in Genomic Prediction

Monday, January 18, 2021

From Genotype to Phenotype: polygenic prediction of complex human traits

Tuesday, September 19, 2017

Accurate Genomic Prediction Of Human Height

Wednesday, November 01, 2017

The Future is Here: Genomic Prediction in MIT Technology Review

Tuesday, November 20, 2018

Super-smart designer babies (Guardian UK)

Thursday, December 27, 2018

Genomic Prediction of Complex Disease Risk (bioRxiv)

Monday, December 17, 2018

Advances in Genomic Prediction: Breast Cancer Risk

Thursday, October 11, 2018

Population-wide Genomic Prediction of Health Risks

Tuesday, August 14, 2018

Genomic Prediction of disease risk using polygenic scores (Nature Genetics)

Saturday, June 11, 2022

Genomic Prediction on WHYY The Pulse

Saturday, September 05, 2009

Some favorite posts

Blog Archive

Labels