Information Processing: 08/2018

Tuesday, August 28, 2018

Scientists of Stature

The link below is to the published version of the paper we posted on biorxiv in late 2017 (see blog discussion). Our results have since been replicated by several groups in academia and in Silicon Valley.

Biorxiv article metrics: abstract views 31k, paper downloads 6k. Not bad! Perhaps that means the community understands now that genomic prediction of complex traits is a reality, given enough data.

Had we taken a poll on the eve of releasing our biorxiv article, I suspect 90+ percent of genomics researchers would have said that ~few cm accuracy in predicted human height from genotype alone was impossible.

Since our article appeared, interesting results for complex phenotypes such as educational attainment, heart disease, diabetes, and other disease risks have been obtained.

Accurate Genomic Prediction Of Human Height

Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos and Stephen D. H. Hsu

GENETICS Early online August 27, 2018; https://doi.org/10.1534/genetics.118.301267

We construct genomic predictors for heritable but extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). The constructed predictors explain, respectively, ∼40, 20, and 9 percent of total variance for the three traits, in data not used for training. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The proportion of variance explained for height is comparable to the estimated common SNP heritability from Genome-Wide Complex Trait Analysis (GCTA), and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for SNPs. Thus, our results close the gap between prediction R-squared and common SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common variants. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier Genome-Wide Association Studies (GWAS) for out-of-sample validation of our results.

The published version of the paper contains several new analyses in response to reviewer comments.

We added detailed comparisons between the top SNPs activated in our predictor and earlier GIANT GWAS hits. We analyze the correlation structure of L1-activated SNPs -- the algorithm (as expected) automatically selects variants which are mostly decorrelated (statistically independent) from each other.

We compare our L1 method to simpler algorithms, such as windowing: choose a genomic window size (e.g., 200k bp) and use only the SNP in each window which accounts for the most variance. This does not work as well as L1 optimization, but can produce a respectable predictor.

We investigate the correlation structure of height-associated SNPs: to what extent can the best linear combination of GIANT GWAS-significant SNPs predict the state of one of the predictor SNPs? This raises the interesting question: how much total information (entropy) is in the human genome?

Friday, August 24, 2018

Death from the Sky: Drone Assassination

This is a ~$1000 drone, max velocity ~70kph (~45mph), range ~30min flying time, controller range ~5km. It's only 1 kilo -- so payload is limited. It is optimized for photography, not for speed or range or payload. But it gives you an idea of what is possible at the same cost as, say, a couple of cheap AR15s... A real hobbyist could construct something cheaper, faster, with bigger payload. But this you can buy with one click ready to go.

It's never been easier for a bad guy to deliver an explosive charge (e.g., fraction of a kilo) to a target from a mile away. Operating a drone like this takes almost no training.

Defeating two of them coming from different directions, staggered by a few seconds, would be extremely hard even for an active security detail. Follow the target in their car and detonate the drone near the gas tank when the car stops at an intersection. Or have the drone waiting near the intersection if you know the route in advance.

If your target is commercial aviation, hit a 747 near its fuel tank as it waits to take off. A sitting duck, and no fooling around with military gear like MANPADs -- remember, you can be a mile or more away from the airport, sitting on your hotel room balcony, or in your car ready to hit the freeway.

Will this ever happen? Thank goodness terrorists tend to be incompetent... But 9/11 was a good example of what can happen when they are not.

See also Assassination by Drone.

Tuesday, August 21, 2018

MSU New Faculty Welcome 2018

This is my welcome message to new MSU faculty and staff, presented at the 2018 New Faculty and Staff Orientation lunch.

Good afternoon and Welcome!

We are so pleased that you are here at Michigan State University. You have joined a leading research university, at a very exciting time.

I usually don’t give lengthy remarks, but since I’m not standing between you and lunch, and because so many exciting things are happening on campus, I could not resist giving something of an overview today.

With increased funding, new infrastructure, and an aggressive hiring initiative, we are positioning MSU research, and the university, for continued success. This success is built on a rich research history spanning many decades and disciplines.

In an MSU lab in 1965, Barnett Rosenberg and his team discovered that cisplatin prevents the DNA in cancer cells from replicating. Cisplatin is now a widely used chemotherapy medication. His “ah-ha” moment led to further research, but not without difficulties. The team initially failed to replicate their first results. But they worked extremely hard to resolve the issue, and subsequently had the drug through trials and approved in record time. It’s this kind of Spartan tenacity and effectiveness that we should all emulate.

Two weeks ago we celebrated the 40 year anniversary of the FDA approval of cisplatin, a therapy that is still considered the gold standard to which most new cancer treatments are compared. This discovery not only continues to help those afflicted with cancer, but the resulting royalties also fuel new research and discovery in the form of internal grants and other investments from the MSU Foundation.

My sincere hope is that one of you someday discovers the next cisplatin, or makes scholarly advances of equal importance.

Our team at the MSU Innovation Center is ready to assist faculty and student entrepreneurs with the next “big idea”. They steward more than 150 discoveries annually into a pipeline of patents, products and startup businesses. In 2017, this productivity resulted in 75 license and option agreements with companies around the world, as well as $2.4M in royalties being distributed to our faculty and departments. Applied research helps to build a diversified economy and brings jobs to Michigan and beyond. It is an increasingly important part of university activity.

I’d like to give you a bit of context for the size and scope of the research enterprise here at MSU.

MSU research continues on an upward growth trajectory. For 2017, total research expenditures were about $700M. This is a number reported each year to NSF for their Higher Education Research and Development (HERD) report. Only 5 years ago our number was closer to $500M, so this represents significant growth.

Based on the HERD comparison data, MSU ranks 1st in the Big Ten and 2nd in the nation in combined Department of Energy and National Science Foundation research expenditures.

We expect to continue our leadership in DOE and NSF funding, in part due to the Facility for Rare Isotope Beams, but also due to our work with the Plant Research Laboratory, the Great Lakes Bioenergy Research Lab and other interdisciplinary and multi-institutional research projects.

Our strategic plan outlines a number of new initiatives that leverage our current strengths and/or build new capacity to expand our portfolio, increase our competitiveness, and ultimately solve many of tomorrow’s pressing problems.

As I mentioned, MSU is home to the Facility for Rare Isotope Beams. FRIB will be a scientific user facility for the Office of Nuclear Physics in the Office of Science of the U.S. Department of Energy.

FRIB will be operational in 2021 and will deliver the highest intensity beams of rare isotopes available anywhere in the world. Estimates of the total investment in this project are roughly $1 billion dollars--a huge milestone for MSU. Operated by MSU, FRIB will enable scientists to make discoveries about the properties of rare isotopes (which are unusual forms of the elements) in order to better understand the physics of nuclei, nuclear astrophysics, and the fundamental interactions of nature. It will also produce practical applications for society, including in medicine, homeland security, and industry.

Last weekend, FRIB held a public open house attracting some 3000 people. If you didn’t have a chance to visit, you will get a glimpse this Thursday at the new faculty research orientation. I hope to see many of you there.

But new infrastructure doesn’t stop with FRIB, and I’m sure you’ve noticed all the construction on campus.

Two years ago, we opened the new BioEngineering building, which houses the Institute for Quantitative Health Science and Engineering, colloquially referred to as “IQ”. This collaboration of the colleges of Engineering, Human Medicine and Natural Science will apply quantitative methods to biomedicine and life science in an interdisciplinary setting. IQ’s researchers will develop new medical tools and treatments that will advance biomedicine in creative ways. We hope it will fundamentally change the way healthcare is delivered.

We’re already far along in construction of another, larger building next to IQ that will house precision health researchers and several other new initiatives. This building, along with IQ and Radiology will create an entire area of campus dedicated to biomedical research.

Last year, we opened a new health research facility in Grand Rapids to complement our medical school there. Researchers in Grand Rapids, and our East Lansing biomedical neighborhood, will make discoveries in health science, and attract additional funding to expedite our growth trajectory. Our performance in NIH funding lags the stellar results I mentioned concerning NSF and DOE, but the investments listed above are meant to improve this situation. In addition, I should mention that for the first time MSU will have a research hospital on our campus, through a partnership with McLaren. MSU research integration with major health systems in Michigan has never been stronger, and we anticipate announcement of major collaborative efforts in the near future.

On August 31, we will break ground on a new STEM education building. New laboratory teaching and research spaces will support MSU’s increasing student enrollment in STEM fields. We look forward to the opportunities this new facility will create for both our students and faculty.

In June, construction began on a new music pavilion. This state-of-the-art facility will incorporate highly advanced acoustical engineering to create high-quality teaching, practice, rehearsal and research spaces that meet the needs of 21st century musicians. This addition further elevates our reputation in the arts, with a particular focus on student learning.

MSU will continue to invest in infrastructure improvements to support our faculty and students, increase our competitiveness, and to attract top recruits like yourselves to the university.

Another recent development is a new department called Computational Mathematics, Science, and Engineering or CMSE. This department was planned, authorized, and operational in only three years—quite a feat in academia. I often compare “startup time” (the fast pace at which things are accomplished in Silicon Valley) to “academic time” (i.e., nothing gets done, other than committee meetings, or a no-brainer project takes a decade to complete), but with CMSE this was a case of something on campus getting done in startup time. CMSE is one of very few such departments in the country -- it is focused on data science, machine learning, advanced computation and related applications, but is not a traditional CS department. It supports many of the new efforts on campus that require the analysis of large data sets and development of new tools and algorithms. Researchers in this department utilize datasets drawn from areas such as astrophysics, business analytics, mobile data, materials science, human and plant genomics, and many other areas. The department was conceived as fundamentally interdisciplinary -- bringing together experts in computation with subject matter experts in fields of science which are becoming increasingly reliant on data.

I can’t help mentioning a couple of big data examples related to my own research interests: we’ve created a compute resource with 500k human genomes from the UK Biobank, which is open to interested investigators on campus. All of the data is stored at our High Performance Computing Center or HPCC. Using this data, our collaboration demonstrated for the first time that machine learning applied to large genomic datasets could produce accurate predictors for complex human traits. We can now predict adult human height from genome alone, with accuracy of roughly 1 inch. The predictor uses ~20k genetic variants distributed throughout the genome. Predictors of complex disease risk, for conditions such as heart disease, diabetes, low blood platelet count, and breast cancer, have been developed and replicated in out-of-sample tests. See the NYTimes science section just a few days ago. This is only the beginning for genomics-informed Precision Medicine.

Over the summer, through a CEO friend in Silicon Valley, I obtained access for MSU researchers to mobile geolocation data covering the movements of over 30 million Americans. Yes, geolocation coordinates every 10 minutes or so for 30 million people, via their smartphones. I hope you all were aware of this when you clicked “I Accept” :-) If you can think of interesting research uses for this data, please contact Dirk Colbry in CMSE for more information.

The most important component of a university is not buildings, or even laboratory or compute or data infrastructure. The most important resource is people -- talented research faculty, postdocs, students, and support staff.

Some of you joining us today may have been hired under the Global Impact Initiative (GI2). Launched in 2014, the goal of GI2 is to hire 100 new faculty whose research has breakthrough potential to shape the future. Over the last four years, we’ve recruited new faculty with a focus on key areas of innovation, such as machine learning, precision medicine, computational genomics, autonomous vehicles, advanced materials, gene-editing, and advanced plant science. Nearly 80 positions have been filled, with candidates hired from Harvard, Stanford, Princeton, MIT, Johns Hopkins University, Lawrence Berkeley National Lab, Los Alamos National Lab, and many other top institutions. But we're not done yet. We look forward, with enthusiasm, to the next year of recruiting.

Working here, you will be surrounded by world-class faculty, including members of the National Academy; Guggenheim, Packard, and Sloan Fellows; a recipient of the Stockholm Water Prize; Pulitzer Prize winners; and many more.

In 2018 alone, faculty at MSU received a record 11 NSF CAREER awards across a number of disciplines including engineering, communication arts and sciences, physics and astronomy, plant science, and others. This speaks volumes about the caliber of our young faculty, and is one reason why I’m looking forward to seeing their progress.

As you begin your time here at MSU, we urge you to think big and act boldly. If you are a new faculty member, still near the beginning of your career, we want to support your growth in every way possible. If you are a senior faculty member, we want to push your research program to that next higher level of impact. And, we hope that you can provide valuable mentorship to younger scholars around you.

If there is a problem -- tell us about it! -- whether it has to do with grant submissions, or startup incubation, or child care, food options on campus, your functional or dysfunctional department. We’re here to fix things, and to provide the best possible environment for your teaching and research.

Only one in a thousand people in our society have the privilege to engage full time in discovery -- in curiosity driven research -- for the benefit of humankind. You are part of that lucky one in a thousand, and we are here to help you succeed.

The bar has been set very high, but with the resources and new opportunities here at MSU, your potential is limitless.

My very best wishes to you all :-)

Sunday, August 19, 2018

Genomic Prediction: A Hypothetical (Embryo Selection), Part 2

The figures below are from the recent paper Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations (Nature Genetics), discussed previously here.

As you can see, genomic prediction of risk allows to identify outliers for conditions like heart disease and diabetes. Individuals who are top 1% in polygenic risk score are many times (approaching an order of magnitude) more likely to exhibit the condition than the typical person.

In an earlier post, Genomic Prediction: A Hypothetical (Embryo Selection), I pointed out a similar situation with regard to the SSGAC predictor for Educational Attainment. Negative outliers on that polygenic score (e.g., bottom 1%) are much more likely to have difficulty in school. I then posed this hypothetical:

You are an IVF physician advising parents who have exactly 2 viable embryos, ready for implantation.

The parents want to implant only one embryo.

All genetic and morphological information about the embryos suggest that they are both viable, healthy, and free of elevated disease risk.

However, embryo A has polygenic score (as in figure above) in the lowest quintile (elevated risk of struggling in school) while embryo B has polygenic score in the highest quintile (less than average risk of struggling in school). We could sharpen the question by assuming, e.g., that embryo A has score in the bottom 1% while embryo B is in the top 1%.

You have no other statistical or medical information to differentiate between the two embryos.

What do you tell the parents? Do you inform them about the polygenic score difference between the embryos?

We can pose the analogous hypothetical for the risk scores displayed below. Should the parents be informed if, for instance, one of the embryos is in the top 1% risk for heart disease or Type 2 Diabetes? Is there a difference between the case of the EA predictor and disease risk predictors?

In the case of monogenic (Mendelian) genetic risk, e.g., Tay-Sachs, Cystic Fibrosis, BRCA, etc., deliberate genetic screening is increasingly common, even if penetrance is imperfect (i.e., the probability of the condition given the presence of the risk variant is less than 100%).

Note, the risk ratio between top 1% and bottom 1% individuals is potentially very large (see below), although more careful analysis is probably required to understand this better.

These hypotheticals will not be hypothetical for very much longer: the future is here.

(CAD = coronary artery disease.)

Tuesday, August 14, 2018

Genomic Prediction of disease risk using polygenic scores (Nature Genetics)

It seems to me we are just at the tipping point -- soon it will be widely understood that with large enough data sets we can predict complex traits and complex disease risk from genotype, capturing most of the estimated heritable variance. People will forget that many "experts" doubted this was possible -- the term missing heritability will gradually disappear.

In just a few years genotyping will start to become "standard of care" in many health systems. In 5 years there will be ~100M genotypes in storage (vs ~20M now), a large fraction available for scientific analysis.

Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations (Nature Genetics)

A key public health need is to identify individuals at high risk for a given disease to enable enhanced screening or preventive therapies. Because most common diseases have a genetic component, one important approach is to stratify individuals based on inherited DNA variation1. Proposed clinical applications have largely focused on finding carriers of rare monogenic mutations at several-fold increased risk. Although most disease risk is polygenic in nature2,3,4,5, it has not yet been possible to use polygenic predictors to identify individuals at risk comparable to monogenic mutations. Here, we develop and validate genome-wide polygenic scores for five common diseases. The approach identifies 8.0, 6.1, 3.5, 3.2, and 1.5% of the population at greater than threefold increased risk for coronary artery disease, atrial fibrillation, type 2 diabetes, inflammatory bowel disease, and breast cancer, respectively. For coronary artery disease, this prevalence is 20-fold higher than the carrier frequency of rare monogenic mutations conferring comparable risk6. We propose that it is time to contemplate the inclusion of polygenic risk prediction in clinical care, and discuss relevant issues.

Using much larger studies and improved algorithms, we set out to revisit the question of whether a GPS can identify subgroups of the population with risk approaching or exceeding that of a mono- genic mutation. We studied five common diseases with major public health impact: CAD, atrial fibrillation, type 2 diabetes, inflamma- tory bowel disease, and breast cancer.

For each of the diseases, we created several candidate GPSs based on summary statistics and imputation from recent large GWASs in participants of primarily European ancestry (Table 1). Specifically, we derived 24 predictors based on a pruning and thresholding method, and 7 additional predictors using the recently described LDPred algorithm13 (Methods, Fig. 1 and Supplementary Tables 1–6). These scores were validated and tested within the UK Biobank, which has aggregated genotype data and extensive phenotypic information on 409,258 participants of British ancestry (average age: 57 years; 55% female)14,15.

We used an initial validation dataset of the 120,280 participants in the UK Biobank phase 1 genotype data release to select the GPSs with the best performance, defined as the maximum area under the receiver-operator curve (AUC). We then assessed the performance in an independent testing dataset comprised of the 288,978 partici- pants in the UK Biobank phase 2 genotype data release. For each disease, the discriminative capacity within the testing dataset was nearly identical to that observed in the validation dataset.

In the talk below @21:45 I discuss prospects for genomic prediction of disease risk.

Wednesday, August 08, 2018

Life and Fate, Before Sunset

This Hollywood oral history tells the story of Richard Linklater's "Before" Trilogy: Before Sunrise, Before Sunset, and Before Midnight. The films appeared 9 years apart, and tell the story of Jesse (Ethan Hawke) and Celine (Julie Delpy) in their 20s, 30s, and 40s. I find the second film to be the most interesting, really a masterpiece of filmmaking (I have a copy on the hard drive of the laptop I write this on :-). The events in Before Sunset take place in real time -- i.e., the story transpires over the run time of the movie, a single afternoon. Shooting it must have been extremely challenging for Delpy and Hawke, and for the crew.

The video above should start at 23:30, and explains how Linklater, Delpy, and Hawke came together to do the sequel. I think that event was, in some sense, the most contingent of those responsible for the trilogy. The first movie made very little money, and hence the idea to make a second, very different film -- about the complexity of life, the passage of time, lost chances -- was neither obvious nor inevitable.

The first movie is about a one night tryst between 20-something travelers, but the second movie takes place a decade later. The protagonists, while still young, have experienced more of life and the second film is richer and more complex, despite taking place over an even shorter period of time. I remember being excited to see it, not so much because of Before Sunrise (which I found entertaining, but not as special), but because of the intriguing premise of two lovers meeting again by chance after losing track of each other for so long.

Here's a scene from Before Sunset: a long take of walking and conversation in beautiful Paris, camera following Hawke and Delpy in a totally naturalistic way.

I hesitate to include this trailer because it's kind of cheesy, but if you're not familiar with the trilogy it explains the premise of the first two films.

The video below is a nice discussion of the trilogy. Just now I learned (thanks, AI!) that Before Sunrise is based on actual events in Linklater's life -- see here for the poignant story of the real life muse for these films.

Richard Linklater also directed Dazed and Confused -- one of the greatest high school movies ever made, and a beautiful evocation of adolescence in late-70s, early-80s America.

Saturday, August 04, 2018

Assassination by Drone

I have been waiting for this to happen:

Reuters: CARACAS - Drones loaded with explosives detonated close to a military event where Venezuelan President Nicolas Maduro was giving a speech on Saturday, but he and top government officials alongside him escaped unharmed from what Information Minister Jorge Rodriguez called an “attack” targeting the leftist leader. Seven National Guard soldiers were injured, Rodriguez added.

See this 2015 post on drone racing and ask yourself how you'd stop one of these drones from getting close to its target.

Countermeasures will be quite difficult, especially if drone operators use sophisticated frequency hopping control.

One doesn't even need pilot operators. The drones can be programmed to fly to a GPS coordinate using an evasive approach.

1. The exact coordinate can be marked by someone in the audience of a public appearance of the target.

2. It would be a formidable challenge even to stop some medium sized drones, each with a few kilo payload, from flying through the windows of the Oval Office (known GPS coordinate; known presence of targets at specific times).

This is still Science Fiction, for now:

Twenty years ago I told a PhD student that a terrorist -- willing to die and able to fly an airplane -- could probably take out the White House. After 9/11 he reminded me that I had identified this hole in the system well in advance. It's the same thing here with small and medium size drones. They are accessible to non-state actors with limited resources, and very difficult to defeat, even for state security.

Barista Bots

Still think low-skill immigration is a good idea?

If you accept the thesis that automation is a threat to low-skill employment, then you should be willing to reconsider the long term cost-benefit analysis of low-skill immigration.

Thursday, August 02, 2018

Arnold: The Will to Power

I don't know whether Arnold ever read Nietzsche, but he certainly developed the Will to Power early in life. I quite like the video above -- I even made my kids watch it :-)

When I was in high school I came across his book Arnold: The Education of a Bodybuilder, a combination autobiography and training manual published in 1977. I found a copy in the remainder section of the book store and bought it for a few dollars. The most interesting part of the book is the description of his early life in Austria and his introduction to weightlifting and bodybuilding. I highly recommend it to anyone interested in golden age bodybuilding, the early development of physical training, or the psychology of human drive and high achievement. Young Arnold displays a kind of unbridled and unironic egoism that can no longer be expressed without shame in today's feminized society.

Chapter 2: Before long, people began looking at me as a special person. Partly this was the result of my own changing attitude about myself. I was growing, getting bigger, gaining confidence. I was given consideration I had never received before; it was as though I were the son of a millionaire. I'd walk into a room at school and my classmates would offer me food or ask if they could help me with my homework. Even my teachers treated me differently. Especially after I started winning trophies in the weight-lifting contests I entered.

This strange new attitude toward me had an incredible effect on my ego. It supplied me with something I had been craving. I'm not sure why I had this need for special attention. Perhaps it was because I had an older brother who'd received more than his share of attention from our father. Whatever the reason, I had a strong desire to be noticed, to be praised. I basked in this new flood of attention. I turned even negative responses to my own satisfaction.

I'm convinced most of the people I knew didn't really understand what I was doing at all. They looked at me as a novelty, a freak. ...

"Why did you have to pick the least-favorite sport in Austria?" they always asked. It was true. We had only twenty or thirty bodybuilders in the entire country. I couldn't come up with an answer. I didn't know. It had been instinctive. I had just fallen in love with it. I loved the feeling of the gym, of working out, of having muscles all over.

Now, looking back, I can analyze it more clearly. My total involvement had a lot to do with the discipline, the individualism, and the utter integrity of bodybuilding. But at the time it was a mystery even to me. Bodybuilding did have its rewards, but they were relatively small. I wasn't competing yet, so my gratification had to come from other areas. In the summer at the lake I could surprise everyone by showing up with a different body. They'd say, "Jesus, Arnold, you grew again. When are you going to stop?"

"Never," I'd tell them. We'd all laugh. They thought it amusing. But I meant it.

...

The strangest thing was how my new body struck girls. There were a certain number of girls who were knocked out by it and a certain number who found it repulsive. There was absolutely no in-between. It seemed cut and dried. I'd hear their comments in the hallway at lunchtime, on the street, or at the lake. "I don't like it. He's weird—all those muscles give me the creeps." Or, "I love the way Arnold looks—so big and powerful. It's like sculpture. That's how a man should look."

These reactions gave me added motivation to continue building my body. I wanted to get bigger so I could really impress the girls who liked it and upset the others even more. Not that girls were my main reason for training. Far from it. But they added incentive and I figured as long as I was getting this attention from them I might as well use it. I had fun. I could tell if a girl was repelled by my size. And when I'd catch her looking at me in disbelief, I would casually raise my arm, flex my bicep, and watch her cringe. It was always good for a laugh. ...

Arnold, age 17 or 18:

Information Processing

About Me