Monday, July 20, 2015

What is medicine’s 5 sigma?

Editorial in the Lancet, reflecting on the Symposium on the Reproducibility and Reliability of Biomedical Research held April 2015 by the Wellcome Trust.
What is medicine’s 5 sigma?

... much of the [BIOMEDICAL] scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, [BIOMEDICAL] science has taken a turn towards darkness. As one participant put it, “poor methods get results”. The Academy of Medical Sciences, Medical Research Council, and Biotechnology and Biological Sciences Research Council have now put their reputational weight behind an investigation into these questionable research practices. The apparent endemicity of bad research behaviour is alarming. In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory of the world. ...

One of the most convincing proposals came from outside the biomedical community. Tony Weidberg is a Professor of Particle Physics at Oxford. ... the particle physics community ... invests great effort into intensive checking and rechecking of data prior to publication. By filtering results through independent working groups, physicists are encouraged to criticise. Good criticism is rewarded. The goal is a reliable result, and the incentives for scientists are aligned around this goal. Weidberg worried we set the bar for results in biomedicine far too low. In particle physics, significance is set at 5 sigma—a p value of 3 × 10–7 or 1 in 3·5 million (if the result is not true, this is the probability that the data would have been as extreme as they are). The conclusion of the symposium was that something must be done ...
I once invited a famous evolutionary theorist (MacArthur Fellow) at Oregon to give a talk in my institute, to an audience of physicists, theoretical chemists, mathematicians and computer scientists. The Q&A was, from my perspective, friendly and lively. A physicist of Hungarian extraction politely asked the visitor whether his models could ever be falsified, given the available field (ecological) data. I was shocked that he seemed shocked to be asked such a question. Later I sent an email thanking the speaker for his visit and suggesting he come again some day. He replied that he had never been subjected to such aggressive and painful attack and that he would never come back. Which community of scientists is more likely to produce replicable results?

See also Medical Science? and Is Science Self-Correcting?

To answer the question posed in the title of the post / editorial, an example of a statistical threshold which is sufficient for high confidence of replication is the p < 0.5 x 10^{-8} significance requirement in GWAS. This is basically the traditional p < 0.05 threshold corrected for multiple testing of 10^6 SNPs. Early "candidate gene" studies which did not impose this correction have very low replication rates. See comment below for what this implies about the validity of priors based on biological intuition.

I discuss this a bit with John Ioannidis in the video below.


Douglas Knight said...

I don't think that the history of genetic associations that you present to Ioannidis is correct. I believe that there was a period of GWAS with correction for multiple comparisons that still failed to replicate before the current period of replication. I think that this was a period of very few hits per paper, which was probably should have been a warning sign.

Also, I am disturbed by this emphasis on setting a strong alpha. If the problem is multiple comparisons, correct for multiple comparisons. GWAS does not use an alpha of 10^-8. If it did, that would soon become obsolete as the number of genes went up! Similarly, I was disturbed by what I read about the Higgs boson. I am not sure I understood it correctly, but it sounded like 10^-5 was not correcting for multiple comparisons and it was really only 10^-3. That's a perfectly fine alpha, but people should report corrected p-values, not raw.

Presumably physics uses a small alpha because it doesn't trust people to account for all multiple comparisons. This is more of a problem in small experiments than large. Large experiments are less subject to publication bias. Moreover, they tend to pre-specify analysis. This is true both for GWAS and for LHC. (At least, I think LHC pinned down the analysis of the mass of the Higgs boson; I don't know how much flexibility there would be for the discovery of an unexpected particle.) It's hard to imagine what you could do about hidden experimenter degrees of freedom except a strong alpha, but I fear that confusing it with correcting for explicit multiple comparison is dangerous.

steve hsu said...

>> I don't think that the history of genetic associations that you present to Ioannidis is correct. I believe that there was a period of GWAS with correction for multiple comparisons that still failed to replicate before the current period of replication. <<

I'm not aware of this history you reference, but I am only a recent entrant into this field. On the other hand Ioannidis is both a long time genomics researcher and someone who does meta-research on science, so he should know. He may have even written a paper on this subject -- I seem to recall he had hard numbers on the rate of replication of candidate gene studies and claimed it was in the low percents. BTW, this result shows that the vaunted intuition of biomedical types about "how things really work" in the human body is worth very little. We are much better off, in my opinion, relying on machine learning methods and brute force statistical power than priors based on, e.g., knowledge of biochemical pathways or cartoon models of cell function. (Even though such things are sometimes deemed sufficient to raise ~$100m in biotech investment!) This situation may change in the future but the record from the first decade of the 21st century is there for any serious scholar of the scientific method to study.

Both Ioannidis and I (through separate and independent analyses) feel that modern genomics is a good example of biomedical science that (now) actually works and produces results that replicate with relatively high confidence. It should be a model for other areas ...

Re: particle physics, AFAIK the p value threshold used at, e.g., LHC is (1) not based on any rigorous calculation, and (2) not just taking into account multiple tests. There are also always hard-to-estimate systematic errors (for example, in monte carlo of background events using theoretical approximations) that cannot be taken precisely into account. So 5 SD is really a rule of thumb. Even a 6 SD result could be wrong due to a subtle mistake by the experimenters.

Science is hard!

Nat Philosopher said...

The scientific literature itself has its problems, but the real issue is the transition to medical practice. Everybody who's ever looked seriously at the issue, from the US Congress Office of Technology Assessment, to the BMJ, to the Cochrane, finds no more than 10-30% of medical practice is supported by scientific research. The Mayo Clinic just looked at studies testing an implemented change in medical practice, and found that more often than not, the studies found the old practice was better!

To cite a specific example, Bishop et al NEJM 1997 found that every 40mcg/kg of intravenous aluminum injected in a neonate cost basically an IQ point, and a followup showed loss of bone density at 15 years of age. Every animal study injecting aluminum into neonates finds damage. Every epidemiological study sensitive to the issue I've seen finds damage. The Hep B vaccine given at birth alone has 250 mcg of aluminum, and the whole series has 4000mcg in the first six months. The FDA is relying on Mitkus et al, which is theory paper using an MRL based on dietary experiments in weaned animals. It is not informed about the toxicity of parenteral aluminum in neonates in any way I can see, and every study that is reports damage.

BobSykes said...

We live in a Lysenkoist age. Academic researchers in particular are under pressure and are rewarded to produce the results desired by their researcher sponsors, who are mostly highly politicized government agencies. It should be noted that unlike industrial researchers, academics are almost entirely unsupervised and unmonitored. They are hired and promoted based on their ability to bring in research dollars, and nearly all universities have tacit dollar quotas, usually adjusted for discipline.

For successful researchers, grant dollars sustain a jet set life style, provide fame and not a little fortune. The referee process at even the best journals is unable (and unwilling) to detect errors, especially errors in statistical procedure, and fraud. In areas like medicine, biology and climatology research is so hard it's easy to befuddle reviewers and readers and even the researchers themselves. Faculty can enhance their relative status by sensational public statements. Consider how many utterly delusional statements are made by climatologists and environmentalists. So, combined with the possible rewards, the lack of supervision and monitoring and the ineptness of referees, cheating and sloppy, bad work become very attractive, and in some disciplines it is the norm. That half of all published research is nonsense is merely to be expected. Science and scientists are always suspect and never deserve the deference they get.

Even when bad behavior is detected, senior university administrators always cover it up. Think of the Baltimore and Gallo scandals. Baltimore got off scot free. Gallo was ultimately punished by the Nobel Prize Committee, but Faucey, his former boss, still defends him. Unlike physicians and engineers, university faculty and administrators have no sense of professional ethics, and, consequently, no restraints on their personal behavior..

Industrial researchers are closely supervised and the government scrutinizes all aspects of their work because the stakes are so high. Consequently bad and dishonest research is largely a university faculty monopoly. However, even industrial research can be corrupted, especially in the social sciences, viz.

Some academic fields are famously free of fraud and deceit. Mathematics, physics and chemistry head the list. Mostly, because these fields are populated by people with very high intellects and a strong commitment to their fields. Also, deceit and fraud and nearly impossible.

Raghuveer Parthasarathy said...

Somewhat related:

(On false claims in Physics.) Despite this, I agree that the biomedical literature is in poor shape, though I'd claim the problem is much more due to a lack of understanding of what things like p-values mean than a lack of sufficiently stringent "significance" criteria.

Blog Archive