Wednesday, November 30, 2011

DNA data deluge

NYTimes reports: DNA Sequencing Caught in Deluge of Data. Spooks and other scientists have similar problems. See also here.

NYTimes: BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day.

BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx.

“It sounds like an analog solution in a digital age,” conceded Sifei He, the head of cloud computing for BGI, formerly known as the Beijing Genomics Institute. But for now, he said, there is no better way.

The field of genomics is caught in a data deluge. DNA sequencing is becoming faster and cheaper at a pace far outstripping Moore’s law, which describes the rate at which computing gets faster and cheaper.

The result is that the ability to determine DNA sequences is starting to outrun the ability of researchers to store, transmit and especially to analyze the data.

... The cost of sequencing a human genome — all three billion bases of DNA in a set of human chromosomes — plunged to $10,500 last July from $8.9 million in July 2007, according to the National Human Genome Research Institute.

That is a decline by a factor of more than 800 over four years. By contrast, computing costs would have dropped by perhaps a factor of four in that time span.

The lower cost, along with increasing speed, has led to a huge increase in how much sequencing data is being produced. World capacity is now 13 quadrillion DNA bases a year, an amount that would fill a stack of DVDs two miles high, according to Michael Schatz, assistant professor of quantitative biology at the Cold Spring Harbor Laboratory on Long Island.

There will probably be 30,000 human genomes sequenced by the end of this year, up from a handful a few years ago, according to the journal Nature. And that number will rise to millions in a few years.

In a few cases, human genomes are being sequenced to help diagnose mysterious rare diseases and treat patients. But most are being sequenced as part of studies. The federally financed Cancer Genome Atlas, for instance, is sequencing the genomes of thousands of tumors and of healthy tissue from the same people, looking for genetic causes of cancer. ...

Here's a slide I sometimes use in talks.


ytrewq123 said...

Is there no (relatively) compact representation of genomic data? 

steve hsu said...

For the case of humans, you can compress using a diff wrt a reference sequence. Individuals only differ at only a fraction of a percent of loci.

See, e.g.,

Or Wikipedia: "The 2.9 billion[17][18] base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be coded by 2 bits. Since individual genomes vary by less than 1% from each other, they can be losslessly compressed to roughly 4 megabytes.[19]

The entropy rate of the genome differs significantly between coding and non-coding sequences. It is close to the maximum of 2 bits per base pair for the coding sequences (about 45 million base pairs), but less for the non-coding parts. It ranges between 1.5 and 1.9 bits per base pair for the individual chromosome, except for the Y-chromosome, which has an entropy rate below 0.9 bits per base pair."

But this is a one-time gain. If we end up sequencing every person on earth, which according to recent trends would be economically feasible fairly soon, that's about 10^16 bytes of data, fully compressed.

RKU1 said...

I wonder if genomic storage needs will soon begin replacing video storage as the main driving force behind the semiconductor demand curve...

5371 said...

All that data is so far no more useful than it would be to scan the sole of everyone's foot at molecular resolution.

steve hsu said...

Video is still the driver. 10^16 bytes = 10^4 terabytes is not that much. You could fit it in your house if you stuffed it full of terabyte hard drives :-)

RKU1 said...

But let's approximate the typical hi-def movie as something like 10 GB = 10^10 bytes and there can't be much more than about 10^5 movies = 10^15 bytes total.  Meanwhile, live TV would stream at probably about 5 GB per hour = 100 GB per day = 10^11 bytes per day = 10^13.5 bytes per year per channel, and maybe a few dozen channels with the 24/7/365 production values worth recording = 10^15 bytes per year for all of TV before (massively) factoring out repeats.

I think the only sort of video which would be in the same range is all the CC used to stop burglers, which probably can take lots of lossy compression and doesn't really need to be centralized.  Also,  all the individual clips people take of their cat and upload to YouTube, which probably can be pretty lossy.

Meanwhile, on the other side, there's the full DNA of all the individual dogs and cats and hamsters and snails...  :-)

Allan Folz said...

Instead, it sends computer disks containing the data, via FedEx. “It sounds like an analog solution in a digital age,”

Ha, reminds me of the quip from Andrew Tanenbaum, "Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway."

Mariano Chouza said...

Seeing the glass half full, we will be able to store the DNA sequences of everyone in just a row of cabinets:

steve hsu said...

Antti, thanks for the compliments :-)

This paper might be of interest:

Note though that in terms of pure entropy humans and other biological creatures are not so exceptional. I think perhaps you might be interested in measures of complexity or computational / logical depth. Evolved creatures tend to have great computational depth, as opposed to entropy.

efoss said...

efoss said...

Blog Archive