My impression is that the limiting factor at the moment is the human brainpower necessary to understand the idiosyncrasies of the particular problem, and, simultaneously, develop the appropriate algorithms. There are simply not enough people around who are good at this; it's not just a matter of algorithms, you need insight into the specific situation. Of equal importance is that the (usually non-technical) decision makers who have to act on the data need to have some rough grasp of the strengths and limitations of the methods, so as not to have to treat the results as coming from a black box.
To give you my little example of big data, on my desk (in Oakland, not in Eugene) I have stacks of terabyte drives with copies of essentially every Windows executable (program that runs on a flavor of Windows) that has appeared on the web in the past few years (about 5 percent of this is malware; also stored in our data is what each executable does once it's installed). Gathering this data was only modestly hard; analyzing it in a meaningful way is a lot harder!
NY Review of Books: On a remote edge of Utah's dry and arid high desert, where temperatures often zoom past 100 degrees, hard-hatted construction workers with top-secret clearances are preparing to build what may become America's equivalent of Jorge Luis Borges's "Library of Babel," a place where the collection of information is both infinite and at the same time monstrous, where the entire world's knowledge is stored, but not a single word is understood. At a million square feet, the mammoth $2 billion structure will be one-third larger than the US Capitol and will use the same amount of energy as every house in Salt Lake City combined.
Unlike Borges's "labyrinth of letters," this library expects few visitors. It's being built by the ultra-secret National Security Agency—which is primarily responsible for "signals intelligence," the collection and analysis of various forms of communication—to house trillions of phone calls, e-mail messages, and data trails: Web searches, parking receipts, bookstore visits, and other digital "pocket litter." Lacking adequate space and power at its city-sized Fort Meade, Maryland, headquarters, the NSA is also completing work on another data archive, this one in San Antonio, Texas, which will be nearly the size of the Alamodome.
Just how much information will be stored in these windowless cybertemples? A clue comes from a recent report prepared by the MITRE Corporation, a Pentagon think tank. "As the sensors associated with the various surveillance missions improve," says the report, referring to a variety of technical collection methods, "the data volumes are increasing with a projection that sensor data volume could potentially increase to the level of Yottabytes (1024 Bytes) by 2015."[1] Roughly equal to about a septillion (1,000,000,000,000,000,000,000,000) pages of text, numbers beyond Yottabytes haven't yet been named. Once vacuumed up and stored in these near-infinite "libraries," the data are then analyzed by powerful infoweapons, supercomputers running complex algorithmic programs, to determine who among us may be—or may one day become—a terrorist. In the NSA's world of automated surveillance on steroids, every bit has a history and every keystroke tells a story.
... Where does all this leave us? Aid concludes that the biggest problem facing the agency is not the fact that it's drowning in untranslated, indecipherable, and mostly unusable data, problems that the troubled new modernization plan, Turbulence, is supposed to eventually fix. "These problems may, in fact, be the tip of the iceberg," he writes. Instead, what the agency needs most, Aid says, is more power. But the type of power to which he is referring is the kind that comes from electrical substations, not statutes. "As strange as it may sound," he writes, "one of the most urgent problems facing NSA is a severe shortage of electrical power." With supercomputers measured by the acre and estimated $70 million annual electricity bills for its headquarters, the agency has begun browning out, which is the reason for locating its new data centers in Utah and Texas. And as it pleads for more money to construct newer and bigger power generators, Aid notes, Congress is balking.
The issue is critical because at the NSA, electrical power is political power. In its top-secret world, the coin of the realm is the kilowatt. More electrical power ensures bigger data centers. Bigger data centers, in turn, generate a need for more access to phone calls and e-mail and, conversely, less privacy. The more data that comes in, the more reports flow out. And the more reports that flow out, the more political power for the agency.
Rather than give the NSA more money for more power—electrical and political—some have instead suggested just pulling the plug. "NSA can point to things they have obtained that have been useful," Aid quotes former senior State Department official Herbert Levin, a longtime customer of the agency, "but whether they're worth the billions that are spent, is a genuine question in my mind."
Based on the NSA's history of often being on the wrong end of a surprise and a tendency to mistakenly get the country into, rather than out of, wars, it seems to have a rather disastrous cost-benefit ratio. Were it a corporation, it would likely have gone belly-up years ago. The September 11 attacks are a case in point. For more than a year and a half the NSA was eavesdropping on two of the lead hijackers, knowing they had been sent by bin Laden, while they were in the US preparing for the attacks. The terrorists even chose as their command center a motel in Laurel, Maryland, almost within eyesight of the director's office. Yet the agency never once sought an easy-to-obtain FISA warrant to pinpoint their locations, or even informed the CIA or FBI of their presence.
But pulling the plug, or even allowing the lights to dim, seems unlikely given President Obama's hawkish policies in Afghanistan. However, if the war there turns out to be the train wreck many predict, then Obama may decide to take a much closer look at the spy world's most lavish spender. It is a prospect that has some in the Library of Babel very nervous. "It was a great ride while it lasted," said one.
The GPGPU future means you don't *even* really need to be that good at it. With just a little expertise built into your system, you'll be able to brute force a lot of problems. Can't even get someone in authority to sign off on that. Solutions have to be deterministic, not speculative.
ReplyDelete"...Yottabytes (1024 Bytes)..."
ReplyDeleteI think he means 10^24 bytes.
I'd also put in that there is probably a hell of a lot of good work physicists can do in this domain. Not because their prior tools or experience are helpful (though dealing with the LHC data set can't hurt!) but because its exactly the kind of field that be advanced by a good combination of theory and hard nosed practicality. In fact if I were a physics Ph.D. I'd probably think seriously about making sure I was working in a field that had exposure to large data sets and the algorithmic challenges therein.
ReplyDeleteToday's NYTimes says the same thing -- Big Data rules!
ReplyDeletehttp://www.nytimes.com/2009/10/12/technology/12data.html?hpw