Information Processing: 03/2007

Saturday, March 31, 2007

Letter to a former student

Here's some advice I wrote to a former student who is headed into software development. Can anyone add to or improve my comments?

1) software development in general

The Mythical Man Month (overrated, in my view, but everyone in the industry has read it)

Joel Spolsky on Software (a successful entrepreneur who has a big following in the developer community; he has an extensive web site)

Paul Graham (a CS PhD who writes about software and startups; very opinionated; check out his web site)

2) algorithms

The standard text is Rivest et al., but for an informal introduction to some of the best stuff in CS try The Turing Omnibus. For security and crypto, try Applied Cryptography by Schneier. Knuth's books on the art of computer programming are probably mainly of academic interest, but you might enjoy a look.

3) general

Anyone in the working world should read How To Win Friends and Influence People by Dale Carnegie -- I swear by it. There is an outline of the whole book on my blog somewhere :-)

4) miscellaneous comments

Unix shell tools are amazingly powerful and will serve you time and again -- grep, piping, awk, sed, emacs, etc. Many Windows programmers are unfamiliar with these and are blown away by their power to do quick and dirty stuff.

The more you know the more effective you become -- experienced programmers think a lot before they code, and they tend to reuse existing code or libraries.

Don't forget to think like a physicist -- get to the heart of the problem before starting in any direction; make quick and dirty models and test them out; test your assumptions throughout the process by checking against the accumulating evidence. After you have a feel for the problem spend some time generalizing about or abstracting from what you've done.

Thursday, March 29, 2007

I'm trying to understand how the subprime mortgage mess is going to unwind. Below are links to two recent articles I found useful.

1) WSJ on leading subprime lender New Century and how its bankruptcy was driven by decisions on Wall St. by firms like Citi, JP Morgan and Merrill that both lend to mortgage issuers and repackage their loans for resale as CMOs.

2) The Economist's remarkably sanguine summary of the situation. (Some useful excerpts below -- the first hard analysis numbers I've yet seen.)

America's residential mortgage market is huge. It consists of some $10 trillion worth of loans, of which around 75% are repackaged into securities, mainly by the government-sponsored mortgage giants, Fannie Mae and Freddie Mac. Most of this market involves little risk. Two-thirds of mortgage borrowers enjoy good credit and a fixed interest rate and can depend on the value of their houses remaining far higher than their borrowings. But a growing minority of loans look very different, with weak borrowers, adjustable rates and little, or no, cushion of home equity.

For a decade, the fastest growth in America's mortgage markets has been at the bottom. Subprime borrowers—long shut out of home ownership—now account for one in five new mortgages and 10% of all mortgage debt, thanks to the expansion of mortgage-backed securities (and derivatives based on them). Low short-term interest rates earlier this decade led to a bonanza in adjustable-rate mortgages (ARMs). Ever more exotic products were dreamt up, including “teaser” loans with an introductory period of interest rates as low as 1%.

When the housing market began to slow, lenders pepped up the pace of sales by dramatically loosening credit standards, lending more against each property and cutting the need for documentation. Wall Street cheered them on. Investors were hungry for high-yielding assets and banks and brokers could earn fat fees by pooling and slicing the risks in these loans.

Standards fell furthest at the bottom of the credit ladder: subprime mortgages and those one rung higher, known as Alt-As. A recent report by analysts at Credit Suisse estimates that 80% of subprime loans made in 2006 included low “teaser” rates; almost eight out of ten Alt-A loans were “liar loans”, based on little or no documentation; loan-to-value ratios were often over 90% with a second piggy-bank loan routinely thrown in. America's weakest borrowers, in short, were often able to buy a house without handing over a penny.

Lenders got the demand for loans that they wanted—and more fool them. Amid the continuing boom, some 40% of all originations last year were subprime or Alt-A. But as these mortgages were reset to higher rates and borrowers who had lied about their income failed to pay up, the trap was sprung. A new study by Christopher Cagan, an economist at First American CoreLogic, based on his firm's database of most American mortgages, calculates that 60% of all adjustable-rate loans made since 2004 will be reset to payments that will be 25% higher or more. A fifth will see monthly payments soar by 50% or more.

...Mr Cagan marries the statistics and concludes that—going by today's prices—some 1.1m mortgages (or 13% of all adjustable-rate mortgages originated between 2004 and 2006), worth $326 billion, are heading for repossession in the next few years. The suffering will be concentrated: only 7% of mainstream adjustable mortgages will be affected, whereas one in three of the recent “teaser” loans will end in default. The harshest year will be 2008, when many mortgages will be reset and few borrowers will have much equity.

Mr Cagan's study considers only the effect of higher payments (ignoring defaults from job loss, divorce, and so on). But it is a guide to how much default rates may worsen even if the economy stays strong and house prices stabilise. According to RealtyTrac, some 1.3m homes were in default on their mortgages in 2006, up 42% from the year before. This study suggests that figure could rise much further. And if house prices fall, the picture darkens. Mr Cagan's work suggests that every percentage point drop in house prices would bring 70,000 extra repossessions.

The direct damage to Wall Street is likely to be modest. A repossessed property will eventually be sold, albeit at a discount. As a result, Mr Cagan's estimate of $326 billion of repossessed mortgages translates into roughly $112 billion of losses, spread over several years. Even a loss several times larger than that would barely ruffle America's vast financial markets: about $600 billion was wiped out on the stockmarkets as share prices fell on February 27th.

In theory, the chopping up and selling on of risk should spread the pain. The losses ought to be manageable even for banks such as HSBC and Wells Fargo, the two biggest subprime mortgage lenders, and Bear Stearns, Wall Street's largest underwriter of mortgage-backed securities. Subprime mortgages make up only a small part of their business. Indeed, banks so far smell an opportunity to buy the assets of imploding subprime lenders on the cheap.

Some open questions (experts please help!):

1) Effect on housing bubble: how much of the recent bubble was driven specifically by increased availability of credit (as opposed to the usual irrational exuberance or speculation)?

2) How much of bad mortgage debt is insured by CDO derivatives? Who is on the hook? Selling this kind of insurance was reportedly a popular income strategy for hedge funds.

3) Is $100B of mortgage-related losses over several years a big number or a small one? Will anyone blow up (CMO insurers)? Who is holding the riskiest CMO tranches?

4) If $100B over several years is chump change, what are the chances of contagion still leading to a housing bust, credit crunch and recession?

Earnings of big Wall St. banks will be negatively impacted, but some smart guys are surely buying up these loans on the cheap, as markets overreact in the negative direction.

Tuesday, March 27, 2007

Proton pumps: modular, swappable genetic units

Wow! It's the modularity that is amazing. Swap a bit of DNA and suddenly you have bacteria that can harvest energy from light! (Via GNXP.)

Technology Review: Some bacteria, such as cyanobacteria, use photosynthesis to make sugars, just as plants do. But others have a newly discovered ability to harvest light through a different mechanism: using light-activated proteins known as proteorhodopsins, which are similar to proteins found in our retinas. When the protein is bound to a light-sensitive molecule called retinal and hit with light, it pumps positively charged protons across the cell membrane. That creates an electrical gradient that acts as a source of energy, much like the voltage, or electromotive force, supplied by batteries.

First discovered in marine organisms in 2000, scientists recently found that the genes for the proteorhodopsin system--essentially a genetic module that includes the genes that code for both the protein and the enzymes required to produce retinal--are frequently swapped among different microorganisms in the ocean. (While we usually think of genes being passed from parent to offspring, microorganisms can exchange bits of DNA laterally.)

Intrigued by the prospect that a single piece of DNA is really all an organism needs to harvest energy from light, the researchers inserted it into E. coli. They found that the microorganisms synthesized all the necessary components and assembled them in the cell membrane, using the system to generate energy. "All it takes to derive energy from sunlight is that bit of DNA," says Ed Delong, professor of biological engineering at MIT and author of the study. The results were published last week in the Proceedings of the National Academy of Sciences.

The findings have implications for both marine ecology and for synthetic biology, an emerging field that aims to design and build new life forms that can perform useful functions. Giant genomic studies of the ocean have found that the rhodopsin system is surprisingly widespread. The fact that a single gene transfer can result in an entirely new functionality helps explain how this genetic module traveled so widely. In fact for microbes, this kind of module swapping may be the rule rather than the exception."A new paradigm is emerging in microbiology: [microorganisms] are much more fluid than we thought," says Ford Doolittle, Canada Research Chair in comparative genomics at DalhousieUniversity, in Nova Scotia.

Monday, March 26, 2007

Income inequality: Manhattan toddlers

From the Times, this story tells a lot about what's happening in Manhattan. My friends there say it's very kid-friendly these days, with crime way down from a few decades ago.

Given how the hedonic treadmill works, I can't imagine living in Manhattan if I were a Columbia or NYU professor. Who wants to be the poorest family in the neighborhood? ;-) One of the families in the article, the father a management consultant, says they won't be able to afford the upper West Side once their kids each need a bedroom of their own.

The analysis shows that Manhattan’s 35,000 or so white non-Hispanic toddlers are being raised by parents whose median income was $284,208 a year in 2005, which means they are growing up in wealthier households than similar youngsters in any other large county in the country.

Among white families with toddlers, San Francisco ranked second, with a median income of $150,763, followed by Somerset, N.J. ($136,807); San Jose, Calif. ($134,668); Fairfield, Conn. ($132,427); and Westchester ($122,240).

Median household income of families with children ages 0 to 4. (Left is all ethnic groups, right is non-Hispanic whites only.)

For more on income inequality, including the interesting observation that it is primarily driven by financiers and tech entrepreneurs (third link), see here.

Saturday, March 24, 2007

The Mechanical Turk and Searle's Chinese Room

The Times has an article about Jeff Bezos' Mechanical Turk project, which lets machines outsource certain tasks to humans. (The orginal mechanical Turk was an 18th century hoax in which a hidden human operated a chess-playing automaton.) As Bezos describes,

“Normally, a human makes a request of a computer, and the computer does the computation of the task,” he said. “But artificial artificial intelligences like Mechanical Turk invert all that. The computer has a task that is easy for a human but extraordinarily hard for the computer. So instead of calling a computer service to perform the function, it calls a human.”

...The company opened Mechanical Turk as a public site in November 2005. Today, there are more than 100,000 “Turk Workers” in more than 100 countries who earn micropayments in exchange for completing a wide range of quick tasks called HITs, for human intelligence tasks, for various companies.

The Times writer Jason Pontin (who is also editor and publisher of MIT's Technology Review), gives Turk working a try, and finds it disorienting:

What is it like to be an individual component of these digital, collective minds?

To find out, I experimented. After registering at www.mturk.com, I was confronted with a table of HITs that I could perform, together with the price that I would be paid. I first accepted a job from ContentSpooling.net that asked me to write three titles for an article about annuities and their use in retirement planning. Then I viewed a series of images apparently captured from a vehicle moving through the gray suburbs of North London, and, at the request of Geospatial Vision, a division of the British technology company Oxford Metrics Group, identified objects like road signs and markings.

For all this, my Amazon account was credited the lordly sum of 12 cents. The entire experience lasted no more than 15 minutes, and from my point of view, as an occluded part of the hive-mind, it made no sense at all.

This is reminiscent of philospher John Searle's thought experiment called the Chinese Room, in which he posits a large team of humans implementing an algorithm that translates Chinese to English. Since each human performs only a small task (e.g., sorting acording to a rule set), none have any understanding of the overall process. Searle asks where, exactly, does the understanding of Chinese and English reside in this device? Searle considered his thought experiment as evidence against strong AI, whereas I just consider Searle to be confused. It's obvious that a Turk worker might be a small cog in some larger process that "understands" the world and processes information in a useful way. This depends not at all on what the little cog understands or does not understand.

Thursday, March 22, 2007

Academic impact

Berkeley PhD student Jo Guldi describes how Google Book Search is revolutionizing her historical research.

Google Book Search is a relatively recent phenomenon... six months ago, right? About six months ago I was pottering around there, finding a few illustrated nineteenth-century texts, a lot of contemporary books for sale, and not much of too much interest. Six months turns out to be a long time in book land. In that period of time, Book Search has accomplished enough to transform the academic profession.

I was idly trying a search on "roads" to see what sort of a literature would turn up for the period of my dissertation research, 1740-1850. I didn't expect much. I've spent the last two years wandering through the Yale, Harvard, and California libraries, the British Library, Britain's National Archives, and the immense reserves of North American Inter Library Loan reading every book on London, pavement, or travel I could get my hands on.

Surprise. In a single idle search I just added twenty extra full-text books to my list.

Which are, by the way, full-text searchable --

-- and subject to word-count analysis --

-- and replete with full illustrations --

-- and instantly digestable into visuals for powerpoint presentations.

Hallelujah, GoogleBooks. And holy mackeral! Good work.

By now, the first half of the nineteenth century exists in a very complete form on Google Books. In the last six months, while academic history has meandered in its habituated paths of grinding research, the possibilities of scholarship have been utterly transformed.

...

What this signals, by the way, is the opportunity for a new age of scholarship. Cultural and image analysis used to be painfully time-consuming, heavy lifting, involving rare kinds of access, full fellowships, immense travel, and long waits for delicate books. Comparison between different cultural sources was even harder, placing absurd demands on the cultural historian's personal memory and note-taking skills. Cultural historians, despite their many skills, stood second in depth of research on any particular topic to political historians, for whom one visit to a Parliamentary archive and one visit to a personal residence outfitted them with every last detail of historical change. Now all that is changing. Comparing a hundred images is no longer a problem for a year's labor in an out-of-the-way museum reading room. Comparing a hundred personal accounts from working men is no longer a task to eat up a social historian's entire year.

Wednesday, March 21, 2007

Eric Schmidt interview

Interesting interview (podcast) with Google CEO Eric Schmidt.

On the infrastructure barrier of entry to compete with Google:

Q: Ray Ozzie recently made some comments about Microsoft. Their data center construction being a big competitive advantage. I wanted to see how you think that what Google is doing is different from what their competitors are doing with their data centers.

Eric: Google, by any measure, is much more capital-intensive than our competitors, and we’re much further along in the build-out of data centers. So without commenting on our competitor’s quote and positioning, let me say that in an internet market where you deliver your services by computers with spinning disks, we have a competitive advantage... because we have the cheapest and most scalable such architecture. We hope that in the course of innovation we will be able to build products which are almost impossible for our competitors to replicate, because we simply learn how to implement them at scale. The internet is a scaled business, and running large internet scale businesses is very, very difficult. All the challenges – staying up, dealing with spammers, dealing with other access problems – and we believe we do it best in the world.

Q: What are some of those big areas of capital expenditures today... kinda how do you see those going forward?

Eric: Well, virtually all the capital goes into the expenses that are associated with running data centers. And that literally are the computers, and the disks, and the appropriate networking harboring, and literally the buildings now that house them. It used to be, right after the bubble burst, that you could go and purchase these inexpensive data centers that had been written off by people who overbuilt. But those data centers are long gone, and that’s one of our primary focus.

On entrepreneurship:

Q: Talking before, you’ve spoken glowingly about Larry and Sergey, and also Chad and Steve from YouTube. What are the characteristics of very successful entrepreneurs?

Eric: I think the most important characteristic of an entrepeneur is that they’re going to do it whether you give them permission or not. They are motivated by something inside of them. It’s not something that can be taught... they feel it, they want it, they’re driven to it. And when you find such a person, they’re usually a pretty good person to hang out with. They’re gonna do some interesting things. In Larry and Sergey’s case, and more recently with YouTube with Steve and Chad, they were going to be successful, and you can tell when you talk to them. Everything else is a tactic compared to that passion and vision that an entrepreneur has. It’s relatively rare, and it’s important to identify and respect it. (...)

Q: Eric, you’ve been a leader in the Valley for a long time now. What strategy is different today than in the past about the areas of innovation and entrepeneurship?

Eric: Yeah, it’s funny that every generation thinks that they’ve invented cold fusion... but every previous generation did too. And one of the rules about innovation is that it’s a constant process. The innovators today are not different – they’re just innovators now, as opposed to innovators from 10 years ago, or probably 100 years ago. In watching old BBC movies about England in the late 1900s/ early 1900s, you have to imagine what it meant to be an innovator, when roads and gaslights and electricity and so forth were being invented. It’s the same person, it’s the same type, it’s the same economic sense of leverage, it’s the same passion that drives innovators. The Valley is fortunate that these cycles are almost perpetual here, and innovators are drawn to both the culture, and I think the good weather...

Partial transcript found here.

Monday, March 19, 2007

Barbados pix

Some photos from Barbados, regrettably taken with my wimpy cellphone camera. The first two are from the beach near Bellairs Research Institute (where we had the workshop) and the third is Jonathan Oppenheim of Cambridge lecturing on black hole information.

I also added a fourth photo of the whole group, taken with Charles Bennett's non-wimpy camera. Charles is a talented photographer as well as information theorist :-)

Saturday, March 10, 2007

Barbados, here I come!

I'm off to a workshop on black holes and quantum information. Although I'll have network access you might not hear much from me for the next week or so :-)

Bellairs workshop schedule.

Update: As predicted, I haven't had much time to post about this meeting. It's been a wonderful chance to talk to quantum information theorists while enjoying the Caribbean. I don't know whether the high point has been the excellent talks (which have often gone 2 hours, due to the informal atmosphere), getting to meet well-known theorists like Don Page (a relativist whose papers I first read while in grad school) or Charlie Bennett (IBM Fellow and co-inventor of quantum teleportation), or swimming with turtles yesterday. I'll post some photos when I get back this weekend.

Machine intelligence

We just completed some head to head testing of Robot Genius' web crawl data. We've built a fully automated process that downloads every Windows executable on the web (terabytes of data, now in the can), installs it on an analysis machine, and determines whether it is malware.

We compared our results for 8000 executable urls (something our farm can do in much less than a day) against the databases of two public security companies. The first is a leader in the web protection space, and the second is one of the 3 largest antivirus and desktop security vendors. The results of the second head to head comparison are below (in the first comparison the public company did even worse).

Conclusion: the Robot Genius defeats teams of hundreds of security engineers located in multiple countries.

Summary statistics:

Caught be RG: 100
Caught by leading AV company: 96

False negatives by RG: 4 (6 if you include Alexa toolbar, which we do not)
False negatives by leading AV company: 58

False positives by RG: 2, due to a dumb mistake that we have fixed
False positives in leading AV company data: 52 (50 if you consider Alexa toolbar bad, which we do not)

Tuesday, March 06, 2007

Hedge fund intelligence

It's a hedge fund universe: over a trillion dollars under management now by the largest US funds. Thanks to an intrepid correspondent for the data :-)

Hedge Fund Intelligence: As money continued to flow into alternatives last year, combined assets at the largest U.S. hedge funds finally crossed the trillion-dollar mark. The top three firms each exceeded $30 billion in assets for the first time, with JPMorgan Asset Management unseating Goldman Sachs Asset Management as the largest hedge fund platform, with $34 billion. Among the top tier, Renaissance Technologies joins the top 10 for the first time, and Cerberus Capital Management rejoins the list after a three-year hiatus.

The Absolute Return Billion Dollar Club, our biannual survey of U.S. hedge funds, shows that 241 firms, each managing more than $1 billion, held a combined total of nearly $1.2 trillion as of January 1. That is about $215 billion more than the top 218 firms were managing this past summer and $347 billion more than the top 207 firms were running at the beginning of last year. And the huge increases came during a year when hedge fund closures, including that of $9.1 billion Amaranth Advisors, erased $35 billion from the market. (See "Poof! $35 billion gone from hedge funds," p14.)

I originally posted more information -- including a long list of >$4B funds, but was asked by Hedge Fund Intelligence to remove most of it.

Friday, March 02, 2007

AI and Google

Here is a little video snippet of Larry Page talking about AI at the recent AAAS meeting. He points out that our genetic information is about 600MB compressed, so smaller than any modern operating system. The connection between compression and AI has been made by many people -- what is intelligence, after all, if not an algorithm capable of taking in data (observations about the world) and turning it into predictions about what will happen next? Prediction (or, equivalently, modeling) is nothing more than compression! Newton's laws plus some initial data compress all the information about the trajectory of a spaceship -- trading bits of stored information (the spacetime coordinates of the trajectory) for CPU flops (necessary to uncompress the trajectory from the initial position and velocity -- i.e., evolve it forward in time).

Page guesses that AI will result more from "lots of computation" than from "whiteboard stuff" (i.e., we won't really "understand" how it happens from a theoretical or analytical perspective) and that "we aren't as far off as many people think"!

A lot of people like to speculate that Google is working like mad on AI, and indeed certain related problems like machine translation, pattern recognition and, of course, search, are things they devote a lot of resources to. However, the vast majority of their 10,000 employees (yes, the number really has been doubling every year for a while) are working on just keeping the existing services up and running. There isn't yet a blue-sky, Bell Labs-like research arm at Google.

See here for previous related posts. I have a bet with a former PhD student about machines passing a strong version of the Turing test in the next 50 years.

Information Processing

About Me