Wednesday, August 27, 2014

Neural Networks and Deep Learning 2

Inspired by the topics discussed in this earlier post, I've been reading Michael Nielsen's online book on neural nets and deep learning. I particularly liked the subsection quoted below. For people who think deep learning is anything close to a solved problem, or anticipate a near term, quick take-off to the Singularity, I suggest they read the passage below and grok it deeply.
Neural Networks and Deep Learning (Chapter 3):

You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong [...] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well. -- Question and answer with neural networks researcher Yann LeCun

Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with "I'm very sympathetic to your point of view, but [...]". Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I'd rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn't always have unarguable new progress to report.

You may have noticed a verbal tic similar to "I'm very sympathetic [...]" in the current book. To explain what we're seeing I've often fallen back on saying "Heuristically, [...]", or "Roughly speaking, [...]", following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I've presented has often been pretty thin. If you look through the research literature you'll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?

In many parts of science - especially those parts that deal with simple phenomena - it's possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it's exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.

[ Sufficiently advanced AI will come to resemble biology, even psychology, in its complexity and resistance to rigorous generalization ... ]

One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works* *From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons." This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we've discussed. Each heuristic is not just a (potential) explanation, it's also a challenge to investigate and understand in more detail.

Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It's going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It's like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.
See also here from an earlier post on this blog:
... evolution has [ encoded the results of a huge environment-dependent optimization ] in the structure of our brains (and genes), a process that AI would have to somehow replicate. A very crude estimate of the amount of computational power used by nature in this process leads to a pessimistic prognosis for AI even if one is willing to extrapolate Moore's Law well into the future. [ Moore's Law (Dennard scaling) may be toast for the next decade or so! ] Most naive analyses of AI and computational power only ask what is required to simulate a human brain, but do not ask what is required to evolve one. I would guess that our best hope is to cheat by using what nature has already given us -- emulating the human brain as much as possible.
If indeed there are good (deep) generalized learning architectures to be discovered, that will take time. Even with such a learning architecture at hand, training it will require interaction with a rich exterior world -- either the real world (via sensors and appendages capable of manipulation) or a computationally expensive virtual world. Either way, I feel confident in my bet that a strong version of the Turing test (allowing, e.g., me to communicate with the counterpart over weeks or months; to try to teach it things like physics and watch its progress; eventually for it to teach me) won't be passed until at least 2050 and probably well beyond.

Turing as polymath: ... In a similar way Turing found a home in Cambridge mathematical culture, yet did not belong entirely to it. The division between 'pure' and 'applied' mathematics was at Cambridge then as now very strong, but Turing ignored it, and he never showed mathematical parochialism. If anything, it was the attitude of a Russell that he acquired, assuming that mastery of so difficult a subject granted the right to invade others.


David Coughlin said...

This is a realization that I judge people by: "Of course, there is not time for any single person to investigate all these heuristic explanations in depth." That says to me that they have become intellectual grown-ups.

Richard Seiter said...

Great quote: "When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking."

Glad to see him mention the end of Dennard scaling. So many discussions of Moore's law leave that out. I don't agree with equating them though. The key distinction is that even if we can maintain feature size reduction (Moore's law), not being able to maintain the power/speed improvements we got for free for decades (Dennard scaling) dramatically changes the observed benefit. Another issue is that even if Moore's law continues to hold we seem to have entered a period of diminished ROI on decreasing process size.

The Turing test is probably going to go through a long period where it is redefined as we meet milestones. This is similar to what we saw with computer chess (where the different steps took decades). An interesting question is what levels enable what capabilities. In this vein, has anyone done an analysis of what the trajectory of computer chess progress implies about the variation of human ability? (e.g. linear or not? can we measure a meaningful distance between performance levels? perhaps computer processing power as a metric?)

slocklin said...

I quite enjoyed Schmidthuber's seminar as well, and have been inspired to fool around in Torch7 as a result. Good piece of kit, particularly if you have a decent GPU.

You might enjoy Bengio & company's book also:

Another topic which melted my brain at this seminar is Reservoir learning, which was being touted by one of the people there. As it happens, you can build a giant recurrent net and *only fit the output nodes* and it will often work really well. I'll be blogging on this myself at some point.

I have avoided NNs until this seminar for precisely the reason stated above: nobody really seems to know how or why they work. Plus, sitting around and waiting for your net to converge is not a good way to ship product. Gradient Boost and Random Forests generally do pretty well. Heck, even variations on the KNN idea tend to compete well in contests.

dxie48 said...

There are noted failures of deep learning.

Not only deep learning independently has discovered internet cats, it has also independently discovered China!
At one stage, Google Translate (which is known to be relying on deep learning) had translated the Latin phrase
"lorem ipsum ipsum ipsum lorem" to "China is the winner"
Was it that the Vatican had made heavy use GT for their internal memos? Dont know.
Anyway there are numerous publicly available random lorem ipsum generator for Latin and Chinese, e.g.
Maybe GT choked on indexing these.

To learn the tech weakness of corporations it is easier just to see what their competitors said, e.g.
Bearing in mind that Google has to service a larger userbase and hence some tech shortcuts could have been made.
A striking example is that for similar visual image search for a dog with bow and a collar. Somehow some thing
else more interesting had been matched to the collar :) Maybe it is a reflection of Google userbase and their
preferences which supply the data for the deep learning.

Blog Archive