Information Processing: Search results for deep learning

Showing posts sorted by relevance for query deep learning. Sort by date Show all posts

Saturday, August 08, 2015

Deep Learning in Nature

When I travel I often carry a stack of issues of Nature and Science to read (and then discard) on the plane.

The article below is a nice review of the current state of the art in deep neural networks. See earlier posts Neural Networks and Deep Learning 1 and 2, and Back to the Deep.

Deep learning
Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Nature 521, 436–444 (28 May 2015) doi:10.1038/nature14539

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

The article seems to give a somewhat, er, compressed, version of the history of the field. See these comments by Schmidhuber:

Machine learning is the science of credit assignment. The machine learning community itself profits from proper credit assignment to its members. The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it). Relatively young research areas such as machine learning should adopt the honor code of mature fields such as mathematics: if you have a new theorem, but use a proof technique similar to somebody else's, you must make this very clear. If you "re-invent" something that was already known, and only later become aware of this, you must at least make it clear later.

As a case in point, let me now comment on a recent article in Nature (2015) about "deep learning" in artificial neural networks (NNs), by LeCun & Bengio & Hinton (LBH for short), three CIFAR-funded collaborators who call themselves the "deep learning conspiracy" (e.g., LeCun, 2015). They heavily cite each other. Unfortunately, however, they fail to credit the pioneers of the field, which originated half a century ago. All references below are taken from the recent deep learning overview (Schmidhuber, 2015), except for a few papers listed beneath this critique focusing on nine items.

1. LBH's survey does not even mention the father of deep learning, Alexey Grigorevich Ivakhnenko, who published the first general, working learning algorithms for deep networks (e.g., Ivakhnenko and Lapa, 1965). A paper from 1971 already described a deep learning net with 8 layers (Ivakhnenko, 1971), trained by a highly cited method still popular in the new millennium. Given a training set of input vectors with corresponding target output vectors, layers of additive and multiplicative neuron-like nodes are incrementally grown and trained by regression analysis, then pruned with the help of a separate validation set, where regularisation is used to weed out superfluous nodes. The numbers of layers and nodes per layer can be learned in problem-dependent fashion.

2. LBH discuss the importance and problems of gradient descent-based learning through backpropagation (BP), and cite their own papers on BP, plus a few others, but fail to mention BP's inventors. BP's continuous form was derived in the early 1960s (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969). Dreyfus (1962) published the elegant derivation of BP based on the chain rule only. BP's modern efficient version for discrete sparse networks (including FORTRAN code) was published by Linnainmaa (1970). Dreyfus (1973) used BP to change weights of controllers in proportion to such gradients. By 1980, automatic differentiation could derive BP for any differentiable graph (Speelpenning, 1980). Werbos (1982) published the first application of BP to NNs, extending thoughts in his 1974 thesis (cited by LBH), which did not have Linnainmaa's (1970) modern, efficient form of BP. BP for NNs on computers 10,000 times faster per Dollar than those of the 1960s can yield useful internal representations, as shown by Rumelhart et al. (1986), who also did not cite BP's inventors. [ THERE ARE 9 POINTS IN THIS CRITIQUE ]

... LBH may be backed by the best PR machines of the Western world (Google hired Hinton; Facebook hired LeCun). In the long run, however, historic scientific facts (as evident from the published record) will be stronger than any PR. There is a long tradition of insights into deep learning, and the community as a whole will benefit from appreciating the historical foundations.

One very striking aspect of the history of deep neural nets, which is acknowledged both by Schmidhuber and LeCun et al., is that the subject was marginal to "mainstream" AI and CS research for a long time, and that new technologies (i.e., GPUs) were crucial to its current flourishing in terms of practical results. The theoretical results, such as they are, appeared decades ago! It is clear that there are many unanswered questions concerning guarantees of optimal solutions, the relative merits of alternative architectures, use of memory networks, etc.

Some additional points:

1. Prevalence of saddle points over local minima in high dimensional geometries: apparently early researchers were concerned about incomplete optimization of DNNs due to local minima in parameter space. But saddle points are much more common in high dimensional spaces and local minima have turned out not to be a big problem.

2. Optimized neural networks are similar in important ways to biological (e.g., monkey) brains! When monkeys and ConvNet are shown the same pictures, the activation of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey's inferotemporal cortex.

Some comments on the relevance of all this to the quest for human-level AI from an earlier post:

.. evolution has encoded the results of a huge environment-dependent optimization in the structure of our brains (and genes), a process that AI would have to somehow replicate. A very crude estimate of the amount of computational power used by nature in this process leads to a pessimistic prognosis for AI even if one is willing to extrapolate Moore's Law well into the future. [ Moore's Law (Dennard scaling) may be toast for the next decade or so! ] Most naive analyses of AI and computational power only ask what is required to simulate a human brain, but do not ask what is required to evolve one. I would guess that our best hope is to cheat by using what nature has already given us -- emulating the human brain as much as possible.

If indeed there are good (deep) generalized learning architectures to be discovered, that will take time. Even with such a learning architecture at hand, training it will require interaction with a rich exterior world -- either the real world (via sensors and appendages capable of manipulation) or a computationally expensive virtual world. Either way, I feel confident in my bet that a strong version of the Turing test (allowing, e.g., me to communicate with the counterpart over weeks or months; to try to teach it things like physics and watch its progress; eventually for it to teach me) won't be passed until at least 2050 and probably well beyond.

Relevant remarks from Schmidhuber:

[Link] ...Ancient algorithms running on modern hardware can already achieve superhuman results in limited domains, and this trend will accelerate. But current commercial AI algorithms are still missing something fundamental. They are no self-referential general purpose learning algorithms. They improve some system’s performance in a given limited domain, but they are unable to inspect and improve their own learning algorithm. They do not learn the way they learn, and the way they learn the way they learn, and so on (limited only by the fundamental limits of computability). As I wrote in the earlier reply: "I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic." However, additional algorithmic breakthroughs may be necessary to make this a practical reality.

[Link] The world of RNNs is such a big world because RNNs (the deepest of all NNs) are general computers, and because efficient computing hardware in general is becoming more and more RNN-like, as dictated by physics: lots of processors connected through many short and few long wires. It does not take a genius to predict that in the near future, both supervised learning RNNs and reinforcement learning RNNs will be greatly scaled up. Current large, supervised LSTM RNNs have on the order of a billion connections; soon that will be a trillion, at the same price. (Human brains have maybe a thousand trillion, much slower, connections - to match this economically may require another decade of hardware development or so). In the supervised learning department, many tasks in natural language processing, speech recognition, automatic video analysis and combinations of all three will perhaps soon become trivial through large RNNs (the vision part augmented by CNN front-ends). The commercially less advanced but more general reinforcement learning department will see significant progress in RNN-driven adaptive robots in partially observable environments. Perhaps much of this won’t really mean breakthroughs in the scientific sense, because many of the basic methods already exist. However, much of this will SEEM like a big thing for those who focus on applications. (It also seemed like a big thing when in 2011 our team achieved the first superhuman visual classification performance in a controlled contest, although none of the basic algorithms was younger than two decades: http://people.idsia.ch/~juergen/superhumanpatternrecognition.html)

So what will be the real big thing? I like to believe that it will be self-referential general purpose learning algorithms that improve not only some system’s performance in a given domain, but also the way they learn, and the way they learn the way they learn, etc., limited only by the fundamental limits of computability. I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic, but now I can see how it is starting to become a practical reality. Previous work on this is collected here: http://people.idsia.ch/~juergen/metalearner.html

See also Solomonoff universal induction. I don't believe that completely general purpose learning algorithms have to become practical before we achieve human-level AI. Humans are quite limited, after all! When was the last time you introspected to learn about the way you learn you learn ...? Perhaps it is happening "under the hood" to some extent, but not in maximum generality; we have hardwired limits.

Do we really need Solomonoff? Did Nature make use of his Universal Prior in producing us? It seems like cheaper tricks can produce "intelligence" ;-)

Sunday, February 07, 2021

Gradient Descent Models Are Kernel Machines (Deep Learning)

This paper shows that models which result from gradient descent training (e.g., deep neural nets) can be expressed as a weighted sum of similarity functions (kernels) which measure the similarity of a given instance to the examples used in training. The kernels are defined by the inner product of model gradients in the parameter space, integrated over the descent (learning) path.

Roughly speaking, two data points x and x' are similar, i.e., have large kernel function K(x,x'), if they have similar effects on the model parameters in the gradient descent. With respect to the learning algorithm, x and x' have similar information content. The learned model y = f(x) matches x to similar data points x_i: the resulting value y is simply a weighted (linear) sum of kernel values K(x,x_i).

This result makes it very clear that without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples. See note added at bottom for more on this point, re: AGI, etc. Given the complexity (e.g., dimensionality) of the ground truth model, one can place bounds on the amount of data required for successful training.

This formulation locates the nonlinearity of deep learning models in the kernel function. The superposition of kernels is entirely linear as long as the loss function is additive over training data.

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

P. Domingos

https://arxiv.org/pdf/2012.00152.pdf

Deep learning’s successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel. This improved understanding should lead to better learning algorithms.

From the paper:

... Here we show that every model learned by this method, regardless of architecture, is approximately equivalent to a kernel machine with a particular type of kernel. This kernel measures the similarity of the model at two data points in the neighborhood of the path taken by the model parameters during learning. Kernel machines store a subset of the training data points and match them to the query using the kernel. Deep network weights can thus be seen as a superposition of the training data points in the kernel’s feature space, enabling their efficient storage and matching. This contrasts with the standard view of deep learning as a method for discovering representations from data. ...

... the weights of a deep network have a straightforward interpretation as a superposition of the training examples in gradient space, where each example is represented by the corresponding gradient of the model. Fig. 2 illustrates this. One well-studied approach to interpreting the output of deep networks involves looking for training instances that are close to the query in Euclidean or some other simple space (Ribeiro et al., 2016). Path kernels tell us what the exact space for these comparisons should be, and how it relates to the model’s predictions. ...

See also this video which discusses the paper.

You can almost grasp the result from the figure and definitions below.

Note Added: I was asked to elaborate further on this sentence, especially regarding AGI and human cognition:

... without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples.

It should not be taken as a suggestion that gradient descent models can't achieve AGI, or that our minds can't be (effectively) models of this kernel type.

1. The universe is highly compressible: it is governed by very simple effective models. These models can be learned, which allows for prediction beyond specific examples.

2. A sufficiently complex neural net can incorporate layers of abstraction. Thus a new instance and a previously seen example might be similar in an abstract (non-explicit) sense, but that similarity is still incorporated into the kernel. When Einstein invented Special Relativity he was not exactly aping another physical theory he had seen before, but at an abstract level the physical constraint (speed of light constant in all reference frames) and algebraic incorporation of this fact into a description of spacetime (Lorentz symmetry) may have been "similar" to examples he had seen already in simple geometry / algebra. (See Poincare and Einstein for more.)

Ulam: Banach once told me, "Good mathematicians see analogies between theorems or theories, the very best ones see analogies between analogies." Gamow possessed this ability to see analogies between models for physical theories to an almost uncanny degree...

Friday, October 22, 2021

The Principles of Deep Learning Theory - Dan Roberts IAS talk

This is a nice talk that discusses, among other things, subleading 1/width corrections to the infinite width limit of neural networks. I was expecting someone would work out these corrections when I wrote the post on NTK and large width limit at the link below. Apparently, the infinite width limit does not capture the behavior of realistic neural nets and it is only at the first nontrivial order in the expansion that the desired properties emerge. Roberts claims that when the depth to width ratio r is small but nonzero one can characterize network dynamics in a controlled expansion, whereas when r > 1 it becomes a problem of strong dynamics.

The talk is based on the book

The Principles of Deep Learning Theory

https://arxiv.org/abs/2106.10165

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

Dan Roberts web page.

This essay looks interesting:

Why is AI hard and Physics simple?

https://arxiv.org/abs/2104.00008

We discuss why AI is hard and why physics is simple. We discuss how physical intuition and the approach of theoretical physics can be brought to bear on the field of artificial intelligence and specifically machine learning. We suggest that the underlying project of machine learning and the underlying project of physics are strongly coupled through the principle of sparsity, and we call upon theoretical physicists to work on AI as physicists. As a first step in that direction, we discuss an upcoming book on the principles of deep learning theory that attempts to realize this approach.

May 2021 post: Neural Tangent Kernels and Theoretical Foundations of Deep Learning

Large width seems to provide a limiting case (analogous to the large-N limit in gauge theory) in which rigorous results about deep learning can be proved. ...

The overparametrized (width ~ w^2) network starts in a random state and by concentration of measure this initial kernel K is just the expectation, which is the NTK. Because of the large number of parameters the effect of training (i.e., gradient descent) on any individual parameter is 1/w, and the change in the eigenvalue spectrum of K is also 1/w. It can be shown that the eigenvalue spectrum is positive and bounded away from zero, and this property does not change under training. Also, the evolution of f is linear in K up to corrections with are suppressed by 1/w. Hence evolution follows a convex trajectory and can achieve global minimum loss in a finite (polynomial) time.

The parametric 1/w expansion may depend on quantities such as the smallest NTK eigenvalue k: the proof might require k >> 1/w or wk large.

In the large w limit the function space has such high dimensionality that any typical initial f is close (within a ball of radius 1/w?) to an optimal f. These properties depend on specific choice of loss function.

See related remarks: ICML notes (2018).

It may turn out that the problems on which DL works well are precisely those in which the training data (and underlying generative processes) have a hierarchical structure which is sparse, level by level. Layered networks perform a kind of coarse graining (renormalization group flow): first layers filter by feature, subsequent layers by combinations of features, etc. But the whole thing can be understood as products of sparse filters, and the performance under training is described by sparse performance guarantees (ReLU = thresholded penalization?). Given the inherent locality of physics (atoms, molecules, cells, tissue; atoms, words, sentences, ...) it is not surprising that natural phenomena generate data with this kind of hierarchical structure.

Monday, February 23, 2015

Back to the deep

The Chronicle has a nice profile of Geoffrey Hinton, which details some of the history behind neural nets and deep learning. See also Neural networks and deep learning and its sequel.

The recent flourishing of deep neural nets is not primarily due to theoretical advances, but rather the appearance of GPUs and large training data sets.

Chronicle: ... Hinton has always bucked authority, so it might not be surprising that, in the early 1980s, he found a home as a postdoc in California, under the guidance of two psychologists, David E. Rumelhart and James L. McClelland, at the University of California at San Diego. "In California," Hinton says, "they had the view that there could be more than one idea that was interesting." Hinton, in turn, gave them a uniquely computational mind. "We thought Geoff was remarkably insightful," McClelland says. "He would say things that would open vast new worlds."

They held weekly meetings in a snug conference room, coffee percolating at the back, to find a way of training their error correction back through multiple layers. Francis Crick, who co-discovered DNA’s structure, heard about their work and insisted on attending, his tall frame dominating the room even as he sat on a low-slung couch. "I thought of him like the fish in The Cat in the Hat," McClelland says, lecturing them about whether their ideas were biologically plausible.

The group was too hung up on biology, Hinton said. So what if neurons couldn’t send signals backward? They couldn’t slavishly recreate the brain. This was a math problem, he said, what’s known as getting the gradient of a loss function. They realized that their neurons couldn’t be on-off switches. If you picture the calculus of the network like a desert landscape, their neurons were like drops off a sheer cliff; traffic went only one way. If they treated them like a more gentle mesa—a sigmoidal function—then the neurons would still mostly act as a threshold, but information could climb back up.

...

A decade ago, Hinton, LeCun, and Bengio conspired to bring them back. Neural nets had a particular advantage compared with their peers: While they could be trained to recognize new objects—supervised learning, as it’s called—they should also be able to identify patterns on their own, much like a child, if left alone, would figure out the difference between a sphere and a cube before its parent says, "This is a cube." If they could get unsupervised learning to work, the researchers thought, everyone would come back. By 2006, Hinton had a paper out on "deep belief networks," which could run many layers deep and learn rudimentary features on their own, improved by training only near the end. They started calling these artificial neural networks by a new name: "deep learning." The rebrand was on.

Before they won over the world, however, the world came back to them. That same year, a different type of computer chip, the graphics processing unit, became more powerful, and Hinton’s students found it to be perfect for the punishing demands of deep learning. Neural nets got 30 times faster overnight. Google and Facebook began to pile up hoards of data about their users, and it became easier to run programs across a huge web of computers. One of Hinton’s students interned at Google and imported Hinton’s speech recognition into its system. It was an instant success, outperforming voice-recognition algorithms that had been tweaked for decades. Google began moving all its Android phones over to Hinton’s software.

It was a stunning result. These neural nets were little different from what existed in the 1980s. This was simple supervised learning. It didn’t even require Hinton’s 2006 breakthrough. It just turned out that no other algorithm scaled up like these nets. "Retrospectively, it was a just a question of the amount of data and the amount of computations," Hinton says. ...

Sunday, June 09, 2019

L1 vs Deep Learning in Genomic Prediction

The paper below by some of my MSU colleagues examines the performance of a number of ML algorithms, both linear and nonlinear, including deep neural nets, in genomic prediction across several different species.

When I give talks about prediction of disease risks and complex traits in humans, I am often asked why we are not using fancy (trendy?) methods such as Deep Learning (DL). Instead, we focus on L1 penalization methods ("sparse learning") because 1. the theoretical framework (including theorems providing performance guarantees) is well-developed, and (relatedly) 2. the L1 methods perform as well or better than other methods in our own testing.

The term theoretical framework may seem unusual in ML, which is at the moment largely an empirical subject. Experience in theoretical physics shows that when powerful mathematical results are available, they can be very useful to guide investigation. In the case of sparse learning we can make specific estimates for how much data is required to "solve" a trait -- i.e., capture most of the estimated heritability in the predictor. Five years ago we predicted a threshold of a few hundred thousand genomes for height, and this turned out to be correct. Currently, this kind of performance characterization is not possible for DL or other methods.

What is especially powerful about deep neural nets is that they yield a quasi-convex (or at least reasonably efficient) optimization procedure which can learn high dimensional functions. The class of models is both tractable from a learning/optimization perspective, but also highly expressive. As I wrote here in my ICML notes (see also Elad's work which relates DL to Sparse Learning):

It may turn out that the problems on which DL works well are precisely those in which the training data (and underlying generative processes) have a hierarchical structure which is sparse, level by level. Layered networks perform a kind of coarse graining (renormalization group flow): first layers filter by feature, subsequent layers by combinations of features, etc. But the whole thing can be understood as products of sparse filters, and the performance under training is described by sparse performance guarantees (ReLU = thresholded penalization?).

However, currently in genomic prediction one typically finds that nonlinear interactions are small, which means features more complicated than single SNPs are unnecessary. (In a recent post I discussed a new T1D predictor that makes use of nonlinear haplotype interaction effects, but even there the effects are not large.) Eventually I expect this situation to change -- when we have enough whole genomes to work with, a DL approach which can (automatically) identify important features (motifs?) may allow us to go beyond SNPs and simple linear models.

Note, though, that from an information theoretic perspective (see, e.g., any performance theorems in compressed sensing) it is obvious that we will need much more data than we currently have to advance this program. Also, note that Visscher et al.'s recent GCTA work suggests that additive SNP models using rare variants (i.e., extracted from whole genome data), can account for nearly all the expected heritability for height. This implies that the power of nonlinear methods like DL may not yield qualitatively better results than simpler L1 approaches, even in the limit of very large whole genome datasets.

Benchmarking algorithms for genomic prediction of complex traits

Christina B. Azodi, Andrew McCarren, Mark Roantree, Gustavo de los Campos, Shin-Han Shiu

The usefulness of Genomic Prediction (GP) in crop and livestock breeding programs has led to efforts to develop new and improved GP approaches including non-linear algorithm, such as artificial neural networks (ANN) (i.e. deep learning) and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of GP datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and five non-linear algorithms, including ANNs. First, we found that hyperparameter selection was critical for all non-linear algorithms and that feature selection prior to model training was necessary for ANNs when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple GP algorithms (i.e. ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits than that of linear algorithms. Although ANNs did not perform best for any trait, we identified strategies (i.e. feature selection, seeded starting weights) that boosted their performance near the level of other algorithms. These results, together with the fact that even small improvements in GP performance could accumulate into large genetic gains over the course of a breeding program, highlights the importance of algorithm selection for the prediction of trait values.

Wednesday, August 27, 2014

Neural Networks and Deep Learning 2

Inspired by the topics discussed in this earlier post, I've been reading Michael Nielsen's online book on neural nets and deep learning. I particularly liked the subsection quoted below. For people who think deep learning is anything close to a solved problem, or anticipate a near term, quick take-off to the Singularity, I suggest they read the passage below and grok it deeply.

Neural Networks and Deep Learning (Chapter 3):

You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong [...] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well. -- Question and answer with neural networks researcher Yann LeCun

Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with "I'm very sympathetic to your point of view, but [...]". Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I'd rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn't always have unarguable new progress to report.

You may have noticed a verbal tic similar to "I'm very sympathetic [...]" in the current book. To explain what we're seeing I've often fallen back on saying "Heuristically, [...]", or "Roughly speaking, [...]", following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I've presented has often been pretty thin. If you look through the research literature you'll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?

In many parts of science - especially those parts that deal with simple phenomena - it's possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it's exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.

[ Sufficiently advanced AI will come to resemble biology, even psychology, in its complexity and resistance to rigorous generalization ... ]

One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works* *From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: "This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons." This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we've discussed. Each heuristic is not just a (potential) explanation, it's also a challenge to investigate and understand in more detail.

Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It's going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It's like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.

See also here from an earlier post on this blog:

... evolution has [ encoded the results of a huge environment-dependent optimization ] in the structure of our brains (and genes), a process that AI would have to somehow replicate. A very crude estimate of the amount of computational power used by nature in this process leads to a pessimistic prognosis for AI even if one is willing to extrapolate Moore's Law well into the future. [ Moore's Law (Dennard scaling) may be toast for the next decade or so! ] Most naive analyses of AI and computational power only ask what is required to simulate a human brain, but do not ask what is required to evolve one. I would guess that our best hope is to cheat by using what nature has already given us -- emulating the human brain as much as possible.

If indeed there are good (deep) generalized learning architectures to be discovered, that will take time. Even with such a learning architecture at hand, training it will require interaction with a rich exterior world -- either the real world (via sensors and appendages capable of manipulation) or a computationally expensive virtual world. Either way, I feel confident in my bet that a strong version of the Turing test (allowing, e.g., me to communicate with the counterpart over weeks or months; to try to teach it things like physics and watch its progress; eventually for it to teach me) won't be passed until at least 2050 and probably well beyond.

Turing as polymath: ... In a similar way Turing found a home in Cambridge mathematical culture, yet did not belong entirely to it. The division between 'pure' and 'applied' mathematics was at Cambridge then as now very strong, but Turing ignored it, and he never showed mathematical parochialism. If anything, it was the attitude of a Russell that he acquired, assuming that mastery of so difficult a subject granted the right to invade others.

Wednesday, May 30, 2018

Deep Learning as a branch of Statistical Physics

Via Jess Riedel, an excellent talk by Naftali Tishby given recently at the Perimeter Institute.

The first 15 minutes is a very nice summary of the history of neural nets, with an emphasis on the connection to statistical physics. In the large network (i.e., thermodynamic) limit, one observes phase transition behavior -- sharp transitions in performance, and also a kind of typicality (concentration of measure) that allows for general statements that are independent of some detailed features.

Unfortunately I don't know how to embed video from Perimeter so you'll have to click here to see the talk.

An earlier post on this work: Information Theory of Deep Neural Nets: "Information Bottleneck"

Title and Abstract:

The Information Theory of Deep Neural Networks: The statistical physics aspects

The surprising success of learning with deep neural networks poses two fundamental challenges: understanding why these networks work so well and what this success tells us about the nature of intelligence and our biological brain. Our recent Information Theory of Deep Learning shows that large deep networks achieve the optimal tradeoff between training size and accuracy, and that this optimality is achieved through the noise in the learning process.

In this talk, I will focus on the statistical physics aspects of our theory and the interaction between the stochastic dynamics of the training algorithm (Stochastic Gradient Descent) and the phase structure of the Information Bottleneck problem. Specifically, I will describe the connections between the phase transition and the final location and representation of the hidden layers, and the role of these phase transitions in determining the weights of the network.

About Tishby:

Naftali (Tali) Tishby נפתלי תשבי

Physicist, professor of computer science and computational neuroscientist
The Ruth and Stan Flinkman professor of Brain Research
Benin school of Engineering and Computer Science
Edmond and Lilly Safra Center for Brain Sciences (ELSC)
Hebrew University of Jerusalem, 96906 Israel

I work at the interfaces between computer science, physics, and biology which provide some of the most challenging problems in today’s science and technology. We focus on organizing computational principles that govern information processing in biology, at all levels. To this end, we employ and develop methods that stem from statistical physics, information theory and computational learning theory, to analyze biological data and develop biologically inspired algorithms that can account for the observed performance of biological systems. We hope to find simple yet powerful computational mechanisms that may characterize evolved and adaptive systems, from the molecular level to the whole computational brain and interacting populations.

Sunday, May 01, 2016

The Future of Machine Intelligence

See you at Foo Camp in June! Get a free copy of this book at the link.

The Future of Machine Intelligence

Perspectives from Leading Practitioners
By David Beyer

Publisher: O'Reilly
Released: March 2016

Advances in both theory and practice are throwing the promise of machine learning into sharp relief. The field has the potential to transform a range of industries, from self-driving cars to intelligent business applications. Yet machine learning is so complex and wide-ranging that even its definition can change from one person to the next.

The series of interviews in this exclusive report unpack concepts and innovations that represent the frontiers of ever-smarter machines. You’ll get a rare glimpse into this exciting field through the eyes of some of its leading minds.

In these interviews, these ten practitioners and theoreticians cover the following topics:

Anima Anandkumar: high-dimensional problems and non-convex optimization
Yoshua Bengio: Natural Language Processing and deep learning
Brendan Frey: deep learning meets genomic medicine
Risto Miikkulainen: the startling creativity of evolutionary algorithms
Ben Recht: a synthesis of machine learning and control theory
Daniela Rus: the autonomous car as a driving partner
Gurjeet Singh: using topology to uncover the shape of your data
Ilya Sutskever: the promise of unsupervised learning and attention models
Oriol Vinyals: sequence-to-sequence machine learning
Reza Zadeh: the evolution of machine learning and the role of Spark

About the editor: David Beyer is an investor with Amplify Partners, an early-stage VC focused on the next generation of infrastructure IT, data, and information security companies. Part of the founding team at Patients Know Best, one of the world’s leading cloud-based Personal Health Record (PHR) companies, he was also the co-founder and CEO of Chartio.com, a pioneering provider of cloud-based data visualization and analytics.

Tuesday, July 17, 2018

ICML notes

It's never been a better time to work on AI/ML. Vast resources are being deployed in this direction, by corporations and governments alike. In addition to the marvelous practical applications in development, a theoretical understanding of Deep Learning may emerge in the next few years.

The notes below are to keep track of some interesting things I encountered at the meeting.

Some ML learning resources:

Metacademy
Depth First study of AlphaGo

I heard a more polished version of this talk by Elad at the Theory of Deep Learning workshop. He is trying to connect results in sparse learning (e.g., performance guarantees for L1 or threshold algos) to Deep Learning. (Video is from UCLA IPAM.)

It may turn out that the problems on which DL works well are precisely those in which the training data (and underlying generative processes) have a hierarchical structure which is sparse, level by level. Layered networks perform a kind of coarse graining (renormalization group flow): first layers filter by feature, subsequent layers by combinations of features, etc. But the whole thing can be understood as products of sparse filters, and the performance under training is described by sparse performance guarantees (ReLU = thresholded penalization?). Given the inherent locality of physics (atoms, molecules, cells, tissue; atoms, words, sentences, ...) it is not surprising that natural phenomena generate data with this kind of hierarchical structure.

Off-topic: At dinner with one of my former students and his colleague (both researchers at an AI lab in Germany), the subject of Finitism came up due to a throwaway remark about the Continuum Hypothesis.

Wikipedia
Horizons of Truth
Chaitin on Physics and Mathematics

David Deutsch:

The reason why we find it possible to construct, say, electronic calculators, and indeed why we can perform mental arithmetic, cannot be found in mathematics or logic. The reason is that the laws of physics "happen" to permit the existence of physical models for the operations of arithmetic such as addition, subtraction and multiplication.

My perspective: We experience the physical world directly, so the highest confidence belief we have is in its reality. Mathematics is an invention of our brains, and cannot help but be inspired by the objects we find in the physical world. Our idealizations (such as "infinity") may or may not be well-founded. In fact, mathematics with infinity included may be very sick, as evidenced by Godel's results, or paradoxes in set theory. There is no reason that infinity is needed (as far as we know) to do physics. It is entirely possible that there are only a (large but) finite number of degrees of freedom in the physical universe.

Paul Cohen:

I will ascribe to Skolem a view, not explicitly stated by him, that there is a reality to mathematics, but axioms cannot describe it. Indeed one goes further and says that there is no reason to think that any axiom system can adequately describe it.

This "it" (mathematics) that Cohen describes may be the set of idealizations constructed by our brains extrapolating from physical reality. But there is no guarantee that these idealizations have a strong kind of internal consistency and indeed they cannot be adequately described by any axiom system.

Saturday, January 27, 2018

Mathematical Theory of Deep Neural Networks (Princeton workshop)

This looks interesting. Deep Learning would benefit from a stronger theoretical understanding of why it works so well. I hope they put the talks online!

Mathematical Theory of Deep Neural Networks

Tuesday March 20th, Princeton Neuroscience Institute.
PNI Psychology Lecture Hall 101

Recent advances in deep networks, combined with open, easily-accessible implementations, have moved empirical results far faster than formal understanding. The lack of rigorous analysis for these techniques limits their use in addressing scientific questions in the physical and biological sciences, and prevents systematic design of the next generation of networks. Recently, long-past-due theoretical results have begun to emerge. These results, and those that will follow in their wake, will begin to shed light on the properties of large, adaptive, distributed learning architectures, and stand to revolutionize how computer science and neuroscience understand these systems.

This intensive one-day technical workshop will focus on state of the art theoretical understanding of deep learning. We aim to bring together researchers from the Princeton Neuroscience Institute (PNI) and of the theoretical machine learning group at the Institute for Advanced Studies (IAS) interested in more rigorously understanding deep networks to foster increased discussion and collaboration across these intrinsically related groups.

Saturday, April 15, 2017

History of Bayesian Neural Networks

This talk gives the history of neural networks in the framework of Bayesian inference. Deep learning is (so far) quite empirical in nature: things work, but we lack a good theoretical framework for understanding why or even how. The Bayesian approach offers some progress in these directions, and also toward quantifying prediction uncertainty.

I was sad to learn from this talk that David Mackay passed last year, from cancer. I recommended his book Information theory, inference and learning algorithms back in 2007.

Yarin Gal's dissertation Uncertainty in Deep Learning, mentioned in the talk.

I suppose I can thank my Caltech education for a quasi-subconscious understanding of neural nets despite never having worked on them. They were in the air when I was on campus, due to the presence of John Hopfield (he co-founded the Computation and Neural Systems PhD program at Caltech in 1986). See also Hopfield on physics and biology.

Amusingly, I discovered this talk via deep learning: YouTube's recommendation engine, powered by deep neural nets, suggested it to me this Saturday afternoon :-)

Friday, December 08, 2017

Recursive Cortical Networks: data efficient computer vision

Will knowledge from neuroscience inform the design of better AIs (neural nets)? These results from startup Vicarious AI suggest that the answer is yes! (See also this company blog post describing the research.)

It has often been remarked that evolved biological systems (e.g., a baby) can learn much faster and using much less data than existing artificial neural nets. Significant improvements in AI are almost certainly within reach...

Thanks to reader and former UO Physics colleague Raghuveer Parthasarathy for a pointer to this paper!

A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs

Science 08 Dec 2017: Vol. 358, Issue 6368, eaag2612
DOI: 10.1126/science.aag2612

INTRODUCTION
Compositionality, generalization, and learning from a few examples are among the hallmarks of human intelligence. CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), images used by websites to block automated interactions, are examples of problems that are easy for people but difficult for computers. CAPTCHAs add clutter and crowd letters together to create a chicken-and-egg problem for algorithmic classifiers—the classifiers work well for characters that have been segmented out, but segmenting requires an understanding of the characters, which may be rendered in a combinatorial number of ways. CAPTCHAs also demonstrate human data efficiency: A recent deep-learning approach for parsing one specific CAPTCHA style required millions of labeled examples, whereas humans solve new styles without explicit training.

By drawing inspiration from systems neuroscience, we introduce recursive cortical network (RCN), a probabilistic generative model for vision in which message-passing–based inference handles recognition, segmentation, and reasoning in a unified manner. RCN learns with very little training data and fundamentally breaks the defense of modern text-based CAPTCHAs by generatively segmenting characters. In addition, RCN outperforms deep neural networks on a variety of benchmarks while being orders of magnitude more data-efficient.

RATIONALE
Modern deep neural networks resemble the feed-forward hierarchy of simple and complex cells in the neocortex. Neuroscience has postulated computational roles for lateral and feedback connections, segregated contour and surface representations, and border-ownership coding observed in the visual cortex, yet these features are not commonly used by deep neural nets. We hypothesized that systematically incorporating these findings into a new model could lead to higher data efficiency and generalization. Structured probabilistic models provide a natural framework for incorporating prior knowledge, and belief propagation (BP) is an inference algorithm that can match the cortical computational speed. The representational choices in RCN were determined by investigating the computational underpinnings of neuroscience data under the constraint that accurate inference should be possible using BP.

RESULTS
RCN was effective in breaking a wide variety of CAPTCHAs with very little training data and without using CAPTCHA-specific heuristics. By comparison, a convolutional neural network required a 50,000-fold larger training set and was less robust to perturbations to the input. Similar results are shown on one- and few-shot MNIST (modified National Institute of Standards and Technology handwritten digit data set) classification, where RCN was significantly more robust to clutter introduced during testing. As a generative model, RCN outperformed neural network models when tested on noisy and cluttered examples and generated realistic samples from one-shot training of handwritten characters. RCN also proved to be effective at an occlusion reasoning task that required identifying the precise relationships between characters at multiple points of overlap. On a standard benchmark for parsing text in natural scenes, RCN outperformed state-of-the-art deep-learning methods while requiring 300-fold less training data.

CONCLUSION
Our work demonstrates that structured probabilistic models that incorporate inductive biases from neuroscience can lead to robust, generalizable machine learning models that learn with high data efficiency. In addition, our model’s effectiveness in breaking text-based CAPTCHAs with very little training data suggests that websites should seek more robust mechanisms for detecting automated interactions.

Tuesday, March 15, 2016

Geoff Hinton on Deep Learning

This is a recent, and fairly non-technical, introduction to Deep Learning by Geoff Hinton.

In the most interesting part of the talk (@25 min; see arxiv:1409.3215 and arxiv:1506.00019) he describes extracting "thought vectors" or semantic (meaning) relationships from plain text. This involves a deep net, human text, and resulting vectors of weights.

The slide below summarizes some history. Most of the theoretical ideas behind Deep Learning have been around for a long time. Hinton sometimes characterizes the advances as resulting from a factor of a million in hardware capability (increase in compute power and data availability), and an order of magnitude from new tricks. See also Moore's Law and AI.

Sunday, January 20, 2019

Ditchley Foundation Conference: The intersection of machine learning and genetic engineering

I'll be back in the UK soon for the meeting described below. Above, Ditchley House.

The intersection of machine learning and genetic engineering: what should be our check list for society and state as we blast off?

07 - 09 FEB 2019

Advances in machine learning and genetic engineering are combining to produce rapid advances in medicine, development of materials and genetic engineering. Parallel advances in robotics and automation have made the practical process of gene editing scalable. The possibility exists that advances in quantum computing could further accelerate progress on machine learning, bringing a second boost to this technological rocket.

This Ditchley conference will bring together an unusual mix of deep expertise and scientific renown in the disciplines; thinkers on religion, ethics and law; investors fueling innovation; and political leaders looking to shape the approach of society and state to fast emerging possibilities. We will attempt to establish sufficient common understanding of what the science promises and what it doesn’t and then explore the opportunities and risks that are likely to unfold at speed. This will be a first pass at preparation for potential blast off – what should be our moral, legal, economic and national security checklist as we wait on the launch pad of a new age?

The progress on machine learning is quite narrow in scope – deep learning using neural networks and other techniques on large data sets that now exist that didn’t previously and that are store-able and computable in a way that was not possible previously. But whereas progress towards general AI is often overstated, full general AI is not required to radically accelerate gene sequencing, editing and programming, with costs falling all the time and scale and speed increasing.

We will examine and try to come to preliminary conclusions on questions such as the following:

How should the most aggressive genetic engineering technologies be regulated?

How can societies best assess the ethical issues raised by these technologies to find an optimal balance between fostering genetic technologies for the common good while preventing abuse?

What are the implications for the global economy and economic cooperation and competition between states? Are we entering a period of bio-nationalism as well as AI nationalism? Should this be compared to the space race of the Cold War? How can we avoid competition between states driving abandonment of norms and moral standards? What will be the impact on the labour force of the new combined technologies of AI and bio-engineering? Within countries, will potential applications of the new technologies further intensify the concentration of wealth and power in a few hands?

What are the implications of rapid combined advances in AI and bioengineering for defence and national security? Will countries be tempted to pursue military applications either through bio-weapons or through the genetic improvement of military forces? What new materials will emerge and how will they affect the balance of power in warfare?

What are the implications for medicine and public health? If we are able to find targeted genetic cures for diseases like cancer then what will the impact be on the population? What are the implications for ageing or declining populations?

How should we handle the implications of deeper knowledge about the influence of our genes on our characteristics and on the characteristics of groups? How do we chart a course between remaining scientifically objective and providing material that could be misused to support racist conclusions by those tending to that view?

What opportunities and threats are there in the potential of these combined technologies for democracies and the equal value put on the view point of each citizen in the electoral system and the rule of law? More philosophically, how can we make sure the development of these technologies contributes to a positive sense of human progress and meaning, rather than to a sense of alienation and loss of purpose? How can we manage the tension between science and religion as human capability to shape the genetic world increases?

I'm only briefly in London on my way there, but might be able to squeeze in a few meetings :-)

See also:

The Future of IVF and Gene-Editing (Psychology Today interview)

The Future is Here: Genomic Prediction in MIT Technology Review

Genomic Prediction of Complex Disease Risk (bioRxiv)

Sunday, April 16, 2017

Yann LeCun on Unsupervised Learning

This is a recent Yann LeCun talk at CMU. Toward the end he discusses recent breakthroughs using GANs (Generative Adversarial Networks, see also Ian Goodfellow here and here).

LeCun tells an anecdote about the discovery of backpropagation. The first implementation of the algorithm didn't work, probably because of a bug in the program. But they convinced themselves that the reason for the failure was that the network could easily get caught in a local minimum which prevents further improvement. It turns out that this is very improbable in high dimensional spaces, which is part of the reason behind the great success of deep learning. As I wrote here:

In the limit of high dimensionality a critical point is overwhelmingly likely to be a saddlepoint (have at least one negative eigenvalue). This means that even though the surface is not strictly convex the optimization is tractable.

This (free version!) new textbook on deep learning by Goodfellow, Bengio, and Courville looks very good. See also Michael Nielsen's book.

If I were a young person I would be working in this area (perhaps with an eye toward applications in genomics, or perhaps working directly on the big problem of AGI). I hope after I retire they will let me hang out at one of the good AI places like Google Brain or Deep Mind :-)

Tuesday, April 23, 2019

Backpropagation in the Brain? Part 2

If I understand correctly the issue is how to realize something like backprop when most of the information flow is feed-forward (as in real neurons). How do you transport weights "non-locally"? The L2 optimization studied here doesn't actually transport weights. Rather, the optimized solution realizes the same set of weights in two places...

See earlier post Backpropagation in the Brain? Thanks for STS for the reference.

Center for Brains, Minds and Machines (CBMM)
Published on Apr 3, 2019
Speaker: Dr. Jon Bloom, Broad Institute

Abstract: When trained to minimize reconstruction error, a linear autoencoder (LAE) learns the subspace spanned by the top principal directions but cannot learn the principal directions themselves. In this talk, I'll explain how this observation became the focus of a project on representation learning of neurons using single-cell RNA data. I'll then share how this focus led us to a satisfying conversation between numerical analysis, algebraic topology, random matrix theory, deep learning, and computational neuroscience. We'll see that an L2-regularized LAE learns the principal directions as the left singular vectors of the decoder, providing a simple and scalable PCA algorithm related to Oja's rule. We'll use the lens of Morse theory to smoothly parameterize all LAE critical manifolds and the gradient trajectories between them; and see how algebra and probability theory provide principled foundations for ensemble learning in deep networks, while suggesting new algorithms. Finally, we'll come full circle to neuroscience via the "weight transport problem" (Grossberg 1987), proving that L2-regularized LAEs are symmetric at all critical points. This theorem provides local learning rules by which maximizing information flow and minimizing energy expenditure give rise to less-biologically-implausible analogues of backproprogation, which we are excited to explore in vivo and in silico. Joint learning with Daniel Kunin, Aleksandrina Goeva, and Cotton Seed.

Tuesday, September 05, 2017

DeepMind and StarCraft II Learning Environment

This Learning Environment will enable researchers to attack the problem of building an AI that plays StarCraft II at a high level. As observed in the video, this infrastructure development required significant investment of resources by DeepMind / Alphabet. Now, researchers in academia and elsewhere have a platform from which to explore an important class of AI problems that are related to real world strategic planning. Although StarCraft is "just" a video game, it provides a rich virtual laboratory for machine learning.

StarCraft II: A New Challenge for Reinforcement Learning
https://arxiv.org/abs/1708.04782

This paper introduces SC2LE (StarCraft II Learning Environment), a reinforcement learning environment based on the StarCraft II game. This domain poses a new grand challenge for reinforcement learning, representing a more difficult class of problems than considered in most prior work. It is a multi-agent problem with multiple players interacting; there is imperfect information due to a partially observed map; it has a large action space involving the selection and control of hundreds of units; it has a large state space that must be observed solely from raw input feature planes; and it has delayed credit assignment requiring long-term strategies over thousands of steps. We describe the observation, action, and reward specification for the StarCraft II domain and provide an open source Python-based interface for communicating with the game engine. In addition to the main game maps, we provide a suite of mini-games focusing on different elements of StarCraft II gameplay. For the main game maps, we also provide an accompanying dataset of game replay data from human expert players. We give initial baseline results for neural networks trained from this data to predict game outcomes and player actions. Finally, we present initial baseline results for canonical deep reinforcement learning agents applied to the StarCraft II domain. On the mini-games, these agents learn to achieve a level of play that is comparable to a novice player. However, when trained on the main game, these agents are unable to make significant progress. Thus, SC2LE offers a new and challenging environment for exploring deep reinforcement learning algorithms and architectures.

Tuesday, May 15, 2018

AGI in the Alps: Schmidhuber in Bloomberg

A nice profile of AI researcher Jurgen Schmidhuber in Bloomberg. I first met Schmidhuber at SciFoo some years ago. See also Deep Learning in Nature.

Bloomberg: ... Schmidhuber’s dreams of an AGI began in Bavaria. The middle-class son of an architect and a teacher, he grew up worshipping Einstein and aspired to go a step further. “As a teenager, I realized that the grandest thing that one could do as a human is to build something that learns to become smarter than a human,” he says while downing a latte. “Physics is such a fundamental thing, because it’s about the nature of the world and how the world works, but there is one more thing that you can do, which is build a better physicist.”

This goal has been Schmidhuber’s all-consuming obsession for four decades. His younger brother, Christof, remembers taking long family drives through the Alps with Jürgen philosophizing away in the back seat. “He told me that you can build intelligent robots that are smarter than we are,” Christof says. “He also said that you could rebuild a brain atom by atom, and that you could do it using copper wires instead of our slow neurons as the connections. Intuitively, I rebelled against this idea that a manufactured brain could mimic a human’s feelings and free will. But eventually, I realized he was right.” Christof went on to work as a researcher in nuclear physics before settling into a career in finance.

... AGI is far from inevitable. At present, humans must do an incredible amount of handholding to get AI systems to work. Translations often stink, computers mistake hot dogs for dachshunds, and self-driving cars crash. Schmidhuber, though, sees an AGI as a matter of time. After a brief period in which the company with the best one piles up a great fortune, he says, the future of machine labor will reshape societies around the world.

“In the not-so-distant future, I will be able to talk to a little robot and teach it to do complicated things, such as assembling a smartphone just by show and tell, making T-shirts, and all these things that are currently done under slavelike conditions by poor kids in developing countries,” he says. “Humans are going to live longer, healthier, happier, and easier lives, because lots of jobs that are now demanding on humans are going to be replaced by machines. Then there will be trillions of different types of AIs and a rapidly changing, complex AI ecology expanding in a way where humans cannot even follow.” ...

Schmidhuber has annoyed many of his colleagues in AI by insisting on proper credit assignment for groundbreaking work done in earlier decades. Because neural networks languished in obscurity through the 1980s and 1990s, a lot of theoretical ideas that were developed then do not today get the recognition they deserve.

Schmidhuber points out that machine learning is itself based on accurate credit assignment. Good learning algorithms assign higher weights to features or signals that correctly predict outcomes, and lower weights to those that are not predictive. His analogy between science itself and machine learning is often lost on critics.

What is still missing on the road to AGI:

... Ancient algorithms running on modern hardware can already achieve superhuman results in limited domains, and this trend will accelerate. But current commercial AI algorithms are still missing something fundamental. They are no self-referential general purpose learning algorithms. They improve some system’s performance in a given limited domain, but they are unable to inspect and improve their own learning algorithm. They do not learn the way they learn, and the way they learn the way they learn, and so on (limited only by the fundamental limits of computability). As I wrote in the earlier reply: "I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic." However, additional algorithmic breakthroughs may be necessary to make this a practical reality.

Sunday, January 31, 2016

Deep Neural Nets and Go: AlphaGo beats European champion

I'm surprised that this happened so fast. I guess I need to update some priors :-)

AlphaGo uses two neural nets: one for move selection ("policy") and the other for position evaluation ("value"), but also uses MC search trees. Its strength is roughly top 1000 or so among all human players. In a few months it is scheduled to play one of the very best players in the world.

For training they used a 30 million position Go database of expert games (KGS Go Server). I have no intuition as to whether this is enough data to train the policy and value NNs. The quality of these NNs must be relatively good, as the MC tree search used was much smaller than for DeepBlue and its hand-crafted evaluation function.

Some grandmasters who reviewed AlphaGo's games were impressed by the "humanlike" quality of its play. More discussion: HNN, Reddit.

Mastering the game of Go with deep neural networks and tree search

Nature 529, 484–489 (28 January 2016) doi:10.1038/nature16961

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

Schematic representation of the neural network architecture used in AlphaGo. The policy network takes a representation of the board position s as its input, passes it through many convolutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a probability distribution p (a|s) or p (a|s) over legal moves a, represented by a σρ probability map over the board. The value network similarly uses many convolutional layers with parameters θ, but outputs a scalar value vθ(s′) that predicts the expected outcome in position s′.

Related News: commenter STS points me to some work showing the equivalence of Deep Learning to the Renormalization Group in physics. See also Quanta magazine. The key aspect of RG here is the identification of important degrees of freedom in the process of coarse graining. These degrees of freedom make up so-called Effective Field Theories in particle physics.

These are the days of miracle and wonder!

Saturday, August 16, 2014

Neural Networks and Deep Learning

One of the SCI FOO sessions I enjoyed the most this year was a discussion of deep learning by AI researcher Juergen Schmidhuber. For an overview of recent progress, see this paper. Also of interest: Michael Nielsen's pedagogical book project.

An application which especially caught my attention is described by Schmidhuber here:

Many traditional methods of Evolutionary Computation [15-19] can evolve problem solvers with hundreds of parameters, but not millions. Ours can [1,2], by greatly reducing the search space through evolving compact, compressed descriptions [3-8] of huge solvers. For example, a Recurrent Neural Network [34-36] with over a million synapses or weights learned (without a teacher) to drive a simulated car based on a high-dimensional video-like visual input stream.

More details here. They trained a deep neural net to drive a car using visual input (pixels from the driver's perspective, generated by a video game); output consists of steering orientation and accelerator/brake activation. There was no hard coded structure corresponding to physics -- the neural net optimized a utility function primarily defined by time between crashes. It learned how to drive the car around the track after less than 10k training sessions.

For some earlier discussion of deep neural nets and their application to language translation, see here. Schmidhuber has also worked on Solomonoff universal induction.

These TED videos give you some flavor of Schmidhuber's sense of humor :-) Apparently his younger brother (mentioned in the first video) has transitioned from theoretical physics to algorithmic finance. Schmidhuber on China.

About Me