Sunday, February 07, 2021

Gradient Descent Models Are Kernel Machines (Deep Learning)

This paper shows that models which result from gradient descent training (e.g., deep neural nets) can be expressed as a weighted sum of similarity functions (kernels) which measure the similarity of a given instance to the examples used in training. The kernels are defined by the inner product of model gradients in the parameter space, integrated over the descent (learning) path.

Roughly speaking, two data points x and x' are similar, i.e., have large kernel function K(x,x'), if they have similar effects on the model parameters in the gradient descent. With respect to the learning algorithm, x and x' have similar information content. The learned model y = f(x) matches x to similar data points x_i: the resulting value y is simply a weighted (linear) sum of kernel values K(x,x_i).

This result makes it very clear that without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples. See note added at bottom for more on this point, re: AGI, etc. Given the complexity (e.g., dimensionality) of the ground truth model, one can place bounds on the amount of data required for successful training.

This formulation locates the nonlinearity of deep learning models in the kernel function. The superposition of kernels is entirely linear as long as the loss function is additive over training data.
 
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine  
P. Domingos      
https://arxiv.org/pdf/2012.00152.pdf
Deep learning’s successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods. We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel. This improved understanding should lead to better learning algorithms.
From the paper:
... Here we show that every model learned by this method, regardless of architecture, is approximately equivalent to a kernel machine with a particular type of kernel. This kernel measures the similarity of the model at two data points in the neighborhood of the path taken by the model parameters during learning. Kernel machines store a subset of the training data points and match them to the query using the kernel. Deep network weights can thus be seen as a superposition of the training data points in the kernel’s feature space, enabling their efficient storage and matching. This contrasts with the standard view of deep learning as a method for discovering representations from data. ... 
... the weights of a deep network have a straightforward interpretation as a superposition of the training examples in gradient space, where each example is represented by the corresponding gradient of the model. Fig. 2 illustrates this. One well-studied approach to interpreting the output of deep networks involves looking for training instances that are close to the query in Euclidean or some other simple space (Ribeiro et al., 2016). Path kernels tell us what the exact space for these comparisons should be, and how it relates to the model’s predictions. ...
See also this video which discusses the paper. 

You can almost grasp the result from the figure and definitions below.

Note Added:
I was asked to elaborate further on this sentence, especially regarding AGI and human cognition: 

... without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples.

It should not be taken as a suggestion that gradient descent models can't achieve AGI, or that our minds can't be (effectively) models of this kernel type. 

1. The universe is highly compressible: it is governed by very simple effective models. These models can be learned, which allows for prediction beyond specific examples.

2. A sufficiently complex neural net can incorporate layers of abstraction. Thus a new instance and a previously seen example might be similar in an abstract (non-explicit) sense, but that similarity is still incorporated into the kernel. When Einstein invented Special Relativity he was not exactly aping another physical theory he had seen before, but at an abstract level the physical constraint (speed of light constant in all reference frames) and algebraic incorporation of this fact into a description of spacetime (Lorentz symmetry) may have been "similar" to examples he had seen already in simple geometry / algebra. (See Poincare and Einstein for more.)
Ulam: Banach once told me, "Good mathematicians see analogies between theorems or theories, the very best ones see analogies between analogies." Gamow possessed this ability to see analogies between models for physical theories to an almost uncanny degree... 

No comments:

Post a Comment