## Wednesday, May 12, 2021

### Neural Tangent Kernels and Theoretical Foundations of Deep Learning

A colleague recommended this paper to me recently. See also earlier post Gradient Descent Models Are Kernel Machines.
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot, Franck Gabriel, Clément Hongler
At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function fθ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function fθ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.
The results are remarkably well summarized in the wikipedia entry on Neural Tangent Kernels:

An Artificial Neural Network (ANN) with scalar output consists in a family of functions ${\displaystyle f\left(\cdot ,\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R} }$ parametrized by a vector of parameters ${\displaystyle \theta \in \mathbb {R} ^{P}}$.

The Neural Tangent Kernel (NTK) is a kernel ${\displaystyle \Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R} }$ defined by

${\displaystyle \Theta \left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f\left(x;\theta \right)\partial _{\theta _{p}}f\left(y;\theta \right).}$
In the language of kernel methods, the NTK ${\displaystyle \Theta }$ is the kernel associated with the feature map ${\displaystyle \left(x\mapsto \partial _{\theta _{p}}f\left(x;\theta \right)\right)_{p=1,\ldots ,P}}$
This is a very brief (3 minute) summary by the first author:

This 15 minute IAS talk gives a nice overview of the results, and their relation to fundamental questions (both emprirical and theoretical) in deep learning.

I hope to find time to explore this in more depth. Large width seems to provide a limiting case (analogous to the large-N limit in gauge theory) in which rigorous results about deep learning can be proved.

Some naive questions:

What is the expansion parameter of the finite width expansion?

What role does concentration of measure play in the results?

Is there a similar result for the infinite connection or infinite neurons limit? (Relaxing the infinite width requirement to something like infinite volume, perhaps with a W/L threshold?)

Simplification seems to be a consequence of overparametrization. But the proof method seems to apply to a regularized (but still convex, e.g., using L1 penalization) loss function that imposes sparsity. It would be interesting to examine this specific case in more detail.

## Saturday, May 08, 2021

### Three Thousand Years and 115 Generations of 徐 (Hsu / Xu)

Over the years I have discussed economic historian Greg Clark's groundbreaking work on the persistence of social class. Clark found that intergenerational social mobility was much less than previously thought, and that intergenerational correlations on traits such as education and occupation were consistent with predictions from an additive genetic model with a high degree of assortative mating.

See Genetic correlation of social outcomes between relatives (Fisher 1918) tested using lineage of 400k English individuals, and further links therein. Also recommended: this recent podcast interview Clark did with Razib Khan.

The other day a reader familiar with Clark's work asked me about my family background. Obviously my own family history is not a scientific validation of Clark's work, being only a single (if potentially illustrative) example. Nevertheless it provides an interesting microcosm of the tumult of 20th century China and a window into the deep past...

I described my father's background in the post Hsu Scholarship at Caltech:
Cheng Ting Hsu was born December 1, 1923 in Wenling, Zhejiang province, China. His grandfather, Zan Yao Hsu was a poet and doctor of Chinese medicine. His father, Guang Qiu Hsu graduated from college in the 1920's and was an educator, lawyer and poet.
Cheng Ting was admitted at age 16 to the elite National Southwest Unified University (Lianda), which was created during WWII by merging Tsinghua, Beijing, and Nankai Universities. This university produced numerous famous scientists and scholars such as the physicists C.N. Yang and T.D. Lee.
Cheng Ting studied aerospace engineering (originally part of Tsinghua), graduating in 1944. He became a research assistant at China's Aerospace Research Institute and a lecturer at Sichuan University. He also taught aerodynamics for several years to advanced students at the air force engineering academy.
In 1946 he was awarded one of only two Ministry of Education fellowships in his field to pursue graduate work in the United States. In 1946-1947 he published a three-volume book, co-authored with Professor Li Shoutong, on the structures of thin-walled airplanes.
In January 1948, he left China by ocean liner, crossing the Pacific and arriving in San Francisco. ...
My mother's father was a KMT general, and her family related to Chiang Kai Shek by marriage. Both my grandfather and Chiang attended the military academy Shinbu Gakko in Tokyo. When the KMT lost to the communists, her family fled China and arrived in Taiwan in 1949. My mother's family had been converted to Christianity in the 19th century and became Methodists, like Sun Yat Sen. (I attended Methodist Sunday school while growing up in Ames IA.) My grandfather was a partner of T.V. Soong in the distribution of bibles in China in the early 20th century.

My father's family remained mostly in Zhejiang and suffered through the communist takeover, Great Leap Forward, and Cultural Revolution. My father never returned to China and never saw his parents again.

When I met my uncle (a retired Tsinghua professor) and some of my cousins in Hangzhou in 2010, they gave me a four volume family history that had originally been printed in the 1930s. The Hsu (Xu) lineage began in the 10th century BC and continued to my father, in the 113th generation. His entry is the bottom photo below.
Wikipedia: The State of Xu (Chinese: 徐) (also called Xu Rong (徐戎) or Xu Yi (徐夷)[a] by its enemies)[4][5] was an independent Huaiyi state of the Chinese Bronze Age[6] that was ruled by the Ying family (嬴) and controlled much of the Huai River valley for at least two centuries.[3][7] It was centered in northern Jiangsu and Anhui. ...

Generations 114 and 115:

Four volume history of the Hsu (Xu) family, beginning in the 10th century BC. The first 67 generations are covered rather briefly, only indicating prominent individuals in each generation of the family tree. The books are mostly devoted to generations 68-113 living in Zhejiang. (Earlier I wrote that it was two volumes, but it's actually four. The printing that I have is two thick books.)

## Sunday, May 02, 2021

### 40 Years of Quantum Computation and Quantum Information

This is a great article on the 1981 conference which one could say gave birth to quantum computing / quantum information.
Technology Review: Quantum computing as we know it got its start 40 years ago this spring at the first Physics of Computation Conference, organized at MIT’s Endicott House by MIT and IBM and attended by nearly 50 researchers from computing and physics—two groups that rarely rubbed shoulders.
Twenty years earlier, in 1961, an IBM researcher named Rolf Landauer had found a fundamental link between the two fields: he proved that every time a computer erases a bit of information, a tiny bit of heat is produced, corresponding to the entropy increase in the system. In 1972 Landauer hired the theoretical computer scientist Charlie Bennett, who showed that the increase in entropy can be avoided by a computer that performs its computations in a reversible manner. Curiously, Ed Fredkin, the MIT professor who cosponsored the Endicott Conference with Landauer, had arrived at this same conclusion independently, despite never having earned even an undergraduate degree. Indeed, most retellings of quantum computing’s origin story overlook Fredkin’s pivotal role.
Fredkin’s unusual career began when he enrolled at the California Institute of Technology in 1951. Although brilliant on his entrance exams, he wasn’t interested in homework—and had to work two jobs to pay tuition. Doing poorly in school and running out of money, he withdrew in 1952 and enlisted in the Air Force to avoid being drafted for the Korean War.
A few years later, the Air Force sent Fredkin to MIT Lincoln Laboratory to help test the nascent SAGE air defense system. He learned computer programming and soon became one of the best programmers in the world—a group that probably numbered only around 500 at the time.
Upon leaving the Air Force in 1958, Fredkin worked at Bolt, Beranek, and Newman (BBN), which he convinced to purchase its first two computers and where he got to know MIT professors Marvin Minsky and John McCarthy, who together had pretty much established the field of artificial intelligence. In 1962 he accompanied them to Caltech, where McCarthy was giving a talk. There Minsky and Fredkin met with Richard Feynman ’39, who would win the 1965 Nobel Prize in physics for his work on quantum electrodynamics. Feynman showed them a handwritten notebook filled with computations and challenged them to develop software that could perform symbolic mathematical computations. ...
... in 1974 he headed back to Caltech to spend a year with Feynman. The deal was that Fredkin would teach Feynman computing, and Feynman would teach Fredkin quantum physics. Fredkin came to understand quantum physics, but he didn’t believe it. He thought the fabric of reality couldn’t be based on something that could be described by a continuous measurement. Quantum mechanics holds that quantities like charge and mass are quantized—made up of discrete, countable units that cannot be subdivided—but that things like space, time, and wave equations are fundamentally continuous. Fredkin, in contrast, believed (and still believes) with almost religious conviction that space and time must be quantized as well, and that the fundamental building block of reality is thus computation. Reality must be a computer! In 1978 Fredkin taught a graduate course at MIT called Digital Physics, which explored ways of reworking modern physics along such digital principles.
Feynman, however, remained unconvinced that there were meaningful connections between computing and physics beyond using computers to compute algorithms. So when Fredkin asked his friend to deliver the keynote address at the 1981 conference, he initially refused. When promised that he could speak about whatever he wanted, though, Feynman changed his mind—and laid out his ideas for how to link the two fields in a detailed talk that proposed a way to perform computations using quantum effects themselves.
Feynman explained that computers are poorly equipped to help simulate, and thereby predict, the outcome of experiments in particle physics—something that’s still true today. Modern computers, after all, are deterministic: give them the same problem, and they come up with the same solution. Physics, on the other hand, is probabilistic. So as the number of particles in a simulation increases, it takes exponentially longer to perform the necessary computations on possible outputs. The way to move forward, Feynman asserted, was to build a computer that performed its probabilistic computations using quantum mechanics.
[ Note to reader: the discussion in the last sentences above is a bit garbled. The exponential difficulty that classical computers have with quantum calculations has to do with entangled states which live in Hilbert spaces of exponentially large dimension. Probability is not really the issue; the issue is the huge size of the space of possible states. Indeed quantum computations are strictly deterministic unitary operations acting in this Hilbert space. ]

Feynman hadn’t prepared a formal paper for the conference, but with the help of Norm Margolus, PhD ’87, a graduate student in Fredkin’s group who recorded and transcribed what he said there, his talk was published in the International Journal of Theoretical Physics under the title “Simulating Physics with Computers.” ...

Feynman's 1981 lecture Simulating Physics With Computers.

Fredkin was correct about the (effective) discreteness of spacetime, although he probably did not realize this is a consequence of gravitational effects: see, e.g., Minimum Length From First Principles. In fact, Hilbert Space (the state space of quantum mechanics) itself may be discrete.

Related:

My paper on the Margolus-Levitin Theorem in light of gravity:

We derive a fundamental upper bound on the rate at which a device can process information (i.e., the number of logical operations per unit time), arising from quantum mechanics and general relativity. In Planck units a device of volume V can execute no more than the cube root of V operations per unit time. We compare this to the rate of information processing performed by nature in the evolution of physical systems, and find a connection to black hole entropy and the holographic principle.

Participants in the 1981 meeting:

Physics of Computation Conference, Endicott House, MIT, May 6–8, 1981. 1 Freeman Dyson, 2 Gregory Chaitin, 3 James Crutchfield, 4 Norman Packard, 5 Panos Ligomenides, 6 Jerome Rothstein, 7 Carl Hewitt, 8 Norman Hardy, 9 Edward Fredkin, 10 Tom Toffoli, 11 Rolf Landauer, 12 John Wheeler, 13 Frederick Kantor, 14 David Leinweber, 15 Konrad Zuse, 16 Bernard Zeigler, 17 Carl Adam Petri, 18 Anatol Holt, 19 Roland Vollmar, 20 Hans Bremerman, 21 Donald Greenspan, 22 Markus Buettiker, 23 Otto Floberth, 24 Robert Lewis, 25 Robert Suaya, 26 Stand Kugell, 27 Bill Gosper, 28 Lutz Priese, 29 Madhu Gupta, 30 Paul Benioff, 31 Hans Moravec, 32 Ian Richards, 33 Marian Pour-El, 34 Danny Hillis, 35 Arthur Burks, 36 John Cocke, 37 George Michaels, 38 Richard Feynman, 39 Laurie Lingham, 40 P. S. Thiagarajan, 41 Marin Hassner, 42 Gerald Vichnaic, 43 Leonid Levin, 44 Lev Levitin, 45 Peter Gacs, 46 Dan Greenberger. (Photo courtesy Charles Bennett)