Information Processing: 05/2021

Wednesday, May 26, 2021

How Dominic Cummings And The Warner Brothers Saved The UK

Photo above shows the white board in the Prime Minister's office which Dominic Cummings and team (including the brothers Marc and Ben Warner) used to convince Boris Johnson to abandon the UK government COVID herd immunity plan and enter lockdown. Date: March 13 2020.

Only now can the full story be told. In early 2020 the UK government had a COVID herd immunity plan in place that would have resulted in disaster. The scientific experts (SAGE) advising the government strongly supported this plan -- there are public, on the record briefings to this effect. These are people who are not particularly good at order of magnitude estimates and first-principles reasoning.

Fortunately Dom was advised by the brothers Marc and Ben Warner (both physics PhDs, now working in AI and data science), DeepMind founder Demis Hassabis, Fields Medalist Tim Gowers, and others. In the testimony (see ~23m, ~35m, ~1h02m, ~1h06m in the video below) he describes the rather dramatic events that led to a switch from the original herd immunity plan to a lockdown Plan B. More details in this tweet thread.

I checked my emails with Dom during February and March, and they confirm his narrative. I wrote the March 9 blog post Covid-19 Notes in part for Dom and his team, and I think it holds up over time. Tim Gowers' document reaches similar conclusions.

Seven hours of riveting Dominic Cummings testimony from earlier today.

Transcript.

Shorter summary video (Channel 4). Summary live-blog from the Guardian.

This is a second white board used in the March 14 meeting with Boris Johnson:

Saturday, May 22, 2021

Feynman Lectures on the Strong Interactions (Jim Cline notes)

Professor James Cline (McGill University) recently posted a set of lecture notes from Feynman's last Caltech course, on quantum chromodynamics. Cline, then a graduate student, was one of the course TAs and the notes were meant to be assembled into a monograph. Thanks to Tim Raben for pointing these out to me.

The content seems a bit more elementary than in John Preskill's Ph230abc, a special topics course on QCD taught in 1983-4. I still consider John's notes to be one of the best overviews of nonperturbative aspects of QCD, a rather deep subject. However as Cline remarks there is unsurprisingly something special about the lectures: Feynman was an inspiring teacher, presenting everything in an incisive and fascinating way, that obviously had his own mark on it.

The material on QFT in non-integer spacetime dimensions is, as far as I know, original to Feynman. Dimensional regularization of gauge theory was popularized by 't Hooft and Veltman, but the analytic continuation to d = 4 - ε is specifc to the loop integrals (i.e., concrete mathematical expressions) that appear in perturbation theory. Here Feynman is, more ambitiously, exploring whether the quantum gauge theory itself can be meaningfully extended to a non-integer number of spacetime dimensions.

Feynman Lectures on the Strong Interactions

Richard P. Feynman, James M. Cline

These twenty-two lectures, with exercises, comprise the extent of what was meant to be a full-year graduate-level course on the strong interactions and QCD, given at Caltech in 1987-88. The course was cut short by the illness that led to Feynman's death. Several of the lectures were finalized in collaboration with Feynman for an anticipated monograph based on the course. The others, while retaining Feynman's idiosyncrasies, are revised similarly to those he was able to check. His distinctive approach and manner of presentation are manifest throughout. Near the end he suggests a novel, nonperturbative formulation of quantum field theory in D dimensions. Supplementary material is provided in appendices and ancillary files, including verbatim transcriptions of three lectures and the corresponding audiotaped recordings.

The image below is from some of Feynman's handwritten notes (in this case, about the Gribov ambiguity in Fadeev-Popov gauge fixing) that Cline included in the manuscript. There are also links to audio from some of the lectures. As in some earlier notebooks, Feynman sometimes writes "guage" instead of gauge.

Sunday, May 16, 2021

Ditchley Foundation meeting: China Today and Tomorrow

China Today and Tomorrow

20 MAY 2021 - 21 MAY 2021

This Ditchley conference will focus on China, its internal state and sense of self today, its role in the region and world, and how these might evolve in years to come.

There are broadly two current divergent narratives about China. The first is that China’s successful response to the pandemic has accelerated China’s ascent to be the world’s pre-eminent economic power. The Made in China 2025 strategy will also see China take the lead in some technologies beyond 5G, become self-sufficient in silicon chip production and free itself largely of external constraints on growth. China’s internal market will grow, lessening dependence on exports and that continued growth will maintain the bargain between the Chinese people and the Chinese Communist Party through prosperity and stability. Retaining some elements of previous Chinese strategy though, this confidence is combined with a degree of humility: China is concerned with itself and its region, not becoming a global superpower or challenging the US. Economic supremacy is the aim but military strategy remains focused on defence, not increasing international leverage or scope of action.

The second competing narrative is that China’s position is more precarious than it appears. The Belt and Road Initiative will bring diplomatic support from client countries but not real economic gains. Human rights violations will damage China abroad. Internally the pressures on natural resources will prove hard to sustain. Democratic and free-market innovation, combined with a bit more industrial strategy, will outstrip China’s efforts. Careful attention to supply chains in the West will meanwhile reduce critical reliance on China and curb China’s economic expansion. This perceived fragility is often combined though with a sense of heightened Chinese ambition abroad, not just through the Belt and Road Initiative but in challenging the democratic global norms established since 1989 by presenting technologically-enabled and effective authoritarian rule as an alternative model for the world, rather than just a Chinese solution.

What is the evidence today for where we should settle between these narratives? What trends should we watch to determine likely future results? ...

[Suggested background reading at link above.]

Unfortunately this meeting will be virtual. The video below gives some sense of the unique charm of in-person workshops at Ditchley.

See also this 2020 post about an earlier Ditchley meeting I attended: World Order Today

... analysis by German academic Gunnar Heinsohn. Two of his slides appear below.

1. It is possible that by 2050 the highly able STEM workforce in PRC will be ~10x larger than in the US and comparable to or larger than the rest of the world combined. Here "highly able" means roughly top few percentile math ability in developed countries (e.g., EU), as measured by PISA at age 15.

[ It is trivial to obtain this kind of estimate: PRC population is ~4x US population and fraction of university students in STEM is at least ~2x higher. Pool of highly able 15 year olds as estimated by PISA or TIMMS international testing regimes is much larger than in US, even per capita. Heinsohn's estimate is somewhat high because he uses PISA numbers that probably overstate the population fraction of Level 6 kids in PRC. Current PISA studies disproportionately sample from more developed areas of China. At bottom (asterisk) he uses results from Taiwan/Macau that give a smaller ~20x advantage of PRC vs USA. My own ~10x estimate is quite conservative in comparison. ]

2. The trajectory of international patent filings shown below is likely to continue.

Wednesday, May 12, 2021

Neural Tangent Kernels and Theoretical Foundations of Deep Learning

A colleague recommended this paper to me recently. See also earlier post Gradient Descent Models Are Kernel Machines.

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, Clément Hongler

At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function fθ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function fθ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.

The results are remarkably well summarized in the wikipedia entry on Neural Tangent Kernels:

For most common neural network architectures, in the limit of large layer width the NTK becomes constant. This enables simple closed form statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. For example, it guarantees that wide enough ANNs converge to a global minimum when trained to minimize an empirical loss. ...

An Artificial Neural Network (ANN) with scalar output consists in a family of functions $f\left(\cdot ,\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ parametrized by a vector of parameters $\theta \in \mathbb {R} ^{P}$ .
The Neural Tangent Kernel (NTK) is a kernel $\Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ defined by
$\Theta \left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f\left(x;\theta \right)\partial _{\theta _{p}}f\left(y;\theta \right).$ In the language of kernel methods, the NTK $\Theta$ is the kernel associated with the feature map $\left(x\mapsto \partial _{\theta _{p}}f\left(x;\theta \right)\right)_{p=1,\ldots ,P}$

For a dataset $\left(x_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {in} }}$ with scalar labels $\left(z_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R}$ and a loss function $c:\mathbb {R} \times \mathbb {R} \to \mathbb {R}$ , the associated empirical loss, defined on functions $f:\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ , is given by
${\mathcal {C}}\left(f\right)=\sum _{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).$
When training the ANN $f\left(\cdot ;\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ is trained to fit the dataset (i.e. minimize ${\mathcal {C}}$ ) via continuous-time gradient descent, the parameters $\left(\theta \left(t\right)\right)_{t\geq 0}$ evolve through the ordinary differential equation:
$\partial _{t}\theta \left(t\right)=-\nabla {\mathcal {C}}\left(f\left(\cdot ;\theta \right)\right).$
During training the ANN output function follows an evolution differential equation given in terms of the NTK:
$\partial _{t}f\left(x;\theta \left(t\right)\right)=-\sum _{i=1}^{n}\Theta \left(x,x_{i};\theta \right)\partial _{w}c\left(w,z_{i}\right){\Big |}_{w=f\left(x_{i};\theta \left(t\right)\right)}.$
This equation shows how the NTK drives the dynamics of $f\left(\cdot ;\theta \left(t\right)\right)$ in the space of functions $\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ during training.

This is a very brief (3 minute) summary by the first author:

This 15 minute IAS talk gives a nice overview of the results, and their relation to fundamental questions (both empirical and theoretical) in deep learning. Longer (30m) version: On the Connection between Neural Networks and Kernels: a Modern Perspective.

I hope to find time to explore this in more depth. Large width seems to provide a limiting case (analogous to the large-N limit in gauge theory) in which rigorous results about deep learning can be proved.

Some naive questions:

What is the expansion parameter of the finite width expansion?

What role does concentration of measure play in the results? (See 30m video linked above.)

Simplification seems to be a consequence of overparametrization. But the proof method seems to apply to a regularized (but still convex, e.g., using L1 penalization) loss function that imposes sparsity. It would be interesting to examine this specific case in more detail.

Notes to self:

The overparametrized (width ~ w^2) network starts in a random state and by concentration of measure this initial kernel K is just the expectation, which is the NTK. Because of the large number of parameters the effect of training (i.e., gradient descent) on any individual parameter is 1/w, and the change in the eigenvalue spectrum of K is also 1/w. It can be shown that the eigenvalue spectrum is positive and bounded away from zero, and this property does not change under training. Also, the evolution of f is linear in K up to corrections with are suppressed by 1/w. Hence evolution follows a convex trajectory and can achieve global minimum loss in a finite (polynomial) time.

The parametric 1/w expansion may depend on quantities such as the smallest NTK eigenvalue k: the proof might require k >> 1/w or wk large.

In the large w limit the function space has such high dimensionality that any typical initial f is close (within a ball of radius 1/w?) to an optimal f.

These properties depend on specific choice of loss function.

Saturday, May 08, 2021

Three Thousand Years and 115 Generations of 徐 (Hsu / Xu)

Over the years I have discussed economic historian Greg Clark's groundbreaking work on the persistence of social class. Clark found that intergenerational social mobility was much less than previously thought, and that intergenerational correlations on traits such as education and occupation were consistent with predictions from an additive genetic model with a high degree of assortative mating.

See Genetic correlation of social outcomes between relatives (Fisher 1918) tested using lineage of 400k English individuals, and further links therein. Also recommended: this recent podcast interview Clark did with Razib Khan.

The other day a reader familiar with Clark's work asked me about my family background. Obviously my own family history is not a scientific validation of Clark's work, being only a single (if potentially illustrative) example. Nevertheless it provides an interesting microcosm of the tumult of 20th century China and a window into the deep past...

I described my father's background in the post Hsu Scholarship at Caltech:

Cheng Ting Hsu was born December 1, 1923 in Wenling, Zhejiang province, China. His grandfather, Zan Yao Hsu was a poet and doctor of Chinese medicine. His father, Guang Qiu Hsu graduated from college in the 1920's and was an educator, lawyer and poet.

Cheng Ting was admitted at age 16 to the elite National Southwest Unified University (Lianda), which was created during WWII by merging Tsinghua, Beijing, and Nankai Universities. This university produced numerous famous scientists and scholars such as the physicists C.N. Yang and T.D. Lee.

Cheng Ting studied aerospace engineering (originally part of Tsinghua), graduating in 1944. He became a research assistant at China's Aerospace Research Institute and a lecturer at Sichuan University. He also taught aerodynamics for several years to advanced students at the air force engineering academy.

In 1946 he was awarded one of only two Ministry of Education fellowships in his field to pursue graduate work in the United States. In 1946-1947 he published a three-volume book, co-authored with Professor Li Shoutong, on the structures of thin-walled airplanes.

In January 1948, he left China by ocean liner, crossing the Pacific and arriving in San Francisco. ...

My mother's father was a KMT general, and her family related to Chiang Kai Shek by marriage. Both my grandfather and Chiang attended the military academy Shinbu Gakko in Tokyo. When the KMT lost to the communists, her family fled China and arrived in Taiwan in 1949. My mother's family had been converted to Christianity in the 19th century and became Methodists, like Sun Yat Sen. (I attended Methodist Sunday school while growing up in Ames IA.) My grandfather was a partner of T.V. Soong in the distribution of bibles in China in the early 20th century.

My father's family remained mostly in Zhejiang and suffered through the communist takeover, Great Leap Forward, and Cultural Revolution. My father never returned to China and never saw his parents again.

When I met my uncle (a retired Tsinghua professor) and some of my cousins in Hangzhou in 2010, they gave me a four volume family history that had originally been printed in the 1930s. The Hsu (Xu) lineage began in the 10th century BC and continued to my father, in the 113th generation. His entry is the bottom photo below.

Wikipedia: The State of Xu (Chinese: 徐) (also called Xu Rong (徐戎) or Xu Yi (徐夷)[a] by its enemies)[4][5] was an independent Huaiyi state of the Chinese Bronze Age[6] that was ruled by the Ying family (嬴) and controlled much of the Huai River valley for at least two centuries.[3][7] It was centered in northern Jiangsu and Anhui. ...

Generations 114 and 115:

Four volume history of the Hsu (Xu) family, beginning in the 10th century BC. The first 67 generations are covered rather briefly, only indicating prominent individuals in each generation of the family tree. The books are mostly devoted to generations 68-113 living in Zhejiang. (Earlier I wrote that it was two volumes, but it's actually four. The printing that I have is two thick books.)

Sunday, May 02, 2021

40 Years of Quantum Computation and Quantum Information

This is a great article on the 1981 conference which one could say gave birth to quantum computing / quantum information.

Technology Review: Quantum computing as we know it got its start 40 years ago this spring at the first Physics of Computation Conference, organized at MIT’s Endicott House by MIT and IBM and attended by nearly 50 researchers from computing and physics—two groups that rarely rubbed shoulders.

Twenty years earlier, in 1961, an IBM researcher named Rolf Landauer had found a fundamental link between the two fields: he proved that every time a computer erases a bit of information, a tiny bit of heat is produced, corresponding to the entropy increase in the system. In 1972 Landauer hired the theoretical computer scientist Charlie Bennett, who showed that the increase in entropy can be avoided by a computer that performs its computations in a reversible manner. Curiously, Ed Fredkin, the MIT professor who cosponsored the Endicott Conference with Landauer, had arrived at this same conclusion independently, despite never having earned even an undergraduate degree. Indeed, most retellings of quantum computing’s origin story overlook Fredkin’s pivotal role.

Fredkin’s unusual career began when he enrolled at the California Institute of Technology in 1951. Although brilliant on his entrance exams, he wasn’t interested in homework—and had to work two jobs to pay tuition. Doing poorly in school and running out of money, he withdrew in 1952 and enlisted in the Air Force to avoid being drafted for the Korean War.

A few years later, the Air Force sent Fredkin to MIT Lincoln Laboratory to help test the nascent SAGE air defense system. He learned computer programming and soon became one of the best programmers in the world—a group that probably numbered only around 500 at the time.

Upon leaving the Air Force in 1958, Fredkin worked at Bolt, Beranek, and Newman (BBN), which he convinced to purchase its first two computers and where he got to know MIT professors Marvin Minsky and John McCarthy, who together had pretty much established the field of artificial intelligence. In 1962 he accompanied them to Caltech, where McCarthy was giving a talk. There Minsky and Fredkin met with Richard Feynman ’39, who would win the 1965 Nobel Prize in physics for his work on quantum electrodynamics. Feynman showed them a handwritten notebook filled with computations and challenged them to develop software that could perform symbolic mathematical computations. ...

... in 1974 he headed back to Caltech to spend a year with Feynman. The deal was that Fredkin would teach Feynman computing, and Feynman would teach Fredkin quantum physics. Fredkin came to understand quantum physics, but he didn’t believe it. He thought the fabric of reality couldn’t be based on something that could be described by a continuous measurement. Quantum mechanics holds that quantities like charge and mass are quantized—made up of discrete, countable units that cannot be subdivided—but that things like space, time, and wave equations are fundamentally continuous. Fredkin, in contrast, believed (and still believes) with almost religious conviction that space and time must be quantized as well, and that the fundamental building block of reality is thus computation. Reality must be a computer! In 1978 Fredkin taught a graduate course at MIT called Digital Physics, which explored ways of reworking modern physics along such digital principles.

Feynman, however, remained unconvinced that there were meaningful connections between computing and physics beyond using computers to compute algorithms. So when Fredkin asked his friend to deliver the keynote address at the 1981 conference, he initially refused. When promised that he could speak about whatever he wanted, though, Feynman changed his mind—and laid out his ideas for how to link the two fields in a detailed talk that proposed a way to perform computations using quantum effects themselves.

Feynman explained that computers are poorly equipped to help simulate, and thereby predict, the outcome of experiments in particle physics—something that’s still true today. Modern computers, after all, are deterministic: give them the same problem, and they come up with the same solution. Physics, on the other hand, is probabilistic. So as the number of particles in a simulation increases, it takes exponentially longer to perform the necessary computations on possible outputs. The way to move forward, Feynman asserted, was to build a computer that performed its probabilistic computations using quantum mechanics.

[ Note to reader: the discussion in the last sentences above is a bit garbled. The exponential difficulty that classical computers have with quantum calculations has to do with entangled states which live in Hilbert spaces of exponentially large dimension. Probability is not really the issue; the issue is the huge size of the space of possible states. Indeed quantum computations are strictly deterministic unitary operations acting in this Hilbert space. ]

Feynman hadn’t prepared a formal paper for the conference, but with the help of Norm Margolus, PhD ’87, a graduate student in Fredkin’s group who recorded and transcribed what he said there, his talk was published in the International Journal of Theoretical Physics under the title “Simulating Physics with Computers.” ...

Feynman's 1981 lecture Simulating Physics With Computers.

Fredkin was correct about the (effective) discreteness of spacetime, although he probably did not realize this is a consequence of gravitational effects: see, e.g., Minimum Length From First Principles. In fact, Hilbert Space (the state space of quantum mechanics) itself may be discrete.

The Quantum Simulation Hypothesis

Hedgehogs, Foxes, and Feynman

My paper on the Margolus-Levitin Theorem in light of gravity:

Physical limits on information processing

We derive a fundamental upper bound on the rate at which a device can process information (i.e., the number of logical operations per unit time), arising from quantum mechanics and general relativity. In Planck units a device of volume V can execute no more than the cube root of V operations per unit time. We compare this to the rate of information processing performed by nature in the evolution of physical systems, and find a connection to black hole entropy and the holographic principle.

Participants in the 1981 meeting:

Physics of Computation Conference, Endicott House, MIT, May 6–8, 1981. 1 Freeman Dyson, 2 Gregory Chaitin, 3 James Crutchfield, 4 Norman Packard, 5 Panos Ligomenides, 6 Jerome Rothstein, 7 Carl Hewitt, 8 Norman Hardy, 9 Edward Fredkin, 10 Tom Toffoli, 11 Rolf Landauer, 12 John Wheeler, 13 Frederick Kantor, 14 David Leinweber, 15 Konrad Zuse, 16 Bernard Zeigler, 17 Carl Adam Petri, 18 Anatol Holt, 19 Roland Vollmar, 20 Hans Bremerman, 21 Donald Greenspan, 22 Markus Buettiker, 23 Otto Floberth, 24 Robert Lewis, 25 Robert Suaya, 26 Stand Kugell, 27 Bill Gosper, 28 Lutz Priese, 29 Madhu Gupta, 30 Paul Benioff, 31 Hans Moravec, 32 Ian Richards, 33 Marian Pour-El, 34 Danny Hillis, 35 Arthur Burks, 36 John Cocke, 37 George Michaels, 38 Richard Feynman, 39 Laurie Lingham, 40 P. S. Thiagarajan, 41 Marin Hassner, 42 Gerald Vichnaic, 43 Leonid Levin, 44 Lev Levitin, 45 Peter Gacs, 46 Dan Greenberger. (Photo courtesy Charles Bennett)

Information Processing

About Me