Information Processing: information theory

Showing posts with label information theory. Show all posts

Thursday, April 07, 2022

Scott Aaronson: Quantum Computing, Unsolvable Problems, & Artificial Intelligence — Manifold podcast #9

Scott Aaronson is the David J. Bruton Centennial Professor of Computer Science at The University of Texas at Austin, and director of its Quantum Information Center. Previously, he taught for nine years in Electrical Engineering and Computer Science at MIT. His research interests center around the capabilities and limits of quantum computers, and computational complexity theory more generally.

Scott also writes the blog Shtetl Optimized: https://scottaaronson.blog/

Steve and Scott discuss:

1. Scott's childhood and education, first exposure to mathematics and computers.

2. How he became interested in computational complexity, pursuing it rather than AI/ML.

3. The development of quantum computation and quantum information theory from the 1980s to the present.

4. Scott's work on quantum supremacy.

5. AGI, AI Safety

ManifoldOne page. Transcript.

Friday, March 18, 2022

Quantum Hair from Gravity (published version in Physical Review Letters)

This is the published version of our paper on Quantum Hair on black holes, in Physical Review Letters:

Quantum Hair from Gravity

Xavier Calmet, Roberto Casadio, Stephen D. H. Hsu, and Folkert Kuipers

Phys. Rev. Lett. 128, 111301 – Published 17 March 2022

We explore the relationship between the quantum state of a compact matter source and of its asymptotic graviton field. For a matter source in an energy eigenstate, the graviton state is determined at leading order by the energy eigenvalue. Insofar as there are no accidental energy degeneracies there is a one to one map between graviton states on the boundary of spacetime and the matter source states. Effective field theory allows us to compute a purely quantum gravitational effect which causes the subleading asymptotic behavior of the graviton state to depend on the internal structure of the source. This establishes the existence of ubiquitous quantum hair due to gravitational effects.

The paper establishes that the quantum state of the graviton field (equivalently, the spacetime metric) of a compact matter source depends on the quantum state of the source. This can be established without a short distance theory of quantum gravity -- i.e., near the Planck length. Our results are long wavelength effects and are insensitive to the details of short distance physics, such as whether gravitons are excitations of strings, or something else, at the most fundamental level.

Classical theorems in General Relativity indicate that black holes are nearly featureless -- only a few aspects of the hole, such as its total mass, charge, and angular momentum, are manifested in its asymptotic gravitational field. We show that this "no hair" property does not extend to the quantum realm. Indeed at the quantum level the situation is the opposite: the full quantum state of the compact object can be recovered from the asymptotic graviton state.

In this companion paper we show how these results resolve Hawking's black hole information paradox, which has been an open problem for 46 years.

Quantum hair and black hole information

Physics Letters B Volume 827, 10 April 2022, 136995

Xavier Calmet and Stephen D.H. Hsu

It has been shown that the quantum state of the graviton field outside a black hole horizon carries information about the internal state of the hole. We explain how this allows unitary evaporation: the final radiation state is a complex superposition which depends linearly on the initial black hole state. Under time reversal, the radiation state evolves back to the original black hole quantum state. Formulations of the information paradox on a fixed semiclassical geometry describe only a small subset of the evaporation Hilbert space, and do not exclude overall unitarity.

Note to experts: the companion paper explains why Mathur's Theorem (i.e., entanglement entropy must always increase by ~ln 2 with each emitted qubit) is evaded once one considers BH evolution in the full radiation Hilbert space. The radiation Hilbert space is much larger than the small subspace which remains after conditioning on any specific spacetime background or BH recoil trajectory. Even exponentially small entanglement between different radiation states (mediated by quantum hair) can unitarize the evaporation process.

This is also explained in detail in the talk video and slides linked below.

Press coverage:

BBC

Guardian

Independent

Earlier discussion, with more background on the Hawking paradox. See especially the important work by Suvrat Raju and collaborators:

Quantum Hair and Black Hole Information (December 2021)

Has Hawking's Black Hole Information Paradox Been Resolved? (November 2021)

Video of seminar and slides.

Tuesday, June 12, 2018

Big Ed on Classical and Quantum Information Theory

I'll have to carve out some time this summer to look at these :-) Perhaps on an airplane...

When I visited IAS earlier in the year, Witten was sorting out Lieb's (nontrivial) proof of strong subadditivity. See also Big Ed.

A Mini-Introduction To Information Theory
https://arxiv.org/abs/1805.11965

This article consists of a very short introduction to classical and quantum information theory. Basic properties of the classical Shannon entropy and the quantum von Neumann entropy are described, along with related concepts such as classical and quantum relative entropy, conditional entropy, and mutual information. A few more detailed topics are considered in the quantum case.

Notes On Some Entanglement Properties Of Quantum Field Theory
https://arxiv.org/abs/1803.04993

These are notes on some entanglement properties of quantum field theory, aiming to make accessible a variety of ideas that are known in the literature. The main goal is to explain how to deal with entanglement when – as in quantum field theory – it is a property of the algebra of observables and not just of the states.

Years ago at Caltech, walking back to Lauritsen after a talk on quantum information, with John Preskill and a famous string theorist not to be named. When I asked the latter what he thought of the talk, he laughed and said Well, after all, it's just linear algebra :-)

Saturday, October 07, 2017

Information Theory of Deep Neural Nets: "Information Bottleneck"

This talk discusses, in terms of information theory, how the hidden layers of a deep neural net (thought of as a Markov chain) create a compressed (coarse grained) representation of the input information. To date the success of neural networks has been a mainly empirical phenomenon, lacking a theoretical framework that explains how and why they work so well.

At ~44min someone asks how networks "know" to construct (local) feature detectors in the first few layers. I'm not sure I followed Tishby's answer but it may be a consequence of the hierarchical structure of the data, not specific to the network or optimization.

Naftali (Tali) Tishby נפתלי תשבי

Physicist, professor of computer science and computational neuroscientist
The Ruth and Stan Flinkman professor of Brain Research
Benin school of Engineering and Computer Science
Edmond and Lilly Safra Center for Brain Sciences (ELSC)
Hebrew University of Jerusalem, 96906 Israel

I work at the interfaces between computer science, physics, and biology which provide some of the most challenging problems in today’s science and technology. We focus on organizing computational principles that govern information processing in biology, at all levels. To this end, we employ and develop methods that stem from statistical physics, information theory and computational learning theory, to analyze biological data and develop biologically inspired algorithms that can account for the observed performance of biological systems. We hope to find simple yet powerful computational mechanisms that may characterize evolved and adaptive systems, from the molecular level to the whole computational brain and interacting populations.

Another Tishby talk on this subject.

Saturday, November 21, 2009

IQ, compression and simple models

I get yelled at from all sides whenever I mention IQ in a post, but I'm a stubborn guy, so here we go again.

Imagine that you would like to communicate something about the size of an object, using as short a message as possible -- i.e., a single number. What would be a reasonable algorithm to employ? There's obviously no unique answer, and the "best" algorithm depends on the distribution of object types that you are trying to describe. Here's a decent algorithm:

Let rough size S = the radius of the smallest sphere within which the object will fit.

This algorithm allows a perfect reconstruction of the object if it is spherical, but isn't very satisfactory if the object is a javelin or bicycle wheel.

Nevertheless, it would be unreasonable to reject this definition as a single number characterization of object size, given no additional information about the distribution of object types.

I suggest we think about IQ in a similar way.

Q1: If you had to supply a single number meant to characterize the general cognitive ability of an individual, how would you go about determining that number?

I claim that the algorithm used to define IQ is roughly as defensible for characterizing cognitive ability as the quantity S, defined above, is for characterizing object size. The next question, which is an empirical one, is

Q2: Does the resulting quantity have any practical use?

In my opinion reasonable people should focus on the second question, that of practical utility, as it is rather obvious that there is no unique or perfect answer to the first question.

To define IQ, or the general factor g of cognitive ability, we first define some different tests of cognitive ability, i.e., which measure capabilities like memory, verbal ability, spatial ability, pattern recognition, etc. Of course this set of tests is somewhat arbitrary, just as the primitive concept "size of an object" is somewhat arbitrary (is a needle "bigger" than a thimble?). Let's suppose we decide on N different kinds of tests. An individual's score on this battery of tests is an N-vector. Sample from a large population and plot each vector in the N-dimensional space. We might find that the resulting points are concentrated on a submanifold of the N-dimensional space, such that a single variable (which is a special linear combination of the N coordinates) captures most of the variation. As an extreme example, imagine the points form a long thin ellipse with one very long axis; position on this long axis almost completely specifies the N vector. The figure below shows real data, ~100k individuals tested. The principal axis of the ellipsoid is g (roughly speaking; as I've emphasized it is not entire well-defined).

What I've just described geometrically is the case where the N mental abilities display a lot of internal correlation, and have a dominant single factor that arises from factor analysis. This dominant factor is what we call g. Note it did not have to be the case that there was a single dominant factor -- the sampled points could have had any shape -- but for the set of generally agreed upon human cognitive abilities, there is.

(What this implies about underlying brain wetware is an interesting question but would take us too far afield. I will mention that g, defined as above using cognitive tests, correlates with neurophysical quantities like reaction time! So it's at least possible that high g has something to do with generally effective brain function -- being wired up efficiently. It's now acknowledged even by hard line egalitarians that g is at least partly heritable, but for the purposes of this discussion we only require a weaker property -- that adult g is relatively stable.)

To summarize, g is the best single number compression of the N vector characterizing an individual's cognitive profile. (This is a lossy compression -- knowing g does not allow exact reconstruction of the N vector.) Of course, the choice of the N tests used to deduce g was at least somewhat arbitrary, and a change in tests results in a different definition of g. There is no unique or perfect definition of a general factor of intelligence. As I emphasized above, given the nature of the problem it seems unreasonable to criticize the specific construction of g, or to try to be overly precise about the value of g for a particular individual. The important question is Q2: what good is it?

A tremendous amount of research has been conducted on Q2. For a nice summary, see Why g matters: the complexity of ordinary life by psychologist Linda Gottfredson, or click on the IQ or psychometrics label link for this blog. Links and book recommendations here. The short answer is that g does indeed correlate with life outcomes. If you want to argue with me about any of this in the comments, please at least first read some of the literature cited above.

From Gottfredson (WPT = Wonderlic Personnel Test):

Personnel selection research provides much evidence that intelligence (g) is an important predictor of performance in training and on the job, especially in higher level work. This article provides evidence that g has pervasive utility in work settings because it is essentially the ability to deal with cognitive complexity, in particular, with complex information processing. The more complex a work task, the greater the advantages that higher g confers in performing it well.

... These conclusions concerning training potential, particularly at the lower levels, seem confirmed by the military’s last half century of experience in training many millions of recruits. The military has periodically inducted especially large numbers of “marginal men” (percentiles 10-16, or WPT 10-12), either by necessity (World War II), social experiment (Secretary of Defense Robert McNamara’s Project 100,000 in the late 196Os), or accident (the ASVAB misnorming in the early 1980s). In each case, the military has documented the consequences of doing so (Laurence & Ramsberger, 1991; Sticht et al., 1987; U.S. Department of the Army, 1965).

... all agree that these men were very difficult and costly to train, could not learn certain specialties, and performed at a lower average level once on a job. Many such men had to be sent to newly created special units for remedial training or recycled one or more times through basic or technical training.

Limitations and open questions:

1. Are there group differences in g? Yes, this is actually uncontroversial. The hard question is whether these observed differences are due to genetic causes.

2. Is it useful to consider sub-factors? What about, e.g., a 2 or 3-vector compression instead of a scalar quantity? Yes, that's why the SAT has an M and a V section. Some people are strong verbally, but weak mathematically, and vice versa. Some people are really good at visualizing geometric relationships, some aren't, etc.

3. Does g become less useful in the tail of the distribution? Quite possibly. It's harder and harder to differentiate people in the tail.

4. How stable is g? Adult g is pretty stable -- I've seen results with .9 correlation or greater for measurements taken a year apart. However, g measured in childhood is nowhere near a perfect predictor of adult g. If someone has a reference with good data on childhood/adult g correlation, please let me know.

5. Isn't g just the same as class or SES? No. Although there is a weak correlation between g and SES, there are obviously huge variations in g within any particular SES group. Not all rich kids can master calculus, and not all disadvantaged kids read below grade level.

6. How did you get interested in this subject? In elementary school we had to take the ITED (Iowa Test of Educational Development). This test had many subsections (vocabulary, math, reading, etc.) with 99th percentile ceilings. For some reason the teachers (or was it my parents?) let me see my scores, and I immediately wondered whether performance on different sections was correlated. If you were 99 on the math, what was the probability you were also 99 on the reading? What are the odds of all 99s? This leads immediately to the concept of g, which I learned about by digging around at the university library. I also found all five volumes of the Terman study.

7. What are some other useful compressed descriptions? It is claimed that one can characterize personality using the Big Five factors. The results are not as good as for g, I would say, but it's an interesting possibility, and these factors were originally deduced in an information theoretic way. Big Five factors have been shown to be stable and somewhat heritable, although not as heritable as g. Role playing games often use compressed descriptions of individuals (Strength, Dexterity, Intelligence, ...) as do NFL scouts (40 yd dash, veritcal leap, bench press, Wonderlic score, ... ) ;-)

It's a shame that I have to write this post at all. This subject is of such fundamental importance and the results so interesting and clear cut (especially for something from the realm of social science) that everyone should have studied it in school. (Everyone does take the little tests in school...) It's too bad that political correctness means that I will be subject to abuse for merely discussing these well established scientific results.

Why think about any of this? Here's what I said in response to a comment on this earlier post:

Intelligence, genius, and achievement are legitimate subjects for serious study. Anyone who hires or fires employees, mentors younger people, trains students, has kids, or even just has an interest in how human civilization evolved and will evolve should probably think about these questions -- using statistics, biography, history, psychological studies, really whatever tools are available.

Thursday, September 20, 2007

Information theory, inference and learning algorithms

I'd like to recommend the book Information theory, inference and learning algorithms by David Mackay, a Cambridge professor of physics. I wish I'd had a course on this material from Mackay when I was a student! Especially nice are the introductory example on Bayesian inference (Ch. 3) and the discussion of Occam's razor from a Bayesian perspective (Ch. 28). I'm sure I'll find other gems in this book, but I'm still working my way through.

I learned about the book through Nerdwisdom, which I also recommend highly. Nerdwisdom is the blog of Jonathan Yedidia, a brilliant polymath (theoretical physicist turned professional chess player turned computer scientist) with whom I consumed a lot of French wine at dinners of the Harvard Society of Fellows.

Sunday, March 27, 2005

How much information in the universe?

Suppose the universe is described by a single wavefunction which evolves in time according to Schrodinger dynamics. In this framework remarkably little information is required to describe the universe as a whole.

Consider the sub-volume V, just after the big bang, which evolves into our observable 15 Gy universe. Now suppose there is an ultraviolet cutoff (or minimum resolvable length) given by the Planck length. Then the entropy (log of number of degrees of freedom) is of order V in Planck units. To neglect quantum gravity effects, we need to wait sufficiently long after the big bang that curvatures are small in Planck units, so this entropy is a large number, but still much smaller than the observed entropy of our current universe. Given the wavefunction over this large but finite Hilbert space, and the Hamiltonian, we can evolve this system forward in time until today. Neither the informational complexity (number of bits required to specify the initial state) nor the algorithmic complexity (length of program required to evolve the system) is very large. (Note that memory requirements could grow quite rapidly - especially in an expanding universe.)

If this confuses you, just imagine you had to write a computer program to evolve this finite system (remember, we have both UV and IR cutoffs) forward in time. How much input would your program need, and how long would the code be? Then compare the answer to what would be required, e.g., to simulate to just a small part of planet earth today.

So what is the origin of the apparent complexity of our world? The answer is that the wavefunction described above contains all branches of Everett's many worlds (see previous post). In order to locate your particular branch (the one on which your consciousness resides), you have to specify the outcomes of the branchings in your past (or, at least, the important ones - I think Gell-Mann and Hartle refer to this as decoherent histories). The amount of information required to specify a particular history is related to the number of possible present universes, and is mainly responsible for the complexity we observe.

Information Processing

About Me