Information Processing: probability

Showing posts with label probability. Show all posts

Monday, October 18, 2021

Embryo Screening and Risk Calculus

Over the weekend The Guardian and The Times (UK) both ran articles on embryo selection.

Polygenic screening of embryos is here, but is it ethical? Philip Ball for The Guardian

Genetics firms screen embryos to create designer babies Danny Fortson, The Sunday Times.

I recommend the first article. Philip Ball is an accomplished science writer and former scientist. He touches on many of the most important aspects of the topic, not easy given the length restriction he was working with.

However I'd like to cover an aspect of embryo selection which is often missed, for example by the bioethicists quoted in Ball's article.

Several independent labs have published results on risk reduction from embryo selection, and all find that the technique is effective. But some people who are not following the field closely (or are not quantitative) still characterize the benefits -- incorrectly, in my view -- as modest. I honestly think they lack understanding of the actual numbers.

Some examples:

Carmi et al. find a ~50% risk reduction for schizophrenia from selecting the lowest risk embryo from a set of 5. For a selection among 2 embryos the risk reduction is ~30%. (We obtain a very similar result using empirical data: real adult siblings with known phenotype.)

Visscher et al. find the following results, see Table 1 and Figure 2 in their paper. To their credit they compute results for a range of ancestries (European, E. Asian, African). We have performed similar calculations using siblings but have not yet published the results for all ancestries.

Relative Risk Reduction (RRR):

Hypertension: 9-18% (ranges depend on specific ancestry)

Type 2 Diabetes: 7-16%

Coronary Artery Disease: 8-17%

Absolute Risk Reduction (ARR):

Hypertension: 4-8.5% (ranges depend on specific ancestry)

Type 2 Diabetes: 2.6-5.5%

Coronary Artery Disease: 0.55-1.1%

I don't view these risk reductions as modest. Given that an IVF family is already going to make a selection they clearly benefit from the additional information that comes with genotyping each embryo. The cost is a small fraction of the overall cost of an IVF cycle.

But here is the important mathematical point which many people miss: We buy risk insurance even when the expected return is negative, in order to ameliorate the worst possible outcomes.

Consider the example of home insurance. A typical family will spend tens of thousands of dollars over the years on home insurance, which protects against risks like fire or earthquake. However, very few homeowners (e.g., ~1 percent) ever suffer a really large loss! At the end of their lives, looking back, most families might conclude that the insurance was "a waste of money"!

So why buy the insurance? To avoid ruin in the event you are unlucky and your house does burn down. It is tail risk insurance.

Now consider an "unlucky" IVF family. At, say, the 1 percent level of "bad luck" they might have some embryos which are true outliers (e.g., at 10 times normal risk, which could mean over 50% absolute risk) for a serious condition like schizophrenia or breast cancer. This is especially likely if they have a family history.

What is the benefit to this specific subgroup of families? It is enormous -- using the embryo risk score they can avoid having a child with very high likelihood of serious health condition. This benefit is many many times (> 100x!) larger than the cost of the genetic screening, and it is not characterized by the average risk reductions given above.

The situation is very similar to that of aneuploidy testing (screening against Down syndrome), which is widespread, not just in IVF. The prevalence of trisomy 21 (extra copy of chromosome 21) is only ~1 percent, so almost all families doing aneuploidy screening are "wasting their money" if one uses faulty logic! Nevertheless, the families in the affected category are typically very happy to have paid for the test, and even families with no trisomy warning understand that it was worthwhile.

The point is that no one knows ahead of time whether their house will burn down, or that one or more of their embryos has an important genetic risk. The calculus of average return is misleading -- i.e., it says that home insurance is a "rip off" when in fact it serves an important social purpose of pooling risk and helping the unfortunate.

The same can be said for embryo screening in IVF -- one should focus on the benefit to "unlucky" families to determine the value. We can't identify the "unlucky" in advance, unless we do genetic screening!

Some useful references:

Embryo Screening for Polygenic Disease Risk: Recent Advances and Ethical Considerations (Genes 2021 Special Issue)

First Baby Born from a Polygenically Screened Embryo

Wednesday, May 12, 2021

Neural Tangent Kernels and Theoretical Foundations of Deep Learning

A colleague recommended this paper to me recently. See also earlier post Gradient Descent Models Are Kernel Machines.

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, Clément Hongler

At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit, thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function fθ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinite-width limit, the network function fθ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.

The results are remarkably well summarized in the wikipedia entry on Neural Tangent Kernels:

For most common neural network architectures, in the limit of large layer width the NTK becomes constant. This enables simple closed form statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. For example, it guarantees that wide enough ANNs converge to a global minimum when trained to minimize an empirical loss. ...

An Artificial Neural Network (ANN) with scalar output consists in a family of functions $f\left(\cdot ,\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ parametrized by a vector of parameters $\theta \in \mathbb {R} ^{P}$ .
The Neural Tangent Kernel (NTK) is a kernel $\Theta :\mathbb {R} ^{n_{\mathrm {in} }}\times \mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ defined by
$\Theta \left(x,y;\theta \right)=\sum _{p=1}^{P}\partial _{\theta _{p}}f\left(x;\theta \right)\partial _{\theta _{p}}f\left(y;\theta \right).$ In the language of kernel methods, the NTK $\Theta$ is the kernel associated with the feature map $\left(x\mapsto \partial _{\theta _{p}}f\left(x;\theta \right)\right)_{p=1,\ldots ,P}$

For a dataset $\left(x_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R} ^{n_{\mathrm {in} }}$ with scalar labels $\left(z_{i}\right)_{i=1,\ldots ,n}\subset \mathbb {R}$ and a loss function $c:\mathbb {R} \times \mathbb {R} \to \mathbb {R}$ , the associated empirical loss, defined on functions $f:\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ , is given by
${\mathcal {C}}\left(f\right)=\sum _{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).$
When training the ANN $f\left(\cdot ;\theta \right):\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ is trained to fit the dataset (i.e. minimize ${\mathcal {C}}$ ) via continuous-time gradient descent, the parameters $\left(\theta \left(t\right)\right)_{t\geq 0}$ evolve through the ordinary differential equation:
$\partial _{t}\theta \left(t\right)=-\nabla {\mathcal {C}}\left(f\left(\cdot ;\theta \right)\right).$
During training the ANN output function follows an evolution differential equation given in terms of the NTK:
$\partial _{t}f\left(x;\theta \left(t\right)\right)=-\sum _{i=1}^{n}\Theta \left(x,x_{i};\theta \right)\partial _{w}c\left(w,z_{i}\right){\Big |}_{w=f\left(x_{i};\theta \left(t\right)\right)}.$
This equation shows how the NTK drives the dynamics of $f\left(\cdot ;\theta \left(t\right)\right)$ in the space of functions $\mathbb {R} ^{n_{\mathrm {in} }}\to \mathbb {R}$ during training.

This is a very brief (3 minute) summary by the first author:

This 15 minute IAS talk gives a nice overview of the results, and their relation to fundamental questions (both empirical and theoretical) in deep learning. Longer (30m) version: On the Connection between Neural Networks and Kernels: a Modern Perspective.

I hope to find time to explore this in more depth. Large width seems to provide a limiting case (analogous to the large-N limit in gauge theory) in which rigorous results about deep learning can be proved.

Some naive questions:

What is the expansion parameter of the finite width expansion?

What role does concentration of measure play in the results? (See 30m video linked above.)

Simplification seems to be a consequence of overparametrization. But the proof method seems to apply to a regularized (but still convex, e.g., using L1 penalization) loss function that imposes sparsity. It would be interesting to examine this specific case in more detail.

Notes to self:

The overparametrized (width ~ w^2) network starts in a random state and by concentration of measure this initial kernel K is just the expectation, which is the NTK. Because of the large number of parameters the effect of training (i.e., gradient descent) on any individual parameter is 1/w, and the change in the eigenvalue spectrum of K is also 1/w. It can be shown that the eigenvalue spectrum is positive and bounded away from zero, and this property does not change under training. Also, the evolution of f is linear in K up to corrections with are suppressed by 1/w. Hence evolution follows a convex trajectory and can achieve global minimum loss in a finite (polynomial) time.

The parametric 1/w expansion may depend on quantities such as the smallest NTK eigenvalue k: the proof might require k >> 1/w or wk large.

In the large w limit the function space has such high dimensionality that any typical initial f is close (within a ball of radius 1/w?) to an optimal f.

These properties depend on specific choice of loss function.

Monday, November 30, 2015

The measure problem in many worlds quantum mechanics

I am a Quantum Engineer, but on Sundays I have principles. — J.S. Bell

My own conclusion ... there is no interpretation of quantum mechanics that does not have serious flaws. — Steve Weinberg

I wrote this paper mainly for non-specialists: any theorist should be able to read and understand it. However, I feel the main point — that subjective probability analyses do not resolve the measure problem in many worlds quantum mechanics — is often overlooked, even by the experts.

The measure problem in no-collapse (many worlds) quantum mechanics
arXiv:1511.08881 [quant-ph]

We explain the measure problem (cf. origin of the Born probability rule) in no-collapse quantum mechanics. Everett defined maverick branches of the state vector as those on which the usual Born probability rule fails to hold -- these branches exhibit highly improbable behaviors, including possibly the breakdown of decoherence or even the absence of an emergent semi-classical reality. An ab initio probability measure is necessary to explain why we do not occupy a maverick branch. Derivations of the Born rule which originate in decision theory or subjective probability do not resolve this problem, because they are circular: they assume, a priori, that we reside on a non-maverick branch.

To put it very succinctly: subjective probability or decision theoretic arguments can justify the Born rule to someone living on a non-maverick branch. But they don't explain why that someone isn't on a maverick branch in the first place.

It seems to me absurd that many tens of thousands of papers have been written about the hierarchy problem in particle physics, but only a small number of theorists realize we don't have a proper (logically complete) quantum theory at the fundamental level.

Friday, June 28, 2013

Solomonoff Universal Induction

In an earlier post (Kolmogorov, Solomonoff, and de Finetti) I linked to a historical article on the problem of induction. Here's an even better one, which gives a very clear introduction to Solomonoff Induction.

A Philosophical Treatise of Universal Induction
Samuel Rathmanner, Marcus Hutter

Understanding inductive reasoning is a problem that has engaged mankind for thousands of years. This problem is relevant to a wide range of fields and is integral to the philosophy of science. It has been tackled by many great minds ranging from philosophers to scientists to mathematicians, and more recently computer scientists. In this article we argue the case for Solomonoff Induction, a formal inductive framework which combines algorithmic information theory with the Bayesian framework. Although it achieves excellent theoretical results and is based on solid philosophical foundations, the requisite technical knowledge necessary for understanding this framework has caused it to remain largely unknown and unappreciated in the wider scientific community. The main contribution of this article is to convey Solomonoff induction and its related concepts in a generally accessible form with the aim of bridging this current technical gap. In the process we examine the major historical contributions that have led to the formulation of Solomonoff Induction as well as criticisms of Solomonoff and induction in general. In particular we examine how Solomonoff induction addresses many issues that have plagued other inductive systems, such as the black ravens paradox and the confirmation problem, and compare this approach with other recent approaches.

Of course, properties such as Turing machine-independence and other key results are asymptotic in nature (only in the limit of very long sequences of data does it cease to matter exactly which reference Turing machine you choose to define program length). When it comes to practical implementations, the devil is in the details! You can think of the Solomonoff Universal Prior as a formalization of the a priori assumption that the information in our Universe is highly compressible (i.e., there are underlying simple algorithms -- laws of physics -- governing its evolution). See also Information, information processing and black holes. From the paper:

... The formalization of Solomonoff induction makes use of concepts and results from computer science, statistics, information theory, and philosophy. It is interesting that the development of a rigorous formalization of induction, which is fundamental to almost all scientific inquiry, is a highly multi-disciplinary undertaking, drawing from these various areas. Unfortunately this means that a high level of technical knowledge from these various disciplines is necessary to fully understand the technical content of Solomonoff induction. This has restricted a deep understanding of the concept to a fairly small proportion of academia which has hindered its discussion and hence progress.

... Every major contribution to the foundations of inductive reasoning has been a contribution to under- standing rational thought. Occam explicitly stated our natural disposition towards simplicity and elegance. Bayes inspired the school of Bayesianism which has made us much more aware of the mechanics behind our belief system. Now, through Solomonoff, it can be argued that the problem of formalizing optimal inductive inference is solved.

Being able to precisely formulate the process of (universal) inductive inference is also hugely significant for general artificial intelligence. Obviously reasoning is synonymous with intelligence, but true intelligence is a theory of how to act on the conclusions we make through reasoning. It may be argued that optimal intelligence is nothing more than optimal inductive inference combined with optimal decision making. Since Solomonoff provides optimal inductive inference and decision theory solves the problem of choosing optimal actions, they can therefore be combined to produce intelligence. ... [ Do we really need Solomonoff? Did Nature make use of his Universal Prior in producing us? It seems like cheaper tricks can produce "intelligence" ;-) ]

Here are some nice informal comments by Solomonoff himself.

Wednesday, June 26, 2013

Kolmogorov, Solomonoff, and de Finetti

This is a nice historical article that connects a number of key figures in the history of probability and information theory. See also Frequentists and Bayesians, Jaynes and Bayes, and On the Origin of Probability in Quantum Mechanics.

Induction: From Kolmogorov and Solomonoff to De Finetti and Back to Kolmogorov
John J. McCall DOI: 10.1111/j.0026-1386.2004.00190.x

ABSTRACT This paper compares the solutions to “the induction problem” by Kolmogorov, de Finetti, and Solomonoff. Brief sketches of the intellectual history of de Finetti and Kolmogorov are also composed. Kolmogorov's contributions to information theory culminated in his notion of algorithmic complexity. The development of algorithmic complexity was inspired by information theory and randomness. Kolmogorov's best-known contribution was the axiomatization of probability in 1933. Its influence on probability and statistics was swift, dramatic, and fundamental. However, Kolmogorov was not satisfied by his treatment of the frequency aspect of his creation. This in time gave rise to Kolmogorov complexity. De Finetti, on the other hand, had a profound vision early in his life which was encapsulated in his exchangeability theorem. This insight simultaneously resolved a fundamental philosophical conundrum—Hume's problem, and provided the bricks and mortar for de Finetti's constructive probabilistic theory. Most of his subsequent research involved extensions of his representation theorem. De Finetti was against determinism and celebrated quantum theory, while Kolmogorov was convinced that in every seemingly indeterministic manifestation there lurked a hidden deterministic mechanism. Solomonoff introduced algorithmic complexity independently of Kolmogorov and Chaitin. Solomonoff's motivation was firmly focused on induction. His interest in induction was to a marked extent sparked by Keynes’ 1921 seminal book. This interest in induction has never faltered, remaining prominent in his most recent research. The decisive connection between de Finetti and Kolmogorov was their lifelong interest in the frequency aspect of induction. Kolmogorov's solution to the problem was algorithmic complexity. De Finetti's solution to his frequency problem occurred early in his career with the discovery of the representation theorem. In this paper, we try to explain these solutions and mention related topics which captured the interest of these giants.

Excerpt below from the paper. I doubt Nature (evolution) uses Solomonoff's Universal Prior (determined by minimum length programs), as it is quite expensive to compute. I think our priors and heuristics are much more specialized and primitive.

... There are a host of similarities and equivalences joining concepts like Shannon entropy, Kolmogorov complexity, maximum entropy, Bayes etc. Some of these are noted. In short, the psychological aspects of probability and induction emphasized by de Finetti, Ramsey and Keynes may eventually emerge from a careful and novel neuroscientific study. In this study, neuronal actors would interact as they portray personal perception and mem- ories. In this way, these versatile neuronal actors comprise the foundation of psychology.

... In comparing de Finetti and Kolmogorov one becomes entangled in a host of controversial issues. de Finetti was an indeterminist who championed a subjective inductive inference. Kolmogorov sought determinism in even the most chaotic natural phenomena and reluctantly accepted a frequency approach to statistics, an objective science. In the 1960s Kolmogorov questioned his frequency position and developed algorithmic complexity, together with Solomonoff and Chaitin. With respect to the frequency doubts, he would have been enlightened by de Finetti’s representation theorem. The KCS research remains an exciting research area in probability, statistics and computer science. It has raised a whole series of controversial issues. It appears to challenge both the frequency and Bayesian schools: the former by proclaiming that the foundations of uncertainty are to be found in algorithms, information and combinatorics, rather than probability; the latter by replacing the Bayes prior which differs across individuals in accord with their distinctive beliefs with a universal prior applicable to all and drained of subjectivity.

Sunday, February 17, 2013

Weinberg on quantum foundations

I have been eagerly awaiting Steven Weinberg's Lectures on Quantum Mechanics, both because Weinberg is a towering figure in theoretical physics, and because of his cryptic comments concerning the origin of probability in no collapse (many worlds) formulations:

Einstein's Mistakes
Steve Weinberg, Physics Today, November 2005

Bohr's version of quantum mechanics was deeply flawed, but not for the reason Einstein thought. The Copenhagen interpretation describes what happens when an observer makes a measurement, but the observer and the act of measurement are themselves treated classically. This is surely wrong: Physicists and their apparatus must be governed by the same quantum mechanical rules that govern everything else in the universe. But these rules are expressed in terms of a wavefunction (or, more precisely, a state vector) that evolves in a perfectly deterministic way. So where do the probabilistic rules of the Copenhagen interpretation come from?

Considerable progress has been made in recent years toward the resolution of the problem, which I cannot go into here. [ITALICS MINE. THIS REMINDS OF FERMAT'S COMMENT IN THE MARGIN!] It is enough to say that neither Bohr nor Einstein had focused on the real problem with quantum mechanics. The Copenhagen rules clearly work, so they have to be accepted. But this leaves the task of explaining them by applying the deterministic equation for the evolution of the wavefunction, the Schrödinger equation, to observers and their apparatus. The difficulty is not that quantum mechanics is probabilistic—that is something we apparently just have to live with. The real difficulty is that it is also deterministic, or more precisely, that it combines a probabilistic interpretation with deterministic dynamics. ...

Weinberg's coverage of quantum foundations in section 3.7 of the new book is consistent with what is written above, although he does not resolve the question of how probability arises from the deterministic evolution of the wavefunction. (See here for my discussion, which involves, among other things, the distinction between objective and subjective probabilities; the latter can arise even in a deterministic universe).

1. He finds Copenhagen unsatisfactory: it does not allow QM to be applied to the observer and measuring process; it does not have a clean dividing line between observer and system.

2. He finds many worlds (no collapse, decoherent histories, etc.) unsatisfactory not because of the so-called basis problem (he accepts the unproved dynamical assumption that decoherence works as advertised), but rather because of the absence of a satisfactory origin of the Born rule for probabilities. (In other words, he doesn't elaborate on the "considerable progress..." alluded to in his 2005 essay!)

Weinberg's concluding paragraph:

There is nothing absurd or inconsistent about the ... general idea that the state vector serves only as a predictor of probabilities, not as a complete description of a physical system. Nevertheless, it would be disappointing if we had to give up the "realist" goal of finding complete descriptions of physical systems, and of using this description to derive the Born rule, rather than just assuming it. We can live with the idea that the state of a physical system is described by a vector in Hilbert space rather than by numerical values of the positions and momenta of all the particles in the system, but it is hard to live with no description of physical states at all, only an algorithm for calculating probabilities. My own conclusion (not universally shared) is that today there is no interpretation of quantum mechanics that does not have serious flaws [italics mine] ...

It is a shame that very few working physicists, even theoreticians, have thought carefully and deeply about quantum foundations. Perhaps Weinberg's fine summary will stimulate greater awareness of this greatest of all unresolved problems in science.

"I am a Quantum Engineer, but on Sundays I have principles." -- J.S. Bell

Wednesday, August 15, 2012

Better to be lucky than good

Shorter Taleb (much of this was discussed in his first book, Fooled by Randomness):

Fat tails + nonlinear feedback means that the majority of successful traders were successful due to luck, not skill. It's painful to live in the shadow of such competitors.

What other fields are dominated by noisy feedback loops? See Success vs Ability , Nonlinearity and noisy outcomes , The illusion of skill and Fake alpha.

Why It is No Longer a Good Idea to Be in The Investment Industry

Nassim N. Taleb

Abstract: A spurious tail is the performance of a certain number of operators that is entirely caused by luck, what is called the “lucky fool” in Taleb (2001). Because of winner-take-all-effects (from globalization), spurious performance increases with time and explodes under fat tails in alarming proportions. An operator starting today, no matter his skill level, and ability to predict prices, will be outcompeted by the spurious tail. This paper shows the effect of powerlaw distributions on such spurious tail. The paradox is that increase in sample size magnifies the role of luck.

... The “spurious tail” is therefore the number of persons who rise to the top for no reasons other than mere luck, with subsequent rationalizations, analyses, explanations, and attributions. The performance in the “spurious tail” is only a matter of number of participants, the base population of those who tried. Assuming a symmetric market, if one has for base population 1 million persons with zero skills and ability to predict starting Year 1, there should be 500K spurious winners Year 2, 250K Year 3, 125K Year 4, etc. One can easily see that the size of the winning population in, say, Year 10 depends on the size of the base population Year 1; doubling the initial population would double the straight winners. Injecting skills in the form of better-than-random abilities to predict does not change the story by much.

Because of scalability, the top, say 300, managers get the bulk of the allocations, with the lion’s share going to the top 30. So it is obvious that the winner-take-all effect causes distortions ...

Conclusions: The “fooled by randomness” effect grows under connectivity where everything on the planet flows to the “top x”, where x is becoming a smaller and smaller share of the top participants. Today, it is vastly more acute than in 2001, at the time of publication of (Taleb 2001). But what makes the problem more severe than anticipated, and causes it to grow even faster, is the effect of fat tails. For a population composed of 1 million track records, fat tails multiply the threshold of spurious returns by between 15 and 30 times.

Generalization: This condition affects any business in which prevail (1) some degree of fat-tailed randomness, and (2) winner-take-all effects in allocation.

To conclude, if you are starting a career, move away from investment management and performance related lotteries as you will be competing with a swelling future spurious tail. Pick a less commoditized business or a niche where there is a small number of direct competitors. Or, if you stay in trading, become a market-maker.

Bonus question: what are the ramifications for tax and economic policies (i.e., meant to ensure efficiency and just outcomes) of the observation that a particular industry is noise dominated?

Tuesday, October 04, 2011

On the origin of probability in quantum mechanics

New paper! This is a brief writeup of the talk I gave last year in Benasque, as well as a few other places. Slides are available at the link above.

On the origin of probability in quantum mechanics

http://arxiv.org/abs/1110.0549

I give a brief introduction to many worlds or "no wavefunction collapse" quantum mechanics, suitable for non-specialists. I then discuss the origin of probability in such formulations, distinguishing between objective and subjective notions of probability.

Here's what I say in the conclusion.

Decoherence does not resolve the collapse question, contrary to what many physicists think. Rather, it illuminates the process of measurement and reveals that pure Schrodinger evolution (without collapse) can produce the quantum phenomena we observe. This of course raises the question: do we need collapse? If the conventional interpretation was always ill-defined (again, see Bell for an honest appraisal [1]; Everett referred to it as a "philosophical monstrosity''), why not remove the collapse or von Neumann projection postulates entirely from quantum mechanics?

The origin of probability is the real difficulty within many worlds interpretations. The problem is subtle and experts are divided as to whether it has been resolved satisfactorily. Because the wave function evolves entirely deterministically in many worlds, all probabilities are necessarily subjective and the interpretation does not require true randomness, thereby preserving Einstein's requirement that outcomes have causes.

[1] J.S. Bell's famous article Against Measurement.

Sunday, May 09, 2010

Climate change priors and posteriors

I recommend this nice discussion of climate change on Andrew Gelman's blog. Physicist Phil, the guest-author of the post, gives his prior and posterior probability distribution for temperature sensitivity as a function of CO2 density. I guess I'm somewhere between Skeptic and Phil Prior.

As an aside, I think it is worth distinguishing between a situation where one has a high confidence level about a probability distribution (e.g., at an honest casino game like roulette or blackjack) versus in the real world, where even the pdf itself isn't known with any confidence (Knightian uncertainty). Personally, I am in the latter situation with climate science.

Here is an excerpt from a skeptic's comment on the post:

... So where are we on global climate change? We have some basic physics that predicts some warming caused by CO2, but a lot of positive and negative feedbacks that could amplify and attenuate temperature increases. We have computer models we can't trust for a variety of reasons. We have temperature station data that might have been corrupted by arbitrary "adjustments" to produce a warming trend. We have the north polar ice area decreasing, while the south polar ice area is constant or increasing. Next year an earth satellite will launch that should give us good measurements of polar ice thickness using radar. Let's hope that data doesn't get corrupted. We have some alternate theories to explain temperature increases such as cosmic ray flux. All this adds up to a confused and uncertain picture. The science is hardly "settled."

Finally the public is not buying AGW. Anyone with common sense can see that the big funding governments have poured into climate science has corrupted it. Until this whole thing gets an independent review from trustworthy people, it will not enjoy general acceptance. You can look for that at the ballot box next year.

For a dose of (justified?) certitude, see this angry letter, signed by numerous National Academy of Science members, that appeared in Science last week. See here for a systematic study of the record of expert predictions about complex systems. Scientists are only slightly less susceptible than others to group think.

Monday, December 22, 2008

More than this

I discovered today that one of my colleagues also loves this song. Bryan Ferry of Roxy Music had real artistic insight into chance and determinism. I like the 10,000 Maniacs version better, which is in the lower screen.

More Than This

I could feel at the time
There was no way of knowing
Fallen leaves in the night
Who can say where they're blowing

As free as the wind
And hopefully learning
Why the sea on the tide
Has no way of turning

More than this - there is nothing
More than this - tell me one thing
More than this - there is nothing

It was fun for a while
There was no way of knowing
Like dream in the night
Who can say where we're going

No care in the world
Maybe I'm learning
Why the sea on the tide
Has no way of turning

More than this - there is nothing
More than this - tell me one thing
More than this - there is nothing

Thursday, December 11, 2008

Jaynes and Bayes

E.T. Jaynes, although a physicist, was one of the great 20th century proponents of Bayesian thinking. See here for a wealth of information, including some autobiographical essays. I recommend this article on probability, maximum entropy and Bayesian thinking, and this, which includes his recollections of Dyson, Feynman, Schwinger and Oppenheimer.

Here are the first three chapters of his book Probability Theory: the Logic of Science. The historical material in the preface is fascinating.

Jaynes started as an Oppenheimer student, following his advisor from Berkeley to Princeton. But Oppenheimer's mystical adherence to the logically incomplete Copenhagen interpretation (Everett's "philosophic monstrosity") led Jaynes to switch advisors, becoming a student of Wigner.

Edwin T. Jaynes was one of the first people to realize that probability theory, as originated by Laplace, is a generalization of Aristotelian logic that reduces to deductive logic in the special case that our hypotheses are either true or false. This web site has been established to help promote this interpretation of probability theory by distributing articles, books and related material. As Ed Jaynes originated this interpretation of probability theory we have a large selection of his articles, as well as articles by a number of other people who use probability theory in this way.

See Carson Chow for a nice discussion of how Bayesian inference is more like human reasoning than formal logic.

The seeds of the modern era could arguably be traced to the Enlightenment and the invention of rationality. I say invention because although we may be universal computers and we are certainly capable of applying the rules of logic, it is not what we naturally do. What we actually use, as coined by E.T. Jaynes in his iconic book Probability Theory: The Logic of Science, is plausible reasoning. Jaynes is famous for being a major proponent of Bayesian inference during most of the second half of the last century. However, to call Jaynes’s book a book about Bayesian statistics is to wholly miss Jayne’s point, which is that probability theory is not about measures on sample spaces but a generalization of logical inference. In the Jaynes view, probabilities measure a degree of plausibility.

I think a perfect example of how unnatural the rules of formal logic are is to consider the simple implication A -> B which means - If A is true then B is true. By the rules of formal logic, if A is false then B can be true or false (i.e. a false premise can prove anything). Conversely, if B is true, then A can be true or false. The only valid conclusion you can deduce from is that if B is false then A is false. ...

However, people don’t always (seldom?) reason this way. Jaynes points out that the way we naturally reason also includes what he calls weak syllogisms: 1) If A is false then B is less plausible and 2) If B is true then A is more plausible. In fact, more likely we mostly use weak syllogisms and that interferes with formal logic. Jaynes showed that weak syllogisms as well as formal logic arise naturally from Bayesian inference.

[Carson gives a nice example here -- see the original.]

...I think this strongly implies that the brain is doing Bayesian inference. The problem is that depending on your priors you can deduce different things. This explains why two perfectly intelligent people can easily come to different conclusions. This also implies that reasoning logically is something that must be learned and practiced. I think it is important to know when you draw a conclusion, whether you are using deductive logic or if you are depending on some prior. Even if it is hard to distinguish between the two for yourself, at least you should recognize that it could be an issue.

While I think the brain is doing something like Bayesian inference (perhaps with some kinds of heuristic shortcuts), there are probably laboratory experiments showing that we make a lot of mistakes and often do not properly apply Bayes' theorem. A quick look through the old Kahneman and Tversky literature would probably confirm this :-)

Monday, December 01, 2008

Frequentists vs Bayesians

Noted Berkeley statistician David Freedman recently passed away. I recommend the essay below if you are interested in the argument between frequentists (objectivists) and Bayesians (subjectivists). I never knew Freedman, but based on his writings I think I would have liked him very much -- he was clearly an independent thinker :-)

In everyday life I tend to be sympathetic to the Bayesian point of view, but as a physicist I am willing to entertain the possibility of true quantum randomness.

I wish I understood better some of the foundational questions mentioned below. In the limit of infinite data will two Bayesians always agree, regardless of priors? Are exceptions contrived?

Some issues in the foundation of statistics

Abstract: After sketching the conflict between objectivists and subjectivists on the foundations of statistics, this paper discusses an issue facing statisticians of both schools, namely, model validation. Statistical models originate in the study of games of chance, and have been successfully applied in the physical and life sciences. However, there are basic problems in applying the models to social phenomena; some of the difficulties will be pointed out. Hooke’s law will be contrasted with regression models for salary discrimination, the latter being a fairly typical application in the social sciences.

...The subjectivist position seems to be internally consistent, and fairly immune to logical attack from the outside. Perhaps as a result, scholars of that school have been quite energetic in pointing out the flaws in the objectivist position. From an applied perspective, however, the subjectivist position is not free of difficulties. What are subjective degrees of belief, where do they come from, and why can they be quantified? No convincing answers have been produced. At a more practical level, a Bayesian’s opinion may be of great interest to himself, and he is surely free to develop it in any way that pleases him; but why should the results carry any weight for others? To answer the last question, Bayesians often cite theorems showing "inter-subjective agreement:" under certain circumstances, as more and more data become available, two Bayesians will come to agree: the data swamp the prior. Of course, other theorems show that the prior swamps the data, even when the size of the data set grows without bounds-- particularly in complex, high-dimensional situations. (For a review, see Diaconis and Freedman, 1986.) Theorems do not settle the issue, especially for those who are not Bayesians to start with.

My own experience suggests that neither decision-makers nor their statisticians do in fact have prior probabilities. A large part of Bayesian statistics is about what you would do if you had a prior.7 For the rest, statisticians make up priors that are mathematically convenient or attractive. Once used, priors become familiar; therefore, they come to be accepted as "natural" and are liable to be used again; such priors may eventually generate their own technical literature. ...

It is often urged that to be rational is to be Bayesian. Indeed, there are elaborate axiom systems about preference orderings, acts, consequences, and states of nature, whose conclusion is-- that you are a Bayesian. The empirical evidence shows, fairly clearly, that those axioms do not describe human behavior at all well. The theory is not descriptive; people do not have stable, coherent prior probabilities.

Now the argument shifts to the "normative:" if you were rational, you would obey the axioms, and be a Bayesian. This, however, assumes what must be proved. Why would a rational person obey those axioms? The axioms represent decision problems in schematic and highly stylized ways. Therefore, as I see it, the theory addresses only limited aspects of rationality. Some Bayesians have tried to win this argument on the cheap: to be rational is, by definition, to obey their axioms. ...

How do we learn from experience? What makes us think that the future will be like the past? With contemporary modeling techniques, such questions are easily answered-- in form if not in substance.

·The objectivist invents a regression model for the data, and assumes the error terms to be independent and identically distributed; "iid" is the conventional abbreviation. It is this assumption of iid-ness that enables us to predict data we have not seen from a training sample-- without doing the hard work of validating the model.

·The classical subjectivist invents a regression model for the data, assumes iid errors, and then makes up a prior for unknown parameters.

·The radical subjectivist adopts an exchangeable or partially exchangeable prior, and calls you irrational or incoherent (or both) for not following suit.

In our days, serious arguments have been made from data. Beautiful, delicate theorems have been proved; although the connection with data analysis often remains to be established. And an enormous amount of fiction has been produced, masquerading as rigorous science. [!!!]

Tuesday, October 09, 2007

Bounded cognition

Many people lack standard cognitive tools useful for understanding the world around them. Perhaps the most egregious case: probability and statistics, which are central to understanding health, economics, risk, crime, society, evolution, global warming, etc. Very few people have any facility for calculating risk, visualizing a distribution, understanding the difference between the average, the median, variance, etc.

A remnant of the cold war era curriculum still in place in the US: if students learn advanced math it tends to be calculus, whereas a course on probability, statistics and thinking distributionally would be more useful. (I say this reluctantly, since I am a physical scientist and calculus is in the curriculum largely for its utility in fields related to mine.)

In the post below, blogger Mark Liberman (a linguist at Penn) notes that our situation parallels the absence of concepts for specific numbers (i.e., "ten") among primitive cultures like the Piraha of the Amazon. We may find their condition amusing, or even sad. Personally, I find it tragic that leading public intellectuals around the world are mostly innumerate and don't understand basic physics.

Language Log

The Pirahã language and culture seem to lack not only the words but also the concepts for numbers, using instead less precise terms like "small size", "large size" and "collection". And the Pirahã people themselves seem to be suprisingly uninterested in learning about numbers, and even actively resistant to doing so, despite the fact that in their frequent dealings with traders they have a practical need to evaluate and compare numerical expressions. A similar situation seems to obtain among some other groups in Amazonia, and a lack of indigenous words for numbers has been reported elsewhere in the world.

Many people find this hard to believe. These are simple and natural concepts, of great practical importance: how could rational people resist learning to understand and use them? I don't know the answer. But I do know that we can investigate a strictly comparable case, equally puzzling to me, right here in the U.S. of A.

Until about a hundred years ago, our language and culture lacked the words and ideas needed to deal with the evaluation and comparison of sampled properties of groups. Even today, only a minuscule proportion of the U.S. population understands even the simplest form of these concepts and terms. Out of the roughly 300 million Americans, I doubt that as many as 500 thousand grasp these ideas to any practical extent, and 50,000 might be a better estimate. The rest of the population is surprisingly uninterested in learning, and even actively resists the intermittent attempts to teach them, despite the fact that in their frequent dealings with social and biomedical scientists they have a practical need to evaluate and compare the numerical properties of representative samples.

[OK, perhaps 500k is an underestimate... Surely >1% of the population has been exposed to these ideas and remembers the main points?]

...Before 1900 or so, only a few mathematical geniuses like Gauss (1777-1855) had any real ability to deal with these issues. But even today, most of the population still relies on crude modes of expression like the attribution of numerical properties to prototypes ("A woman uses about 20,000 words per day while a man uses about 7,000") or the comparison of bare-plural nouns ("men are happier than women").

Sometimes, people are just avoiding more cumbersome modes of expression -- "Xs are P-er than Ys" instead of (say) "The mean P measurement in a sample of Xs was greater than the mean P measurement in a sample of Ys, by an amount that would arise by chance fewer than once in 20 trials, assuming that the two samples were drawn from a single population in which P is normally distributed". But I submit that even most intellectuals don't really know how to think about the evaluation and comparison of distributions -- not even simple univariate gaussian distributions, much less more complex situations. And many people who do sort of understand this, at some level, generally fall back on thinking (as well as talking) about properties of group prototypes rather than properties of distributions of individual characteristics.

If you're one of the people who find distribution-talk mystifying, and don't really see why you should have to learn it, or perhaps think that you're just not the kind of person who learns things like this -- congratulations, you now know exactly how (I imagine) the Pirahã feel about number-talk.

Does this matter? Well, in the newspapers every week, there are dozens of stories about risks and rewards, epidemiology and politics, social trends and psychological differences, with serious public-policy implications, which you can't understand without understanding distribution-talk. And usually you won't just feel baffled -- instead, you'll think you understand, and draw the wrong conclusions.

In fact, the people who write these stories mostly don't understand distribution-talk themselves, and in any case they believe that they need to write for an audience that doesn't understand it. As a result, news stories on these topics are usually impossible to understand correctly unless you go back to the primary sources in order to recover the information that's been distorted or omitted. I imagine that something similar must happen when one Pirahã tells another about the deal that this month's river trader is offering on knives.

Here's a great comment:

For many years I attempted to teach Biology and Genetics students the rudiments of statistics, with, alas, only limited success. The notions of population, sample, variance, hypothesis testing, etc. require more time and practice than can be devoted to them in such courses. Most students in the life sciences are math-phobic and few take statistics courses until they reach graduate school. Even among professional biologists publishing in journals like Science and Nature you can find examples of statistical ignorance. Is it any wonder that the average man on the street doesn't understand them either? Practical statistics needs to be incorporated into high school math courses and, possibly, earlier. But I'd remain doubtful that even then the average person would understand enough to be critical of what they read in the papers.

Posted by: Dale Hoyt | October 7, 2007 8:21 PM

For a real life example, see Gary Taubes' book on nutrition and public health research, reviewed here. Even the medical establishment adopted hypotheses that were not in any way supported by good statistical data.

Thursday, September 20, 2007

Information theory, inference and learning algorithms

I'd like to recommend the book Information theory, inference and learning algorithms by David Mackay, a Cambridge professor of physics. I wish I'd had a course on this material from Mackay when I was a student! Especially nice are the introductory example on Bayesian inference (Ch. 3) and the discussion of Occam's razor from a Bayesian perspective (Ch. 28). I'm sure I'll find other gems in this book, but I'm still working my way through.

I learned about the book through Nerdwisdom, which I also recommend highly. Nerdwisdom is the blog of Jonathan Yedidia, a brilliant polymath (theoretical physicist turned professional chess player turned computer scientist) with whom I consumed a lot of French wine at dinners of the Harvard Society of Fellows.

Sunday, July 02, 2006

Hollywood genius

Physicist turned author and screenwriter Leonard Mlodinow has a nice article in the LA Times on the hit or miss nature of the movie industry. He recapitulates the myth of expertise as it applies to studio executives, whom he compares to dart throwing monkeys (a la fund managers in finance).

Mlodinow wrote a charming memoir about his time as a postdoc at Caltech in the early 1980s. Fresh from Berkeley, having written a PhD dissertation on the large-d expansion (d is the number of dimensions), he was in over his head at Caltech, but found a friend and mentor in the ailing Richard Feynman.

We all understand that genius doesn't guarantee success, but it's seductive to assume that success must come from genius. As a former Hollywood scriptwriter, I understand the comfort in hiring by track record. Yet as a scientist who has taught the mathematics of randomness at Caltech, I also am aware that track records can deceive.

That no one can know whether a film will hit or miss has been an uncomfortable suspicion in Hollywood at least since novelist and screenwriter William Goldman enunciated it in his classic 1983 book "Adventures in the Screen Trade." If Goldman is right and a future film's performance is unpredictable, then there is no way studio executives or producers, despite all their swagger, can have a better track record at choosing projects than an ape throwing darts at a dartboard.

That's a bold statement, but these days it is hardly conjecture: With each passing year the unpredictability of film revenue is supported by more and more academic research.

That's not to say that a jittery homemade horror video could just as easily become a hit as, say, "Exorcist: The Beginning," which cost an estimated $80 million, according to Box Office Mojo, the source for all estimated budget and revenue figures in this story. Well, actually, that is what happened with "The Blair Witch Project" (1999), which cost the filmmakers a mere $60,000 but brought in $140 million—more than three times the business of "Exorcist." (Revenue numbers reflect only domestic receipts.)

What the research shows is that even the most professionally made films are subject to many unpredictable factors that arise during production and marketing, not to mention the inscrutable taste of the audience. It is these unknowns that obliterate the ability to foretell the box-office future.

But if picking films is like randomly tossing darts, why do some people hit the bull's-eye more often than others? For the same reason that in a group of apes tossing darts, some apes will do better than others. The answer has nothing to do with skill. Even random events occur in clusters and streaks.

...If the mathematics is counterintuitive, reality is even worse, because a funny thing happens when a random process such as the coin-flipping experiment is actually carried out: The symmetry of fairness is broken and one of the films becomes the winner. Even in situations like this, in which we know there is no "reason" that the coin flips should favor one film over the other, psychologists have shown that the temptation to concoct imagined reasons to account for skewed data and other patterns is often overwhelming.

...Actors in Hollywood understand best that the industry runs on luck. As Bruce Willis once said, "If you can find out why this film or any other film does any good, I'll give you all the money I have." (For the record, the film to which he referred, 1993's "Striking Distance," didn't do any good.) Willis understands the unpredictability of the film business not simply because he's had box-office highs and lows. He knows that random events fueled his career from the beginning, and his story offers another case in point...

Information Processing

About Me