*Estimated read time (minus contemplative pauses): 15 min.*

Bertrand’s Box Paradox has been mentioned in comments a couple of times here at *Untrammeled Mind*, so I figured I’d give a quick explanation. It’s a straightforward probability problem, if at first glance counterintuitive. Which is to say it doesn’t take much to resolve the so-called paradox.

I’ll first give a short explanation, then a slightly longer one. As a cool added bonus, the slightly longer explanation will lead us to discover Bayes’ theorem, whose innards I’ll then take the opportunity to briefly, gently probe with Bertrand’s Box Paradox as a guide.

Bertrand’s Box Paradox comes from Joseph Bertrand’s 1889 book *Calcul des probabilités*. I’ll lift it from Wikipedia:

There are three boxes:

(1) a box containing two gold coins,

(2) a box containing two silver coins,

(3) box containing one gold coin and a silver coin.

The “paradox” is in the probability, after choosing a box at random and withdrawing one coin at random, if that happens to be a gold coin, of the next coin drawn from the same box also being a gold coin.

So, to summarize: You randomly choose a box, then randomly pull a coin from that box. The coin is gold. You don’t put the coin back in the box. What’s the probability the second coin you pull from that same box is also gold?

**Short Explanation:**

To aid mental visualizing, I’ll sketch the boxes:

(1) [g,g]

~~(2) [s,s]~~

(3) [s,g]

You know you’ve got either box (1) or (3), so forget about box (2). We presume the same likelihood for choosing any given box. This suggests the intuitive response of there being an equal likelihood of having chosen box (1) or (3), and thus one-to-one (i.e., 50-50) odds for the next coin being gold or silver. In other words, you might now think the answer is 1/2. But this is wrong.

Notice that, given that box (1) contains twice as many gold coins as does box (3), you are twice as likely to have chosen box (1). In other words, there’s a 2:1 ratio in favor of box (1). That’s 2-to-1 odds, which means that, on average, for every three times you play this game and first pull a gold coin, two out of three of those runs will be a box (1) scenario, and one out of three will be a box (3) scenario. So the probability that the second coin is gold is two out of three, or 2/3.

In short, the first coin you pull will be gold twice as often in the box (1) situation than in the box (3) situation. So you’re twice as likely to be in the box (1) situation than in the box (3) situation. That’s 2:1 odds, which translates to a 2/3 probability for having chosen box (1) and a 1/3 probability for having chosen box (3).

**Slightly Longer Explanation: **

I’ll now explain Bertrand’s Box Paradox more thoroughly with a tree diagram, some simple math, and a little common sense. Hopefully it’s clear; if not immediately, then with a little persistence. I tend to describe the same thing in different ways with the hope that, taken as a whole, the result is understanding rather than algorithmic dependence (both for myself and the reader). (If this stuff is new to you and my delivery is confusing, I apologize; if it’s familiar and my delivery seems patronizing, I apologize. Striking the right balance is tough to do. So I follow the rule of trying to explain things the way I think I’d like to have seen them explained when I first encountered them—also surprisingly tough to do.)

First, let’s define our starting sample space and give the boxes names instead of numbers: {GG, SG, SS}

So, “GG,” for example, means “the box with two gold coins.” Also, I’ll use lower case to symbolize the coins—so, “g” means “gold coin.”

Let’s now use theoretical frequencies to run the game 60 times. Each box carries the same probability of being chosen, so we end up with:

SS is crossed out because we already know were not in an SS situation. It’s at this point that one’s intuition pipes up, “Ah, two boxes, both equally likely of having been chosen. The probability of GG must be 1/2.” This fails, however, to account for the evidence the gold coin provides. We must extend our branches:

In total, we pull a gold coin 30 times, 20 of which happen in a GG scenario, and the other 10 happen in SG. We then set a ratio of the *desired outcomes* against all outcomes that meet the broader condition of *we pulled a gold coin*:

Now plug in the corresponding numbers from the tree:

And there you have it. But I’d like to break this down a little further and repeat the above method, but using probabilities. Essentially, you can think of this as the same as what we did above, but playing one game instead of 60. This amounts to the exact same proportions everywhere, but with the the relevant fractions simplified and the math more transparent.

For example, instead of starting with 60, we start with 60/60 = 1. Instead of choosing GG 60/3 = 20 times, we choose it (60/3)/60 = 20/60 = 1/3 times (or we might say, “1/3 of the time”), which is just the probability of selecting GG. You can then find the 2/3 answer with the same ratio we used above (which I represent in the next diagram below), or you can do what amounts to essentially the same thing by multiplying the probabilities on the branches on a given path in order to figure out the probability of that path’s happening, and then make an appropriate ratio (I’ll do that in a moment).

Let’s go right to the full tree:

We can interpret the above multiple ways. We can say, “On average, 1/3 of the times we play this game, we choose box GG, at which point we always pull a gold coin; and, on average, 1/3 of the times we play this game, we choose box SG, and we then pull a gold coin about 1/2 those times.” Those fractions are, of course, probabilities. Which is to say that we can multiply them to get the probability of a given path. In other words, we can also, and more efficiently, interpret the above, by noting that there’s a (1/3) × (2/2) = 1/3 probability of choosing box GG then pulling a gold coin; and there’s a (1/3) × (1/2) = 1/6 probability of choosing box SG then pulling a gold coin.

We can now find the probability that *we chose box GG given the evidence of a gold coin* by setting a ratio of our desired outcome’s probability against the sum of all the probabilities corresponding to the outcomes that meet our condition of having pulled a gold coin. I’ll make this explicit. Allow me to introduce some notation for readability. If I write P(x), read that as “the probability that x happens.” Notice that the components of the ratio follow the stories told by our branch paths:

To recap, the numerator has *the outcome we’re hoping for*, while the denominator has *the outcome we’re hoping for* PLUS *the (relevant possible) outcome(s) we’re *not* hoping for*. Or, more simply put, the numerator has *the probability of this evidence arising given the situation we’re hoping for (or at least that makes our hypothesis true)*, and the denominator has *the probability of this evidence arising period*. In specific terms, this translates, respectively, to *the probability the first coin being gold when we’ve chosen box* GG and *the probability of the first coin being gold period*.

Now we just plug in the numbers. This introduces nothing new into the conversation except for a visual representation of the math we’ve already noticed:

And there you have it!^{1}

**Hey, That’s Just Bayes’ Theorem:**

The above amounts to an application of what we’ve come to call “Bayes’ theorem” (or “law” or “rule”). Often with conditional probability problems like these, where the aim is to revise a probability (or “degree of belief”) estimate based on new evidence or data, folks will trot out Bayes’ theorem, plug in some numbers, and say, “And there you have it!” Here, though, we’ve gotten to the theorem from the inside out. The theorem really isn’t anything magical, but it is powerful. And when you understand it, it’s intuitive enough that it doesn’t need to be rotely memorized.

My goal today isn’t to talk at length about Bayes’ theorem, but rather to say enough to show how our above work matches up with it. I make this more complicated than it needs to be for the sake of continuing with the idea of working from the inside out. (For a more in-depth overview of Bayes’ theorem, see my post, “An Easier Counterintuitive Conditional Probability Problem [with and Without Bayes’ Theorem],” where I also apply the theorem twice to the same hypothesis . For a more thorough primer, along with some fun examples, check out Dan Morris’s book, *Bayes’ Theorem Examples: A Visual Introduction For Beginners*. For an overview of Bayesian statistics and Bayesian inference more generally, see this Wikipedia entry, where you’ll find links to other useful articles.)

A common way to represent Bayes’ theorem is as follows.

Where:

H = *hypothesis*

E = *evidence*

~ = *not*

P(H|E) = something like, “The probability my *hypothesis* is true given this *evidence*” or, for short, “the probability of H given E.”

This is essentially an abstraction of we produced at the end of the last section. To review: the numerator has *the probability of this evidence arising given the situation we’re hoping for (or at least that makes our hypothesis true)*, and the denominator has *the probability of this evidence arising period*. In specific terms, this translates, respectively, to *the probability of the first coin being gold when we’ve chosen box* GG and *the probability of the first coin being gold period*.

Notice that, since the denominator is really just the probability of getting the evidence at all, it is equivalent to P(E). If P(E) is already known or easily calculated, you can skip the longer formula and plug P(E) into the denominator. Some problems are much more easily worked out that way, so Bayes’ theorem is often represented as:

In fact, this is Bayes’ theorem (though I’ll refer to it here as “abbreviated.”) We derive the longer version, when useful, by breaking up P(E) with the Law of Total Probability.

Arguably, Bertrand’s Box Paradox is most easily addressed with the abbreviated form, given that we can infer P(E) to be 1/2, due to the situation’s general symmetry and there being a total of 3 gold and 3 silver coins in the three boxes; but let’s let that 1/2 figure emerge from the inside out, using the longer version of the formula. I’ll do this two ways.

First, I’ll run it in a way that matches the longer version of the theorem above. Some of the numbers will at first differ from what we saw at the end of the last section (for reasons that will be obvious), but will quickly line up as expected. Then I’ll run it again with the with a slight variation, but in a way that is intuitively streamlined—in fact, the numbers there will more closely match what we did at the end of the last section.

Here’s the first way. The hypothesis here is that box GG was chosen, or “GG” for short. The evidence is that the first coin pulled was gold, or “g” for short. Start by plugging a verbal representation of this into Bayes’ theorem:

Tighten this up with the symbolic representation:

If need be, take a moment to compare these steps with our earlier formulation and convince yourself that this story’s on the right track. Now we find the relevant probabilities for the equation.

P(GG) is the probability of choosing box GG before conditioning on our current evidence. That’s 1/3.

P(~GG) is the probability of choosing either box SG or SS (again, before introducing our evidence). That’s 2/3.

P(g|GG) is the probability of the first coin being gold when we’ve chosen box GG. That’s clearly 1.

P(g|~GG) is the probability of the first coin being gold when we’ve chosen either box SG or SS. In other words, this asks us to look at what can happen when we don’t select box GG, and find the proportion of time that our first pull is a gold coin among all the ways things can go, win or lose, in that condition. I’ll sketch the relevant sample space as:

{(choose SG then pull gold), (choose SG then pull gold), (choose SS then pull gold)}

There are other ways to model and represent and think about this, but this’ll do. Now we’re looking for the *ratio of*…

…the probability of selecting box SG (i.e., 1/3) then pulling a gold coin (1/2), which comes out to (1/3) × (1/2) = 1/6

*to*

….the probability of selecting box SG then pulling a silver coin (1/6) PLUS the probability of selecting box SG then pulling a gold coin (also 1/6) PLUS the probability of selecting box SS (1/3) (in which case we always pull a silver coin) = (1/6) + (1/6) + (1/3) = 2/3.

We take that ratio: (1/6)/(2/3) and get 1/4.

So, P(g|~GG) is 1/4. (This makes sense, given that, of the four coins in SG and SS, one is gold. Yeah, I am indeed making this more complicated than it needs to be—you know, for learning.)

Now plug in the numbers:

And again we get a final answer of 2/3. And, as expected, the denominator is 1/2 = P(E). (You might notice that there are more efficient ways to do the math as I go along here. But I’d like to make clear the sorts of steps one might take when the numbers aren’t so convenient.)

Finally, I’ll run it again with the same hypothesis and evidence, but this time, rather than using ~GG in the denominator, I’ll use the competing hypothesis that we chose SG, or “SG” for short.

Technically, we should also include the third hypothesis, “chose box SS” (or “SS” for short), but we could skip it here as this will come out to 0 as P(g|SS) = 0. The idea, though, is that our unconditioned probabilities add up to 1. That is: P(GG) + P(SG) + P(SS) = 1/3 + 1/3 + 1/3 = 3/3 = 1.^{2} This is consistent with what we’ve done so far where P(GG) + P(~GG) = 1, and where ~GG is just SG or SS, and P(SG) + P(SS) = 2/3. ^{3}

In fact, we can generalize Bayes’ theorem where “E” is an evidencing event that can occur in conjunction with one of the mutually exclusive and exhaustive hypotheses events H_{1}, H_{2}, H_{3}… H_{n} —in other words, H_{i}, where *i* is a member of the counting numbers. So, for P(H_{i}|E) we can use the usual numerator of P(E|H_{i})×P(E), and for the denominator we get:

(If you’re not familiar with sigma notation or need a quick refresher, here’s a quick and clear tutorial.)

I first put our problem strictly in terms of GG in order to stick with the way the theorem is popularly encountered. In our new formulation, we end up with the following (I’ll include SS for the sake of thoroughness):

I’ve skipped the verbal representation this time, but the story is still there and should be consistent with what we’ve done so far.

We already know all these probabilities, but let’s review:

P(GG) = 1/3

P(SG) = 1/3

P(SS) = 1/3

P(g|GG) = 1

P(g|SG) = 1/2

P(g|SS) = 0

Plug in to get:

And again we get 2/3.

There are of course yet other ways to model this problem. For example, you might try it with the hypothesis that the second coin pulled will be gold (which can happen in the SG condition, provided the first coin was silver).

As always, I invite you to comment with suggestions for improvement, corrections, and questions.

**Closing Remarks:**

It shouldn’t be surprising that different routs can lead to the same answer depending on how evidence is modeled. Especially when it comes to problems where the rates of outcome are reliably predictable. For example, suppose you want to get a 3♥ out of well-shuffled deck of 52 playing cards. You pull a card and get a 7♣. You don’t need Bayes’ theorem to know that if you pull another card from the remaining 51 cards, the probability of 3♥ is now 1/51. But if you do use the theorem, you’ll need to decide whether to represent E as 7♣ (whose probability is 1/52) or as ~3♥ (i.e., 51/52).

Such toy problems are for getting comfortable with the tools and sharpening your chops. Real-world questions—”Did the defendant do it?”—are where you’ll find the real and messy action. Bayes’ theorem is only as good as the numbers that go into it, so where a big chunk of its value lies may be not only as way to generate a series of probability estimate updates as new bits of evidence are introduced, but also in the care it urges us to take in developing the numbers that often come from the subjective evaluation of that evidence.

That’s a discussion for another day and for which I’m excitedly accumulating notes and instructive readings like this article: “Bayes and the Law“^{4}; and this book: *Bayesian Data Analysis*^{5}; and many others.

**UPDATE 4/27/20:** A big thanks to the VSauce2 channel for linking this post as a source in their explanation of this problem! Check it out: “The Easiest Problem Everyone Gets Wrong.”

**Post Script & Bonus Question:**

Just as I was about to send this post to press, I discovered Allen Downey’s blog, *Probably Overthinking It*. Downey is author of the *Think X* series (e.g., *Think Bayes*; looks good, though I haven’t read it, partly because I don’t know Python, which seems to be a prereq—maybe it’s time to learn!). In a blog post, he poses a problem formally similar to Bertrand’s Box Paradox:

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of Bowl #1?

Find that, plus a trickier variation involving M&M’s, here: “My Favorite Bayes’s Theorem Problems.” Solutions are in a separate post: “All Your Bayes Are Belong to Us!” I found his solution to the the M&M’s problem especially instructive—without giving too much away: his formulation of the hypotheses allowed for tidy calculations that could have otherwise been cluttered. (The post includes some other problems I haven’t looked at yet.)

*Enjoy or find this post useful? Please consider pitching in a dollar or three to help me do a better job of populating this website with worthwhile words and music. Let me know what you'd like to see more of while you're at it. Transaction handled by PayPal.*

#### Footnotes:

- For fun, here’s yet another way to do the tree, in which we toss out box SS and update boxes GG and SG each to 1/2. Notice that so long as the initial numbers leading to GG and SG are the same (except for 0), they cancel out and you end up with 1/(3/2), just as we did in the previous diagram.
- We can also achieve this by renormalizing the GG and SG probabilities so they add to 1; i.e., by making each 1/2, which we can do once we’ve removed SS from the picture.
- This is just like when you can treat “the die didn’t land 1” as either P(~1) = (1 – P(1)) = (1 – 1/6) = 5/6; or as P(2) + P(3) + P(4) + P(5) + P(6) = 1/6 + 1/6 + 1/6 + 1/6 + 1/6 = 5/6. Notice, again, that P(1) + P(2 or 3 or 4 or 5 or 6) = P(1 or ~1) = 6/6 = 1.
- by Norman Fenton, Martin Neil, Daniel Berger (
*Annu Rev Stat Appl.*2016 Jun; 3: 51–77. Published online 2016 Mar 9. doi: 10.1146/annurev-statistics-041715-033428) - 3rd edition (2013) by Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin.

Fantastic work. I am still digesting the complicated and the simple Bayes approach. Good job!

Thanks for reading and for the feedback!

I have this box with the inscription Raton ET Bertran on the top of it. Giving to me from my grandmother cause I like to put my lil rings and niknaks in it.I just wanted to know what is it?

Hi Tameka — That sounds like a reference to the play Bertrand et Raton by Eugène Scribe.