# 1. A Counterintuitive Probability Problem

Mathematician Gil Kalai recently posted the following intuition-bending probability problem at his blog:

*You throw a dice until you get 6. What is the expected number of throws (including the throw giving 6) conditioned on the event that all throws gave even numbers.**

(*The problem is originally due to Elchanan Mossel. The term “a dice” is used here to denote a single die. Some people dislike this usage, though it does not obscure the question.)

Most people give the initially intuitive answer: three. But that’s wrong. I’ll get to the right answer below. Solutions, along with commenters’ expressions of bewilderment and contention, have been posted at Kalai’s blog, Math with Bad Drawings, and at Mind Your Decisions’ blog and YouTube channel. It’s interesting to note that one YouTube commenter accepted the correct solution, even giving an explanation in their own words, then later reverted back to the incorrect answer of three.

Methods for getting the correct solution are often hard to follow due to involving complicated-looking math or relying on background knowledge. I’ll share a solution that I think makes the problem more intuitive, and that only requires a basic understanding of probability, along with a little precalculus. Still, I’ll be sure to review the most relevant concepts as I go along, just in case it’s helpful.

# 2. The Meaning of “Expectation” and Basic Probability Rules

First, we need to be clear about what the problem is asking. I’ll start by clarifying what is meant here by “expect,” given that several commenters have expressed confusion about the word.

In the present context, “expect” has a technical meaning, though it is related to the ordinary usage of the word. The long and short of it is that “expect” here means what you expect to see *on average* over several trials. That in mind, I’ll first consider the ordinary usage, and will tease out the technical meaning from that. I’ll also cover some simpler probability problems and will review basic probability rules.

If you already know about this stuff, you might want to skip to the next section: “Solving the Problem.”

Let’s start with a simple variation on the problem at hand: *How many times would you expect to flip a fair quarter to see a Heads, including the Heads flip? *

(I might not always put the “including the Heads” part, but consider it assumed unless otherwise noted.)

The ordinary usage of “expect,” yields several interpretations. For starters, we might think, “Well, as many flips as it takes to exceed a 50% chance of getting at least one Heads.” Ok, so how many would that be? The probability of getting Heads in one flip is obviously 50%. The probability of getting at least one Heads in two flips can be worked out in various ways. The most intuitive is to work out all the possible outcomes in the sample space, than make a ratio of the desired outcomes to all possible outcomes:

**H**,**H**

**H**,T

T,**H**

T,T

That comes out to four possible outcomes, three of which have *at least *one Heads. Therefore, there is a 3/4, or 75%, chance of getting at least one Heads in two flips of a fair coin. So perhaps you might expect to require two flips in order to feel confident, but not overly confident, that you’ll see a Heads.

Before moving on, let’s look at two other ways to solve this problem. Remember that when you want to find the probability of two independent events happening, you multiply their individual probabilities. A coin flip is independent because the result of one coin flip doesn’t depend on any other coin flips. So, the probabilities of the above example are as follows (I’ll denote the probability of something as “P(something)”):

H,H = P(H)×P(H) =(1/2)×(1/2) = 1/4 = .25 = 25%

H,T = P(H)×P(T) =(1/2)×(1/2) = 1/4 = .25 = 25%

T,H = P(T)×P(H) =(1/2)×(1/2) = 1/4 = .25 = 25%

T,T = P(T)×P(T) =(1/2)×(1/2) = 1/4 = .25 = 25%

Notice that if you add them all up you get: 1/4 + 1/4 + 1/4 + 1/4 = 4/4 = 1 = 100%. The probability of all possible outcomes—i.e., the sample space—should always add up to 1.

Now, to find the probability that one or another mutually exclusive events—i.e., events that can’t happen together—will happen, you add them together. So, the probability of getting Heads or Tails in a single flip is P(H) + P(T) = 1/2 + 1/2 = 1. We can use this to answer the question regarding the probability of getting at least one Heads in two flips, by simply adding together all the observations that satisfy that condition:

H,H = P(H)×P(H) =(1/2)×(1/2) = 1/4 = .25 = 25%

Or: H,T = P(H)×P(T) =(1/2)×(1/2) = 1/4 = .25 = 25%

Or: T,H = P(T)×P(H) =(1/2)×(1/2) = 1/4 = .25 = 25%

We now add those up to once again get 75%.

There’s yet another way to calculate this probability. It’s the most important way, because we’ll need it when we have too many possibilities to list out. Remember that the probabilities of a sample space sum to 1. This means we can ask the opposite, or complementary, question, and then subtract that answer from 1: *What is the probability of getting *zero* Heads in two flips?* Well, there’s one scenario where that happens: T,T. That probability is 1/2 × 1/2 (which from here on I’ll denote as (1/2)^2). We then subtract that from 1 to get 3/4.

Ok, now back to “expectation.” 75% sounds reasonable. But what if the stakes were higher? What if you your life depended on seeing a Heads? Then how many flips would you expect? “Hmmm… now I’d like to hit over 80%, or even 99%, certainty of seeing a Heads, just to be safe.” We can figure that out by recognizing that we need to subtract from 1 a certain number of Tails—namely, the number of Tails that has a 1% chance of happening. We can use this equation:

1 − (1/2)^x = .99

Where x is some number of Tails in a row, such that (1/2)^x equals .01.

We could do a guess and check to get this answer pretty quickly:

(1/2)^3 = 1/8 = 7/8 = .875

(1/2)^4 = 1/16 = 15/16 = .9375

…and so on.

An quicker way is to use a logarithm. Just pop “log base .5 of .01” into a calculator like so: log(.01)÷log(.5) ≈ 6.6439. Let’s round to 7 to test:

1 – (1/2)^6 = 63/64 = .984375

1 – (1/2)^7 = 127/128 = .9921875

It works out.

But, of course, 99% is so high that we wouldn’t really *expect* seven flips, though we might expect no more than seven flips. This high probability just bolsters our certainty that we’ll see one or more Heads (though I still wouldn’t bet my life on it!). Notice that getting *exactly* one Heads is a very different question. For example, getting exactly one Heads in two flips has a probability of 1/2, because two out of the four possible scenarios have exactly one Heads. Getting exactly one Heads in four flips can be calculated in the following way. I won’t elaborate much on this because it’s not necessary for the problem at hand, but it’s worth having a look at.

The numerator will be 4-choose-1^{1}, or 4, because any single flip can be a Heads. The denominator will be however many possible outcomes there are, which is this case is 2^4, or 16.^{2}. So, the probability of getting exactly one Heads in four flips is 4/16, or 1/4. To bear this out, here are the possible outcomes that fulfill the desired outcome:

**H**,T,T,T

T,**H**,T,T

T,T,**H**,T

T,T,T,**H**

Each has a probability of (1/2)^4, or 1/16. We add them up to get 4/16, or 1/4. Notice that there must be 12 “losing” scenarios that I didn’t list here, such as H,H,T,T.

Ok, this is all good for reviewing probability rules and helping us warm up our probability intuitions. Now, how do we determine how many flips to “expect,” in the technical sense, before seeing a Heads?

Expectation, in this sense, is not about a single trial, nor is it necessarily about an exact number. It’s about what you’d likely see on average if you do the trial over and over and over again. In fact, this applies to the above examples as well. When we say that the probability of getting at least one Heads in two flips is 75%, we’re saying that if you flip a coin twice, then flip it twice, then repeat, repeat, repeat, over and over again, and then you look at all the sets of two flips, you’re likely to see, on average, that about 75% of them have at least one Heads. And about 25% of them will be T,T.

That said, here’s the solution. The probability of a quarter landing Heads is 1/2. That means that there will be about one Heads for every two flips in a large set of flips. So, you expect two flips, on average, in order to see a Heads. Notice that 2 is just the reciprocal of 1/2. (A convenient, and highly intuitive, feature of geometric distribution that we’ll take for granted.)

To be abundantly clear, this does *not* mean that if you flip a coin 100 times, you should literally expect to see 50 Heads and 50 Tails. Indeed, the probability of getting exactly 50 Heads in 100 flips is only about 8%. But this is still a higher probability than getting exactly 49 Heads (which means you landed 51 Tails), or any other number of Heads. In other words, the most likely outcome is 50 Heads, though its likelihood is still fairly low. I’ve popped some of this into Desmos to demonstrate, though you need not spend much time on these diagrams; just notice the outcomes.

The diagrams show, from top to bottom, out of 100 flips: The probability of getting 49 Heads and 51 Tails (or vice versa). The probability of getting 50 each of Heads and Tails. The probability of getting from 40 to 60 Heads inclusive.^{3}

The key thing to notice is that the probability of getting exactly 50 Heads is fairly low, but the probability of getting from 40 to 60 Heads is around 96%. This means that if you flip a fair quarter 100 times over and over again, you’ll likely see an average of about 50% Heads across all those trials of 100 flips. For example, suppose one trial has 40 Heads and another has 60; that’s an average of 50 Heads per trial. Also across those 100 trials, which will have much variation, the average for *exactly *50 Heads is expected to be about 8%.

Ok, let’s translate this into a simple question about rolling a die: *How many times would you expect to roll a die to see a 6?* The probability of getting a six in a single throw is 1/6. Therefore, on average, you’ll have about six throws for every appearance of a 6. In other words, you can expect an average of 6 throws in order to see a 6 (as usual, this includes the throw that gives the 6).

This concept is commonly applied for assessing the “expected value” of some event. For example, a game in which you win $12 every time a die lands on a 6. In 60 rolls, you’d expect about 1/6 of them to be a 6. Which means you’d expect to get 60/6, or 10, payments of $12 dollars. So the expected value of 60 rolls would be $120 dollars. This averages out to $2 per roll. That is, the probability of getting a 6 times the payout for getting a 6: (1/6)×12 = 2.

That’s a simple example. Usually we’d include some probability for loss. Such as: You’ll have to pay a dollar for any non-6 roll. So this means for every six rolls, you can expect to pay $5, but earn $12. In other words, you’ll net $7 per six rolls. This makes the expected value of a given roll 1/6 of $7, or about $1.17. I think most people would play this game if given, say, 120 trials, in which case you’d expect to earn $140: i.e., it should yield about 20 instances of a 6, meaning getting paid $12 about 20 times, which means $240; subtract from the 100 instances of paying $1: $240 – $100 = $140; this is also what you get if you multiply 120 and $7/6.

But would you really expect to earn $1.17 if given only one roll? Only in the technical sense. In the ordinary sense of “expect,” you’d probably expect to lose a dollar. Not sure I’d take that bet. Though I would if the loss were a penny and the gain a million dollars.

At any rate, this gets at some of the difficulty between relating single trials to large sets of trials (which I’ll call an “experiment”), a difficulty which I think served as a source of confusion for many of the commentators I observed discussing the problem at hand. Hopefully that confusion is now resolved, and we can agree that “expect” here is about the result you’ll likely see on average.

We’re now ready to solve the problem in an (I hope) intuitive way.

# 3. Solving the Problem

I’ll first solve it using an experiment involving 120 trials; this will conclude with a computer simulation of the game. Then I’ll solve it just using math.

Once again, Kalai states the problem as follows:

*Test your intuition: You throw a dice until you get 6. What is the expected number of throws (including the throw giving 6) conditioned on the event that all throws gave even numbers.*

Here’s my statement of it:

You play a game in which you roll a fair die. The rules of the game are:

1. If you roll a 6, you win and the game ends.

2. If you roll a 1, 3, or 5, you lose and the game ends.

3. If you roll a 2 or a 4, you roll again.

4. Repeat the above until the game ends (i.e., until you win or lose).

Question: How many times can you expect to roll the die, on average, in those sequences that include a 6? In other words, what will the average number of rolls be in a winning sequence?

Some comments on my statement of the problem. Anything I’ve added to the content (e.g., it’s a standard 6-sided die) of Kalai’s statement, I consider to be blatantly obvious. Beyond that, I’m only explicitly stating things implied by the original statement. I’ll belabor this a little, given that I’ve seen many complaints about Kalai’s statement being ambiguous. It isn’t.

We’re throwing a fair, six-sided die. This means we have to consider the likelihood of any one of the die’s sides coming up in any given throw. This also means we’ll end up with a large set of outcomes if we run the trial (i.e., the game) many times, and many of those outcomes will terminate with a 1, 3, or 5. Those trials will be thrown out, and along with them many instances of 2 and 4. But it’s not as though those numbers don’t exist on the die and won’t come up in a trial. It’s just that we’ll only tally from the trials that have a 6 in them—I call the collection of these trials the “winning” set. That’s the set we’re interested in. It will include every 6 that was rolled, but not every 2 or 4, though it will include some of the 2’s and 4’s rolled (i.e., the 2’s and 4’s that preceded a 6). Indeed, 6 is the only number on the die that can make the winning set on its own; this gives it a big advantage. A major theme here will be that we should expect more 6’s than 2’s or 4’s in the winning set.

That said, here are some examples of how the game can go:

You roll a 1 and lose (a probability of 1/6).

You roll a 6 and win (a probability of 1/6).

You roll a 2 and then a 3 and lose (a probability of (1/6)^2 = 1/36).

You roll a 4 and then a 6 and win (a probability of (1/6)^2 = 1/36).

Notice that the probability of getting a single 6 is higher than that of getting a 4 and then a 6. Again, we should expect more 6’s than 2’s or 4’s. (Remember, the intuitive wrong answer assigns 2, 4, and 6 equal outcomes of 1/3 each, and thus predicts getting equal numbers of 2’s, 4’s, and 6’s in the winning set.)

## 3.1 Solving by Application

Now, consider you play the game 120 times. (Note that I’m using “trial” and “game” interchangeably.) In those 120 games, you expect the following results (I could parse out the sample space in other ways, but this way seems instructive):

[1] 1/6 of the 120 trials will be a 6. (YOU WIN)

[2] Some fraction of the 120 trials will have at least one 2 or 4, and then terminate in a 6. (I’ll figure this out below.) (YOU WIN)

[3] 1/2 of the 120 trials will be a 1, 3, or 5 (that’s P(1) + P(3) + P(5) = 3/6 or 1/2). (YOU LOSE)

[4] Some fraction of the 120 trials will have at least one 2 or 4, and then terminate in a 1, 3, or 5. (I’ll figure this out below.) (YOU LOSE)

The probabilities above should add up as follows: [1]+[2]+[3]+[4] = 1 (or 100%).

The probabilities for [2] and [4] require a bit of precalculus. I don’t have space here to thoroughly explain these techniques, but hopefully what I do explain will suffice. I’ll provide links to explanations for anyone who wants to delve deeper.

For [2], let’s first consider some examples of how things can go. You could roll:

2, 6

2, 4, 6

2, 2, 4, 6

Notice again that these outcomes are all less likely than rolling a 6. The third trial, for example, has a probability of (1/6)^4 or 1/1296, or about .078%. Now, let’s state what can happen in [2] more specifically. To be included in the [2] condition, you could get:

2 or 4, then a 6

Or: 2 or 4, then 2 or 4, then a 6

Or: 2 or 4, then 2 or 4, then 2 or 4, then a 6

Or… well, this can go on forever!

Putting the above in terms of probabilities, where P(2 or 4) = 2/6 = 1/3, and the P(6) = 1/6, the [2] condition may be satisfied as follows:

(1/3)×(1/6)

Or: (1/3)×(1/3)×(1/6)

Or: (1/3)×(1/3)×(1/3)×(1/6)

Or: (1/3)×(1/3)×(1/3)×…… (1/6), with infinitely many (1/3)’s

This creates an infinite geometric series. We now must add it all up, all infinitely many of it. Luckily, there are two simple ways to do it. First, we need to recognize what our first term is in the series. It’s (1/3)×(1/6), or 1/18. Then we need to recognize what that term is being multiplied by each time we perform a new roll of the die. We see that this is 1/3. So, the above can be expressed as:

(1/18)×(1/3)^0 + (1/18)×(1/3)^1 + (1/18)×(1/3)^2 + (1/18)×(1/3)^3…. and so on indefinitely.

(Remember: A number raised to the zero power equals 1.)

We have two ways to add this up. We can use a sigma calculator (this one’s from Wolfram|Alpha):

Or, we can use use a simple formula: a/(1–r), where *a* is the first term and *r* is the common ratio, which we’ve observed here to be 1/3.

Applying the formula: (1/18)÷(1 – 1/3) = (1/18)÷(2/3) = (1/18)×(3/2) = 1/12.

So, [2] will contain 1/12 of our 120 trials.

Now we do the same for [4]. This time I’ll just use the formula. Let’s figure out the first term and the common ratio, where P(2 or 4) = 1/3, and P(1 or 3 or 5) = 1/2. To be included in [5] if you must get:

(1/3)×(1/2)

Or: (1/3)×(1/3)×(1/2)

Or: (1/3)×(1/3)×(1/3)×(1/2)

Or: (1/3)×(1/3)×(1/3)×… (1/2), with infinitely many (1/3)’s

The first term is (1/3)×(1/2) = 1/6, and the common ratio is 1/3. So, using the formula:

(1/6)÷(1–1/3) = (1/6)÷(2/3) = (1/6)×(3/2) = 1/4.

So, [5] will contain 1/4 of the trials.

Let’s add these up and make sure they come out to 100%.

[1] = 1/6 = 20 trials

[2] = 1/12 = 10 trials

[3] = 1/2 = 60 trials

[4] = 1/4 = 30 trials

1/6 + 1/12 + 1/2 + 1/4 = 1 (or 100%)

And: 20 + 10 + 60 + 30 = 120

Looking good.

Ok, so we know all the trial outcome types are accounted for in our sample space. Now we need to refine that space according to the conditions of the problem. The problem is asking how many rolls to expect in order to see a 6 within the a trial terminating with a 6. In other words: What is the probability of getting a 6 in a single roll within the winning set—that is, within [1] and [2] combined? This is a key point. Remember that the usual probability of getting a 6 in one roll (i.e., 1/6) is equal to the probability of getting a 6 within the sets [1], [2], [3], and [4] combined. (Perhaps that’s an obvious point, but it be becomes less obvious in the problem at hand, where the probability will not be 1/6.)[/note]A convenient assumption allowable due to the outcome of the die rolls being independent. I can also calculate this like so (using the Wolfram|Alpha sigma calculator):

For discussions that include more solutions with explanations, see these Mathematics Stack Exchange threads: What is the expected value of the number of die rolls necessary to get a specific number? and On average, how many times must I roll a dice until I get a 6?

You have to be careful with this. Playing cards problems, for example, tend to involve dependent events. Consider this question: You have a normal deck of 52 playing cards. You start taking cards off the top. How many cards do you expect to remove in order to see an Ace? The answer isn’t 13, but 10.6. This problem is from Frederick Mosteller’s Fifty Challenging Problems in Probability with Solutions. I recommend the book, but you can also find a discussion of that problem, with beautiful excel simulations, here: Fifty Challenging Problems in Probability; and there’s a short and sweet explanation in this PDF: When Some Tricks Are Trickier Than Others A Collection of Probability Tricks[/note] The same concept applies here, but we’re only interested in [1] and [2].

In [1] and [2], we have 30 trials total. Twenty of those were trials in which a 6 was rolled in the first throw. The other 10 trials also contain a 6, but they also each contain at least one or more 2’s or 4’s. (In a moment I’ll consider how many 2’s or 4’s there are.) Now we just have to calculate what fraction of all the winning trials yielded a 6 in the first roll. That is 20 out of 30 trials, or 2/3. So, within the context of the problem at hand, there is a 2/3 chance of rolling a 6.

This means that for every three trials, two yielded a 6 in the first throw. This also means that for every three individual throws in the set of all winning trials, there are two 6’s. All this boils down, on average, to one 6 per 1.5 throws. Notice that this is just the reciprocal of 2/3.^{4} So that’s the answer. You can expect to throw 3/2 times to see a 6, given that the preceding throws, if any, are even.

If this is correct, there must be about twice as many 6’s in the winning set than there are 2’s and 4’s combined. In other words, 1/3 are 2’s and 4’s combined. Those two outcomes are equally likely, so they must each have a probability of 1/6 in the context of this set.

Looking at 2’s and 4’s is useful, I think, because it seems that a big part of why this problem is counterintuitive has to do with overestimating their quantity. As I noted above, there should mostly be 6’s here: every trial here is guaranteed to have a 6 in it. Not so for 2 or 4.

That said, let’s calculate how many of each number to expect in the winning set. We have 30 6’s total. That is 2/3 of 45, so there must be 45 throws total, 15 of which are 2’s and 4’s (i.e., after subtracting the 30 6’s). Two and 4 are equally likely, so there must be 7.5 of each on average.

Let’s put this to the test! I programmed the game into Excel and ran the 120-trial experiment 11 times, with the following results:^{5}

I expected…

- …20 wins with a 6 on the first roll. The average was 20.18.
- …10 wins with a 6 after getting some number of 2’s or 4’s. The average was 9.64.
- …30 wins total. The average was 29.82.
- …66.67% (i.e., around 2/3) of the winning trials to contain a 6 only. The average was 67.65%. Notice the wide range of values for this column, going from 53.33% to 92.59%! But the final average still comes close to the expectation—it overshoots by less than one percentage point (about .98 percentage points, actually). The wide range here has to do with the wide range of 2’s and 4’s (see below), due, I suppose, to throwing so many of those out, leaving a small sample of them. The star here really is 6. It is always well in the majority, usually by a large margin.
- …7.5 of each 2 and 4. I got 7.27 and 6.73, respectively.
- …15 2’s and 4’s total. I got 14 on average.
- …a 50/50 split of 2’s and 4’s. I got close, at around 52/48. Also notice that the range for the 2’s is 14–1=13; and the range for the 4’s is also 14–1=13.
- …6’s to be twice as many as the 2’s and 4’s combined. It was 3.19 times as many. Well, at least it wasn’t half as many (which is what the most common wrong answer predicts). Notice the impact of Experiment 5 here, where 6’s are a whopping 13.5 times bigger! Without that, the average would have been 2.16.
- …6’s to make up 66.67% (or 2/3) of all outcomes. The average here is 68.05%. That’s a difference of 1.38 percentage points. Not bad.
- …an average of 1.5 throws to see a 6. The average here is 1.46. Had I rounded everything to one decimal point, it would have been 1.5, but I opted for a little more precision.

Had I not been convinced before, I would be now! Though it’s still possible I could be making an error in my math somewhere. I’m open to critique.

### 3.2 Solving with Basic Probability Math

Drawing from the probabilities I worked out earlier in this section, we know that, for any given run of this game, there’s a 1/6 chance of getting a 6 on the first throw, and a 1/12 chance of getting some sequence of 2’s and/or 4’s that terminates in a 6. Add those up and you get 1/4. So, there’s a 1/4 chance of winning. In other words, 1/4 of all outcomes (including those with 1’s, 3’s, and 5’s) will be winners.

We now calculate what fraction of the winning trials consists of a 6 only. That is (1/6)÷(1/4) = (1/6)×(4/1) = 4/6 = 2/3. Take the reciprocal to get 3/2 = 1.5.

# 4. Errors?

Did I get something wrong here? Conceptually? Computationally? If so, let me know!

#### Footnotes:

- If you need a refresher on combinations, here’s a Khan Academy video.
- If you need a refresher on permutations, you can check out this Khan Academy video
- Go here for a refresher on sigma notation. In the denominator, I’ve put the formula for the aforementioned nCr, which, again, you can brush up on at Khan Academy.
- A convenient, and highly intuitive, feature of geometric distribution.
- This modest program got the job done:
1. In cell A1, I entered: =RANDBETWEEN(1, 6)

2. In cell B1, I entered: =IF(OR(A1=2,A1=4),RANDBETWEEN(1,6),IF(A1=6,”WIN”,”LOSE”))

3. I then copied and pasted B1 into C1 through H1 (only one of the 1,320 trials made it to H1; it ended in a 1).

4. I then copied and pasted that row, from A1 to H1, so that it covered 120 rows.

5. I then ran the game 11 times, copying and pasting each result into its own sheet for tallying.