*Estimated read time (minus contemplative pauses): 44 min.*

This is a followup to the post, “Dice Probability: Expected Rolls to See All Six Sides (Coupon Collector’s Problem), which ends like this:

…I’ll soon publish a post looking at the median number of rolls for seeing your first 6, as well as the much harder question of the median number of rolls for seeing all six sides. The latter poses some frustrating, though also instructive (and maybe kinda fascinating), difficulties. Along the way, I’ll explain/derive the geometric distribution’s cumulative distribution function and median, among other things, mostly for the sake of intuition-training.

Looking around at what others have written about the coupon collector’s problem, a common observation I see is that the median is more useful than the mean for guessing how many trials will actually be needed. But it’s hard to calculate, so a simulation is the best way to go.

See, for example, an SAS blog post called “How many times must you roll a die until each side has appeared?” from Rick Wicklin, a researcher in computational statistics whose small simulation yielded a median of 10.5:

So what is the expected median number of rolls before all six faces appear? I don’t know. The only way that I know to estimate the median is through simulation. If you know a reference or a better way, leave a comment.

In the comments, user Will Sawyer uses a Markov chain approach to get ~12.811, I’ll say more about Markov chains below, as it’s the approach I’ll use here.

Wicklin posted a followup two days later—”Simulating the Coupon Collector’s Problem“—detailing a more extensive simulation, from which he concludes that “The random variable in this problem is discrete, but 51% of the time all six faces occur by the 13th roll.”

Ok, so ~13 is the number to look out for.

I’ll break this post into two parts. In the first, I’ll try to impart what’s going on with the problem and why it’s harder than finding the mean. In the second part, I’ll solve the problem, or will at least will detail a solid approximation. This will require some linear algebra.

I don’t have space here to explain all the background concepts needed for addressing the problem. Instead, I will be clear about what I’m doing from step to step, and I’ll point to resources for further study.

Essentially, I’ll include what I wish I could have found when I first saw this and similar problems. If I say anything wrong or confusing, let me know!

## Part 1: Why Is the Coupon Collector’s Median Hard to Find?

To be clear, this is *a* coupon collector’s problem, not *the* coupon collector’s problem. But engaging it is a step in the direction of understanding more general solutions.

You can read about the general problem at the “Coupon Collector’s Problem” *Wikipedia* entry, but it doesn’t mention the median there. Nor do any of the papers on this problem I came across in an admittedly brief search, except here and there in the context of simulation.

This lack may give some indication of the difficulties—or unnecessary tedium—of approaching it mathematically. I enjoyed approaching the problem mathematically, however, and look forward to sharing what I did.

But I’m getting ahead of myself. First, l’ll explore an intuitive approach that I at first thought would work, but doesn’t. I find such work illuminating. It led me, for example, to derive the cumulative distribution function and median of a geometrically distributed random variable (see below).

(I’ve taken this approach before, in a post called “Gambler’s Ruin & Random Walk: Probability, Expectation, Steal the Chips.” I start there with an investigation into why an approach that I and others found intuitive, but doesn’t work, and then I work from there to an approach that does. If you’re looking for an introductory application for difference equations, you might like it.)

To get started, let’s find the median of an easier question:

*What is the median number of rolls for seeing your first 6 (including the 6)?*

Another way to put it (e.g., for simulation purposes):

*Were you to roll a die until you saw 6 many, many times, what would be the median number of rolls it took to see your first 6 (including the 6)?*

A more specific way to put the question is:

*When rolling a die until you see your first 6 (including the 6), within what number of rolls do you have a 50% chance of succeeding?*

In other words, if you have a 50% chance of seeing your first 6 within the first four rolls, then the median is 4, which is in fact the answer (or at least one answer; see below). But how to show it? We’re getting there.

In more technical terms, we can say that the median depends on the cumulative distribution function (CDF). See this definition, for example, in my favorite probability textbook, *Introduction to Probability*, by Joseph Blitzstein and Jessica Hwang (buy here or read online for free here):

A

medianof an r.v. [random variable]Yis a valuemfor which P(Y≤m) ≥ 1/2 and P(Y≥m) ≥ 1/2 (this is also called a median of thedistributionofY; note that the notion is completely determined by the CDF ofY). Every distribution has a median, but for some distributions it is not unique. (p 187).

I’m not going to get into what a random variable is. The book does a fantastic job of explaining it, or see this short Brilliant.org entry: “Discrete Random Variables – Definition.”

Instead, I’ll just say that, in the present example, the random variable—which is what we call “geometrically distributed”—involves the set of of possible outcomes, quantified as die rolls. For example, we can land our first 6 in 1, 2, 3, 4, 5, 6, 7, … and so on rolls. Those numbers are the possible values that *Y* can take on in with respect to the above definition. (That set of possible values is known as the random variable’s “support.”)

More specifically, since the median number of rolls is 4, it tells us that the probability of landing our first 6 in less than or equal to four rolls—i.e., in 1, 2, 3, or 4 rolls—must be 50% or greater. And landing our first 6 in 4, 5, 6, 7, … rolls rolls must also be 50% or greater.

A simpler, and more common way, to conceptualize this is found at *Wikipedia*‘s entry on “Median“:

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population or a probability distribution.

This definition isn’t quite right but gets the main idea across, at least for a starting goal. Which is to figure out at what point we hit 50% as we add up the probabilities of of landing our first 6 in one roll, two rolls, three rolls, etc. This is where the CDF helps us, but before getting to that, I’m going to use a shortcut to show that four is an acceptable answer.

I’ll use a formula (which I’ll also derive in a moment) listed at the *Wikipedia* entry on “Geometric Distribution” for finding the median of a geometrically distributed random variable, where *p* is the probability of success:

(Review logarithm properties at Math Is Fun‘s quick and clear webpage. It’s interesting to note, by the way, that the Brilliant.com entry for geometric distribution states: “Three of these values—the mean, mode, and variance—are generally calculable for a geometric distribution. The median, however, is not generally determined.”)

Note that the formula includes the successful roll, and that *p* cannot equal 1 (as that results in an undefined expression).

For the question at hand, the probability of landing 6 (or any other die face) is just 1/6, so I’ll plug that in:

This uses a ceiling function (i.e., tells you to round up to the next integer). Without that, we’d get 3.8. This is good. You might have observed at *Wikipedia* that the median is not unique if the formula outputs an integer without the ceiling function. It’s not uncommon for a distribution to have a range of values that work as the median, particularly for a discrete random variable (though it can happen with a continuous ones as well).

There’s a nice discussion of this in a PDF document by Steven J. Miller called “The Probability Lifesaver: Order Statistics and the Median Theorem.” For example, he shows that the median of a fair die is not unique—any number works that is greater than equal to 3, but less than 4. For instance, there’s a 50% chance of rolling less than or equal to 3, less than or equal to 3.5, and same goes for 3.98. See the document to go deeper than I will here.

(The homepage where Miller shares that document is a fantastic resource for digging into probability. The document is supplemental to a book that’s new to me, but looks excellent and math-heavy and is joining my library; the e-book price ($16.17) is a steal: *The Probability Lifesaver: All the tools you need to understand chance.*)

At any rate, the difference between the CDFs for 3.8 and 4 rolls will be interesting to ponder. But I want to derive the CDF rather than jump straight to it. And from that, I’ll derive the above formula. I hope that makes things less abstract. I’ll approach the CDF by way of the rolls-to-first-6 question, which was the topic of my previous post on the coupon collector’s problem.

Recall that you expect six rolls to see your first 6. But it won’t always take that long. The probability that it takes 1 or 2 or 3 rolls, for example is:

P(1st roll) + P(2nd roll) + P(3rd roll) = 1/6 + (5/6)(1/6) + (5/6)^{2}(1/6) ≈ .42.

The CDF—which again, is short for “cumulative distribution function”—is just a shortcut for calculating such accumulations. The CDF for a geometric distribution is just the probability of success *p* within *k* trials (or in this case *k* rolls). In the above example, *k* = 3.

Notice that the complement to this is *not* seeing 6 within four rolls—which means seeing 6 in four or more rolls.

The probability of it taking four or more rolls is easily calculated here. It is just the probability of three failures. Once you’ve rolled three times without seeing a 6, you are *promised* to require at least four rolls to see a 6 (provided you still count those first three rolls; of course, rolls are independent, so you should *expect*, technically speaking, six more rolls to see 6, but this isn’t an expectation problem). The probability of three failures in a row is (5/6)^{3} ≈ .58.

This, again, is just the complement to the above probability of .42. In other words, 1–(5/6)^{3} ≈ .42. And how do we find 5/6? That’s just the probability of not landing a 6, which is 1–(1/6). Which means we could have just as easily expressed it as 1–(1–(1/6))^{3} ≈ .42

We can generalize that as 1–(1–*p*)* ^{k}*. And there you have the CDF of a geometrically distributed random variable where the successful outcome has a probability of

*p*, failure has a probability of 1–

*p*, given

*k*trials. This is a standard formula, found at the afore-linked

*Wikipedia*and in textbooks and such. Note also that, again, I’m including in

*k*the final, successful trial.

The CDF can easily be used to find the median here, since the definition of median tells us we’re looking for CDF of 1/2. I’ll designate *k* here as *m*.

(I choose log_2 because it plays well with the 1/2. And because I’ve already seen how this ends.)

And that is the same thing found at *Wikipedia*, minus the ceiling function.

If I pop 1/6 in for *p*, I get:

Which I’ll pop into the CDF to test:

Check. Here that is rounded up to 4, per the ceiling function:

Ok, so we again see that the definition of “median” being a point that divides the probability in half is false, unless we go with the 3.8 answer. And why wouldn’t we? I’m honestly unclear on this from a theoretical standpoint. The mean number of rolls for the coupon collector’s die problem is 14.7, so why can’t the median be 3.8?

Perhaps, for one thing, in many cases you’d awkwardly end up with a median of less than 1, as with collecting the second face, which has a probability of 5/6, and thus would, without ceiling function, have a median of ~0.387 rolls, which is less than the median number of rolls for collecting the first face, which is trivially 1. Collecting the third face, with probability 4/6, would have a median of ~0.631. That can’t possibly be less than the median for something that has a 100% chance of happening.

That said, given Miller’s assertion that the median roll for a fair die can be, say, 3.8 (among infinitely many other numbers), then there should be no problem with this being the median roll for seeing your first 6. And yet I’ve seen the ceiling function for discrete random variables elsewhere as well.

It strikes me, then, that these medians are approximations either way, and their utility is in surveying all of the results, however messy. But I’ll leave the depths of this question for later contemplation as I continue learning about probability theory. For now, at least the math makes intuitive sense. To deepen the intuition, let’s graph it.

I’ll use a cool little app I have on my iPhone, just called Probability Distributions (by Matthew Bognar, also available on his website!; note that there two geometric functions there, because, again, you can model these including or not including the success—I include the success).

I’ll graph both below and above the median (in red):

We get ~51.78% for being less than or equal to 4. The app rejects 3.8, by the way, because the support for the distribution is just {1, 2, 3, 4, …}. Thus, I suppose, the ceiling function.

So how to use all this for the coupon collector’s problem? The geometric distribution of a die can easily be used to find expectation, facilitated by linearity of expectation. That is, all we have to do is add up the expected number of rolls to fill slot A (i.e., collect the first face), then to fill slot B, and so on up to slot F. Add those up, and you get 14.7.

Why not do that here? Is there no general “linearity of median”? That was my first thought. So I tried it.

That is, I found the median for filling slot A (which I take to be just one roll, as slot A gets filled with the first roll no matter what). I added each median up, ceiling function included, to get:

(Without the ceiling function, you get ~8.529.)

The 1 out front because otherwise you end up with a 1 in the logarithm argument, which equals 0, making the the thing undefined.

Let me write out the (rounded up) values being added here, in order: {1, 1, 1, 1, 2, 4}. This feels fine, intuitively. Here’s my thinking.

First, there is a 100% chance of collecting your first face on your first roll, so the median roll is 1. The median roll for getting your second face on your next roll (after having collected one face) is also 1, which sounds about right, given you have have a 5/6 (or about 83%) chance of doing so. Once you’ve got two faces, you have a 2/3, or about 66.67% chance of getting your third face on the next roll, so it’s not surprising that the median there would be 1. And so on.

That’s all well and good if the formula is to be trusted. My next thought was that I should be able to add this up to get the overall median. While this felt a bit too good to be true, the intuition was bolstered by finding Wilkin’s result of 10 from his first simulation.

It is, in fact, too good to be true. Why? With a little reflection, it seems it could have to do with the fact that, in accumulation, there’s a good chance of deviating from some of these medians, so the overall median is greater than 10.

Medians in general cannot be added together the way expectations can. You can’t even assume that the median of the medians of several subsets will be the median of the union of those sets. Here’s can example where it works:

{1, 2, 3} , {4, 5, 6} , { 7, 8, 9}, where the respective medians are {2, 5, 8}, which itself has a median of 5. So, too, does the combined set: {1, 2, 3, 4, 5, 6, 7, 8, 9} have a median of 5.

But it’s easy to concoct a counterexample, as with:

{1, 2, 3, 4, 5, 6} , {0, 1, 3} , {2, 3, 4, 5, 6}, whose (ordered) medians are {1, 3.5, 4}, which itself has a median of 3.5. But if you combine these three sets you get: {0, 1, 1, 2, 2, 3, 3, 3, 4, 4, 5, 5, 6, 6}, which has a median of 3.

Stuff gets more complicated when we start talking about median linearity. I obviously don’t have a good intuition about this, so I ran five coupon collector’s scenarios to get a closer look.

I’ve conceptualized this as having to fill each of six slots—labeled A through F—with a unique die face. I list here the median rolls it took to fill each slot, where median is just the middle value:

A | B | C | D | E | F | row medians: | |

1 | 1 | 1 | 1 | 3 | 13 | 20 | |

1 | 2 | 1 | 2 | 1 | 1 | 8 | |

1 | 2 | 3 | 3 | 1 | 4 | 14 | |

1 | 1 | 3 | 1 | 3 | 4 | 13 | |

1 | 2 | 4 | 2 | 3 | 14 | 26 | |

column medians: | 1 | 2 | 3 | 2 | 3 | 4 | 11 |

The median amount of times it took to fill a given slot added up to 11. In other words, the (median to fill slot A) + (median to fill slot B) + … + (median to fill slot F) = 11. That’s pretty close to the predicted 1 + 1 + 1 + 1 + 2 + 4 = 10.

But the actual median of the number of rolls required to fill slots A through F is 14. That’s pretty close to the predicted 13, which I’m stilly trying to figure out how to calculate.

The culprit seems to be, in line with what I imagined, loss of deviation in the individual medians. In particular, we lose the bigger outliers, which are more significant than the smaller ones (you have a 16.67% chance of getting 6 in one roll, but you have a 16.15% chance of getting not-6 ten times in a row, and a 6.49% chance of getting not-6 fifteen times in a row). So, the individual medians get two layers of filtering, a process that is unfriendly to outliers.

This helps my intuition about why there’s no “median linearity.” This counterexample seems sufficient to prove that it isn’t linear. But if you’d like to see a discussion about formally telling when a median is or isn’t linear, along with the easier question of how to prove linearity of expectation, see this Mathematics Stack Exchange thread: “Expected value is a linear operator? Under what conditions is median also a linear operator?” For a more streamlined discussion, see this one: “How can I prove that the median is a nonlinear function?” (See also Miller’s discussion of the Median Theorem in the above document.)

Regarding linearity of mean (another word for “expectation”), by the way, note that, whether you take the mean from the rows or columns, you get 16.2.

This is intuitive, as the rows and columns must add up to the same thing (81), and they have the same denominator (5), but it also works with different sized sets, so long as you weight everything correctly (i.e., find a common denominator). There’s more to say about linearity of mean, but that was the core topic of the previous post in this series.

So I’ll move on to finally solving this problem!

## Part 2: Finding Median for Seeing All Six Sides of Die with Linear Algebra, Markov Chains, and Newton’s Method

Let’s return to the cumulative distribution function (CDF). What we are dealing with here is not a geometric distribution, because of the conditionality involved. That is, once you’ve collected one face, the probability of collecting a second in the next roll changes from 100% to 83.33%.

These probabilities aren’t difficult to conceptualize, but aren’t so straightforwardly computed. That is, using the simpler definition of median, wherein we’re looking for the number of rolls such that succeeding at or below that number of rolls has a 50% chance of occurring, all we need to do is add up the probabilities of succeeding in such and such number of rolls until we hit (or pass) 50%. We know we’re testing 13 rolls, so I’d expect:

P(success in 6 rolls) + P(success in 7 rolls) + P(success in 8 rolls) + … + P(success in 13 rolls) = ~0.5.

How to figure those out? Well, I first tried brute counting, with some combinatoric techniques as a shortcut. This was interesting and gave a little more insight into what’s going on, but this got difficult quickly. I won’t share all the details of that here as it is not an efficient way to go, but will say that involves multiset permutations and multinomial coefficients, the core of which is the multinomial theorem, or what’s often called the “MISSISSIPPI rule,” all of which you can read about at the “Multinomial Theorem” *Wikipedia* entry (or see the above-link textbooks). Or just see this five-minute video by YouTuber patrickJMT:

Then you just have to sit and think it through. For example, you can collect six faces with six throws exactly one way. For seven throws, you could roll something like: 1, 2, 3, 4, 5, <repeat roll>, 6. That repeat roll can be any number but 6, and can be placed between any two rolls. And, of course, it could have been a different order, like: 6, 3, 2, 4, 5, <repeat roll>, 1 (where the repeat roll can be anything but 1). And so on for collecting all six faces in eight throws.

You can see how tedious this is. But I played with it for a good while and got this for winning in exactly six, seven, eight, nine, or ten rolls:

Tedious. And I made a mistake with the last two terms there, but it’s not bad. It should, I later found, add to 0.271812128487, so it’s only 0.036258287617 off. I think it could be interesting to figure out the correct multinonials, to see if there’s a pattern (at one point I jotted down “Catalan numbers,” though now I don’t see it, including among the other failed attempts I’m not sharing here). That’s the stuff of obsession run amuck.

Instead, return to the answer given by commenter Will Sawyer at Wicklin’s SAS blog post: ~12.811. I’ll reprint the full comment here as a reference, as it is the basis for how I came to unravel the problem, though I’ll take a lot more steps than this (and will add several decimal places to the final output):

I used the Markov Chain approach to solve this problem. There are six states corresponding to the number of sides of the die already seen and the transition matrix is bidiagonal, with only the current state of the next highest one possible. It is then possible to create the set of difference equations linking the probabilty vector after N rolls of the die to the vector after N-1 rolls. These equations solve to give values of p1(N),p2(N),…,p6(n) which are the probabilities of having seen 1,2,..6 faces after n rolls, respectively. It is more convenient to write the probabilities are (n+1) throws so:

p1(n+1) = (1/6)^n

p2(n+1) = 5 * [ (2/6)^n – (1/6)^n ]

p3(n+1) = 10 * [ (3/6)^n – 2(2/6)^n + (1/6)^n ]

p4(n+1) = 10 * [ (4/6)^n – 3(3/6)^n + 3(2/6)^n – (1/6)^n ]

p5(n+1) = 5 * [ (5/6)^n – 4(4/6)^n + 6(3/6)^n – 4(2/6)^n +(1/6)^n ]

p6(n+1) = 1 – 5(5/6)^n + 10(4/6)^n – 10(3/6)^n + 5(2/6)^n – (1/6)^n ]The pattern here, for anyone familiar with Pascal’s Trinagle and combinatorics is clear, with the sign of the combinations alternating between + and -, and the proportion being summed reducing by 1/6.

The probability that it will need N throws is just the (1/6) times the probability that 5 sides had been seen after N-1 throws, i.e.

P(N rolls before all 6 sides are seen) =

(5/6) * [ (5/6)^(N-2) – 4*(4/6)^(N-2) + 6*(3/6)^(N-2) – 4*(2/6)^(N-2) + (1/6)^(N-2) ]

for N=2,3,4,…

and 0, otherwise.

By the way, it can be quicly seen that for N=2,3,4,5, this formula gives 0, as it should.The mean of a random varable with thie PDF can be worked out to be 14.7 using basic results of sums of infinite series.

The Median is solved by seeing at which value of N the cumulative probabilities reach 0.5.

The CDF, i.e. P(number of rolls <=N) is, by summing the series:1 – 5*(5/6)^(N-1) + 10*(4/6)^(N-1) – 10*(3/6)^(N-1) + 5*(2/6)^(N-1) – (1/6)^(N-1)

This cannot be solved explicitly, but by iteration gives a solution of the median as 12.8110473

Thanks, William Sawyer, for setting me on the right track with this response!

All right. So we need to think about this in terms of state changes. That makes great sense. It means, for instance, that if after 11 rolls we have collected five faces (call this “being in State5”), then we have a 1/6 probability of going on to State6 in the next roll. So we just need the probability of being in State5, and then we can multiply that by 1/6 for the overall probability of winning in exactly 13 rolls. Do this for winning in rolls 6 through 12 as well, add all that up, and it should be ~.50.

But I’m getting ahead of myself. First of all, what is a Markov chain? I think of it as a system of states, such that there is some probability of transitioning (or not) from any given state to any other given state. You make one move at a time. I won’t worry about making this formal here (see the above textbooks for that), but I will embed some videos to make things more explicit as we go along.

That said, an excellent, and typical place to start is with a bubble diagram. For the present example, we have seven states: State0 (no faces collected) through State6 (all six faces collected). You must go through these states in order, of course.

The Markov chain bubble diagram looks like this:

This just says that, starting in State0, the probability of going on (i.e., “transitioning”) to State1 in the next roll is 1 (while the probability of staying in State0 is 0); in State1, the probability of transitioning to State2 in the next roll is 5/6 (while staying in State1 is 1/6), and so on until you are in State6, at which there is only one thing that can happen so long as you keep rolling: you stay in State6. When you hit a state that you can’t leave, that’s called an “absorbing” state.

This is a powerful tool. And the obvious patterns (i.e., 6, 5, 4, 3, 2, 1 on top and 1, 2, 3, 4, 5, 6 on bottom) make it easy to generalize. For example, we can reduce the transition process as follows, as *n* goes from 0 to 5:

With this simple idea, we can run a lot of calculation, such as the probability of being in State6 after 13 rolls. But we’ll need to re-conceptualize the chain as a matrix rather than as a bubble diagram.

Before doing that, I’ll note that there are other ways we could visualize and organize this system. For example, we could use a probability tree or you could graph it on a coordinate plane like a random walk in which, with each step, you must move either one step to the right or one step to the diagonal-right. These methods get cluttered quickly, I find.

Or you can list the possible scenarios straightforwardly and systematically; for example, how many ways can you get from State0 to State3 in precisely five rolls? Easy (where the numbers represent States 1 through 3):

Roll1 Roll2 Roll3 Roll4 Roll5

1 1 1 2 3

1 1 2 2 3

1 1 2 2 3

1 2 2 2 3

Those probabilities are easily calculated, and in fact yield a pattern. We hold the 3 fixed, so all we’re concerned with is the probability that we’ve collected two faces after two rolls, three rolls, four rolls, etc. For above, not including the probability of transitioning to State3 (i.e., 4/6) on the fifth row, we get, it’s clearly:

If you take ten seconds to write out each step in the above sigma machine, you’ll see that it matches the probabilities of each row. (The first row, for example, including the 3, is: (1)(1/6)(1/6)(5/6)(4/6).) Or skip it and move on, as this isn’t a great method for dealing with the problem at hand. It starts to get unwieldy well before you get closer to State6.

So I’ll now return to the reconfiguring the Markov chain as a matrix, which cleanly organizes what the above methods quickly turn into a mess:

The labels along the rows and columns represent State0, State1, State2, and so on. The rows represent the state you are in now, and the columns represent the states you may transition to. So you read it as follows.

If you’re currently in State0, then the probability of advancing to State1 on the next roll is 6/6 (i.e., 1, but I keep 6’s in the denominators for transparency). If you’re currently in State1 (i.e., see the row marked “1”), the probability of remaining in State1 on the next roll is 1/6. And so on.

Now, if we wish to know what happens given two rolls of the die, then we square this matrix, and the probabilities change. And we cube it for 3 rolls. In case you don’t know how to perform basic operations (by hand) with matrices, I’ll embed some videos further down. Right now, I’m going to jump ahead and confirm that transitioning from State0 to State6 by the 13th roll has a probability of ~50%.

I’ll do this with a fantastic calculator that you can find here: Matrix Calculator. I’m going to remove the 0 rows and columns, however, because they aren’t needed (we are promised to get to State1 on our first role). Since this eliminates State0 from the matrix, it means we’re looking at going from State1 to State6. In other words, the first roll is a freebie, so we’ll raise the matrix to the 12th power.

I’ll also change to parenthesis style because it’s easier to input here.

I could have made this look less awful by telling the calculator to give a decimal approximation, but that would result in a lot of 0’s. At any rate, the main thing to zero in on is the entry at the very end of the first row. That’s the first row/sixth column spot. In other words, the probability of having made it from State1 to State6 in 12 rolls or, equivalently, from State0 to State6 in 13 rolls. In yet other terms, getting from having collected no faces to all six faces within 13 rolls.

The number in that spot is 485485/944784 = ~.514, as expected based on Wicklin’s second blog post. Recall that Sawyer’s answer is ~12.8110473 rolls for hitting much closer to 50%. This is consistent with that. But also recall that, when dealing above with the median for a geometric distribution, we used a ceiling function in order to output a whole number. If we wanted to do that here, we’d be finished. The answer is 13.

Either way, it’s unique. There’s a 50% chance of collecting all six faces in less than or equal to 12.8110473 (to however many decimal places) rolls. This is not true of any other number of rolls (adding more decimal places aside), But we’ve yet to show that it’s true for ~12.8110473.

I haven’t finished reading, much less grappling with, Miller’s PDF document on the median theorem, but I think it’s important to know how to hit as close to 50% as possible. I want to know, anyway. I also want to know how Sawyer pulled out those nice looking formulas for streamlining the probabilities here, no matrices required.

My goal is to manipulate the matrix in such a way that helps me make an equation with some variables on one side and .5 on the other. To follow this, you’ll at least need to understand basic matrix operations.

Here’s a quick overview for adding and subtraction matrices:

And here’s one for multiplication:

Finally, to see this in action in the context of Markov chains, that also relates the bubble diagram and probability tree to a matrix, see this informative ~12.15-minute video:

I’ll start by factoring 1/6 out of the starting matrix, which has the nice property of being bidiagonal with lots of 0’s and with every entry having a denominator of 6.

Call this matrix **A**. In other words, our starting matrix is equal to (1/6)**A**.

The task now is to diagonalize **A**. What this means, and why we’re doing it, will become apparent in a moment. There are several steps. As I go along, I will explain the gist of what I’m doing and will share resources for going deeper.

Diagonalizing can be done in a flash with the above-linked matrix calculator. As can the intermediate steps to get there. I’ll take the latter route. Matrix **A** turns out to be relatively easy to diagonalize thanks to all the zeroes.

What we’re looking for is **A** = **CDC**^{-1}, where **C** is a matrix whose diagonal is six (because **A** is a 6×6 matrix) independent eigenvalues of **A**. What is an eigenvalue? I have sitting here an excellent linear algebra, self-teaching oriented and reasonably priced textbook called *Linear Algebra Step by Step* (by Kuldeep Singh), whose final chapter is an introduction to eigenvalues and eigenvectors; it defines these as (note that a “scalar” is just a number, as opposed to a vector; what’s a “vector”? for our purposes, it’s a list of numbers, e.g., the column or row of a matrix, but of course it’s much more than that):

For a

non-zerovectoruthe scalar λ is called aneigenvalueof the matrixAand the vectoruis called aneigenvectorbelonging to or corresponding to λ, which satisfiesAu= λu. (p 492)

I’ll embed some videos below on how to calculate these. But to really delve into what they are would require a deeper dive into linear algebra than I can attempt here. There are some wonderful videos out there, which I myself am still absorbing; I assume that if you’re going that deep, you’ll run into them as they are deservedly popular.*

*One collection of which is Gilbert Strang’s MIT lectures. To motivate that journey, hear Strang on the *Artificial Intelligence with Lex Fridman* podcast (11/25/19), in which Sang makes the, I think correct, point that calculus is overhyped in comparison to linear algebra, particularly in this day and age, when linear algebra is increasingly important. I’ll also add that I think I would have found multivariate calculus more mysterious with some linear algebra background:

Let’s find the eigenvalues of **A**.

We do this by finding the determinant of (**A** − λ**I**), which just says that we subtract from matrix **A** a variable (by convention “λ”) that is multiplied by the identity matrix with the same dimensions as **A**.

“Identity matrix” is easy to define: it’s the matrix that, when you multiply A by that matrix, you get out **A**; so, just as 3 × 1 = 3, we have that **A** × **I** = **A**. It is, beautifully, always just a matrix with a diagonal of 1’s, with 0’s everywhere else.

“Determinant” is harder to define. Here’s a video for the usual way to find that:

There are other ways, such as cofactor expansion, which will be used in one of the videos I’ll embed at the end of this in case you’d like to see the a less convenient matrix diagonalized. **A** happens to be very easy to work with because it’s a lower triangular matrix (i.e., everything under the diagonal is a 0). So we end up with eigenvalues of 1, 2, 3, 4, 5, 6.

How’s how to it works. Again, need the determinant of (**A** − λ**I**). That means we need the determinant of:

To find the determinant of the resulting matrix above—that is, of (**A** − λ**I**)—all we have to do is set the product of each entry of the diagonal to 0:

(1 − λ)(2 − λ)(3 − λ)(4 − λ)(5 − λ)(6 − λ) = 0

(Again, this direct method works because everything below the diagonal is a 0.)

Then find all the λ values make the equation true, which are clearly 1, 2, 3, 4, 5, 6.

Since, by the way, we get six distinct eigenvalues, which matches the dimensions of **A**, which is 6×6, we know that **A** is diagonizable (see below for a video on this).

Recall that we are looking for three matrices whose product is **A**. That is, we want **A** = **CDC**^{-1}. **D** is made up of the eigenvalues, like this:

Now to find **C**, and then the inverse of **C** (i.e., **C**^{-1}). **C** is a matrix whose columns are eigenvectors of **A**. There are a lot of videos out there on finding eigenvectors some rather long. It might be the most complicated aspect of this process, so I’ll include two shorter ones and two medium ones that seem helpful (I sometimes find two short explanations more helpful than one long one):

At ~3 minutes:

At ~7 minutes:

At ~13.5 minutes:

At 18.25 minutes:

Developing eigenvectors for **C** is, again, easy. All we have to do is more subtraction, similar to (**A** − λ**I**), but this time we actually plug in each eigenvalue for λ. I’ll do one of these to demonstrate. For λ = 2, we get (**A** − 2**I**):

Now we need to simplify this to reduced row echelon form, which we can easily do with row reduction. (Note that it’s already in row echelon form.) All we have to do is get the pivots (i.e., the first non-zero number in each row) to be 1 (which in this case is the diagonal entries), which is super easy, just divide each by itself:

That’s for λ = 2. It needs to be done for 1 through 6. And then we put the results into parametric vector form. This is explained well here:

Next, find the inverse of **C**, which is nicely explained in this video:

Once I do all this (or, when too tedious, when I ask my calculator to do it), I get the following, which is very easy to use. That is, the beauty of the diagonal matrix with 0’s otherwise is that if you raise it to a power, all you have to do is raise each entry to that same power in order to calculate the new matrix. In other words:

When *n* = 1, we have **D**. Recall also that I factored out a 1/6 at the beginning, so I’ll put that back here, and it will also have to be raised to *n*. All together, this gives:

You can test for yourself that these are the same at the awesome matrix calculator.

Here are three videos in case you’d like to see diagonalizing performed.

At ~8.75 minutes:

At ~9.25 minutes:

At ~12 minutes:

At ~23 minutes, with some computation done in Octave:

And this one is on how to know if a matrix is diagonizable:

In a calculator, the very first matrix we started with is easiest to work with. But the diagonalized form is more transparent. So we can use it to develop a set of equations. And here’s an interesting observation. **C** and **C**^{-1} look an awful lot like the first five rows of Pascal’s triangle, but upside down:

In fact, I could easily rearrange stuff differently so that it’s not upside down (so **D**‘s diagonal would go 6, 5, 4, 3, 2, 1 instead of 1, 2, 3, 4, 5, 6), but I went with the calculator’s default.

Anyway, this is a hyper cool thing that will also be visible in the final equations. Which I’ll now do.

The probability of going from no faces collected (State0) to one face collected (State1), given that you’re in State0, is 1. So that’s easy enough. The probability of remaining in State1 on a given roll is 1/6. Also easy. I’ll represent that as:

P(1|*r*)= (1/6)^{r-1}

Which is read as “the probability of having collected only one face in *r* rolls is 1/6 to the power of *r*-1 rolls.” For example, the probability of having collected only one face after three rolls is the probability of collecting one face with probability 1), then rolling that same face three times in a row with probability (1/6)^{3}.

This requires no matrix to figure out. For State2 (i.e., having collected two faces), I’ll need the **CDC**^{-1} matrices, which again are:

Note that I’m letting *r* − 1 = *n*. So if *n* = 2, that will mean there have been three rolls total (no different than when I first introduced *n* above). What I’d like to know is the probability of having only collected two faces given *n* rolls (which, to be abundantly clear, means there have been *r* = *n* + 1 rolls overall).

What we need, then, is the very top entry of the second column—i.e., the entry that tells us about going from having collected one face to two faces in *n* rolls. So, I’ll go ahead and just pull out the top row of **C**:

And now we take resulting vector (which, if we had the whole 6×6 matrix, would be its first row), and multiply that by **C**^{-1}:

And there we have our formula for finding the probability of having collected two faces after *n* + 1 rolls (recall that this is because *n* is how many rolls there are *after* the first roll, which is to say that *n* must be at least 1 in this case, or the probability will be 0). I’ll rearrange this a little to make it look nicer (i.e., factor out the 5 and distribute the (1/6)* ^{n}*):

Now do the same for 3 through 6. That is, for being in State3 after three, four, five, … rolls, you look at the second row third column, and so on. A nice pattern emerges wherein you are always subtracting whatever came before, while inserting rows from Pascal’s triangle (plus you see the triangle’s fifth row show up vertically)! I’ll include the 1’s in the equations to make that more conspicuous. Also, as promised, this ends up looking just like the equations given by Sawyer in the SAS blog post:

Beautiful. What we’re considered with today is the P(6|*r*). We want find *r* such that it equals 0.5. Now, it’s easy enough to pop this into Desmos as a function to verify Sawyer’s 12.8110473:

Looks great, but how to get to 12.8110473 just given the equation? One way to get close is to graph y = .5 and look at where the graphs intersect:

That works (or, equivalently, subtract 0.5 from P(6|*r*) and look at the x-intercept)*,* but we’ve lost decimal places. Plus, it would be unsatisfying to stop there. We can do better with the Newton-Raphson method you learn in first-year calculus:

You also need to know how to take the derivative of this thing. It’s not tough, but tedious and you’ll need a calculator eventually. I messed around with it a bit in Desmos (where you can define a function as say, f(x), then easily get its derivative by inputting “f'(x)”).

Once I was convinced all was well, I streamlined it by typing “newton’s method” at Wolfram|Alpha. I then entered the P(6|*r*) equation equation for a result of *r* = 12.81104728506638. Putting that in for *r* at Desmos—like so:

—yields exactly 0.5. At Wolfram|Alpha it yields ~0.5000000000000002.

Not bad!

Play around with P(6|*r*) yourself at Wolfram|Alpha. That link should auto-populate, but if not here’s the input in their notation:

(6/6)^(r – 1) – 5 (5/6)^(r – 1) + 10 (4/6)^(r – 1) – 10 (3/6)^(r – 1) + 5 (2/6)^(r – 1) – (1/6)^(r – 1) = .5

I set the starting point to 0 and changed the variable to *r*. Here’s a snippet of what I got there:

(You can also just put the equation into the basic homepage prompt but you just get 12.811 with no steps available for review.)

As far as I know, this is about as good as it gets given free online resources. There’s no closed from solution to such an equation. So there you have it.

One more thing before wrapping up. What we are finding with, say, P(6|13) is the probability of having succeeded in collecting all six faces in six throws or seven throws or eight throws or… all the way up to 13 rolls. At 13 rolls is where the probability of success hits 50% or greater, and so it is a good approximation for being the median number of rolls needed for success. (In a simulation, where you run many versions of the game, it’s the number of rolls needed for success that shows up 50% or more of the time, roughly splitting the data.)

In other words, with the equations, we can easily find what I was struggling to calculate earlier, which was the probability collecting all six faces in six, seven, eight, etc. throws in order to add them up until 50% is broached. I tried to do this through counting, resulting in:

I said this is about 0.0363 off. Now we can see which of those terms is off and by how much. First, I’ll do this by finding P(5|*r*) for each scenario, then multiplying it by 1/6. For example, the P(5|12) is the probability of having collected only five faces after rolling 12 times. Obviously, the probability of the next roll being the sixth face is 1/6. So, the probability of collecting all six faces in exactly 13 rolls is P(5|12)(1/6) ≈ 0.0760, rounded to four decimal places (which I’ll do from here on).

For convenience, I’ll define a function *g* as:

Now, here are the terms, as decimals, from my failed combinatorics (which I’m quelling the temptation to fix):

0.0154 + 0.0385 + 0.0600 + 0.0650 +

That’s for succeeding in six, seven, eight, nine, or ten rolls, respectively.

Here are the corresponding numbers for *g*(*n*):

*g*(5)(1/6) + *g*(6)(1/6) + *g*(7)(1/6) + *g*(8)(1/6) + *g*(9)(1/6) =

0.0154 + 0.0385 + 0.0600 + 0.0750 + 0.0823 = **0.2712**

The difference between this is: 0.2712 −

And finally, of course, if we go all the way to the 13th roll we get, as expected:

If I’ve said anything wrong or confusing today let me know!

* Post Script: After writing this, I noticed that Wicklin* states in a comment that the probability for a given number of rolls can be expressed with:

Where, in the die context, *m* is the number of rolls, *n* is the number of faces collected, and S_{2}(*m*,*n*) is a Stirling number of the second kind. I’ve written up a new post to explore how this works: “Dice Probability: Median Rolls to 6 Sides (Coupon Collector’s) with Stirling Numbers.”

*Enjoy or find this post useful? Please consider pitching in a dollar or three to help me do a better job of populating this website with worthwhile words and music. Let me know what you'd like to see more of while you're at it. Transaction handled by PayPal.*