Entropy I
Motivating Shannon
The word “entropy” is surrounded in a strange mystique. This is confusing. Let us illuminate it.
This is part of a series on entropy.
- (This post)
- Entropy II: The Entropy Function
Table of Contents
1. Information Density
Consider the division of a whole into parts:
You can count or label a row of
What about these?
We could, of course, count the number of distinct blocks. Two binary bits are enough to count the first row of 3 (using
Or, four binary bits could count the 13 blocks in last row, with three codes to spare.
For any row, we could use the set of binary numbers as “codes”, “labels”, “pointers”, or “addresses” to refer to the blocks in that row. To write out a sequence of, say, 100 blocks from the last row of 13 would take 400 binary bits: one 4-bit number per block.
But could we do a bit better? The set of all four-bit binary numbers is
Of those, exactly four start with a pair of
The rule will be: if we encounter
If, furthermore, the big block shows up in proportion to its size—one fourth of the time—we can expect to save two bits each time we encode it, meaning it will take on average only 350 binary bits to list 100 of these blocks (4 bits three-quarters of the time, 2 bits one-quarter of the time).
Can we do any better? Since there are 13 distinct blocks, could we perhaps do the job with only
So in some sense, the blocks in the row
seem to be able to be encoded with, not 4, but only 3.5 bits of data.
This “3.5 bits” characterization, remember, relied on the interpretation that the blocks have an inherent probability proportional to their size.
And, in fact, we can arrive at that same number
Apparently, the quarter-length block brought along its
By a similar approach, you could determine the bits required to describe any of the oddly-sized rows. Even the blocks that don’t divide evenly can be shoved into the above formula, for example in
the length-
So the
What we’ve just found is that the contribution of a single block comprising a fraction
and so the number of bits required to encode an entire row of blocks is just a sum of
This is what is called the “base-2 Shannon entropy” or just “entropy”, which I will write1 as
The following rows of
and our non-uniform rows have entropies
The last of those rows is the
This problem—“how to optimally encode a sequence of samples from a distribution”—is probably the standard derivation of “entropy”. But here the mystique of that word is nowhere to be seen: it is not at all clear why the thing just described would be called “entropy”, or why anyone should care about it unless they are trying to write a compression algorithm. It seems to be a synonym for “information density”, perhaps.
2. Log-Likelihood per Sample
Now a second scenario.
Suppose some process in reality emits data according to a probability distribution
If you sample from this distribution (i.e. roll the dice)
You expect, on average, to see each of the
Suppose we take a vey large number of samples
We’ll call the above the “likelihood of
If the distribution
For any other distribution the rate of increase of the likelihood would vary, but in the long run it would center on some average rate. Furthermore we could weaken our assumption that our samples occur in exact proportion to
The above expression for the likelihood is unwieldy because it decreases by a fraction every time we add a new sample. To get something directly interpretable as a “rate of increase”, we should a) invert it, b) take a logarithm, and c) divide out
And voilà: we have found the same “Shannon entropy” formula as in the first example. The interpretation this time is as the “average log-inverse-likelihood per sample”—meaning what?
We got here by considering the likelihood
So the “entropy” here seems to describe the inherent difficulty of predicting the results of a probability distribution. It place some kind of an upper bound on how precisely a sample from that distribution can be predicted. It seems now to have a sense of “innate uncertainty” or perhaps “irreducible complexity”.
3. Limit of a Log-Multinomial
A third and final derivation.
Consider again some process with
Given one such vector
Now apply any selection process to the set of possible outcomes
It will help to have a few examples in mind:
- “the player folds all hands worse than
” (Texas Hold’em, where cards in the deck, cards in a hand) - “the word contains an “H” in the first position but does not have any of “OUSE”. (Wordle,
candidate words, ) - “the measured volume is
” (Ignoring distinguishability considerations, is the number of possible states per particle and is according to ChatGPT, while particles) - “if can’t metabolize lactose it dies” (
distinct genes which can mutate for E. coli or something, however many cells you want to study) - “keep only distributions with mean
and variance ” ( is the discretization of the space, can be whatever you want, you might consider the limit )
What vectors
The answer is: whichever have the highest probabilities among all those surviving the selection.
Which are these? What is the “posterior distribution” after this selection process?
We can make two simplifications to the above probabilities:
- all of the
have the same denominator (it’s just ), so their relative sizes will depend on the multinomials in the numerator only. - we can safely take a logarithm of both sides without changing the relative sizes of any terms.
Therefore the most probable vectors
is greatest.
We can simplify further by applying the Stirling approximation for the logarithm of a factorial:
… and we find that the largest surviving distributions are exactly those for which the probability distribution implied by
The interpretation is this: out of all distributions compatible with the selection process (or a constraint, or any information we have), the most probable are those with the highest entropy.
In fact, we can plug this expression for the entropy back in to the original probabilities:
We find that each outcome of the original uniform distribution (in the large-
I find this clarifying so I’ll restate it in terms of
High-entropy distributions are common; the uniform distribution is the highest of all. The growth rate of this probability with respect to the entropy is fantastically high: it goes as
This is really no different from our earlier “average likelihood per sample” formulation, since the above can be rearranged into
though this relation is neither an equality (since we’ve discarded the denominators and approximated the multinomials), nor is it a “proportional to” (since we took a logarithm), instead we might say
The derivation just given is the Wallis derivation of the “Principle of Maximum Entropy”, which states that the optimal posterior distribution to expect from any system about which we have some partial knowledge is that with the highest Shannon Entropy consistent with our knowledge (or with any constraints we know to apply).
Most standard probability distributions are exactly the maximum-entropy distributions for certain sets of constraints:
- a uniform distribution has the maximum entropy under no constraints
- a normal
has the highest entropy of all distributions on with its specific mean and variance. - a Bernoulli distribution, which models a flip of an unfair coin with probabilities
on the two-element set , has the maximum entropy of any distribution on that set with a mean of .
All of this is fairly mind-bending to me. It’s a strange and nonphysical way of thinking about probability distributions: obviously a Bernoulli distribution—the distribution of an unfair coin—does not arise because you flipped a fair coin over and over and threw out any sequences which don’t have mean
Still I find this to be the most elementary derivation of the equation
-
The
was probably intended originally to be a Greek capital “eta”, for “entropy”, but it’s indistinguishable from the letter . The square brackets indicate that this thing is not exactly a function of its argument, it is really a function of the full vector of values. ↩ -
It turns out that the entropy of a distribution, defined in this way, places a theoretical limit on the compressibility of any data which arrives randomly according to that distribution. The compression algorithm sketched in the introduction is a prefix code, which uses shorter codes to encode the more-frequent members of the set. If the data is non-random (like real data), you can probably do much better by dedicating codes to commonly-occurring sequences. ↩
-
Predicting this sequence with some other distribution
would alter the probability inside the logarithms (which are used to “model” the sequence) but would not change the probabilities outside, which arose from the actual process which produced each value times. The resulting expression would be , which is called the “cross entropy”—a term which, to me, is unsuggestive of its meaning. ↩ -
It is clarifying that this argument also works if the underlying distribution is some other distribution
. In that case the probabilities are , which in the large- limit implies minimization of the relative entropy between and , , a.k.a. the Kullback-Liebler Divergence. The simpler case of a Shannon entropy falls out if you set . ↩