Entropy I

The word “entropy” is surrounded in a strange mystique. This is confusing. Let us illuminate it.

This is part of a series on entropy.

(This post)
Entropy II: The Entropy Function

1. Information Density
2. Log-Likelihood per Sample
3. Limit of a Log-Multinomial

1. Information Density

Consider the division of a whole into parts:

You can count or label a row of blocks with binary numbers. For the row of 8 these would be:

What about these?

We could, of course, count the number of distinct blocks. Two binary bits are enough to count the first row of 3 (using ) with one “code” to spare ().

Or, four binary bits could count the 13 blocks in last row, with three codes to spare.

For any row, we could use the set of binary numbers as “codes”, “labels”, “pointers”, or “addresses” to refer to the blocks in that row. To write out a sequence of, say, 100 blocks from the last row of 13 would take 400 binary bits: one 4-bit number per block.

But could we do a bit better? The set of all four-bit binary numbers is

Of those, exactly four start with a pair of s in a row: . If we omit those four numbers from our counting (which we can do because we have 16 codes for only 13 blocks) we can instead use the two-bit code for the big block, saving two bits.

The rule will be: if we encounter it means “big block, and only read two bits”, otherwise we read four bits. in the middle of a code like doesn’t mean anything special. With this scheme we could encode 100 blocks in a row with perhaps fewer than 400 bits, depending on whether the big block shows up.

If, furthermore, the big block shows up in proportion to its size—one fourth of the time—we can expect to save two bits each time we encode it, meaning it will take on average only 350 binary bits to list 100 of these blocks (4 bits three-quarters of the time, 2 bits one-quarter of the time).

Can we do any better? Since there are 13 distinct blocks, could we perhaps do the job with only bits? But we can’t shorten the code any further to , because we wouldn’t be able to avoid conflicting interpretations; and we definitely can’t count the other twelve with fewer than twelve codes, so it seems not.

So in some sense, the blocks in the row

seem to be able to be encoded with, not 4, but only 3.5 bits of data.

This “3.5 bits” characterization, remember, relied on the interpretation that the blocks have an inherent probability proportional to their size.

And, in fact, we can arrive at that same number by taking of the bits required to label the full set of sixteen little blocks, plus of the bits required to label the four blocks:

Apparently, the quarter-length block brought along its bit contribution from its original home among the other quarter-length blocks, while the little blocks each brought of their bits.

By a similar approach, you could determine the bits required to describe any of the oddly-sized rows. Even the blocks that don’t divide evenly can be shoved into the above formula, for example in

the length- block could fit into a row of, uh, copies of itself, so its bit contribution should be

So the block contributes just a little more than the block did: even though , the fraction more than made up for it.

What we’ve just found is that the contribution of a single block comprising a fraction of the whole is

and so the number of bits required to encode an entire row of blocks is just a sum of blocks:

This is what is called the “base-2 Shannon entropy” or just “entropy”, which I will write¹ as . Its units are “information” or perhaps “data”, which, like “length”, can be measured in various units, depending on the base used for the logarithm. In base-2 it is measured in “bits”, while base- would be “digits” and base- is sometimes called “nats”:

The following rows of uniform blocks have entropies :

and our non-uniform rows have entropies

The last of those rows is the -bit example we calculated before. Here we see the entropy as representing some kind of lower bound on the ability of some sequence of unevenly-weighted objects to be encoded or compressed.²

This problem—“how to optimally encode a sequence of samples from a distribution”—is probably the standard derivation of “entropy”. But here the mystique of that word is nowhere to be seen: it is not at all clear why the thing just described would be called “entropy”, or why anyone should care about it unless they are trying to write a compression algorithm. It seems to be a synonym for “information density”, perhaps.

2. Log-Likelihood per Sample

Now a second scenario.

Suppose some process in reality emits data according to a probability distribution over outcomes, . A sample example is a 6-sided dice, for which the probabilities are uniform:

If you sample from this distribution (i.e. roll the dice) times, you’ll get some particular sequence with

You expect, on average, to see each of the outcomes appear times according to its probability:

Suppose we take a vey large number of samples , such that we can safely assume the counts of samples in each bin are approxiamtely equal to these averages. Then the “likelihood” of any particular sequence with exactly this typical distribution of outcomes will be the product of the factors of each of the :

We’ll call the above the “likelihood of ”, , with the understanding that this stands for the probability of seeing any single sequence with counts , rather than the probability of seeing those counts in any order (which would be the above multiplied by the number of such sequences, a multinomial coefficient ).

If the distribution were uniform with for all , this likelihood would simply be equal to ; every sequence would be equally likely to occur. For each new sample you tack on, the likelihood of that specific sequence would decrease by a factor of , but this rate would never change. Larger values of would have a greater “cost” in likelihood per sample—for a given length, a larger base set can produce more sequences, and therefore could encode more data or count to a larger number.

For any other distribution the rate of increase of the likelihood would vary, but in the long run it would center on some average rate. Furthermore we could weaken our assumption that our samples occur in exact proportion to ; this would also alter the likelihood-per-sample, but again we could expect to see the result centered around some average rate.

The above expression for the likelihood is unwieldy because it decreases by a fraction every time we add a new sample. To get something directly interpretable as a “rate of increase”, we should a) invert it, b) take a logarithm, and c) divide out :

And voilà: we have found the same “Shannon entropy” formula as in the first example. The interpretation this time is as the “average log-inverse-likelihood per sample”—meaning what?

We got here by considering the likelihood of a typical sequence of samples from a distribution of known over bins. This rate of likelihood-per-sample represents the best you can do predicting samples from this process with any distribution—if we tried to predict this data with any other distribution , we would invariably judge it to have a lower likelihood, in the long run and on average, than it would according the true generating distribution .³

So the “entropy” here seems to describe the inherent difficulty of predicting the results of a probability distribution. It place some kind of an upper bound on how precisely a sample from that distribution can be predicted. It seems now to have a sense of “innate uncertainty” or perhaps “irreducible complexity”.

3. Limit of a Log-Multinomial

A third and final derivation.

Consider again some process with uniformly distributed with . We expect a sequence of such samples to have each represented times. The probability of seeing any particular vector of counts will be proportional to a multinomial:

Given one such vector , you could then estimate the original probabilities of the different as . We expect that all , and indeed, the multinomial is largest when the are all equal.

Now apply any selection process to the set of possible outcomes , such that only certain distributions of the survive. This could mean masking certain outputs, discarding outputs which fail to adhere to some constraint, or it could mean we use some additional evidence which narrows the candidate list.

It will help to have a few examples in mind:

“the player folds all hands worse than ” (Texas Hold’em, where cards in the deck, cards in a hand)
“the word contains an “H” in the first position but does not have any of “OUSE”. (Wordle, candidate words, )
“the measured volume is ” (Ignoring distinguishability considerations, is the number of possible states per particle and is according to ChatGPT, while particles)
“if can’t metabolize lactose it dies” ( distinct genes which can mutate for E. coli or something, however many cells you want to study)
“keep only distributions with mean and variance ” ( is the discretization of the space, can be whatever you want, you might consider the limit )

What vectors should we expect to see now?

The answer is: whichever have the highest probabilities among all those surviving the selection.

Which are these? What is the “posterior distribution” after this selection process?

We can make two simplifications to the above probabilities:

all of the have the same denominator (it’s just ), so their relative sizes will depend on the multinomials in the numerator only.
we can safely take a logarithm of both sides without changing the relative sizes of any terms.

Therefore the most probable vectors will be those for which

is greatest.

We can simplify further by applying the Stirling approximation for the logarithm of a factorial: . (This assumes that the number of samples is very large.) The above then becomes

… and we find that the largest surviving distributions are exactly those for which the probability distribution implied by has the highest Shannon entropy.

The interpretation is this: out of all distributions compatible with the selection process (or a constraint, or any information we have), the most probable are those with the highest entropy.

In fact, we can plug this expression for the entropy back in to the original probabilities:

We find that each outcome of the original uniform distribution (in the large- limit) appears in exact proportion to the above function of its entropy. We can therefore see as measuring the “volume of sample space”—but this is not sample space of itself; instead it is the sample space of a uniform distribution over all values of .

I find this clarifying so I’ll restate it in terms of : the Shannon Entropy determines the relative probability that a very large sample from a uniform distribution over the would produce a distribution of outcomes proportional to .

High-entropy distributions are common; the uniform distribution is the highest of all. The growth rate of this probability with respect to the entropy is fantastically high: it goes as and we are taking to be large, perhaps even infinite.

This is really no different from our earlier “average likelihood per sample” formulation, since the above can be rearranged into

though this relation is neither an equality (since we’ve discarded the denominators and approximated the multinomials), nor is it a “proportional to” (since we took a logarithm), instead we might say “goes as” the above the expression.

The derivation just given is the Wallis derivation of the “Principle of Maximum Entropy”, which states that the optimal posterior distribution to expect from any system about which we have some partial knowledge is that with the highest Shannon Entropy consistent with our knowledge (or with any constraints we know to apply).

Most standard probability distributions are exactly the maximum-entropy distributions for certain sets of constraints:

a uniform distribution has the maximum entropy under no constraints
a normal has the highest entropy of all distributions on with its specific mean and variance.
a Bernoulli distribution, which models a flip of an unfair coin with probabilities on the two-element set , has the maximum entropy of any distribution on that set with a mean of .

All of this is fairly mind-bending to me. It’s a strange and nonphysical way of thinking about probability distributions: obviously a Bernoulli distribution—the distribution of an unfair coin—does not arise because you flipped a fair coin over and over and threw out any sequences which don’t have mean ; obviously that is describing a different scenario which would also happen to be Bernoulli-distributed.

Still I find this to be the most elementary derivation of the equation itself: it is the large- scaling of the multinomial. This suggests the reason for its universality: it is closely related to the frequentist sense of a “probability” itself. A probability is only empirical meaningful, measurable, if one could repeat an identical experiment and see a distribution of outcomes; any finite sequence of samples from such a distribution will always imply some posterior distribution of probabilities with a multinomial shape; this is simply what it means to be a “probability”—and the entropy is the limiting form which comes with that.⁴

The was probably intended originally to be a Greek capital “eta”, for “entropy”, but it’s indistinguishable from the letter . The square brackets indicate that this thing is not exactly a function of its argument, it is really a function of the full vector of values. ↩
It turns out that the entropy of a distribution, defined in this way, places a theoretical limit on the compressibility of any data which arrives randomly according to that distribution. The compression algorithm sketched in the introduction is a prefix code, which uses shorter codes to encode the more-frequent members of the set. If the data is non-random (like real data), you can probably do much better by dedicating codes to commonly-occurring sequences. ↩
Predicting this sequence with some other distribution would alter the probability inside the logarithms (which are used to “model” the sequence) but would not change the probabilities outside, which arose from the actual process which produced each value times. The resulting expression would be , which is called the “cross entropy”—a term which, to me, is unsuggestive of its meaning. ↩
It is clarifying that this argument also works if the underlying distribution is some other distribution . In that case the probabilities are , which in the large- limit implies minimization of the relative entropy between and , , a.k.a. the Kullback-Liebler Divergence. The simpler case of a Shannon entropy falls out if you set . ↩

Entropy I

Motivating Shannon

Table of Contents

1. Information Density

2. Log-Likelihood per Sample

3. Limit of a Log-Multinomial