Table of Contents

1. Typical Sequences

Here we’ll pick up the thread from the first post in this series of attempting to articulate exactly what it is that the Shannon Entropy is describing. In the process we’ll encounter two related concepts, the “cross entropy” and “relative entropy”/“Kullback-Liebler Divergence”.

Since that first post I’ve stabilized my own understanding of how entropy arises, so let us briefly review, hopefully with some new clarity.

As before we’ll work with a discrete probability distribution over discrete elements . We take samples, and will use to represent the count of each outcome in a sample, with .

Sequences of samples will be represented by vectors , with each taking value in .

Then the probability of any particular sequence in which each outcome appears times may be represented in two ways: either as a product over the samples

or, over the outcomes:

The second expression is easier to work with. Its logarithm is:

This entropy expression , with two arguments, is the cross-entropy between the empirical distribution of our samples and the true distribution .

We’ll give the empirical distribution the name . Then the cross entropy may be written:

Note that if = , this expression reduces to the Shannon entropy, and if is distributed as then , regardless of .

We can write the exact probability of obtaining any single sequence whose empirical distribution is with a cross-entropy:

When the empirical distribution coincides exactly with , the cross-entropy reduces to the ordinary Shannon entropy, giving the result from the first post in this series that the probability of any “typical sequence” is directly related to the Shannon entropy of :

In general the empirical distribution will tend to fluctuate around the true distribution , but will converge to the true distribution “in probability” as , by the law of large numbers.

Now, for any counts , the number of sequences with exactly these counts is given by a multinomial:

As we saw in the earlier post, a Stirling approximation for logarithm of a multinomial produces another expression involving Shannon entropy:

Note that, while the error in a single Stirling approximation is positive, for a multinomial the errors in the denominators outweigh the numerator, making the overall approximation an underestimate.

Undoing the logarithm and writing , we get an approximation for the multinomial itself,

where, as before, this is a slight underestimate.

Putting these two factors together, the total probability of observing a set of counts is the product of the probability of any one of its sequences with the number of sequences:

The combination is called the relative entropy or Kullback-Liebler Divergence. I’ll use the first term. The usual notation is something like , but I don’t like that, so I will borrow the symbol only and write it as .

Hence the probabiliy of seeing any particular vector of counts is approximately

Of the two concepts “cross” and “relative” entropy, the latter is more fundamental, maybe even moreso than Shannon entropy itself. The exact expression for the relative entropy is

Recall that the logarithm of any obeys , with equality only if . Therefore the logarithm in the relative entropy obeys , and we can deduce:

  • Relative entropy is never negative, as

  • Relative entropy is 0 if and only if for all .

  • When is a uniform distribution,

From these properties, and our result for , we can draw a number of conclusions.

First: the probability of observing an empirical distribution exactly equal to the true generating distribution goes to as :

In this limit, all of the other empirical distributions go to probability zero. (This is the Law of Large Numbers.)

For large but finite , will occur with some probability infinitesimally less than , because our approximation to the multinomial, which gave us the first term in , was a slight underestimate. Instead we should see some slightly less than .

Second: in the form

we see that the limiting probability as is achieved by balancing the rate of increase of the first factor (an approximation to the multinomial count of states) with the rate of decrease of the second (the probability of seeing each state). In the limit these two factors multiply to , so the second must be the inverse of the first. That is, the probability of any single sequence is exactly the inverse of the number of sequences.

This means that the distribution has effectively become a uniform one over the sequences, each with probability . This is called the Asymptotic Equipartition Property.

Third: if the underlying distribution is a , the probability of any particular empirical distribution depends on its entropy:

This is the property we employed in the “Wallis Derivation” in the first post of this series, and is the underlying logic of the “Maximum Entropy Principle” of Bayesian inference.

Before we move on, let’s sum up the results of this section in a table.


Distribution with with
Arbitrary



2. The Growth Rate of Sample Spaces

All of this is standard information theory, but I was had never really wrapped my head around it when I was learning physics. I now find it tremendously clarifying.

What emerges is a picture of “entropy” as fundamental to probability itself. A “probability”, a frequentist might declare, is nothing but the long run proportion of outcomes of a particular event. And, any finite sample from some set of events occurring according to a probability distribution necessarily has a multinomial distribution. Entropy enters immediately. Entropy, it turns out, is intimately related to the basic act of “sampling”. This I find much more satisfying than Shannon’s communications-focused approach.

But it still seems mysterious to me that has the form that it does. Why is it ? This expression is so simple that it feels like its form should be obvious.

This expression has the curious property of acting being “derivation”, i.e. it obeys the Leibniz Rule like a derivative. If we write , then . Compare with . Is itself the derivative of something?

Furthermore it is quite unsatisfactory the entropy has so much to do with probability distributions, which often are a “subjective”, representing “degrees of belief”, or represent hypothetical long-run frequencies, rather than “objective” features of reality. Entropy, it feels, is more fundamental than this.

If, for some underlying distribution , we take large enough that the empirical distribution is effectively equal to , then our results above say that the “effective” size of the entire sample space is the inverse of the probability of any of the sequences of . Calling this, :

This is an immediate simplification, actually. We can do away with the idea of a sequence of samples with a probability entirely. From here on we will think of entropy as characterizing the growth rate of a space.

My first thought was to try to identify as something like a surface area , which for a 3D cube with side length and volume comes out to , and for the 3-sphere with radius to . But to get a volume, these expressions would have to be integrated rather than exponentiated.

The better analog of of is the dimension argument of a -dimensional volume. After all, itself is the volume of a -dimensional product of our underlying set .

Let us examine some volume expressions we do understand, the -cube, -sphere, and -simplex:

Note I have used rather than in the denominator of the sphere for readability. I’ve also used in the cube.

What is the derivative of the volume of a -cube with respect to ?

Now we get something reminiscent of entropy. Evidently we can get the prefactor on its own by taking a logarithmic derivative:

Well—it’s already clear that we can express the entropy in exactly this form, since we’ve already worked out the multinomial approximation . We have:

This is nothing but the same “entropy as rate of decrease of the log-likelihood”, but now explicitly as a “growth rate of the sample space”.

It is looking as though the in the entropy is typical of derivatives with respect to dimensions. The volume of a -dimensional cube increased by each time a new dimension of length was added. Then is like the “side length” or “length scale” of a -dimensional multinomial. And in fact, a uniform distribution over elements would have a sample space of exactly a -cube and would have entropy .

This answers one of my questions: the entropy is like a logarithm of a whole distribution. appears to serve the same role for the set with measure as does for the cube, like the logarithm of the “effective length-scale of the set”. I expect it doesn’t matter if we interpret as representing a length of units or a division of the unit-length into subdivisions; likewise we can probably think of a distribution as representing either a division of unity into subsets of size or as a expansion of unity into something of effective size (but what kind of thing?).

Let us make a quick reference table of the result of for the cube, sphere, simplex, and multinomial. I’ve also included a bare factorial for later reference. These results will freely make use of , the Stirling approximation , and where appropriate.


ShapeVolumeLog-Derivative
-Cube
-Simplex
-Sphere
-Multinomial
-Factorial

And as another reference, here is how to use a log-derivative to generate a finite translation:

The last line is just the Fundamental Theorem of Calculus for the log-derivative:

3. Leibniz Rule?

What about the Leibniz rule?

We might hope that being defined as some kind of derivative gives us this for free. But the definition of itself we found was of a log-derivative, and these are not derivations:

Of course it would work to consider a product of the form , but this does not feel too natural. Perhaps the view of as a plain will do instead?

No, this “eigenfunction” does not give us Leibniz; it’s equivalent to the log-derivative:

We may be looking in the wrong place. Let’s try to be specific about the sense in which acted like a derivation. appears to obey a Leibniz rule term-by-term:

This is like to representing the set , which underlies our whole sample space, as a product, . But itself transforms as

i.e., it’s only slightly more of a derivation than the log-derivatives we just saw; the prefactors are there but they come out to .

The sensible thing to do may be to simply interpret as a measure on rather than a probability, i.e. . Then the Leibniz property will hold as expected for a Cartesian product of sets , giving

where are the measures of the constituent sets and . That seems to work well, but it isn’t very illuminating.

It feels like there’s some view missing here. Vaguely it feels like is some kind of directional derivative of a quantity which is later evaluated at , causing the factors to disappear from the product rule.

Perhaps we should be looking at the combination instead? Recall that the expression which lead to this was:

Terms like also obey a Leibniz rule, but need not be normalized. Maybe we can replace with a product, ?

… no, it doesn’t work. The term has to belong to or ; we can’t associate it with both.

I do find the form to be intriguing. If we regard a multinomial as a function of and all the separately, then this expression appears to be describing the approximate construction of a multinomial as a certain combination of generators of factorials (recall ). Unlike varies with , so to determine the full action we need to use integrate the generator:

A multinomial then looks like “apply the generator of the variable eventually called exactly times, then subtract the variable called exactly times, of , etc.” An odd view.

Well, that’s enough for now. This is by far the most concrete my sense of entropy has ever been, so I’m quite satisfied.