Table of Contents

1. Typical Sequences

Here we’ll pick up the thread from the first post in this series of attempting to articulate exactly what it is that the Shannon Entropy is describing. In the process we’ll encounter two related concepts, the “cross entropy” and “relative entropy”/“Kullback-Liebler Divergence”.

Since that first post I’ve stabilized my own understanding of how entropy arises, so let us briefly review, hopefully with some new clarity.

As before we’ll work with a discrete probability distribution over discrete elements . We imagine taking samples from this distribution, forming a sequence vector . We’ll use to denote the count of each of the outcomes in the sample, with .

Then the probability of any particular sequence in which each outcome appears times may be represented in two ways: either as a product over the samples:

or, over the outcomes:

The second expression is easier to work with. Its logarithm is:

This entropy expression , with two arguments, is the cross-entropy between the empirical distribution of our samples and the true distribution .

We’ll give the empirical distribution the name . Then the cross-entropy itself is written:

Note that if = this expression reduces to the Shannon entropy, and if is distributed as then , regardless of .

We can write the exact probability of obtaining any single sequence whose empirical distribution is with a cross-entropy:

When the empirical distribution coincides exactly with , the cross-entropy reduces to the ordinary Shannon entropy, leading to the result from the first post in this series: that the probability of any “typical” sequence is directly related to the Shannon entropy of :

In general the empirical distribution will tend to fluctuate around the true distribution , but will converge to the true distribution “in probability” as , by the law of large numbers.

Now, for any count vector , the number of sequences with exactly these counts is given by a multinomial:

As we saw in the earlier post, a Stirling approximation of the logarithm of a multinomial produces another expression involving Shannon entropy:

Note that, while the error in a single Stirling approximation is positive, the error for the total multinomial is negative—the errors in the denominators outweigh the numerator, so that the overall approximation is actually an underestimate.

Undoing the logarithm and writing , we get an approximation for the multinomial itself,

where, as just noted, this is a slight underestimate.

Putting these two factors together, the total probability of observing a set of counts is the product of the probability of any one of its sequences with the number of sequences:

The combination is called the relative entropy or Kullback-Liebler Divergence. I’ll prefer the former, “relative entropy”. The usual notation is something like , but I don’t like that, so I will borrow the symbol only and write it as .

Hence the probability of seeing any particular vector of counts is approximately

Of the two concepts “cross” and “relative” entropy, the latter is more fundamental, maybe even moreso than Shannon entropy itself. The exact expression for the relative entropy is

Recall that the logarithm of any obeys , with equality only if . Therefore the logarithm in the relative entropy obeys , and we can deduce:

  • Relative entropy is never negative, as

  • Relative entropy is 0 if and only if for all .

  • When is a uniform distribution,

From these properties, and our result for , we can draw a number of conclusions.

First: the probability of observing an empirical distribution exactly equal to the true generating distribution goes to as :

In this limit, all of the other empirical distributions go to probability zero. (This is the Law of Large Numbers.)

For large but finite , will occur with some probability infinitesimally less than , because our approximation to the multinomial, which gave us the first term in , was a slight underestimate. Instead we should see some slightly less than .

Second: in the form

we see that the limiting probability as is achieved by balancing the rate of increase of the first factor (an approximation to the multinomial count of states) with the rate of decrease of the second (the probability of seeing each state). In the limit these two factors multiply to , so the second must be the inverse of the first. That is, the probability of any single sequence is exactly the inverse of the number of sequences.

This means that the distribution has effectively become a uniform one over the sequences, each with probability . This is called the Asymptotic Equipartition Property.

Third: if the underlying distribution is a , the probability of any particular empirical distribution depends on its entropy only as:

This is the property we employed in the “Wallis Derivation” in the first post of this series, and is the underlying logic of the “Maximum Entropy Principle” of Bayesian inference. Evidently, an analogous principle could be devised with any other distribution as its basis, merely by using the relative entropy with respect to that distribution instead.

Before we move on, we’ll sum up the results of this section in a table:

Distribution with with
Arbitrary


2. The Growth Rate of a Sample Space

All of this is standard information theory, but I never really wrapped my head around it when I was encountering it for the first time. I now find these observations to be tremendously clarifying.

What emerges is a picture of “entropy” as fundamental to probability itself. A “probability”, a frequentist might declare, is nothing but the long run proportion of outcomes of a particular event. And, any finite sample from some set of events occurring according to a probability distribution necessarily has a multinomial distribution. Entropy enters immediately. Entropy, it turns out, is intimately related to the basic act of “sampling”, or, even more fundamentally, to the basic notion of an “event space”. This I find much more satisfying than Shannon’s communications-focused approach.

But it still seems mysterious to me that has the form that it does. Why is it ? This expression is so simple that it feels like its form should be obvious.

This expression has the curious property of acting like a “derivation”, i.e. it obeys the Leibniz Rule like a derivative. If we write , then . Compare with . Is itself the derivative of something?

Furthermore it feels unsatisfactory that the entropy has so much to do with probability distributions, which often are a “subjective”, representing “degrees of belief”, or represent hypothetical long-run frequencies, rather than “objective” features of reality. Entropy, it feels, is more fundamental than this.

If, for some underlying distribution , we take large enough that the empirical distribution is effectively equal to , then our results above say that the “effective” size of the entire sample space is the inverse of the probability of any of the sequences of . Calling this :

This is an immediate simplification, actually. We can do away with the idea of a sequence of samples with a probability entirely. From here on we will think of entropy as characterizing the growth rate of the volume of a space.

From here, my first thought was to try to identify itself as something like a surface area , i.e. as analogous to, for a 3D cube with side length and volume , , and for the 3-sphere with radius , to . But to get volume from these expression, they would have to be integrated rather than exponentiated as we do with the entropy.

The better analog for is the dimension argument of a -dimensional volume. After all, itself is the volume of a -dimensional product of our underlying set .

Let us examine some volume expressions we do understand, the -cube, -sphere, and -simplex:

Note I have used rather than in the denominator of the sphere for readability.

What is the derivative of the volume of a -cube with respect to ?

Now we get something reminiscent of entropy. Evidently we can get the prefactor on its own by taking a logarithmic derivative:

Well—it’s already clear that we can express the entropy in exactly this form, since we’ve already worked out the multinomial approximation . We have:

This is nothing but the same “entropy as rate of decrease of the log-likelihood”, but now explicitly as a “growth rate of the sample space”.

It is looking as though the in the entropy is typical of derivatives with respect to dimensions. The volume of a -dimensional cube increased by each time a new dimension of length was added. Then is like the “side length” or “length scale” of a -dimensional multinomial. And in fact, a uniform distribution over elements would have a sample space of exactly a -cube and would have entropy .

This answers one of my questions: the entropy is like a logarithm of a whole distribution. appears to serve the same role for the set with measure as does for the cube, like the logarithm of the “effective length-scale of the set”. I expect it doesn’t matter if we interpret as representing a length of units or a division of the unit-length into subdivisions; likewise we can probably think of a distribution as representing either a division of unity into subsets of size or as a expansion of unity into something of effective size (but what kind of thing?).

Let us make a quick reference table of the result of for the cube, sphere, simplex, and multinomial. I’ve also included a bare factorial for later reference. These results will freely make use of , the Stirling approximation , and where appropriate.


ShapeVolumeLog-Derivative
-Cube
-Simplex
-Sphere
-Multinomial
-Factorial

And as another reference, here is how to use a log-derivative to generate infinitesimal and finite translations:

The last line is just the Fundamental Theorem of Calculus for the log-derivative:

3. Leibniz Rule?

What about the Leibniz rule?

We might hope that being defined as some kind of derivative gives us this for free. But the definition of itself we found was of a log-derivative, and a log-derivative is not a derivation:

Of course it would work to consider a product of the form , but this does not feel too natural. Perhaps the view of as a plain will do instead?

No, this “eigenfunction” does not give us Leibniz; it’s equivalent to the log-derivative:

We may be looking in the wrong place. Let’s try to be specific about the sense in which acted like a derivation. appears to obey a Leibniz rule term-by-term:

This is like to representing the set , which underlies our whole sample space, as a product, . But itself transforms as

i.e., it’s only slightly more of a derivation than the log-derivatives we just saw; the prefactors are there but they come out to .

The sensible thing to do may be to simply interpret as a measure on rather than a probability, i.e. . Then the Leibniz property will hold as expected for a Cartesian product of sets , giving

where are the measures of the constituent sets and . That seems to work well, but it isn’t very illuminating.

It feels like there’s some view missing here. Vaguely it feels like is some kind of directional derivative of a quantity which is later evaluated at , causing the factors to disappear from the product rule.

Perhaps we should be looking at the combination instead? Recall that the expression which lead to this was:

Terms like also obey a Leibniz rule, but need not be normalized. Maybe we can replace with a product, ?

… no, it doesn’t work. The term has to belong to or ; we can’t associate it with both.

I do find the form to be intriguing. If we regard a multinomial as a function of and all the separately, then this expression appears to be describing the approximate construction of a multinomial as a certain combination of generators of factorials (recall ). Unlike varies with , so to determine the full action we need to use integrate the generator:

A multinomial then looks like “apply the generator of the variable eventually called exactly times, then subtract the variable called exactly times, of , etc.” An odd view.

Well, that’s enough for now. This is by far the most concrete my sense of entropy has ever been, so I’m quite satisfied.