This is part of a series on entropy.

Entropy I: Motivating Shannon
(This post)

1. Interpreting Shannon

Here again is the “Shannon entropy” of a distribution over a set with elements:

Its argument is a probability distribution, such , , , or a random variable with its distribution given elsewhere. The square brackets should be read as indicating that it is a function of an entire distribution rather than of a single numeric value.

The simplest example is the entropy of a uniform distribution over elements:

We get , which is easily seen as the “number of bits required to count or label a set of elements” (in whatever logarithm base we’re using)

and the formula also interpolates cleanly over non-uniform distributions

We saw in the first post in this series that it is interpreted as representing the “information density” or “innate uncertainty” of a sample from this distribution.

It is also the “average log-likelihood per sample” for a typical sequence drawn from the distribution :

We also saw that the same formula arises when taking the large- limit of a log of a multinomial

which can be seen as the relative size of a given outcome of many samples of a uniform distribution.

We can arrive at another characterization by noting that the entropy can be written as an expectation:

where the object inside the expectation is either the “the number of bits required to express that probability” or the “information required to enumerate elements”. For each probability in the uniform distribution it is . This is called the “information function” or just “information”, written

We can then express as

which suggests we interpret the entropy either “average information required to specify an element in ” or as the “expected information gained by learning the exact value of a a sample from “.

2. Basic Properties

If we start with a single set

and then we divide it in two

the entropy goes up by , in bits.

The same applies for any division into two:

For each block of weight and fraction of the whole, dividing it into two half-cells increases its information by bit:

so dividing every block in two at once

In the “addressing” or “labeling” interpretation, we think of this as saying: the distribution can be addressed by first specifying an element with of and then providing a single additional bit: 0 or 1, or “left” or “right”, to pick between elements of the new subdivision.

Likewise subdividing or “fine-graining” by another factor adds bits:

Such a “fine-graining” is equivalent to replacing the distribution by its product with a uniform distribution :

which leads us to a general rule that entropies add over products:

Note this is the same as the behavior of a logarithm on numbers:

and the two laws coincide in the case of uniform distributions¹:

We can go in the other direction to calculate the entropy of a coarse-graining or aggregating operation. For any distribution (ignoring zeros) there will be some common denominator with which we can write probability as a fraction: . For example if our probabilities are we can write these as . Then we can imagine constructing this distribution by first subdividing into blocks, then merging blocks to wind up with our final distribution. Merging blocks will change the entropy from

So grouping cells into one of size decreases the entropy by a fraction of , here by of which is :

And grouping all cells into one of course brings the entropy down to zero:

Note that the term subtracting in the above is exactly the (fraction occupied by the merged block) (the entropy of a uniform distribution over the merged block). We can create any distribution of blocks by merging a uniform into groups :

Rearranging gives

in which we can read two the entropy of two equivalent algorithms:

either we first select a block from an uneven distribution , then select uniformly from within that block according to a
or, we select one element from a distribution

Clearly we should be able to count or address a set of elements either way—these two selection processes are the same; they must contain or require the same amount of information, hence, their entropies must also be the same.²

3. Visualizing Shannon

Each term in the Shannon entropy has the form

whose graph looks like

That’s an odd shape, but it doesn’t mean much on its own. It is the product of (which increases linearly) and which goes to at and goes to at .

More informative is the graph of on a two-element distribution, such as that of an unbalanced coin that comes up heads with probability . The distribution is simply and the entropy, as a function of , looks like the sum of the graph above with itself reflected horizontally, which smooths out the asymmetry giving

It’s nearly a perfect circle, peaking at a value of at . (In base-2 its peak value would be .)

We can also visualize the entropy over all possible probability distributions on a three-element set. The three probabilities are constrained to the two-dimensional simplex . The entropy function looks like:

Again we see that the entropy attains its highest value in the center (a uniform distribution) and goes to zero in all three corners (the indicators on any of the three elements). It treats each probability symmetrically; it is indifferent to the ordering of the three.

For more than three probabilities no direct visualization is possible. Within the space of all possible distributions on an -element set, there are indicators each with entropy 0 and one with entropy . All other distributions fall somewhere between these two extremes:

What happens in the middle? What fraction of all possible distributions have the given entropies? Certainly there are vastly more uneven distributions than either the indicators or the single uniform.

Let’s first try to take on a simpler problem by considering only distributions which can arise by grouping underlying cells. This is then the problem of “counting partitions of a set”. However, there are no longer distinct indicators but only one—the partition consisting of the whole set. Clearly there will be more distributions in between, but what is the exact shape?

Let’s play with a few, starting from the single partition with . Chipping away at it increases the entropy…

… but the three-way partitions quickly overtake the two-way ones, and of course there are a lot of these. The total number of partitions is given by the Bell numbers; the 16th Bell number is around 10 billion. But the number of distinct integer partitions (e.g. ) is far smaller, and all of these have the same entropy—so the problem turns out to be extremely tractable. Plus it’s 2025, so I can ask an AI to do it in two minutes.³ Here’s a visualization:

There’s a clear shape, peaking (the AI tells me) near . Neat. (Note the log scale: at the peak value there appear to be about 800 million partitions!)

Returning to the first question: what if we did the same for the set of distributions on elements? The question now is to estimate the measure attaining each value of within the -simplex formed by the probabilities. This is probably hard to calculate, but should be fairly simple to estimate numerically. Again I conjure a visualization from the AI:

Well: it’s a similar shape. Note the y-axis is not a log-scale in this one, as the whole region near 0 would go to . Still we see that the vast majority of distributions have entropies a bit lower than the maximum. This isn’t too enlightening, but I was curious.

Before we move on, I want to register a couple of stray thoughts:

I find the distinction between the entropy of partitions and distributions here to be suggestive. I suspect that entropy is more naturally defined on partitions: elements with effectively don’t exist from the perspective of the entropy. All indicator distributions are the same thing, information-wise (it’s the underlying set which is different). I’ll have to think on this.
I don’t think a “uniform measure” makes sense on the space of real-valued probability distributions, so the second visualization just given is probably not very meaningful. It might actually make sense to use the entropy itself as the measure on that space, as is the asymptotic measure a distribution arising from samples from a uniform, by the third argument from the previous post in thisseries. But I won’t try to take that on now.

4. Entropy vs. Variance

Here’s one visualization of some entropies. Now—unlike all the earlier examples—we will associated the indices with their numeric value, i.e. treating this as a random variable (henceforth “r.v.”) with . This will let us compute a variance for the distribution.

You can observe:

the uniform distribution always has the highest entropy.
all of the indicator/delta-function distributions have entropy zero (and variance zero).
in general, entropy is lower for very peaked distributions, and high for spread-out ones. In this respect it is similar to the variance.
but the entropy is unchanged when the cells are shuffled, while the variance will tend to vary a lot.

(Note that you can put in unnormalized distributions, but they will be normalized before calculating stats.)

Both the entropy and variance characterize the “uncertainty” or “spread-out-ness” of a distribution. Both are zero for an indicator and large for a uniform distribution—but the entropy attains its highest possible value on a uniform, while it’s easy to make the variance even larger by creating something bimodal. (If you click the “Beta” button enough times you’ll get something bimodal, or you can draw your own.)

What’s the difference?

We’ll write the entropy of a r.v. as a function of the r.v. rather than of the probabilities themselves:

Clearly the entropy does not depend on the value of , while the variance does.

And yet if we plot the two against each other for a few distributions, they seem to be closely related:

(Widget Loading...)

(These are discretized normals and betas, not continuous distributions. We’ll get to the entropy of continuous distributions later.)

What’s going on?

Here are a family of normal distributions centered on :

(Widget Loading...)

We see a clear pattern, which interpolates between the nearly-indicator-like and the nearly-uniform .

Here now are a family of distributions, which might represent the posterior distribution of the probability that an unfair coin comes up heads, which can be seen to converge to the true value I used in the simulation of .

(Widget Loading...)

Again wee see entropies and variances which decreasing together.

Let’s try a more contrived scenario: you flip a coin of unknown -probability repeatedly and get the same result times in a row, but you don’t know if this is or . The probability of such a streak given is

If we take our prior on to be uniform (i.e. a ), the posterior distribution of goes as

which looks like

(Widget Loading...)

There’s still a tight relationship between entropy and variance, but now it goes the other way.

The difference seems to be that these are bimodal distributions. After all, we should be able to shuffle the probability of any unimodal distribution and change the variance but not the entropy. Likewise if we can find some family of distributions with constant variance, their entropies must vary.

Let’s try it. In the following I started with a , and then had the AI conjure up a handful of shuffled versions of that distribution (which should have different variances) and a handful of Beta distributions with this same variance but otherwise different shapes (which should therefore have different entropies):

(Widget Loading...)

The point appears to be: entropy and variance are closely related for unimodal distributions, but not in general. They can vary independently, though it can take some pretty contrived distributions to demonstrate it.

Further investigation sheds some light. Apparently, there exists a series expansion for the Shannon entropy in terms of the cumulants of a distribution, which are certain functions of the moments which distribute across independent distributions: .⁴ It seems that any distribution can be expanded in a series in its cumulants around a normal distribution, called an Edgeworth series. Hence there exists a “local approximation” of the entropy which will be fairly accurate for distributions which are close to normal:

This was surprising to me, since the Shannon entropy doesn’t depend on the values taken by the R.V. , while the variance and cumulants do. But I suppose the explanation is that, here, we are parameterizing the distribution by its variance and cumulants; then the entropy of course depends on the these because it depends on the specific distribution.

Interactive visualizations for this post were authored with D3.js in Typescript, embedded in a Marimo (Python) notebook via anywidget with considerable help from Claude Code, and then manually ported into React components to be consumed by Astro, which builds this site. The simpler plots were built with Bokeh and exported as HTML. Mostly this was a pain and I wouldn’t do it this way again, but it is worth noting that Claude is excellent at one-off D3-type visualizations.

I find this law to be very suggestive—entropy appears to act like a generalization of “logarithms” to the large space of distributions, where we identify the uniform distributions with the original space of numbers themselves. We can follow the analogy in the other direction to interpret logarithms as an “information” function in all cases. Aren’t integers, after all, just an abstract “count” of something? ↩
One can also derive the formula for in the first place by asserting a handful of properties like continuity, the logarithm-like property on product distributions, and the equivalence of the two selection processes just given; see wiki. ↩
Here’s the chat, including a lot of extra—and impressive—analysis. ↩
In statistical mechanics, the logarithm of the partition function, , turns out to be a cumulant generating function, which is one way of explaining its many surprising properties. ↩

Entropy II

The Entropy Function

Table of Contents

1. Interpreting Shannon

2. Basic Properties

3. Visualizing Shannon

4. Entropy vs. Variance