This is part of a series on entropy.

  1. Entropy I: Motivating Shannon
  2. (This post)

Table of Contents

1. Interpreting Shannon

Here again is the “Shannon entropy” of a distribution over a set with elements:

Its argument is a probability distribution, such , , , or a random variable with its distribution given elsewhere. The square brackets should be read as indicating that it is not really a “function” of its argument’s numeric value; instead it is a function of the entire distribution.

The simplest example is the entropy of a uniform distribution over elements:

We get , which is easily seen as the “number of bits required to count or label a set of elements” (in whatever logarithm base we’re using)

and the formula also interpolates cleanly over non-uniform distributions

We saw in the first post in this series that it is interpreted as representing the “information density” or “innate uncertainty” of a sample from this distribution.

It is also the “average log-likelihood per sample” for a typical sequence drawn from the distribution :

We also saw that the same formula arises when taking the large- limit of a log of a multinomial

which can be seen as the relative size of a given outcome of many samples of a uniform distribution.

We can arrive at another characterization by noting that the entropy can be written as an expectation:

where the object inside the expectation is either the “the number of bits required to express that probability” or the “information required to enumerate elements”. For each probability in the uniform distribution it is . This is called the “information function” or just “information”, written

We can then express as

which suggests we interpret the entropy either “average information required to specify an element in ” or as the “expected information gained by learning the exact value of a a sample from “.

2. Basic Properties

If we start with a single set

and then we divide it in two

the entropy, as measured in base-2 bits, goes up by .

The same applies for any division into two:

For each block of weight and fraction of the whole, dividing it into two half-cells increases its information by bit:

Therefore, dividing every block in two at once increases the total entropy by bit:

If we interpret entropy as giving the information required to “address” or “label” the elements of the set, then dividing each element in two can be interpreted as saying: the new distribution can be addressed by first specifying an element with of the original distribution , followed by a single additional bit, 0 or 1, representing the “left” or “right” element of each new subdivision.

Likewise subdividing or “fine-graining” by another factor adds bits:

Such a “fine-graining” is equivalent to replacing the distribution by its product with a uniform distribution :

which leads us to a general rule that entropies add over products:

Note this is the same as the behavior of a logarithm on numbers:

and the two laws coincide in the case of uniform distributions1:

We can go in the other direction to calculate the entropy of a coarse-graining or aggregating operation. For any distribution there will be some common denominator with which we can write probability as a fraction: . For example if our probabilities are we can write these as . Then we can imagine constructing this distribution by first subdividing into blocks, then merging blocks to wind up with our final distribution. Merging blocks will change the entropy from

So grouping cells into one of size decreases the entropy by a fraction of , here by of which is :

And grouping all cells into one of course brings the entropy down to zero:

Note that the term subtracting in the above is exactly the (fraction occupied by the merged block) (the entropy of a uniform distribution with all elements the size of the merged block). We can create any distribution of blocks by merging a uniform into groups :

Rearranging gives

in which we can read two the entropy of two equivalent algorithms:

  1. either we first select a block from an uneven distribution , then select uniformly from within that block according to a
  2. or, we select one element from a distribution

Clearly we should be able to count or address a set of elements either way—these two selection processes are the same; they must contain or require the same amount of information, hence, their entropies must also be the same.2

3. Visualizing Shannon

Each term in the Shannon entropy has the form

whose graph looks like



That’s an odd shape, but it doesn’t mean much on its own. It is the product of (which increases linearly) and which goes to at and goes to at .

More informative is the graph of on a two-element distribution, such as that of an unbalanced coin that comes up heads with probability . The distribution is simply and the entropy, as a function of , looks like the sum of the graph above with itself reflected horizontally, which smooths out the asymmetry giving



It’s nearly a perfect circle, peaking at a value of at . (In base-2 its peak value would be .)

We can also visualize the entropy over all possible probability distributions on a three-element set. The three probabilities are constrained to the two-dimensional simplex . The entropy function looks like:



Again we see that the entropy attains its highest value in the center (a uniform distribution) and goes to zero in all three corners (the indicators on any of the three elements). It treats each probability symmetrically; it is indifferent to the ordering of the three.

For more than three probabilities no direct visualization is possible. Instead, let us try to say something about the shape of the space and the range of entropies on it.

Within the space of all possible distributions on an -element set, there are indicators , each with entropy , and one with entropy . All other distributions ${p_i} fall somewhere between these two extremes:

What happens in the middle? What fraction of all possible distributions have the given entropies? Certainly there are vastly more uneven distributions than either the indicators or the single uniform.

Let’s first try to take on a simpler problem by considering only distributions which can arise by grouping underlying cells. This is then the problem of “counting partitions of a set”. However, there are no longer distinct indicators but only one—the partition consisting of the whole set. Clearly there will be more distributions in between, but what is the exact shape?

Let’s play with a few, starting from the single partition with . Chipping away at it increases the entropy…

… but the three-way partitions quickly overtake the two-way ones, and of course there are a lot of these. The total number of partitions of a set of size is given by the th Bell number; the 16th Bell number is around 10 billion. Many of these have the same entropies, though; so the problem of determining the distribution of entropies is actually quite tractable. We only need to calculate the entropies of distinct integer partitions (e.g. ), and the number of these is much smaller; the 16th is only 231. It’s then relatively straightforward to determine how many subdivision of a set correspond ot each partition—for the single partition into , the answer is given by a multinomial times a factor for rearranging amongst the pairs of blocks of size :

As it’s 2025, I can ask an AI to count them all in about two minutes.3 Here’s a visualization:



There’s a clear shape, peaking (the AI tells me) near . Neat. Note the log scale: at the peak value there appear to be about 800 million set-partitions!

Returning to the first question: what if we did the same for the set of distributions on elements? The question now is to estimate the measure attaining each value of within the -simplex formed by the probabilities. This is probably hard to calculate, but should be fairly simple to estimate numerically. Again I conjure a visualization from the AI:



Well: it’s a similar shape. Note the y-axis is not a log-scale in this one, as the whole region near 0 would go to . Still we see that the vast majority of distributions have entropies a bit lower than the maximum. This isn’t too enlightening, but I was curious.

Before we move on, I want to register a couple of stray thoughts:

  • I find the distinction between the entropy of partitions and distributions here to be suggestive. I suspect that entropy is more naturally defined on partitions: elements with effectively don’t exist from the perspective of the entropy. All indicator distributions are the same thing, information-wise (it’s the underlying set which is different). I’ll have to think on this.
  • The first visualization acquires its shape only from the integer-partitions with the highest set-partition multiplicities. It might make more sense to smooth this curve out, so as actually get a sense of the measures in the neighborhood of each value.
  • I don’t think a “uniform measure” makes sense on the space of real-valued probability distributions, so the second visualization just given is probably not very meaningful. It might actually make sense to use the entropy itself as the measure on that space, as is the asymptotic measure a distribution arising from samples from a uniform, by the third argument from the previous post in thisseries. But I won’t try to take that on now.

4. Entropy vs. Variance

Here’s one visualization of some entropies. Now—unlike all the earlier examples—we will associate the indices with their numeric value, i.e. treating this as a random variable (henceforth “r.v.”) with . This will let allow us to compute a mean and variance for the same distribution and, in partiular, to compare the entropy to the variance:



Playing with this, we can observe a few things:

  • the uniform distribution has the highest possible entropy for a given number of bins.
  • all of the indicator/delta-function distributions have entropy zero (and variance zero).
  • in general, entropy will be low for very peaked distributions, and high for spread-out ones. In this respect it is similar to the variance.
  • but the entropy is unchanged when the cells are shuffled, while the variance will tend to vary a lot.

(Note that you can put in unnormalized distributions, but they will be normalized before calculating the stats.)

Both the entropy and variance characterize the “uncertainty” or “spread-out-ness” of a distribution. Both are zero for an indicator and large for a uniform distribution—but the entropy attains its highest possible value on a uniform, while it’s easy to make the variance even larger by creating something bimodal. (If you click the “Beta” button enough times you’ll get something bimodal, or you can draw your own.)

What’s the difference?

We’ll write the entropy of a r.v. as a function of the r.v. rather than of the probabilities themselves:

Despite this notation, the entropy does not depend on the value of , only on the probabilities, while the variance does depend on the values attained by .

Yet if we plot the two against each other for a few typical distributions, they appear to be closely related:


(Widget Loading...)

(These are discretized normals and betas, not continuous distributions. We’ll address the entropy of continuous distributions later.)

What’s going on?

Let’s investigate further. Here are a family of normal distributions centered on :

(Widget Loading...)

We see a clear pattern, which interpolates between the nearly-indicator-like and the nearly-uniform .

Here now are a family of distributions, representing the posterior probability that an unfair coin (whose actual ) comes up heads. At first we have only a broad idea of the value of , but over many flips our posterior belief converges to the true value.

(Widget Loading...)

Again wee see entropies and variances which decreasing together.

But all of these distributions have been approximately normal; guided by our first widget we probably need to be looking at more spread-out distributions.

Let’s try a contrived scenario: you flip a coin of unknown -probability repeatedly and get the same result times in a row, but you don’t know if these are or . The probability of such a streak given is

If we take our prior on to be uniform (i.e. a ), the posterior distribution of goes as

which looks like

(Widget Loading...)

There’s still a tight relationship between entropy and variance, but now it goes the other way.

The difference seems to be that these new distributions are bimodal.

Apparently we can change the variance of a distribution quite freely without affecting the entropy, simply by shuffling it to be more or less spread out.

Then if find can find some family of distributions with equal variances, then their entropies must vary.

Let’s try it. In the following I started with a , and then had the AI conjure up a handful of shuffled versions of that distribution (which should have different variances) and a handful of Beta distributions with this same variance but otherwise different shapes (which should therefore have different entropies):

(Widget Loading...)

The point is: entropy and variance are closely related for unimodal distributions, but not in general. They can vary independently, though it can take some pretty contrived distributions to demonstrate it.

Further investigation sheds some light. Apparently, there exists a series expansion for the Shannon entropy in terms of the cumulants of a distribution, which are certain functions of the moments which distribute across independent distributions: .4 Up through , they are identical with the moments— is just the variance. It seems that any distribution can be expanded in a series in its cumulants around a normal distribution, called an Edgeworth series. Hence there exists a “local approximation” of the entropy which will be fairly accurate for distributions which are close to normal:

This was surprising to me, since the Shannon entropy doesn’t depend on the values taken by the R.V. , while the variance and cumulants do. How can it be that shows up in the formula for , then? The explanation is that, here, we are specifying which distribution we are talking about by its variance and cumulants; the entropy depends on these because it depends on the distribution! I.e.:


That’s enough for now. In writing this I answered a number of my own questions about entropy, and raised a few more. My plan for the next post is to take on relative joint entropies, with, hopefully, the same kind of interactive widgets, as these have been very clarifying.




Interactive visualizations for this post were authored with D3.js in Typescript, embedded in a Marimo (Python) notebook via anywidget with considerable help from Claude Code, and then manually ported into React components to be consumed by Astro, which builds this site. The simpler plots were built with Bokeh and exported as HTML. Mostly this was a pain and I wouldn’t do it this way again, but it is worth noting that Claude is excellent at one-off D3-type visualizations.

  1. I find this law to be very suggestive—entropy appears to act like a generalization of “logarithms” to the large space of distributions, where we identify the uniform distributions with the original space of numbers themselves. We can follow the analogy in the other direction to interpret logarithms as an “information” function in all cases. Aren’t integers, after all, just an abstract “count” of something?

  2. One can also derive the formula for in the first place by asserting a handful of properties like continuity, the logarithm-like property on product distributions, and the equivalence of the two selection processes just given; see wiki.

  3. Here’s the chat, including a lot of extra—and impressive—analysis.

  4. In statistical mechanics, the logarithm of the partition function, , turns out to be a cumulant generating function, which is one way of explaining its many surprising properties.