Entropy II
The Entropy Function
This is part of a series on entropy.
- Entropy I: Motivating Shannon
- (This post)
Table of Contents
1. Interpreting Shannon
Here again is the “Shannon entropy” of a distribution
Its argument is a probability distribution, such
The simplest example is the entropy of a uniform distribution over
We get
and the formula also interpolates cleanly over non-uniform distributions
We saw in the first post in this series that it is interpreted as representing the “information density” or “innate uncertainty” of a sample from this distribution.
It is also the “average log-likelihood per sample” for a typical sequence drawn from the distribution
We also saw that the same formula arises when taking the large-
which can be seen as the relative size of a given outcome of many samples of a uniform distribution.
We can arrive at another characterization by noting that the entropy can be written as an expectation:
where the object inside the expectation is either the “the number of bits required to express that probability” or the “information required to enumerate
We can then express
which suggests we interpret the entropy either “average information required to specify an element in
2. Basic Properties
If we start with a single set
and then we divide it in two
the entropy goes up by
The same applies for any division into two:
For each block of weight
so dividing every block in two at once
In the “addressing” or “labeling” interpretation, we think of this as saying: the distribution
Likewise subdividing or “fine-graining” by another factor
Such a “fine-graining” is equivalent to replacing the distribution
which leads us to a general rule that entropies add over products:
Note this is the same as the behavior of a logarithm on numbers:
and the two laws coincide in the case of uniform distributions1:
We can go in the other direction to calculate the entropy of a coarse-graining or aggregating operation. For any distribution
So grouping
And grouping all
Note that the term subtracting in the above is exactly the (fraction occupied by the merged block)
Rearranging gives
in which we can read two the entropy of two equivalent algorithms:
- either we first select a block from an uneven distribution
, then select uniformly from within that block according to a - or, we select one element from a
distribution
Clearly we should be able to count or address a set of
3. Visualizing Shannon
Each term in the Shannon entropy has the form
whose graph looks like
That’s an odd shape, but it doesn’t mean much on its own. It is the product of
More informative is the graph of
It’s nearly a perfect circle, peaking at a value of
We can also visualize the entropy over all possible probability distributions on a three-element set. The three probabilities
Again we see that the entropy
For more than three probabilities no direct visualization is possible. Within the space of all possible distributions

What happens in the middle? What fraction of all possible distributions have the given entropies? Certainly there are vastly more uneven distributions than either the indicators or the single uniform.
Let’s first try to take on a simpler problem by considering only distributions which can arise by grouping

Let’s play with a few, starting from the single partition
… but the three-way partitions quickly overtake the two-way ones, and of course there are a lot of these. The total number of partitions is given by the Bell numbers; the 16th Bell number is around 10 billion. But the number of distinct integer partitions (e.g.
There’s a clear shape, peaking (the AI tells me) near
Returning to the first question: what if we did the same for the set of distributions on
Well: it’s a similar shape. Note the y-axis is not a log-scale in this one, as the whole region near 0 would go to
Before we move on, I want to register a couple of stray thoughts:
- I find the distinction between the entropy of partitions and distributions here to be suggestive. I suspect that entropy is more naturally defined on partitions: elements with
effectively don’t exist from the perspective of the entropy. All indicator distributions are the same thing, information-wise (it’s the underlying set which is different). I’ll have to think on this. - I don’t think a “uniform measure” makes sense on the space of real-valued probability distributions, so the second visualization just given is probably not very meaningful. It might actually make sense to use the entropy itself as the measure on that space, as
is the asymptotic measure a distribution arising from samples from a uniform, by the third argument from the previous post in thisseries. But I won’t try to take that on now.
4. Entropy vs. Variance
Here’s one visualization of some entropies. Now—unlike all the earlier examples—we will associated the indices
You can observe:
- the uniform distribution always has the highest entropy.
- all of the indicator/delta-function distributions have entropy zero (and variance zero).
- in general, entropy is lower for very peaked distributions, and high for spread-out ones. In this respect it is similar to the variance.
- but the entropy is unchanged when the cells are shuffled, while the variance will tend to vary a lot.
(Note that you can put in unnormalized distributions, but they will be normalized before calculating stats.)
Both the entropy and variance characterize the “uncertainty” or “spread-out-ness” of a distribution. Both are zero for an indicator and large for a uniform distribution—but the entropy attains its highest possible value on a uniform, while it’s easy to make the variance even larger by creating something bimodal. (If you click the “Beta” button enough times you’ll get something bimodal, or you can draw your own.)
What’s the difference?
We’ll write the entropy of a r.v.
Clearly the entropy does not depend on the value of
And yet if we plot the two against each other for a few distributions, they seem to be closely related:
(These are discretized normals and betas, not continuous distributions. We’ll get to the entropy of continuous distributions later.)
What’s going on?
Here are a family of normal distributions centered on
We see a clear pattern, which interpolates between the nearly-indicator-like
Here now are a family of
Again wee see entropies and variances which decreasing together.
Let’s try a more contrived scenario: you flip a coin of unknown
If we take our prior on
which looks like
There’s still a tight relationship between entropy and variance, but now it goes the other way.
The difference seems to be that these are bimodal distributions. After all, we should be able to shuffle the probability of any unimodal distribution and change the variance but not the entropy. Likewise if we can find some family of distributions with constant variance, their entropies must vary.
Let’s try it. In the following I started with a
The point appears to be: entropy and variance are closely related for unimodal distributions, but not in general. They can vary independently, though it can take some pretty contrived distributions to demonstrate it.
Further investigation sheds some light. Apparently, there exists a series expansion for the Shannon entropy in terms of the cumulants of a distribution, which are certain functions
This was surprising to me, since the Shannon entropy doesn’t depend on the values taken by the R.V.
Interactive visualizations for this post were authored with D3.js in Typescript, embedded in a Marimo (Python) notebook via anywidget with considerable help from Claude Code, and then manually ported into React components to be consumed by Astro, which builds this site. The simpler plots were built with Bokeh and exported as HTML. Mostly this was a pain and I wouldn’t do it this way again, but it is worth noting that Claude is excellent at one-off D3-type visualizations.
-
I find this law to be very suggestive—entropy appears to act like a generalization of “logarithms” to the large space of distributions, where we identify the uniform distributions with the original space of numbers themselves. We can follow the analogy in the other direction to interpret logarithms as an “information” function in all cases. Aren’t integers, after all, just an abstract “count” of something? ↩
-
One can also derive the formula for
in the first place by asserting a handful of properties like continuity, the logarithm-like property on product distributions, and the equivalence of the two selection processes just given; see wiki. ↩ -
Here’s the chat, including a lot of extra—and impressive—analysis. ↩
-
In statistical mechanics, the logarithm of the partition function,
, turns out to be a cumulant generating function, which is one way of explaining its many surprising properties. ↩