Entropy and Generating Functions

The Information of a Generating Function
- 1.
- 2.
- 3.
- 4.
- 5.
The Entropy of a Generating Function
In Summary

Consider the binomial generating function \({(x+y)}^N\):

We can read this literally as “\(N\) copies of a two-state system”, where the states \(x\) and \(y\) are indistinguishable between the \(N\) systems.
Plug in in \(x = y =1\) and you count the total number of states of \(N\) 1-bit systems, \(2^N\).
Expanding this gives a binomial series: \(\sum_k {N \choose k} x^k y^{N-k}\), representing the number of ways to make the states “0 of \(x\) or 1 of \(x\) or… \(k\) of \(x\)”
Plug in \(x=p, y=1-p\) and you get a \(\mathrm{binomial}[p]\) distribution, representing the chances of getting \(k\) heads in \(N\) flips. Across all \(k\) these sum to 1: \({(p + (1-p))}^N = 1^N = 1\).

Now: what is the “information” of this g.f.? A few different ideas come to mind, but they contradict into each other. This small post will untangle them.

The Information of a Generating Function

1.

The expression \(2^N\) represents the data in an \(1\)-bit system; the information content ought to be \(N \log 2\) and is just \(1\) if the \(\log\) is a \(\log_2\).

My immediate instinct is to say that \(N \log{(x+y)}\) ought to be the information of the g.f. above. But this doesn’t mkae use of the distinction that all the \(x\) and \(y\) states are indistinguishable. If we make them distinguishable, our g.f. becomes \(\Pi_i (x_i + y_i)\). This gives the same number of states \(2^N\) if we plug in \(x_i = y_i = 1\). But if we interpret it as a distribution, it is simply the uniform distributions over \(2^N\) choices of “one \(x_i\) and one \(y_i\)”, it will have entropy \(N \log 2\).

So “plugging in 1 to count states” is not the same operation as “entropy”; they only coincide when all states are distinct. The “information” ought to make use of the fact that \(x, y\) are distinct. However you define the entropy of a g.f., it shouldn’t be the same for the g.f.s \(({x+y})^N\) and \(\Pi_i^N (x_i + y_i)\); the operation of marking certain states as indistinguishable ought to change it—by narrowing the distribution.

2.

The Shannon entropy of the distribution represented by the fully-expanded g.f.

\[H\left[p_k = \sum_k {N \choose k} x^k y^{N-k}\right] = \sum_k p_k \ln \frac{1}{p_k}\]

This is not a normalized distribution; its sum is—what else—\({(x+y)}^N\). But the entropy of an unnormalized pdf is related to the normalized one by a Leibniz rule: \(H[p_i] = H\left[\frac{1}{Z} \times (Z p_i)\right]\), where \(Z\) is a normalization constant, which rearranges into:

\[H[p_i] = \frac{H[Z p_i]}{Z} + \ln Z\]

So we get for the the normalized distribution, with \(p_x = p\) and \(p_y = 1-p\):

\[H\left[p_k = {N \choose k}p^k {(1-p)}^{N-k}\right] = \frac{H\left[p_k = {N \choose k}x^k y^{N-k}\right]}{ {(x+y)}^N} + N \ln {(x+y)}\]

These are just two ways of writing the entropy of the distribution \({N \choose k} x^k y^{N-k}\), as a function of the values of \(x, y\).

3.

But now another idea comes to mind. The Shannon entropy just discussed is arguably not the actual distribution represented by the g.f. \(\sum_k {N \choose k} x^k y^{N-k}\), which is just distribution over terms \(x^k y^{N-k}\). If we view these as generic expressions \(z_k\), then the g.f. is \(\sum_k {N \choose k} z_k\), representing an unnormalized series with \(p(k) \propto {N \choose k}\); the normalization constant is \(\sum_k {N \choose k} = 2^N\). The Shannon entropy of this would be \(H[p_k = {N\choose k}] = -\sum_k {N \choose k} \ln {N \choose k}\) which is some negative number, and the normalized form would be

\[H\left[p_k = 2^{-N}{N\choose k}\right] = \frac{H\left[p_k = {N \choose k}\right]}{2^N} + N \ln 2\]

Note this is just the earlier entropy \(H\left[p_k = {N \choose k}x^k {y}^{N-k}\right]\) for \(x=y=\frac{1}{2}\), since the factors \(x^k y^{N-k}\) are then equal to \(2^{-N}\). Our first stab at the information, \(N \ln 2\), has popped out—it was the entropy of something, but it wasn’t the best interpretation of the entropy of our g.f.

4.

And one more idea. Doesn’t entropy add across independent systems? A product of g.f.s represents a product of systems, so shouldn’t \(H[xy] = H[x] + H[y]\), and \(H[z^N] = NH[z]\)? What does this have to do with the Leibniz rule that makes \(H\) a derivation \(H[pq] = pH[q] + qH[p]\)?

Here it turns out I’m just confusing a few things. We have to distinguish between pointwise multiplication of distributions vs. the Cartesian product.

Pointwise multiplication is like a dot product: \(\{p_i\} \cdot \{q_i\} = \{p_i q_i\}\) between two sets defined on the same domain. This has no particular expression in terms of a g.f. If \(g_a(x) = \sum_n a_n x^{n}\), and likewise for \(g_b(x)\), then the pointwise multiplication is \(g_{a \cdot b} = \sum_n a_n b_n x^n\). The Shannon entropy of a pointwise multiplication is a derivation: \(H[p_i q_i] = \sum_i [p_i H(q_i) + q_i H(p_i)]\). This is most useful when one of \(p_i, q_i\) is a constant, as in \(H[p_i] = H\left[\frac{1}{Z} \times (Z p_i)\right]\) earlier.

The Cartesian product of two sets \(\{p_i\} \times \{q_j\} = \{(p_i, q_j)\}\), and its entropy is, by the previous derivation rule:

\[H[p_i q_j] = \sum_{ij} ( q_j H[p_i]) + p_i H[q_j] = \sum_j q_j H[p_i] + \sum_i p_i H[q] = q H[p_i] + pH[q_j]\]

and this thing is simply the sum of the entropies \(H[p] + H[q]\) iff both sets are normalized to 1. So this is the rule I was thinking of, but it only applies to distributions. Then the product of \(N\) independent copies of the same distribution would be \(H[z^N] = NH[z]\). But this corresponds to the g.f. \((x_0 + y_0)(x_1 + y_1) \cdots (x_N + y_N)\), rather than the g.f. \({(x+y})^N\), which has additional structure.

So my confusion comes from the fact that \(H[pq] = H[p] + H[q]\) only holds for normalized distributions, and therefore only holds for single numbers or one-element sets if both numbers are 1. The derivation rule is the general one. And neither applies to \({(x+y)}^N\) because that g.f. has supplied additional information by grouping the different \(x, y\) terms—its entropy should be correspondingly lower than the full \(N \ln 2\) we’d get if every combination of \(x\)s and \(y\)s was independent, by the correction term \(\frac{H\left[{N \choose k}\right]}{2^N}\) from above (which is a negative number since the distribution \(N \choose k\) is not normalized.)

We can add to this list of composition laws the entropy of the sum of two disjoint sets: \(H[\{p_i\} \cup \{q_j\}]\) which… also adds, since entropy is a sum across the set in its argument anyway. This property of \(H\) is curious: it treats (normalized) Cartesian products like (unnormalized) disjoint unions. I wonder two things which I won’t pursue right now:

Can the form of \(H\) be derived from this fact alone?
Is there a clear relationship how \(H\) acts on between mixture-products and how it acts on non-disjoint unions?

If the input sets are unnormalized, with totals \(p = \sum p_i\) and \(q = \sum q_j\), the entropy of the sum is related to the entropy of the normalized sum by a derivation-rule: But it doesn’t look like anything:

\[\begin{align*} H[\{p_i\} \cup \{q_j\}] &= (p+q) H\left[\frac{\{p_i\} + \{q_j\}}{p+q}\right] + H(p+q) \end{align*}\]

That last term is the entropy of a single number, \(H(p+q) = - (p+q) \ln (p+q)\).

5.

In a couple of places we have also encountered the Shannon entropy of a single number, like the \(H(p+q) = - (p+q) \ln (p+q)\) we just saw. You could apply this to a g.f. like \(H[{(x+y)^N}] = -N {(x+y)}^N \ln (x+y)\), but I’m not sure what this would mean.

The Entropy of a Generating Function

Well. Those are the different ideas that come to mind when I wonder about the “information of a g.f.”.

Apparently, however we define the entropy of a g.f., it cannot depend only on the value of the g.f., since the \(g(p) = {(p + (1-p))}^N\) equals \(1\) regardless of the value of \(p\). Nor is it a function of the variable in which the g.f. is a polynomial, like \(p\), since terms of the form \(p^k {(1-p)}^{N-k}\) are themselves polynomials in \(p\).

It must defined relative to the thing we’ve decided we’re counting in the g.f.—it needs to be affected by the way we choose to group terms. For example, for \(p=\frac{1}{4}\) in the previous g.f., the terms \(p^3(1-p)\) and \(p(1-p)^3\) will have the same coefficient, but they represent distinct “states” if we’re counting “number of heads” and ought to enter as terms to the entropy rather than one. Combining two states into one always makes the entropy smaller—a uniform distr. over \(N\) elements has entropy \(\ln N\), and if you keep combining these until there’s one big state you’ll get 0 in the end.

It may be that, for a choice of how to group the terms, you can mark this grouping with a specific set of variables \(x_i\) or powers of a single variable \(x^n\); our earlier attempts to define an entropy were being careless about the choice of which variables mark the distribution vs which parameterize it. We can define an entropy functional relative to a marker variable. This rhymes nicely with the idea that entropy is always relative to something.

For the binomial, the standard way to mark terms would be by powers of \(x^k\) in \({(px + (1-p))}^N\). This takes the monomials \(x^k\) as the “basis” for the entropy, and everything else is part of the probability. (Then \({(px + (1-p)y)}^N\) would be effectively the same basis since a choie of \(x^k\) fixes \(y^{N-k}\), but both would be different from, say, \({(px + (1-p)x^2)}^N\), which groups certain \(k\) terms.). Then we can define entropy relative to this basis as a functional on g.f.s:

\[\begin{align*} H[g(x)] &= \left(\left. H\left [p_k = \frac{1}{k!}{(\partial_x)}^k \right] g(x) \right) \right|_{x=0}\\ &= \sum_k \left.\left[ -\frac{ {(\partial_x)}^k g(x)}{k!} \ln \left(\frac{ {(\partial_x)}^k g(x)}{k!} \right) \right]\right|_{x=0} \end{align*}\]

This is not too revealing, but I see a few things:

Some kind of chain rule?
Something that looks like a translation operator \(e^{-\partial_x}\), or a Taylor series.
We could get rid of the \(k!\)s with a \(\partial_{x^k}\), or with some other kind of g.f.
You could e.g. take an entropy w.r.t. one variable of a two-variable g.f., like \(H_y g(x, y)\). Or w.r.t. both at once.
We could ask what happens if you plug in different values than \(x=0\), or leave this unevaluated. Say, \(e^{-\beta}\).
This entropy is now an analytic function of \(p\), or of \(x, y\), perhaps its derivatives mean something.
This ought to allow us to define an entropy relative to any parameter, like \(p\), even if the polynomial in that expression is not well formed. Well, not quite: \({(\partial_p)}^k {(p + (1-p))^N} = {(\partial_p)}^k(0) = 0\). Hum. What am I reaching for here?

If instead we mark terms with \(x_k\), we cannot write the g.f. as a single power, but we can define an entropy on the full series

\[\begin{align*} g(x_k) &= \left[ \sum_k {N \choose k} p^k {(1-p)}^{N-k} x_k\right] \\ H[g(x_k)] &= \sum_{k} \partial_{x_k} g \ln \frac{1}{\partial_{x_k} g} \end{align*}\]

In Summary

That’s enough musing. I’ll summarize the progress I’ve made on untangling my confusions:

We cannot simply take \(\log\) of a g.f. at \(x=y=1\) to get its “information”; this is a measure of information but it is not the “entropy of the series encoded by the g.f.”.
We can define a Shannon entropy on a g.f. as a certain functional, where the choice of “which terms we are defining the entropy with-respect-to” is baked into the form of the entropy functional itself, though there might be some “natural” choice for simple g.f.s. This correctly makes the entropy decrease as we set variables equal to each other like \(x_i = x\). We find that the entropy of some g.f. ranges from a maximum of \(\ln g(1)\), which is like the sum of all its coefficients, down to \(0\) if all states are identified with each other.
In general the entropy of a g.f. is only defined w.r.t. some choice of terms, and you could always plug in more fine-grained terms like \(x = (1+a)\) and then take an entropy w.r.t. \(a\), which would even larger.

And regarding the composition laws of entropies:

\(H[\{p_i q_i\}] = \sum_i p_i H[q_i] + q_i H[p_i]\) derivation property holds for pointwise multiplication of distributions, where the \(\cdot\) acts like a dot product between terms of different \(i\). This holds for numbers; the pointwise-multiplication form just acts on each numeric term separately. As a bit of an abuse of notation we can write also write this with a \(\cdot\): \(H[p \cdot q] = p \cdot H[q] + q \cdot H[p]\). This doesn’t mean much of anything on g.f.s.
\(H[p\times q] = p H[q] + qH[p]\) derivation property holds for the Cartesian products of distributions, where the \(p, q\) outside of \(H\) are now the sums \(p = \sum p_i\), etc. This amounts to the previous case for one-element distributions, i.e. numbers. Cartesian products correspond to the product of g.f.s. in unrelated variables
The derivation property on products reduces to \(H[p \times q] = H[p] + H[q]\) for normalized distributions.
\(H[p \cup q] = H[p] + H[q]\) holds for the disjoint union of sets/distributions, which corresponds to a sum of g.f.s. in different variables.
There might be a relationship between the action of \(H\) on mixed products (i.e. g.f. convolutions) and non-disjoint unions.