Entropy and Generating Functions

1. Counting States?
2. Shannon?
3. Shannon again?
4. Composing Entropies
5. Counting Again.
The Entropy of a Generating Function
In Summary

1. Counting States?

The expression represents the data in an -bit system; the information content ought to be and is just if the is a .

My immediate instinct is to say that ought to be the information of the g.f. above. But this doesn’t make use of the distinction that all the and states are indistinguishable. If we make them distinguishable, our g.f. becomes . This gives the same number of states if we plug in . But if we interpret it as a distribution, it is simply the uniform distributions over choices of “one and one ”, it will have entropy .

So “plugging in 1 to count the total states” is not the same operation as “entropy”; they only coincide when all states are distinct. The “information” ought to make use of the fact that are distinct. However you define the entropy of a g.f., it shouldn’t be the same for the g.f.s and ; the operation of marking certain states as indistinguishable ought to change it—by narrowing the distribution.

2. Shannon?

The Shannon entropy of the distribution represented by the fully-expanded g.f.

This is not a normalized distribution; its sum is—what else?—. But the entropy of an unnormalized pdf is related to the normalized one by a Leibniz rule: , where is a normalization constant, which rearranges into:

So we get for the normalized distribution, with and :

These are just two ways of writing the entropy of the distribution , as a function of the values of .

3. Shannon again?

But now another idea comes to mind. The Shannon entropy just discussed is arguably not the actual distribution represented by the g.f. , which is just distribution over terms . If we view these as generic expressions , then the g.f. is , representing an unnormalized series with ; the normalization constant is . The Shannon entropy of this would be which is some negative number, and the normalized form would be

Note this is just the earlier entropy for , since the factors are then equal to . Our first stab at the information, , has popped out—it was the entropy of something, but it wasn’t the best interpretation of the entropy of our g.f.

4. Composing Entropies

And one more idea. Doesn’t entropy add across independent systems? A product of g.f.s represents a product of systems, so shouldn’t , and ? What does this have to do with the Leibniz rule that makes a derivation ?

Here it turns out I’m just confusing a few things. We have to distinguish between pointwise multiplication of distributions vs. the Cartesian product.

Pointwise multiplication is like a dot product: between two sets defined on the same domain. This has no particular expression in terms of a g.f. If , and likewise for , then the pointwise multiplication is . The Shannon entropy of a pointwise multiplication is a derivation: . This is most useful when one of is a constant, as in earlier.

The Cartesian product of two sets , and its entropy is, by the previous derivation rule:

and this thing is simply the sum of the entropies iff both sets are normalized to 1. So this is the rule I was thinking of, but it only applies to distributions. Then the product of independent copies of the same distribution would be . But this corresponds to a g.f. , rather than the g.f. , which has additional structure. (TODO: check this I got confused)

So my confusion comes from the fact that only holds for normalized distributions, and therefore only holds for single numbers or one-element sets if both numbers are 1. The derivation rule is the general one. And neither applies to because that g.f. has supplied additional information by grouping the different terms—its entropy should be correspondingly lower than the full we’d get if every combination of s and s was independent, by the correction term from above (which is a negative number since the distribution is not normalized.)

We can add to this list of composition laws the entropy of the sum of two disjoint sets: which… also adds, since entropy is a sum across the set in its argument anyway. This property of is curious: it treats (normalized) Cartesian products like (unnormalized) disjoint unions. I wonder two things which I won’t pursue right now:

Can the form of be derived from this fact alone?
Is there a clear relationship between the way acts on between mixture-products and how it acts on non-disjoint unions?

If the input sets are unnormalized, with totals and , the entropy of the sum is related to the entropy of the normalized sum by a derivation-rule: But it doesn’t look like anything:

That last term is the entropy of a single number, .

5. Counting Again.

In a couple of places we have also encountered the Shannon entropy of a single number, like the we just saw. You could apply this to a g.f. like , but I’m not sure what this would mean.

The Entropy of a Generating Function

Well. Those are the different ideas that come to mind when I wonder about the “information of a g.f.”.

Apparently, however we define the entropy of a g.f., it cannot depend only on the value of the g.f., since the equals regardless of the value of . Nor is it a function of the variable in which the g.f. is a polynomial, like , since terms of the form are themselves polynomials in .

It must defined relative to the thing we’ve decided we’re counting in the g.f.—it needs to be affected by the way we choose to group terms. For example, for in the previous g.f., the terms and will have the same coefficient, but they represent distinct “states” if we’re counting “number of heads” and ought to enter as terms to the entropy rather than one. Combining two states into one always makes the entropy smaller—a uniform distr. over elements has entropy , and if you keep combining these until there’s one big state you’ll get 0 in the end.

It may be that, for a choice of how to group the terms, you can mark this grouping with a specific set of variables or powers of a single variable ; our earlier attempts to define an entropy were being careless about the choice of which variables mark the distribution vs which parameterize it. We can define an entropy functional relative to a marker variable. This rhymes nicely with the idea that entropy is always relative to something.

For the binomial, the standard way to mark terms would be by powers of in . This takes the monomials as the “basis” for the entropy, and everything else is part of the probability. (Then would be effectively the same basis since a choie of fixes , but both would be different from, say, , which groups certain terms.). Then we can define entropy relative to this basis as a functional on g.f.s:

This is not too revealing, but I see a few things:

Some kind of chain rule?
Something that looks like a translation operator , or a Taylor series.
We could get rid of the s with a , or with some other kind of g.f.
You could e.g. take an entropy w.r.t. one variable of a two-variable g.f., like . Or w.r.t. both at once.
We could ask what happens if you plug in different values than , or leave this unevaluated. Say, .
This entropy is now an analytic function of , or of , perhaps its derivatives mean something.
This ought to allow us to define an entropy relative to any parameter, like , even if the polynomial in that expression is not well formed. Well, not quite: . Hum. What am I reaching for here?

If instead we mark terms with , we cannot write the g.f. as a single power, but we can define an entropy on the full series

In Summary

That’s enough musing. I’ll summarize the progress I’ve made on untangling my confusions:

We cannot simply take of a g.f. at to get its “information”; this is a measure of information but it is not the “entropy of the series encoded by the g.f.”.
We can define a Shannon entropy on a g.f. as a certain functional, where the choice of “which terms we are defining the entropy with-respect-to” is baked into the form of the entropy functional itself, though there might be some “natural” choice for simple g.f.s. This correctly makes the entropy decrease as we set variables equal to each other, like . We find that the entropy of some g.f. ranges from a maximum of , which is like the sum of all its coefficients, down to if all states are identified with each other.
In general the entropy of a g.f. is only defined w.r.t. some choice of terms, and you could always plug in more fine-grained terms like and then take an entropy w.r.t. , which would be even larger.

And regarding the composition laws of entropies:

derivation property holds for pointwise multiplication of distributions, where the acts like a dot product between terms of different . This holds for numbers; the pointwise-multiplication form just acts on each numeric term separately. As a bit of an abuse of notation we can write also write this with a : . This doesn’t mean much of anything on g.f.s.
derivation property holds for the Cartesian products of distributions, where the outside of are now the sums , etc. This amounts to the previous case for one-element distributions, i.e. numbers. Cartesian products correspond to the product of g.f.s. in unrelated variables
The derivation property on products reduces to for normalized distributions.
holds for the disjoint union of sets/distributions, which corresponds to a sum of g.f.s. in different variables.
There might be a relationship between the action of on mixed products (i.e. g.f. convolutions) and non-disjoint unions.