Table of Contents

Part 1: The Partition Function

Multiplicity

Statistical mechanics begins with the postulate that the macroscopic behavior of an isolated system depends only on its entropy, defined as the logarithm of the “multiplicity” of states:

Measurable thermodynamic quantities like pressures and heat capacities derivatives are then found by taking derivatives of in various combinations.

Thus the system is completely characterized by the functional form of . For example, if our system consists of non-interacting quantum-mechanical free particles in a box of volume , then can be found by fixing and , then counting the quantized states with energies in .

This is variously called the “multiplicity”, “count”, or just “number” of states. In some contexts it is the “density of states” or “degeneracy” of states, where it is usually written or .

For two independent systems considered together, the joint count of states will be , and entropies will add . But in this context it’s not really accurate to call on its own the “count of states”. It’s just one factor in a product, like the side of a rectangle or a cardinality of a single set in a product . Even when a system is said to be “isolated” or “independent”, this just means that the joint state of the whole world can be factored as . Ech state of an “isolated” system is still a whole slice of size .

So I’ll go on calling the “multiplicity” or “count of states”, but with the understanding that this is the size of one factored dimension in a product.

We can visualize the multiplicity of three independent systems together as a rectangular prism:

Product of three multiplicities

Maybe you constructed this system by taking products of the smaller systems. Once you have you might forget that it factors and look only at the joint dependence on parameters like . Or, you might begin with a single expression and then find that it factors into .

For completeness, we should also mention that disjunct—mutually exclusive—states will add their multiplicities: . This is just how we count the alternative states of single system: . The rule is: multiplication represents “and” and addition represents “or”.

We restate all of this in the language of probabilities. If we postulate that a system of specified has an equal probability of being found in each of its states, then the distribution of its states will be ).

This “uniform distribution of states” is called the “microcanonical ensemble”, but I’ll avoid the term in this essay. In fact I intend to avoid discussing probabilities at all, preferring to allow stand on its own as a description of a system, to the extent possible.

For any joint distribution, picking a single value for determines the state up to a conditional distribution . If are independent, then this conditional does not depend on the choice of ; it is simply the marginal distribution , for all . So independent systems have distributions which can be factored as a product of marginal distributions:


The heat reservoir via generating functions

Let’s now look at the standard “heat reservoir” derivation of the partition function. We imagine a system which can excahnge energy freely with a much larger reservoir . We label the states of each by their energies , and each has a multiplicity function which depends only on its energy: . I will be departing a bit from the standard textbook argument by making liberal use of the language of “generating functions” (“g.f.”s from now on.) So we will represent the states of and separately as two generating functions:

where and are placeholder variables which track the “energy cost” of each state, which is a standard g.f. method. These should be thought of as “transformations” of the underlying series .

We’ll want to have a few facts about g.f.s on hand:

  • The g.f.s above are written as sums over energies , but they would come to the same thing if we summed over individual states : , since
  • The sum can be taken over the actual spectrum of system , since the coefficient of any term not in the spectrum would just be zero.
  • We can use a g.f. to count the total states over all by evaluating at : . This just reduces the g.f. to a sum of across all states.
  • We can convert a g.f. to a “probability generating function” (“p.g.f”) by normalizing it: .
  • We can also plug in other values for to assign a “cost” to each unit of . This is best illustrated with the binomial series as seen in the following examples
    • Plugging in counts the total number of states, which is
    • Plugging in give a distribution, representing the probability of getting heads in coin flips: .
    • Plugging in any other , will give a different distribution. This works because . Any assignment of values that make the g.f. equal to 1 can be interpreted as a p.g.f.
    • Plugging in marks the two variables as indistinguishable, reducing the series to .
    • Other expressions like can usually be given some interpretion in terms of of weights, costs, indistinguishability, etc.

Now, if our system and reservoir do not interact, we can track their energies separately in a two-variable joint g.f., which is just the product

Clearly the series would factor; each term for a given power of contains a full copy of the series in and vice versa. Likewise any fully-factorizable g.f. polynomial must represent a product of independent states—this is just the same statement about marginal probabilities from before.

If our system and reservoir can interact, by exchanging energy, then we need to represent this by using the same cost-tracking variable in each series. Now the terms mix, and we can rewrite the product as a sum over total energies:

… and we observe that the product of two g.f.s is the g.f. of the convolution of their series:

Note that each term is picking up contributions from a diagonal line in the space of energies :

Line of constant E total

So this convolution is just a way of “reindexing” the joint series by the combined energies . The underlying expression hasn’t changed yet.


Linearizing the reservoir

Now that we have our joint g.f., we’ll add the assumption that the reservoir is so much larger than system that, while its multiplicity will vary as energy is exchanged (otherwise they’d be independent), its state will not change appreciably. Thus the rate of the variation will be approximately constant, and we can replace the function with a linear approximation. But we’ll actually linearize rather than , which I’ll only try to justify with a hand-wave-y argument for now: entropies compose additively, , so they represent “extensive” physical quantities and are the “sorts of thing we take derivatives of”, while multiplicites compose multiplicatively. More on this later.

The typical version of this argument linearizes around , interpreting the reservoir as containing “nearly all of the energy”. This interpretation isn’t necessary, though—all we need is that the reservoir’s slope is approximately constant. So I’ll linearize around an arbitrary energy instead:

where .

We can now rewrite the term in the joint g.f. with the linearization:

We see that, no matter which we linearize at, the effect is to isolate the dependence of on as .

Now we insert the linearization back into the g.f.:

… and there appears the partition function .

The effect of our linearization has been to characterize the influence of the external reservoir on (and really, the influence of the entire external world) in terms of a single parameter . Read this as: the reservoir converts energy into entropy at a fixed exchange rate . Then , which in isolation had a certain dependence of entropy-on-energy , will instead experience an effective dependence , since any energy traded away comes at a constant entropy cost.

(This is a lot like linearizing the gravitational potential at the surface of the earth as . We’re not really doing anything more complicated than elementary physics here!)


as the multiplicity of the combined system

Let’s now try to interpret the partition function in our joint g.f. We had:

We can read the terms

as the product of multiplicities of three systems. The first is just , but the second and third are a fictitious division of into a part which captures its full -dependence, and a part which doesn’t depend on at all: . This fictitious could depend on all kinds of other variables, and in fact is summing up the contribution of the rest of the whole world, but our approximation has guaranteed it won’t depend on .

We can visualize the joint distribution of the -dependent systems, :

boltzmann visualized by marginals

The rectangles widths grow according to while their heights shrink according to to . The partition function is the combined area of the all of the rectangles.1

Their joint multiplicity represents the areas, which grow and then shrink again:

Boltzmann areas

The “third” system can be visualized as an independent dimension, called below:

boltzmann visulialized with constant 3rd dimension

We can also read a physical interpretation of off the joint generating function

For each , this has the form of a product of two systems . We conclude:

represents the total contribution of and the energy exchange with to the multiplicity of the combined system .

In other words:

is the multiplicity of all systems which depend on .

In still other words:

is the effective multiplicity of system , accounting for its interaction with the environment.


as a Laplace transform

If we write the definition of as an integral and drop all the subscripts, the partition function reveals itself to be the Laplace transform of :

That g.f.s are very nearly Laplace transforms isn’t too surprising, since multiplication of g.f.s corresponded to the convolution of underlying functions, which is one of the defining properties of the Laplace transform.

From this perspective the arose from system ‘s own Laplace transform. Apparently our linear approximation took ‘s function form out of the picture entirely:

Thus the linear approximation for the reservoir makes the joint state depend only on system . Maybe if you have a great intuition for Laplace transforms you could work out the form without the external reservoir argument at all; apparently we have parameterized system by its derivative , though it’s hard for me to imagine where that form would come from without the reservoir argument… maybe the reservoir argument will give some insight into the Laplace transform.


as an information cost

One more interpretation: we arrived at by plugging in for the cost-tracking variable in a g.f. Other ways of evaluating g.f.s at specific values, like or , were interpretable. So what does it mean to plug in ?

We can first observe that any g.f. like can be reindexed by evaluation at : . The transformed g.f. represents a reindexed series , where the set of weights is unchanged but their assignment to indices, which represent counts of a resource, has been altered. In evaluating at , we are first assigning each unit of the original cost to units of a new cost. This can also be seen as altering the energy scale . (We could express the partition function as .)

Then we have also set the scaled cost to . This suggests an interpretation as “information”, based on the fact that the Shannon Entropy (about which more later) is usually interpreted as an expectation of information: . This suggests that any probability can be seen as the exponent of an “information cost”:

We shouldn’t read too much into this: this does not necessarily imply that is a particularly privileged unit of information, beyond the fact that it interacts nicely with derivatives. After all, the logarithm base in the entropy can easily be changed to 2 or anything else.

But it feels safe to read the assignment as assigning an information-cost to each unit of energy.2



Part 2: The Free Energy

Postulates for the Physical State

We have so far describing our the system in terms of probabilities except for a few interpretative remarks. The only physical content is the form of the function (dropping the label now), and the experimental fact that derivatives of behave like measurable thermodynamic variables.

We have also not discussed equilibrium. The partition function is so far only the multiplicity of a larger system than alone. Nothing about the “heat reservoir” argument required equilibrium, only the exchange of energy, because we have not made any claim as to which state is to be found in. We are still completely free to apply different postulates as to which state the system (along with the interacting part of the reservoir) is actually to be expected experimentally.

Three choices of postulate are common:

  1. (Mode Postulate) The system is simply found in the state of maximal of , i.e. the mode, which we’ll call .
  2. (Mean Postulate) Or, the system is found to be in the mean-energy state of this series, which we’ll call .
  3. (Distribution Postulate) Or, the system is found in all possible states with uniform probability over the system and reservoir combined. This is:

In terms of the system alone, applying the linearization as before, we get a physical state which is a drawn from a “canonical ensemble”, whose states are distributed according to the “Boltzmann distribution” with as the normalization constant:

Each postulate above answers the same question: how do we make physical predictions of thermodynamic properties from only a description of the multiplicities of states of a system ?

None of these postulates are inherently correct, except inasmuch as they reproduce experimental results.

We can make some progress on picking among the postulates by observing how they relate to each other. The distribution in postulate 3 will have as its mode and mean the exact values and chosen by the first two postulates (since it’s proportional to the same underlying series for ). And it will turn out that, for most physical systems, the mode and mean will be effectively equal to each other, and in fact the series will alwmost alawys be very nearly a delta function around its modal value .

We can illustrate this with a simple system of binary bits, each with energy or . The “exact” Boltzmann distribution as a function of the number of -bits is:

For some temperature , the state with of the bits will be the largest term this series. How dominant is it? If , and , the peak at has 2.5% of the total probability, and the range , only 2% of the range of , supports about 50% of the total probability. That’s pretty sharp, but this is nothing compared to typical thermal systems, which usually have . One can show with a Stirling approximation that the width of this peak grows as ; thermal systems are very sharply peaked indeed.

The sharpness of these peaks justifies the “thermodynamic limit”: for large we assume that the mode (, postulate 1) and mean (, postulate 2) are equal, , and often approximate the entire distribution (postulate 3) as only its peak value, or as a sharp Gaussian around the peak value, the variances of which turn out correspond to second-order thermodynamic properties like “specific heat” and “compressibility”.

(This last observation is essentially the Central Limit Theorem: the sum of many identical distributions tends to converge to a Gaussian. We could call this Gaussian approximation “postulate 4”, but we won’t need it.)

But I’m not going to take this limit yet. I will instead examine the three postulates separately, and apply the thermodynamic limits separately to each one, so we can clearly see where different definitions of the same concepts arise, and which simplifying limits lead to their familiar forms.

1. Free Energy in the Modal State

Where is the extremal energy of the series? We maximize that term, which is equivalent to minimizing the exponent:

where is the “free energy” in a state with energy .

Then the extremal value of gives the thermodynamic Free Energy:

I call this “thermodynamic” because a single formula like this can serve as an entry point into classical thermodynamics, which makes predictions about the properties of materials by taking different derivatives of the Free Energy (which generally would be a function of and as well, so ), in the same way that derivatives of corresponded to physical properties of an isolated system.

Note how the extremization has consumed the parameter, with its “dual”, the temperature, falling out as an argument which parameterizes how sharply the extremization w.r.t. should affect the original function .

This relationship expresses as a Legendre transform of the entropy , which can be best understood as inverting the first derivative, or saying: the extremum of occurs at the energy where , or, the modal energy . In other words, occurs where the rate of increase of cancels the rate of decrease of . We can locate this on the visualization from before: the extremal occurs when the rectangle areas are stationary, that is, at the energy for which the widths are growing at the same rate that the heights are shrinking:

Boltzmann area growth rate

Strictly speaking, the above equation is not how the Legendre transform is usually written. The standard form would be , Legendre-transforming w.r.t. a slope which we name . This definition can be written more naturally in terms of with a “dimensionless free energy” and “dimensionless entropy” , then we have:

You can also see this as a Legendre transform of through the parameter , if you had started with the inverted as . These different formulations are all equivalent, only amounting to different rearrangements of the underlying differential .

Note that we have still asserted nothing about equilibrium or probabilities! Our expression simply represents the value of the series as its maximum:

This I will call the “modal Free Energy”, so we can distinguish it from the definitions of Free Energy arising under the other postulates.

Let’s now try to interpret it.

First, note that is the extremal value of the , the “free energy of a state”, which can be traced back to . This is simply the the number of states in the combined system when system is at a given energy . So the “free energy” of system is nothing but the “entropy” of the larger system when in its modal state:

If we think of “turning on” a reservoir suddenly, the free energy of a state tells us “how does the entropy of a state of energy change when we turn on the reservoir, as a function of the temperature?” Answer: .

The thermodynamic free energy instead tells us: “how does entropy at the extremal energy change, as a function of the temperature?”

can also tell us: “how does the extremal value of the energy change?” The extremal value of a Legendre transform is encoded in a derivative:

In effect we’ve transformed from an isolated-system version of postulate 1 to an interacting version, i.e. from:

  1. (isolated) the system is found in the state of maximum entropy for a given energy

to:

  1. (interacting) the system is found in the state of minimum free energy for a given temperature , which is the state of maximum entropy for the larger system .

Note that we’ve now strung together two approximations: first we linearized the reservoir, then, in adopting postulate 1, we have assumed that a system described by a series can be found in its state of modal . If we additionally assume the whole series is exactly a delta function at its mode—as is justified in the thermodynamic limit of very-large-—then the value at the peak would be , and the approximate series is just:

But the l.h.s of the second line is just the partition function , so in the thermodynamic limit we get:

We can read this one of two ways:

  • either we are “approximating up” to encompass all of the other states
  • or, we are “approximation the partition function down” to forget about all states besides the extrema.

In the very-large- limit these are equivalent, but I prefer the first interpretation so we don’t have to forget that the rest of the energy states exist.

Finally, the above expression for the large- modal looks a lot like . This says once again that the free energy in the form is nothing but the entropy of the larger system , whose multiplicity is just :


2. Free Energy in the Mean State

Now onto postulate 2: approximating the system by its mean energy. We don’t need a probabilitistic interpretation to do this; a series has a mean too:

(That is equal to a derivative of is a consequence of being a moment generating function. More on this later.)

This is the mean of the series after the reservoir is “turned on”. But what does it mean to turn on the reservoir? There are a few interpretations:

  1. We start with and increase to its final value, representing: a. that initially , energy is free and all states are equally accessible; each energy corresponds to a number of states —so the “isolated system” we considered originally is an “infinite temperature” system. Then we view the reservoir as turning the temperature down from , or putting a finite “cost” on units of energy. b. or that initially , which would seem to represent a sort of “classical limit” akin to .
  2. Or, our system is initially isolated at energy . Then we introduce a reservoir at exactly the temperature , and the exchange of energy allows other states than to become accessible.

The two cases starting with don’t suggest any particular characterization of the mean state before the reservoir: all states are accessible, all energies appear with multiplicity . Typical systems have unbounded energy levels and unbounded , therefore the “mode” or “mean” energies would just be infinite as well.

The latter case is more interesting: before the reservoir, the system was a specific state with entropy . Afterwards, states with energies around become accessible; the effective entropy of the whole system is now , as we’ve seen. What is the difference ? How much entropy was gained?

(Note we’re assuming is continuous so the mean actually appears in the series.)

The last line brings us one approximation away from the thermodynamic equation , which is the same Legendre transform we saw before but now in terms of . (Note that is a shorthand for or .)

We can now define a “mean” Free Energy , as opposed to our earlier “modal” Free Energy. We have a choice of what definition to take as fundamental:

  • either we choose , the exact entropy of the expanded system , but only holds approximately,
  • or, we choose , and only holds approximately.

In either case we get in a large- approximation. We also have, again, two choices of how to interpret the approximation:

  • either all states have the mean energy,
  • or, the contributions of all states other than the mean energy vanish: for

I find it helpful to see that is most naturally seen as a characterization of “introducing a reservoir to a system already with energy .” And we have still managed to avoid the probabilistic postulate 3! Our only simplifications have been approximations.


3. Free Energy of a Distribution of States

Finally we consider the third postulate: that the physical state of the system can take on any of the accessible values with a Boltzmann distribution .

We can take for granted that the conclusions of the previous two sections apply to this distribution:

  • the mode of this distribution is and we can define
  • the mean of this distribution is and we can define
  • if we take approximate the entire distribution as a delta function, then the above two equations hold exactly, and so does , and the whole system can in some respects be treated as possessing only a single state of multiplcity , entropy , and energy .

To consider our distribution as a distribution, we’ll bring in the Shannon entropy:

The Shannon Entropy of the Boltzmann distribution is:

Clearly this is just another expression for . We’ve already identified as . Now we can identify the Shannon Entropy with . But this is curious, because this same quantity was previously found to be thermodynamic entropy of the single state of energy before the introduction of the reservoir!

We also note the fact that the Shannon entropy of any uniform distribution over elements is . Then:

  • admits interpretation as the Shannon entropy of either:
    • The Boltzmann distribution
    • Or of a uniform distribution over the states at the mean energy.
  • While is the Shannon entropy of a uniform distribution over the accessible states in the joint system .

Turning on the reservoir appears to have “smeared” the fixed state , whose Shannon and thermodynamic entropies were both , into a distribution of states centered around with the same Shannon entropy, but whose effective thermodynamic entropy is greater by .

If we express the Shannon entropy of a uniform distribution of elements as , with , we can rewrite the Shannon entropy to make these interpretations clear:

which we can read as describing the entropy of two equivalent subdivisions of the system:

  1. choosing one of total states of .
  2. choosing a state of systsem with probability , then for each choosing one of the states of the reservoir .

If we had computed the Shannon entropy of the distribution of energies instead of , we would have gotten:

In the second line we’ve written this in terms of the “full” entropy ; we see a correction term which appararently accounts for the way the underlying states are grouped into energies; i.e. for switching from finest-grained description in terms of to the coarser .

If we break this up differently, writing :

This now expresses two ways of viewing the states of the combined system . On the left we have representing a uniform distribution over all states; on the r.h.s. we see the entropy of, first, a Boltzmann distribution of energies of alone, and then for each , choosing uniformly from one of states of system and one of the states of system .

So we see that the Free-Energy equation can be written in these two separate ways as the sort of “the entropy of two equivalent divisions of a system must be equal” argument that we use to characterize the Shannon entropy in the first place.

We can also express this in terms of the “relative entropy”, “information gain”, or “K-L divergence” between two distributions:

Then:

  • The term can be read as , the information gained by moving from a uniform distribution over all states to a Boltzmann distribution.
  • The difference of the two granularities of entropy above, is a relative entropy , which we can read as the information gained by grouping the energies—marking states indistinguishable—according to .
  • We can view as the divergence , where is a distribution localized at any single point . Thus the Shannon entropy of a uniform distribution represents the information to be gained by going from that entire uninformative uniform distribution to any single state.

Hence we can express the Shannon entropies entirely as differences of divergences, capturing the fact that entropy is only ever define relative to some finer-grained description of a system.

The first line represents the identification of of the underlying states as single states; the second the identification of of the states as single states.3

In summary: when we finally apply the probabilistic interpretation of our distribution of states, we find that the Shannon entropy of this distribution is , which was thermodynamic entropy of the mean state before the reservoir was turned on. Two things about this surprised me:

  • I expected the Shannon entropy of the distribution to equal , I think, but in hindsight this would be impossible because (or rather ) is the entropy of a uniform distribution over all of the joint states of , distributed as rather than of system alone distributed as .
  • When we translate into information-theoretic terms, it turns out to be nothing but the definition of relative entropy.

4. Translating the Energy Scale

We have seen three derivations of the same expressions for Free Energy, arising from three postulates as to the physical state of our system.

One final derivation is as follows. If we have some very simple system with a single state with multiplicity and entropy , we can ask: how do these change as we alter the definition of energy ? We make our same linear approximation:

But this is the same as the Free Energy relationship, written as , if we identify:

  • as the entropy before the shift
  • as the multiplicity after the shift, which will apparently equal
  • as the entropy after the shift, i.e. , which equals .
  • as the new zero-energy.

It seems this relation is the only sensible way for entropy to transform under a change of energy scale. We can write these as a transformation law between and :

In this form we can clearly see that, when the effect of a reservoir or a change of energy is to transform each energy’s multiplicity , the same transformation is carried out on the “macroscopic” description: . The effect of the reservoir on the average-energy state is exactly the average of all the effect on all the microscopic energy states: . And the r.h.s. expresses the opposite effect, i.e. turning off the reservoir.

I interpret this as follows. If you initially describe a system at a fine-grained scale where all states are discernible, then:

  • the effect of turning on a heat reservoir at temperature , which adds Boltzmann factors to each state, then summarizing the system by a single total
  • …must be, to lowest order, the same as the effect of summarizing the system by a single , and mean energy , then turning on the reservoir and adding a Boltzmann factor to the single state

There is a “commutator” of this sequence of operations—all the terms we drop in a thermodynamic limit—but again, to lowest order, the requirement that the result transform self-consistently with an energy scale implies there is no other way for these two operations to behave!


Conclusions

We see, in conclusion, that the same equation can represent:

  • How the modal entropy is affected by a reservoir
  • How the effective entropy of the state of mean energy is affected by a reservoir
  • How the Shannon entropy of the state at energy is altered by a reservoir
  • Or, how the entropy of any single state is altered by a change of energy scale

In each of the first three cases we are approximating our large- system as a single function of entropy, and evidently out there is only one form this function can take, which is determined solely from how it must transform with a change of energy scale. All of this rhymes closely with classical mechanics—the other setting in physics where one encounters Legendre trasforms—where the effects of interactions can always be equivalently viewed as transformations of the coordinate system. And this argument explains why the different postulates all lead to the same asymptotic : there really isn’t any other way for a one-dimensional system to behave.

It is also remarkable to me that no probabilities are needed to produce the results of thermodynamics. completely characterizes the system. We can make statements about its mode and mean and even properties like variances without having to make any assertion as to which state we expect to find then we measure it. We only need a postulate as to which state a system is to be found in, whether probabilistic or not, at the moment of measurement, and this choice can be made solely on the basis of experiments. I assume these the grounds for the use of “probabilities”.



Part 3: General Techniques

When studying stat-mech for the first time, many of its tools—partition functions, generating functions, free energies—are encountered on an ad hoc basis in some vaguely historical order, and wind up feeling like a bag of magic tricks with no overall logic to them. Each separate encounter has a separate name; it is as if we gave separate names all of countless different Fourier Transforms encountered in a physics education. There are in fact only a handful of techniques being used over and over, and one of my goals in spelling out these derivations has been to disentangle them. We’ve seen in particualr that is nothing but a generating function, Legendre transform, Laplace transform, and a Shannon entropy, all at once—and all of these techniques have to produce the same object because, it turns out, there’s only one sensible way to summarize a giant thermodynamic-scale system by a single formula.

The following few sections will spell out some of the less-familiar general techniques of which the standard stat-mech arguments are specific examples.


Maximum-Entropy Estimation of Distributions

If you take for granted the probabilistic postulate, you can derive the partition function and free energy formulas via another route. We consider the following constrained optimization problems:

  • Maximize the Shannon Entropy w.r.t. a known mean energy and the constraint that the probabilities are normalized .
  • Or, maximize the multinomial coefficient over a set of occupation numbers , subject to constraints and .

The second problem arises when considering stat-mech as arising from an imaginary ensemble of states sharing energy (a trick to make the probabilities palatable to frequentists, I suspect). In the large- limit, it amounts to the same problem as the Shannon entropy optimization, because the Shannon entropy itself is simply the asymptotic form of a multinomial—

—an observation which helps to interpret the Shannon entropy, but won’t shine any light on the present situation.

I’ll sketch the first constraint-optimization derivation to illustrate the method; the full details can be found in any stat-mech textbook. We add Lagrange multipliers for the constraints to get a Lagrangian function for the problem:

Here is the Shannon entropy .

The condition ought to hold for any variation , , or at the extremal . A little calculus leads to:

The normalization () constraint reads:

which sets .

Then the average energy constraint () gives , which we identify as . This is an implicit formula for in terms of the provided :

Finally, evaluating , we get:

Interestingly, we find that the thermodynamic entropy is specifically the extremal value of the Shannon entropy over all distributions adhering to the constraints. And has spelled out the same Legendre transform again.

I never learned it in my physics education, but this derivation is a completely general technique—called maximum-entropy—which can be used to estimate a probability distribution from knowledge of some set of expectation values .

The same derivation with constraints has as a solution:

The last line expresses the extremal entropy as a Legendre transform of w.r.t. all of the s at once. We can also write this in a symmetric form as:

This is exactly how one arrives at a “grand canonical partition function”: a second constraint on the total particle number is added, which amounts to a second Legendre-transform of . It is unfortunate we’re stuck such a terrible name as “grand canonical” for such an elementary operation.

Just about all of the properties of the stat-mech turn out to be completely general. The various derivative relationships of are found to be properties of Legendre transforms:

Note the two first-derivative relationships in the first line are how we actually perform the Legendre tranforms . We isolate, say, , then invert the implicit equations to write the as functions of the . The full form is:

These formulas produce most of the zoo of thermodynamic equations like and , as well various other definitions specific heats and compressibilities. This is all so clear in Jaynes that I now think it a crime to teach stat-mech without mentioning these in generality.

We can go on to express identities pertaining to the distribution by re-interpreting our as . Then we have, say:

These properties derive from the fact that is a kind of moment generating function; more on this in a bit.

In any case, these identities give measures of dispersion and fluctuations in stat-mech, among them the fact that , which, because , means that the standard deviation of energy measurements will grow only as ; another characterization of the sharpness of the peak around in the Boltzmann distribution.

For more, see Jaynes. The takeaway from this section is simply that partition functions arise generally when estimating a distribution from knowledge of its expectations like . This is comforting to me: it turns out all of this is a completely natural thing to do, rather than some weird trick we’ve wedged in at the bottom of our tower of abstractions.


is a Legendre Transform of Shannon Entropy

In the maximum-entropy derivation we just saw, we ended up finding a Legendre duality between and the extremal value of , which was . We can make this explicit by writing the solution in terms of the full Lagrangian function (I’ll write all the extremizations as rather than thinking about which, if any, ought to be …):

The outermost maximization represents the Legendre transform between and . This makes me curious: what about the other s? Can we interpret those as Legendre transforms too? It looks like is two Legendre transforms removed from the Shannon entropy itself. We have the following expression for :

This will be easier to interpret if we take a detour to look at the relationship between Lagrange multipliers and Legendre Transforms.

If you have some constrained optimization problem “maximize subject to ”, the Lagrangian function will look like , and the solution will be:

If were just the function , then , and the Lagrange multiplier method just amounts to a roundabout way of evaluating at . But the equation for also suggests an even more roundabout way of evaluating a function: you can (for suitably well-behaved functions ) set by taking two Legendre transforms:

A more general constraint function like constrains the optimization to the subspace . Generically it can be rewritten in terms of , though will be a set or subspace unless is one-to-one. In the one-to-one case it reduces to the simple case of “evaluate a function”, except that the function is now . If not one-to-one:

This still looks a like “evaluating a function with two Legendre transforms”, with the thorny details of the constraint surface now stashed away in the definition of .

This line of thinking also suggests it is sensible to take a Legendre transform w.r.t. any (suitable) function on the space , not just to one of the available parameters. (Of course, any such transform will be equivalent to a a transform with respect to the parameters after some reparameterization—this is obvious in hindsight, but is rather hidden by the general obscurity which surrounds all things Legendre.)

Now returning to our Lagrange-multiplier expression for :

We can read this as:

  • Legendre transform w.r.t. its zeroeth moment
  • Legendre transform back to also evaluate this at (this transform doesn’t have a in the above expression because it has been evaluated.)
  • Then, Legendre the average energy
  • Then the final transformation sets the value of the average energy

If we view and as vectors, the second to last transform reads: “reparameterize in terms of its component along ”, and the final Legendre transform sets this component to . The normalization constraint can also be written in vector notation as ; this reparameterizes by its derivative along the line , then of course we transform back to set the component along this line to 1, which constraining to the unit simplex. The combination of the two constraints constrains the system to the subspace of intersection of these two planes, which we can visualize in 3D:

intersection of normalization and U constraints

Clearly you could rotate to a basis that isolates the and components, which would be analogous to the reparameterization .

One question arises: can these planes fail to intersect for some value of ? Yes: obviously we have no physical solution if . But it certainly seems possible to imagine other sets of energies for which satisfying has no solutions… I wonder if these have a physical interpretation.

It’s curious that the transform removes only a single degree of freedom (leaving if we have discrete ), while the transform removes the other : we extremize all of the at once, with a single constraint value shared between them. This will clearly limit the space of variations that are possible! I suppose the argument is that we have no reason to prefer one or another, or to prefer any particular ordering.

Another curiosity: this same line of thinking implies that we can break out the -maximization to define a “Shannon entropy constrained to the simplex”, like:

This thing is perhaps even more natural than , since it bakes in the fact that the arguments are probabilities. Clearly it is related to by a single Legendre transform:

… except that this is really separate Legendres w.r.t. all the remaining degrees of freedom of the , setting across the board. Hm.

That’s probably enough about this, although my mind wanders off in a few more directions. I find the equivalence of Lagrange Multipliers and Legendre transforms to be really interesting. It came up in my classes now and then, but I’ve never seen it spelled out as a general relationship between the two techniques, and nobody ever mentioned you could Legendre-transform w.r.t. any constraint function—probably because it’s either equivalent to a reparameterization in the one-to-one case, or very complicated in any other case.


is a few kinds of generating functions at once

We have mentioned that is a generating function, but it’s a little unclear what sort of generating function it is. Its derivatives are moments: . But the standard moment-generating-function defined in probability theory, which looks like, is normally utilized by evaluating its derivatives at zero: . The moments of don’t take , though.

If you stare at this for a minute you’ll realize that is indeed an m.g.f., but it’s the standard m.g.f. of the uniform distribution over all energy states. But when we use it to calculate moments of the Boltzmann distribution, we evaluate it not at , but at some other finite . If we call this point we can document the three series expansons of and below:

So is the m.g.f. of a uniform distribution. This is not especially useful: it corresponds to , i.e. infinite temperature. Now :

We have to divide out to produce a proper m.g.f.:

Since I’ve spelled out the dependence, you would use this to calculate e.g. by taking . So, like a standard m.g.f., we “set something” to remove the terms we don’t care about. But we normally don’t distinguish between , which makes usage simpler.

Finally we have the series :

appears to generate central moments, but this turns out not to hold after the third moment. This turns out to be a general property of the logarithm of any mgf, called the “cumulant generating function”.

Note that these are all just normal Taylor series! They will generalize in the usual way to multiple dimensions, so if involves constraint functions like , it will have separate first derivative terms, a Hessian for a second derivative, etc., and all of these will have interpretations as expectations and covariances.


Factoring a Delta Function with Lagrange Multipliers

When we originally linearized , we handwaved an argument as to why we linearized the entropy rather than the multiplicity .

Perhaps we can do better? Here’s another argument. You could write the term for a given with a -function as

Now, we often reach for Lagrange multipliers to implement the concept of “constraints”. Le’ts try to translate the -contraint into that language, even though this isn’t actually an optimization problem. We’ll get:

where the should be interpreted as transforming the terms of the series into the -function in the previous formula. Therefore we should think of it as selecting all terms which are locally extremal, rather than a single true maxima.

On this equation, the condition that give should lead to , and since this becomes the condition . But then we’re stuck; there’s no clear way forward to simplify the series. The problem is that the Lagrange multiplier term is being added rather than multiplied, so it can’t “kill” the terms that don’t meet the constraint.

If instead we translate the -function into an exponentiated “Lagrange multiplier”, we get:

then the constraint says . Because these expressions don’t depend on the other system, we can factor the series into a product of individual extremizations, effectively decoupling the constrained variables:

I find this clarifying: it tells us to think of as a “multiplicative Lagrange multiplier”, or as “part of a -function”. Adding to each term of the series amounts to abstracting out how that series will participate in an enclosing .

We can distill a general technique: if you want to analyze the behavior of subject to , write:

which encodes the constraint as the condition . This is hardly a rigorous treatment, but it gives some suggestive basis for the otherwise inscrutable technique of tacking on terms without any justification, which I’ve seen employed by employed on a few occasions by grandmasters of physics like Landau. What is happening, apparently, is a rewriting of a -function to encoding it as its Laplace transform, somewhat like the more-familiar replacement by the Fourier transform .


References

  1. Jaynes “Probability Theory: the Logic of Science”
  2. “Making Sense of the Legendre Transform”, arXiv
  3. “What is Entropy?”, John C Baez, ebook

Footnotes

  1. Here I’m plotting on the axes, not ; the values are labeling the bins. If the distribution were bounded, this would be like plotting against the quantile of rather than itself, or against the “position on ‘s c.d.f.”. Basically this is a copula plot, a concept I only learned about when I was trying to figure out what this thing I was drawing was.

  2. I am led to contemplate whether the energy going into any particular mode in our system , e.g. a wave mode with energy , can itself be seen as an information cost, but one which doesn’t depend on the ambient temperature, being quantized and unable to exchange freely in units smaller than itself. I often think of the canonical momentum as the “information cost of changing Lorentz reference frames”; a bound state of a wave must change its frame at every coordinate within the boundary at once in order to be consistent with the B.C.s; this quantizes it. But, if this view is right, at root all of the energy is just “configuration entropy”…

  3. This makes me think it should be possible to add a third expression representing the complete coarse graining of the system down to a single state of energy , perhaps with an error term capturing the fact that, unlike the first two lines, this alters the system. But I can’t quite see how to do it.