Somehow I managed to make it through an undergraduate physics degree, two years of grad school, and three years of work as a data scientist, without ever taking a course in statistics.

I picked up a fair amount through courses and ambiently on the intellectual internet, but this education was incomplete, and it turned out to be a fairly noticeable handicap to be less-than-fluent in the common knowledge of other professional math-people—not surprisingly.¹

Yet mathematics when actually applied to the world is quite a bit like a skill, and statistics is the central vocabulary of this practical skillset. Somehow I missed it entirely.

This is how I came to work through MathAcademy’s high-school-AP-level statistics course remedially at the age of 35, despite my mathematical experience substantially exceeding what it expected of me.

This turned out to be something of a rare opportunity—to encounter something as a student but with an adult’s ability to examine my experience. Here’s what I found.

Random Variables
Kolmogorov’s View
Confidence Intervals

Random Variables

The moment we encounter the “random variable” we are confused. There is a sense of “too many cooks” having spoiled something—of an abstraction created for one purpose and then formalized with a different “sense”, of programming in a codebase without a coherent architecture; no unifying sense of “intelligent design”; no other mind for my mind to mirror.

What exactly is my problem? I am not sure whether my issue is with the naming or the pedagogy or the concepts themselves. I will find out by trying to explain it.

One begins statistics with elementary calculations of probabilities—permutations, combinations, replacements, conditionals, Bayes’ Theorem. The mental gestures entailed are, first, the mere counting of possibilities, which proves incredibly effective, and second the clever application of the rules for transforming these counts. With some practice one starts to see the patterns to these operations and gets a feel for their power. All is grounded in the notion that a probability is defined as a ratio of measures of sets

which is basically compatible with both a frequentist/objective “long run relative frequency” and a Bayesian/subjective “degree of belief” view, so long as the latter assigns a consistent meaning to the counting operation , for example, by a statement such as “in the absence of particular information, all orientations of the dice are equally likely to occur”.

At this point we have grown accustomed to the differing treatment of data depending on whether it is “categorical”, “ordinal”, or “numerical”, and for numerical data by whether it is discrete or continuous, or in general by its domain . Of course you can speak of an average temperature or a roll of the dice; you can speak of a median poker hand but not a mean; you can speak of the distribution of results of a coin flip but not its median or mean, unless you assign and to “tails” and “heads” as we often do; of course there is no particular reason for this assignment besides convention and the opposite assignment works just as well. So, while the sides of a dice or a coin or the cards in a deck are not inherently numeric, we find we have the option of assigning numbers to these events and thereby can calculate their means and other statistics.

At this point we have basically introduced the thing to be called a “random variable”—it is simply an assignment of numerical values to events.

We assign probabilities to real numbers according to the pushforward of the measure on . This is convenient, because from here on a course in statistics will deal almost exclusively with numerical data by way of various normal approximations to binomials and the like.

Often the elements of are uniformly distributed, and the nonuniformity in the distribution of values arises only from the non-bijectivity of .

The random variables , etc. defined this way have many nice properties, and we may proceed to define variance, standard deviation, covariances, etc.

But these random variables are at this point very general things, not particularly “random” and not particularly “probabilistic”; the term “random variable” is not well-motivated from this direction. These constructions are really applicable to any “measurable set”, and their nice properties derive mainly from their being vectors, though, pedagogically, we of course do not speak of “measurable sets” or “vectors” in the abstract at this point.

So we will try to arrive at the idea of a “random variable” from a second direction. We start by looking at a collection of data and label it with an integer in the obvious way

And of course we would like to express things like a “mean” as a function of the data

which suggests interpretations both as an “operator” on the name or as a separate thing which we name just for convenience.

Then, starting from the original notion of “probabilities as counts”, we find ourselves trying to ask the question: given a hypothesized underlying probability distribution for the , what “sampling distribution” do we expect to see for a “statistic” like ? We find ourselves needing a second symbol to act almost exactly like or but as a kind of “slot” for a hypothetical value. We would like to use these values in the same operations, , , etc. We choose to notate this new variable in an upper case, i.e.

and we declare that these variables are “random variables”. Likewise we notate a “random vector” of random variables as

So far we have not really said how the probabilities work—that these should be considered “random” has no precise meaning. At present our random variables are something like the arguments to a function in code:

function calcSomeStats(Xs: Array[Real]):
  X_avg: Real = Xs.sum() / Xs.length()
  f_of_X1: Real = f(Xs[0])
  ... etc.

And, for now, the and r.v.s are a different kind of thing than the themselves—rather than arguments, they are completely determined once the are known.

Then we additionally assert that the input r.v.s are distributed in a certain way. The vector is distributed as the joint distribution of the (unless we assume some correlations), and the distributions of and are determined as well. We write for “is distributed as”:

To express this idea of a random variable in software, we would have to write some kind of DAG structure which treats the inputs and derived nodes equivalently:

function myModel():
    m = model()
    Xs: RandomVector = m.input()
    X_avg: RandomVariable = Xs.sum() / Xs.length()
    f_Of_X1: RandomVariable = f(Xs[0])
    return m

m = myModel()
m.set_input(dist=Normal(mu, sigma^2), samples=N)

Now the name “random variable” feels appropriate: these are “variables” which are “random”. We can go on to calculate further statistics like variances and standard deviations, and these too will be the same type of thing too—random variables.

Conveniently, the “types” of , , and are now all the same: all are “nodes” in the graph representing variables with an associated probability distribution. But this is actually a minor discomfort to me: it feels odd that our input r.v.s and the derived r.v.s like are both called “random variables”. They are not equally random—both have distributions, but the former are inputs while the latter are determined; this is not, I think, what the name “random variable” implies.

The code examples clarify that the gesture being made by the use of random variables is a familiar one from software: the shift from an imperative program flow to something like a “late binding” representation of a program, often seen in DSLs, which allows additional information to be carried alongside the flow of the program (here, the arithmetic).

A bit of discomfort at this point arises from the fact these notations suggest that is a particular “instance” of a random process named , and that is an operator applied to the process . These readings are misleading, and it would be best to think of merely as suggestive names for otherwise-generic random variables. But I would like to think of as the name of a process and as a particular name for an instance of the process’s output.

To arrive at the technical definition of a random variable we need to take one more step. We assert that the probability taking any particular value is the measure of the preimage of under a function in a set with a probability measure, i.e.

In the r.h.s. we are using according to its technical definition as a function mapping , which is the essentially-probabilistic thing, to . That is,

where is equipped with a suitable probability measure. ( may be a subset of .) Our other random variables like , etc. can be given the same interpretation, possibly over different underlying sets.

Note this random-variable-as-a-function is the same thing we came up with when we want to assign numerical values to ordinal or categorical data! We have only reached it from a different direction.

What we have here is best expressed in code as a SQL groupby

SELECT
  X(w),
  SUM(p(w)) as w
FROM omega AS w
GROUP BY 1

but there is not really an obvious way to use this construction in a modeling context.

But we go on to use these functions in strange ways,

where is treated at times like a “variable” which can take different values, at least inside the odd operator. Here seems to stand for , and for .

Now that I’m seeing this clearly, it looks quite strange. This is something like notating a “restriction” with a predicate,

Here it perhaps makes sense to drop the and and just write , and then we’re at something very similar to .

I think I have located my confusion. The technical random variable-as-function is really the first thing you would do to assign probabilities to a set. But the name “random variable” and its notations make the most sense when viewed as “placeholder variables understood to be equipped with probability distributions”, and it’s hard to intuit how the technical definition as a function winds up supporting this.

To my eye the name “random variable” ought to be reserved for the notation in expressions like , being like a normal variable except random—equipped with a probability distribution.

A different word should be used for the sense of as a function . Using a “function” as a “variable” is misleading, and when using in this second sense as a function we should not write nonsense like . That’s really it.

I really don’t even like using . My intuition wants -the-variable to not be a function; instead we should write to assign a certain value according to some “mapping” . In general I want to keep thinking of as a variable which attains different values—using a “variable” as a “function” is misleading!

Perhaps the best thing to do would be to allow a predicate like to do double duty as “the subset of ‘s domain on which ”. This is not the first time I’ve felt that “predicates” seem to be underutilized in math notation. We do not find an indicator on a subset to be too disturbing, or generally a restriction ; why not formalize this?

Kolmogorov’s View

Now another discomfort I feel with basic probability.

The problem roughly has to do with the relationship between the variables and processes under discussion and the enclosing “universe” which these inhabit.

Suppose you are handed one random variable defined on a sample space and another completely unrelated random variable on .

There is absolutely nothing stopping you from defining a new random variable on the space , which will obviously have some distribution—probably a simple product, if you take to be independent, but it could as well be something more complicated.

Something already irks me. It feels strange that we can whip up new product spaces like out of thin air whenever we want, and that, when we speak first of and and later of , we are switching into and out of as we go.

The natural way we are introduced to a product of sets is by thinking of the two sets as the sides of a rectangle, or entries of a tuple, and of course their cardinalities multiply. Each logical construction we imagine existing in a featureless intellectual vacuum (much like the thought of experiments of Galileo), and we develop the habit of waving into existence new entities in that vacuum without much of a thought. But this is only a model of physical reality (again like Galileo) and, while it is effective at isolating concepts and factoring them into their constituent abstractions, it is not true…

Consider the following example from Jaynes. The probability of a proposition , e.g. “the roof will leak today”, should not be changed if we decompose it into a product with a sum over all possible values of some irrelevant piece of information, e.g. a proposition “the number of penguins in Antarctica is ”. We ought to have, with multiplication as a logical “and”,

Evidently this sum-over-penguins was always present in premise ; apparently a simple statement of a proposition may always be regarded as eliding any sum we like over remaining universe of irrelevant propositions.

Let us try to write this another way, not as a logic of propositions but as a “sample space”, with not as a single proposition but as a system with some set of substates. Then, along the lines of the penguin example, it appears to be the case that a claim about “system ” in isolation is practically equivalent to a division of the state space of the entire universe into

The penguins live in . But so do plenty of states which correlate with , for example, the state of rain, humidity, or weather forecasts. We should factor the universe into three parts:

system itself, with states
the part of the rest of the universe which correlates with or contains information about , which I’ll denote by
the rest of the universe, .

that is,

Anything in we have to be careful considering at the same time as , but we can freely disregard or include; Jaynes’ penguins live here.

This is a statistical-physics-inspired view: might be the surrounding “environment” whose temperature affects , while represents the remaining degrees of freedom of the world at some suitable distant so as to be effectively uncorrelated with .

I am trying to sketch my way towards the sense in which, when we add two unrelated random variables , they are not really unrelated, as in Kolmogorov’s view of probability—the sample spaces and are not freefloating sets we have just now brought into relation with each other; instead I wish to view both as separate systems within the same universe.

The claim is that we see in an example like the Jaynes’ penguins that probabilistic language about beliefs and propositions always existed in relationship to irrelevant states and information.

It feels as though there is a different way of thinking about probability distributions which makes this manifest in the first place: that, say, declaring the existence of a sample space and a random variable entails making a distinction among the states of an undiscriminated universe. In writing we go from considering a sample space of a single state to one which makes an -distinction, giving . (Or .) Likewise for . Then means making both distinctions at once: . Or you might presume a correlation, equivalent to assigning a weighting to the joint states different from the product-weighting.

In this view the act of writing acts like a concatenation of their labels or distinctions. It is suggestive of a “direct sum” (of a vector space, say) rather than a “direct product” (though my abstract mathematics is too faded at present to be more precise than that).

I don’t have much more to say on this now, but it reminds me of a number of other observations. It feels helpful to link them together.

I originally came by the sense of this other “view” of probabilities in statistical mechanics. It is reminiscent also of reaction network theory and is related to my series of sketches on dimensions. I am also reminded of the way the first term in the series expansion of the Shannon Entropy admits an interpretation as counting “distinctions”, which I mention here—in particular, the picture below of an uneven probability distribution as being described, not in terms of an equivalence relation on an underlying set, but an inequivalence relation:

I suspect that the view I am grasping for would, if achieved, be a complete inversion of thinking about probability. My hunch is that this is the way we should be thinking of and teaching probability in the first place—that some immense amount of misunderstanding arises from attempting to extend that “idealized sets in a vacuum” view far beyond the point where it is sane to do so. In the inverted view, Jaynes’ penguins would be entirely unsurprising—because it should be obvious in the first place that any distinction you are not making could be made and would carve up your sample space further.²

Confidence Intervals

Let me spell out a basic confidence interval calculation before I start complaining.

We begin with some data which we hypothesize to be generated by a process whose distribution is normal with an unknown mean and (let’s say) known variance . Some standard probability derivations show us that:

the mean of a sample of points from this process will therefore be distributed with the same mean and variance .
the sample variance , standardized, will be distributed .
the t-statistic , for a given true mean , will have a distribution.

Then we can test a specific candidate (that of the null hypothesis, say) by determining whether the value of this final statistic falls within some interval around its mean, for example, the interval where the distribution has between and probability, for a “95% confidence interval”.

Now: a notation is sorely missing! It is relatively straightforward to write the three distributions I just spelled out, but the remaining steps of the process are taught by a patchwork of paragraphs and equations, and are far harder to follow the logic of; as a result the exact logical content of a “confidence interval” is obscure to many (which obscures, among other things, how objectionable it is!)

To repair this we need the “interval notation” , where is an interval . Then stands for an interval as well:

Usually , in which case we might replace with a “neighborhood of zero” :

We also need to allow our functions to map intervals to intervals:

This only works for nice enough functions, but we can always map sets to sets.

With these tools we may express the rest of the confidence interval argument purely in equations. Here is the CDF of some distribution, e.g. the distribution just discussed, is its p.d.f., is its inverse which maps quantiles to -values, is our target “confidence level” e.g. , and is the interval of probabilities which “do not reject a hypothesis at the confidence level”. A line with a in its relation represents a condition which, if false, means we reject the hypothesis for .

This to me is crystal-clear: we reject any not falling within the interval

Likewise when we estimate a true variance we could perform an analogous calculation on the distribution to determine which values imply the probability of attaining the measured value falls within a range of quantiles , which maps to a range of standard values :

These kinds of manipulations of sets are, I think, a programmer’s instincts at work. Conventional written mathematics prefers certain kinds of abstractions and abuses of notations and shies away from others; programmers are less shy about re-defining syntaxes. In particular, using for sets in the last few lines of the derivation offends my math brain but is obviously correct. Really I don’t even want to think of these as “sets”—intervals are a more-specialized thing; we simply have endowed the “interval” type with an interface adequate to express our thoughts.

In short, we ought to let a “confidence interval” be a particular instance of a general thing, an “interval”!

That’s enough for now, though there’s plenty more to say. In a separate post I also intend to give my thoughts on MathAcademy in particular.

We moderns tend to underrate the degree to which math is a skill like any trade, and the degree to which a mêtis—a body of practical knowledge, acquired by experience and mentorship—is involved in rendering it useful. The math we learn in school has a completely different sense to it, and only pays lip service to practical applications, far short of what would be required to acquire this mêtis; to the extent we encounter mêtis it is in spreadsheets, and in programming if we partake of those classes. In school, instead, we learn a very different sense of mathematics, which aspires to entirely different ideals than would a useful skill. High-school math is learned in a manner resembling Latin grammar, all vocabulary and declensions and inflections and the like, without ever even reading anything aloud, much less conversing or arguing. Higher math is like an art, which aims to bring into view a tremendous edifice of logical implication, which must be understood in compressed forms and maps only, being too immense to comprehend in totality; Category Theory of course is well-directed towards this ideal, though I wonder if past generations of mathematicians were essentially doing something else. Physics and computer science, then, seem to undertake a slightly different “gesture” of the mind than pure-math—not exactly “compression” but the “factoring” of physical reality into discrete components, building-blocks, and principles; a sort of middle-ground between math and the mêtis of real work.³ ↩
Furthermore, such a view might encourage an entirely different approach to rational thought. How much of modern thinking is downstream of the paradigmatic ways of thinking in which we are well-trained by early educations in math and physics? It is impossible to speak of such things without seeming crazy, yet, consider the following example: today’s philosophical discourse gives an enormous amount of attention to utilitarianism, and to attempting to devise the exact assignment of “lives” and “life-years” and “pain qualia” and the like to “moral value”, e.g. in Effective Altruist circles and similar. Is this not directly a consequence of a probabilistic way of thinking? To attempt to model morality by assigning equal moral value to each human life is to impose a “uniform distribution” on a sample space of human lives which is, in reality, devoid of any “quantity” akin to moral value. This has always seemed insane to me—perhaps a consequence of missing out on a statistics class in my formative years. Human lives are, in reality, unalike things—incommensurable and non-fungible, unless we choose to assign them real-number or at least ordinal values. The choice to do so is itself a moral act, and may be an appropriate one in certain contexts (e.g. a state with a duty to treat its population equitably) but is, to my eye, an absurd view for a human being to hold, and is at odds with the natural moral sentiments out of which ethics arises.⁴ ↩
I have come to see the abstract mathematics as a dreamlike pursuit—to endlessly form connections and see clearly as an end in itself, towards no other end in particular—grasping to comprehend a certain God essentially as a mystical act. To be “useful” is nearly antithetical to the aims of the mathematical mind. Only wartime (the atom bomb, encryption, competitive pressure, etc.) can really divert this mystical inclination downward to vulgar reality, where, of course, it turns out to grant incredible power. The reclusive wizards of modern fantasy are remixes of mathematicians and physicists, above all; their aims of this memetic archetype are always obscure; at best they are servants of some divine mission (e.g. Gandalf) invisible to worldly mortals; at worst their power is employed in selfish self-interest and always reads as evil. ↩
Likewise it is absurd to my eye that the “trolley problem” is considered interesting at all—it is the philosophical equivalent of a Galilean thought experiment in a vacuum, and depends on and encourages a view of morality in terms of identitarian features, or the mere counting of bodies, rather than the context of the scenario in the enclosing world. In reality our judgment of just action in such a scenario would depend almost entirely on how the scenario came to be—whether duties were neglected, whether anyone was to blame vs. fate alone, whether the scenario had occurred before or might recur. Morality, in my view, has nothing to do with such scenarios considered in a vacuum. One may counter that physics gets quite a bit use out of thought experiments in vacuums, but physics does not truly consider vacuums—there is always an external force , or a gravitational field , or ingoing or outgoing edges to a Feynman diagram—the external world is factored into its atomic interactions, diagonalized w.r.t. an operator which is linear over the dynamic in question, and thus its thought experiments lead to genuinely composable abstractions. Not so for the trolley problem, not generally, though we try to add such edges, “what if one of the people is Hitler?”, “what if they’re suicidal?”, etc.—morality does not decompose; these exercises are nonsense; entertaining them encourages a paradigm of thinking in the Kuhnian sense which is unable to contemplate non-utilitarian moralities entirely. ↩

On Elementary Statistics

Table of Contents

Random Variables

Kolmogorov’s View

Confidence Intervals