Part 1: Type System Ergonomics

After months away I have once again picked up the tools of programming, and have once again been assailed by a torrent of frustrations: nothing is clean, I cannot express my intention, all my idea are mired in implementation details, everything is documented badly, if at all…

Surely we can do better—much better—

Part 1: Type System Ergonomics

We’ll start small.

Data types are, on one hand, required for languages and their compilers to operate sanely: to allocate memory efficiently and safely, say.

On the other hand types are needed to help the user of the language operate sanely: as documentation, to guard against programmer errors, etc.

The type “systems” (if they can be called that) added to Python and JS are clearly starting from the second sense, since they have no effect on “memory allocation” anyway, and typically do not even cover the entire codebase.

But the type systems so far implemented in high-level languages seem to be based in the concept of “types” starting from low-level compiler-focused end of the spectrum, working “upward” to add a few ergonomic features, rather than starting from the “user ergonomics”-focused needs and working “downward” to add static guarantees. And this state of things is quite inadequate to the needs of actual programmers. Perhaps two systems are needed which meet in the middle: one rigorous and verified, one imperfect and only checked as well as feasible; people don’t seem to get too bothered by imperfect type systems anyway.

Here are some basic asks—things I think any sane language ought to have (there are no sane languages.) I’m not a programming-language designer, and as such will only lightly refer to technical terms where I’m familiar with them—these are a user’s feature-asks.

Constraints

First up: constraints. (This is my catch-all name for “refinement types”, “contracts”, and “dependent types”.)

This is so basic I can’t believe it’s not commonplace. There is a difference between an integer variable

x: int

and an

x: int
assert x > 0

The fact that x is non-negative should flow through the program just like the fact that x is an Int. You should check this constraint once when instantiating the x at the program’s “boundary”, and then if you receive an x declared this way you can trust that the constraint holds.

It should additionally not require a separate assert or a wrapper class or any convoluted trait or the like to declare this. The syntax should be immediately obvious to a high schooler learning to program for the first time. The typical syntax is like

Type nnInt = {x: int | x > 0}

but this is ugly because of, first, the extraneous variable name x, and second because in many cases I would like avoid having to give the constrained type a name, instead declaring it inline. My instinct is to use a Scala-like “anonymous argument” symbol _, acting like the

x: {int, _ > 0}

Or perhaps we do without the braces, though this obviously collides with objects in JS and tuples in Python:

x: int, _ > 0

Perhaps both are supported when unambiguous.

(My code examples will be in no particular language—some kind of blur of Python, Typescript, and Scala.)

These syntaxes read as if _ > 0 is a “statement”, with the existence of such an x implying that any statements in its definition are true. I then imagine giving int the same semantics, though this would conflict with conventional meaning of int(x) as a cast or constructor.

By all means give a type a name, though:

Email = Type {
  Str,
  regex = re(“[^@]+@[^@]+\.[^@]+”), # Checks @ and .
  re.match(_)
}

def send_email(msg: str, e: Email) = { ... }

This has the advantage of letting us interpret Email itself as a sort of object. One could elsewhere refer to Email.regex to make use of the same (statically-defined) regex object. (Here I’ve mangled together a few notations, including a Scala-like object instantiation syntax; one would have to take things further to make this consistent with Python or JS objects.)

One advantage of type constraints is to not have to redeclare a bunch of things between the docstrings for a method and the method body.

def send_email(msg: Str, e: Str):
  """
  send an email
  
  args: 
  - msg  string  the msg to be sent
  - e    string  the email address, which must contain the characters @ and .
  """
  try:
    re(“[^@]+@[^@]+\.[^@]+”).match(e)
  except:
    raise "Invalid Email Address"
  ...

Instead we can encode as much as possible in the types themselves, reusing the logic and documentation between all instances, and then auto-gen a docstring from this information while trusting that any argument we get has already been validated as an email:

Email = Type { ... }

def send_email(msg: Str, e: Email):
  """
  send {msg} to {e}.
  
  (an example of a succinct docstring)
  """  
  // no checks needed.


msg: Str = "hi"
e: Str = "a@b.c"

// validation occurs at call site when passing a Str 
send_email(msg, e)

If instead you passed a pre-validated Email, the validation would be skipped, which of course is what you would want to do for good performance. To the extent possible values should be validated when the enter the program “boundary” and then trusted through the rest of execution, which of course is what types do already.

A few other types of contraints come to mind quickly:

constraints between multiple fields in a structure, which would thought of as a constraints on the enclosing structure rather than the types of the single fields.
a condition on the length of a list (if checked at compile-time would be an example of a “dependent type”). You could then place constraints on the arguments to a function, e.g. List.head would require a nonempty list, or Vec.dot would require two vectors of the same length. (All linear algebra code would be vastly more enjoyable to work with with the help of some type constraints.)
a condition that a list is sorted; this is preferable to a dumb wrapper SortedList that conveys the information to a reader but carries no semantic information.
that a type is non-null (in languages where such things are allowed and it’s not sufficient to distinguish int | None from int).

Constraint Validation

Where should constraints be validated? And when?

The obvious answer, already mentioned, is at the point where the type is instantiated. This would amount to a type “contract”—a runtime type check.

But in the example of a SortedList, we wouldn’t necessarily want an automatic check—it would amount to an extra unnecessary pass over the data after sorting.

The best case would a compile-time check that the sorting algorithm actually produces a sorted list, with the aid of some kind of theorem prover (making this a “refinement type”). But for non-trivial examples this is not likely to be possible. In most instances it would be overkill anyway: let the programmer figure it out, and let the test cases prove it to their satisfaction. Therefore it should also be possible to optionally put off a check in a constructor, i.e. marking the instance unchecked or verified or something:

// I'm not sure what the best syntax is here.
// Maybe just _.isSorted
SortedList[T] = Type {list[T], _[i] < _[i+1] for i in _.index}

def mergeSort[T](xs: list[T]) = {
  ... // sorting algorithm here
  xs = ...
  return xs verified as SortedList[T]
}

I imagine doing this incrementally: a compiler checks as much as it can based on its ability to understand the algorithm; the programmer asserts whatever is left to close the gap to the constraint desired; perhaps the compiler can specifically request any assertions needed to complete its proof. Over time the compiler can be improved further and the gaps closed. I’m sure refinement-type languages already do this—but let’s go mainstream with it already, please?

Anonymous Type Definitions

It would often be handy to define a type inline and then refer to it elsewhere. Something like this:

an_email: {
  Str,
  regex = re(“[^@]+@[^@]+\.[^@]+”), # Checks @ and .
  re.match(_)
} = ...

...
def send_email(msg: str, e: typeof(an_email)) = { ... }

That typeof is ugly, though. For something so fundamental I’d prefer a symbol, such as ~an_email, with ~ having a bit of a sense of “similar to”.

The main application of this is an ergonomics feature I find myself wanting all the time: the ability to refer to input and output types of a function anonymously. For the time being let’s define f(~) to refer to the input type f~ for the output type. Then an example would be:

def step1 = (a: int, b: str) => {
  ...
  return {x: b, y: a + 1}
}

def step3 = {xs: [int]} => { ... }

def step2 = step1~ => step3(~)

Here we therefore have:

step1~ = Type {x: str, y: int}
~step3 = Type list[int]

Obviously you could define the required types separately, but that’s so verbose! And, usually, the set of “valid function signatures” is different from the “set of valid types”—most languages’ functions take positional and keyword arguments; the basic struct or object type could support the same thing such that every normal function is just a mapping from one object type to another. Why not?

Some other particularly useful “anonymous types” are:

a copy of another type but without any of its constraints, to be used as a container for unvalidated input
a copy of another type but with all its fields optional, to be used as an builder or partial constructor
subsections of another type/struct/object, like a subset of the fields or a sub-structure

Type Metadata in General

For the programmer, the primary function of “types” is convey “metadata” information through the program structure, as labels on the “pipes” containing the data, rather than having to build a detailed model of the program in the brain. There’s nothing about “types” themselves that distinguishes them from other kinds of information, and “constraints” are just one example of additional information that closely resembles “types” themselves.

Then we can ask: what other sorts of metadata do we find ourselves wanting to convey?

Type constraints make me think of a SQL table definition: the basic constraints above are like column constraints, while a constraint between different fields can be seen as a “row constraint” or “check”, and a constraint like SortedList acts like a “table constraint”. Then the other standard features of SQL tables make good examples of additional metadata:

documentation: why not attach a docstring to a type rather than documenting it everywhere it appears, especially in function arguments? I find myself wanting this constantly.
default_values: let a type have a built-in default to be supplied when a value is not given.
serialization metadata, e.g. to JSON, protobuf, or SQL, which is often annotated with language magic or macros.
a “unique” constraint, which says that only each instance in scope or in memory must be distinct. This seems intuitive-enough to me, but I’m sure it would be complicated in practice.

One also ought to be able distinguish, say, a user_id from an event_id at a language level, even if both are of type “nonnegative int”. Scala has these under the strange names “value classes” or “opaque type aliases” or “tagged types”; the “tagging” concept makes the most sense to me, but the particular syntax in Scala makes use of some obscure library macro syntax and I can’t recommend it. I certainly think this should be the norm, rather than passing around user_id: Int everywhere, and the type-checker needs to be designed such that it understands this. The right idea might be to always “tag” subtypes, but not “opaquely”: a UserId is still a subtype of Int even though it is also definitely a UserId. In some instances it might make sense for a cast of Int to UserId would require validation, but the condition “is a UserId” is un-validatable, so it has to be asserted with the “x verified as UserId” sort of construction from before. In any case this should be a first-class language construct, and the default rather than an exception to the rule; Scala’s feature always felt like an afterthought to me.

Then another ergonomic feature comes to mind: once you’re defining the type user_id as UserId, you hardly need a variable name. In many cases it should be possible, as long as there aren’t two values of the same type in scope, to use the type name as the variable name, with variable name themselves being like “anonymous tags” to distinguish conflicting instances of the same type in scope. For example, using @UserId as if @ stands for the word “the”:

def login(UserId, password: Str) = {
  // here @UserId refers to "the user"
  token = checkPassword(@UserId, password)
  if token:
    user = getUser(@UserId)
    return loggedInresponse(user, token)
  else:
    ...
}

// here we name the two users
def transfer(amount: {Money, _>0}, u1: UserId, u2: UserId) = {
  ...
}

At this point if I wonder if I’ve lost my mind and simply arrived at “variables” after circling the block. But I think there’s something here: a program is just a series of tubes, and not all of those tubes need names all the time!

In all likelihood it would make sense to allow the programmer to define their own types of “metadata”, and potentially to add “inference rules” to the compiler such that it can deduce properties from each other.

Applications

At this point I ought to give some more compelling examples of how this extended type system would be used. The clearest cases the kind of boilerplate seen in every script you write: configuration from a file or environment variable and CLI params. With all the type system elements presented so far, it is possible to write the configuration and CLI spec for a simple app as, more or less, a single type:

Config = {
  database_url: PostgresURL,
  port: http.Port = 8000,
  n_processes: {Int, 0 < _ < cpu.n_processes} = 1,  // or could use _ = 1 for default assignment
  some_param: {Str, "a required configuration field"}   // documentation as a bare string in the object
}

config = env.for(Config, capitalize=True, prefix='my_app_')

// `some_param`'s doc becomes the --help description of that parameter
// database_url's doc would include the documentation for the type `PostgresURL` which would describe
// the format of that string
config = cli.for(Config)  

Branching can always be modeled with a union type PostgresConfig | SqliteConfig, say. In general my philosophy would be for the type system to recommend a standard way of doing things, with a normal imperative library always as a verbose option for complex cases.

Another obvious example is in the translation to/from a SQL database: most of the metadata described so far has a direct translation into SQL, meaning either that struct types can be used directly in an ORM, or statically derived from a query (given one-off access to the DB itself from the editor, which is perhaps wishful thinking):

MyStruct = Type { ... }
myStructTable = orm.table_of(MyStruct)
// or add some type metadata
myStructTable = orm.table_of({ MyStruct, primary_key = id })


// the language should be able to introspect the query to determine the return type, and then fill it in
// or check it
rows: List[...] = db.query(q).fetch_all()

As mentioned above, the possibility of defining additional “type metadata” in the language itself will give a place to put the kinds of extra annotations (serialization formats, aliased names, etc) that normally would be done with meta-level language features like macros, which one sees all the time in these sorts of examples.

The most natural application of the type system I’ve described is in the basic “domain models” in every program. The features I’ve described provide a middle ground between the patterns employed in one-off scripts, where one typically writes functions that declare their arguments directly with little organization into a sensible set of types or structs, vs the kind of verbose detailed organization occurring in large programs.

The popularity of Dataclasses and Pydantic in Python demonstrate this: people really want a more-sophisticated and lightweight “struct” class in their everyday programming language, and will jump through considerable—and often illegible—hoops to get one. Pydantic’s docs are more-or-less a list of the same features discussed here, but bolted-on to the inadequate constructs of Python itself, with considerable cost in complexity.

Of the languages I’ve used, Scala gets closest to realizing the ideal of a type system, but is—it seems to me—hopelessly confined to the compiler-focused sense of “types”. The “types” in the current essay are instead a kind of “privileged object” within the language, which interact with the type-checker and get special syntaxes, but, being objects, can also be interacted with normally. The kinds of types the JVM expects could be produced by a compilation stage that strips away the extra metadata from the types I’ve discusesd here.

Typescript too has converged in a similar direction, coming out better than Python but still riddled with “jankiness”, in my opinion. It’s telling that all these languages seem to be reaching for the same “ideal”: what I’ve described is my attempt to glimpse the ideal that these languages are already convering on. What holds them back is thinking of type systems from the “compiler” side of things rather than as a user question: how do we get the features we want? How close can we get; how many imperfections can we tolerate? The “click” in this kind of thinking is to realize that “conveying structured metadata through a program” is incredibly valuable on its own, even if you make little or no effort to enforce it statically.

With all that said, the next post in this series will step out of the weeds of single types and variables to look at programming from the opposite end; the high level.