After months away I have once again picked up the tools of programming, and have once again been assailed by a torrent of frustrations: nothing is clean, I cannot express my intention, all my idea are mired in implementation details, everything is documented badly, if at all… surely we can do better…

Part 1: Type System Ergonomics

Starting small. Data types are, on one hand, required for languages and their compilers to operate sanely: to allocate memory efficiently and safely, say.

On the other hand, types are needed to help the user of the language operate sanely: as documentation, to guard against programmer errors, etc.

The type “systems” (if they can be called that) added to Python and Javascript in recent years are clearly prioritizing the second sense, as they have no effect on “memory allocation” anyway, and typically do not even cover the entire codebase.

But the type systems so far added to high-level dynamic languages seem to be rooted in the concept of “types” as it appears at the low-level and compiler-focused end of the spectrum, working “upward” from there to add a few ergonomic features, rather than starting from the “user ergonomics” end and working “downward” to add static guarantees. The resulting state of things is quite inadequate to the needs of actual programmers. Perhaps two systems are needed which meet in the middle: one rigorous and verified, one imperfect and only checked as well as feasible; people don’t seem to get too bothered by imperfect type systems anyway.

Here then are some things I think any sane language ought to have (there are no sane languages, as far as I know). I’ve never written a programming-language, and as such am only lightly familiar with technical terminology. Consider this to be a user’s feature-asks, to be treated as you would the requests of any user.

Constraints

First up: constraints. (This is my catch-all name for “refinement types”, “contracts”, and “dependent types”.)

This is so basic I can’t believe it’s not commonplace. There is a difference between an integer variable

x: int

and an

x: int
assert x > 0

The fact that x is non-negative should flow through the program just like the fact that x is an Int. You should check this constraint once when instantiating the x at the program’s “boundary”, and then if you receive an x declared this way you can trust that the constraint holds.

It should additionally not require a separate assert or a wrapper class or any convoluted trait or the like to declare this. The syntax should be immediately obvious to a high schooler learning to program for the first time. The typical syntax is like

Type nnInt = {x: int | x > 0}

but this is ugly because of, first, the extraneous variable name x, and second because in many cases I would like avoid having to give the constrained type a name, instead declaring it inline. My instinct is to use a Scala-like “anonymous argument” symbol _, acting like the

x: {int, _ > 0}

Or perhaps we do without the braces, though this obviously collides with objects in JS and tuples in Python:

x: int, _ > 0

Perhaps both are supported when unambiguous.

(My code examples will be in no particular language—some kind of blur of Python, Typescript, and Scala.)

These syntaxes read as if _ > 0 is a “statement”, with the existence of such an x implying that any statements in its definition are true. I then imagine giving int the same semantics, though this would conflict with conventional meaning of int(x) as a cast or constructor. Maybe it makes sense to define a type as a list of “constraint statements”:

x: [int, _ > 0]

I’ll use { ... } for the rest of the post, but I suspect I will eventually prefer the square brackets [ ... ]. We’ll see.

By all means give a type a name:

Email = Type {
  Str,
  regex = re(“[^@]+@[^@]+\.[^@]+”), // Checks @ and .
  re.match(_)
}

def send_email(msg: str, e: Email) = { ... }

This has the advantage of letting us interpret Email itself as a sort of object. One could elsewhere refer to Email.regex to make use of the same (statically-defined) regex object.

One advantage of type constraints is to not have to redeclare a bunch of things between the docstrings for a method and the method body.


def send_email(msg: Str, e: Str):
  """
  send an email
  
  args: 
  - msg  string  the msg to be sent
  - e    string  the email address, which must contain the characters @ and .
  """
  try:
    re(“[^@]+@[^@]+\.[^@]+”).match(e)
  except:
    raise "Invalid Email Address"
  ...

Instead we can encode as much as possible in the types themselves, reusing the logic and documentation between all instances. Then we can trust that any argument we get has already been validated as an email:

Email = Type { ... }

def send_email(msg: Str, e: Email):
  """
  send {msg} to {e}.
  
  (an example of a succinct docstring)
  """  
  // no checks needed.


msg: Str = "hi"
e: Str = "a@b.c"

// validation occurs at call site when passing a Str 
send_email(msg, e)

In that example a full doc-string could be programmatically generated; those format-strings like {e} would be filled in with something like the type information rather than the value of e itself.

If instead you passed a pre-validated Email, the validation would be skipped, which of course is what you would want to do for good performance. To the extent possible values should be validated when the enter the program “boundary” and then trusted through the rest of execution, which of course is what types do already.

A few other types of contraints come to mind quickly:

  • constraints between multiple fields in a structure, which would thought of as a constraints on the enclosing structure rather than the types of the single fields.
  • a condition on the length of a list (if checked at compile-time would be an example of a “dependent type”). You could then place constraints on the arguments to a function, e.g. List.head would require a nonempty list, or Vec.dot would require two vectors of the same length. (All linear algebra code would be vastly more enjoyable to work with with the help of some type constraints.)
  • a condition that a list is sorted; this is preferable to a dumb wrapper SortedList that conveys the information to a reader but carries no semantic information.
  • that a type is non-null (in languages where such things are allowed and it’s not sufficient to distinguish int | None from int).

Constraint Validation

Where should constraints be validated? And when?

The obvious answer, already mentioned, is at the point where the type is instantiated. This would amount to a type “contract”—a runtime type check.

But in the example of a SortedList, we wouldn’t necessarily want an automatic check—it would amount to an extra unnecessary pass over the data after sorting.

The best case would a compile-time check that the sorting algorithm actually produces a sorted list, with the aid of some kind of theorem prover (making this a “refinement type”). But for non-trivial examples this is not likely to be possible. In most instances it would be overkill anyway: let the programmer figure it out, and let the test cases prove it to their satisfaction. Therefore it should also be possible to optionally put off a check in a constructor, i.e. marking the instance unchecked or verified or something:

// I'm not sure what the best syntax is here.
// Maybe just _.isSorted
SortedList[T] = Type {list[T], _[i] < _[i+1] for i in _.index}

def mergeSort[T](xs: list[T]) = {
  ... // sorting algorithm here
  xs = ...
  return xs verified as SortedList[T]
}

I imagine doing this incrementally: a compiler checks as much as it can based on its ability to understand the algorithm; the programmer asserts whatever is left to close the gap to the constraint desired; perhaps the compiler can specifically request any assertions needed to complete its proof. Over time the compiler can be improved further and the gaps closed. I’m sure refinement-type languages already do this—but let’s go mainstream with it already, please?

Anonymous Type Definitions

It would often be handy to define a type inline and then refer to it elsewhere. Something like this:

an_email: {
  Str,
  regex = re(“[^@]+@[^@]+\.[^@]+”), // Checks @ and .
  re.match(_)
} = ...

...
def send_email(msg: str, e: typeof(an_email)) = { ... }

That typeof is ugly, though. For something so fundamental I’d prefer a symbol, such as ~an_email, with ~ having a bit of a sense of “similar to”.

The main application of this is an ergonomics feature I find myself wanting all the time: the ability to refer to input and output types of a function anonymously. For the time being let’s define f(~) to refer to the input type and f~ for the output type. Then an example would be:


def step1 = (a: int, b: str) => {
  ...
  return {x: b, y: a + 1}
}

def step3 = {xs: [int]} => { ... }

def step2 = step1~ => step3(~)

Here we therefore have:

step1~ == Type {x: str, y: int}
~step3 == Type list[int]

Obviously you could define the required types separately, but that would be quite verbose! And one issue here is that the set of “valid function signatures” is different from the “set of valid types” in most languages—most languages’ functions can take both positional and keyword arguments. The basic struct/object type would have to able to represent the same things, which would make all functions be interpretable as a mapping from a single object type to another. But that’s a convenient unification, so why not?

Some other particularly useful “anonymous types” would be:

  • a copy of another type but without any of its constraints, to be used as a container for unvalidated input
  • a copy of another type but with all its fields optional, to be used as a builder or partial constructor
  • a subset of the fields of another struct, including a syntax for “all fields except…”
  • a sub-struct within a struct

Type Metadata in General

For the programmer, the primary function of “types” is convey “metadata” information through the program structure, as labels on the “pipes” containing the data, rather than having to build a detailed model of the program in the brain. There’s nothing about “types” themselves that distinguishes them from other kinds of information, and “constraints” are just one example of additional information that closely resembles “types” themselves.

Then we can ask: what other sorts of metadata do we find ourselves wanting to convey?

Type constraints make me think of a SQL table definition: the basic constraints above are like column constraints, while a constraint between different fields can be seen as a “row constraint” or “check”, and a constraint like SortedList acts like a “table constraint”. Then the other standard features of SQL tables make good examples of additional metadata:

  • documentation: why not attach a docstring to a type rather than documenting it everywhere it appears, especially in function arguments? I find myself wanting this constantly.
  • default_values: let a type have a built-in default to be supplied when a value is not given.
  • serialization metadata, e.g. to JSON, protobuf, or SQL, which is often annotated with language magic or macros.
  • a “unique” constraint, which says that only each instance in scope or perhaps in memory must be distinct. This seems intuitive-enough to me, but I’m sure it would be complicated in practice.

One also ought to be able distinguish, say, a user_id from an event_id at a language level, even if both are of type “nonnegative int”. Scala has these under the strange names “value classes” or “opaque type aliases” or “tagged types”; the “tagging” concept makes the most sense to me, but the particular syntax in Scala makes use of some obscure library macro syntax and I can’t recommend it. I certainly think this should be the norm, rather than passing around user_id: Int everywhere, and the type-checker needs to be designed such that it understands this. The right idea might be to always “tag” subtypes, but not “opaquely”: a UserId is still a subtype of Int even though it is also definitely a UserId. In some instances it might make sense for a cast of Int to UserId would require validation, but the condition “is a UserId” is un-validatable, so it has to be asserted with the “x verified as UserId” sort of construction from before. In any case this should be a first-class language construct, and the default rather than an exception to the rule; Scala’s feature always felt like an afterthought to me.

Then another ergonomic feature comes to mind: once you’re defining the type user_id as UserId, you hardly need a variable name. In many cases it should be possible, as long as there aren’t two values of the same type in scope, to use the type name only, with variable names themselves being like “anonymous tags” which distinguish between conflicting instances of the same type in scope. For example, using @UserId as if @ stands for the word “the”:

def login(@UserId, password: Str) = {
  // here @UserId refers to "the user"
  token = checkPassword(@UserId, password)

  // or maybe your @UserId is passed implicitly, and anon values can similarly be 
  // injected into the current scope from return values
  if token:
    @getUser(token)  // shorthand for @User = getUser(@UserId, token)
    return loggedInresponse(token)
  else:
    ...
}

// here we name the two users
def transfer(amount: {Money, _>0}, u1: UserId, u2: UserId) = {
  ...
}

Or you could skip passing

At this point if I wonder if I’ve lost my mind and simply arrived at “variables” after circling the block. But I think there’s something here: a program is just a series of tubes, and not all of those tubes need names all the time!

In all likelihood it would make sense to allow the programmer to define their own types of “metadata”, and potentially to add their “inference rules” to the compiler such that it can deduce properties from each other. But it would take a prototype to start to see what these would look like from here. Better to move on.

Applications

At this point I ought to give some more compelling examples of how this extended type system would be used. The clearest cases would be the sorts of boilerplate you put in every script—configuration from a file or environment variable and CLI params. With all the type system elements presented so far, it should be possible to write the configuration and CLI spec for a simple app as, more or less, a single type:

Config = {
  database_url: PostgresURL,
  port: http.Port = 8000,
  // or could use _ = 1 for a default
  n_processes: {Int, 0 < _ < cpu.n_processes} = 1,
  // documentation as a bare string inside the type?
  some_param: {Str, "a required configuration field"}
}

config_from_env: Config = env.get(Config, capitalize=True, prefix='my_app_')

// `some_param`'s doc becomes the --help description of that parameter
// database_url's doc would include the documentation for the type `PostgresURL`,
// which would describe the format of that string
config_from_cli: Config = cli.get(Config)  

// and then you can easily write "cli with env fallback"
// maybe a special syntax. It's like a field-wise "x or y".
config: Config = merge(config_from_cli, config_from_env)

Branching can always be modeled with a union type PostgresConfig | SqliteConfig, say. In general my philosophy would be for the type system to recommend a standard way of doing things, with a normal imperative library always as a verbose option for complex cases.

Another obvious example is in the translation to/from a SQL database: most of the metadata described so far has a direct translation into SQL, meaning either that struct types can be used directly in an ORM, or statically derived from a query (given one-off access to the DB itself from the editor, which is perhaps wishful thinking):

MyStruct = Type { ... }
myStructTable = orm.table_of(MyStruct)
// or add some type metadata
myStructTable = orm.table_of({ MyStruct, primary_key = id })


// the language should be able to introspect the query to determine the return type, and then fill it in
// or check it
rows: List[...] = db.query(q).fetch_all()

As mentioned above, the possibility of defining additional “type metadata” in the language itself will give a place to put the kinds of extra annotations (serialization formats, aliased names, etc) that normally would be done with meta-level language features like macros, which one sees all the time in these sorts of examples.

The most natural application of the type system I’ve described is in the basic “domain models” in every program. The features I’ve described provide a middle ground between the patterns employed in one-off scripts, where one typically writes functions that declare their arguments directly with little organization into a sensible set of types or structs, vs the kind of verbose detailed organization occurring in large programs.

The popularity of Dataclasses and Pydantic in Python demonstrate this: people really want a more-sophisticated and lightweight “struct” class in their everyday programming language, and will jump through considerable—and often illegible—hoops to get one. Pydantic’s docs are more-or-less a list of the same features discussed here, but bolted-on to the inadequate constructs of Python itself, with considerable cost in complexity.

Of the languages I’ve used, Scala gets closest to realizing the ideal of a type system, but is—it seems to me—hopelessly confined to the compiler-focused sense of “types”. The “types” in the current essay are instead a kind of “privileged object” within the language, which interact with the type-checker and get special syntaxes, but, being objects, can also be interacted with normally. The kinds of types the JVM expects could be produced by a compilation stage that strips away the extra metadata from the types I’ve discusesd here.

Typescript too has converged in a similar direction, coming out better than Python but still riddled with “jankiness”, in my opinion. It’s telling that all these languages seem to be reaching for the same “ideal”: what I’ve described is my attempt to glimpse the ideal that these languages are already convering on. What holds them back is thinking of type systems from the “compiler” side of things rather than as a user question: how do we get the features we want? How close can we get; how many imperfections can we tolerate? The “click” in this kind of thinking is to realize that “conveying structured metadata through a program” is incredibly valuable on its own, even if you make little or no effort to enforce it statically.

With all that said, the next post in this series will step out of the weeds of single types and variables to look at programming from the opposite end, where I’ll try to put into words some of the features of high-level program organization which our present-day languages are already trying to converge to.