On Domain-Specific Languages

I wrote a little bit about CMake the other day, which left me thinking "I've done a fair amount of work on domain-specific languages, what do I know about how to get them right?" aka "It's easy to criticise"!

I've written a bit about DSLs before, mostly from the point of view of configuration languages. I'm now going to try to approach it from another angle: How to make a good DSL.

Structure the DSL to match the domain

In my exotic derivatives payout language FPF, I included a "fold" functor. In retrospect, this was sub-optimal. We nearly always wanted to perform some aggregation across a set of assets or a set of dates. These have different semantics: The former is usually a basket of stocks evaluated together, while the latter represents doing something in timesteps. By distinguishing between the two, you can more cleanly manage various aspects of the trade lifecycle - a time fold represents lifecycle steps whereas an asset fold doesn't.

In a similar way, we started supporting "strategy trades" with FPF. You don't need the details, but they were a poor fit for the existing infrastructure, and the experience sucked. The language should match the domain.

Outside DSLs, a similar thing you see is the effort C compilers need to go through, reversing "for" loops into the underlying iteration structure. Is this a map, parallelisable over an array? The programmer knows, the compiler wants to know, the language obscures. How sad.

Mathing the domain often means making the DSL declarative rather than imperative; expressing intent rather than a detailed plan. The canonical example is a build system (like make) - express dependencies and how to build each step, and let the system do the rest.

You see the same idea with SQL, and in the infrastructure world this is "intent-driven configuration", where the DSL system will do what's necessary to close the gap between the config and reality.

A key thing to realise is that declarative is not always better. Sometimes the domain requires control. Vanilla SQL leaves so much unspecified that performance is unpredictable - horrific for the production serving path. A naive intent-based configuration system skims over important aspects of how to roll out incrementally and safely.

Consider carefully how to map the domain to your language.

Consider your users when designing the DSL

In my limited experience, DSLs have two core demographics: A small number of experts, or a large number of passers-by.

In the former case, you're building a tool that will be heavily used by a small number of people. It's probably a key part of their job, and the investment in the DSL is to make them more efficient. I dealt with this in banking a couple of times, with FPF and with a language for algorithmic trading. It's really nice: You can do subtle, clever things, and your expert users will thank you for improving their effectiveness.

In the latter case, you're dealing with one corner, often neglected, of a wider world, often incidental to the key role of many, many people. CMake is an example of this. Parser generators and other mini-languages fall in this categoy. Users don't want to become experts, they want to get the job done. Anything subtle or complex will make your users hate you. The tool should be obvious, guessable, and friendly to untrained users. This is really hard. I am glad I've never needed to do this

Languages need decent tooling

Like pets, DSLs need a lot of tedious extra work to thrive. In particular, they need debugability. Ranging from sensible syntax errors through the ability to trace execution, maintainability is probably the biggest tax on DSLs. Please don't neglect this angle, particularly if you're writing a "many shallow users" DSL.

Is this a configuration language?

Pretty much anything can be shoe-horned into a configuration language, because code can be expressed as data. However, some things have a natural representation, and fighting that leads to sadness. XSLT still upsets me, an XML transformation about as ugly as it's possible to be, just because it felt it needed to be represented by XML.

If you really are just trying to express something fairly config-ish, such that formatting it as a config file does not obfuscate the meaning, then use an existing config library/language. There's no point reinventing the wheel in a well-populated space, especially if it increases the barrier to entry on a "many shallow users" DSL. "Few expert users" situations tend to be much less like configs.

This is probably where I can get a personal grudge out: In a workplace of almost universal protobufs, it's very tempting to configure everything as one, and... it can get ugly. Do you really want to express "5 < 10" as "left: 5, op: LESS_THAN, right: 10"?

Have you considered an Embedded Domain Specific Language?

A cheap way to make a DSL is to embed it in another language. This can decrease the debugability/maintainability, but you understand trade-offs, right?

"Embedding" here usually means providing functions that generate their own syntax tree, that can then be interpreted. Sometimes you build a mini-language by having the functions directly do the thing they describe, but you get a lot more flexibility by generating the syntax tree. I'm not an ML person, but I think this is the approach behind TensorFlow, for example.

I really like EDSLs, and they provide a spectrum from full access to a general-purpose programming language through to a highly-constrained language.

There are some subtle constraints on EDSLs. As evaluation generates a syntax tree (well, DAG) from the call tree, let-binding and other structural hints from the original syntax are lost. This works against debuggability and makes analysis harder.

Think carefully about the edges of your language

If you are writing an EDSL, you need to choose your embedding language carefully, and consider how much of the original language you support. Haskell is a surprisingly nice language for embedding in, and it's easy to disable the standard prelude. Lua also makes it straightforward to control exactly what functionality is available - e.g. creating a sandbox with zero filesystem access. Python is a perennial favourite for embedding DSLs, but my experience has been that it's a right pain to sandbox.

You need to decide whether you want you DSL to basically be the embedding language with a few extra features, or your own language that just happens to share a syntax. This also ties up with whether your DSL should be Turing-complete, as I discussed previously. Spoiler alert: I favour constrained languages as more maintainable.

Python inevitably encourages you to write Python programs with your own extras on top. I've seen multiple projects at scale where there was a big clean-up effort later to make the DSL code-base into a properly constrained language, taking out all the general-Python-isms. It's miserable. I personally think the lesson is to not use Python for EDSLs, it's too leaky, but no-one listens to me. ;)

Use programming language theory

Either you decide up-front that you prefer a proper stand-alone language, or maybe you've outgrown an EDSL. If you're heading down that path build a proper language. The theory is there, use it. Design a syntax that looks like a proper programming language, with an intent to be used, not some can't-be-bothered mess.

Here, I'm clearly looking at CMake and its ridiculous confused-config-language syntax ("else()"? Really?), but also at TeX and its horrible backslashed macro messes obscuring a Turing-complete language and making proper work on (La)TeX styles incredibly tedious.

At a smaller level, it also means taking syntax seriously so that you don't end up with "but it seemed like a convenient shortcut"isms like the way make cares so deeply about tabs at the start of lines.

Obviously the language you're building shouldn't look like a config language, because if it did, you'd be using an existing config language.

Don't build a text-manipuation language

Unless your domain really is text, don't make your language a text-processing language. By this, I mean you shouldn't look to languages like the Bourne shell or Tcl for inspiration. Don't build a templating language based on textual substitution.

Such languages are incredibly brittle. If your DSL has any security requirements, it'll screw them up, since plain textual substitution inevitably leads to escaping issues. You can get yourself in a pickle trying to work out how many levels of evaluation are needed to substitute all the variables in all the strings that contain other variables. And just fundamentally, you're not modeling the domain any more, are you? You're playing with strings.

Any time I see a DSL with $DOLLAR_PREFIXED_SHOUTY_VARIABLES, I get really twitchy. I've not been comfortable writing code like that since BASIC with line numbers. Shell is not a well-designed programming language, and anything that decides to ape it is ill-thought-through. Shell is compact and convenient, admittedly, but it is quite possible to build convenient, efficient DSLs without copying a syntactic approach associated with unmaintainable hacked-up code. We have learnt a lot in the last forty years.

There's probably a bunch of other hugely important factors to take into account writing a DSL that I've forgotten. There's plenty of space to get a little less abstract and start giving examples of DSLs that are good, maybe develop one myself. However, I feel this is enough for now.

Posted 2023-04-30.