Simon Frankau's blog

Determinism is awesome

I saw a post on Mastodon saying "the older i get, the more i realize that I/O is at the bottom of literally every engineering annoyance i deal with. the further i can push I/O to the edge of my program state, the better.". This is something I agree with, and it got me thinking about what it is that I really appreciate.

Pulling the I/O out has several advantages. I/O is great at generating errors, so it can help lump all the error handling together, and you can avoid having to handle error propagation throughout the whole stack of code. In the opposite direction, it makes mocking for testing easier, since you don't have to dependency-inject a mock right into the core of your code. And pulling I/O out helps make your core code deterministic.

To elaborate on that final point, I/O is great at not behaving the same every time. Doing I/O interacts with a giant, hidden pile of state. In contrast, pure-functional-style code that does no I/O has a decent chance of working the same every time.

For me, based on my early career, determinism is big win. I spent most of a decade working on pure functional style systems - a quant (financial maths) library, and a DSL (domain-specific language) written in Haskell. In these systems, you send in some input, and get some output. The output is determined solely by the input, and the output is the only thing you care about. Testing is just sending in input and checking the output, and if you provide the ability to record input and outputs in prod you can trivially debug offline. No heisenbugs or schrödinbugs. I find fixing bugs to be one of the most unpredictably time consuming parts of development, and with deterministic code life is good (usually - see later).

(I'm going to play a little fast and loose with terminology by conflating pure functional behaviour with deterministic code. Sorry. I'm sure you'll work it out.)

I later spent a couple of years working on an algorithmic trading system, and while it did plenty of I/O, the core algos were deliberately in a pure functional style, with very explicit inputs and outputs. The huge advantage of that is that exactly the same code would run in exactly the same way against live markets and against recorded data for (back) testing. I've heard similar approaches used in computer games.

From there, I moved onto large-scale distributed systems. There's no hiding the I/O there. Yet even there, you do what you can to regain control. Making the I/O into uniform RPCs allows you to apply standard patterns. Standardise the error handling via e.g. making things idempotent and limiting error handling logic to try again/give up. Build observability tooling, and use heavy logging to catch the inputs and outputs you can. If you squint, you can try to design each RPC handler as a pure function. It won't be perfect, but you can push to that style as much as possible.

If you extend your use of determinism, you get nice extras. If your top-level system that behaves in a pure functional way is made out of pure functional components, you get compositionality: Each sub-component can be analysed separately in terms of input and output, and you can take an easy, reductionistic approach to debugging, dividing and conquering. Refactoring can be achieved safely through local changes, and tests can be arranged hierarchically. This is roughly what you get with the encapsulation of OO and associated unit tests. From another angle, if you can fully enumerate intermediate state, you can probably checkpoint your code.

I/O is a key determinism-killer, but it's not the only one. Some I/O is non-obvious. Caring about the current time introduces non-determinism. Multi-threading is well-known for creating non-determinism, and any multi-threaded code needs a clear model as to why it's going to behave. "Human thinks hard" won't cut it. Using real random numbers obviously creates non-determinism. Outside of cryptography, you almost always actually want pseudo-random numbers, for reproducability. And perhaps most frustrating of all, memory-unsafety bugs create non-determinism. A particularly malicious, hard-to-debug form of non-determinism with spooky-action-at-a-distance, the damage turning up far away and much later than where it was created.

So, finally, a couples of stories of unexpected non-determinism cropping up in our theoretically-pure-functional quant library. We had some heavy models that got distributed out onto a compute grid. The price for a particular complicated product would jump around, given the same input, maybe something to do with which machine it was running on.

Further investigation revealed the product in question relied on a particularly heavyweight calibration stage. The calibrator, relying on numerical methods and probably some not great market data, was not expected to always converge, and if it didn't, it would fall back to a simpler model. It would do so, non-deterministically.

The fallback code looked something like "try { calibrate(); } catch (...) { fallback_calibrate(); }". The blanket exception-catcher turned out to be catching a memory allocation failure exception (this was a heavy calculation!). In other words, the profit-and-loss for this particular trade would be determined by how much RAM the machine had!

The reason given sounded plausible: Surely you'd prefer to get some price out, rather than have your trading book non-deterministically fail to price? It turns out exotic derivatives traders are a detail-oriented bunch, and no, they would rather have something clearly and explicitly error than just randomly change in price, unannounced, in a way that slowly drives you mad (and screws up finite-difference-based risk calculations). Exotics traders have something of the dev mindset.

That one was interesting, and gave me some understanding of priorities, but I also saw another, which I found more depressing. A good quant understands finance, stochastic calculus, numerical methods and software engineering. In practice, most are weak at software engineering (so they employ quant-devs - hi!), and they pick up finance on the job over a number of years. You want the core competencies of stochastic calculus and numerical methods. You usually get just the stochastic calculus. Hence this story.

All this took place before Cloud was commonplace, and even then for the level of utilisation it makes sense to have your own hardware. So, we had a bunch of machines to calculate heavyweight models on. They were Intel machines, but AMD had just started producing some incredibly good price/performance machines, so we bought a bunch. And... a bunch of trades priced differently on them!

Investigation followed. I forget the exact outcome, but it hinged on the way that Intel FPUs have 80-bit FP registers, but values get stored in memory as 64-bit values (I think you can force the internal representation down to 64 bits, too, but - hey! - free precision!). For whatever reason, AMD did not work identically - whether it was just 64-bits internally, or went through a different code path with CPU-specific optimisations that dumped values via RAM, I forget. Maybe it didn't need to spill values to RAM. The upshot was that some intermediate values were truncated from 80 bits to 64 on one architecture and not the other.

There was much internal hand-wringing about all this - about buying machines without sufficient testing, about bad hardware and bad compilers and all the rest of it. No-one seemed to see the headline that I did: If your code produces wildly different results with 64-bit and 80-bit precision (and you didn't know this already because you'd designed it that way), your code is numerically unstable, and the numbers that are coming out of it are largely noise. You have discovered a bug. Be grateful for it, fix the bug, and don't blame the messenger.

So, uh, yeah, the compiler and microarchitecture can be a source of unobvious non-determinism too, if you try hard enough!

I love determinism.

Posted 2024-03-07.

Full index