Ugh. I've always hated the word 'blog'. In any case, this is a chronologically ordered selection of my ramblings.
Recently, Twitter introduced me to the Circular Law, which is extremely cool. Not only do the (complex) eigenvalues of a random matrix end up distributed inside a circle, they do so quite uniformly due to a "repulsion" effect.
I wondered a little bit about what this repulsion looks like, and wrote some code to visualise it - https://github.com/simon-frankau/eigenvalues. I won't bother with an image here, you can click through to see what it looks like, but I found it a fun little project to do some basic numerical stuff. The "nalgebra" and "plotters" crates were a pleasant surprise.
I made a Tweet I'm quite proud of, that got double-digit likes! ;) It goes something like this:
Teleological incident postmortem: "The root cause of this incident was the need for this system to be more reliable in the future."
280 characters doesn't do nuance, so I thought I'd write a little more. Even if, to be frank, there's not a lot of nuance.
Despite my advancing years, I only just learnt about teleology. My knowledge of ancient Greek thought is very limited, but I've started working my way through Bertrand Russell's History of Western Philosophy. It's a bit of a doorstop, but surprisingly readable with plenty of snark. I'm sure I'll do a review in due course. Anyway, the idea of teleology is (roughly) to answer the question "Why?" in terms of what it causes in the future. It's a bit like cause and effect, except with the event being the cause, not the effect.
I find it a startling way to think. I'm so used to "Why?" being in terms of cause and effect, that this framing, with its implicit hint of fate, is surprisingly alien. Why on earth would I apply it to incident response, beyond "I have a new hammer, let's hammer everything!"?
Firstly, I think it's an empowering framing. While I think it's not the way it worked for the ancient Greeks, it's good to be able to say "When we look back, what do want to say this event lead to? Let's make it that way!". It brings the right mindset to reviewing an incident, pushing it in the direction of making things better, rather than dwelling too much on the past.
Secondly, thinking about this slightly alien, human-story-centred approach to "Why?" can help remind us of the assumptions baked into our normal "cause and effect" approach to "Why?". Cause-and-effect covers a multitude of attitudes. It can be making up a story as to why a thing happened, or scientifically-grounded causation. The latter works best in situations with repeatable experiments, where the alternatives can be tested. Incident analysis, where we don't have access to the universe where the outage didn't happen, tends to the rather human story-telling end of things, and we should think about that.
In some contexts, sometimes "Why?" makes no sense. It's just trying to impose human values of meaning onto systems that aren't like that. Some events are random, some things are just the way they are. Not everything is about causation, and "Why not?" is just as (in)valid as "Why?". For incidents with human-built systems, this is less of a concern - if an outage happened due to bad lack or coincidences, you can still ask "Why didn't we have another layer to defend against this?". We should still proceed with care.
The lesson to take away, I think, is that there are different kinds of "Why?" for different purposes. It's a fun route back to the same old story. "Who did this?" is a bad "Why?" for improving the system. "What was the trigger?" is a bad "Why?". "What was the (singular) root cause?" is a bad "Why?". "What are all the key things that, had they been different, would have prevented or ameliorated this outage?" is perhaps a better "Why?".
In the end, our "Why?"s need to be driven by our ultimate goal for a PM, which is to improve the reliability of the system. Perhaps, counter to all my intuition, the teleological "Why?" is the actual best "Why?" in this situation!
Hot on the heels of picking up Fez, I decided to play Tumbleweed Park, a graphical adventure in the style of SCUMM games like Monkey Island, which is not surprising as it's written by the author of those games, Ron Gilbert. I'd heard about it a while ago, got hooked in by the freebie mini-game staring the character of Dolores from the game, and, yep, ended up buying it.
It's weird to see where we are with such games. It's very pixel-arty, with various bits drawn deliberately below the native resolution, and just composited together, so you have big pixels on top of small pixels. To show off Advanced Technology you get to control five (!) different characters throughout the story, often making them work together. It's got the old-school limited set of verbs, and manages to capture the feeling of those old games while still taking advantage of the modern hardware. There aren't compreomises on the sound with Adlib-style music, though. Strangely, one of the things that hooked me in is the somewhat melancholy main theme!
This is very clearly the product of a Kickstarter. The phone book is full of backers who you can phone for messages, and the library is well stocked with a vast number of books. Strangely it kind of adds a chunk of depth to the world, and can be a great big distraction if you've got too much time on your hands.
Anyway, the game. It's nice to see there's an easy and hard mode. Obviously I want straight for the hard mode. ;) You could tell how that works through some of the puzzles being multi-part (nearly, but... just one more step!) where I assume the easy mode would just declare something done.
Speaking of puzzles, they were mostly pretty reasonable. If you can keep making progress in the game, it gives you faith that the puzzles are fair, and not just "try to imagine what the designer was thinking, even if it's not reasonable" or "combine all possible nouns and verbs until something happens". The only time I lost that was when I couldn't get into the factory, and it turned out it was blocked on (spoiler!) finding the pizza van, even though there's no actual logical dependency. I ended up taking a hint even if I think it wasn't strictly necessary, through frustration at that point.
This is a bit of a shame, as mostly the plot/puzzle motivation is pretty clear - you aren't wandering around trying random things, but are specifically trying to solve particular problems to advance the plot. This is pleasantly aided by giving each character a "To do" list. The only weakness is that sometimes you'll be given a puzzle you can't solve yet, and it's not obvious you can't, but mostly you'll have other problems to solve and not get stuck. Of course, I did get wedged for a while as I failed to mouse over a few pixels representing a key item, but that's how these games go!
Speaking of which, it's a game that's more than willing to break the fourth wall, as well as commenting on adventure gaming. Amusingly it takes a dig at Sierra adventure games for all their pointless deaths, reassures you it wouldn't do such a thing, and then chucks in one or two possible "game over" scenarios. Admittedly, not in a frustrating way!
(As an aside, my first PC adventure game was Space Quest III, and it was fun to later realise that constant unpredictable and surprising deaths are not actually necessary. That got replaced by the frustration of never quite getting my sound card configuration right to get the audio voices for Day of the Tentacle working (such a good game). Man, that was annoying.)
The ending is a bit meta, and quite frankly I also thought it was a bit meh. Having said that, I never really got on with the ending of Monkey Island 2, so what's new? :) However, the destination is not the journey, and I surprised myself with how much I enjoyed the game as a whole. Highly recommended, if you like that sort of thing.
Fez is a Famous Indie Game, and I bought ages ago during a sale and never played it. My current medical leave has given me a chance to catch up a bit, and I've finally played it.
In my limited indie gaming vocabulary, I see elements of Undertale, The Witness and Monument Valley. It has the meta elements of Undertale, it has the layers of puzzles of The Witness, and it has the perspective puzzles of Monument Valley. In all cases, though I think it is a weaker game than those I'm comparing it with.
On the surface, the closest comparison is with Monument Valley. Fez's vocabulary of puzzles is far more limited than that of Monument Valley, making that part of Fez a somewhat simple and even unsatisfying puzzle platformer, despite the neat rotation mechanic, and cutesy graphics and sound.
However, Fez is two games in one. It's superficially a simple platformer, bright and friendly and easy. However, there's also a second level, of cryptographic-style puzzles. That level is... internet aware. Some of these puzzles are just horrible, because they're a challenge for the whole audience, not just you, and I guess spoilers are available if you want to complete the unreasonable puzzles?
I don't like this. I want a game that is a reasonable personal challenge to me, not a paean to internet sleuthing. The way these levels are mixed together is also jarring. There will be levels and puzzles that you have no chance of solving when you first see them, but this is not obvious when you encounter them, so you end up wasting time before realising you need to give up for now. This is particularly frustrating when the cutesy platformer puzzles are mixed with the secondary puzzles. Compare this to The Witness, where the layering is distinct and structured and... very playable. Discoveries there are enjoyable and surprising, not annoying and disappointing.
I think the epitome of this, including the frankly lazy cross-platform translation, are the tuning fork puzzles. I'm going to spoil them, and I don't care because they're rubbish. On the XBox (?) you'd see a tuning fork, and you'd get a left-right haptic sequence on your controller that you'd parrot back to unlock a secret. On the Mac, this has been translated to a left-right sequence of sounds. Which is completely non-obvious unless you're wearing headphones. The stereo effect on the built-in speakers is... just not there. A painfully obvious puzzle becomes obscure, simply through carelessness.
It's a wilfully annoying game. The clock puzzle makes that abundantly clear. It's a challenge to the players, but more in terms of what they'll put up with in the name of games, rather than in terms of an enjoyable gaming experience. Perhaps this is art, challenging preconceived notions of what a game is, but I'd prefer something enjoyable and engaging.
I'm useless at thought leadership. I am fundamentally incapable of getting my hot takes out before everyone else. Indeed, I only tend to get my thoughts down once everyone else has got bored and wandered off. That's what I'm doing here. I'm not unduly bothered by this; hopefully there's room to move on to more durable truths that stay relevant...
Rather intriguingly, the outage report has been rewritten to remove any reference to a single engineer receiving consequences for the actions they took, but the initial draft... had that. To be clear, the first iteration I saw did not place that as the sole issue - it was mentioned with things like poor automation - but it certainly struck a chord.
A typical Twitter view was that anything like that is beyond the pale and a sign of an irrevocably broken culture. And I just can't quite get on-board with that. I view all outages as systemic issues, but that's not incompatible with having someone being negligent or even malicious. That is a systemic issue - why did you hire this person? Why did you not deal with it earlier? Why aren't your systems resilient against this? But it's also a people issue - you don't just go "Yeah, sure, you're negligent, but that's not your fault, we shouldn't have hired you. It's on us.".
To be clear, almost certainly we do have a screwed-up culture here, not One Bad Apple. I'm being pedantic because I think it's useful to avoid lazy thinking. How would you know the difference between "one bad engineer" and widespread issues? Well, the main thing is whether their pattern of behaviour is normal or not. This emergency change required approval, and was apparently inappropriately approved. Did that happen often? Are there consequences for the approvers? Canarying and staged rollouts were supposed to happen, but didn't - was that unusual?
I guess what I don't like about "blaming the engineer is bad" is that the opposite is not the real fix. You can avoid blaming the engineer and still make no progress. You need to understand the context, the norms, the reasons why alleged processes weren't followed and checks didn't work. The thing that really made me feel bad about the Salesforce outage write-up is that it had no answers to those questions, and moreover didn't even ask them. Take out the engineer-blaming, and it's still a concerning write-up.
As well as culture issues, this was, at heart, a "should have had better automation, with canarying and incremental rollouts" outage, which I think makes a nice comparison with the Fastly outage. So, let's move on.
Despite being very short, I liked the Fastly write-up. For one, it didn't focus on the trigger, but on what happened. Buggy software was rolled out, and then a customer config change triggered it. It sounds simple, but so many incidents start with the immediate trigger (preferably "human error"), and are instantly wrong-footed in terms of managing the systemic issues. As it is, Fastly's response looks systemic.
So far, so boring. What was interesting to me was a Twitter conversation I had while the issue was still poorly understood. The assumption was that this was another operational config push gone wrong, not a customer config push. My correspondent asserted: 1) They wanted a write-up explaining how whoever was doing that operational work screwed up, and 2) They wanted proper multi-stage rollouts, with a human in the loop between steps, to prevent global failure in the future. I found this interesting, because there were little echos of Salesforce.
(My correspondent may disagree with how I'm portraying their views. Oh well. Pretend I'm making up a straw person to argue with, then! :)
Leading with "what the human operator did wrong" is, as I said above, dangerous to systemic thinking that will actually fix the system. Indeed, it's bad enough that people assumed that a config-push outage was an operational error, rather than customer control-path failure. Assuming it's human operator error is as bad as saying "It's always DNS", except it's not even a (misinformed, tedious) joke. Ok, in Salesforce's case it was DNS. *sigh*
The other point, about human inspection of multi-stage rollouts is... a lot more subtle. If you don't trust your rollouts, having a human in the loop and watching sounds good. In practice, it doesn't scale, and it sucks. Humans make mistakes, good automation is much more reliable. So, any decent system will have automated checks anyway. What's the human doing? Watching the machine to make sure its checks are working correctly. That's the approach behind Tesla's autopilot, and people have died because of that.
Get good at trusting your automation. We don't put human checks on the business path - no human is canarying the rollouts of customer-driven configuration. To say that the operational automation needs a human watching it means we're just not holding it to the same standard of software engineering as the application, and that is how you get outages. You don't want to stick humans in the loop by choice - they are just so darned unreliable. Look at all those outages caused by operator error! So, adding human vigilence is a sign of some underlying cognitive dissonance.
Where does this concern about automation come from? Well... *sad face*... there are valid concerns, which brings me on to the third outage, which I never intended to cover when I started writing this, but it's so informative:
This outage is a case of automation taking down a large service in a very visible way. Automation makes it possible to do so exceedingly efficiently (and surprisingly!). Is full automation bad? Do we need humans in the loop?
One of the things about automation is it enables scaling. It enables complexity. If every production change were done by hand, we'd have a huge error rate and any large distributed change would be overwhelmed by the overhead of coordinating people. Replace that with robots, and you can keep scaling. You can build a really complicated system, and it'll function (most of the time). If you read the write-up, this is clearly a mega-complex automation system, with many layers and... it got too complex.
People might take away the lesson that automation is dangerous. They'll be tempted to put the much less efficient, much less safe humans back in the loop, and stop treating the problem like software engineering. Don't do that. Trust automation, but build automation that you can trust. Code lets complexity quietly pile up in a way that a human bureaucracy would immediately spot. It's a horrible temptation. Keep the automation simple and clear. Build it so that you can trust it.
Sometimes you'll read something that's clear, coherent, thoughtful and utterly at odds with your experience. That's just happened to me. In particular, it's this article on using a general-purpose programming language to specify the structure of your system. Well, "specify the structure of your system" is probably incorrect, since there's no hard line between the bit about system shape and the operations you perform, but that's roughly the gist? Anyway, the idea is to use general-purpose code to configure production.
As usual, bear in mind that I'm arguing against what I read, which may not be what was written. Interpretation is fun.
My TL;DR is that any time I've seen a general-purpose language used for something that doesn't need it, at scale it will turn into a horror such that you end up Googlng "The Office no meme". The freedom gets abused. I'm going to spend the rest of this post trying to add some subtlety to that viewpoint....
First, let's introduce DSLs...
Domain-Specific Languages (DSLs) are languages cooked up for a particular purpose. They should look like and feel like a general-purpose programming language, but they're specialised for a particular task.
A DSL can be Turing-complete - it can be set up so that any program you want to write can be written in that language, and you cannot, in general, tell if a given DSL program will terminate - but for most tasks you don't need that power and it's a pain to work with, so you'll make your DSL not Turing-complete. You can still have conditionals and loops, for example, but the loops must be bounded (kinda primitive recursive-level power).
I've seen DSLs used to great effect throughout my career. I spent a bunch of time in banking, and used them for specifying exotic equity derivative payouts and real-time trading algorithms (which are extremely different uses, even if the words mean nothing to you), as well as for configuring infrastructure. My PhD thesis was also based on a DSL, but I gotta say, looking back, it was more an exercise in how not to do it!
In general, I've found DSLs to work really well. I've found that when people break the box open and shovel in general-purpose language behaviour, things go really badly.
Note that I'm not against using general-purpose languages to write your DSL code in. Embedded DSLs, where you write your DSL within the syntax of another language, are a quick and easy way to bootstrap a DSL, and can give you a lot of mileage before you hit any problems. The important thing to note, though, is that you need to be clear about the DSL boundaries, and never leak into full general-purpose language functionality.
The linked article uses a rather personal naming system. The ideas embedded in it seem to closely, but not necessarily precisely, align with concepts I have, so rather than risk confusing ideas, I'm going to use my own personal naming scheme, and suggest alignments. My stack-of-abstractions looks like this:
Roughly, the article is "code sucks, use software", and my response is "general-purpose sucks, use DSLs". So, let's put forward my position...
The reason I like DSLs over general-purpose languages can be summarised as "If you give people a general-purpose language, they will use its full power, and this will screw you over." Breaking this down, we have:
In practice, the main "analysis" that you use on the DSL is "expand out a fully-unfolded data structure/config", but it gives you the option to do better than that. And I always want the option to see the fully-unfolded config, as it will be used downstream, without actually performing any operations, as that's an absolute baseline for observability. And a badly-done general-purpose-programming-language-based system won't give you that.
Much of what I read in the article rails against the current state of tooling. All the YAML. Templating. Crappy tooling. Code as config/data with a thin coat of paint.
I think I've been really lucky, in that my career has kept me away from the "state of the art" of open source infra tools, and has let me play around with decent DSLs, supported by the people who use them. I've not had that pain.
If you have crappy "code" systems, I can see the temptation to move to "software" systems. I've seen people give in to that temptation, and as things scale up, I've seen them regret it. General-purpose languages are harder to reason about than DSLs, and we don't need more complexity. DSLs are an investment in keeping things simple, because if you're operating large-scale distributed systems complexity is a complete reliability killer.
My take is that if infrastructure as code sucks, we should make it not suck, not move to general-purpose languages. I get the impression that a lot of the tooling is built by infra people without understanding of programming language design, so we end up with bad tools. The proposal is to jump to general-purpose languages to steal those lessons and tools, but that's not the lesson I want to take away. Which brings me on to...
I felt the article's idea of "software engineers" vs. "infrastructure engineers" very revealing. I could be reading way too much into it, as someone with a strong software background who's spent some time now in infra, but it feels like the infra inferiority complex.
The message I'm getting is "Infra engineers write code in these crappy tools, while proper software engineers write in proper languages like Go. Infra engineers should be like software engineers.". I wholeheartedly agree, but draw a different conclusion.
Proper Software Engineering throws in tonnes of abstractions. It builds DSLs. Look at Unix: You want to manipulate text? C sucks for that, let's build awk and sed and stuff. The software engineer approach is DSLs, but with real ownership - tools built by those who use the tools. They treat it like a real language, and build out the tooling so that it's not a joke to use. They don't just create a better string manipulation library for C.
I think the issue is the ownership. Right now, the "code" end is for infra engineers, and the tooling that actuates it is "software", usually part of some OSS project, and they're viewed as somewhat distinct, and in the end the infra people are just writing some templated config.
This is not how it should be. "Infra engineers" should have the same software engineering skills as "software engineers", and hence engineer software similarly. They should own the system end to end - both the config/what and the actuation/how parts - but still respect abstractions. "Software engineers" do not (ahem, should not) mix config into their source tree, even if they can. Infra engineers should be just as good at abstracting.
Is the article really suggesting something I totally hate? I can't be sure, but I think so. The "func MyApplication" example code is pretty much an embedded DSL - a data structure being built within a general-purpose language can be clearly distinguished from the embedding language. This does not scare me.
On the other hand, the linked pull request is clearly structured in a way that mixes deployment behaviour with parameters, breaking all the abstractions I'm such a fan of. So... yeah, I think I don't like this.
What now? I'm not sure. I'm sympathetic to existing tooling sucking badly. I really don't like the proposed solution. I've had a go at explaining why I dislike it, I have no idea if I've got that across or not. I don't have the time or energy to propose a better alternative, and TBH I don't know the OSS tools anything like well enough to make a sensible suggestion. All I can say is "if this is a proposed improvement, I think the starting point must be really bad.".
Or... maybe I lie. I do have strong opinions of my own. I like declarative systems, with reconciliation-based actuation. I like DSLs for those declarative systems, static-analysis-based tooling to leverage those descriptions, and strong end-to-end ownership of the full system by infrastructure engineers so that infrastructure management is truly treated as software engineering. Alas, I still lack the time and energy to act on those opinions.
In any case, that article really got me thinking about config for the first time in a while.
Something I've been thinking about writing, pretty much for years, is a little about the very worst books I've read. I've done book reviews for years on end, and I rarely slate books. I'm also something of a completionist, so once I've started a book, I tend not to give up. After all, it might get better, right? This is almost never the case.
To start with, I'm not a fan of Kurt Vonnegut. I don't get why people like him so much. So it goes. Kilgore Trout, his fictional unsuccessful sci-fi author, seems to just be an outlet for bad sci-fi ideas, and we don't need more of those, intermediated or not. I acknowledge that Vonnegut played with ideas and tried various innovative things... they just all fell flat with me.
While I'm at it, to show I'm not just down on modern authors, I really don't rate Hardy. I don't know if it's the writing or the distance of time, but Tess of the d'Urbervilles left me nothing but bored.
Having said that, none of Vonnegut's or Hardy's books make my top three. Notably, my top three are all doorstops. A short bad book is one thing, but trawling through a long, tedious work is so much worse. They're also books that think highly of themselves, or at the very least have staunch supporters:
#3: Stranger in a Strange Land by Robert Heinlein
A lot of people think Heinlein is a classic sci-fi author. I read SISL and never read anything else by him. Judging by the descriptions of his other work, it varies and SISL is not fully representative, but... I don't care. I've wasted enough of my life reading Heinlein.
Why is it a bad book? It thinks a lot of itself, yet it's fundamentally naff. It tries to combine so many ideas, yet so many of the ideas are just so bad. It's suffused by mock spirituality, with the most tedious of '60s free love combined with... whatever can be found in the kitchen sink. It grinds away at the brain until, finally, several hundred pages later, you're released, thinking "What was the point of that?".
#2: Gravity's Rainbow by Thomas Pynchon
Why am I a glutton for punishment? I read a review of Gravity's Rainbow in the Cambridge University student newspaper by some English student, and decided to read it. The review was positively effusive, but it was really praising the book's cleverness as a proxy for the cleverness of the reviewer in understanding it.
I'm sure it is a clever book. It clearly thinks it is. But it's just too much like hard work. It's some kind of tedious shaggy dog story that is much more interested in showing off than actually telling a story. Is it meandering, or is it just plain lost?
All of which might be forgivable if it wasn't hundreds and hundreds of pages long. All the better to meander with. As it is, I'll be very happy to never read another "Slothrop sez".
#1: The Illuminatus Trilogy by Robert Shea and Robert Anton Wilson
Gravity's Rainbow was a clear product of the '70s, and Stranger in a Strange Land has its hippie free-love elements. There's some kind of strand here that gets combined in The Illuminatus Trilogy.
There's a paragraph in Fear and Loathing in Las Vegas about the wave of hippie idealism breaking and rolling back, and... that's the seventies in so much culture. The leftovers of the '60s gone soured and seedy.
I've never been a hippie person. When choosing between those who fought actual Nazis and built rockets to the moon, and people who dropped out, I know which side my modernist, straight-laced self would fall on. It took me a long time to realise it was ok to not like hippies - growing up, the media did rather go on about how the '60s were the best of times, and it took me forever to understand that this was because the kids of the '60s were running the media.
'70s counterculture, as epitomised by The Illuminatus Trilogy, though, that's something else. Crappy conspiracy theories that look like a trial run for Trumpian reality avoidance. Awful, awful writing. Nonsense plot. More crappy, bad, seedy sex than you can shake a bargepole at. A thousand and one other bad qualities, and behind it all... an astonishing vacuity. It would have had the redeeming quality of being short, but it's a trilogy in an omnibus book and for some reason I read the lot in search for something when I should have given it up as a bad job.
Why on earth would I read such a thing? It's a long and complicated story! As a computer-liking kid in a small town in Gloucestershire in the early '90s, I felt somewhat socially lost. In one rare bookshop trip, I found and bought The New Hacker's Dictionary, which was like a window into the history of computer geekery culture - there was a tribe out there I could belong to!
TNHD was itself derived from "The Jargon File" which had been floating around on the proto-internet (the culture described in it actually being a mish-mash of several subgroups, but that's not terribly important). The bookification process involved Eric S. Raymond (ESR), who also added an appendix on "Hacker Culture", which overlooked the variety of people making up "hackerdom" in exchange for really just scribbling down his own politics. Naively, I bought in.
Oh, and his politics just happen to be quite crappy, with an interest in awful '70s counter-culture. So, when The Illuminatus Trilogy was mentioned as a way of understanding the mindset, I scooped it up. It's crap, I suffered cognitive dissonance, and came to realise his politics sucked. So, I guess it did some good. The real lesson I eventually learnt was that you want to be extremely careful who your role models are. Since then, it's become increasingly obvious that ESR is not a great person, and widespread internet access has made computer geekery into a very broad church - effectively TNHD has been made obsolete as the communities have become directly accessible.
Later, I found out that a friend of mine saw me reading it, took a skim when I put it down, and thought I was some kind of weirdo for reading that crap. To read it and enjoy it, that's a bit weirdo. To read it, not enjoy it, and continue reading anyway... that's my idiocy.
Worst book ever. Seriously.
The launch of the Raspberry Pi Pico has reminded me to play about with microcontrollers. I have completed a first-year-undergrad-like project: display something on a little 128x32 OLED display, using a Teensy 2.0! My biggest fear was installing the toolchain, but it turned out to be easy. Maybe things have improved in the last few years. Microcontrollers being big enough and powerful enough to happily run C rather than really prefering assembly is also definitely pleasant.
Of course, a photo is necessary:
Big-banging I2C was relatively painless (having previously done SPI for SD card access on my Dirac Z80-based machine), and I got to play with very simple graphical effects. So, fun.
Of course, the whole thing is a yak-shaving exercise. I want to do this as a warm-up before using the Teensy 2.0 (5v compatible, yay!) to drive a 30-pin SIMM, as an early experiment towards building a 68K Mac-compatible, which has been on my project bucket list for ages. I'm starting to realise how these MCUs can make a nice support device for these retrocomputing experiments.
I think it says a lot that my last update was January. 2020 has been quite the year, and I don't think I've got an awful lot to add on top of the general conversation. What I want to do, though, is remind myself that it was still a year where I managed to do stuff - managed to build things and learn and achieve.
So much has been going on at home, between the rounds of children learning from home and just trying to keep sane. And all the work-from-home, of course. Getting a good videoconference set-up was like a project in itself. My son moved school this year (to one in central London, even! It seemed like a good idea at the time.), so I had my share of masked school runs. However, I don't really blog family stuff, so let's skip over that.
I accidentally got an MBA. I like MOOCs, and as a techie it's very easy to have a career where you can vaguely see the business in the distance, but never really understand it, so learning more about it all seemed extremely interesting. I got some very polite smart.ly spam on LinkedIn, thought "Why not?" and joined up for this non-accredited online-only Executive MBA course - the official degree status was not important to me, I was just interested in Learning Stuff. A year later, they're now quantic.mba, they received full accreditation, and I have an Executive MBA with Honors. Ooops! :)
I got promoted at work. This one was very much not accidental, being the culmination of years bashing at the system, but I got there. I'm strongly of the opinion that the easiest way to get to a level at Google is to be hired at that level. It was a bright point in a very grinding year of work.
I dealt with my father's death. Not exactly a deliberate project, but something hard that we got through nonetheless. The timing was fortuitous in otherwise bleak circumstances - lockdown lifting meant we could visit him in time and hold a small funeral in person. My sibling was a star in the hospital visits, my wife has done amazing admin work, and family friends really helped us out. Everyone was wonderful in an otherwise difficult time.
I learnt to draw. Ever the book-learning type, I bought a couple of introductory books and started practising. I'm by no means any good, but it's great to pick up something that I feel I have no natural aptitude for, and just... learn a new thing, be bad at it, and that's ok, and just gradually get better. It also made a nice distraction from dealing with my father.
I became a licensed radio amateur (M7FTL). Software-defined radio and the like sounded fun, and a work colleague mentioned the exams had gone online-only, so it seemed a great opportunity to find distraction from the grimness of 2020 in study. Who doesn't love an exam?! I completed my Foundation license and, well, I've not done much with it, but it was a fair amount of fun, and I learnt a bunch.
Last year, I learnt to sail. I was really looking forward to getting better this year, but in the end only got a quick, single session in. It does make me feel a bit better that I have managed to find other things to learn and improve at instead. Screw 2020, and here's to our achievements despite it.
I decided to re-read A Fire Upon The Deep. It looks like I last read it... 15 years ago?! Really?! Well, there you go. It's been one of my favourites, largely because it so effectively steals ideas from computer science to make a science fiction novel that's ostensibly space opera.
It's been really interesting to come back to it. I suppose it's fun to go look at the themes involved:
As well as the compsci elements, there's the structure of the story: Pham Nuwem is a Christ figure. A god (well, transcended Power) made flesh, sacrificed to save people from their mistakes. Also resurrected, although in this case at the beginning of the story.
Not everything has stood the test of time for me. Going beyond the fun compsci space opera, I felt the characters were somewhat weak and the battle at the end was pretty muddled. These complaints are still missing the point, though, when the ideas are so fun. I'll probably re-read A Deepness in the Sky in a year or two. :)
Yak-shaving at its finest!
I want to build a 68K-based machine. To avoid needing too many TTL chips, I wanted to use GALs. I played with GALs, and eventually got them working, but not before looking at the source to galasm and deciding that it was a bit old and crufty and I didn't really trust that I understood its corner cases.
So, I wrote Galette, a largely-galasm-compatible GAL assembler, written in Rust and hopefully a bit more maintainable. Yaks-shave completed.
So, my reading of the kernel code ends not with a bang but with a whimper. tcp.c is the last source file to read, and is a 3.7kloc mess. Most of the code looks plausible, but would take far too much time to understand if it's doing the right things. tcp_ack is pretty horrible. tcp_rcv is too. The structure make responsibilities unclear (not helped by C's tendency to mix the lowest-level details with the highest level abstractions). I guess one could charitably say it's compact, but it's in the "so complex there's no obvious bugs" category. To be frank, I just skimmed it.
There are various things I've learnt reading this kernel. One is that a clear and consistent programming model is necessary if you want a maintainable kernel (e.g. around how to avoid race conditions, how to handle operation completions, etc.). Another is that C is a really bad language for this kind of thing, and RAII-like tooling is necessary to stop resource leaks being risky. Clear abstractions are great, good comments are important, and life's too short to read hard-to-read code.
After another hiatus, I've almost finished reading through the Linux 1.0 kernel. Building on last time, I've now read everything but tcp.[ch], which is probably a good time to take a break as tcp.c is pretty near 4000 lines on its own, and the longest source file in the tree.
We took a short break to Norwich (surprisingly nice), so obviously it's time for a new hobby. A bit of background: I've already tried my hand at knitting, and there was a rather nice crocheted/knitted Groot at work. Then, at some point, it disappeared. I thought it should be replaced.
I found instructions for the amigurumi Groot model here, and followed them. It turns out YouTube is actually really useful for learning a new manual skill. This is the result:
I'm actually rather proud of it!
I am finally learning Rust!
I should have designed Rust. It's been clear for far too long that languages like C and even C++ are far too dangerous, unnecessarily allowing many classes of bug to exist, somehow in the name of power and being close to the metal. So many things that could be fixed with decent use of a type system and static checks. On the other hand, great languages like Haskell require GC, which makes them an odd choice for systems programming. Go says "No, you can write your systems programs in a GC'd language, and it'll be ok", but there are still applications where this feels wrong, and Go makes some really suspect decisions in the name of simplicity (*). There was an obvious gap in the market.
(*) The trade-off being suspiciously similar to Not Invented Here syndrome, where "here" is Bell Labs in the late '70s, possibly early '80s if you're lucky. Hmmm.
At the same time, this is kinda my area of expertise. My PhD was on statically-allocated functional languages, so it was right up my street. Seeing the problem, and knowing the area, I could have tried something in this space. I didn't.
And you know what? I'm glad I didn't. The result is really good, and I don't think I'd have produced something as nice as Rust. It's pleasantly pragmatic, the lifetime inference rules are simple (certainly compared to the thing I saw in research), and I got the language I wanted without a decade of hard work!
The documentation is pretty good, so I'm learning the language from the site, although the language still appears to be in flux even now - despite learning the language much later than I originally hoped I'd be able to (I first planned to learn it several years ago, but never found the time), I don't regret it too much given how it seems to have changed! I'm not entirely sure the docs are up-to-date.
I don't know whether it's me getting more experience in programming, or just the nature of Rust, but I'm not feeling confident just from reading the docs - I want to understand the idiomatic style of the language and read some good Rust code. I've not done that yet, but I have at least started writing code, rather than just reading the manual in the abstract.
For this, I've done the Advent of Code, admittedly extremely late (which is quite good, as it stops me trying to be competitive. :). Running through a couple of dozen simple algorithmic coding challenges is a great way to learn the feel of a language, and understand the subtleties that docs don't get across. And Rust does have subtleties. It doesn't have the pointless accumulated crufty corner cases that C++ has, but the features combine in ways that take a little while to get used to. So, that's what I've been doing.
My code is now up on https://github.com/simon-frankau/advent_of_code_2018 and I plan to carry on with my learning. I think it's my clear favourite systems language now. Being able to code without the fear of segmentation-faulting stupid mistakes from C and C++, but with the full feel of an effective systems language is great.
I'd heard good things about Stephen's Sausage Roll and on Black Friday it was ridiculously discounted. So I bought it. And it ate my time.
As people have said, it's a puzzle tricky game. I never got stuck for too long, but there were certainly puzzles where I had to quit the game, go away and think for a while. So, I guess that counts as a nice balances.
It weirdly feels like a linear interpolation of The Witness and Sokoban. It's got that pushing-stuff element of Sokoban, but the explore-the-world-working-out-how-puzzles-work-for-yourself-and-then-uncovering-the-story-of-the-land aspect of The Witness.
Making it a game about sausages just makes it weird, even as the story gets darker, and I think this ties up with the lo-fi aesthetics. I found the visuals just offensively bad - they hurt my eyes while playing the game. It's certainly A Look. I kind of wonder what it would look like, even if just the textures were upgraded.
However, it's clearly about the puzzles, and I think the visuals cement that message. Moving from land to land and working out the different ways the sausage and fork work, discovering ideas (or, more realistically, being subtly taught them by the game) and using them is fantastic. Even if it sounds ludicrous.
Back by popular demand! I never actually posted the image for the NTK Elite t-shirt, as I've now had my second request ever to have a copy of it, I've now put it up online. If you're lucky it'll not show up on your browser - it's white on a transparent background. I'm sure you can work it out. :)
On to the next bit of the network stack. My plan had been to cover the stack from the bottom up, but in the end this turned into an "edges in" approach, covering both the device-interface layer and the user API.
So, having both sides of the interface defined, I'm at a point where we can do a little bit of filling the middle and reading up the rest of the stack.
Work's performance appraisals continue, so more distraction was needed, and I decided to go for an Infocom game, Enchanter. It's magic-based, and pretty fun. I got through it all on a single hint.
I particularly liked the in-game hint system. When you sleep, your dreams give you hints as to what you need to do. Which is helpful, since some of the things are rather unobvious - there's a bunch of solving problems you didn't even know you had.
I got particularly stuck when I found a thing that was clearly supposed to do something, but no idea what it was needed for. This was made trickier by the fact there were some red herrings.
What I forgot is that '80s games, especially text adventures, were sadistic. Despite the game not supporting several solutions that I thought were quite reasonable, that it didn't like, it also allowed me to solve one problem, apparently succesfully, in a way that left the game unfinishable. And once the puzzle was solved, the dream hints didn't cover that puzzle, leaving me without hints at the actual correct solution.
Once I'd read online hints and gone back to the start to replay it successfully, the in-game sleep hint was really obvious. Most frustrating.
Between that and the occasional "guess the verb" there was plenty of frustration, but a fair amount of fun and exploration too. A very '80s experience, not quite as good as Trinity or Bureaucracy. Followed by Sorcerer and Spellbreaker.
On to networking! This is a fairly chunky area in itself, so I thought I'd start with some peripheral areas: The non-IP networking stack, and the infrastructure parts of the inet stack. As usual, I started with the Makefiles and READMEs to get an overview. Not much of interest.
From there, I looked at the source in the "net" directory itself. ddi.c and Space.c are a simple module mechanism that has hopeful comments that it'll be picked up for everything else in the system. socket.c handles the socket-related system calls. It's mostly standardised wrappers around the fake-virtual-calls-in-C for the specific socket types, called from a big switch statement, plus some helper functions. This is pretty useful to help me get my bearings. Now's a convenient time to read a few related headers.
Then, net/unix. net/unix/sock.c is annoyingly fiddly, and makes me think that when people say "In Unix, everything's a file.", it's in an Animal Farm world where some things are more like files than others. Maybe Plan 9 gets this right. I should probably take a look sometime.
Having read net/unix, it's time to start on net/inet. I'll skip the headers and just concentrate on the C source fils I read:
And that's laid the groundwork. Now, I plan to work from the bottom up, starting with dev.c, and eventually making my way to sock.c. Let's see how long that takes...
Work's got busy and stressful - it's performance appraisal and quarterly planning time, simultaneously, and so I've been looking for a light distraction. Myth is it. A short text adventure, original not sold but given away as a gift with a subscription service, it's short and sweet.
It's a simple, solid text adventure. Nothing like the mess of Corruption, and much more clearly structured than, say, Jinxter or Guild of Thieves, it was pleasant and quick to complete - I think it took me three evenings of not particularly strenuous effort. I couldn't really ask for more!
Now, the terminals end of the characters devices. I've already read serial.c, which means I've read a terminal driver already without being properly aware. pty.c is the first one I read deliberately. Yes, it looks like a couple of connected terminals.
Then, onto keyboard code. defkemp.c, kefkeymap.map and diacr.h look like dull scan code conversion tables, so meh. kd.h is complicated - it's not just keyboard stuff but also includes video modes etc. kbd_kern.h covers LEDs, locks and modes for the keyboard. keyboard.c handles the keyboard itself - mostly how to handle incoming key presses (unsurprisingly), with a horrlbe mix of table-driven code, switch statements and ifs. Messy but not mysterious. It even uses an octal constant.
vt.h is short and obscure. vt_kern.h is short and obscure. vt.c is the ioctls for the terminal.
Then, onto console.c, a big and interesting file to implement the console on a text mode display. It starts off with a giant structure and then macros to make it convenient to access those fields. Scrolling functionality is a pile of assembly, which I'm sure is efficient, but I'm not going to try to decode inline gcc assembly.
This first part of the file is actually pleasantly readable video buffer wrangling. Then there's a 300 line function to interpret the stream of characters written to the console, then other console functionality before finally doing some font-setting stuff. In the end, much more straight-forward than I expected for a 2kloc file.
tty_ioctl.c is what I'm used to for this kind of file by now. Spoiler: It's a great long switch statement.
Finally, tty_io.c. Ugh, so tedious. I'm impressed Linus wrote all this as a student. Writing schedulers and VM and stuff is fun, but this is deeply boring and very practical, and... it's done. It's needed to make a basic usable Unix clone, and he did it. I really don't want to read it, and I'm mostly skimming.
There are some comments about "be really careful modifying this code, don't modify unless you understand it, as it's full of race conditions". Hint: Don't do this. If there are risks of races, explain them explicitly or even better remove the risk. Don't write an obscure comment, feel proud about how clever you and your code are, and wander off.
One function in particular I found interesting: tty_write_data covers the same issue I saw in the SCSI stack with "How do I call callbacks if I don't want to overflow the stack?". In this case, it's implemented with a bottom-half handler, which I guess is following the same pattern of "queue it up until the current work unit is complete". In other words, we convert the usual LIFO execution model for a FIFO model, from stack to queue.
Anyway, I declare character devices both boring and done. 17 thousand lines of network stack to go and I'm done with the Linux kernel!
Having finished off SCSI drivers, I've turned my attention to character devices. There's a lot of console stuff here, but I thought I'd start with the not-terminal devices first, and work my way up to that level of tediousness.
Starting with the Makefile, I chose to read through the short and optionally-compiled mouse driver files - atixlmouse, busmouse, msbusmouse, psaux and mouse. There's a standard pattern to the bus mice, they're easy to read and it's a good start. By the time we get to psaux, it's not clear to me what's going on with psaux vs. 82c710, but I don't really care.
Then I read lp, mem, serial and tpqic02. mem is in some ways the most fun. It supports not just /dev/mem, but /dev/null, /dev/zero and a couple of others. It also contains the main chr_dev_init function, presumably because it's really not optional. lp, serial and tpqic02 are all longish (lp shorter than the others), full of device-specific details, well-commented and tedious.
It's really devices like this that make me want microkernels or something. Why should the kernel have to care about such tedious details of specific devices? Apparently you need to have full access to all memory and state of the system to decide whether or not the tape drive should be rewound.
While much of the resource management of the kernel makes me wish for explicit RAII, the low-level drivers, with their callbacks and interrupts, top and bottom halves etc. make me wish for some kind of explicit coroutine support. And the layered structure and asynchronous completion makes me think of microkernels and monitors/servers for the drivers. There must be a neater way of doing all this.
In any case, I've read the easy stuff, and now it's time for various terminal-ish things.
I enjoyed reading the low-level SCSI driver code much more than I expected. After reading the sound and network card drivers, I imagined the low-level drivers for SCSI would be quite boring, and potentially rather uniform and repetitive. They turned out to be rather interesting.
As I mentioned in the previous post, I'd read a bit around SCSI as I didn't know much about it, and this background knowledge came in rather useful for the low-level drivers.
I started with scsi_debug, which is a fake low-level SCSI driver (and hence a good introduction!). In various cases it calls back into the "done" callback immediately, which can cause internal_cmnd to then call back recursively into the scsi_debug code. Kernels aren't supposed to do recursion, so I can see why scsi.c cares about stack overflow. This recursion is a bad thing. Why aren't they handling it? In other cases, they hook the callbacks off timers, creating a more realistic set of timings (and probably helping to avoid stack overflow).
Having got the basic idea with scsi_debug, I moved onto the smallest self-contained real drivers. I say "self-contained" because some of the 5380-based drivers are small, but rely on the rather long NCR5380 source file. So, next up, aha1740 and wd7000. These are both simple drivers that rely on a smart card. You just fill in a memory-mapped struct, queue up the mailbox request, let the card do its thing and DMA, and then get an interrupt when it's done. Lovely. A great introduction to the basic ideas of SCSI as you can elide so many details! All the hard work is carried out by the hardware.
So, that was nice. Let's have a go at the NCR 5380-based drivers. g_NCR5380, t128 and pas15 all use NCR5380. Lots of funky (maybe icky) macros allow the code to be reused, with various different modes, including stuff like pseudo-DMA. There's a lot more low-level big-banging and implementing the SCSI protocols by hand - much more code, much more to learn from, and much more CPU overhead.
NCR5380 is nicely commented and has to handle all the conditionally-compiled cases, but it's still loooong. There does seem to be an inverse relationship between how long the driver is and how nice the hardware is. Short drivers and clever cards, please!
However, it's nice to see the NCR5380 "run_main" takes recursion seriously - it is implemented as a coroutine that refuses to run itself if already running, and checks for more work to do before returning, thus effectively implementing a queue for the work rather than a stack. Having not dealt with much interrupt-driven kernel workloads before (beyond "interrupt arrives and wakes up process"), it's nice to see this pattern to handle this shape of problem.
Then I'm onto the remaining drivers, that are longer (and hence generally based on dumber hardware), but not 5380-based. The ultrastor driver is for clever hardware. The length comes from highly-commented code and dealing with multiple card variants. It shows a few signs of being thoughtful in its design, including caring about re-entrancy.
The fdomain card is fairly dumb, so we have to twiddle most of the lines ourselves, but it does at least to DMA for the data. Bizarrely the earlier chip doesn't seem to tell you which direction the data's going, so there's 130 lines saying which direction data goes based on the command id.
The aha152x.c driver is another long driver (2.4kloc), for a dumb card. However, the lines get spent in weird ways - the code is quite vertically-oritented, with planety of ifdefs, there's around 1000 lines of introduction, and then aha152x_intr is a 1000 line function. I don't believe a 1000 line function is ever good style. It's well commented, though. It then finishes off with about 250 lines of verbose debug-text-printing code.
The final driver is seagate. This is another driver that worries about stack overflows and works around it. Yay. However, it also has a 1000 line "internal_command" function. Boo. "Fast mode" bit-bangs the data out without any form of handshaking. This means that a) there's no DMA. Oh well. b) Surely fast machines will go beyond the the synchronous SCSI speed limit? (Well, maybe the card will insert wait states. Given the cheapness of the card otherwise, I doubt it, though.) It's another vertically-oriented source file, but this "big long function" approach is still ridiculous. Finally, it's amusing that the code seems to have been written by a reader of the New Hackers Dictionary. "bork" and "bletcherous" and all the rest.
And I'm done with SCSI! It was kinda fun, although pleasant that there's only a smallish number of drivers. One of the things I learnt through all this is that SCSI is not tricky, it's just pretentious. "Contingent allegance condtion", "request sense", "nexus", etc. Sounds complicated, but actually pretty simple. Why give it silly names? I don't know.
After something of an extended hiatus, I've taken this post-operation recovery time opportunity to read a bit more Linux 1.0 kernel. 'cos why not? :)
I finished reading the sound driver stuff, and I must admit I mostly skimmed it as the low-level details of individual sound cards didn't seem terribly exciting to me. "gus_card.c" is the longest source file outside of "tcp.c" - 3.5k lines! There's a lot of faff in there that I didn't care about, with both low-level hardware mangling and having to care about sound stuff. In comparison, the Soundblaster midi code is a doddle. Anyway, the sound directory is done.
For a bit of a change, I thought I'd attack SCSI. SCSI had always been a bit of a mystery to me. When I built my first Linux-specific PC around 1997 or so, I got a rather fancy Adaptec 2940 PCI card and SCSI disk, since SCSI disks were deemed important for performance back in the day (along with plenty of RAM - 16MB at 30 quid a meg!). However, the details of it all were magic, and this is a chance to learn!
Indeed, Linux 1.0 here doesn't support PCI or the AHA2940, but I guess only having the simpler stuff is better for me as a reader. :) Lots of the supported cards are built around the NCR 5380, so for preparatory reading I read the Wikipedia pages about SCSI, and skimmed the NCR 5380 data sheet. Then I plunged into the source...
The source is split into layers, with the abstract drivers, the intermediate glue layers, and the low-level device-specific drivers. I decided to start at the top end and work my way down. The generic end is around 8.5kloc, the device-specific code is 14kloc.
Once again, the code reminds me of how basic C makes it difficult to ensure invariants are met, and keep reasonable abstractions. For example, there are several places where the runnabilty of the current process is changed in different ways, directly, rather than using standardised sleep/wait mechanisms. And error handling... ugh. So easy to leak resources in little-used error paths.
I sometimes think "How could the code be structured better?". There are some core multi-hundred-line functions that could be broken up. Given some operations span over multiple functions through completion functions etc., I think there could be explicit state machines. Mostly though... yeah, I want invariants and abstractions. I want the code to be obviously correct, not mysterious. Doesn't seem too easy to me. :/
And that brings us to the end of the device-independent part. Next up, I'll be working through that, but I imagine it'll be pretty dull, with just some tedious register mangling. We'll see, shan't we?