Simon Frankau's blog

Raft, done weirdly

From time to time I'll think about consensus protocols, and then slowly relearn how Raft works. I've never found the Raft paper to be particularly helpful - it focuses on how Raft should work, rather than how it doesn't fail to work. This is not entirely fair, in that it does describe the properties that remain true, and then show how they remain invariant through the various operations.

What I end up doing is reconstructing a mental model in my head that's invariant-first, and going from there. What I should do, and what I'm finally doing this time, is writing this model down so that I can shortcut to the important bit when I forget. Maybe it'll be helpful to someone else?

The goal in my version of Raft is to build a transaction log. You ask the system to append an item to the log, and if you get a "done" back you can be sure the transaction you've just submitted will be in the log, in that position, from here on out.

In my version of Raft you're given an unreliable computer that can reset at any time (losing all its state), a ticket machine and a semi-reliable storage mechanism. I hope "a computer that can reset at any time, losing all state" is understandable and relatable. The ticket machine is also pretty straightforward: Every time you call it, you get the next number. It returns numbers in an increasing order, and never returns the same number twice. The memory is just a little bit more complicated.

The memory system runs something like this: It stores a set, and has a read and a write operation. If you write and it returns before your unreliable computer explodes, you have a complete write, and it's durable - any time you read, you can get the value back. If it explodes, you'll have an incomplete write - when you read, you'll get the value back non-deterministically. To simplify, when you read and get a value back, you can't be sure if it's part of a complete write or not, but you can (try to) make it complete by writing the value again (in real Raft you can sometimes know a write is complete when reading).

This isn't an exact match for what Raft does, but it's close enough to be continuously deformable into the published algorithm.

How do we build a transaction log with these tools?

First, everything written needs a unique id to identify it. When our unreliable computer starts up it grabs a ticket, which we'll call the "term number", and then keeps its own counter. Each entry is identified by a unique (term number, counter) pair, which monotonically increases with time (I guess you could grab a new ticket every time, but this'll be cheaper).

Then, we need to sequence the transactions. We can't assume that every single write will be in the transaction log - we may start a write, and then crash, leaving an incomplete write from which we can't recover the data. So, each transaction will refer to the previous transaction's (term number, counter) pair, creating a chain of transactions.

Without further constraints, this chain could actually be a tree. The main trick of Raft is working out how to create a unique main chain on which all committed transactions live.

In practice, we do generate a tree, but we ensure there's a main backbone of committed transactions. Off this may hang uncommitted attempts to build a transaction, but the core chain holds all committed transactions.

We're going to make the rule that every non-leaf entry of the tree is a complete write. After all, we don't want entries whose predecessors aren't available! So, before writing a new entry refering to an older entry, we must know that older entry is a complete write - either because the write is in the same term and we saw it complete, or it's from a previous term but we have rewritten the entry and seen the rewrite succeed as complete.

One of the side-effects of this is that if we see an entry with another entry chained after it, we know that the earlier entry is definitely a complete write.

As things can get messy around having multiple incomplete writes outstanding (if our unreliable computer keeps crashing near the start of a bunch of terms), and then deciding which ones to complete, we can't actually just guarantee that every complete write is part of the committed backbone transaction log, but we can guarantee a different invariant: Any write completed in its own term (the term number being written is the current term) is committed (will always be on the backbone).

To keep this invariant, all a new term needs to do is to always build on the highest (term number, counter) entry it sees (having rewritten that entry to ensure it's complete). Why does this work? Let's say the new term immediately follows the last term with a committed transaction. We'll be building on either that transaction, or the incomplete entry chained immediately afterwards. In either case, the chain includes the last committed transaction. If there have been other terms in-between with transactions not committed during their term, they will have been chained onto the last actually committed transaction (by induction), so we can safely build on those too.

And that's morally the equivalent of how Raft works. The unreliable data store is just "keep writing until you've written to a majority of the nodes" to write and "union the results of everything you can see, assuming you're reading from a majority of the nodes" to read. The unreliable computer is "do a leadership election, have the elected device run the algorithm", and the ticket machine is a side effect of the the described election algorithm". Raft does a bunch of stuff like delete the entries that aren't on the backbone, but the end result is the same.

That's Raft, done weirdly. I think I'll take a look at Paxos again.

Posted 2023-07-15.

Full index