Code reading: Lua 5.1.4

One of my resolutions is to read more of others' code. Especially decent open source code, as there's plenty of it and it provides a rather different view compared to work code. So, Lua seemed a pretty good choice as it's featureful, interesting and 17kloc in total. 5.1.4 is just the version I have lying around and use. Let's get started!

The code is really very compact and terse. It's written as if it's meant to be read on an 80x25 monitor. Much more squished up than the Linux source, and much less commented. I know comments can be misleading, but sometimes just trying to work out what the code is supposed to be doing is a bit painful. I think they're trying a bit too hard on the density.

Reading through the code, I can understand the general shape of what's going on, but I have very little confidence that I'm not missing bugs as I'm reading it. Unsurprisingly it's rather stateful, but it's also not clear what the invariants are, or how they are maintained. While functionality is split quite nicely across the source files, it's not well-encapsulated (C makes this a lot more difficult than C++, with objects).

To understand in depth, I think I'd need to trace the code as it runs, rather than just stare at the code.

The source is divided into the core part, and the libraries. I started with the core, which is broken down by function. Each source file is nice and short, generally a few hundred lines with the longest (the parser) being a touch over 1300 lines. I started with the periphery/base layers - the implementation of strings, tables, allocation etc. All fairly straightforward.

The chunkier parts are the VM, garbage collection, and code generation (the Lua input is converted to bytecode that the VM runs).

The Virtual Machine

The VM is stack-based, but it's not pure stack access. Each call frame has a base address, to which offsets are applied. It's rather like the Sparc moving register file. There's a separate stack for values (as just described) and calls, a bit like Haskell's STG.

Lua's "upvalues" (lexically-scoped variables) can be open or closed. If they're open, they're still live on the stack, and are represented by pointers to those values. When the item comes off the stack, they become closed, and become self-contained.

There's some weird GC handling of upvalues I don't quite understand. If an upvalue is unused, it can be resurrected if something else refers to it. I don't know if this is an optimisation, or just a way to prevent multiple upvalues pointing to the same actual value.

OP_CLOSURE's definition doesn't match what really happens. Or rather, instructions to load the values needed for the closure appear after the OP_CLOSURE instruction. This perplexed me both when reading the VM code and the code generation code!

Some of the opcodes seemed rather specialist, to deal with Lua's specific structures. Not unreasonable, I guess.

Garbage Collection

The idea of incremental garbage collection in Lua seemed mildly magic, given the hoops jumped through by languages like Java to do this kind of thing. In the end, the implementation is surprisingly pedestrian: The marking can happen incrementally, and if you put a garbage-collectable value somewhere outside the heap that's already been marked, add it to the "to process" list. Life is made a lot easier by the code being single-threaded, with the garbage collector being able to synchronously kick in at nice, safe points.

The code initially confused me with heap objects being black, white or grey. It took me a while to work out that black means marked (fully processed), white means unmarked, and grey means incompletely processed (we know it's accessible, but haven't chased through it). Obvious in retrospect, I guess. There are two whites, presumably so that we can distinguish "not yet marked in this iteration" from "should have been deleted already", and to avoid nasty mess around the transition.

Weak tables and finalisation provide a bit more fun, as usual. Ugh.

There are a few little bits of non-native-English-speaking amusement in the code. For example, the variable to hold the GC "debt" is "gcdept".

Code generation

This is split between the lexer, parser and actual code generation functions. The lexer and parser are hand-written. For this kind of thing, I certainly don't blame them. The parser is relatively long, but better-structured than the code generator, and makes the syntactically-driven code generation a bit clearer. The split of work between the parser and code generator does seem a little uneasy, though. Unclear invariants and repsonsibilities.

The code generator is mildly funky, mostly because it tries to perform peephole optimisations as it goes along. Jump handling is also a little tricky. It keeps lists of jumps that need their destinations overwriting as a linked list, linked through the jump destination field that will be overwritten. This is made more complex by the code that fills in that linked list, which has a special case not to do so if the target of the jumps is another jump (instead jump straight through to the final destination).

Similar tricks abound for handling variables and values, to make sure they end up in the right place at the right time without extraneous register moves. Boolean expressions are handled as expressions with jump pointers for the true and false cases, which I guess is a neater representation if you want to handle conditionals and short-circuiting nicely, but then there's extra conversion if you just want to treat the result as a boolean value.

Libraries

Having read through the core, the libraries seemed to make a nice wind-down coda. Mostly, it's wrappers around useful functions. "loadlib.c" is a bit interesting, as it breaks Lua's usual pretence of ANSI C, and has some quite platform-specific code. The string-matching code is... interesting, too. Clearly they want the code to be compact and simple, but the Lua regexp-alikes use back-tracking. Ugh.

Summary

The code, in itself, is readable, I guess. It never makes life easy, though. None of the above is documented explicitly, so you have to work out everything for yourself. Somewhat frustrating. I guess it's good exercise, though.

Posted 2015-03-11.