Delimiters, escaping, stuffing and all that: How long is a piece of string?

I've been reading about shell scripting, and one thing that has always bothered me is how easy it is to get things wrong by having to deal with filenames with funny characters in - how do you handle filenames with spaces, newlines or even just colons in? Disappointingly, the book's recommendation is mostly along the lines of "Don't do that, then".

Indeed, this is a fairly common issue to deal with in computing, at all levels - how do you know when the field/line/lump of data you're reading is over? In a non-streaming, binary-blob world, the answer is relatively simple - use a length prefix to tell you how much to expect.

In the text-oriented, streaming Unix world, this isn't really much of an answer. Unix, in its simplicity, tends to go for special characters for terminating or delimiting. In /etc/passwd, newlines terminate records, and colons delimit fields.

However, this approach is so fundamental it's taken for granted. C-style strings are terminated with NUL - we have a character we can't put in our strings, and we mostly don't worry about it. I do like C, and I'm terribly used to NUL-terminated strings, but Pascal-style length-prefixed strings do start to look pretty sensible. Not only can you put in any characters you like, but it becomes more difficult to buffer overflow if you're forced to track the length.

Back in the day, I guess having to track the length/end position in an extra processor register was a pain and performance issue, but nowadays a Pascal-style string probably helps a modern processor's branch prediction - a string copy can be unrolled, knowing the exact length in advance, without having to do conditional behaviour based on the contents you're reading, checking at every byte. In summary, I guess I like length prefixing, where you can do it.

Having said that, C++'s std::string (and various other languages' equivalent) do track the length, but in most cases people using such types then don't really bother trying to support embedded NULs for most applications. Hey ho.

The basic Unix approach, then, is special characters as delimiters. It can get a bit silly. Say that you're reading /etc/passwd. You might be processing it using NUL-terminated strings, so that's an internal delimiter. You read records separated by newline, and break the fields apart with colons. You pull out the home directory, and the path components are separated by slashes. That's four characters used to delimit elements in a single context. If you want a home directory name containing any of those elements, bad luck.

In this most basic approach, the delimiters are special, and cannot be used for anything else. In the filesystem, file and directory names may not contain slashes or NULs. Period. (They may contain periods, though :). This seems a fair trade-off, since escaping slashes or NULs in file names would introduce some very unpleasant and risky complexity.

In many other places, though, the delimiter characters not being available otherwise is convention, enforced by things breaking if you don't follow the convention, as anyone putting spaces in their file names will know.

There are plenty of times when you do want to shove in whatever data you like, despite the delimiters and terminators. What can you do? A basic approach is an escape character. In shell, you can remove the file "A B" with "remove A\ B", and the backslash escapes the space which would otherwise separate the filenames. In C, a string literal which would otherwise be terminated with a double quote can have that double quote escaped.

Escaping provides more than just the ability to put a terminating character inside the data without terminating the data. It allows you to put in special characters (e.g. "\n" in C strings), or handle lots of special characters (e.g. "*" in regex vs. the literal asterisk in "\*"). However, basic escaping suffers from one particular hassle: The terminator can appear literally in the message, and the context must be checked to see if it's a terminator or not. In other words, if my fields on a line are separated by ":", and it can be escaped as "\:", I still can't just naively split on ":" itself.

The next question, therefore, is "How can I write my message so the delimiter never appears in the message?". If you know what the message is, and can choose your delimiter, you simply choose a delimiter not contained in the message. This is precisely the approach used by "here docs" in shell or Perl. In effect, it's also the half-hearted approach in shell when you put quote marks around a string - you're switching the delimiter from a space to the quote marks. Of course, if your string contains quote marks another approach is necessary, usually escaping.

This approach is also the one taken by Lua's "long strings" and similar comments. These are opened by tokens of the form '[======[', and closed by square brackets with the same number of equals signs, allowing you to embed anything inside as long as you've chosen the appropriate number of equals. If you know your string content, the chosen delimiter approach is nice, since the core data itself can be represented literally - there are no escapes or stuffing. The downside is that the delimiters have to be able to be multi-character, and there is the need to choose the delimiter, which is especially fun if you are streaming the content and don't know it in advance.

Another approach is to have fixed delimiters, but arrange your escaping so that the delimiter never appears in the content. Probably the best-known example of this is HTML/XML. A piece of text contains tags, but the tagging characters "<" and ">" never appear literally in your source. They are replaced with &lt; and &gt;, and the ampersand is escaped as &amp;. This means delimiter scanning can be done without understanding the escaping, although you clearly need to understand the escaping to return the content to its original form.

A similar way of doing this is described in the Lua book. All characters you wish to eliminate from your string are converted to numeric escapes - i.e. \xxx for some numeric xxx (backslash being escaped like this too, of course). This means that you can then safely use the escaped characters as delimiters, without fear of seeing them anywhere else. Once the values have been split up, converting back to the original string is simple, too. A rather neat approach, I thought.

The "separator never appears in data" approach also appears fairly commonly in a multi-byte-separator context, but it's not immediately obvious what's going on, since it looks like you can embed the separator "character" in the message. Byte stuffing for mail messages has "\n.\n" as the terminator, and avoids this appearing literally by adding an extra period to the start of any line that starts with a period. CSV files escape quotes by repeating them, thus distinguishing them from the terminating quote marks, which are followed by a field-separating comma, newline or end of file. In both these cases, you might think the terminator is a single character (period or double quote), but really, it's a sequence.

The same idea applies in lower-level binary systems, too. Synchronous HDLC uses a particular bit pattern as a delimiter, and uses bit stuffing to insert bits to prevent that pattern appearing in the message. HDLC with asynchronous framing uses byte stuffing to prevent the delimiter appearing in the data.

In summary, the approaches are:

  • Prefix a length
  • Use a special terminator that may not appear in the message
  • Use a special terminator that may appear in the message, escaped
  • Use a user-defined terminator chosen to not appear in the message
  • Use a special terminator that may be encoded so that it never appears literally in the message

While I'm on the topic, I thought I might write a little on some related stuff.

Shell scripting mostly uses 'terminator that may not appear in the message', and escaping if you're lucky. This makes it extremely difficult to make scripts that work reliably over lists of values which have to be encoded in a single string. *sigh*

Regular expressions need both special characters representing regex operations and to be able to express the underlying characters. So, escaping's a key part of it. It's a real shame that the standard regex expressions are so ad hoc. Some things get escaped, character ranges have special cases for special characters, some things are special in some contexts and not in others, and some characters become special when escaped. Oh, and the extended RE syntax is not a simple superset of the basic RE expressions. *sigh*

UTF-8, while not being strictly about delimiters or escaping, is a great example of a good design, thanks to its self-synchronising properties - a great deal of thought has been put into what bytes may appear where. This seems vaguely ironic to me, as it was designed by Thompson and Pike, who were heavily involved in all the unix systems that were somewhat weak at escaping delimiters! UTF-8 is neat, and all is forgiven.

Quines - programs that print their own source - are all about escaping. The basic idea is to have a program that contains a literal string (with some escaping used by the language), and it will then print out that literal string, with the escaped version of that string substituted into it.

For example, in Haskell, we can do:

let x = "let x =  ; (y, z) = splitAt 8 x in putStrLn $ y ++ show x ++ z" ; (y, z) = splitAt 8 x in putStrLn $ y ++ show x ++ z

Posted 2014-11-28.