Simon Frankau's blog

Linux 1.0 kernel source reading - 5: Memory management

Over to the "mm/" directory.

kmalloc.c: Hmmmm. Something else that implements kmalloc/kfree. Careful reading of lib/Makefile reveals that lib/malloc.c is not actually compiled, so I guess this is used instead!

I prefer the other implementation. This one has crazy indents and fewer comments. Also, it puts the control block in the page with the data, which means allocation blocks of powers of two don't fit in neatly, and you can't actually allocate large chunks efficiently.

Oh, and it allows a pointless race in the code to allocate from a new page: If there are no blocks available, it constructs a new page, adds it to the list, and then races against anything else to take an item out of that list, complete with code to give up if it loses the race too often. More simply, it could just allocate the chunk from the newly constructed page before hooking the remainder into the pool. Naff design.

And there are other minor stylistic things I don't like.

swap.c: _get_free_page lives here, so it seems a good place to start. In the standard style of C, it's written in a bottom-up fashion, and there isn't high-level explanation comments, so it's a bit of an exercise to work out what it intends to do as we go along.

The basic idea at start of file is to treat swapped-out pages as a specialised resource, with functions to to claim, free, duplicate (increase ref count), etc.

It looks like swapped-out pages get duplicated when they're swapped in... but the pages that actually get paged out are ones that are not already backed by a file, so the paging back in effectively just does a copy-on-write-like step as the page is pulled in.

The next part of the file moves on to the mechanism for trying to swap out pages (which calls into other areas, mildly breaking the layering of abstractions). Then later parts of the file deal with freeing pages (free_page), and getting a page (__get_free_page), kicking off some swapping if necessary. The file concludes with the turning on and off of swap devices and swap files.

memory.c: This file is a chunky one, and the next layer up the abstraction stack. Rather than dealing with the allocation of individual pages (and swapping), it deals with page tables (modulo swapping) and the mapping of ranges of VM.

The "oom" function is commented to send an "untrappable SIGSEGV". The code seems to send a SIGKILL. For some reason, this amused me.

Various functions have heavily duplicate code - "clear_page_tables" vs. "free_page_tables". This reveals the pain of working in a low-level language like C (I'm currently reading On Lisp).

"share_page", to find a page for a VM mapping that can be shared, looks hideously expensive, and the wrong way to do things. VM mappings should know where their pages are, not have to scrape processes! Sharing with buffer cache seems a more pleasant alternative, too (and is perhaps how you should really do this look-up).

The end of this file deals with the page fault handling mechanism, and the basic VM ops structure for a file memory mapping.

mmap.c: This builds memory mapping infrastructure on the above. Mostly works on the VM area linked list, so it's not clear how this is kept in sync with the actual page tables. e.g. munmap partially unmapping an area doesn't actually necessarily update the page table, so partially unmapped areas may actually remain accessible in the address space. Or I could be misreading the code. The fun of maintaining invariants in a low-level language!

vmalloc.c: This one works a little more independently. Its purpose is not explained, but it is evidently to do multi-page kernel memory allocations. It does this by finding free physical pages and assigning virtual addresses to create the needed contiguous memory in virtual address space. It stops physical page fragmentation from being a problem for large kernel allocations. Rather obviously, it assumes that you have a lot more virtual address space than physical RAM, which was certainly the case back in the day. It creates the kernel memory mapping by updating the kernel memory page tables for each process, which seems an inefficient but simple way to do it.

Theoretically, I think I've covered the real core of the system now. Realistically, given "everything is a file" (more or less), I need to read the generic parts of "fs/" to claim I've read the core parts. For example, "exec" is in fs/. Next up, I'll take a brief detour to ipc/ and ibcs/, and then time to tackle fs/.

Posted 2015-04-19.

Full index