How it works: Automatic garbage collection

by tomjen September, two thousand and fifteen

Most “modern” programming languages do not require (or permit) the programmer to write code that directly handles memory. The reason is two-fold: first allowing this means that the programmer may end up introducing hard to find bugs which can lead to the program leaking memory, corrupting data or allow carefully crafted code to take control of the program.

Secondly: programmer convinience. Do you really want to keep track of what memory is allocated somewhere else in your program? Do you really want to manually have to free it everytime you are done using it, but not when some other part of the program is using it? Or would you prefer to spend your attention on solving actual problems (whether they are the kind of problem your boss pays you to solve or the kind of problems you like to solve in your spare time)? Would you rather not spend hours chasing down a bug because you forgot to free a variable and now the program leaks too much memory to remain stable for more than 15 minutes? Or patching a security issue a 4 am because you reused a variable that could contain user addressable content and now some two-bit noob in Uzbekistan used that flaw in his latest malware?

Automatic memory management allows us to pretend that there is infinite memory, and that we can therefore simply allocate new memory as needed, and that it will always be available. This of course is not true (hence the pretend part), but pretending it is true is very, very convinient and only requires us to deny reality (which is easy). Oh and it requires that memory, which the program has allocated but which is no longer in use, is made available for reuse and that the program doesn’t require more memory at any given time than is available on the machine it runs (although manual memory management wouldn’t make a difference in that case, you wouldn’t be able to run the program, no matter what) and that the computer has some spare CPU to use for the garbage collector (this is almost always the case, but even when it is not, you could write the rest of the program in a language with garbage collection and then write the part that requires all the performance in C/C++).

Well at least until the abstraction leaks. And all abstractions leak. This abstraction leaks in the first place by altering timing, because the garbage collector might choose at any point to reclaim unreachable memory. Technically the program can’t know that this happened, because there is no way for it to know how long executing you method would take, but in practice the user will see it when your fancy new app pauses and jitter. Worse this happens only slightly less even if the user has a fancy fast computer.

The second way the abstraction can leak is when you store some object that the program is never going to read (such as storing all locations where a player has been, so that you can debug it later, but turn of the debugger interface so there is no code which reads this stored data). A perfect garbage collector should be able to detect when this happens and free the memory. Unfortunately such a garbage collector cannot be written, because it would need to be able to solve the halting problem (which is of course mathematically impossible) — but it is possible to write garbage collectors that are more accurate than the naïve algorithms that I introduce in this blog post.

But first, the burning question on everybodys mind (since you made it so far):

How the hell do automatic garbage collection actually work?

The key to understanding how garbage collection works is that it exist in a world “outside” your program and can consequently control your program completely. Automatic garbage collection is not unique in this case, all the virtual machines (such as the JVM, CPython, the various Ruby VMs, and thousands of others) also exists “outside” the program they are are running.

Because of this control, the garbage collector can remember and act on each allocation so that it can maintain the illusion of infinite memory.

At its basics the garbage collector will remember each allocation (whenever you write e.g new JButton("hello");) and, when it suits the garbage collector, pause your program, check all the memory the program use and reclaim what it finds to be memory the program can no longer access.

How exactly does it do this? Well there are many ways, and many variations of these approaches, but the simplest is:

Reference counting

Reference counting is a very simple way to ensure that memory is always freed when no longer in use. The way reference counting works is that every time an element is refered an internal (and to the programmer, invisible) counter is incremented and whenever a variable that used to refer to this object is changed, the internal counter is decremented.

As soon as the internal counter reaches zero, the object is automatically freed, decrementing other counters (and possibly freeing other objects, etc).

This method is used among others in Gtk, Objective C, Python and PHP.

It is simple, reasonably easy to predict and has a low overhead.

It just has one problem: If two objects have a (no matter how indirect) cyclic reference, neither those two objects nor any other object they refer to will ever be freed.

And circular references are pretty common, especially in gui frameworks (because a parent component often needs to refer to its child components and a child component often wants to refer to its parent component). Good thing that Gtk and Objective C are frequently used to make GUIs. Oh wait.

Fortunately both Gtk and Objective C has ways around it. Gtk has an api design that makes it clear when an object takes ownership of another object vs just refers to it and Objective C (with the new automatic reference counting) has weak vs strong references, with only strong references keeping an object alive (interestingly, and not commonly known, Java has strong vs weak references too, despite using a different automatic memory management algorithm)

None of these work arounds would do for a language with entirely automatic memory management, like Java.

Instead most platforms that use automatic memory management uses some variation of the Mark-and-Sweep algorithm.

Mark-and-sweep

So what do we do if the object we want memory for might have cycles?

We have to choose another garbage collection algorithm, naturally!

Fortunately there exist a number of different garbage collection algorithms that can handle cycles, but probably the most common is mark-and-sweep (you have no doubt used it as the JVM use a variation of it by default). So we will examing it in more detail (for All The Details, look up any recent work on compiler construction, e.g the Second edition of the Dragon Book (aka the Purple Dragon))

A program has access to two types of memory: the stack, where memory is allocated in a stack like fashion and the heap, where memory is allocated and deallocated non-linarily.

The stack is always available, but it is pretty simple to clean up, since once a function has been executed the computer just decrement the stack counter and the memory that function used is automatically reused for the next function.

The heap is much more complicated, because the order in which the objects are allocated is not guaranteed to be the order in which they are deallocated. In fact it is almost certainly not going to be the case — it would suprise me if there existed a single program anywhere which did this.

The key that makes garbage collection possible is that all objects on the heap can either be safely reclaimed or there is at least one other object on the heap or the stack that has stored a reference to it and which cannot safely be reclaimed.

Well you might ask, what about global variables? Well those are counted as well, but they are essentially just another stack.

What this means is that any object on the heap which cannot safely be reclaimed has (at least one) path to it, which starts on the stack. It is this fact that the mark-and-sweep alogrithm uses.

As indicated in the name, the mark-and-sweep algorithm is a two step process. First all the objects that are directly refered from the stack are marked, then their children are recursively marked (unless they have previously been marked, in which case they, and their children are skipped). This mark phase ends when there are no references to nodes that have not been marked.

The second phase of the algorithm is simply to reclaim all memory used by the objects that have not been marked — remember that the garbage collector knows about all the allocations, even those the program can no longer access, so it can free them. And since the garbage collector exists outside your program, the references it has do not count.

Only once the memory has been freed in the sweep phase can the program reuse it (although the program itself could theoretically be restarted after the mark-phase if enough memory is available to full-fill the request). More advanced variations of the mark-and-sweep algorithm can isolate threads so that it isnt’t necessary to stop the entire program (algorithms which do require that, including the basic mark-and-sweep, are called stop-the-world algorithms), other variations can give guarantees that they won’t take more than X amount of time before the program is allowed to run. Both of these variations are very useful for programs where small delays are perceptible by the user (e.g in computer games or animations) or where the system has to interact with or monitor other systems.

A very common optimization that most Mark-and-Sweep algorithms use is to have separate sets of objects that it reclaims at different frequencies. This is based on the observation that the life time of objects in a computer program is much like that of turtles: most die very young, but those that don’t tend to live very long lives.

This optimization (which, when used together with the Mark-and-Sweep algorithm is sometimes refered to as a generational-mark-and-sweep algorithm.) often works by having an eden where new objects are allocated and one or two separate places in memory where the objects are promoted, as they survive more and more garbage collection cycles). The objects in the later areas are assumed to be alive for most of the cycles and so they don’t get scanned (the automatic memory system keeps track of any references from objects outside of eden to objects inside eden, so they don’t get garbage collected too soon).

Another common optimization for garbage collection is to “compact” memory. I used quotes here because, despite the name, the memory isn’t compacted — the space between it is. The memory is merely moved, so that there is no empty space between allocations, which makes it much more effective from a caching perspective. Not just that, if you program needs a large (contiguous) area of memory it is much faster to find a suitable space.

Finally I mentioned earlier that Java has weak-vs-strong references. This in an advanced topic that you do not have to worry about normally as a programmer, but which may be useful on some special occasions — just like understanding automatic garbage collection.

The first thing you need to know is that normally all java references are strong, that is an object can’t be garbage collected as long as there is at least one strong reference to it (more formally, such an object is said to be strongly reachable). However there is a special class java.lang.ref.WeakReference<T> which can be created to wrap any object. If an object is wrapped in a WeakReference then it may be garbage collected, because it is only weakly reachable (unless some other object has a strong reference to it).

So what use is WeakReference? Well first it has a get method which returns the object it contains or null if the object has been garbage collected (garbage collection and setting the reference to null is required to be an atomic operation by the garbage collector, so you will never have a dangling reference). Secondly, since normal java references are strong, the object won’t be reclaimed suddenly if you call get and use the returned object.

Okay I hear you scream, why would you ever want to do that? Why have an object that may or may not keep a reference to some other object, rather than use any of javas normal containers? Personally I have only ever used it once, to built a bitmap caching system for Android. Bitmaps take up a relative lage amount of space but at the same time they also take a quite a bit of time to load in so if you have the space you want to keep them around. By having a hash table of (strong) string keys to (weak) references to bitmaps I was able to make the garbage collector evict the bitmaps that were no longer in use when not enough memory was available.

Java has two additional reference types (for a total of four): soft, which is the same as the weak reference, but won’t be reclaimed by the garbage collector unless it has to, and phantom which is

I should probably also mentionen that if you google for weak references you may see the third-class of references that Java has: phantom references. These are essentially even weaker than weak references: you can’t get the object they refer to from them!

I googled and about the only reasons I could find for ever using them were to handle cleanup outside a finalizer block, such as to remove files on disk after the program is done reading them, to attempt to reclaim database connection pools and to profile memory.

All in all you are not likely to ever need to use them, but they are there, should you ever need to.