Encyclopedia > Automatic garbage collection

  Article Content

Automatic garbage collection

Garbage collection in computing is a system of automatic memory management. It is commonly abbreviated as GC. A garbage collector is the part of a computer system that is responsible for recycling memory via garbage collection.

When a system has a garbage collector it is usually part of the language run-time system and integrated into the language. The language is said to be garbage collected. Garbage collection was invented by John McCarthy as part of the first Lisp system. It is a bit sloppy to talk of a garbage collected language just as it is slightly incorrect to talk of a compiled language. Being garbage collected is a property of an implementation of a language.

Table of contents

Basic principle

The basic principle of automatic memory management is simple:
  1. determine what data objects in a program cannot possibly be needed.
  2. reclaim the storage used by those objects.

(liveness vs reachability)

Garbage collectors use a reachability-based criterion for determining whether a data object will be needed (but see note about reference counting below).

Reachability of an object

The principles of reachability are:
  1. A distinguished set of objects are assumed to be reachable, these are known as the roots. In a typical system these objects will be the machine registers, the stack, the instruction pointer, the global variables. In other words everything that a program can reach directly.
  2. anything referenced from a reachable object is itself reachable.

In other words a reachable object is one that the program could get to by following a chain of pointer references.

Basic algorithm

Garbage collectors work by performing garbage collection cycles. A cycle is started when the collector decides (or is notified) that it needs to reclaim storage. A cycle consists of the following steps:

  1. Create initial white, grey, and black sets; these sets will be used to maintain progress during the cycle. Initially the black set is empty, the grey set consists of specially denoted objects known as the "roots" and possibly some additional objects chosen according to the particular algorithm used, and the white set is everything else. At any time in the algorithm a particular object will be in exactly one of the three sets. The set of white objects can be thought of as the set of objects that we are trying to reclaim the storage for; in the course of the cycle the algorithm will remove many objects from the white set, leaving behind a set of objects that it can reclaim the storage for.
  2. (This step is repeated until there are no objects in the grey set.) Pick an object from the grey set, move that object from the grey set to the black set, move all the white objects that are referenced (reachable in one step of following pointers) from the selected object into the grey set.
  3. When there are no more objects in the grey set, then all the objects remaining in the white set are not reachable and the storage occupied by them can be reclaimed.

The tricolour invariant is an important property of the objects and their colours. It is simply this:

No black object points directly to a white object.

Notice that the algorithm above preserves the tricolour invariant. The initial partition has no black objects, so the invariant trivially holds. Subsequently whenever an object is made black any white objects that it references are made grey, ensuring that the invariant remains true. When the last step of the algorithm is reached, because the tricolour invariant holds, none of the objects in the black set point to any of the objects in the white set (and there are no grey objects) so the white objects must be unreachable from the roots, and the system calls their finalisers and frees their storage.

Some variations on the algorithm do not preserve the tricolour invariant but they use a modified form for which all the important properties hold.

Variants of the basic algorithm

The basic algorithm has several variants.
  • There is the issue of whether the garbage collector moves objects in memory (that is, changes their address) or not: moving or non-moving GC.

  • Some collectors can correctly identify all pointers (references) in an object; these are called "exact" collectors, the opposite being a "conservative" or "partly conservative" collector. "Conservative" collectors have to assume that any bit pattern in memory is a pointer if (when interpreted as a pointer) it would point into any allocated object. Thus, conservative collectors may have some false negatives, where storage is not released because of accidental fake pointers. However, this is not a big drawback in practice.

  • There is the issue of whether the garbage collector can run interleaved or in parallel with any of the rest of the system: the simplest garbage collectors stop the rest of the system while they perform a cycle; they are non-incremental, whereas incremental collectors interleave their work with units of activity from the rest of the system. Some incremental collectors can run fully parallel in a separate thread; these can theoretically run on a separate CPU, but cache issues make this less practical than it may initially appear.

Mark and sweep
Collectors can also be divided by considering how the three sets (of white, grey, and black objects) are implemented. A Mark sweep GC[?] maintains a bit (or two) with each object to record whether it is white or black; the grey set is either maintained as a separate list or using another bit. A Copying GC[?] identifies grey and black objects by copying them to another area of memory (the copy space) and often distinguishes black from grey objects by dividing the copy space into two portions (in the simple case by maintaining a single pointer that marks the division between black and grey objects).

Generational GC
There is the issue of when to perform a cycle and what objects to place in the initial grey set. A simple collector will always put only the roots in the initial grey set, and everything else will be initially white.

Statistically speaking, the objects most recently created in the runtime system are also those most likely to quickly become unreachable. A Generational GC[?] divides objects into generations and will generally only place the objects of a subset of all the generations into the initial white set (the grey set being everything else). This can result in faster cycles.

Reference counting
Reference counting is commonly referred to as garbage collection, but this usage is not accepted by all. It could probably be argued that reference counting is a class of conservative garbage collection. Some punters like to reserve the term garbage collection for algorithms and systems that determine the liveness of data based on reachability.

Reference counting has serious problems anyway, such as a general inability to deal with circular references and a high overhead in both space and time. On the other hand, it does dispose of most garbage fairly promptly, which can have advantages if there are finalisers to be run to free up scarce resources other than heap memory. Hybrid systems that use reference counting to achieve (mostly) immediate freeing of resources, and occasionally call a mark-sweep collector to free stuff containing reference loops, have been proposed and occasionally implemented. This gives the best of both worlds in some ways, but still has a performance overhead.

Languages which use automatic garbage collection

Further Reading



All Wikipedia text is available under the terms of the GNU Free Documentation License

 
  Search Encyclopedia

Search over one million articles, find something about almost anything!
 
 
  
  Featured Article
Finite mathematics

... mathematics, is the study of mathematical structures that are fundamentally discrete, in the sense of not supporting or requiring the notion of continuity. Most, if not all, ...