Garbage collection (computer science)
From Wikipedia, the free encyclopedia
In computer science, garbage collection (also known as GC) is a form of automatic memory management. The garbage collector or collector attempts to reclaim garbage, or memory used by objects that will never again be accessed or mutated by the application. Garbage collection was invented by John McCarthy around 1959 to solve the problems of manual memory management in his Lisp programming language.
Garbage collection is often portrayed as the opposite of manual memory management,which requires the programmer to specify which objects to deallocateand return to the memory system. However, many systems use acombination of the two approaches, and there are other techniques beingstudied (such as region inference) to solve the same fundamental problem. Note that there is an ambiguity of terms, as theory often uses the terms manual garbage-collection and automatic garbage-collection rather than manual memory management and garbage-collection, and does not restrict garbage-collection to memory management, rather considering that any logical or physical resource may be garbage-collected.
Contents[hide] |
[edit] Description
The basic principle of how a garbage collector works is:
- Determine what data objects in a program will not be accessed in the future
- Reclaim the storage used by those objects
By making manual memory deallocation unnecessary (and typicallyimpossible), garbage collection frees the programmer from having toworry about releasing objects that are no longer needed, which canotherwise consume a significant amount of design effort. It also aidsprogrammers in their efforts to make programs more stable, because itprevents several classes of runtime errors. For example, it prevents dangling pointererrors, where a reference to a deallocated object is used. (The pointerstill points to the location in memory where the object or data was,even though the object or data has since been deleted and the memorymay now be used for other purposes, creating a dangling pointer.)
Many computer languages require garbage collection, either as part of the language specification (e.g. Java, C#, and most scripting languages) or effectively for practical implementation (e.g. formal languages like lambda calculus); these are said to be garbage-collected languages. Other languages were designed for use with manual memory management, but have garbage collected implementations (e.g., C, C++). Some languages, like Modula-3,allow both garbage collection and manual memory management to co-existin the same application by using separate heaps for collected andmanually managed objects, or yet others like D,which is garbage-collected but allows the user to manually deleteobjects and also entirely disable garbage collection when speed isrequired. In any case, it is far easier to implement garbage collectionas part of the language's compiler and runtime system, but post hocGC systems exist, including ones that do not require recompilation. Thegarbage collector will almost always be closely integrated with the memory allocator.
[edit] Tracing garbage collectors
Tracing garbage collectors are the most common type of garbage collector. They focus on determining which objects are reachable (or potentially reachable), and then discard all remaining objects.
[edit] Reachability of an object
Informally, a reachable object can be defined as an object for whichthere exists some name in the program environment that leads to it,either directly or through references from other reachable objects.More precisely, objects can be reachable in only two ways:
- A distinguished set of objects are assumed to be reachable -- these are known as the roots. Typically, these include all the objects referenced from anywhere in the call stack (that is, all local variables and parameters in the functions currently being invoked), and any global variables.
- Anything referenced from a reachable object is itself reachable. This is referred to as transitivity.
The reachability definition of "garbage" is not optimal, insofar asthe last time a program uses an object could be long before that objectfalls out of the environment scope. A distinction is sometimes drawnbetween syntactic garbage, those objects the program cannot possibly reach, and semantic garbage,those objects the program will in fact never again use. The problem ofprecisely identifying semantic garbage can easily be shown to be partially decidable: a program that allocates an object X, runs an arbitrary input program P, and uses X if and only if P finishes would require a semantic garbage collector to solve the halting problem.Although conservative heuristic methods for semantic garbage detectionremain an active research area, essentially all practical garbagecollectors focus on syntactic garbage as described here.
[edit] Basic algorithm
Tracing garbage collectors use an algorithmin which they perform garbage collection cycles. A cycle is startedwhen the collector decides (or is notified) that it needs to reclaimstorage, which in particular happens when the system is low on memory.All tracing garbage collectors implement some variant of the tri-colour marking abstraction, but simple collectors (such as the mark-and-sweep collector) often do not make this abstraction explicit. Tri-colour marking works as follows:
- Create initial white, grey, and black sets; these sets will be usedto maintain progress during the cycle. Initially the white set or condemned setis the set of objects that are candidates for having their memoryrecycled. The black set is the set of objects that cheaply can beproven to have no references to objects in the white set; in manyimplementations the black set starts off empty. The grey set is all theremaining objects that may or may not have references to objects in thewhite set (and elsewhere). These sets partition memory; every object in the system, including the root set, is in precisely one set.
- (This step is repeated until the grey set is empty.) Pick an object from the grey set. Blacken this object (move it to the black set), by greying all the white objects it references directly.
- When there are no more objects in the grey set, then all theobjects remaining in the white set are probably not reachable and thestorage occupied by them can be reclaimed.
The tri-colour marking algorithm preserves an important invariant:
- No black object points directly to a white object.
This ensures that the white objects can be safely destroyed once the grey set is empty.
Some variations on the algorithm do not preserve the tricolourinvariant but they use a modified form for which all the importantproperties hold.
[edit] Implementation strategies
In order to implement the basic tri-color algorithm, severalimportant design decisions must be made, which can significantly affectthe performance characteristics of the garbage collector.
[edit] Moving vs. non-moving
Once the unreachable set has been determined, the garbage collector may simply release the unreachable objectsand leave everything else as it is, or it may copy some or all of thereachable objects into a new area of memory, updating all references tothose objects as needed. These are called "non-moving" and "moving"garbage collectors, respectively.
At first, a moving GC strategy may seem inefficient and costlycompared to the non-moving approach, since much more work would appearto be required on each cycle. In fact, however, the moving GC strategyleads to several performance advantages, both during the garbagecollection cycle itself and during actual program execution:
- No additional work is required to reclaim the space freed by deadobjects; the entire region of memory from which reachable objects weremoved can be considered free space. In contrast, a non-moving GC mustvisit each unreachable object and somehow record that the memory italone occupied is available.
- Similarly, new objects can be allocated very quickly. Since largecontiguous regions of memory are usually made available by the movingGC strategy, new objects can be allocated by simply incrementing a'free memory' pointer. A non-moving strategy may, after some time, leadto a heavily fragmented heap, requiring expensive consultation of "freelists" of small available blocks of memory in order to allocate newobjects.
- If an appropriate traversal order is used, objects that refer toeach other frequently can be moved very close to each other in memory,increasing the likelihood that they will be located in the same cacheline or virtual memory page. This can significantly speed up access tothese objects through these references.
[edit] Copying vs. mark-and-sweep
To further refine the distinction, tracing collectors can also bedivided by considering how the three sets of objects (white, grey, andblack) are maintained during a collection cycle.
The most straightforward approach is the semi-space collector, whichdates to 1969. In this moving GC scheme, memory is partitioned into a"from space" and "to space". Initially, objects are allocated into "tospace", until it becomes full and a collection is triggered. At thestart of a collection, the "to space" becomes the "from space", andvice versa. The objects reachable from the root set are copied from the"from space" to the "to space". These objects are scanned in turn, andall objects that they point to are copied to "to space", until allreachable objects have been copied to "to space". Once the programcontinues execution, new objects are once again allocated from the "tospace" until it is once again full and the process is repeated. Thisapproach has the advantage of conceptual simplicity (the three objectcolor sets are implicitly constructed during the copying process), butthe disadvantage that a (possibly) very large contiguous region of freememory is necessarily required on every collection cycle.
A "mark and sweep" garbage collector maintains a bit (or two) witheach object to record whether it is white or black; the grey set iseither maintained as a separate list or using another bit. As thereference tree is traversed during a collection cycle, these bits aremanipulated by the collector to reflect the current state. The mark andsweep strategy has the advantage that, once the unreachable set isdetermined, either a moving or non-moving collection strategy can bepursued; this choice of strategy can even be made at runtime, asavailable memory permits. It has the disadvantage of "bloating" objectsby a small amount.
[edit] Generational GC (aka Ephemeral GC)
It has been empirically observed that in many programs, the mostrecently created objects are also those most likely to quickly becomeunreachable (known as infant mortality or the generational hypothesis).A generational GC divides objects into generations and, on most cycles,will place only the objects of a subset of generations into the initialwhite (condemned) set. Furthermore, the runtime system maintainsknowledge of when references cross generations by observing thecreation and overwriting of references. When the garbage collectorruns, it may be able to use this knowledge to prove that some objectsin the initial white set are unreachable without having to traverse theentire reference tree. If the generational hypothesis holds, thisresults in much faster collection cycles while still reclaiming mostunreachable objects.
In order to implement this concept, many generational garbagecollectors use separate memory regions for different ages of objects.When a region becomes full, those few objects that are referenced fromolder memory regions are promoted (copied) up to the next highestregion, and the entire region can then be overwritten with freshobjects. This technique permits very fast incremental garbagecollection, since the garbage collection of only one region at a timeis all that is typically required.
Generational garbage collection is a heuristicapproach, and some unreachable objects may not be reclaimed on eachcycle. It may therefore occasionally be necessary to perform a fullmark and sweep or copying garbage collection to reclaim all availablespace. In fact, runtime systems for modern programming languages (suchas Java and the .NET Framework)usually use some hybrid of the various strategies that have beendescribed thus far; for example, most collection cycles might look onlyat a few generations, while occasionally a mark-and-sweep is performed,and even more rarely a full copying is performed to combatfragmentation. The terms "minor cycle" and "major cycle" are sometimesused to describe these different levels of collector aggressiveness.
[edit] Stop-the-world vs. incremental vs. concurrent
Simple stop-the-world garbage collectors completely haltexecution of the program to run a collection cycle, thus guaranteeingthat new objects are not allocated and objects do not suddenly becomeunreachable while the collector is running. This has the obviousdisadvantage that the program can perform no useful work while acollection cycle is running (sometimes called the "embarassing pause").
Incremental garbage collectors are designed to reduce thisdisruption by interleaving their work with activity from the mainprogram. Careful design is necessary to ensure that the main programdoes not interfere with the garbage collector and vice versa; forexample, when the program needs to allocate a new object, the runtimesystem may either need to suspend it until the collection cycle iscomplete, or somehow notify the garbage collector that there exists anew, reachable object.
Finally, a concurrent garbage collector can run concurrently in real time with the main program on a symmetric multiprocessing machine. Complex lockingregimes may be necessary in order to guarantee correctness, and cacheissues also make this less helpful than one might imagine. Nonetheless,concurrent GC may be desirable for SMP applications with highperformance requirements.
[edit] Precise vs. conservative and internal pointers
Some collectors can correctly identify all pointers (references) inan object; these are called "precise" (or "exact" or "accurate")collectors, the opposite being a "conservative" or "partlyconservative" collector. "Conservative" collectors have to assume thatany bit pattern in memory is a pointer if (when interpreted as apointer) it would point into any allocated object. Thus, conservativecollectors may have some false negatives, where storage is not releasedbecause of accidental fake pointers, but this is rarely a significantdrawback in practice. Whether a precise collector is practical usuallydepends on type safety properties of the programming language.
A related issue concerns internal pointers, or pointers tofields within an object. If the semantics of a language allow internalpointers, then there may be many different addresses that can refer tothe same object, which complicates determining whether an object isgarbage or not.
[edit] Performance implications
Tracing garbage collectors require some implicit runtime overheadthat may be beyond the control of the programmer, and can sometimeslead to performance problems. For example, the tendency of the runtimesystem to pause program execution at arbitrary times to run the garbagecollector may make GC languages inappropriate for some embeddedsystems, high-performance server software, and applications with real-time needs.
A more fundamental issue is that garbage collectors violate locality of reference,since they deliberately go out of their way to find bits of memory thathaven't been accessed recently. The performance of modern computerarchitectures is increasingly tied to caching,which depends on the assumption of locality of reference for itseffectiveness. Some garbage collection methods result in betterlocality of reference than others. Generational garbage collection isrelatively cache-friendly, and copying collectors automaticallydefragment memory helping to keep related data together. Nonetheless,poorly timed garbage collection cycles could have a severe performanceimpact on some computations, and for this reason many runtime systemsprovide mechanisms that allow the program to temporarily suspend, delayor activate garbage collection cycles.
Despite these issues, for many practical purposes,allocation/deallocation-intensive algorithms implemented in moderngarbage collected languages can actually be faster than theirequivalents using explicit memory management (at least without heroicoptimizations by an expert programmer). A major reason for this is thatthe garbage collector allows the runtime system to amortizeallocation and deallocation operations in a potentially advantageousfashion. For example, consider the following program in the(garbage-collected) C# language:
class A {
private int x;
public A() { x = 0; ++x; }
}
class Example {
public static void Main() {
for (int i = 0; i < 1000000000; ++i) {
A a = new A();
}
System.Console.WriteLine("DING!");
}
}
And a nearly equivalent C++ program:
#include <iostream>
class A {
int x;
public:
A() { x = 0; ++x; }
};
int main() {
for (int i = 0; i < 1000000000; ++i) {
A *a = new A();
delete a;
}
std::cout << "DING!" << std::endl;
}
Using standard libraries and typical compiler configurations, the C# program executes manytimes faster than the C++ program. This is because the C++ allocatormust hunt for free blocks of memory in a potentially fragmented heap in order to allocate new instances of the A class. In contrast, the C#runtime system can allocate memory by incrementing a pointer from someregion of memory previously set aside for new allocations, relying onthe garbage collector to eventually release the unused objects andcompact the heap. On the other hand, the garbage-collected program mayhave somewhat greater memory usage averaged over time, since instancesof the A class are not deallocated as quickly as they could be.
Each has different sorts of overheads:
- Manual allocation:
- search for best/first-fit block of sufficient size
- free list maintenance
- Garbage collection:
- locate reachable objects
- copy reachable objects for moving collectors
- read/write barriers for incremental collectors
- search for best/first-fit block and free list maintenance for non-moving collectors
It is difficult to compare the two cases directly, as their behaviordepends on the situation. For example, in the best case for agarbage-collecting system, allocation just increments a pointer; but inthe best speed performance case for manual allocation, the allocatormaintains freelists of specific sizes and allocation only requiresfollowing a pointer. However this size segregation usually cause alarge degree of external fragmentation and this can impact cachebehaviour.
The overhead of write barriers is more likely to be noticeable in animperative-style program which frequently writes pointers into existingdata structures than in a functional-style program which constructsdata only once and never changes them.
Some advances in garbage collection can be understood as reactionsto performance issues. Early collectors were stop-the-world collectors,but the performance of this approach was distracting in interactiveapplications. Incremental collection avoided this disruption, but atthe cost of decreased efficiency due to the need for barriers.Generational collection techniques are used with both stop-the-worldand incremental collectors to increase performance--the trade-off isthat some garbage is not detected as garbage for longer than normal.
[edit] Reference counting
- Main article: reference counting
In contrast to tracing garbage collection, reference countingis a form of automatic memory management where each object has a countof the number of references to it. An object's reference count isincremented when a reference to it is created, and decremented when areference is destroyed. The object's memory is reclaimed when the countreaches zero.
There are two major disadvantage to reference counting:
- If two objects refer to each other, they can create a cycle wherebyneither will be collected as their mutual references never let theirreference counts become zero. Some GC systems using reference countinguse specific cycle-detecting algorithms to deal with this issue.
- Each assignment of a reference and each reference falling out ofscope require modifications of one or more reference counters. Whenused in a multithreaded environment, these modifications (increment anddecrement) need to be interlocked. This is usually a very expensiveoperation.
[edit] Implementations
GC comes either as a part of a programming language or an externallibrary that can be added to a language that does not have built-insupport of GC. Boehm garbage collector is the chief example for the latter.
Functional programming languages, like ML, Haskell and Lisp, traditionally use garbage collection. Lisp, which introduced functional programming, is especially notable for using garbage collection before this technique was commonly used. Script languages like Perl, Ruby, Python and PHP tend to have built-in support of GC. Also, many object-oriented programming languages like Smalltalk, Java and ECMAScript usually provide integrated garbage collection, a notable exception being C++.
[edit] See also
[edit] External links
- The Memory Management Reference
- Publications by the OOPS group at the University of Texas at Austin: http://www.cs.utexas.edu/users/oops/papers.html
- A garbage collector for C and C++ by Hans Boehm (the site also links to a number of articles on garbage collection in general)
- Article "Garbage Collector Basics and Performance Hints" by Rico Mariani
- Citations from CiteSeer
- A Glance At Garbage Collection In Object-Oriented Languages
- On-the-fly garbage collection: an exercise in cooperation by Edsger W. Dijkstra and Leslie Lamport and A.J.Martin and C.S.Scholten and E.F.M.Steffens
- Richard Jones and Rafael Lins, Garbage Collection: Algorithms for Automated Dynamic Memory Management, Wiley and Sons (1996), ISBN 0-471-94148-4
'emotional developer > detect-Java' 카테고리의 다른 글
JUnit best practices (0) | 2006.12.06 |
---|---|
Plain Old Java Object, POJO (1) | 2006.12.02 |
List인터페이스의 subList 메소드 사용시. (0) | 2006.11.03 |