You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
765 lines
30 KiB
765 lines
30 KiB
<!doctype html public "-//w3c//dtd html 4.01 transitional//en"> |
|
<!-- $Id: $ --> |
|
<html> |
|
<head> |
|
<title>TCMalloc : Thread-Caching Malloc</title> |
|
<link rel="stylesheet" href="designstyle.css"> |
|
<style type="text/css"> |
|
em { |
|
color: red; |
|
font-style: normal; |
|
} |
|
</style> |
|
</head> |
|
<body> |
|
|
|
<h1>TCMalloc : Thread-Caching Malloc</h1> |
|
|
|
<address>Sanjay Ghemawat</address> |
|
|
|
<h2><A name=motivation>Motivation</A></h2> |
|
|
|
<p>TCMalloc is faster than the glibc 2.3 malloc (available as a |
|
separate library called ptmalloc2) and other mallocs that I have |
|
tested. ptmalloc2 takes approximately 300 nanoseconds to execute a |
|
malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc |
|
implementation takes approximately 50 nanoseconds for the same |
|
operation pair. Speed is important for a malloc implementation |
|
because if malloc is not fast enough, application writers are inclined |
|
to write their own custom free lists on top of malloc. This can lead |
|
to extra complexity, and more memory usage unless the application |
|
writer is very careful to appropriately size the free lists and |
|
scavenge idle objects out of the free list.</p> |
|
|
|
<p>TCMalloc also reduces lock contention for multi-threaded programs. |
|
For small objects, there is virtually zero contention. For large |
|
objects, TCMalloc tries to use fine grained and efficient spinlocks. |
|
ptmalloc2 also reduces lock contention by using per-thread arenas but |
|
there is a big problem with ptmalloc2's use of per-thread arenas. In |
|
ptmalloc2 memory can never move from one arena to another. This can |
|
lead to huge amounts of wasted space. For example, in one Google |
|
application, the first phase would allocate approximately 300MB of |
|
memory for its URL canonicalization data structures. When the first |
|
phase finished, a second phase would be started in the same address |
|
space. If this second phase was assigned a different arena than the |
|
one used by the first phase, this phase would not reuse any of the |
|
memory left after the first phase and would add another 300MB to the |
|
address space. Similar memory blowup problems were also noticed in |
|
other applications.</p> |
|
|
|
<p>Another benefit of TCMalloc is space-efficient representation of |
|
small objects. For example, N 8-byte objects can be allocated while |
|
using space approximately <code>8N * 1.01</code> bytes. I.e., a |
|
one-percent space overhead. ptmalloc2 uses a four-byte header for |
|
each object and (I think) rounds up the size to a multiple of 8 bytes |
|
and ends up using <code>16N</code> bytes.</p> |
|
|
|
|
|
<h2><A NAME="Usage">Usage</A></h2> |
|
|
|
<p>To use TCMalloc, just link TCMalloc into your application via the |
|
"-ltcmalloc" linker flag.</p> |
|
|
|
<p>You can use TCMalloc in applications you didn't compile yourself, |
|
by using LD_PRELOAD:</p> |
|
<pre> |
|
$ LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary> |
|
</pre> |
|
<p>LD_PRELOAD is tricky, and we don't necessarily recommend this mode |
|
of usage.</p> |
|
|
|
<p>TCMalloc includes a <A HREF="heap_checker.html">heap checker</A> |
|
and <A HREF="heapprofile.html">heap profiler</A> as well.</p> |
|
|
|
<p>If you'd rather link in a version of TCMalloc that does not include |
|
the heap profiler and checker (perhaps to reduce binary size for a |
|
static binary), you can link in <code>libtcmalloc_minimal</code> |
|
instead.</p> |
|
|
|
|
|
<h2><A NAME="Overview">Overview</A></h2> |
|
|
|
<p>TCMalloc assigns each thread a thread-local cache. Small |
|
allocations are satisfied from the thread-local cache. Objects are |
|
moved from central data structures into a thread-local cache as |
|
needed, and periodic garbage collections are used to migrate memory |
|
back from a thread-local cache into the central data structures.</p> |
|
<center><img src="overview.gif"></center> |
|
|
|
<p>TCMalloc treats objects with size <= 32K ("small" objects) |
|
differently from larger objects. Large objects are allocated directly |
|
from the central heap using a page-level allocator (a page is a 4K |
|
aligned region of memory). I.e., a large object is always |
|
page-aligned and occupies an integral number of pages.</p> |
|
|
|
<p>A run of pages can be carved up into a sequence of small objects, |
|
each equally sized. For example a run of one page (4K) can be carved |
|
up into 32 objects of size 128 bytes each.</p> |
|
|
|
|
|
<h2><A NAME="Small_Object_Allocation">Small Object Allocation</A></h2> |
|
|
|
<p>Each small object size maps to one of approximately 60 allocatable |
|
size-classes. For example, all allocations in the range 833 to 1024 |
|
bytes are rounded up to 1024. The size-classes are spaced so that |
|
small sizes are separated by 8 bytes, larger sizes by 16 bytes, even |
|
larger sizes by 32 bytes, and so forth. The maximal spacing is |
|
controlled so that not too much space is wasted when an allocation |
|
request falls just past the end of a size class and has to be rounded |
|
up to the next class.</p> |
|
|
|
<p>A thread cache contains a singly linked list of free objects per |
|
size-class.</p> |
|
<center><img src="threadheap.gif"></center> |
|
|
|
<p>When allocating a small object: (1) We map its size to the |
|
corresponding size-class. (2) Look in the corresponding free list in |
|
the thread cache for the current thread. (3) If the free list is not |
|
empty, we remove the first object from the list and return it. When |
|
following this fast path, TCMalloc acquires no locks at all. This |
|
helps speed-up allocation significantly because a lock/unlock pair |
|
takes approximately 100 nanoseconds on a 2.8 GHz Xeon.</p> |
|
|
|
<p>If the free list is empty: (1) We fetch a bunch of objects from a |
|
central free list for this size-class (the central free list is shared |
|
by all threads). (2) Place them in the thread-local free list. (3) |
|
Return one of the newly fetched objects to the applications.</p> |
|
|
|
<p>If the central free list is also empty: (1) We allocate a run of |
|
pages from the central page allocator. (2) Split the run into a set |
|
of objects of this size-class. (3) Place the new objects on the |
|
central free list. (4) As before, move some of these objects to the |
|
thread-local free list.</p> |
|
|
|
<h3><A NAME="Sizing_Thread_Cache_Free_Lists"> |
|
Sizing Thread Cache Free Lists</A></h3> |
|
|
|
<p>It is important to size the thread cache free lists correctly. If |
|
the free list is too small, we'll need to go to the central free list |
|
too often. If the free list is too big, we'll waste memory as objects |
|
sit idle in the free list.</p> |
|
|
|
<p>Note that the thread caches are just as important for deallocation |
|
as they are for allocation. Without a cache, each deallocation would |
|
require moving the memory to the central free list. Also, some threads |
|
have asymmetric alloc/free behavior (e.g. producer and consumer threads), |
|
so sizing the free list correctly gets trickier.</p> |
|
|
|
<p>To size the free lists appropriately, we use a slow-start algorithm |
|
to determine the maximum length of each individual free list. As the |
|
free list is used more frequently, its maximum length grows. However, |
|
if a free list is used more for deallocation than allocation, its |
|
maximum length will grow only up to a point where the whole list can |
|
be efficiently moved to the central free list at once.</p> |
|
|
|
<p>The psuedo-code below illustrates this slow-start algorithm. Note |
|
that <code>num_objects_to_move</code> is specific to each size class. |
|
By moving a list of objects with a well-known length, the central |
|
cache can efficiently pass these lists between thread caches. If |
|
a thread cache wants fewer than <code>num_objects_to_move</code>, |
|
the operation on the central free list has linear time complexity. |
|
The downside of always using <code>num_objects_to_move</code> as |
|
the number of objects to transfer to and from the central cache is |
|
that it wastes memory in threads that don't need all of those objects. |
|
|
|
<pre> |
|
Start each freelist max_length at 1. |
|
|
|
Allocation |
|
if freelist empty { |
|
fetch min(max_length, num_objects_to_move) from central list; |
|
if max_length < num_objects_to_move { // slow-start |
|
max_length++; |
|
} else { |
|
max_length += num_objects_to_move; |
|
} |
|
} |
|
|
|
Deallocation |
|
if length > max_length { |
|
// Don't try to release num_objects_to_move if we don't have that many. |
|
release min(max_length, num_objects_to_move) objects to central list |
|
if max_length < num_objects_to_move { |
|
// Slow-start up to num_objects_to_move. |
|
max_length++; |
|
} else if max_length > num_objects_to_move { |
|
// If we consistently go over max_length, shrink max_length. |
|
overages++; |
|
if overages > kMaxOverages { |
|
max_length -= num_objects_to_move; |
|
overages = 0; |
|
} |
|
} |
|
} |
|
</pre> |
|
|
|
See also the section on <a href="#Garbage_Collection">Garbage Collection</a> |
|
to see how it affects the <code>max_length</code>. |
|
|
|
<h2><A NAME="Large_Object_Allocation">Large Object Allocation</A></h2> |
|
|
|
<p>A large object size (> 32K) is rounded up to a page size (4K) |
|
and is handled by a central page heap. The central page heap is again |
|
an array of free lists. For <code>i < 256</code>, the |
|
<code>k</code>th entry is a free list of runs that consist of |
|
<code>k</code> pages. The <code>256</code>th entry is a free list of |
|
runs that have length <code>>= 256</code> pages: </p> |
|
<center><img src="pageheap.gif"></center> |
|
|
|
<p>An allocation for <code>k</code> pages is satisfied by looking in |
|
the <code>k</code>th free list. If that free list is empty, we look |
|
in the next free list, and so forth. Eventually, we look in the last |
|
free list if necessary. If that fails, we fetch memory from the |
|
system (using <code>sbrk</code>, <code>mmap</code>, or by mapping in |
|
portions of <code>/dev/mem</code>).</p> |
|
|
|
<p>If an allocation for <code>k</code> pages is satisfied by a run |
|
of pages of length > <code>k</code>, the remainder of the |
|
run is re-inserted back into the appropriate free list in the |
|
page heap.</p> |
|
|
|
|
|
<h2><A NAME="Spans">Spans</A></h2> |
|
|
|
<p>The heap managed by TCMalloc consists of a set of pages. A run of |
|
contiguous pages is represented by a <code>Span</code> object. A span |
|
can either be <em>allocated</em>, or <em>free</em>. If free, the span |
|
is one of the entries in a page heap linked-list. If allocated, it is |
|
either a large object that has been handed off to the application, or |
|
a run of pages that have been split up into a sequence of small |
|
objects. If split into small objects, the size-class of the objects |
|
is recorded in the span.</p> |
|
|
|
<p>A central array indexed by page number can be used to find the span to |
|
which a page belongs. For example, span <em>a</em> below occupies 2 |
|
pages, span <em>b</em> occupies 1 page, span <em>c</em> occupies 5 |
|
pages and span <em>d</em> occupies 3 pages.</p> |
|
<center><img src="spanmap.gif"></center> |
|
|
|
<p>In a 32-bit address space, the central array is represented by a a |
|
2-level radix tree where the root contains 32 entries and each leaf |
|
contains 2^15 entries (a 32-bit address spave has 2^20 4K pages, and |
|
the first level of tree divides the 2^20 pages by 2^5). This leads to |
|
a starting memory usage of 128KB of space (2^15*4 bytes) for the |
|
central array, which seems acceptable.</p> |
|
|
|
<p>On 64-bit machines, we use a 3-level radix tree.</p> |
|
|
|
|
|
<h2><A NAME="Deallocation">Deallocation</A></h2> |
|
|
|
<p>When an object is deallocated, we compute its page number and look |
|
it up in the central array to find the corresponding span object. The |
|
span tells us whether or not the object is small, and its size-class |
|
if it is small. If the object is small, we insert it into the |
|
appropriate free list in the current thread's thread cache. If the |
|
thread cache now exceeds a predetermined size (2MB by default), we run |
|
a garbage collector that moves unused objects from the thread cache |
|
into central free lists.</p> |
|
|
|
<p>If the object is large, the span tells us the range of pages covered |
|
by the object. Suppose this range is <code>[p,q]</code>. We also |
|
lookup the spans for pages <code>p-1</code> and <code>q+1</code>. If |
|
either of these neighboring spans are free, we coalesce them with the |
|
<code>[p,q]</code> span. The resulting span is inserted into the |
|
appropriate free list in the page heap.</p> |
|
|
|
|
|
<h2>Central Free Lists for Small Objects</h2> |
|
|
|
<p>As mentioned before, we keep a central free list for each |
|
size-class. Each central free list is organized as a two-level data |
|
structure: a set of spans, and a linked list of free objects per |
|
span.</p> |
|
|
|
<p>An object is allocated from a central free list by removing the |
|
first entry from the linked list of some span. (If all spans have |
|
empty linked lists, a suitably sized span is first allocated from the |
|
central page heap.)</p> |
|
|
|
<p>An object is returned to a central free list by adding it to the |
|
linked list of its containing span. If the linked list length now |
|
equals the total number of small objects in the span, this span is now |
|
completely free and is returned to the page heap.</p> |
|
|
|
|
|
<h2><A NAME="Garbage_Collection">Garbage Collection of Thread Caches</A></h2> |
|
|
|
<p>Garbage collecting objects from a thread cache keeps the size of |
|
the cache under control and returns unused objects to the central free |
|
lists. Some threads need large caches to perform well while others |
|
can get by with little or no cache at all. When a thread cache goes |
|
over its <code>max_size</code>, garbage collection kicks in and then the |
|
thread competes with the other threads for a larger cache.</p> |
|
|
|
<p>Garbage collection is run only during a deallocation. We walk over |
|
all free lists in the cache and move some number of objects from the |
|
free list to the corresponding central list.</p> |
|
|
|
<p>The number of objects to be moved from a free list is determined |
|
using a per-list low-water-mark <code>L</code>. <code>L</code> |
|
records the minimum length of the list since the last garbage |
|
collection. Note that we could have shortened the list by |
|
<code>L</code> objects at the last garbage collection without |
|
requiring any extra accesses to the central list. We use this past |
|
history as a predictor of future accesses and move <code>L/2</code> |
|
objects from the thread cache free list to the corresponding central |
|
free list. This algorithm has the nice property that if a thread |
|
stops using a particular size, all objects of that size will quickly |
|
move from the thread cache to the central free list where they can be |
|
used by other threads.</p> |
|
|
|
<p>If a thread consistently deallocates more objects of a certain size |
|
than it allocates, this <code>L/2</code> behavior will cause at least |
|
<code>L/2</code> objects to always sit in the free list. To avoid |
|
wasting memory this way, we shrink the maximum length of the freelist |
|
to converge on <code>num_objects_to_move</code> (see also |
|
<a href="#Sizing_Thread_Cache_Free_Lists">Sizing Thread Cache Free Lists</a>). |
|
|
|
<pre> |
|
Garbage Collection |
|
if (L != 0 && max_length > num_objects_to_move) { |
|
max_length = max(max_length - num_objects_to_move, num_objects_to_move) |
|
} |
|
</pre> |
|
|
|
<p>The fact that the thread cache went over its <code>max_size</code> is |
|
an indication that the thread would benefit from a larger cache. Simply |
|
increasing <code>max_size</code> would use an inordinate amount of memory |
|
in programs that have lots of active threads. Developers can bound the |
|
memory used with the flag --tcmalloc_max_total_thread_cache_bytes.</p> |
|
|
|
<p>Each thread cache starts with a small <code>max_size</code> |
|
(e.g. 64KB) so that idle threads won't pre-allocate memory they don't |
|
need. Each time the cache runs a garbage collection, it will also try |
|
to grow its <code>max_size</code>. If the sum of the thread cache |
|
sizes is less than --tcmalloc_max_total_thread_cache_bytes, |
|
<code>max_size</code> grows easily. If not, thread cache 1 will try |
|
to steal from thread cache 2 (picked round-robin) by decreasing thread |
|
cache 2's <code>max_size</code>. In this way, threads that are more |
|
active will steal memory from other threads more often than they are |
|
have memory stolen from themselves. Mostly idle threads end up with |
|
small caches and active threads end up with big caches. Note that |
|
this stealing can cause the sum of the thread cache sizes to be |
|
greater than --tcmalloc_max_total_thread_cache_bytes until thread |
|
cache 2 deallocates some memory to trigger a garbage collection.</p> |
|
|
|
<h2><A NAME="performance">Performance Notes</A></h2> |
|
|
|
<h3>PTMalloc2 unittest</h3> |
|
|
|
<p>The PTMalloc2 package (now part of glibc) contains a unittest |
|
program <code>t-test1.c</code>. This forks a number of threads and |
|
performs a series of allocations and deallocations in each thread; the |
|
threads do not communicate other than by synchronization in the memory |
|
allocator.</p> |
|
|
|
<p><code>t-test1</code> (included in |
|
<code>tests/tcmalloc/</code>, and compiled as |
|
<code>ptmalloc_unittest1</code>) was run with a varying numbers of |
|
threads (1-20) and maximum allocation sizes (64 bytes - |
|
32Kbytes). These tests were run on a 2.4GHz dual Xeon system with |
|
hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with |
|
one million operations per thread in each test. In each case, the test |
|
was run once normally, and once with |
|
<code>LD_PRELOAD=libtcmalloc.so</code>. |
|
|
|
<p>The graphs below show the performance of TCMalloc vs PTMalloc2 for |
|
several different metrics. Firstly, total operations (millions) per |
|
elapsed second vs max allocation size, for varying numbers of |
|
threads. The raw data used to generate these graphs (the output of the |
|
<code>time</code> utility) is available in |
|
<code>t-test1.times.txt</code>.</p> |
|
|
|
<table> |
|
<tr> |
|
<td><img src="tcmalloc-opspersec.vs.size.1.threads.png"></td> |
|
<td><img src="tcmalloc-opspersec.vs.size.2.threads.png"></td> |
|
<td><img src="tcmalloc-opspersec.vs.size.3.threads.png"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="tcmalloc-opspersec.vs.size.4.threads.png"></td> |
|
<td><img src="tcmalloc-opspersec.vs.size.5.threads.png"></td> |
|
<td><img src="tcmalloc-opspersec.vs.size.8.threads.png"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="tcmalloc-opspersec.vs.size.12.threads.png"></td> |
|
<td><img src="tcmalloc-opspersec.vs.size.16.threads.png"></td> |
|
<td><img src="tcmalloc-opspersec.vs.size.20.threads.png"></td> |
|
</tr> |
|
</table> |
|
|
|
|
|
<ul> |
|
<li> TCMalloc is much more consistently scalable than PTMalloc2 - for |
|
all thread counts >1 it achieves ~7-9 million ops/sec for small |
|
allocations, falling to ~2 million ops/sec for larger |
|
allocations. The single-thread case is an obvious outlier, |
|
since it is only able to keep a single processor busy and hence |
|
can achieve fewer ops/sec. PTMalloc2 has a much higher variance |
|
on operations/sec - peaking somewhere around 4 million ops/sec |
|
for small allocations and falling to <1 million ops/sec for |
|
larger allocations. |
|
|
|
<li> TCMalloc is faster than PTMalloc2 in the vast majority of |
|
cases, and particularly for small allocations. Contention |
|
between threads is less of a problem in TCMalloc. |
|
|
|
<li> TCMalloc's performance drops off as the allocation size |
|
increases. This is because the per-thread cache is |
|
garbage-collected when it hits a threshold (defaulting to |
|
2MB). With larger allocation sizes, fewer objects can be stored |
|
in the cache before it is garbage-collected. |
|
|
|
<li> There is a noticeable drop in TCMalloc's performance at ~32K |
|
maximum allocation size; at larger sizes performance drops less |
|
quickly. This is due to the 32K maximum size of objects in the |
|
per-thread caches; for objects larger than this TCMalloc |
|
allocates from the central page heap. |
|
</ul> |
|
|
|
<p>Next, operations (millions) per second of CPU time vs number of |
|
threads, for max allocation size 64 bytes - 128 Kbytes.</p> |
|
|
|
<table> |
|
<tr> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.64.bytes.png"></td> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.256.bytes.png"></td> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.1024.bytes.png"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.4096.bytes.png"></td> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.8192.bytes.png"></td> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.16384.bytes.png"></td> |
|
</tr> |
|
<tr> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.32768.bytes.png"></td> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.65536.bytes.png"></td> |
|
<td><img src="tcmalloc-opspercpusec.vs.threads.131072.bytes.png"></td> |
|
</tr> |
|
</table> |
|
|
|
<p>Here we see again that TCMalloc is both more consistent and more |
|
efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc |
|
typically achieves ~2-2.5 million ops per second of CPU time with a |
|
large number of threads, whereas PTMalloc achieves generally 0.5-1 |
|
million ops per second of CPU time, with a lot of cases achieving much |
|
less than this figure. Above 32K max allocation size, TCMalloc drops |
|
to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost |
|
to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU |
|
time is being burned spinning waiting for locks in the heavily |
|
multi-threaded case).</p> |
|
|
|
|
|
<H2><A NAME="runtime">Modifying Runtime Behavior</A></H2> |
|
|
|
<p>You can more finely control the behavior of the tcmalloc via |
|
environment variables.</p> |
|
|
|
<p>Generally useful flags:</p> |
|
|
|
<table frame=box rules=sides cellpadding=5 width=100%> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_SAMPLE_PARAMETER</code></td> |
|
<td>default: 0</td> |
|
<td> |
|
The approximate gap between sampling actions. That is, we |
|
take one sample approximately once every |
|
<code>tcmalloc_sample_parmeter</code> bytes of allocation. |
|
This sampled heap information is available via |
|
<code>MallocExtension::GetHeapSample()</code> or |
|
<code>MallocExtension::ReadStackTraces()</code>. A reasonable |
|
value is 524288. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_RELEASE_RATE</code></td> |
|
<td>default: 1.0</td> |
|
<td> |
|
Rate at which we release unused memory to the system, via |
|
<code>madvise(MADV_DONTNEED)</code>, on systems that support |
|
it. Zero means we never release memory back to the system. |
|
Increase this flag to return memory faster; decrease it |
|
to return memory slower. Reasonable rates are in the |
|
range [0,10]. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD</code></td> |
|
<td>default: 1073741824</td> |
|
<td> |
|
Allocations larger than this value cause a stack trace to be |
|
dumped to stderr. The threshold for dumping stack traces is |
|
increased by a factor of 1.125 every time we print a message so |
|
that the threshold automatically goes up by a factor of ~1000 |
|
every 60 messages. This bounds the amount of extra logging |
|
generated by this flag. Default value of this flag is very large |
|
and therefore you should see no extra logging unless the flag is |
|
overridden. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES</code></td> |
|
<td>default: 16777216</td> |
|
<td> |
|
Bound on the total amount of bytes allocated to thread caches. This |
|
bound is not strict, so it is possible for the cache to go over this |
|
bound in certain circumstances. This value defaults to 16MB. For |
|
applications with many threads, this may not be a large enough cache, |
|
which can affect performance. If you suspect your application is not |
|
scaling to many threads due to lock contention in TCMalloc, you can |
|
try increasing this value. This may improve performance, at a cost |
|
of extra memory use by TCMalloc. See <a href="#Garbage_Collection"> |
|
Garbage Collection</a> for more details. |
|
</td> |
|
</tr> |
|
|
|
</table> |
|
|
|
<p>Advanced "tweaking" flags, that control more precisely how tcmalloc |
|
tries to allocate memory from the kernel.</p> |
|
|
|
<table frame=box rules=sides cellpadding=5 width=100%> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_SKIP_MMAP</code></td> |
|
<td>default: false</td> |
|
<td> |
|
If true, do not try to use <code>mmap</code> to obtain memory |
|
from the kernel. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_SKIP_SBRK</code></td> |
|
<td>default: false</td> |
|
<td> |
|
If true, do not try to use <code>sbrk</code> to obtain memory |
|
from the kernel. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_DEVMEM_START</code></td> |
|
<td>default: 0</td> |
|
<td> |
|
Physical memory starting location in MB for <code>/dev/mem</code> |
|
allocation. Setting this to 0 disables <code>/dev/mem</code> |
|
allocation. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_DEVMEM_LIMIT</code></td> |
|
<td>default: 0</td> |
|
<td> |
|
Physical memory limit location in MB for <code>/dev/mem</code> |
|
allocation. Setting this to 0 means no limit. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_DEVMEM_DEVICE</code></td> |
|
<td>default: /dev/mem</td> |
|
<td> |
|
Device to use for allocating unmanaged memory. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_MEMFS_MALLOC_PATH</code></td> |
|
<td>default: ""</td> |
|
<td> |
|
If set, specify a path where hugetlbfs or tmpfs is mounted. |
|
This may allow for speedier allocations. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_MEMFS_LIMIT_MB</code></td> |
|
<td>default: 0</td> |
|
<td> |
|
Limit total memfs allocation size to specified number of MB. |
|
0 means "no limit". |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_MEMFS_ABORT_ON_FAIL</code></td> |
|
<td>default: false</td> |
|
<td> |
|
If true, abort() whenever memfs_malloc fails to satisfy an allocation. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_MEMFS_IGNORE_MMAP_FAIL</code></td> |
|
<td>default: false</td> |
|
<td> |
|
If true, ignore failures from mmap. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>TCMALLOC_MEMFS_MAP_PRVIATE</code></td> |
|
<td>default: false</td> |
|
<td> |
|
If true, use MAP_PRIVATE when mapping via memfs, not MAP_SHARED. |
|
</td> |
|
</tr> |
|
|
|
</table> |
|
|
|
|
|
<H2><A NAME="compiletime">Modifying Behavior In Code</A></H2> |
|
|
|
<p>The <code>MallocExtension</code> class, in |
|
<code>malloc_extension.h</code>, provides a few knobs that you can |
|
tweak in your program, to affect tcmalloc's behavior.</p> |
|
|
|
<h3>Releasing Memory Back to the System</h3> |
|
|
|
<p>By default, tcmalloc will release no-longer-used memory back to the |
|
kernel gradually, over time. The <a |
|
href="#runtime">tcmalloc_release_rate</a> flag controls how quickly |
|
this happens. You can also force a release at a given point in the |
|
progam execution like so:</p> |
|
<pre> |
|
MallocExtension::instance()->ReleaseFreeMemory(); |
|
</pre> |
|
|
|
<p>You can also call <code>SetMemoryReleaseRate()</code> to change the |
|
<code>tcmalloc_release_rate</code> value at runtime, or |
|
<code>GetMemoryReleaseRate</code> to see what the current release rate |
|
is.</p> |
|
|
|
<h3>Memory Introspection</h3> |
|
|
|
<p>There are several routines for getting a human-readable form of the |
|
current memory usage:</p> |
|
<pre> |
|
MallocExtension::instance()->GetStats(buffer, buffer_length); |
|
MallocExtension::instance()->GetHeapSample(&string); |
|
MallocExtension::instance()->GetHeapGrowthStacks(&string); |
|
</pre> |
|
|
|
<p>The last two create files in the same format as the heap-profiler, |
|
and can be passed as data files to pprof. The first is human-readable |
|
and is meant for debugging.</p> |
|
|
|
<h3>Generic Tcmalloc Status</h3> |
|
|
|
<p>TCMalloc has support for setting and retrieving arbitrary |
|
'properties':</p> |
|
<pre> |
|
MallocExtension::instance()->SetNumericProperty(property_name, value); |
|
MallocExtension::instance()->GetNumericProperty(property_name, &value); |
|
</pre> |
|
|
|
<p>It is possible for an application to set and get these properties, |
|
but the most useful is when a library sets the properties so the |
|
application can read them. Here are the properties TCMalloc defines; |
|
you can access them with a call like |
|
<code>MallocExtension::instance()->GetNumericProperty("generic.heap_size", |
|
&value);</code>:</p> |
|
|
|
<table frame=box rules=sides cellpadding=5 width=100%> |
|
|
|
<tr valign=top> |
|
<td><code>generic.current_allocated_bytes</code></td> |
|
<td> |
|
Number of bytes used by the application. This will not typically |
|
match the memory use reported by the OS, because it does not |
|
include TCMalloc overhead or memory fragmentation. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>generic.heap_size</code></td> |
|
<td> |
|
Bytes of system memory reserved by TCMalloc. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>tcmalloc.pageheap_free_bytes</code></td> |
|
<td> |
|
Number of bytes in free, mapped pages in page heap. These bytes |
|
can be used to fulfill allocation requests. They always count |
|
towards virtual memory usage, and unless the underlying memory is |
|
swapped out by the OS, they also count towards physical memory |
|
usage. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>tcmalloc.pageheap_unmapped_bytes</code></td> |
|
<td> |
|
Number of bytes in free, unmapped pages in page heap. These are |
|
bytes that have been released back to the OS, possibly by one of |
|
the MallocExtension "Release" calls. They can be used to fulfill |
|
allocation requests, but typically incur a page fault. They |
|
always count towards virtual memory usage, and depending on the |
|
OS, typically do not count towards physical memory usage. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>tcmalloc.slack_bytes</code></td> |
|
<td> |
|
Sum of pageheap_free_bytes and pageheap_unmapped_bytes. Provided |
|
for backwards compatibility only. Do not use. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>tcmalloc.max_total_thread_cache_bytes</code></td> |
|
<td> |
|
A limit to how much memory TCMalloc dedicates for small objects. |
|
Higher numbers trade off more memory use for -- in some situations |
|
-- improved efficiency. |
|
</td> |
|
</tr> |
|
|
|
<tr valign=top> |
|
<td><code>tcmalloc.current_total_thread_cache_bytes</code></td> |
|
<td> |
|
A measure of some of the memory TCMalloc is using (for |
|
small objects). |
|
</td> |
|
</tr> |
|
|
|
</table> |
|
|
|
<h2><A NAME="caveats">Caveats</A></h2> |
|
|
|
<p>For some systems, TCMalloc may not work correctly with |
|
applications that aren't linked against <code>libpthread.so</code> (or |
|
the equivalent on your OS). It should work on Linux using glibc 2.3, |
|
but other OS/libc combinations have not been tested.</p> |
|
|
|
<p>TCMalloc may be somewhat more memory hungry than other mallocs, |
|
(but tends not to have the huge blowups that can happen with other |
|
mallocs). In particular, at startup TCMalloc allocates approximately |
|
240KB of internal memory.</p> |
|
|
|
<p>Don't try to load TCMalloc into a running binary (e.g., using JNI |
|
in Java programs). The binary will have allocated some objects using |
|
the system malloc, and may try to pass them to TCMalloc for |
|
deallocation. TCMalloc will not be able to handle such objects.</p> |
|
|
|
<hr> |
|
|
|
<address>Sanjay Ghemawat, Paul Menage<br> |
|
<!-- Created: Tue Dec 19 10:43:14 PST 2000 --> |
|
<!-- hhmts start --> |
|
Last modified: Sat Feb 24 13:11:38 PST 2007 (csilvers) |
|
<!-- hhmts end --> |
|
</address> |
|
|
|
</body> |
|
</html>
|
|
|