Slide 1: Hoard: A Scalable Memory Allocator for Multithreaded Applications
Emery Berger, Kathryn McKinley*, Robert Blumofe, Paul Wilson
Department of Computer Sciences
*
Department of Computer Science
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 2: Motivation
Parallel multithreaded programs becoming prevalent
web servers, search engines, database managers, etc. run on SMP’s for high performance often embarrassingly parallel
Memory allocation is a bottleneck
prevents scaling with number of processors
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 3: Assessment Criteria for Multiprocessor Allocators
Speed
competitive with uniprocessor allocators on one processor
Scalability
performance linear with the number of processors
Fragmentation (= max allocated / max in use)
competitive with uniprocessor allocators
worst-case and average-case
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 4: Uniprocessor Allocators on Multiprocessors
Fragmentation: Excellent Very low for most programs [Wilson & Johnstone] Speed & Scalability: Poor Heap contention a single lock protects the heap Can exacerbate false sharing different processors can share cache lines
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 5: Allocator-Induced False Sharing
Allocators cause false sharing!
Cache lines can end up spread across a number of processors Practically all allocators do this
processor 1
x1 = malloc(s);
A cache line
processor 2
x2 = malloc(s);
thrash…
thrash…
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 6: Existing Multiprocessor Allocators
Speed:
One concurrent heap (e.g., concurrent B-tree): too expensive
too many locks/atomic updates O(log n) cost per memory operation
⇒ Fast allocators use multiple heaps
Scalability:
Allocator-induced false sharing and other bottlenecks
Fragmentation: P-fold increase or even unbounded
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 7: Multiprocessor Allocator I: Pure Private Heaps
Pure private heaps: one heap per processor.
malloc gets memory
processor 1
x1= malloc(s) x2= malloc(s) free(x1) x3= malloc(s)
processor 2
free(x2) x4= malloc(s)
from the processor's heap or the system
free puts memory on the
processor's heap
Avoids heap contention
Examples: STL, ad hoc (e.g., Cilk 4.1)
= allocated by heap 1 = free, on heap 2
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 8: How to Break Pure Private Heaps: Fragmentation
Pure private heaps: memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees
processor 1
x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3= malloc(s) free(x3)
processor 2
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 9: Multiprocessor Allocator II: Private Heaps with Ownership
Private heaps with ownership: free puts memory back on the originating processor's heap. Avoids unbounded memory consumption
Examples: ptmalloc [Gloger], LKmalloc [Larson & Krishnan]
processor 1
x1= malloc(s) free(x1) x2= malloc(s) free(x2)
processor 2
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 10: How to Break Private Heaps with Ownership:Fragmentation
Private heaps with ownership: memory consumption can blowup by a factor of P. Round-robin producerconsumer:
processor i allocates processor i+1 frees
processor 1
x1= malloc(s) free(x1) x2= malloc(s) free(x2) x3=malloc(s) free(x3)
processor 2
processor 3
This really happens (NDS).
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 11: So What Do We Do Now?
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 12: The Hoard Multiprocessor Memory Allocator
Manages memory in page-sized superblocks of samesized objects
- Avoids false sharing by not carving up cache lines - Avoids heap contention - local heaps allocate & free small blocks from their set of superblocks
Adds a global heap that is a repository of superblocks When the fraction of free memory exceeds the empty fraction, moves superblocks to the global heap
- Avoids blowup in memory consumption
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 13: Hoard Example
Hoard: one heap per processor + a global heap
malloc gets memory from a
processor 1
x1= malloc(s) …some mallocs …some frees free(x7)
global heap
superblock on its heap.
free returns memory to its
Empty fraction = 1/3
superblock. If the heap is “too empty”, it moves a superblock to the global heap.
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 14: Summary of Analytical Results
Worst-case memory consumption: O(n log M/m + P) [instead of O(P n log M/m)] n = memory required
M = biggest object size m = smallest object size P = number of processors
Best possible: O(n log M/m) [Robson] Provably low synchronization in most cases
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 15: Experiments
Run on a dedicated 14-processor Sun Enterprise
300 MHz UltraSparc, 1 GB of RAM Solaris 2.7
All programs compiled with g++ version 2.95.1 Allocators:
Hoard version 2.0.2 Solaris (system allocator) Ptmalloc (GNU libc – private heaps with ownership) mtmalloc (Sun’s “MT-hot” allocator)
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 16: Performance: threadtest
speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 17: Performance: Larson
Server-style benchmark with sharing
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 18: Performance: false sharing
Each thread reads & writes heap data
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 19: Fragmentation Results
On most standard uniprocessor benchmarks, Hoard’s fragmentation was low:
p2c (Pascal-to-C): 1.20 LRUsim: 1.05 espresso: Ghostscript: 1.47 1.15
Within 20% of Lea’s allocator
On the multiprocessor benchmarks and other codes:
Fragmentation was between 1.02 and 1.24 for all but one anomalous benchmark (shbench: 3.17).
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 20: Hoard Conclusions
Speed: Excellent
As fast as a uniprocessor allocator on one processor
amortized O(1) cost 1 lock for malloc, 2 for free Scalability: Excellent
Scales linearly with the number of processors Avoids false sharing
Fragmentation: Very good
Worst-case is provably close to ideal Actual observed fragmentation is low
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Slide 21: Hoard Heap Details
“Segregated size class” allocator
Size classes are logarithmicallyspaced Superblocks hold objects of one size class
8 16 24 32 40 48 sizeclass bins radix-sorted superblock lists (emptiest to fullest)
empty superblocks are “recycled”
Approximately radix-sorted:
Allocate from mostly-full superblocks superblocks Fast removal of mostly-empty superblocks Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Hoard: A Scalable Memory