LRU again : Slab-Optimized Caching

Slab-Optimized Caching

Our community continually inspires us to refine Duva, and a recent discussion sparked a crucial realization. While our initial LRU cache design offered excellent O(1) performance, it inadvertently exposed a deeper, more pervasive issue: external memory fragmentation. Today, I want to dive into the implementation of our Least Recently Used (LRU) cache, explore the design choices we made, and explain how it integrates seamlessly and safely with our actor-based architecture, particularly focusing on how we tackle both algorithmic performance and critical memory challenges like fragmentation and cache locality.

Why LRU?

The LRU cache eviction policy is a classic for a reason. It's simple, intuitive, and generally effective. When the cache reaches its capacity, the item that hasn't been accessed for the longest time is evicted, making space for new data. This policy assumes that data accessed recently is likely to be accessed again soon, which holds true for many real-world workloads.

Performance: Amortized O(1) + enhanced with Slab Allocation

Duva's LRU implementation combines a HashMap for key lookups and an index-based doubly linked list for ordering. This core design provides O(1) performance for every operation: lookups, insertions, removals, and updates. But we didn't stop there. To push the boundaries of memory efficiency and CPU cache locality, we've integrated a custom slab allocator for managing our cache nodes. This holistic approach ensures our O(1) operations are not just theoretically fast, but blazing fast in practice due to eliminated memory fragmentation and significantly optimized memory access patterns.

1. GET
2. INSERT
3. REMOVE
4. UPDATE

Slab Allocation: The Foundation for Eliminating Fragmentation and Boosting Cache Locality

Conventional doubly linked lists often rely on separate heap allocations for each node, which can lead to significant inefficiencies, especially with frequent additions and removals of small objects. This causes external memory fragmentation, where free memory gets scattered in small, unusable chunks, and poor cache locality, where related data is spread far apart in memory, slowing down CPU access.

This is where slab allocation helps for our LRU cache. Instead of asking the operating system for tiny, disparate blocks of memory for each cache value, we employ a custom slab allocator that takes precise control of memory for these specific objects.

How it works:

Contiguous Chunks (Slabs): Our slab allocator pre-allocates large, contiguous blocks of memory (the "slabs"). Each slab is then internally divided into fixed-size slots, perfectly sized to hold a single CacheNode.
Index-Based Linked List: Crucially, the prev and next "pointers" within each CacheNode are not raw memory addresses. Instead, they are INDICES into the slab's internal array (Vec). This means our "logical" doubly linked list is built directly within this contiguous memory block.
Efficient Node Management: When the LRU cache needs a new node, it requests a slot from our slab allocator, which quickly provides an available index from a pre-existing slab. When a node is removed (evicted), its slot (index) is simply marked as free within the slab, making it immediately available for future reuse without involving the slower system allocator.

The Benefits:

Drastically Reduced Memory Fragmentation: By reusing fixed-size slots within slabs, we virtually eliminate external memory fragmentation. This ensures memory remains in larger, more usable chunks, making allocation and deallocation much more predictable and efficient for sustained high-performance operations.
Superior Cache Locality: Since many CacheNodes reside within the same or very few contiguous slabs, they are physically close to each other in memory. When the CPU fetches one node, there's a much higher probability that nearby nodes (which are often needed next during list traversals) will also be pulled into the CPU's fast L1/L2 cache. This significantly reduces cache misses and speeds up memory access.
Lower Allocation Overhead: Bypassing the general-purpose system allocator for each tiny CacheNode allocation reduces the computational overhead associated with memory management, leading to faster overall operations. This directly contributes to the sustained O(1) performance.

Safety in Rust: Leveraging Type System for Robustness

Our use of HashMap and a Vec-backed Slab with `usize` indices for the linked list enables this. Rust's strict ownership and borrowing rules prevent common pitfalls like dangling pointers, use-after-free errors, and data races at compile time, providing strong guarantees without sacrificing speed.

The efficiency gains from our index-based linked list within a slab are achieved not through unsafe pointer manipulation, but by carefully designing the data structure to align with Rust's type system and memory management principles. This approach ensures high performance while maintaining the robust safety and reliability Rust is known for.

Conclusion

By building an index-based doubly linked list underpinned by a custom slab allocator, we’ve engineered a high-performance LRU cache that offers O(1) access patterns with optimal memory utilization, exceptional cache locality, and a dramatic reduction in external memory fragmentation. Thanks to Rust's powerful type system, these gains are achieved with inherent safety, and our actor-based architecture provides an additional layer of concurrency safety. This fusion of performance and correctness is key to a scalable, reliable key-value store.

If you’re interested, dive into our codebase and join the project!