[Code Index] by Mike Marynowski

programming for work and fun

Fastest Possible Thread-Safe Lock-Free Lookup Dictionary (CopyOnWriteDictionary)

I have been writing a ton of code lately that involves permanently caching objects of various types: generated type schemas, compiled expression queries, reflection lookups for MethodInfo or PropertyInfo, etc. Usually these are updated infrequently in quick bursts, i.e. at application startup, when a new assembly loads, or when a new operation is invoked for the first time that makes use of items that haven't been cached yet. I always felt like the existing options were lacking in usability and/or performance and we could do much better - CopyOnWriteDictionary is the culmination of that pursuit and represents what I consider the ideal dictionary implementation for use cases like this.

Analyzing The Problem

In the case of our database engine, while under heavy load some of these caches might be getting hammered by 16+ concurrent threads in tight loops. This was mostly a problem for the schema cache, which gets hit for every row that gets materialized. Picking a data structure to store these cached objects has always involved a performance or design trade-off that I knew we could significantly improved upon. Our options initially were something like this:

  1. Use ConcurrentDictionary. Take the performance hit on every single lookup (which is substantial) due to read/write locking but allow multiple threads to read the dictionary without blocking.
  2. Use an ImmutableDictionary. Our performance testing has shown that this is basically never the optimal solution...it's very slow in comparison to the other options, even for just plain reading.
  3. Use a global lock on a standard Dictionary. Uncontested locks are pretty quick so individual lookups are much faster, but things can get dicey when locks start blocking concurrent threads.
  4. Use the standard Dictionary but treat it like it's immutable, meaning updates are always done to a copy of the dictionary (under a lock) and then the reference is swapped atomically. Threads that are just reading the dictionary don't need to synchronize access to it at all, so lookups are optimally fast at the cost of copying the dictionary when you want to update it.

We decided on option #4 for very heavily accessed caches that were updated infrequently. I created a CopyOnWriteDictionary wrapper around Dictionary which internalized the details of doing the copy and atomic swap whenever changes were made. Copying the entire dictionary for every single Add() operation proved to be very slow though; adding 100 items one-by-one to an immutable dictionary that already has 1000 items in it requires performing over 100,000 inserts instead of just 100.

Coming Up with a Better Solution

My initial way of handling this problem was the obvious one - I introduced an AddRange() method that only copied the dictionary once per call and accepted a list of items as the parameter. This helped write performance but meant that we had to batch together updates before calling AddRange(), so we ended up predictively "eager loading" a lot of things that may not actually be needed.  Some rather awkward and hacky code started turning up in places trying to batch updates together. Ensuring everything was synchronized properly while keeping the whole ensemble completely lock-free for lookups became quite an exercise at times, especially when cached items needed access to other cached items that might that actually be in a temporary batching collection somewhere waiting to go into the proper cache.

And then the "aha!" moment came and a no-compromise design for this specific purpose was born - CopyOnWriteDictionary received a CopyDelay property which changed everything. We could now specify the amount of time writing to the dictionary needs to be inactive before the copying operation is actually invoked and the "main" internal immutable dictionary is swapped. New items are first added to a lock synchronized mutable "pending changes" dictionary, and when the delay timer runs out then the copy operation is invoked on a background thread and merges the changes to the main dictionary. Since the main dictionary is immutable, threads fetching values from it aren't blocked even while the copy operation is running, so the world doesn't stop and lots of other work still gets done on all the other running threads. Lookups for main dictionary values run at 100% plain Dictionary speed at all times because no read synchronization is ever required. The only time locks are taken out for lookups is when the pending changes dictionary needs to be accessed because a key wasn't present in the main dictionary (which is usually very rare). Those lookups are still very fast though (they are just synchronized Dictionary lookups) and the slow down on those keys is only momentary while they wait to be merged into the main dictionary.

One other benefit worth mentioning is that iterating over keys/values/both is normally very fast as well because we don't need to make copies of anything except pending items (if there are any, which usually isn't the case). As a bonus you also get a proper point-in-time snapshot of the dictionary without having to do any locking during the enumeration, unlike ConcurrentDictionary.

Everything is much simpler now because we can just add items to the dictionary whenever we need to (i.e. when we can't already find a cached item we need in there) and the batching process is handled automatically. We typically use a delay of 50 - 100ms, which is generally more than enough to ensure that any bursts of related Add() calls get grouped together into a single copy operation and full performance for those newly added keys is quickly established. If you are loading cache items from a slower source like a hard drive or a network connection then you might want to pick a larger delay value.

The dictionary is currently append-only, meaning items can't be removed once they have been added. It wouldn't be difficult to support removal as well, I just haven't had the need for it since I've only been using it for permanent caching. Let me know in the comments if remove operations would be really beneficial to you and I can see about adding them.

Usage Information

You can find the full source code for the dictionary here: Thread-Safe / Lock-Free / Append-Only / Copy-On-Write Dictionary

Typical usage goes something like this:

public class TypeDataCache
{
    private static readonly CopyOnWriteDictionary<Type, TypeData> _lookup;

    static TypeDataCache()
    {
        _lookup  = new CopyOnWriteDictionary<Type, TypeData>() { CopyDelay = 100 };

        foreach (var type in GetInitialTypes())
            _lookup.Add(type, new TypeData(type));
    }

    public static TypeData GetData(Type type)
    {
        if (!_lookup.TryGetValue(type, out var data))
            data = _lookup.AddOrGetValue(type, () => new TypeData(type));

        return data;
    }
}

It is recommended that you only call Add() during initialization when no other threads have access to the dictionary. After that you should first try to grab the value you need with TryGetValue() since that's the option with the fastest lock-free path, and if that fails then a call to AddOrGetValue() will add the item or grab the existing one in a thread-safe manner. It's important that you consider the possibility that another thread may have added the value you were looking for in-between the TryGetValue() and AddOrGetValue() method calls. If you use this pattern and make sure you use the return value of AddOrGetValue() then everything works properly with no possibility of duplicate key exceptions. A value factory will only ever execute once for a given key even if multiple threads call AddOrGetValue() at the same time.

If you read often and write occasionally in quick bursts then this is the fastest concurrent dictionary available. Happy caching and be sure to let me know if this helped you out :)