Solving the ABA Problem for Lock-Free Free Lists

When writing a lock-free data structure in a language that doesn't have garbage collection (like C++), you need to manage the memory yourself. Often (even in managed languages), it's to your advantage to re-use objects instead of freeing them completely; this avoids the overhead of malloc(), and also avoids its worst-case behaviour of taking an operating system lock (a lock-free algorithm isn't very lock-free if its hot path makes library calls to functions that can take locks!).

So, once you know an object is no longer useful, how do you track it so that you can get it back later? The simplest structure that comes to mind, of course, is a free list -- put the object on once everyone's done with it, then take the head one off the next time you need a free one.

It's broken, Jim

Alas, if only it were that simple. The joy of lock-free programming is that it makes even trivial tasks challenging. Consider this naïve implementation (C++11, but the algorithm is what's important) of a lock-free linked-list that uses compare-and-swap (CAS) loops to add and remove nodes:

struct Node
{
    void* data;                // Whatever you want to put in the free list
    std::atomic<Node*> next;   // The next node in the list
};

struct FreeList
{
    FreeList() : freeListHead(nullptr) { }

    void add(Node* node)
    {
        // Memory ordering directives have been ommitted for the sake of simplicity
        auto head = freeListHead.load();
        do {
            node->next.store(head);
        } while (!freeListHead.compare_exchange_strong(head, node));
    }

    Node* try_get()
    {
        auto head = freeListHead.load();
        while (head != nullptr && !freeListHead.compare_exchange_strong(head, head->next.load()))
            continue;

        return head;
    }

private:
    std::atomic<Node*> freeListHead;
};

Can you spot the bug? It's quite subtle! The add() method is fine, but try_get() is flawed: It's possible that, in between the time that the head->next is read and the time that the CAS is done, somebody else removes the head (let's call it Sammy) from the list as well as the next node (say, Tommy), then puts Sammy back on free list. Why is this bad? Well, consider: The first thread thinks Sammy is the head of the list, and that the next item is Tommy, and it's about to do a CAS to replace Sammy with Tommy. Despite the fact that Sammy left and came back, he's still at the head of the list, so the CAS is going to succeed. But because Sammy left and came back, the list had the opportunity to change in the meantime -- Tommy is no longer the next node of the head (he was removed and still hasn't been put back by anyone). The original thread thinks he is, though, and has no way of knowing he isn't. So the CAS succeeds, setting the head of the list to Tommy, who isn't actually in the list anymore, and everything goes wahoonie-shaped after that. This is called the ABA problem: Something appears to be A at the time you check it, changes to B, then changes back to A (but with other parts of the data structure in a non-equivalent state), causing a CAS to succeed when it shouldn't.

Oh no, what to do?!

So, what can we do to fix this ABA problem? There's several alternatives! First, we could simply not use compare-and-swap in the first place. But CAS is nice, because it's commonly implemented on most hardware architectures, and it doesn't require extra blocks of memory (like array-based approaches do) -- it works nicely with pointers linked to each other. We could try to make it so that it doesn't matter if ABA occurs or not -- you'll notice that the add() method above doesn't guard against ABA in any way, and yet it doesn't matter to the algorithm. Unfortunately, that's not possible here (as far as I can tell, anyway). Another common approach is to do a CAS not just on the head pointer, but on a version number ("tag") too. Each time an item is put back into the free list, the version is incremented, and so any CAS that is done after an ABA will fail, because while the pointer is the same, the version has changed.

Versioned/tagged pointers sound great -- but there's a catch: the compare-and-swap needs to be done on both the pointer and its tag in the same atomic step, which means that either: the CAS operation must be capable of working on a word that is twice the size of a pointer; or, that you somehow reduce the size of the pointer so that the pointer and tag can fit within the same machine word. The first option is obviously great if your target architecture supports it, but if not (and not all do), then using it is not lock-free, because the multi-word CAS has to be emulated using a lock. The second option limits things somewhat, since it reduces the number of objects you can have in your list to, say, 65535 (with 16 bits for each pointer and 16 bits for each tag on a 32-bit platform). Hmm.

Yet another option is to use the linked-load/store-conditional (LL/SC) atomic primitive instead of CAS. LL/SC isn't susceptible to the ABA problem in the first place, because the store-conditional fails if anybody has touched the memory since the linked-load, regardless of whether it's been set back to the same value or not. The catch with LL/SC is that not all common architectures support it -- recent versions of ARM do, for example, but x86 does not. There are also limits on what you can do in between the linked-load and the store-conditional. Sigh.

It's a bird, it's a plane, it's... a solution to ABA!

Versioned pointers come close, but aren't always appropriate. Most of the other approaches not mentioned here that I've come across either rely on unicorn-like atomic primitives that don't exist in common hardware, or they introduce (temporary) locks.

So, I spent some time and came up with another approach, which is slightly more complex but also completely general (to the lock-free free list problem, anyway). It's fundamentally based on the naïve CAS-based algorithm above, minus the bug.

First, the root of the problem is that in between the time the head is read and the head's next pointer is read, the head could have changed, meaning the next is not necessarily valid by the time the CAS is executed. So, I introduced a reference count on each node, which is incremented before reading the next, and decremented after the CAS. While the reference count is non-zero, nobody is allowed to change the node's next.

This reference-counting has a few implications: when a node is taken off and an attempt is made to put it back, the reference count has to be zero before it's actually put back on the list -- otherwise the reference count isn't really doing anything. One option would be to spin-wait in add() until the reference reaches 0, but that's technically a lock. Instead, we can be a little more clever and take advantage of the handy fact that the order of nodes in a free list doesn't matter: if the reference count isn't 0, we can set a flag indicating that the item should be put on the list, and let the thread that brings the reference count to 0 take care of adding it back on later. This makes add() lock-free! Note that care has to be taken when setting the should-be-on-free-list flag: In order to avoid a race condition between the time the flag is set and the time the reference count reaches 0, we first increment the reference count before setting the flag, then decrement it afterwards (and if the reference count is then zero, the thread calling add() ends up being the one responsible for putting the node on the list after all).

Another problem is that in try_get(), the reference count shouldn't be re-incremented once it's zero, since this causes a race condition between the time the reference count hits zero (and thus becomes eligible for adding back onto the list), and the time the reference is incremented indicating that it's not eligible to be on the list (i.e. the reference count shouldn't be incremented after the add-to-list process has already started). To solve this, we first make sure the reference count is never zero if the node is in the list (as if the list itself has a reference to the node), and instead of using the fetch-and-add atomic primitive to increment the reference count, we use a compare-and-swap loop (which lets us check that the count is not zero before doing the CAS).

Finally, when adding an item to the list, its reference count must be set to 1 before it takes the place of the head, otherwise any thread that reads the head before the reference count is incremented will not be able to proceed (because the reference count is still zero). But this introduces a slight problem, since it's possible that a thread still has a pointer to the node that had been removed from the list and is about to be added back, and once it sees the reference count is non-zero it increments it and attempts to dequeue it from the list. It turns out that this is OK in the case that the CAS to add the node to the list succeeds, since then the node will be on the list, and the other thread will either see it in time and successfully take it off, or won't and harmlessly try again. If the CAS to add the node to the list fails, the thread will still try to take it off the list, failing harmlessly (since it's not on the list). But another attempt cannot be made to put the node on the list until its reference count goes back to zero -- fortunately, we already have the mechanism for that!

In terms of correctness, I believe this algorithm will work properly under all conditions. I don't have a formal proof, but I did think rather hard about it for a while. If somebody spots a bug, please let me know!

Update

Updated July 20, 2014

I did find a bug (that's what I get for releasing code before I get around to writing unit tests for it). There is, of course, a race condition between the time the reference count of a node reaches zero and the time the should-be-on-freelist flag is checked, which can lead to two threads both seeing the flag set after both saw the reference count at zero (causing mayhem to ensue when they both try to add the node to the free list at the same time).

The easiest way to fix this is to squeeze the reference count and flag into the same shared variable (putting the flag in the high bit, say), so that updates to the state are fully atomic. This has the nice side-effect of reducing the number of atomic operations required in general (increased performance), though at the cost of slightly less readable code.

I've updated the code below to include this fix. I can now say with a fair degree of confidence that it's correct; it passes my unit tests and relacy doesn't find any races in my model of it.

Use the source, Luke

Here's the source code in full, including comments (the free list is templated, for convenience). (Use it pretty much however you want, under the terms of the simplified BSD license.)

template <typename N>
struct FreeListNode
{
    FreeListNode() : freeListRefs(0), freeListNext(nullptr) { }

    std::atomic<std::uint32_t> freeListRefs;
    std::atomic<N*> freeListNext;
};

// A simple CAS-based lock-free free list. Not the fastest thing in the world under heavy contention,
// but simple and correct (assuming nodes are never freed until after the free list is destroyed),
// and fairly speedy under low contention.
template<typename N>    // N must inherit FreeListNode or have the same fields (and initialization)
struct FreeList
{
    FreeList() : freeListHead(nullptr) { }

    inline void add(N* node)
    {
        // We know that the should-be-on-freelist bit is 0 at this point, so it's safe to
        // set it using a fetch_add
        if (node->freeListRefs.fetch_add(SHOULD_BE_ON_FREELIST, std::memory_order_release) == 0) {
            // Oh look! We were the last ones referencing this node, and we know
            // we want to add it to the free list, so let's do it!
            add_knowing_refcount_is_zero(node);
        }
    }

    inline N* try_get()
    {
        auto head = freeListHead.load(std::memory_order_acquire);
        while (head != nullptr) {
            auto prevHead = head;
            auto refs = head->freeListRefs.load(std::memory_order_relaxed);
            if ((refs & REFS_MASK) == 0 || !head->freeListRefs.compare_exchange_strong(refs, refs + 1,
                    std::memory_order_acquire)) {
                head = freeListHead.load(std::memory_order_acquire);
                continue;
            }

            // Good, reference count has been incremented (it wasn't at zero), which means
            // we can read the next and not worry about it changing between now and the time
            // we do the CAS
            auto next = head->freeListNext.load(std::memory_order_relaxed);
            if (freeListHead.compare_exchange_strong(head, next,
                    std::memory_order_acquire, std::memory_order_relaxed)) {
                // Yay, got the node. This means it was on the list, which means
                // shouldBeOnFreeList must be false no matter the refcount (because
                // nobody else knows it's been taken off yet, it can't have been put back on).
                assert((head->freeListRefs.load(std::memory_order_relaxed) &
                    SHOULD_BE_ON_FREELIST) == 0);

                // Decrease refcount twice, once for our ref, and once for the list's ref
                head->freeListRefs.fetch_add(-2, std::memory_order_relaxed);

                return head;
            }

            // OK, the head must have changed on us, but we still need to decrease the refcount we
            // increased
            refs = prevHead->freeListRefs.fetch_add(-1, std::memory_order_acq_rel);
            if (refs == SHOULD_BE_ON_FREELIST + 1) {
                add_knowing_refcount_is_zero(prevHead);
            }
        }

        return nullptr;
    }

    // Useful for traversing the list when there's no contention (e.g. to destroy remaining nodes)
    N* head_unsafe() const { return freeListHead.load(std::memory_order_relaxed); }

private:
    inline void add_knowing_refcount_is_zero(N* node)
    {
        // Since the refcount is zero, and nobody can increase it once it's zero (except us, and we
        // run only one copy of this method per node at a time, i.e. the single thread case), then we
        // know we can safely change the next pointer of the node; however, once the refcount is back
        // above zero, then other threads could increase it (happens under heavy contention, when the
        // refcount goes to zero in between a load and a refcount increment of a node in try_get, then
        // back up to something non-zero, then the refcount increment is done by the other thread) --
        // so, if the CAS to add the node to the actual list fails, decrease the refcount and leave
        // the add operation to the next thread who puts the refcount back at zero (which could be us,
        // hence the loop).
        auto head = freeListHead.load(std::memory_order_relaxed);
        while (true) {
            node->freeListNext.store(head, std::memory_order_relaxed);
            node->freeListRefs.store(1, std::memory_order_release);
            if (!freeListHead.compare_exchange_strong(head, node,
                    std::memory_order_release, std::memory_order_relaxed)) {
                // Hmm, the add failed, but we can only try again when the refcount goes back to zero
                if (node->freeListRefs.fetch_add(SHOULD_BE_ON_FREELIST - 1,
                        std::memory_order_acq_rel) == 1) {
                    continue;
                }
            }
            return;
        }
    }

private:
    static const std::uint32_t REFS_MASK = 0x7FFFFFFF;
    static const std::uint32_t SHOULD_BE_ON_FREELIST = 0x80000000;

    // Implemented like a stack, but where node order doesn't matter (nodes are
    // inserted out of order under contention)
    std::atomic<N*> freeListHead;
};

I hope somebody finds this useful! (Or at least interesting.)

Another update

Updated January 6, 2015

Following MorphingDragon's suggestion in the comments, here's another, simpler free list that takes advantage of DCAS (so, only use this on platforms that support it). Beware, it's untested! :-)

// If you know your target architecture supports a double-wide compare-and-swap
// instruction, you can avoid the reference counting dance and use the naïve algorithm
// with a tag to avoid ABA, while still having it be lock-free. (If the tag of the head
// pointer is incremented each time it's changed, then ABA can't happen because all
// head values appear unique even if an element is re-added.)

template <typename N>
struct FreeListNode
{
    FreeListNode() : freeListNext(nullptr) { }

    std::atomic<N*> freeListNext;
};

template<typename N>    // N must inherit FreeListNode or have the same fields (and initialization)
struct FreeList
{
    FreeList()
        : head(HeadPtr())
    {
        assert(head.is_lock_free() && "Your platform must support DCAS for this to be lock-free");
    }

    inline void add(N* node)
    {
        HeadPtr currentHead = head.load(std::memory_order_relaxed);
        HeadPtr newHead = { node };
        do {
            newHead.tag = currentHead.tag + 1;
            node->freeListNext.store(currentHead.ptr, std::memory_order_relaxed);
        } while (!freeListHead.compare_exchange_weak(currentHead, newHead, std::memory_order_release, std::memory_order_relaxed));
    }

    inline N* try_get()
    {
        HeadPtr currentHead = head.load(std::memory_order_acquire);
        HeadPtr newHead;
        while (currentHead.ptr != nullptr) {
            newHead.ptr = currentHead.ptr->freeListNext.load(std::memory_order_relaxed);
            newHead.tag = currentHead.tag + 1;
            if (head.compare_exchange_weak(currentHead, newHead, std::memory_order_release, std::memory_order_acquire)) {
                break;
            }
        }
        return currentHead.ptr;
    }

    // Useful for traversing the list when there's no contention (e.g. to destroy remaining nodes)
    N* head_unsafe() const { return head.load(std::memory_order_relaxed).ptr; }


private:
    struct HeadPtr
    {
        N* ptr;
        std::uintptr_t tag;
    };

    std::atomic<HeadPtr> head;
};