tef

bad poster & mediocre photographer

  • they/them

reference counting is pretty great form of garbage collection: it's deterministic, pause times are predictable, and it's easy to implement too. it might not be perfect, but it's easy to reason about, and that makes it a good choice, despite the naysayers.

concurrent reference counting is less great: atomic increments and decrements are expensive and cause heavy contention on popular objects, but it doesn't have to be this way.


one solution to the cost of atomic refcounts is biased reference counting:

  • objects can be biased to some thread (storing a thread id)
  • objects store a pair of counts (owner, shared)
  • increment and decrements use the owner count, or the shared count
  • the owner thread periodically merges counts, and other threads can request merges
  • counts must be merged before deallocation

this technique was invented for swift, and it's a great idea. threads that mind their own business don't have to pay for atomic actions, and threads that need to share are only a little bit slower than before.

there's a few more details that i won't cover, like how the counts are merged, or the exact mechanisms in the fast and slow paths, but there are two details worth noting: counts must be merged before deallocation, and the owner thread is the only thread that can merge counts.

this means that under biased reference counting, some objects hang on a little longer before being returned to the allocator, but it's not the end of the world. objects still get cleared faster than they would under a tracing collector.

anyway, by splitting up the expensive and cheap paths, biased reference counting makes a significant performance impact. programs run 22.5% faster, on average.

it's not the only choice.

another reference counting system, buffered reference counts, avoids the expensive operations in a different way

  • there is a global epoch (a counter)
  • there is one thread dedicated to handling reference count changes
  • every thread has a list of increments and decrements
  • these lists are sent to the counting thread at regular intervals
  • increments can be applied immediately by the counting thread
  • when all threads have reported in, the epoch advances
  • decrements from the last epoch can now be applied

it's quite a beautiful trick, so it bears a second explanation

  • there's only one thread adjusting counts, so it needs no locks around them
  • sending the counts to the counting thread can be very fast and very cheap
  • as all increments are applied before decrements, and all decrements are applied in the next epoch, we know objects do not prematurely hit a count of zero
  • any object that hits zero can be deallocated immediately

instead of splitting things into fast and slow paths, we batch up operations and perform them in bulk. i don't have any numbers on how fast it makes things, but it's competitive.

like with biased counting, deallocations might take a while to happen. every thread needs to report in before it's safe to free an item, and one thread can stall everyone's memory from being reclaimed.

even so, it's also a neat trick

which brings me to the final trick. can't we do the split reference counts of biased counting, and the batching of inc/decrements from the buffered reference counting, and make them play nicely?

here's fast and slow reference counting:

  • there is a global epoch, and a counting thread
  • objects have an owner, an owner count, and a shared count
  • threads keep a list of increments and decrements
  • increments and decrements have a fast and a slow path
    • if the object is owned, we use the owned count directly (fast)
    • if the object is shared, we put it on the list of increments or decrements (slow)
  • threads send the lists to the counting thread, and after every thread has reported, the epoch advances
  • the counting thread applies increments immediately, and applies the decrements after the epoch advances

there's one tiny (big) problem: merging counts.

in biased refcounting, owning threads must merge the counts before deallocation, but we can't easily do the same thing here. the counter thread owns the shared count, and other threads own the owned count, and neither is protected by a lock.

instead of merging the counts, we increment the shared count whenever there's an owned count:

  • when an object is allocated, we give it an owner, a local count of one, and enqueue a shared count of one.
  • when a local count reaches zero, it enqueues a decrement for the shared count.
  • when the shared count hits zero, we deallocate the object

but wait, there's more!

we can handle lock-free reclamation, too!

with a concurrent object, it's possible for one thread to free an object, while another is trying to increment the reference count, and this is known as a "bad thing" in the industry at large.

normally, we put a lock around an object, so readers and writers can't be active at the same time. to delete an item, we lock it, clear the pointer to an object, and decrement the object's ref count before releasing the lock, but lock-free code doesn't have it so easy.

for lock-free access to objects, we have to keep deleted objects around in memory until we're sure that they're unreachable by other threads. the usual solution to this is to [check notes] add an epoch [hold up] and threads report in during safe points.

and we're already doing 95% of that. here's the last 5%:

  • when an object's shared count reaches zero, it is put on a zero count list for the current epoch
  • when the epoch advances, we can reclaim the last epoch's zero count list

nice, huh?

there's just a few more details to cover

biased reference counting keeps everything inside a word so it can do atomic actions on the flags and shared count at the same time. buffered reference counting doesn't need to store counts in the object header, as they're never accessed by mutators.

with fast and slow reference counting, we don't need to store the owner and the shared counts in the same place. owner counts can reside in the object header, but shared counts could be stored in card tables, or even in page headers ala minmalloc.

it's also worth noting that like with epoch based reclamation, or buffered reference counting, fast and slow counting can also be held back by an errant thread. so it goes.

oh and i guess it doesn't collect cycles. i have to mention this, i know.


You must log in to comment.

in reply to @tef's post:

i'm not even sure if we need the zero count list to be safe for reclamation under lock free

my intuition is that things get deleted in epoch n, zero counted in n+1, and reclaimed in n+2, much like epoch based reclamation

but well, we know that to have a refcount of zero, all references should have been deleted, and that when we apply decrements, every thread has already gone through a safe point, so

maybe we don't need the free list after all?

edit, we still do

if we advanced the epoch every time a deletion happened, we could immediately deallocate

but as we batch things up into broader epochs, we need to have two waves of safe points to be sure.

a thread at the end of the epoch, after every other thread has been through a safe point, will still expose the other threads to deleted data

the overhead is the increment/decrement buffer, and it's essentially akin to a write barrier, a cost tracing gcs invoke and get away with

in particular, you can use a fixed sized buffer and send that over when it gets full, too

it occurs to me that the "add one to the shared count and subtract one when the owned count is zero" trick can be used to do biased counting without the rest of the faff

like before

-give objects a owner, owner refcount, and shared refcount

  • use atomic increment on the shared refcount, normal on the owner
  • instead of a merge / shared header bit & per thread queue, allocate things with an owner, a count of 1, and a shared count of 1
  • when the owner count hits 0, decrement the shared count
  • when the shared count hits 0, you can deallocate

the downside is that all deallocations involve atomic, but increments and decrements can be biased, and there's no queue or count merging needed

for a language like rust, having per-thread queues can get a little hairy, and this avoids it entirely

edit: someone else already thought of this https://docs.rs/hybrid-rc/latest/src/hybrid_rc/lib.rs.html#286 , neat!