i got some code working, but i never wrote a paper about it, as is traditional for writing a data structure in rust
anyway, some shit on the timeline has nerdsniped me back into it.
what i implemented was a vec of boxed items, that had a multiple-compare-and-swap atomic operation
this might not seem like much, but it would allow me to implement all manner of trees atop, much like how a bw-tree works. instead of making a tree out of pointers, i would use array offsets into the atomic vec. which would mean that operations that touched multiple tree nodes would also be atomic.
aside: suffice to say, implementing your own miniature heap in a rust data structure atop vec is also pretty traditional
anyway, how i did it? keir's two-phase mcas algorithm, from his thesis about RCU and other lock-free things. it's a real good thesis.
the idea is this: you create a descriptor table of all the compare and swaps you are going to do. you then swap in a pointer to this table for all the values you want to change, in some defined order. you then swap back in the new values, and you're done.
this is in some ways, a form of 2-phase commit, and by swapping things in, in the same order each time, you avoid deadlock. it also means that other competing readers can spin on the contended items, or even help the writer swap in the final values, if you're feeling fancy.
... and to implement it in rust? well, i had to write a concurrent garbage collector first. i went with a mixture of epoch and quiescent based state reclamation. it's very much like a tri-color garbage collector, but applied to the readers, not the items. once every reader has caught up to the current moment, you know you can deallocate shit from a while back.
anyway, i got this atomic vec working, i got the multiple-compare-and-swap working, and i even got the concurrent reclamation working too. it actually did pretty ok in benchmarks, absolutely leaving a Mutex<Arc> behind in the dust.
then i put it to one side because at the time, i wanted to do something fancier: i wanted a lock free key-value table that i could persist to disk. this meant making things a little more serializable, or if you will, linearizable. i couldn't just dump a snapshot out of the table without being a little more clever—and in doing so, it felt like i'd be undermining all the fancy steps i took to get there.
or worse, reimplementing all the fancy steps. it looked like i needed to add some mvcc and serializable checks, which would in fact, undermine any real need to do multiple atomic compare and swap. i could just handle things being partially updated by skipping over to the older versions of entries.
this would allow me to have stable reads of the table, something the mcas algorithm doesn't quite manage on it's own. you see, even though the multiple-compare-and-swap is atomic, it doesn't mean that a reader will see a consistent snapshot of the data structure.
if a reader knew in advance every key it would touch in advance, it could scan those keys in the same order as a writer would, and therefore either bump into a writer, or get away with it, knowing that nothing was changed while it was looking elsewhere. another way to implement atomic reads/snapshot reads, is to make a fake write to the data structure, or doing a mcas with the same before and after values. neither of these options sounds great, frankly.
this, along with the reasons above, and the treatment resistant adhd, meant i put it to one side to rot.
fast forward a year or two, and a friend did some fancy lock free reclamation on a radix trie, and i thought i'd mention this cool paper on transactional mutex locks i'd seen in the meantime.
so the transactional mutex lock is a very neat single writer lock that doesn't block writers when there's multiple readers, at the expense of forcing readers to retry or abort operations.
- there's a global writer epoch, and when it's even, no writer is active
- readers spin until the epoch is even, store the epoch locally, and then check it hasn't changed at the end of their transaction
- writers spin until the epoch is even, then increment it to take ownership of the lock, do whatever the fuck they like, increment the epoch again to release it
it's honestly kinda beautifully simple, and it got me thinking: if you had a reader that started out, and a writer interrupted it, could you recover the operation without aborting? what if you had some table or log of the last changes to the data structure in question, that when you saw that the writer epoch didn't match, you could check your reads against the list of changes to see if you needed to restart properly.
... a list of changes you say? that reminded me of the mcas descriptor table. i already had a reader epoch (for reclamation), what if i added a writer epoch (for detecting conflicts)
so let's go back to my fancy pants mcas lock free data structure for a moment:
the writer is mostly unchanged:
- find the reader epoch, store it, increment the number of active readers for that epoch
- create a compare-and-swap descriptor table
- in increasing order, replace the old values (a pointer to a boxed thing) with a pointer to to the descriptor table
- increment the writer epoch
- insert the descriptor table entries on the end of a ring buffer
- increment the writer epoch again
- in decreasing order, replace the descriptor pointers with the pointers to the new value
- decrement the number of active readers for our local reader
(now you might say: wait, we could probably just increment the epoch once, and you'd be right, but for now let's just leave some room for an exclusive writer lock across the entire structure)
(and you might also say: why do we need separate reader and writer epochs? well, it makes the logic simpler for reclamation when there's no active writers, but it's not impossible)
the reader is a little different this time:
- find the reader epoch, store it, increment the number of active readers for that epoch
- find the writer epoch, and store it
- do your unordered reads, but keep an ordered list of every pointer you looked at
- if you find a descriptor, take the old value
- check the writer epoch is unchanged
- if the writer epoch has changed, we can look in the ring buffer of descriptors, check to see if any of our reads have changed, and avoid spuriously aborting the read
- decrement the number of active readers for our local reader
once again you might be asking yourself questions, like: wait, we could just redo the read, checking against the table! what does this get us?
sure, you could just repeat the read, but you run in to the risk of getting another conflict. instead, if we check the ring buffer we can see if the conflict was actually meaningful, and thus avoid rolling the dice again.
(and now, if we felt fancy, we could repeat the reader with the write lock to prevent aborts)
so what do we have now?
well, we have a lock free data structure, with lock free reclamation of deleted items, with lock free multiple-compare-and-swap for writers, and lock free snapshot reads too! all wrapped up with a very lightweight mutex.
but wait! there's more! as the ring buffer essentially forms an undo log, we could use it to take consistent snapshots of the entire structure. it's honestly not a lot more work to turn this into a proper write ahead log.
maybe one day i'll test this idea out. for now, it's chosting time
