this week i've been looking up stuff about succinct data structures, and it's exactly the sort of gribbly bithacking shit you were promised.
it all started here, with this paper:
Semi-Indexing Semi-Structured Data in Tiny Space
We propose a technique to attach to a raw, unparsed document (even in compressed form) a “semi-index”: a succinct data structure that supports operations on the document
tree. Our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.
What if we didn't make a parse tree — what if we kept a compressed bitvector of where tokens started, and a string of ()'s to represent the tree structure instead?
{"a": 1, "b": {"l": [1, null], "v": true}}
pos 110000100100001100001100100000010000100000
bp (()()()(()(()())()()))
This is a semi-index, or storing just enough information about the document that you don't have to reparse it to navigate it. It turns out to take up a lot less space, it's a lot faster to construct, and it's almost as fast for reading in data.
Which is how I ended up going down a rabbit hole:
Quasi-Succinct Indices
This is about compressed inverted indexes, but it also serves as a good explanation of elias-fano encoding—a means of compressing a sequence of monotonically increasing numbers, but still allowing some fast lookup operations.
In other words, if i have a table of token offsets in a file, I can compress it using elias-fano, and still have fast lookup of the n-th token.
Succinct Trees in Practice
The other part to a semi-index is the string of ()s to represent a tree. This survey paper compares three of the big approaches.
- Balanced Parenthesis (BP)
- Depth-First Unary Degree Sequence (DFUDS)
- LOUDS (level-ordered unary degree sequence)
All of these use different approaches to turn a tree into a series of ()'s, each with different tradeoffs. Some operations are much faster in some forms, other forms need more space. All of them use rank/select operations.
rank, and select are operators on bitstrings. rank means "tell me how many bits are set before this bit", or popcount over a prefix of the string. select means "find me the position of the n-th 1-bit in the string".
The elias-fano encoding has a fast implementation of select, but we need fast rank operations for BP/DFUDS/LOUDS style encodings
Rank Select
-
Fast, Small, Simple Rank/Select on Bitmap: https://users.dcc.uchile.cl/~gnavarro/ps/sea12.1.pdf
rank/select has a lot of theoretical work on lowering minimums but it isn't always practical. this paper outlines a simpler to implement approach for datastructures
-
Rank/Select Queries over Mutable Bitmaps: https://arxiv.org/pdf/2009.12809.pdf
-
rrr: https://www.alexbowe.com/rrr/
this is an explanation of how to implement
rank, and a link to a thesis, too -
Broadword Implementation of Rank/Select Queries: https://vigna.di.unimi.it/ftp/papers/Broadword.pdf
This paper is about 'broadword' implementations, or 'simd in a register' style programming
-
A Fast x86 Implementation of Select: https://arxiv.org/pdf/1706.00990.pdf
-
Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors: https://arxiv.org/pdf/2206.01149.pdf
-
Ultra-succinct representation of ordered trees. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=90ed160d5cd94c5102961ad6699d4bfdc47faa59
Extending BP and DFUDS to support more operations.
-
Range-Min-Max Trees: It's a tree datastructure for computing rank/select, and there's a lot of work around
- Succinct Trees in Practice: https://users.dcc.uchile.cl/~gnavarro/ps/alenex10.pdf
- Fully-Functional Static and Dynamic Succinct Trees: https://arxiv.org/abs/0905.0768
- Simple and efficient functional trees: https://arxiv.org/abs/1601.06939
Succinct Structures
-
Compact PAT Tries: https://uwspace.uwaterloo.ca/bitstream/handle/10012/64/nq21335.pdf?sequence=1&isAllowed=y
Encoding a radix trie in much smaller space using succinct structures.
