tef

bad poster & mediocre photographer

  • they/them

this week i've been looking up stuff about succinct data structures, and it's exactly the sort of gribbly bithacking shit you were promised.

it all started here, with this paper:

Semi-Indexing Semi-Structured Data in Tiny Space

We propose a technique to attach to a raw, unparsed document (even in compressed form) a “semi-index”: a succinct data structure that supports operations on the document
tree. Our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.

What if we didn't make a parse tree — what if we kept a compressed bitvector of where tokens started, and a string of ()'s to represent the tree structure instead?

    {"a": 1, "b": {"l": [1, null], "v": true}}
pos 110000100100001100001100100000010000100000
bp  (()()()(()(()())()()))

This is a semi-index, or storing just enough information about the document that you don't have to reparse it to navigate it. It turns out to take up a lot less space, it's a lot faster to construct, and it's almost as fast for reading in data.

Which is how I ended up going down a rabbit hole:


Quasi-Succinct Indices

This is about compressed inverted indexes, but it also serves as a good explanation of elias-fano encoding—a means of compressing a sequence of monotonically increasing numbers, but still allowing some fast lookup operations.

In other words, if i have a table of token offsets in a file, I can compress it using elias-fano, and still have fast lookup of the n-th token.

Succinct Trees in Practice

The other part to a semi-index is the string of ()s to represent a tree. This survey paper compares three of the big approaches.

  • Balanced Parenthesis (BP)
  • Depth-First Unary Degree Sequence (DFUDS)
  • LOUDS (level-ordered unary degree sequence)

All of these use different approaches to turn a tree into a series of ()'s, each with different tradeoffs. Some operations are much faster in some forms, other forms need more space. All of them use rank/select operations.

rank, and select are operators on bitstrings. rank means "tell me how many bits are set before this bit", or popcount over a prefix of the string. select means "find me the position of the n-th 1-bit in the string".

The elias-fano encoding has a fast implementation of select, but we need fast rank operations for BP/DFUDS/LOUDS style encodings

Rank Select

Succinct Structures


You must log in to comment.

in reply to @tef's post:

I'd been mulling over a vague idea of something like this for processing large JSON files like Twitter or Slack archive exports. Nice to see there's prior thought about it that I can build on (if and when I get around to doing so).