vogon

the evil "Website Boy"

member of @staff, lapsed linguist and drummer, electronics hobbyist

zip's bf

no supervisor but ludd means the threads any good


twitter (inactive)
twitter.com/vogon
bluesky
if bluesky has a million haters I am one of them, if bluesky has one hater that's me, if bluesky has no haters then I am no more on the earth (more details: https://cohost.org/vogon/post/1845751-bonus-pure-speculati)
irl
seattle, WA

staff
@staff

small week because jae continues to be sick in ways that are annoying, but with some business news and big changes!

  • business news: @aidan is now officially a co-owner of the company! congrats and/or condolences!
  • HUGE change to how we cache data about posts that makes everything way faster. we’d had a blind spot in our knowledge of database development that has now been filled in.
    • deployment took longer than expected and unfortunately made posting Not Work for like 15 minutes but it’s fine now.
      • lesson learned: run big database changes against more faithful clones of the production environment, to get reasonable estimates at their performance characteristics instead of just guessing based on how they ran on a slice of the database from months and months ago.
  • fixed some issues with embeds causing a lot of flashing
  • tweaked the long post collapsing threshold based on feedback
  • fixed a bug where notifications would sometimes display “ERROR STATE” when there simply wasn’t any text to show
  • made fairly substantial changes to how post publishing works
    • fixes a bug that most people would have never experienced but that the three of us experienced literally every single time we posted

that’s all for this week, thanks for using cohost! :eggbug:


vogon
@vogon

what was wrong?

there's a few pieces of data about each post which are rendered (almost) everywhere a post is, but aren't stored with the post for various reasons:

  • how many comments it has
  • which tags are applied to it
  • which attachments it has

all of these things were being recomputed every time a post was rendered, and even though we'd done all of the easy things to make each individual computation fairly quick (a handful of milliseconds), there were a bunch of them to do on certain pages (the tag page, dashboard page, etc.) that added up to entire seconds overall.

how did you fix it?

same way anyone fixes basically anything slow in computing: computing a result once, storing the result of the computation, and then only recomputing it when you have to (i.e., when someone makes a post, comments on a post, changes the tags on a post, adds an attachment to a post, etc.) it's called a space-time tradeoff and it's extremely cool.

why didn't you fix it earlier?

with our level of understanding of relational databases at the time1, we assumed that we would have to do something super complicated and fragile involving other services; we also knew that the queries that built these pages were slow, but hadn't done the digging into exactly which parts of the queries were slow, and didn't know how much faster the slow parts would get if we fixed them. so we put it all off for a while hoping that we'd find something else that improved performance while being less of a headache to implement.

but then by chance, reading through the PostgreSQL documentation, we discovered that (duh) these problems aren't new, so people had seen fit to make this technique straightforward in-database decades ago, in the form of materialized views. we didn't get to use PostgreSQL's built-in materialized views for this, because they have to be manually refreshed across the whole view at once -- i.e., any time anyone posted, commented, etc., we would have to recompute these statistics across the whole database2 -- but this is a well-known limitation and if you know the magic phrase, there's plenty of intermediate-level blog posts about building your own materialized view by hand that works the way you want it to work. here's the one we used.

no new techniques here, no strokes of genius, no tech talk material, just knowing the right two words to punch into google to find a how-to article.


  1. I had the opportunity to take a databases class back in college but discarded the idea as boring3 so I didn't learn back then, and while we've used databases plenty over our careers so far, this is our first time dealing with database-bound operations as a core part of our day job.

  2. this process of computing the statistics for the entire database, incidentally, is what took almost 20 minutes to run during the deploy on Wednesday -- so you can see how that's not something we want to do a few thousand times a day.

  3. still 100% sure I was right about it being boring.4

  4. should've still taken it anyway, probably. I took classes in operating systems and artificial intelligence instead and I've barely used either of those.


You must log in to comment.

in reply to @staff's post:

the owners-only bug is that when anyone posts, that post is distributed to people’s dashboards in a background task, and the very last thing that background task does is distribute a copy of the post to you, the poster; if you have a lot of followers, this means that your posts didn’t show up on your own timeline for a couple of minutes sometimes. now, all of the copies are distributed in parallel, which means that the process itself goes faster since multiple copies can be distributed at once.

i dont think i have nearly as many followers but my posts weren't showing up on my own timeline until a refresh/some other posts made a 'new posts!' banner happen, i wonder if this'll affect me

also brain go whirrr at fun behind da scenes stuff

in reply to @vogon's post:

views are usually just subqueries you can set permissions on and/or treat as if they are a table that actually contains fewer/different/joined columns. materialized views actually write the result to disk when they are created/refreshed and can have their own indexes, which fuckin rocks.

yeah, my assumption is that plain old views primarily get used for

  1. being able to shim schema changes to one another
  2. databases deployed in contexts where there are data entry people who get user interface generated directly off of the database schema

yeah i'm pretty sure a lot of people use them for permissions gating sensitive columns! you can just lock down the parent table and then provide a set of limited views that users are actually allowed to look at

also good for aliasing stuff in three-phase schema migrations, saving queries you keep using and don't want to have to always remember, or even rendering multiple different layouts read-transparent at the top level so citus or whatever can just obliviously treat whatever-this-shard-is as if it's a normal table despite you having a handful of totally different ways the data might be actually stored.

'just knowing the right two words to punch into google to find a how-to article' is extremely relatable. i think every problem ive ever had on the computer was solvable this way whether i found the words or not

whenever i talk to our database engineers at riot i just say "hello i have noticed that this thing is weird. please help?" then they utter unknowable phrases of power and things im pretty sure are spells while i nod politely then say "thanks!" when it somehow fixes it

pretty sure any database course in uni is boring because it's gonna be some guy telling you about cobb's 12 rules of databases, back in the day when databases went offline, the majority of users were directly writing sql, and all application code lived inside the database

anyway, one tiny problem with denormalizing your database, which is sorta what you're doing by storing things in a way that doesn't need joins, is that somethings are a bit harder

so like tags, that should be fine, one person will update tags, so you can just do "store this in the harray/jsonb/text column too" and hooray life moves on

but, well, with things like aggregates, well, it gets finickity

you kinda wanna do "add a post" and then "increment this number by one"

but well, if two people do a comment at the same time, then you end up with two queries going "replace this with 9" instead of one with "replace this with 9" and "replace this with 10"

so, one way to fix it, is to say "update this post count where post id is blah and post_count is old_count", but, well, then you get the problem where someone else ran the update first and now you gotta loop and that feels bad somehow

another way, which feels less bad is to do "update this post count to the sum of the comments",, which sure enough, for the same reasons, might be wrong sometimes, but when things are calm, or later on, the post count will always be correct, which is nice

there's another secret way you can do this that is interesting and i want to try out: you can borrow a concept from LSMT and create a table for that aggregate that has the columns (group, partial_agg, partial_agg_2, ...) and (optionally, for read locality) a covering index.

if all updates insert partial aggregate values equal to the amount of change they caused in each metric (like adding a comment inserts +1, deleting one inserts -1), you can read the aggregate like SELECT sum(partial_comment_count), count(*) FROM chosts_partial_agg WHERE chost_id = :id;.

if the count(*) ever exceeds some limit you have set, you can (lazily, in the background!) run a transaction that reads-and-deletes all the rows it can and inserts a new one with their combined aggregate. even these cleanup transactions can be pretty fast and can be designed to be run concurrently.

this way the aggregates still read quickly, can insert very quickly with minimal contention, and are always immediately visible to reads.

thing is, a count(chost_parent_id) over an indexed column should be faster than a count() over a table of +1/-1 operations

even then, caching the value in some way/materializing it, that makes more sense given the heavy read load and minor write load

a more lsmt idea would be more for getting aggregates over a write heavy and read light table, but one where you don't want to do a full scan for counts

the cheapo way i'd do that is to have several rows, a hash function on the reply id, and effectively partitioning the count update

but at this point, anyway, people won't remember the exact count of posts, and it would be more interesting to show "new posts since you last looked" i think???

the trick here is the aggregates table can be collapsed every time it hits, like, 32 rows per group or whatever threshold. and you can always collapse it to 1 row as long as nothing else is going on. (if something else is going on you only end up collapsing the ones that were visible when you started.) so changing the aggregate is still linear, but instantly and transactionally visible with very low contention; updating still has approximately constant amortized cost; and reading it is also expected constant cost.

to put it another way i guess the idea here is that LSMT are good for write heavy and read light because they eliminate contention and because they can let the log grow large, but if you invest resources in optimizing read by collapsing it when there are relatively few rows and you still benefit from the lack of contention. it's also great because these aggregates are often going to be O(1) size in 1 row, like counts and sums. (the expected write amplification is a factor of approximately O(1 + 1/t), where t is your threshold, because that's how many extra new collapsed rows you write; each row that's written otherwise is always eventually deleted afterwards assuming the values keep changing.) and the practical read overhead can be very low, because individual partial aggregate entries will often all be found on a single page.

it is possible for the table to desync because it isn't enforced by the database, but if it does it's because you made a mistake in the implementation. basically this is just a fun way you can get both high throughput and effectively perfect read-after-write without overwhelming write amplification or high read cost.

"as long as nothing else is going on"

that's the hard bit. and usually the expensive bit in any concurrent system

i mean you could do SKIP LOCK

"(if something else is going on you only end up collapsing the ones that were visible when you started.)"

now, the problem here is that in order for this to be correct, you need not just serializable but linearizable access to the table, if i'm thinking right?

well, maybe just serializable, so that's a honking big lock

or you need to partition the rows ahead of time such that any particular writer doesn't delete and insert from overlapping segments

i can see where you're coming from but, well, it's the sorta of hijinks that works when you're working at the b-tree level not at the sql level

yeah. it's something you can build in the database, even if it's not the most perfectly efficient fit. you get some tradeoffs, but it can be Good Enough and a big step up for a lot of use cases.

the algorithm can pretty easily be made transactionally MVCC-correct if you use snapshot isolation modes like REPEATABLE READ, which also give the stronger standard for repeatable-read consistency because the algorithm reads everything it modifies.

the locking isn't overwhelmingly important; you can even just do it naively as

with t_old as (delete from t where chost_id = :id returning partial_agg)
insert into t(chost_id, partial_agg) values (:id, sum(t_old.partial_agg));

...and not care about locks too much, because the compacting action never changes a read result and doesn't need to be done synchronously with any read or write.

with SKIP LOCKED it actually still works great, because any transaction that deletes a set of rows and inserts their sums is still correct. "as long as nothing else is going on", the compaction will delete all the other rows... but if it doesn't, that's not actually a big deal! the main difference you see if you do different kinds of locks is that these compaction transactions make different mixes of progress, but the reads and writes never care.

it would be a cool type of weird index to implement inside the database itself, but we don't really have good syntax to declare an index on agg(x) group by y, and there are some other things about it that don't fit well. it could also be implemented in a different store that's designed for it, but honestly the most attractive part is the way you can embed exact transactional aggregates into the MVCC source of truth :)