ConfuSomu

one of the many twilight sparkles

gender is a mess, maybe something like (genderfluid) transfem enby, but like: horse! gay! girl!!

:eggbug: 🏳️‍⚧️

name-color: purple


my website (with contact links)
twilightsparkle.space/

vogon
@vogon

hey there! you're (hopefully!) noticing that the site is running a little more quickly today. we mentioned it in the patch notes in passing, but I wanted to write a little bit longer of a description of what we did to speed it up.


there were two big changes:

content_from_project_ids....... 2!

there's a field associated with each post that lists which pages [horrible voice] contributed content to it -- that is, made the post, or made any post to which it is replying which contains text -- internally called content_from_project_ids; this field is used to filter out posts which are by pages who have blocked you, or whom you have blocked or silenced.

on an abstract level, the check we need to make here is called set intersection; a post is visible if the intersection of content_from_project_ids and the sets of the pages you have silenced or blocked is empty. this is one of those operations in computer science that's common (so there are prebuilt tools to use to make it fast) but not that common (so you can't just expect to be able to do it with anything off the shelf), so, as with a huge amount of software you use on a day to day basis, when we were building this for the first time, we checked what stack overflow had to say about standard solutions and went with hstore, because it deduplicates set elements for you, which means less code to write.

we weren't making any obvious mistakes; our hstore queries were indexed, and all told, the queries were pretty fast -- microseconds a piece -- but we were getting reports from people whose bookmarked tags page was timing out, and the common thread between them was that they had a lot of blocks, suggesting a performance problem with content_from_project_ids. I built a small piece of code to generate a small subset of our database schema and benchmark hstore queries versus a couple of other possible implementations (integer arrays, with both Postgres's native array intersection operator and the contributed implementation in intarray); intarray ran 4-fold faster than hstore even with an index, and the native array intersection ran 4-fold faster than intarray!

here's a fun little graph of the benchmark results (note that the y-axis is logarithmic)1:

a small graph showing the performance of hstore queries degrading from 10ms/query with 10 outgoing blocks to 100ms/query with 100; intarray degrading from 7ms/query with 10 to 240ms/query with 500; and native array operations degrading from 6ms/query with 10 to a mere 30ms/query with 500

for the time being, to speed up performance with as little work as possible, whenever content_from_project_ids is edited, we tell postgresql to copy it into another field, stored as an array, called content_from_project_ids_2; the data is written into the old field and then read from the new field. obviously now we're storing two redundant copies of the same data, so we're going to go back and clean up the old version when we get the chance, but that can occur invisibly in the background.

after this change, one of the reporters' bookmarked tags pages went from taking 22 seconds to populate, down to 13 seconds. however, this check is run by pretty much every part of the site that builds a list of posts, so everyone will get at least some boost out of it.

search tags

obviously, a 13 second page load time is still pretty bad, so I also went looking for performance problems elsewhere.

first, a brief history lesson: when we launched tag search, it searched for tags on every post. unfortunately, at some point, people started replying to posts which were obviously intended by the original poster to be shared mostly/only with their followers, and making them publicly searchable by adding a tag with a large readership. this was a major problem, so we made a hotfix to completely ignore all tags on all replies in site-wide tag searches while we figured out a more permanent fix. eventually, we moderated that by counting tags on replies toward site-wide tag searches if either of two things were true2:

  • you made the original post;
  • you didn't make the original post, but the original poster used the same tag.

we were always vaguely aware that this check had something to do with the performance problems on the bookmarked tag page, but in the past, it was always one problem among many, and this was a good opportunity to reassess the performance characteristics of it. when we did, we discovered that recomputing the second half of this check alone now made up more than 90% of the database execution time for those views under certain circumstances -- and, similarly to the first change, the check itself is as fast as we can make it, but it needs to be run potentially many, many times per page load. so now, alongside a few other pieces of possible-but-expensive-to-recompute-on-the-fly information, every post is associated with a precomputed list of "search tags" -- the tags it's searchable by -- and the queries just check to see if the tag you're searching for is in that list.

this change reduced the reporter's 13-second query time to half a second or so -- which still isn't world-beating, but it's good enough to live with for now.

conclusion

hopefully this provides a little more insight than just a little bullet point that says "bug fixes and performance improvements." if there's a takeaway here: we wouldn't have been able to fix either of these bugs without people e-mailing us at support@. there's too much stuff to work on and our visibility into performance still isn't perfect, and knowing of specific users having a specific problem helped us work on specific solutions.

thanks for reading all this, and thanks as always for using cohost!


  1. observant readers will note that this graph shows each query taking tens of milliseconds rather than the microseconds I mentioned above -- there's a reason for this! after several frustratingly ambiguous test passes, I realized that because the benchmark database was so much simpler than our production one, to keep all of the query execution times out of the noise floor created by the overhead of connecting to PostgreSQL and having it parse the query, I had to make it perform a lot more of these set intersection operations per query than in production to actually visualize any difference.

  2. note that per-user searches are exempt from this check; e.g., if you tag all of your shares #rebug, they'll all be accessible from cohost.org/username/tagged/rebug regardless of whether you made the original post.


You must log in to comment.

in reply to @vogon's post:

Question! What if I want to add a useful tag to another user's post? Say they posted something technical and just tagged it "code" and I want to rebug and add the more specific tags "javascript", "deno", etc. Will my tag enhancements just do nothing now?

to be clear, this isn't a behavior change; this is how it's worked since very shortly after we launched tag search. we originally wanted it to be possible for people to help out like this, but it was almost immediately misused.

we might have a fancier solution for this in the future, but for now, you can always hop into their comments and suggest that they might want to add some appropriate tags to the original post!