ch00beh

✨ software pretengineer ✨

i'm here to dumb ass and chew bubblegum and i'm all out of bubblegum

name gen: @onomancer

capybara dating app: @capybr


staff
@staff

hi folks! quick turnaround on this postmortem because we know Exactly What Happened and mostly Why It Happened

WHAT HAPPENED

While deploying tag silencing (live now by the way, check your settings page), a required database change led to the database needing to rebuild the index of posts by tag, taking it offline.

We knew that the index rebuild would happen, so we planned for it to occur all at once during the deployment process, which made the deployment process a little slower but we thought would leave us a working site afterwards. However, for unforeseen reasons, the index went offline and didn’t come back.

This meant that instead of being able to quickly grab a big list of posts in, for example, the "eggbug" tag, we had to go through every single post, individually, in order, until we found ones that were tagged with "eggbug". There are a lot of posts on cohost, so this was incredibly slow and caused serious performance issues, bringing the entire site to a halt.

We opted to roll back to the previous deploy, which we knew also wouldn’t work (the database change was breaking for old versions), but would error out quickly instead of thrashing the database. This was the cause of the e.filter is not a function error you may have seen.


WHY THIS HAPPENED

This part is jargony, not a lot we can do there.

We know that the change made to the per-post tags index — a small data type change in the indexed column — did not cause postgresql to internally run analyze afterward, which meant that the query planner didn’t know it could use the index. Even if the index was totally fine (it was), the database didn’t use it. We do not know why postgresql did not run analyze; this is not an issue we have encountered in the past.

The fix, once we had determined the issue, was as simple as manually running analyze on the relevant table and column. This query took approximately 700ms and immediately fixed the problem. Once we confirmed it was fine, we re-deployed the newest version of cohost.

WHY THIS WON’T HAPPEN AGAIN

  • Now that we are aware this is an issue, we will refrain from doing anything that could cause an implicit in-place index change.
  • In the event it’s unavoidable, we will make sure the manual analyze step is run as part of the migration.
  • In order to hopefully detect these issues before deploying, we are working on tools to replicate a production-scale database in our development environments. Many aspects of postgresql’s query planner behave differently in our dev environments vs. in production, as many optimizations against large datasets are unused for small ones, and vice versa.

WHAT YOU CAN DO TO HELP

keep thinking about eggbug. maybe silence a tag, if you want! you can do that now.

sorry for the inconvenience, and thanks as always for using cohost!


You must log in to comment.

in reply to @staff's post:

I assume:

  • Tag Muffling = Post still appears on feed, just behind a clickthrough, if it has a Muffled Tag.
  • Tag Silencing = Post doesn't appear on your feed at all if it has a Silenced Tag

I also assume Silencing is not temporary, just like with all other forms of Silencing on Cohost 🤔

Gotcha gotcha. I misworded. "Temporary" isn't right. I was thinking about silenced posts and then "oh look another post from the same person!". I goofed it.

Thanks for the explainer though!