hi folks! quick turnaround on this postmortem because we know Exactly What Happened and mostly Why It Happened
WHAT HAPPENED
While deploying tag silencing (live now by the way, check your settings page), a required database change led to the database needing to rebuild the index of posts by tag, taking it offline.
We knew that the index rebuild would happen, so we planned for it to occur all at once during the deployment process, which made the deployment process a little slower but we thought would leave us a working site afterwards. However, for unforeseen reasons, the index went offline and didn’t come back.
This meant that instead of being able to quickly grab a big list of posts in, for example, the "eggbug" tag, we had to go through every single post, individually, in order, until we found ones that were tagged with "eggbug". There are a lot of posts on cohost, so this was incredibly slow and caused serious performance issues, bringing the entire site to a halt.
We opted to roll back to the previous deploy, which we knew also wouldn’t work (the database change was breaking for old versions), but would error out quickly instead of thrashing the database. This was the cause of the e.filter is not a function error you may have seen.
WHY THIS HAPPENED
This part is jargony, not a lot we can do there.
We know that the change made to the per-post tags index — a small data type change in the indexed column — did not cause postgresql to internally run analyze afterward, which meant that the query planner didn’t know it could use the index. Even if the index was totally fine (it was), the database didn’t use it. We do not know why postgresql did not run analyze; this is not an issue we have encountered in the past.
The fix, once we had determined the issue, was as simple as manually running analyze on the relevant table and column. This query took approximately 700ms and immediately fixed the problem. Once we confirmed it was fine, we re-deployed the newest version of cohost.
WHY THIS WON’T HAPPEN AGAIN
- Now that we are aware this is an issue, we will refrain from doing anything that could cause an implicit in-place index change.
- In the event it’s unavoidable, we will make sure the manual
analyzestep is run as part of the migration. - In order to hopefully detect these issues before deploying, we are working on tools to replicate a production-scale database in our development environments. Many aspects of postgresql’s query planner behave differently in our dev environments vs. in production, as many optimizations against large datasets are unused for small ones, and vice versa.
WHAT YOU CAN DO TO HELP
keep thinking about eggbug. maybe silence a tag, if you want! you can do that now.
sorry for the inconvenience, and thanks as always for using cohost!


