We appear to have recovered from an outage that lasted just under an hour this afternoon.
We’ve determined the cause to be a large number of requests to a single user’s profile from a single Pleroma instance over a span of less than 2 minutes, causing a denial of service site-wide. The instance in question appears to connect via Tor, making automated detection and blocking difficult. However, we currently believe this incident was accidental, caused by a web spider with an overly aggressive retry policy.
We have blocked access from this instance at our firewall level. After setting up this blocking rule at 20:15, we immediately started seeing application recovery across the platform.
We’re looking into ways to prevent this sort of outage from happening in the future, as well as fast-tracking some large architectural work to prevent this sort of behavior from causing downtime. Luckily, we were able to apply our blocking rule to a single instance, so there is no impact to other Pleroma instances.
