Summary: January 29 Outage
On January 29, our Async Transcription and LLM Gateway went down for 25 minutes, followed by two hours of degraded performance. Here's what happened.
Our traffic has been steadily increasing and we received an influx of batch transcription jobs that were higher than usual. Autoscaling did what it should—spun up workers. We ran out of spot instances, fell back to on-demand, then hit AWS's Elastic Network Interface limit. Every ECS task needs an ENI. With no ENIs available, our authentication service couldn't rotate its tasks. It scaled to zero. Everything stopped.
When we tried to recover, a Lambda function meant to help with failures was hammering the AWS API with describe calls, rate-limiting the very operations we needed to bring things back. A tool built for recovery was blocking recovery.
What actually went wrong:
We didn't monitor ENI utilization and our authentication service shared resources with batch workers. Traffic spike in one starved the other.
What we're doing:
Short-term: ENI alerts, isolate auth on dedicated capacity, fix the Lambda.
Medium-term: Full multi-region for everything.
This was preventable. We had the warning signs and didn't connect them. We're sorry, and we're fixing it.