Async + LLM gateway APIs returning 5xx

Incident Report for AssemblyAI

Postmortem

Summary: January 29 Outage

On January 29, our Async Transcription and LLM Gateway went down for 25 minutes, followed by two hours of degraded performance. Here's what happened.

Our traffic has been steadily increasing and we received an influx of batch transcription jobs that were higher than usual. Autoscaling did what it should—spun up workers. We ran out of spot instances, fell back to on-demand, then hit AWS's Elastic Network Interface limit. Every ECS task needs an ENI. With no ENIs available, our authentication service couldn't rotate its tasks. It scaled to zero. Everything stopped.

When we tried to recover, a Lambda function meant to help with failures was hammering the AWS API with describe calls, rate-limiting the very operations we needed to bring things back. A tool built for recovery was blocking recovery.

What actually went wrong:

We didn't monitor ENI utilization and our authentication service shared resources with batch workers. Traffic spike in one starved the other.

What we're doing:

Short-term: ENI alerts, isolate auth on dedicated capacity, fix the Lambda.

Medium-term: Full multi-region for everything.

This was preventable. We had the warning signs and didn't connect them. We're sorry, and we're fixing it.

Posted Feb 05, 2026 - 18:35 UTC

Resolved

We have cleared the backlog, and turnaround times are back to normal. Our team continues to actively monitor the situation. We will provide a post-mortem when available.
Posted Jan 29, 2026 - 19:32 UTC

Update

We are continuing to work through the backlog, and turnaround times are improving. Our team continues to actively monitor the situation. We will provide updates as more information becomes available.
Posted Jan 29, 2026 - 19:18 UTC

Update

We are continuing to see requests being completed. We are working through a backlog of requests, resulting in elevated turnaround times while we recover. Our team continues to actively monitor the situation. We will provide updates as more information becomes available.
Posted Jan 29, 2026 - 18:42 UTC

Monitoring

We are seeing requests beginning to successfully complete and have moved to the monitoring phase. Some requests are still returning 500 errors. Our website is down, and logins may be affected. Our team continues to actively investigate and monitor the situation. We will provide updates as more information becomes available.
Posted Jan 29, 2026 - 18:30 UTC

Update

We are seeing requests beginning to successfully complete and have moved to the monitoring phase. Some requests are still returning 500 errors. Our team continues to actively investigate and monitor the situation. We will provide updates as more information becomes available.
Posted Jan 29, 2026 - 18:29 UTC

Update

We are currently investigating an issue affecting our Async and LLM Gateway services, which are returning 500 errors. Our team is actively working to identify and resolve the problem. We will provide updates as more information becomes available.

All streaming API and EU endpoint services are running normally. This incident is isolated to NA API endpoints for async + LLM gateway.
Posted Jan 29, 2026 - 18:09 UTC

Update

We are continuing to investigate this issue.
Posted Jan 29, 2026 - 18:02 UTC

Investigating

We are currently investigating an issue affecting async + LLM gateway services. More updates to come as we learn more.
Posted Jan 29, 2026 - 17:54 UTC
This incident affected: APIs (Asynchronous API, Streaming API, LLM Gateway) and Web (Website, Playground, Dashboard).