Slowdown in transcription completions

Incident Report for AssemblyAI

Postmortem

On January 3th, 2025 and on January 8th, 2025, we experienced two incidents that resulted in service disruptions and performance degradation for our asynchronous API.

Root Cause Analysis
Increased demand for the AssemblyAI asynchronous API has required changes to our datastore infrastructure to maintain performance and reliability.

January 3th: Transcription Record Service system did not scale out aggressively enough to properly service the traffic. We identified a misconfiguration in the scaling policies that constrained the service’s ability to scale out. This resulted in degraded performance for and high latency time for all requests during the 1 hour window.
January 8th: Our team deployed a configuration update to our infrastructure that unexpectedly caused a redeployment of the transcription record service. This resulted in failures for all customer requests during the 20-minute window while the transcription record service was redeployed and scaled back out to handle current traffic.

While both of these incidents impacted the same service, the underlying cause of each incident was different.

Resolution and Next Steps
To prevent similar incidents in the future, we have:

Adjusted the auto-scaling rules of the Transcription Record Service to scale more aggressively with additional performance metrics
Implemented additional pre-deployment validation checks on infrastructure code changes
Strengthening our infrastructure configuration management processes
Improving the communication cadence to the status page throughout the incident lifecycle

Customer Commitment
We apologize for any inconvenience or disruption these incidents may have caused. We are committed to providing a reliable and high-performing service and are taking steps to prevent similar incidents from occurring in the future. As always, you can rely on the status page for up to date API information on our current status and health.

Posted Jan 12, 2025 - 17:20 UTC

Resolved

We have continued to see good performance around processing times and error rates and all previously throttled jobs have now been complete so we are marking this issue as resolved.

Posted Jan 03, 2025 - 17:57 UTC

Monitoring

We have made changes to address the issues we were seeing earlier. Processing times and error rates are now returning to the normal range. There are still some requests throttled as a result of the earlier showdown but that number is steadily declining as we continue to monitor performance.

Posted Jan 03, 2025 - 17:46 UTC

Update

We are continuing to see improvements and will be moving into monitoring status shortly.

Posted Jan 03, 2025 - 17:43 UTC

Identified

We are working to reset unhealthy instances in our pipeline, and we are now seeing an increase in successful requests and a reduction in the number of failed requests. We are still experiencing longer-than-usual processing times, but we are seeing improvements now.

Posted Jan 03, 2025 - 17:35 UTC

Update

We are still seeing issues leading to longer than usual processing times. This slowdown is causing throttling for some accounts as well as some failed requests due to time out errors.

Posted Jan 03, 2025 - 17:03 UTC

Update

We are still investigating the current issue we are experiencing but the impact of this issue is slower than usual processing times for requests to our Async endpoint as well as some failed requests with "the operation timed out" errors.

Posted Jan 03, 2025 - 16:42 UTC

Investigating

We are currently investigating an issue that is resulting in slowdowns in transcription completions. We will share more details as soon as we learn more.

Posted Jan 03, 2025 - 16:38 UTC

This incident affected: APIs (Asynchronous API).