On January 3th, 2025 and on January 8th, 2025, we experienced two incidents that resulted in service disruptions and performance degradation for our asynchronous API.
Root Cause Analysis
Increased demand for the AssemblyAI asynchronous API has required changes to our datastore infrastructure to maintain performance and reliability.
- January 3th: Transcription Record Service system did not scale out aggressively enough to properly service the traffic. We identified a misconfiguration in the scaling policies that constrained the service’s ability to scale out. This resulted in degraded performance for and high latency time for all requests during the 1 hour window.
- January 8th: Our team deployed a configuration update to our infrastructure that unexpectedly caused a redeployment of the transcription record service. This resulted in failures for all customer requests during the 20-minute window while the transcription record service was redeployed and scaled back out to handle current traffic.
While both of these incidents impacted the same service, the underlying cause of each incident was different.
Resolution and Next Steps
To prevent similar incidents in the future, we have:
- Adjusted the auto-scaling rules of the Transcription Record Service to scale more aggressively with additional performance metrics
- Implemented additional pre-deployment validation checks on infrastructure code changes
- Strengthening our infrastructure configuration management processes
- Improving the communication cadence to the status page throughout the incident lifecycle
Customer Commitment
We apologize for any inconvenience or disruption these incidents may have caused. We are committed to providing a reliable and high-performing service and are taking steps to prevent similar incidents from occurring in the future. As always, you can rely on the status page for up to date API information on our current status and health.