Delay in API response time

Incident Report for Telnyx

Postmortem

Summary

On May 3, 2019 starting at 11:00 UTC, we experienced delayed response times to api.telnyx.com.

On May 8, 2019 starting at 20:00 UTC, we experienced additional response time delays.

Between April 22 to the present, we have experienced momentary spikes in api response times and increased 5XX errors, including:

Thursday, April 25
Wednesday, May 8
Friday, May 10
Friday, May 13

Impact

Customers sending calls to api.telnyx.com experienced increased response times or 500 errors.

Root Cause

Increased transit of API requests
Increased API call queuing and timeouts

Timeline (Central Time)

March and April: Telnyx observes instability in Google Cloud Central, which is where some of our infrastructure is located. This results in delayed 200 OK responses and 5XX errors with Telnyx services such as Call Control and Mission Control.

May 3rd, 20:00 UTC: In an effort to mitigate the risk of additional Google Cloud Central outages, Telnyx deploys two additional instances of its API Gateway in two different cloud providers in two different regions. API calls are now routed round robin to the four different API Gateway instances in multiple regions and cloud providers. Because of this, those calls that are traveling to the new instances inherently take longer to reach the API Gateway.

May 8th, 11:00 UTC: Telnyx migrates database masters from central to east. For API calls requiring database look-ups, there is an increased latency in communication between Central and East.

Action Items

Bypass Telnyx Legacy Edge Stack for latency-sensitive API commands, such as call control
Update edge proxy configuration to allow for increased API traffic
Enable region-based service-to-service interactions for API-based services
Add Call Control API Response times to status.telnyx.com

Posted May 14, 2019 - 18:55 UTC

Resolved

This incident has been resolved.

Posted May 10, 2019 - 22:41 UTC

Identified

We have identified an issue with our API response time and are currently investigating. We will provide updates shortly.

Posted May 10, 2019 - 22:04 UTC

This incident affected: Mission Control API (US East Region, US Central Region, US West Region).