Telnyx Service Disruption

Incident Report for Telnyx

Postmortem

Summary

On November 2, 2018 starting at 19:44 UTC / 14:44 CST, we experienced a disruption to the server running one instance of our telephony services. This disruption caused all calls anchored on one of our San Jose servers to be disconnected.

Telnyx has a number of High Availability mechanisms in place for the Telnyx Telephony Engine, including:

Distributed nodes where calls can be routed.
Automatic re-routing of new call attempts when node specific issues are detected, including application restarts or server crashes.
Call recovery during application restart

Telnyx currently lacks an active call recovery mechanism when the server upon which a given set of Telnyx Telephony Engine applications crashes or otherwise becomes unresponsive. This is what happened in the case of this incident.

Impact

Approximately 500 customer calls were disconnected.

Cause

The docker-engine process segfaulted at 19:44:35 UTC and recovered at 19:44:38 UTC.

Action Items

Telnyx is currently working on implementing a new High Availability mechanism that will allow for active calls to be recovered. This will require low-level changes to our Back-to-back User Agent software.
The changes will be implemented in a way such that they neither interfere with existing support features, nor affect metrics like PDD or CPT, nor require significantly more resources.

Posted Nov 15, 2018 - 14:46 UTC

Resolved

The disruption has been resolved. We will provide updates on the root cause of the issue as we identify them.

Posted Nov 02, 2018 - 20:31 UTC

Monitoring

We experienced a brief disruption to our telephony services at 14:44 CT. Active calls anchored on one of our San Jose servers at that time were disconnected. Our infrastructure team is working to identify the root cause and implement a solution.

We will provide updates on the root cause of the issue as we identify them.

Posted Nov 02, 2018 - 20:27 UTC