Telnyx Service Disruption

Incident Report for Telnyx

Postmortem

Summary

On November 5, 2018 starting at 20:19 UTC / 16:19 CST, we experienced a disruption with a server in our London PoP hosting telephony services. The disruption resulted in the disconnection of (50) active calls being handled by the server at the time of incident.

Telnyx has a number of High Availability mechanisms in place for the Telnyx Telephony Engine, including:

Distributed nodes where calls can be routed.
Automatic re-routing of new call attempts when node specific issues are detected, including application restarts or server crashes.
Call recovery during application restart

Telnyx currently lacks an active call recovery mechanism when the server upon which a given set of Telnyx Telephony Engine applications crashes or otherwise becomes unresponsive. This is what happened in the case of this incident.

Impact

Approximately 50 customer calls were disconnected.

Cause

The server experienced a hardware failure at 20:19:35 UTC. New calls were immediately re-routed.

Action Items

Telnyx is currently working on implementing a new High Availability mechanism for media, enabling recovery for active calls. This will require low-level changes to our Back-to-Back User Agent software. The changes will be implemented in a way such that they neither interfere with existing support features, nor affect metrics like PDD or CPT, nor require significantly more resources.

Posted Nov 15, 2018 - 22:12 UTC

Resolved

The disruption has been resolved. We will provide updates on the root cause of the issue as we identify them.

Posted Nov 05, 2018 - 23:51 UTC

Identified

We experienced a brief disruption to our telephony services at 16:19 CT. Active calls anchored on one of our London servers at that time were disconnected. Our Telephony team is working to identify the root cause and implement a solution.

Posted Nov 05, 2018 - 22:51 UTC