Telnyx Service Disruption

Incident Report for Telnyx

Postmortem

Summary

Telnyx experienced intermittent network issues between the following times:

January 8th from 14:56 UTC to 17:00 UTC
January 9th from 00:13 UTC to 02:00 UTC
January 9th from 09:54 UTC to 16:48 UTC

Impact

This disruption resulted in failures across the following services:

Voice: Outbound calls and inbound calls using credential authentication intermittently failed.
Messaging: Outbound message delivery was intermittently blocked.
Call Control: Increased rate of failed API commands to api.telnyx.com and decreased rate of successfully delivered webhooks.
Website and Portal Access: www.telnyx.com / portal.telnyx.com / other telnyx.com pages were intermittently accessible at this time.

Timeline

January 8th, 2019

4:56 UTC - Impacted systems listed above begin to experience intermittent disruption. Cause is unknown at this point.
16:22 UTC - Network team identifies brief periods of traffic spikes that overwhelmed certain links between Telnyx core routers in Chicago and Washington DC and cloud providers in these regions. The spikes resulted in dropped data packets, including BGP peering status packets. The loss of BGP peering status packets resulted in brief periods of network unavailability.
17:00 UTC - Intermittent network unavailability ceases

‌

January 9th, 2019

00:13 UTC - Intermittent network unavailability resumes
02:00 UTC - Intermittent network unavailability ceases
09:54 UTC - Intermittent network unavailability resumes
11:55 UTC - Telnyx determines that increased network traffic and CPU utilization in Chicago and Washington DC was caused by an exploit used to mine cryptocurrency.
16:48 UTC - Telnyx completes software patches, deploys script to cease the exploit, and updates network configuration to mitigate unavailability.

Cause

Excessive network traffic and CPU utilization caused by the exploit used to mine cryptocurrency resulted in intermittent network unavailability in specific links between Telnyx Core Routers in Chicago and Washington DC, and their associated Cloud Provider regions.

There are two high-level categories of failure that compounded to cause this outage.

Causes related to security

Our system was exploited due to a misconfigured firewall. Moving forward, we will be increasing the regularity and intensity of our network penetration testing. The party responsible took advantage of a known exploit in Hashicorp’s Consul product, which serves as our service discovery tool.

The exploited vulnerability enabled the third party to utilize Telnyx compute resources to mine cryptocurrency. After reverse engineering the code, we have no reason to believe that any customer or Telnyx information was compromised.

Causes related to availability and resiliency

Regardless of the cause of the network unavailability, we should not have experienced a prolonged period of intermittent outages across such a broad range of services.

The prolonged period of the outage was a result of suboptimal monitoring on our network, which prevented us from quickly pinpointing that our link was periodically being saturated. We are actively working to improve our network monitoring so that we will be able to quickly identify the root cause of network impairments and proactively address them.

The broad scope of the outage was a result of how we were using Consul for service discovery, as well as how Consul and other critical infrastructure was deployed within our infrastructure.

In the current Consul implementation, DNS lookups depend upon a highly available Consul cluster, but one that exists within a single cloud provider region, and is thus a single point of failure. Moving forward, we will configure services to resolve domain names in a more distributed fashion, while having high availability and anycast DNS to quickly failover if localized options are unavailable.

In addition to removing our Consul cluster’s single point of failure, we will perform a system wide audit of other critical services and their availability. We will also test, observe, and improve our services’ resiliency to similar outages through controlled experiments.

Short Term Action Items

Enhance network monitoring to better detect temporary traffic spikes
Expand penetration testing of our network
Accelerate chaos engineering tests, observe impact, and address weaknesses
Audit our existing infrastructure for additional single points of failure

Long Term Action Items

Resolve domain names locally
Federate our Consul cluster across multiple cloud providers and regions

Posted Jan 18, 2019 - 21:37 UTC

Resolved

All Telnyx systems are stable and we are continuing to closely monitor the situation. A detailed post mortem will be provided following a full investigation.

Posted Jan 09, 2019 - 19:10 UTC

Monitoring

Services have been restored, and we are continuing to monitor the network. We will post an update here if issues come back.

Posted Jan 09, 2019 - 17:10 UTC

Update

UPDATE - Identified:
We have identified the root cause of the network issues and are still actively working on the resolution. The following services are currently intermittently impacted:

VOICE: The network issues are primarily impacting call connectivity with Connections using User/Pass (Credential) auth. UPDATE: We received a few reports of IP connections experiencing connectivity and quality issues; while this isn't widespread for all IP connections, we're investigating and will follow up directly with those who reported IP-connection issues. We will post updates here as we learn more.

MESSAGING: Sending outbound messages is blocked from time-to-time. Any messages blocked from being sent by Telnyx are not billed.

CALL CONTROL: Decreased rate of successfully delivered webhooks.

WEBSITE AND PORTAL ACCESS: Users unable to access www.telnyx.com / portal.telnyx.com / other telnyx.com pages

Posted Jan 09, 2019 - 17:01 UTC

Identified

We have identified the cause of the network issues and are actively working on the resolution. The following services are currently affected:

VOICE: The network issues are impacting customers with Connections using User/Pass (Credential) auth. FQDN and IP auth connections are not impacted.

MESSAGING: Sending outbound messages is intermittently blocked. Any messages blocked from being sent by Telnyx are not billed.

CALL CONTROL: Decreased rate of successfully delivered webhooks.

WEBSITE AND PORTAL ACCESS: Intermittent periods of time where users cannot access www.telnyx.com / portal.telnyx.com / other telnyx.com pages

Posted Jan 09, 2019 - 16:28 UTC

Investigating

We are experiencing network issues affecting multiple systems. We're investigating and will provide more details shortly.

Posted Jan 09, 2019 - 15:58 UTC

This incident affected: Programmable Voice - Voice API (US), Mission Control API (US East Region, US Central Region, US West Region), and Programmable Messaging.