Telnyx Service Disruption
Incident Report for Telnyx
Postmortem

Summary

Telnyx experienced intermittent network issues between the following times:

  • January 8th from 14:56 UTC to 17:00 UTC

  • January 9th from 00:13 UTC to 02:00 UTC

  • January 9th from 09:54 UTC to 16:48 UTC

Impact

This disruption resulted in failures across the following services:

  • Voice: Outbound calls and inbound calls using credential authentication intermittently failed.

  • Messaging: Outbound message delivery was intermittently blocked.

  • Call Control: Increased rate of failed API commands to api.telnyx.com and decreased rate of successfully delivered webhooks.

  • Website and Portal Access: www.telnyx.com / portal.telnyx.com / other telnyx.com pages were intermittently accessible at this time.

Timeline

January 8th, 2019

  • 4:56 UTC - Impacted systems listed above begin to experience intermittent disruption. Cause is unknown at this point.

  • 16:22 UTC - Network team identifies brief periods of traffic spikes that overwhelmed certain links between Telnyx core routers in Chicago and Washington DC and cloud providers in these regions. The spikes resulted in dropped data packets, including BGP peering status packets. The loss of BGP peering status packets resulted in brief periods of network unavailability.

  • 17:00 UTC - Intermittent network unavailability ceases

January 9th, 2019

  • 00:13 UTC - Intermittent network unavailability resumes

  • 02:00 UTC - Intermittent network unavailability ceases

  • 09:54 UTC - Intermittent network unavailability resumes

  • 11:55 UTC - Telnyx determines that increased network traffic and CPU utilization in Chicago and Washington DC was caused by an exploit used to mine cryptocurrency.

  • 16:48 UTC - Telnyx completes software patches, deploys script to cease the exploit, and updates network configuration to mitigate unavailability.

Cause

Excessive network traffic and CPU utilization caused by the exploit used to mine cryptocurrency resulted in intermittent network unavailability in specific links between Telnyx Core Routers in Chicago and Washington DC, and their associated Cloud Provider regions.

There are two high-level categories of failure that compounded to cause this outage.

Causes related to security

Our system was exploited due to a misconfigured firewall. Moving forward, we will be increasing the regularity and intensity of our network penetration testing. The party responsible took advantage of a known exploit in Hashicorp’s Consul product, which serves as our service discovery tool.

The exploited vulnerability enabled the third party to utilize Telnyx compute resources to mine cryptocurrency. After reverse engineering the code, we have no reason to believe that any customer or Telnyx information was compromised.

Causes related to availability and resiliency

Regardless of the cause of the network unavailability, we should not have experienced a prolonged period of intermittent outages across such a broad range of services.

The prolonged period of the outage was a result of suboptimal monitoring on our network, which prevented us from quickly pinpointing that our link was periodically being saturated. We are actively working to improve our network monitoring so that we will be able to quickly identify the root cause of network impairments and proactively address them.

The broad scope of the outage was a result of how we were using Consul for service discovery, as well as how Consul and other critical infrastructure was deployed within our infrastructure.

In the current Consul implementation, DNS lookups depend upon a highly available Consul cluster, but one that exists within a single cloud provider region, and is thus a single point of failure. Moving forward, we will configure services to resolve domain names in a more distributed fashion, while having high availability and anycast DNS to quickly failover if localized options are unavailable.

In addition to removing our Consul cluster’s single point of failure, we will perform a system wide audit of other critical services and their availability. We will also test, observe, and improve our services’ resiliency to similar outages through controlled experiments.

Short Term Action Items

  • Enhance network monitoring to better detect temporary traffic spikes

  • Expand penetration testing of our network

  • Accelerate chaos engineering tests, observe impact, and address weaknesses

  • Audit our existing infrastructure for additional single points of failure

Long Term Action Items

  • Resolve domain names locally

  • Federate our Consul cluster across multiple cloud providers and regions

Posted Jan 18, 2019 - 21:35 UTC

Resolved
All Telnyx systems have been stable for over an hour now; we are continuing to closely monitor. Please note, some API updates are still processing, and we expect those all to clear shortly.

Below is a quick summary of the impacted services from time-to-time this morning. Exact timelines and additional details will be forthcoming:

*Messaging*
---Intermittent few minutes where we wouldn't accept messages - we send failure responses, prompting a retry. The Messaging services caught up a few minutes thereafter.

*Voice*
---User/pass registration for Credential Connections (didn't impact IP-based or FQDN connection)

*Call Control*
---Decreased rate of successfully delivered webhooks

*Website Access*
---Periods where customers couldn't access www.telnyx.com / other telnyx.com pages
Posted Jan 08, 2019 - 18:17 UTC
Update
We are continuing to monitor for additional issues.
Posted Jan 08, 2019 - 17:06 UTC
Update
We are experiencing network issues affecting multiple systems. We're investigating and will provide more details shortly.
Posted Jan 08, 2019 - 16:31 UTC
Update
We are experiencing network issues affecting multiple systems. We're investigating and will provide more details shortly.
Posted Jan 08, 2019 - 16:07 UTC
Update
We are continuing to monitor for any further issues.
Posted Jan 08, 2019 - 13:13 UTC
Monitoring
At 12:08 PM UTC it was identified that calls and access to the Mission Control portal were having connectivity issues. It was identified as a network issue and our networking team began to investigate.

As of 12:35 PM UTC, the identified network issue was alleviated and the outage resolved.

Our networking team is working to identify the root cause and implement measures to help prevent future incidents of this nature.

Our NOC team continue to monitor the situation at this time and a detailed post mortem will be provided following a full investigation.
Posted Jan 08, 2019 - 12:55 UTC
Identified
We have identified possible issues with our network and are currently investigating. As of right now, this does not appear to be a widespread issue. We will provide updates shortly.
Posted Jan 08, 2019 - 12:46 UTC
This incident affected: Programmable Voice - Voice API (US), Mission Control API (East Region, Central Region, West Region), and Messaging.