Telnyx experienced intermittent network issues between the following times:
January 8th from 14:56 UTC to 17:00 UTC
January 9th from 00:13 UTC to 02:00 UTC
January 9th from 09:54 UTC to 16:48 UTC
This disruption resulted in failures across the following services:
Voice: Outbound calls and inbound calls using credential authentication intermittently failed.
Messaging: Outbound message delivery was intermittently blocked.
Call Control: Increased rate of failed API commands to api.telnyx.com and decreased rate of successfully delivered webhooks.
Website and Portal Access: www.telnyx.com / portal.telnyx.com / other telnyx.com pages were intermittently accessible at this time.
4:56 UTC - Impacted systems listed above begin to experience intermittent disruption. Cause is unknown at this point.
16:22 UTC - Network team identifies brief periods of traffic spikes that overwhelmed certain links between Telnyx core routers in Chicago and Washington DC and cloud providers in these regions. The spikes resulted in dropped data packets, including BGP peering status packets. The loss of BGP peering status packets resulted in brief periods of network unavailability.
17:00 UTC - Intermittent network unavailability ceases
January 9th, 2019
00:13 UTC - Intermittent network unavailability resumes
02:00 UTC - Intermittent network unavailability ceases
09:54 UTC - Intermittent network unavailability resumes
11:55 UTC - Telnyx determines that increased network traffic and CPU utilization in Chicago and Washington DC was caused by an exploit used to mine cryptocurrency.
16:48 UTC - Telnyx completes software patches, deploys script to cease the exploit, and updates network configuration to mitigate unavailability.
Excessive network traffic and CPU utilization caused by the exploit used to mine cryptocurrency resulted in intermittent network unavailability in specific links between Telnyx Core Routers in Chicago and Washington DC, and their associated Cloud Provider regions.
There are two high-level categories of failure that compounded to cause this outage.
Our system was exploited due to a misconfigured firewall. Moving forward, we will be increasing the regularity and intensity of our network penetration testing. The party responsible took advantage of a known exploit in Hashicorp’s Consul product, which serves as our service discovery tool.
The exploited vulnerability enabled the third party to utilize Telnyx compute resources to mine cryptocurrency. After reverse engineering the code, we have no reason to believe that any customer or Telnyx information was compromised.
Regardless of the cause of the network unavailability, we should not have experienced a prolonged period of intermittent outages across such a broad range of services.
The prolonged period of the outage was a result of suboptimal monitoring on our network, which prevented us from quickly pinpointing that our link was periodically being saturated. We are actively working to improve our network monitoring so that we will be able to quickly identify the root cause of network impairments and proactively address them.
The broad scope of the outage was a result of how we were using Consul for service discovery, as well as how Consul and other critical infrastructure was deployed within our infrastructure.
In the current Consul implementation, DNS lookups depend upon a highly available Consul cluster, but one that exists within a single cloud provider region, and is thus a single point of failure. Moving forward, we will configure services to resolve domain names in a more distributed fashion, while having high availability and anycast DNS to quickly failover if localized options are unavailable.
In addition to removing our Consul cluster’s single point of failure, we will perform a system wide audit of other critical services and their availability. We will also test, observe, and improve our services’ resiliency to similar outages through controlled experiments.
Enhance network monitoring to better detect temporary traffic spikes
Expand penetration testing of our network
Accelerate chaos engineering tests, observe impact, and address weaknesses
Audit our existing infrastructure for additional single points of failure
Resolve domain names locally
Federate our Consul cluster across multiple cloud providers and regions