Degraded Audio
Incident Report for Telnyx
Postmortem

Summary

Starting on 2019-07-16 13:25:29 UTC until 2019-07-16 17:00:40 UTC, several customers experienced degraded audio quality for both inbound and outbound voice.

Immediate Cause

The degraded audio quality was caused by increased packet loss beginning when our Core Network team brought a new provider online, providing a new direct link to our Chicago PoP.

Our customers connect to our core network via their own ISPs. The route they take is based on how we advertise our public prefixes via our transit link providers. When we brought this new provider online, our public prefixes were advertised to the provider who, in turn, re-advertised this to their other direct-connect partners.

What this meant is that any customer whose traffic passed through those direct-connect partners to reach the Chicago PoP would use the new direct link instead of any of our existing links.

This in itself would not have been an issue except that the link was unexpectedly rate-limited far below our standard thresholds. Once the bandwidth was saturated, packets were dropped by the new provider.

Underlying Causes

  • The new provider link was ordered almost a year ago but was not activated until the day of the incident. The unnecessarily lengthy timeline helped to obfuscate the fact that the link was rate-limited to such an extent.
  • We hadn't anticipated that our public prefixes would be re-advertised and customers would then be able to come in via that route.
  • While our Core Network team has extensive monitoring and alerting in place for our backbone providers, proper monitoring and alerting on this particular external link was not available during this incident.
  • The entire Core Network team was not aware of the new link, which delayed the response effort significantly.

Timeline (UTC)

2018-09
The new provider link was ordered
2019-07-16 13:25:29
The new provider link was brought online
2019-07-16 13:26:00
Inbound and outbound calls begin to experience degraded audio quality
2019-07-16 14:46:00
The Telnyx Telephony, Core Network, and NOC teams begin investigating per our internal incident response process
2019-07-16 14:56:00
While our Telephony team was able to detect the degraded audio in their monitoring, the Network team’s monitoring indicated a healthy internal backbone. This suggested that the source of the packet loss was outside our network.
2019-07-16 15:48:00
Core Network team starts a call with affected customers to expedite troubleshooting.
2019-07-16 17:00:34
Core Network team shuts down peering with the new provider to resolve the issue
2019-07-16 17:00:40
The issue is resolved

Action Items

  • Ensure a Network Engineer works from the outset with Vendor Management on the design detail of every new link.
  • Prefixes for non-backbone providers will be advertised with the “no-export” community string to prevent our prefixes from being mis-advertised.
  • Implement thorough monitoring & alerting on all external links, not just those of the backbone providers.
  • Keep a global change-log of our network so all members of the Core Network team are aware of recent changes.
Posted Jul 19, 2019 - 15:31 CDT

Resolved
This incident has been resolved.
Posted Jul 16, 2019 - 13:01 CDT
Monitoring
Audio quality has returned to its normal state. We have taken the offending vendor out of route and are now monitoring traffic.
Posted Jul 16, 2019 - 12:14 CDT
Identified
We believe we have identified a vendor that is causing packet loss. We are in the process of confirming this, while also diverting all traffic away from this vendor.
Posted Jul 16, 2019 - 11:11 CDT
Update
We are continuing to investigate this issue.
Posted Jul 16, 2019 - 10:27 CDT
Investigating
We are currently investigating degraded audio quality issues. More updates to come shortly.
Posted Jul 16, 2019 - 09:47 CDT