On March 30th, we experienced a brief connectivity problem between our data center and a public cloud vendor. This caused a very small percentage of Answers requests to return 404s.
On April 1st, the connectivity problem resurfaced more persistently, and impacted the functionality of YextCI builds, Code Editors or Live Previews, and a small percentage of Pages publishes for approximately 1 hour.
The root cause was a confluence of three factors. First, connectivity issues in one system caused it to record empty data rather than notice the error and retry the connection. Secondly, another system experiencing connection issues would retry aggressively, ultimately attempting thousands of separate connections. Lastly, the sudden increase in attempted connections overloaded the TCP port pool available to our NAT gateways, exacerbating the issue.
The three factors listed above are all being fixed independently. The error handling in both pieces of software have already been updated to retry upon error, but to do so with an appropriate backoff timer. For TCP port exhaustion, we are creating telemetry to alert when we’re in danger of approaching this condition.