Answers Errors
Incident Report for Yext
Postmortem

Summary

On March 30th, we experienced a brief connectivity problem between our data center and a public cloud vendor.  This caused a very small percentage of Answers requests to return 404s.

On April 1st, the connectivity problem resurfaced more persistently, and impacted the functionality of YextCI builds, Code Editors or Live Previews, and a small percentage of Pages publishes for approximately 1 hour. 

Root Cause

The root cause was a confluence of three factors.  First, connectivity issues in one system caused it to record empty data rather than notice the error and retry the connection.  Secondly, another system experiencing connection issues would retry aggressively, ultimately attempting thousands of separate connections.  Lastly, the sudden increase in attempted connections overloaded the TCP port pool available to our NAT gateways, exacerbating the issue.

Remediation

The three factors listed above are all being fixed independently.  The error handling in both pieces of software have already been updated to retry upon error, but to do so with an appropriate backoff timer.  For TCP port exhaustion, we are creating telemetry to alert when we’re in danger of approaching this condition.

Posted Apr 08, 2021 - 08:50 EDT

Resolved
This incident has been resolved.
Posted Mar 30, 2021 - 09:59 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 30, 2021 - 07:55 EDT
Identified
We have identified the issue, and determined that it's limited to a small subset of Answers sites. We will deploy a fix shortly.
Posted Mar 30, 2021 - 07:29 EDT
Investigating
We are currently investigating a small rate of errors on Answers sites.
Posted Mar 30, 2021 - 07:20 EDT
This incident affected: Answers Serving.