Pages serving partial outage
Incident Report for Yext
Postmortem

Summary

On October 22nd, at 7:41 p.m. Eastern, sites hosted by Yext Pages became partially unavailable. During this time, some entity locator requests and individual entity pages returned an error. At 7:50 p.m. Eastern, Yext engineers removed the failing infrastructure and restored service to normal.

Root Cause

A DDoS (Distributed Denial of Service) attack against Amazon Web Services’s DNS service, Route 53, caused widespread DNS resolution errors in one of our serving regions. Rerouting traffic away from the affected regions restored services. More details regarding this attack are available at https://twitter.com/AWSSupport/status/1186735657387003904 and https://www.theregister.co.uk/2019/10/22/aws_dns_ddos/.

Remediation

We plan to improve our recovery mechanisms to more quickly detect region specific outages at Amazon Web Services and to automatically fail over to our other regions.

Posted Nov 01, 2019 - 15:55 EDT

Resolved
Amazon Web Services has confirmed that they have resolved their intermittent DNS resolution errors. As Yext services have continued to operate normally, this issue is now resolved.
Posted Oct 22, 2019 - 23:19 EDT
Update
We have returned to normal operation, but are continuing to monitor the situation. We can confirm that the initial outage was related to the intermittent DNS resolution errors currently being experienced by Amazon Web Services, as reported on their service health dashboard at https://status.aws.amazon.com/ .
Posted Oct 22, 2019 - 21:46 EDT
Monitoring
We are monitoring reports of a partial outage affecting sites hosted by Yext Pages. Our telemetry indicates that most store locator requests and some individual store page requests resulted in errors from 7:41 p.m. until 7:50 p.m. US Eastern Time. At 7:50 p.m., we took the problematic infrastructure out of service, and error rates have returned to normal, though page speeds are slightly impacted.

We are continuing to monitor the situation and will provide an update when we are able to return to normal operation.
Posted Oct 22, 2019 - 20:02 EDT
This incident affected: Pages Serving.