Pages Serving Elevated Error Rates
Incident Report for Yext
Postmortem

Summary

On October 11th, beginning at 3:18 PM ET, Yext engineers were alerted to elevated error rates for less than 0.05% of Pages Serving traffic. Engineers began to monitor and investigate, and restored normal service at 5:20 PM ET.

Root Cause

A partial hardware failure caused errors in less than 0.05% of Pages traffic. Once Yext engineers identified the offending machine and removed it from the cluster, error rates returned to normal.

Remediation

Although we have existing mechanisms to automatically detect and respond to hardware failures, these mechanisms failed to engage during this incident because of the partial nature of this failure. We plan to improve our mechanisms to better detect this type of partial failure, and automatically remove failing machines out of the serving pool.

Posted Oct 25, 2019 - 12:43 EDT

Resolved
Error rates have remained normal during our monitoring period. This incident is now resolved.
Posted Oct 11, 2019 - 20:55 EDT
Monitoring
We have deployed a fix and error rates have returned to normal. We will continue to monitor for any issues.
Posted Oct 11, 2019 - 17:45 EDT
Investigating
We are currently investigating increased error rates in 0.05% of our Pages Serving requests. We will update as soon as we have more information.
Posted Oct 11, 2019 - 17:16 EDT
This incident affected: Pages Serving.