On October 11th, beginning at 3:18 PM ET, Yext engineers were alerted to elevated error rates for less than 0.05% of Pages Serving traffic. Engineers began to monitor and investigate, and restored normal service at 5:20 PM ET.
A partial hardware failure caused errors in less than 0.05% of Pages traffic. Once Yext engineers identified the offending machine and removed it from the cluster, error rates returned to normal.
Although we have existing mechanisms to automatically detect and respond to hardware failures, these mechanisms failed to engage during this incident because of the partial nature of this failure. We plan to improve our mechanisms to better detect this type of partial failure, and automatically remove failing machines out of the serving pool.