Customer Portal Degradation
Incident Report for Yext
Postmortem

Summary

On Monday, August 13th, 2018, at 11:59 AM, the Customer Portal began returning errors for a large portion of search requests. Our engineers were alerted instantly and began an investigation. After noticing that a large portion of traffic was going to a single node in the search cluster, we shut down the offending node and the cluster recovered shortly thereafter. Service returned to normal at 12:15 PM.

Root Cause

A bug was discovered in the node configuration which greatly reduced available space in the cluster. Consequently, during a routine operation to increase the serving capacity of our search cluster, the cluster began routing a large portion of traffic to a single node. The single node was unable to keep up with the load and began returning errors. Removing the node from the cluster completely allowed the cluster to rebalance traffic properly.

Going forward, we plan to add additional capacity to our clusters and update our alerts to be more sensitive to space issues before they impact customer traffic.

Posted 4 months ago. Aug 24, 2018 - 12:57 EDT

Resolved
This incident has been resolved.
Posted 4 months ago. Aug 13, 2018 - 15:21 EDT
Monitoring
We have implemented a fix and will continue to monitor customer portal for any issues.
Posted 4 months ago. Aug 13, 2018 - 13:01 EDT
Identified
We are investigating reports of search degradation in the Customer Portal. Some search functionality may not be available or requests may be slow at this time.
Posted 4 months ago. Aug 13, 2018 - 11:59 EDT
This incident affected: Customer Portal.