On Monday, August 13th, 2018, at 11:59 AM, the Customer Portal began returning errors for a large portion of search requests. Our engineers were alerted instantly and began an investigation. After noticing that a large portion of traffic was going to a single node in the search cluster, we shut down the offending node and the cluster recovered shortly thereafter. Service returned to normal at 12:15 PM.
A bug was discovered in the node configuration which greatly reduced available space in the cluster. Consequently, during a routine operation to increase the serving capacity of our search cluster, the cluster began routing a large portion of traffic to a single node. The single node was unable to keep up with the load and began returning errors. Removing the node from the cluster completely allowed the cluster to rebalance traffic properly.
Going forward, we plan to add additional capacity to our clusters and update our alerts to be more sensitive to space issues before they impact customer traffic.