Summary

On Monday, August 13th, 2018, at 11:59 AM, the Customer Portal began returning errors for a large portion of search requests. Our engineers were alerted instantly and began an investigation. After noticing that a large portion of traffic was going to a single node in the search cluster, we shut down the offending node and the cluster recovered shortly thereafter. Service returned to normal at 12:15 PM.

Root Cause

A bug was discovered in the node configuration which greatly reduced available space in the cluster. Consequently, during a routine operation to increase the serving capacity of our search cluster, the cluster began routing a large portion of traffic to a single node. The single node was unable to keep up with the load and began returning errors. Removing the node from the cluster completely allowed the cluster to rebalance traffic properly.

Going forward, we plan to add additional capacity to our clusters and update our alerts to be more sensitive to space issues before they impact customer traffic.

Posted Aug 24, 2018 - 12:57 EDT

Resolved

This incident has been resolved.

Posted Aug 13, 2018 - 15:21 EDT

Monitoring

We have implemented a fix and will continue to monitor customer portal for any issues.

Posted Aug 13, 2018 - 13:01 EDT

Identified

We are investigating reports of search degradation in the Customer Portal. Some search functionality may not be available or requests may be slow at this time.

Posted Aug 13, 2018 - 11:59 EDT

This incident affected: Customer Portal Login.