On June 4th, 2019, at 16:41 EDT, engineering was alerted of elevated error rates and latencies in the Customer Portal and Platform API. Mitigation efforts began immediately, and the root cause was identified and resolved by 17:15 EDT, at which point service was restored.
A runaway process had overwhelmed a high traffic backend service, preventing it from serving requests from other systems. Engineering immediately identified and terminated the process, which allowed the backend service to recover.
We will be upgrading the backend service and will also be implementing systems to anticipate runaway processes to minimize impact to our platform.