On March 21st, at 2:02 PM EST, engineering received reports of elevated page load times in the Customer Portal. Investigation began immediately, and mitigations were implemented by 2:21 PM EST, at which time page load times returned to normal.
A bug in an asynchronous process occurred shortly after midnight on March 21st, causing the process to slowly increase its resource consumption. The unfettered resource consumption reached a critical point in the early afternoon, resulting in increased latency in Customer Portal requests. Once the source was identified, the asynchronous process was instantly terminated, releasing the resources.
Due to the gradual nature of the resource consumption, our alerting failed to detect the issue until it began to affect other systems. Going forward, we plan to implement mechanisms to prevent such runaway processes from consuming resources. We also plan to add alerting to actively detect latency increases over longer periods, to prevent such scenarios from impacting customer requests.