Customer Portal Performance Degradation

Incident Report for Yext

Postmortem

Summary

On March 21st, at 2:02 PM EST, engineering received reports of elevated page load times in the Customer Portal. Investigation began immediately, and mitigations were implemented by 2:21 PM EST, at which time page load times returned to normal.

Root Cause

A bug in an asynchronous process occurred shortly after midnight on March 21st, causing the process to slowly increase its resource consumption. The unfettered resource consumption reached a critical point in the early afternoon, resulting in increased latency in Customer Portal requests. Once the source was identified, the asynchronous process was instantly terminated, releasing the resources.

Due to the gradual nature of the resource consumption, our alerting failed to detect the issue until it began to affect other systems. Going forward, we plan to implement mechanisms to prevent such runaway processes from consuming resources. We also plan to add alerting to actively detect latency increases over longer periods, to prevent such scenarios from impacting customer requests.

Posted Apr 01, 2019 - 14:52 EDT

Resolved

This incident has been resolved.
Posted Mar 22, 2019 - 11:14 EDT

Monitoring

We have implemented mitigations and page load times have returned to normal. We will actively monitor the Customer Portal for any issues.
Posted Mar 21, 2019 - 17:21 EDT

Investigating

We are currently investigating reports of elevated page load times in the Customer Portal. We will update as soon as we have more details.
Posted Mar 21, 2019 - 14:01 EDT
This incident affected: Customer Portal Login.