Reviews Service Unavailable
Incident Report for Yext
Postmortem

Summary

On November 18, 2019, from 11:19 a.m. until 11:40 a.m. US Eastern Time, data about reviews, like aggregate ratings or review text, was unavailable in the Customer Portal and Platform API. Data access was restored at 11:40 a.m., but access continued to be degraded until 12:08 p.m., with some requests returning errors and successful requests at higher latencies. No data was lost during this incident, and the display of reviews on Pages sites was unaffected.

Root Cause

The distributed database that stores and searches reviews stored in Yext was pushed beyond its capacity by a new set of analytic queries, eventually causing the full cluster to become unavailable. The on-call engineers identified and mitigated the issue by shutting down non-essential / offline processes to reduce load on the database and halting the analytic queries, partially restoring service at 11:40 a.m. The database was restored to full operation over the following half hour.

Remediation

As an immediate remediation, we have provisioned additional capacity for this database cluster. To prevent recurrence in the future, we are implementing improvements in query monitoring, to allow us to identify queries that require a disproportionate amount of database resources. We are also implementing new alerts to identify this scenario and tighter timeouts to prevent slow queries from taking as many resources. Lastly, we are investigating alternative implementations for the demanding analytic queries to permanently remove that source of load from the operational systems.

Posted Dec 04, 2019 - 10:23 EST

Resolved
This incident has been resolved.
Posted Nov 18, 2019 - 15:07 EST
Monitoring
We have completely restored the affected component and removed the temporary mitigation. We will continue to monitor the system closely and develop additional preventative measures.
Posted Nov 18, 2019 - 13:12 EST
Identified
We have identified the issue and put a temporary mitigation into place to restore service to reviews-related pages & API endpoints.
Posted Nov 18, 2019 - 12:20 EST
Investigating
We are investigating reports that the reviews-related pages in the Customer Portal and the reviews-related endpoints in the Platform API are unavailable.
Posted Nov 18, 2019 - 11:49 EST
This incident affected: Content (Management API) and Customer Portal Login.