On November 18, 2019, from 11:19 a.m. until 11:40 a.m. US Eastern Time, data about reviews, like aggregate ratings or review text, was unavailable in the Customer Portal and Platform API. Data access was restored at 11:40 a.m., but access continued to be degraded until 12:08 p.m., with some requests returning errors and successful requests at higher latencies. No data was lost during this incident, and the display of reviews on Pages sites was unaffected.
The distributed database that stores and searches reviews stored in Yext was pushed beyond its capacity by a new set of analytic queries, eventually causing the full cluster to become unavailable. The on-call engineers identified and mitigated the issue by shutting down non-essential / offline processes to reduce load on the database and halting the analytic queries, partially restoring service at 11:40 a.m. The database was restored to full operation over the following half hour.
As an immediate remediation, we have provisioned additional capacity for this database cluster. To prevent recurrence in the future, we are implementing improvements in query monitoring, to allow us to identify queries that require a disproportionate amount of database resources. We are also implementing new alerts to identify this scenario and tighter timeouts to prevent slow queries from taking as many resources. Lastly, we are investigating alternative implementations for the demanding analytic queries to permanently remove that source of load from the operational systems.