On August 23, from 11:07 a.m. to 11:51 a.m. US Eastern Time, the Customer Portal and the Platform API experienced elevated error rates, particularly in the Knowledge Graph area.
A routine change to a service contained a subtle resource leak that eventually overwhelmed the machine that service was running on after many hours, negatively affecting the other services also running on the same machine.
We will add monitoring and alerting at both the service and machine levels for the class of resource that was exhausted during this incident.