On March 24 between 10:32am and 8:12pm ET, and then again at 9:16pm to 6:00am ET the following morning, there was elevated load on services that provide access to Knowledge Graph data. The resulting degradation in response times caused delays in propagating updates to Live API and Answers, and it caused some sporadic failures in the Customer Portal as well.
A new usage pattern in accessing data from related entities caused a multiplicative increase in the amount of processing required for updates in a particular large account. Specific, but common, requests began to load >100x the amount of profile data, and this overwhelmed the in-memory caching system and caused delays to other requests.
We identified the usage pattern leading to the elevated load and implemented a more efficient mechanism to support it while avoiding the expansion in resources required.