On December 22 at 7:43pm ET, a number of nodes were removed from the Knowledge Graph Search cluster for routine maintenance. This is typically a non-event. However, the prior configuration change to remove the routing of search requests to those nodes had been incompletely applied, and all requests that were sent to the removed nodes failed. The issue was discovered the next morning at 10:30am ET, and the configuration change to stop routing requests to those nodes was fully applied by 11am ET. That successfully restored service, and no data was lost
Removing nodes from a search cluster is performed by an automated process. In this case, it failed to function as expected due to drift in the node's actual state from its description in Terraform. The time to detection was lengthened because post-change verification is performed on the search cluster itself, which was completely operational.
We will implement periodic automated checks for drift in state between actual and checked-in configuration of infrastructure nodes, and we will also tie relevant application dashboards more closely to the Infrastructure change process to identify more classes of errors and reduce the time to detection.