This incident was caused by an unresponsive service on a single application node.
It was removed from service, reviewed, refreshed and re-introduced. Monitoring detected the issue, but was delayed due to configured thresholds. We have adjusted those to target faster response and resolution times.
Teams are also reviewing the underlying cause & additional opportunities to improve resiliency to reduce impact of future similar scenarios.