

This change was rolled out following our standard safe deployment practices, progressively deployed to increasingly larger regions over the course of four days prior to being deployed to West Europe.

This change was intended to reduce the performance impact of using feature flags on the request serving path, and had been previously load tested and validated in our internal testing and canary environments - demonstrating a significant reduction in performance impact in these scenarios.

There are several factors which contribute to increasing the feedback on this loop, however the ultimate trigger was the recent introduction of a cache used to reduce the time spent parsing complex feature flag definitions in hot loops. These timeouts caused both internal and external clients to retry requests, further increasing load and contention on these locks, eventually causing our Web API tier to saturate its available CPU capacity. As a result of this, latency for long running asynchronous operations (such as outgoing database and web requests) increased, leading to timeouts. This was caused by high-volume, short-held lock contention on the request serving path, which triggered a significant increase in spin-waits against these locks, driving up CPU load and preventing threads from picking up asynchronous background work. This incident was the result of a positive feedback loop leading to saturation on the ARM web API tier. Additionally, Azure services that leverage the ARM API as part of their own internal workflows, and customers of these services, may have experienced issues managing Azure resources located in West Europe as a result. This principally affected customers and workloads in geographic proximity to our West Europe region, while customers geographically located elsewhere would not have been impacted – with limited exceptions for VPN users and those on managed corporate networks. This caused up to 50% of customer requests to this region to fail (approximately 3% of global requests at the time). The primary source of impact was limited to ARM API calls being processed in our West Europe region. This impacted users of Azure CLI, Azure PowerShell, the Azure portal, as well as Azure services which depend upon ARM for their internal resource management operations. Between 02:20 UTC and 07:30 UTC on 23 March 2023 you may have experienced issues using Azure Resource Manager (ARM) when performing resource management operations in the West Europe region.
