Azure Storage の障害

2月26日13時ぐらいから東日本でVMなど一部の顧客のサービスで障害が発生してたようです。(実際にはStorageの問題だったのですが)

Azure Storage – Japan East – Mitigated (Tracking ID PLWV-BT0)

Summary of impact: Between 03:29 UTC and 10:02 UTC on 26 Feb 2021, a subset of customers in Japan East may have experienced service degradation and latency for resources utilising Azure Storage. Some Azure services utilising Storage may have also experienced downstream impact.
Preliminary Root Cause: During the window of impact, an increase in utilisation was observed and this combined with the sudden loss of backend instances meant an operational threshold was reached. This in turn caused failures for customers and Azure services attempting to utilise Storage resources.
Mitigation: Unhealthy instances were recovered which released resources to the service.
Once Storage services were recovered around 06:56 UTC, dependent services started recovering. We’ve declared full mitigation at 10:02 UTC.
Next steps: We will continue to investigate to establish the full root cause and prevent future occurrences. Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.

ストレージの使用率が増加したのとバックエンドインスタンスが突然ロストみたいなのが重なって運用上の閾値に到達、ストレージリソースを使う顧客に障害が発生みたいな感じ?自分は一部のLog Analytics/Application Insightsぐらいしか影響がなかったのですがVMだったりApp Serviceだったりいろいろ影響受けた人はいそうですね。

RCAがでたらまたそのうち更新する予定。

2021年3月2日 13時追記:RCAでました。

RCA – Azure Storage and dependent services – Japan East (Tracking ID PLWV-BT0)

Summary of Impact: Between 03:29 UTC and 10:02 UTC on 26 Feb 2021, a subset of customers in Japan East may have experienced service degradation and increased latency for resources utilizing Azure Storage, including failure of virtual machine disks. Some Azure services utilizing Storage may have also experienced downstream impact.
Root Cause: There were contributing factors that led to customer impact.
Firstly, we had an active deployment in progress on a single storage scale unit. Our safe deployment process normally reserves some resources within a scale unit so that deployments can take place. In addition to this space being reserved for the deployment, some nodes in the scale unit entered an unhealthy state and so they were removed from use from the scale unit. The final factor was that resource demand on the scale unit was unusually high.
In this case, our resource balancing automation was not able to keep up and spread the load to other scale units. A combination of all these factors resulted in a high utilization of this scale unit causing it to be heavily throttled in order to prevent failure. This resulted in a loss of availability for customers and Azure services attempting to utilize Storage resources within the impacted storage scale unit.
Mitigation: To mitigate customer impact as fast as possible, unhealthy nodes were recovered which restored resources to the service. In addition, engineers took steps to aggressively balance resource load out of the storage scale unit.
Once Storage services were recovered around 06:56 UTC, dependent services started recovering. We declared full mitigation at 10:02 UTC.
Next steps: We sincerely apologize for the impact this event had on our customers. Next steps include but are not limited to:

  • Improve detection and alerting when auto-balancing is not keeping up to help quickly trigger manual mitigation steps.
  • Reduce the maximum allowed resource utilization levels for smaller storage scale units to help ensure increased resource headroom in the face of multiple unexpected events.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: https://aka.ms/AzurePIRSurvey

最初に単一ストレージスケールユニットでデプロイが進行していたようです。通常安全に展開できるように事前にスケールユニット内でリソースを予約するようなのですが、予約した領域に加えて一部のノードが異常な状態に(別の要因でと思う)なったのでスケールユニット内から使えなくなった+スケールユニットのリソース需要が異常に高かった、というのが重なった様子。この場合リソースバランシングを自動でするので負荷を維持して他のスケールユニットに分散しようとするけど上記要因が組み合わさった結果、全スケールユニットの使用率が高くなって障害を防ぐために大幅にスロットリングされたという感じです。
結果スロットリングされたスケールユニット内のストレージリソースで可用性がなくなったということらしい。

対処としては異常ノードの回復、リソースの復元、負荷の積極的な分散など。

今後は自動バランシングが追い付かない場合の検出とアラートの改善、軽減手順を素早くできるようにする、小規模なストレージスケールユニットの最大許容リソース使用率レベルの引き下げ(余力を増やす)を行うらしいです。何かあればサーベイリンクからフィードバックしましょう。

2021/03/06 追記。さらに詳細というか続報がでてました。

RCA – Azure Storage and dependent services – Japan East (Tracking ID PLWV-BT0)

Summary of Impact: Between 03:26 UTC and 10:02 UTC on 26 Feb 2021, a subset of customers in Japan East may have experienced service degradation and increased latency for resources utilizing Azure Storage, including failure of virtual machine disks. Some Azure services utilizing Storage may have also experienced downstream impact.Summary Root Cause: During this incident, the impacted storage scale unit was under heavier than normal utilization. This was due to:

  • Incorrect limits set on the scale unit which allowed more load than desirable to be placed on it. This reduced the headroom that is usually available for unexpected events such as sudden spikes in growth which allows time to take load-balancing actions.
  • Additionally, the load balancing automation was not sufficiently spreading the load to other scale units within the region.

This high utilization triggered heavy throttling of storage operations to protect the scale unit from catastrophic failures. This throttling resulted in failures or increased latencies for storage operations on the scale unit.

Note: The original RCA mistakenly identified a deployment as a triggering event for the increased load. This is because during an upgrade, the nodes to be upgraded are removed from rotation, temporarily increasing load on remaining nodes. An upgrade was in queue on the scale unit but had not yet started. Our apologies for the initial mistake.

Background: An internal automated load balancing system actively monitors resource utilization of storage scale units to optimize load across scale units within an Azure region. For example, resources such as disk space, CPU, memory and network bandwidth are targeted for balancing. During this load balancing, storage data is migrated to a new scale unit, validated for data integrity at the destination and finally the data is cleaned up on the source to return free resources. This automated load-balancing happens continuously and in real-time to ensure workloads are properly optimized across available resources.

Detailed Root Cause: Prior to the start of impact, our automated load-balancing system had detected high utilization on the scale-unit and was performing data migrations to balance the load. Some of these load-balancing migrations did not make sufficient progress, creating a situation where the resource utilization on the scale unit reached levels that were above the safe thresholds that we try to maintain for sustained production operation. This kick-started automated throttling on incoming storage write requests to protect the scale unit from catastrophic failures. When our engineers were engaged, they also detected that the utilization limits that were set on the scale unit to control how much data and traffic should be directed to the scale unit was higher than expected. This did not give us sufficient headroom to complete load-balancing actions to prevent customer facing impact.
Mitigation: To mitigate customer impact as fast as possible, we took the following actions:

  • Engineers took steps to aggressively balance resource load out of the storage scale unit. The load-balancing migrations that were previously unable to finish were manually unblocked and completed, allowing a sizeable quantity of resources to be freed up for use. Additionally, load-balancing operations were tuned to improve its throughput to more effectively distribute load.
  • We prioritized recovery of nodes with hardware failures that had been taken out of rotation to bring additional resources online.

These actions brought the resource utilization on the scale unit to a safe level which was well below throttling thresholds. Once Storage services were recovered around 06:56 UTC, dependent services started recovering. We declared full mitigation at 10:02 UTC.
Next steps: We sincerely apologize for the impact this event had on our customers. Next steps include but are not limited to:

  • Optimize the maximum allowed resource utilization levels on this scale unit to provide increased headroom in the face of multiple unexpected events.
  • Improve existing detection and alerting for cases when load-balancing is not keeping up, so corrective action can be triggered early to help avoid customer impact.
  • Improve load-balancing automation to handle certain edge-cases under resource pressure where manual intervention is currently required to help prevent impactful events.
  • Improve emergency-levers to allow for faster mitigation of impactful resource utilization related events.

Provide Feedback: Please help us improve the Azure customer communications experience by taking our survey: https://aka.ms/AzurePIRSurvey

このインシデント中、影響をうけたストレージユニットは通常の使用率よりも重かったということでその理由。1つはスケールユニットに設定された制限が正しくなかったので必要以上に負荷がかかる可能性があった。負荷の急上昇など予期しないイベントで通常利用できるヘッドルームが減少してた。2つ目は負荷分散の自動化でリージョン内の他のスケールユニットに負荷が十分に分散されていなかった。
高い使用率によるストレージ操作の大幅な調整が実行されたおかげでスケールユニットの壊滅的な障害からは保護されたけど、その調整によってスケールユニットでストレージ操作の失敗やレイテンシの増加が発生したということでした。前回のRCAだとデプロイの負荷増加と誤って識別してたけどアップグレード中に対象ノードがローテーションから削除され残りノードの負荷が一時的に増えるからなのですが実際はアップグレード処理はキューにはいってて開始されてなかったということでした。

内部的な自動負荷分散システムはストレージスケールユニットのリソース使用率をアクティブに監視してAzureリージョン内のスケールユニット全体の負荷を最適化する。ディスクスペースやCPU、メモリ、ネットワーク帯域幅などなど。負荷分散中にストレージデータは新しいスケールユニットに移行、データの整合性の検証などの後最後に移行元でクリーンアップしてリソースを返却という感じ。これらが自動的かつ継続的にリアルタイムで行われて利用可能なリソース全体でワークロードが適切に最適化されるようになってる。

根本原因の詳細:影響が始まる前、自動負荷分散システムはスケールユニットで高い使用率を検出、負荷を分散するためのデータ移行を実行してた。ただこの移行の一部は十分に進まずスケールユニットのリソース使用率が継続的な本番運用のために維持する安全な閾値を超えるレベルになってしまった。なので着信ストレージの書き込み要求が自動的にスロットリングされるようになった(けど壊滅的な障害からは保護された)。スケールユニットに送信するデータとトラフィックの量を制御するためにスケールユニットに設定された使用制限が予想よりも高いというのをエンジニアが検出。影響を防ぐための十分な余裕がないことがわかった、という感じらしい。
対処としてリソースの負荷を積極的に分散するように移行完了できなかった負荷分散の移行を手動でブロック解除して完了させ、かなりの量のリソースを解放、再利用可にした。さらに負荷分散のスループットを改善して負荷をより効果的に分散するように調整。あと追加リソースをオンラインにするためにローテーションから外れてたハードウェア障害のあるノードのリカバリを優先した。

という感じでより詳細な情報が出てました。

コメントを残す