- ã¡ãã»ã¼ã¸ç£è¦
ã¡ãã»ã¼ã¸ç£è¦
ã¯ããã«
æ¥ã«å·ãè¾¼ãã§ãã¾ãããOpenShift Advent Calendar 2024 ã® 7 æ¥ç®ã®è¨äºã§ããå æ¥ 5æ¥ç®ã®è¨äºã§ã¢ã©ã¼ãéç¥å ãã©ã®ããã«å¤ãããããã確èªãã¾ããã æ¬æ¥ã¯ãCluster Logging æ©è½ã®1ã¤ã§ããã«ã¹ã¿ã ã¢ã©ã¼ãã®ãã¡ãã»ã¼ã¸ç£è¦ã試ãã¦ã¿ããã¨æãã¾ãã
OpenShift Cluster Logging
Cluster Logging 6 ããã¯ãElasticsearch ã®æä¾ãçµäºããGrafana Loki ã®ã¿ãå©ç¨ã§ããããã«ãªãã¾ããããã®å°ãåããã¡ãã»ã¼ã¸ç£è¦ã®ã¢ã©ã¼ãã®æ©è½ãæå¹ã«ãªã£ã¦ãã¾ãããçããã使ãã«ãªã£ã¦ããã§ããããã ããã§ã¯ãã¡ãã»ã¼ã¸ç£è¦ã®éç¥ããæ¨æ¥ã®ã¢ã©ã¼ãéç¥ã¨åæ§ã«ç®¡çè ç¨ï¼Cluster Monitoringï¼ã®Alertmanager ã¨å©ç¨è ç¨ï¼User Workload Monitoringï¼ã®Alertmanager ã®ã©ã¡ãããéç¥ãããããè¦ã¦ããããã¨æãã¾ãã
Loki ã®ã³ã³ãã¼ãã³ãã§ã¯ Ruler ãç£è¦ã®å½¹å²ãæ ãã¾ãã®ã§ãããã«ã¤ãã¦è¦ã¦ããã¾ãã
Ruler
Grafana Loki ã«ã¯ Ruler ã¨å¼ã°ããã³ã³ãã¼ãã³ããå«ã¾ãã¾ããRuler ã¯è¨å®ãããã¯ã¨ãªã»ãããç¶ç¶çã«è©ä¾¡ãããã®çµæã«åºã¥ãã¦ã¢ã¯ã·ã§ã³ãèµ·ãã責åãæã¡ã¾ãã
å
·ä½çã«ã¯ãAlertingRule
㨠RecordingRule
ã®2ã¤ã®ã«ã¹ã¿ã ãªã½ã¼ã¹ãç¨ãã¦ãã°ã¡ãã»ã¼ã¸ãè©ä¾¡ããæ¡ä»¶ã«ä¸è´ããå ´åã«ã¢ã©ã¼ããéç¥ããäºãã§ãã¾ããããããã®ã«ã¹ã¿ã ãªã½ã¼ã¹ã¯ãPrometheus ã®è¡¨è¨ã¨äºææ§ããããPrometheus ãç¥ã£ã¦ããå ´åã¯è¿½å ã®å¦ç¿ã³ã¹ããä½ãç¹ãç¹å¾´ã§ãã
AlertingRule ã®ä¾
groups: - name: should_fire rules: - alert: HighPercentageError expr: | sum(rate({app="foo", env="production"} |= "error" [5m])) by (job) / sum(rate({app="foo", env="production"}[5m])) by (job) > 0.05 for: 10m labels: severity: page annotations: summary: High request latency - name: credentials_leak rules: - alert: http-credentials-leaked annotations: message: "{{ $labels.job }} is leaking http basic auth credentials." expr: 'sum by (cluster, job, pod) (count_over_time({namespace="prod"} |~ "http(s?)://(\\w+):(\\w+)@" [5m]) > 0)' for: 10m labels: severity: critical
ã¡ãã»ã¼ã¸ç£è¦ã®éç¥å ã® AlertManager
æ¨æ¥ã[ã¢ã©ã¼ãéç¥ã®ã³ã³ããã¼ã«(https://rheb.hatenablog.com/entry/2024/12/06/215706)ã¨ãããã¨ã§ã以ä¸ã®3ç¹ã確èªãã¾ããã
- å©ç¨è ã¯ã¢ã©ã¼ãéç¥ãè¡ãããããéç¥å ã¯ç®¡çè ã«ãã管çããã
- å©ç¨è ã¯ã¢ã©ã¼ãéç¥ãè¡ãããããã¤éç¥å ãå©ç¨è ã管çããã
- 管çè ãã«ã¹ã¿ã ã§ã¢ã©ã¼ãã追å ããããå©ç¨è ã¨è² è·ãåæ£ãããï¼å©ç¨è ã¯éç¥å ã®è¨å®ãèªåã§ç®¡çããªããã°ãªããªããªãï¼
éç¥å ã®ã³ã³ããã¼ã«ãå©ç¨è ãè¡ãã管çè ãè¡ããã«ã¤ãã¦ã¯å æ¥æ´çããã®ã§ãæ¬æ¥ã¯ã¡ãã»ã¼ã¸ç£è¦ã®éç¥ã管çè ç¨ã® AlertManager ãå©ç¨è ç¨ã® AlertManager ãã確èªãã¦ããã¾ãã
ãã¿ã¼ã³ã¨ãã¦ã¯ 1 㨠3 ã確èªãã¾ãã
å©ç¨ããã¢ããªã¨ã¡ãã»ã¼ã¸ç£è¦ã«ã¼ã«
ä¸å®æéæ¯ã« ERROR
ã¨ããæååãåºåããã¢ããªã±ã¼ã·ã§ã³ãå©ç¨ãã¾ããç¹å¥æ©è½ã¯å¿
è¦ãªãã®ã§ã以ä¸ã®Deploymentã§ã¢ããªãããããã¾ãã
cat <<EOF | oc apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: error-logger labels: app: error-logger spec: replicas: 1 selector: matchLabels: app: error-logger template: metadata: labels: app: error-logger spec: containers: - name: error-logger image: busybox command: ["sh", "-c", "while true; do echo ERROR; sleep 1; done"] EOF
ã¾ããã¡ãã»ã¼ã¸ç£è¦ã ERROR ã¨ããæååãããä¸å®ä»¥ä¸ã®å²åã§åºåãããå ´åã«ã¢ã©ã¼ããçºå ±ããããè¨å®ãã¾ãã
apiVersion: loki.grafana.com/v1 kind: AlertingRule metadata: name: sample-alert labels: example.jp/system: sample spec: tenantID: "application" groups: - name: SampleError rules: - alert: HighPercentageError expr: | sum(rate({kubernetes_namespace_name="log-sample", kubernetes_pod_name=~".*"} |= "ERROR" [1m])) by (job) / sum(rate({kubernetes_namespace_name="log-sample", kubernetes_pod_name=~".*"}[1m])) by (job) > 0.01 for: 10s labels: severity: critical annotations: summary: This is summary description: This is description
ã¢ã©ã¼ãç£è¦ã®è¨å®ã®æå¹å
LokiStack ã¯ããã©ã«ãã§ã¯ã¡ãã»ã¼ã¸ç£è¦ãæå¹ã«ãªã£ã¦ããªãããã«ã以ä¸ã®ãã£ã¼ã«ãã LokiStack ã«è¨å®ãã¾ããNamespace ã»ã¬ã¯ã¿ã¼ã®ã©ãã«çã¯ç°å¢ãã¨ã«å¤ãã¦ãã ããã
... spec: ... rules: enabled: true namespaceSelector: matchLabels: example.jp/alert: 'true' selector: matchLabels: example.jp/system: sample ...
ãã¿ã¼ã³1: 管çè ç¨ã® AlertManager ã¸ã®éç¥
ã¢ã©ã¼ãéç¥ã®ç¢ºèªã§å©ç¨ããã³ãã³ãã«ãRuler ã®è¨å®ã表示ãããã³ãã³ãã追å ãã以ä¸ã®ã³ãã³ãã使ã£ã¦ç¶æ³ã確èªãã¦ããã¾ãã
echo "User: alerts" oc exec -it alertmanager-user-workload-0 -n openshift-user-workload-monitoring -- amtool alert query --alertmanager.url http://localhost:9093 echo "" echo "Cluster: alerts" oc exec -it alertmanager-main-1 -n openshift-monitoring -- amtool alert query --alertmanager.url http://localhost:9093 echo "" echo "User workload: alertmanager.yaml" oc exec -it alertmanager-user-workload-0 -n openshift-user-workload-monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml; echo "" echo "" echo "Cluster: alertmanager.yaml" oc exec -it alertmanager-main-1 -n openshift-monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml; echo "" echo "" echo "Cluster Logging: Ruler" oc exec -it logging-loki-ruler-0 -n openshift-logging -- cat /etc/loki/config/config.yaml | egrep "^ruler:" -A 10; echo ""
ã¾ãã¯ã管çè ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
å©ç¨è ç¨ã® Monitoring ã®è¨å®ã¯ã¢ã©ã¼ãéç¥ã«ã¤ãã¦ã¯ä¸è¦ã§ãã
ãã®ç¶æ
ã§ãã¢ã©ã¼ããçºçããã¨æ¬¡ã®ããã«ãªãã¾ããå©ç¨è
ã® Monitoring ã®è¨å®ã¯ãªããããAlertManager ã®æ
å ±åå¾ã¯ã¨ã©ã¼ã¨ãªãã¾ãã
Loki ã® Ruler ã®è¨å®ãè¦ã㨠https://_web._tcp.alertmanager-operated.openshift-monitoring.svc
ãåç
§ãã¦ãããã¨ããããã¾ãã管çè
ç¨ã® AlertManager ãè¦ã¦ãã¾ãããã®è¨å®ã示ãããã«ãã¡ãã»ã¼ã¸ç£è¦ã®ã¢ã©ã¼ã HighPercentageError
ã管çè
ç¨ã® AlertManager ã«éç¥ããã¦ãããã¨ã確èªã§ãã¾ãã
User: alerts Error from server (NotFound): pods "alertmanager-user-workload-0" not found Cluster: alerts Alertname Starts At Summary State Watchdog 2024-12-07 01:36:52 UTC An alert that should always be firing to certify that Alertmanager is working properly. active UpdateAvailable 2024-12-07 01:38:06 UTC Your upstream update recommendation service recommends you update your cluster. active PrometheusOperatorRejectedResources 2024-12-07 01:42:32 UTC Resources rejected by Prometheus operator active InsightsRecommendationActive 2024-12-07 01:44:33 UTC An Insights recommendation is active for this cluster. active KubeDaemonSetMisScheduled 2024-12-07 01:52:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-07 01:52:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-07 01:52:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetRolloutStuck 2024-12-07 02:07:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-07 02:07:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-07 02:07:49 UTC DaemonSet rollout is stuck. active PrometheusDuplicateTimestamps 2024-12-07 02:37:55 UTC Prometheus is dropping samples with duplicate timestamps. active ClusterNotUpgradeable 2024-12-07 02:38:10 UTC One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. active PrometheusDuplicateTimestamps 2024-12-07 02:38:25 UTC Prometheus is dropping samples with duplicate timestamps. active PodDisruptionBudgetAtLimit 2024-12-07 02:39:38 UTC The pod disruption budget is preventing further disruption to pods. active HighOverallControlPlaneMemory 2024-12-07 05:17:04 UTC Memory utilization across all control plane nodes is high, and could impact responsiveness and stability. active HighPercentageError 2024-12-07 11:55:32 UTC This is summary active User workload: alertmanager.yaml Error from server (NotFound): pods "alertmanager-user-workload-0" not found Cluster: alertmanager.yaml inhibit_rules: - equal: - namespace - alertname source_matchers: - severity = critical target_matchers: - severity =~ warning|info - equal: - namespace - alertname source_matchers: - severity = warning target_matchers: - severity = info receivers: - name: Critical - name: Default slack_configs: - channel: '#openshift-on-kvm' api_url: >- https://hooks.slack.com/services/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 - name: Watchdog route: group_by: - namespace group_interval: 5m group_wait: 30s receiver: Default repeat_interval: 12h routes: - matchers: - alertname = Watchdog receiver: Watchdog - matchers: - severity = critical receiver: Critical Cluster Logging: Ruler ruler: enable_api: true enable_sharding: true alertmanager_url: https://_web._tcp.alertmanager-operated.openshift-monitoring.svc enable_alertmanager_v2: true enable_alertmanager_discovery: true alertmanager_refresh_interval: 1m wal: dir: /tmp/wal truncate_frequency: 60m min_age: 5m
ãã¿ã¼ã³3: å©ç¨è ç¨ã® AlertManager ã¸ã®éç¥
ä»åº¦ã¯å©ç¨è ç¨ã® AlertManager ã¸éç¥ãè¡ãã¾ããå©ç¨è ç¨ã®AlertManager ã®æå¹å㨠AlergMangerConfig ã®è¨å®ãæå¹ã«ãã¾ãã
管çè ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
å©ç¨è ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true
ãã®ç¶æ ã§ãã¡ãã»ã¼ã¸ç£è¦ãè¡ãã¾ãã
確èªçµæã¯æ¬¡ã®ã¨ããã¨ãªãã¾ããç£è¦ã®éç¥ãå©ç¨è ç¨ã® AlertManager ã«å±ãã¦ãããã¨ã確èªã§ãã¾ãããã ããRuler ã®è¨å®ãç¸å¤ããã管çè ç¨ã® AlertManager ãåãã¦ãã¾ããä½ãèµ·ãã¦ããã®ã§ããããã
User: alerts Alertname Starts At Summary State HighPercentageError 2024-12-07 12:16:04 UTC This is summary active Cluster: alerts Alertname Starts At Summary State Watchdog 2024-12-07 01:36:52 UTC An alert that should always be firing to certify that Alertmanager is working properly. active UpdateAvailable 2024-12-07 01:38:06 UTC Your upstream update recommendation service recommends you update your cluster. active PrometheusOperatorRejectedResources 2024-12-07 01:42:32 UTC Resources rejected by Prometheus operator active InsightsRecommendationActive 2024-12-07 01:44:33 UTC An Insights recommendation is active for this cluster. active KubeDaemonSetMisScheduled 2024-12-07 01:52:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-07 01:52:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-07 01:52:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetRolloutStuck 2024-12-07 02:07:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-07 02:07:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-07 02:07:49 UTC DaemonSet rollout is stuck. active PrometheusDuplicateTimestamps 2024-12-07 02:37:55 UTC Prometheus is dropping samples with duplicate timestamps. active ClusterNotUpgradeable 2024-12-07 02:38:10 UTC One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. active PrometheusDuplicateTimestamps 2024-12-07 02:38:25 UTC Prometheus is dropping samples with duplicate timestamps. active PodDisruptionBudgetAtLimit 2024-12-07 02:39:38 UTC The pod disruption budget is preventing further disruption to pods. active HighOverallControlPlaneMemory 2024-12-07 05:17:04 UTC Memory utilization across all control plane nodes is high, and could impact responsiveness and stability. active User workload: alertmanager.yaml route: receiver: Default group_by: - namespace routes: - receiver: alert-sample/slack-routing/sample matchers: - namespace="alert-sample" continue: true receivers: - name: Default - name: alert-sample/slack-routing/sample slack_configs: - api_url: https://hooks.slack.com/services/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 channel: '#openshift-on-kvm' templates: [] Cluster: alertmanager.yaml inhibit_rules: - equal: - namespace - alertname source_matchers: - severity = critical target_matchers: - severity =~ warning|info - equal: - namespace - alertname source_matchers: - severity = warning target_matchers: - severity = info receivers: - name: Critical - name: Default slack_configs: - channel: '#openshift-on-kvm' api_url: >- https://hooks.slack.com/services/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 - name: Watchdog route: group_by: - namespace group_interval: 5m group_wait: 30s receiver: Default repeat_interval: 12h routes: - matchers: - alertname = Watchdog receiver: Watchdog - matchers: - severity = critical receiver: Critical Cluster Logging: Ruler ruler: enable_api: true enable_sharding: true alertmanager_url: https://_web._tcp.alertmanager-operated.openshift-monitoring.svc enable_alertmanager_v2: true enable_alertmanager_discovery: true alertmanager_refresh_interval: 1m wal: dir: /tmp/wal truncate_frequency: 60m min_age: 5m
å®ã¯ãRuler ã³ã³ãã¼ãã³ãã®è¨å®ãã¡ã¤ã«ã¯ãæ´ã«å¥ã®ãã¡ã¤ã« /etc/loki/config/runtime-config.yaml
ã«ãã£ã¦è¨å®ãä¸é¨ä¸æ¸ãããã¦ãã¾ãã
ä¸èº«ãè¦ãã¨ä»¥ä¸ã®ããã«ãªã£ã¦ãã¾ãã
--- overrides: application: ruler_alertmanager_config: alertmanager_url: https://_web._tcp.alertmanager-operated.openshift-user-workload-monitoring.svc enable_alertmanager_v2: true enable_alertmanager_discovery: true alertmanager_refresh_interval: 1m alertmanager_client: tls_ca_path: /var/run/ca/alertmanager/service-ca.crt tls_server_name: alertmanager-user-workload.openshift-user-workload-monitoring.svc.cluster.local type: Bearer credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token
ãã¡ãã§ãéç¥å ã® Alertmanager ãå©ç¨è ç¨ã¨ãªã£ã¦ãããã¨ã確èªã§ãã¾ãã ããã§å®æ ã¨è¨å®ã®å 容ãä¸è´ãã¾ããã
ã¾ã¨ã
ã¡ãã»ã¼ã¸ç£è¦ã®éç¥ãã©ã¡ãã®AlertManager ã«éç¥ããããå®éã®è¨å®ã¨æ¯èãã確èªãã¾ãããã¾ãå®éã«Podã«è¨å®ãããå 容ã確èªããæ¯èãã¨ãã£ã¦ãããã¨ã確èªã§ãã¾ããã ããã§ããããé±æ«ãè¿ãããããã§ãã