- OpenShift ã®ã¢ã©ã¼ãéç¥ã®è¨å®
OpenShift ã®ã¢ã©ã¼ãéç¥ã®è¨å®
ã¯ããã«
ãã®ææã«ã®ã¿è¨äºãæ¸ãã¦ããæ°ããã¾ãããOpenShift Advent Calendar 2024 ã® 12æ5æ¥ã®è¨äºã§ãããµã³ã¿ã¯ãã¼ã¹ã¸ã®æç´ãããããæ¸ããªãã¨æ¬²ãããã¬ã¼ã³ããå±ããªãå¯è½æ§ãã§ã¦ããææã«ãªãã¾ãããæ£ããå±ããªãã¨æ²ããã¯ãªã¹ãã¹ã«ãªãããããã¾ããã OpenShift ã¯ã©ã¹ã¿ã¼ã«ããã¦ããã¢ã©ã¼ãã®å±ãå ã大äºã§ããæ£ããå±ããªãã¨ä¼ã¿ãè¿ä¸ãããã¨ã«ãªãããããã¾ããã
ããã§ã¯ OpenShift ã® Cluster Monitoring ã«ããã AlertManager ã®å©ç¨ã«ã¤ãã¦æ´çãã¾ãã
OpenShift ã®ã¢ãã¿ãªã³ã°ã¹ã¿ãã¯
ä¸ã®å³ã¯ Cluster Monitoring åã³ User Workload Monitoring ã®æ§æã表ãããã®ã§ãã赤ã®ç¹ç·ã§ optional ã¨æ¸ããã¦ããé¨åã¯å©ç¨è ã®è¨å®ã«ãã追å ãããã¨ãå¯è½ãªã³ã³ãã¼ãã³ãã«ãªãã¾ãã
ããã§ç°¡åã§ãããCluster Monitoring 㨠User Workload Monitoring ãä½ãã確èªãã¦ããã¾ãã
Cluster Monitoring
製åããã¥ã¡ã³ãã®è¨è¿°ããã®ã¾ã¾ãã£ã¦ãã¾ãã¨ä»¥ä¸ã®ããã«è¨è¼ãããã¾ãã
A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.
These components are illustrated in the Installed by default section in the following diagram.
OpenShift ã¤ã³ã¹ãã¼ã«æã«ããã©ã«ã㧠openshift-monitoring
ããã¸ã§ã¯ãã«ã¤ã³ã¹ãã¼ã«ããããã©ãããã©ã¼ã ã®ã¢ãã¿ãªã³ã°ã³ã³ãã¼ãã³ãã®ã»ãããKubernetes ãµã¼ãã¹ãå«ã OpenShift ã®ã³ã¢ã³ã³ãã¼ãã³ãã®ã¢ãã¿ãªã³ã°ãæä¾ãã¾ããããã©ã«ãã§ç¨æããã¦ããã¢ãã¿ãªã³ã°ã¹ã¿ãã¯ã¯ãªã¢ã¼ãããã¯ã©ã¹ã¿ã®ç¶æ
ãã¢ãã¿ãªã³ã°ããæ©è½ãåãã¦ãã¾ãã
User Workload Monitoring
åæ§ã«ãUser Workload Monitoring ã«ã¤ãã¦ã§ãã
After optionally enabling monitoring for user-defined projects, additional monitoring components are installed in the openshift-user-workload-monitoring project. This provides monitoring for user-defined projects. These components are illustrated in the User section in the following diagram.
ã¦ã¼ã¶å®ç¾©ããã¸ã§ã¯ãç¨ã®ã¢ãã¿ãªã³ã°ãæå¹ã«ãããã¨ãopenshift-user-workload-monitoring
ããã¸ã§ã¯ãã«è¿½å ã®ã¢ãã¿ãªã³ã°ã³ã³ãã¼ãã³ããã¤ã³ã¹ãã¼ã«ããã¾ããã¦ã¼ã¶ãå®ç¾©ããããã¸ã§ã¯ãã®ã¢ãã¿ãªã³ã°ãæä¾ãã¾ãã
æç« ã§ã¯è§¦ãã¦ãã¾ããããUser Workload Monitoring ã®ã³ã³ãã¼ãã³ãã«ã AlertManager ãç»å ´ãã¾ããOpenShift ã§ã¯ AlertManager ã2ã¤å©ç¨ãããã¨ãã§ãã¾ãã
ããã§ã¯ãCluster Monitoring 㨠User Workload Monitoring ã®ç¨éãè¨ããããããããã«ã管çè ç¨ã®ï¼Cluster Monitoringï¼ã¨å©ç¨è ç¨ã®ï¼User Workload Monitoringï¼ã¨ããè¨èãã¤ãã£ã¦ä»¥éè¨è¼ãã¦ããã¾ãã
ã¢ã©ã¼ãéç¥ã®ã³ã³ããã¼ã«
管çè ã¯ã¯ã©ã¹ã¿ã¼ã®ç¶æ³ãç£è¦ãã¾ãããå©ç¨è ãåæ§ã«éç¨ããã¢ããªã±ã¼ã·ã§ã³ã®ç£è¦ãè¡ãã¾ããã¢ã©ã¼ãã®éç¥ãè¡ãããã¨ããéç¥å ãåå¥ã«ç®¡çãããã¨ãè¦æããããã¨æãã¾ãããOpenShift ã§ã¯ä»¥ä¸ã®ã±ã¼ã¹ã«å¯¾å¿ãããã¨ãã§ãã¾ãã
- å©ç¨è ã¯ã¢ã©ã¼ãéç¥ãè¡ãããããéç¥å ã¯ç®¡çè ã«ãã管çããã
- å©ç¨è ã¯ã¢ã©ã¼ãéç¥ãè¡ãããããã¤éç¥å ãå©ç¨è ã管çããã
- 管çè ãã«ã¹ã¿ã ã§ã¢ã©ã¼ãã追å ããããå©ç¨è ã¨è² è·ãåæ£ãããï¼å©ç¨è ã¯éç¥å ã®è¨å®ãèªåã§ç®¡çããªããã°ãªããªããªãï¼
ãã®ãã¨ãä¸ã¤ä¸ã¤è¨å®ã¨å®æ©ã«ãã確èªããã¦ããã¾ããå®æ©ç¢ºèªã¯ OpenShift 4.17 ãå©ç¨ãã管çè ç¨ã® AlertManager ã®éç¥å 㯠Slack ã¨ãã¦ãã¾ããå©ç¨è ç¨ã®éç¥å ãåãã Slack ãå©ç¨ãã¾ãã ã»ã³ã·ãã£ããªæ å ±ã«ã¤ãã¦ã¯ããã¹ã¯ãæãã¾ãã®ã§è©¦ãå ´åã¯åã ã®ç°å¢ã«èªã¿æ¿ãã¦ãã ããã
ã±ã¼ã¹1 å©ç¨è ã¯ã¢ã©ã¼ãã®è¨å®ã ãè¡ããã
OpenShift ã§ã¯å©ç¨è ç¨ã®Monitoringãæå¹ã«ãããã¨ã§ãå©ç¨è ã«ããã¢ã©ã¼ãè¨å®ãè¡ãã¾ãããã®å ´åã¯é常 AlertManager ã¯ç®¡çè ç¨ã®ãã®ã ãã«ãªããããã¢ã©ã¼ãã®éç¥å ã¯ç®¡çè ã«ãã£ã¦ã®ã¿ç®¡çããã¾ããæ¤è¨¼ã«ä½¿ãã¢ã©ã¼ã㯠alert-sample ããã¸ã§ã¯ãã« ExampleUserAlert ã¨ããååã®ã¢ã©ã¼ããå®ç¾©ãã¦ç¢ºèªãã¾ãã 管çè ç¨ã¨å©ç¨è ç¨ã® Monitoring ã®è¨å®ã¯æ¬¡ã®ããã«ãªãã¾ãã
管çè ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
å©ç¨è ç¨ã® Monitoring ã®è¨å®ã¯ã¢ã©ã¼ãéç¥ã«ã¤ãã¦ã¯ä¸è¦ã§ãã
ãã®è¨å®ã§ãå©ç¨è ãã¢ã©ã¼ããè¨å®ãã¢ã©ã¼ããéç¥ãããã¨ç®¡çè ç¨ã® AlertManager ã«éç¥ãå±ãã¾ãã éã§ããã以ä¸ã®ã¹ã¯ãªããã§ç¶æ³ã確èªãã¦ããã¾ãã
echo "User: alerts" oc exec -it alertmanager-user-workload-0 -n openshift-user-workload-monitoring -- amtool alert query --alertmanager.url http://localhost:9093 echo "Cluster: alerts" oc exec -it alertmanager-main-1 -n openshift-monitoring -- amtool alert query --alertmanager.url http://localhost:9093 echo "User workload: alertmanager.yaml" oc exec -it alertmanager-user-workload-0 -n openshift-user-workload-monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml; echo "" echo "Cluster: alertmanager.yaml" oc exec -it alertmanager-main-1 -n openshift-monitoring -- cat /etc/alertmanager/config_out/alertmanager.env.yaml; echo ""
確èªçµæã¯ä»¥ä¸ã®éãã¨ãªãã¾ãã管çè ç¨ã® AlertManager ãããªããã¨ãããã¾ããã管çè ç¨ã® AlertManager ã«éç¥ãä¸ãã£ã¦ãããã¨ã確èªã§ãã¾ããããã¦ãéç¥å ã¯ç®¡çè ãè¨å®ããSlackã®ã¿ãæå®ããã¦ãã¾ãã
User: alerts Error from server (NotFound): pods "alertmanager-user-workload-0" not found Cluster: alerts Alertname Starts At Summary State Watchdog 2024-12-05 23:06:52 UTC An alert that should always be firing to certify that Alertmanager is working properly. active UpdateAvailable 2024-12-05 23:07:36 UTC Your upstream update recommendation service recommends you update your cluster. active PrometheusOperatorRejectedResources 2024-12-05 23:12:32 UTC Resources rejected by Prometheus operator active InsightsRecommendationActive 2024-12-05 23:15:03 UTC An Insights recommendation is active for this cluster. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active ClusterNotUpgradeable 2024-12-06 00:07:40 UTC One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. active PrometheusDuplicateTimestamps 2024-12-06 00:07:55 UTC Prometheus is dropping samples with duplicate timestamps. active PrometheusDuplicateTimestamps 2024-12-06 00:07:55 UTC Prometheus is dropping samples with duplicate timestamps. active PodDisruptionBudgetAtLimit 2024-12-06 00:09:08 UTC The pod disruption budget is preventing further disruption to pods. active ExampleUserAlert 2024-12-06 12:19:23 UTC This is sample summary active User workload: alertmanager.yaml Error from server (NotFound): pods "alertmanager-user-workload-0" not found Cluster: alertmanager.yaml inhibit_rules: - equal: - namespace - alertname source_matchers: - severity = critical target_matchers: - severity =~ warning|info - equal: - namespace - alertname source_matchers: - severity = warning target_matchers: - severity = info receivers: - name: Critical - name: Default slack_configs: - channel: '#openshift-on-kvm' api_url: >- https://hooks.slack.com/services/XXXXXX - name: Watchdog route: group_by: - namespace group_interval: 5m group_wait: 30s receiver: Default repeat_interval: 12h routes: - matchers: - alertname = Watchdog receiver: Watchdog - matchers: - severity = critical receiver: Critical
ã±ã¼ã¹2 å©ç¨è ã¯ã¢ã©ã¼ãã®è¨å®ãéç¥å ã®è¨å®ãè¡ããã
å©ç¨è
ãã¢ã©ã¼ãã®éç¥å
ãè¨å®ããã«ã¯ AlertmanagerConfig
ãªã½ã¼ã¹ãè¨å®ããã®è¨å®ã管çè
ç¨ã® AlertManager ã«è¨å®ããå¿
è¦ãããã¾ããå©ç¨è
ãä½æãã AlertManagerConfig ãªã½ã¼ã¹ã¯ãNamespace ã¹ã³ã¼ãã®ãªã½ã¼ã¹ã¨ãªããããNamespace/Project ãã¨ã«è¨å®ãå¿
è¦ã¨ãªãã¾ãã
è¨å®ã¯æ¬¡ã®ã¨ããã¨ãªãã¾ãã
管çè ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true alertmanagerMain: enableUserAlertmanagerConfig: true
å©ç¨è ç¨ã® Monitoring ã®è¨å®ã¯ã¢ã©ã¼ãéç¥ã«ã¤ãã¦ã¯ä¸è¦ã§ãã
ãã®è¨å®ãå®æ½ããç¶æ
ã§ãAlertmanagerConfig
ãè¨å®ãã¾ããéç¥å
ã§ãã Slack ã® URL 㯠webhook Secret ã® url Key ã¨ãã¦å®ç¾©ãã¦ãã¾ãã
apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: slack-routing spec: route: receiver: sample receivers: - name: sample slackConfigs: - channel: '#openshift-on-kvm' apiURL: name: webhook key: url
ãµã³ãã«ã®ã¢ã©ã¼ããè¨å®ãã¦ãã°ãããã£ããã¨ã®ç¢ºèªçµæã¯æ¬¡ã®ã¨ããã¨ãªãã¾ããã¢ã©ã¼ãã¯ç®¡çè ç¨ã® AlertManager ã«éç¥ããããã¤ãalert-sample ããã¸ã§ã¯ãç¨ã®éç¥å ãè¨å®ããã¦ããã®ã確èªã§ãã¾ãã
User: alerts Error from server (NotFound): pods "alertmanager-user-workload-0" not found Cluster: alerts Alertname Starts At Summary State Watchdog 2024-12-05 23:06:52 UTC An alert that should always be firing to certify that Alertmanager is working properly. active UpdateAvailable 2024-12-05 23:07:36 UTC Your upstream update recommendation service recommends you update your cluster. active PrometheusOperatorRejectedResources 2024-12-05 23:12:32 UTC Resources rejected by Prometheus operator active InsightsRecommendationActive 2024-12-05 23:15:03 UTC An Insights recommendation is active for this cluster. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active ClusterNotUpgradeable 2024-12-06 00:07:40 UTC One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. active PrometheusDuplicateTimestamps 2024-12-06 00:07:55 UTC Prometheus is dropping samples with duplicate timestamps. active PrometheusDuplicateTimestamps 2024-12-06 00:07:55 UTC Prometheus is dropping samples with duplicate timestamps. active PodDisruptionBudgetAtLimit 2024-12-06 00:09:08 UTC The pod disruption budget is preventing further disruption to pods. active ExampleUserAlert 2024-12-06 12:19:23 UTC This is sample summary active User workload: alertmanager.yaml Error from server (NotFound): pods "alertmanager-user-workload-0" not found Cluster: alertmanager.yaml route: receiver: Default group_by: - namespace routes: - receiver: alert-sample/slack-routing/sample matchers: - namespace="alert-sample" continue: true - receiver: Watchdog matchers: - alertname = Watchdog - receiver: Critical matchers: - severity = critical group_wait: 30s group_interval: 5m repeat_interval: 12h inhibit_rules: - target_matchers: - severity =~ warning|info source_matchers: - severity = critical equal: - namespace - alertname - target_matchers: - severity = info source_matchers: - severity = warning equal: - namespace - alertname receivers: - name: Critical - name: Default slack_configs: - api_url: https://hooks.slack.com/services/XXXXX channel: '#openshift-on-kvm' - name: Watchdog - name: alert-sample/slack-routing/sample slack_configs: - api_url: https://hooks.slack.com/services/YYYYYY channel: '#openshift-on-kvm' templates: []
AlertManager ã®ã¬ã·ã¼ãã¼ã®ååã alert-sample/slack-routing/sample
ã¨é·ãã§ãããNamespace ã alert-sample
ã®å ´åã¯ãã¡ãã«éç¥ããã¾ãã
ã±ã¼ã¹3 管çè ãå¤ãã®ã¢ã©ã¼ãéç¥ãè¡ãã®ã§ãå©ç¨è ã®è² è·ã¨åæ£ãããå ´å
ãã¡ãã¯å©ç¨è ã®è¦æã§ã¯ãªãã管çè å´ã®è¦æã§ãå©ç¨è ã®ã¢ã©ã¼ãéç¥ã¨ç®¡çè ã®ã¢ã©ã¼ãéç¥ã®è² è·ãåæ£ããããã«ãåå¥ã« AlertManager ãå©ç¨ãããã¨ãèãã¾ãã
è¨å®ã¯æ¬¡ã®ã¨ããã¨ãªããå©ç¨è ç¨ã® Monitoring ã« Alertmanager ã®æå¹åã¨ãAlertmanagerConfig ã®æå¹åãè¨å®ããã¾ãã
管çè ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true
å©ç¨è ç¨ã® Monitoring ã®è¨å®
apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true
ãã®è¨å®ãè¡ã£ãç¶æ ã§ãAlertmanagerConfig ãè¨å®ãã¾ããè¨å®å 容ã¯ååã¨åæ§ã«æ¬¡ã®ã¨ããã§ã
apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: slack-routing spec: route: receiver: sample receivers: - name: sample slackConfigs: - channel: '#openshift-on-kvm' apiURL: name: webhook key: url
ãµã³ãã«ã®ã¢ã©ã¼ããè¨å®ãã¦ãã°ãããã£ããã¨ã®ç¢ºèªçµæã¯æ¬¡ã®ã¨ããã¨ãªãã¾ããã¢ã©ã¼ãã¯å©ç¨è ç¨ã® AlertManager ã«éç¥ããããã¤ãalert-sample ããã¸ã§ã¯ãç¨ã®éç¥å ãå©ç¨è ç¨ã®AlertManager ã«è¨å®ããã¦ããã®ã確èªã§ãã¾ãã
User: alerts Alertname Starts At Summary State ExampleUserAlert 2024-12-06 12:47:43 UTC This is sample summary active Cluster: alerts Alertname Starts At Summary State Watchdog 2024-12-05 23:06:52 UTC An alert that should always be firing to certify that Alertmanager is working properly. active UpdateAvailable 2024-12-05 23:07:36 UTC Your upstream update recommendation service recommends you update your cluster. active PrometheusOperatorRejectedResources 2024-12-05 23:12:32 UTC Resources rejected by Prometheus operator active InsightsRecommendationActive 2024-12-05 23:15:03 UTC An Insights recommendation is active for this cluster. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetMisScheduled 2024-12-05 23:22:49 UTC DaemonSet pods are misscheduled. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active KubeDaemonSetRolloutStuck 2024-12-05 23:37:49 UTC DaemonSet rollout is stuck. active ClusterNotUpgradeable 2024-12-06 00:07:40 UTC One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. active PrometheusDuplicateTimestamps 2024-12-06 00:07:55 UTC Prometheus is dropping samples with duplicate timestamps. active PrometheusDuplicateTimestamps 2024-12-06 00:07:55 UTC Prometheus is dropping samples with duplicate timestamps. active PodDisruptionBudgetAtLimit 2024-12-06 00:09:08 UTC The pod disruption budget is preventing further disruption to pods. active User workload: alertmanager.yaml route: receiver: Default group_by: - namespace routes: - receiver: alert-sample/slack-routing/sample matchers: - namespace="alert-sample" continue: true receivers: - name: Default - name: alert-sample/slack-routing/sample slack_configs: - api_url: https://hooks.slack.com/services/YYYYY channel: '#openshift-on-kvm' templates: [] Cluster: alertmanager.yaml inhibit_rules: - equal: - namespace - alertname source_matchers: - severity = critical target_matchers: - severity =~ warning|info - equal: - namespace - alertname source_matchers: - severity = warning target_matchers: - severity = info receivers: - name: Critical - name: Default slack_configs: - channel: '#openshift-on-kvm' api_url: >- https://hooks.slack.com/services/XXXXX - name: Watchdog route: group_by: - namespace group_interval: 5m group_wait: 30s receiver: Default repeat_interval: 12h routes: - matchers: - alertname = Watchdog receiver: Watchdog - matchers: - severity = critical receiver: Critical
ã¾ã¨ã
ã¢ã©ã¼ãã®éç¥ã«é¢ãã¦ã管çè ç¨ã¨å©ç¨è ç¨ã®è¨å®ãå¤æ´ãããã¨ã§ãå©ç¨è ã«éç¥å ã®è¨å®ã移è²ãããã¨æ§ã ãªè¨å®ãã§ãããã¨ã確èªã§ãã¾ãããããã§å®å¿ãã¦ã¢ã©ã¼ãè¨è¨ãè¡ããã¨æãã¾ãã ã§ã¯ãæå¾ã« ChatGPT ã«ãªããããã²ã¨ã¤ä½ã£ã¦ãããã¾ããã®ã§ãæå¾ã®ç· ãã¨ãã¦æ®ãã¾ãã
ãµã³ã¿ã¯ãã¼ã¹ã¸ã®æç´ã¨ããã¦ãã¢ã©ã¼ãã®éç¥å ã¨è§£ãã¾ãã ãã®å¿ã¯ï¼
ã©ã¡ãããå±ããç¸æããéè¦ã§ãï¼ ð ð©
ãã¤ãããã¾ã§ããã