forked from cloudnative-pg/charts
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added the ability to exclude specific PrometheusRules (cloudnative-pg…
…#232) * Added the ability to exclude specific PrometheusRules Signed-off-by: Itay Grudev <[email protected]>
- Loading branch information
1 parent
ac0a34e
commit 7412346
Showing
14 changed files
with
206 additions
and
164 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
{{- $alert := "CNPGClusterHACritical" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Cluster has no standby replicas! | ||
description: |- | ||
CloudNativePG Cluster "{{ .labels.job }}" has no ready standby replicas. Your cluster at a severe | ||
risk of data loss and downtime if the primary instance fails. | ||
The primary instance is still online and able to serve queries, although connections to the `-ro` endpoint | ||
will fail. The `-r` endpoint os operating at reduced capacity and all traffic is being served by the main. | ||
This can happen during a normal fail-over or automated minor version upgrades in a cluster with 2 or less | ||
instances. The replaced instance may need some time to catch-up with the cluster primary instance. | ||
This alarm will be always trigger if your cluster is configured to run with only 1 instance. In this | ||
case you may want to silence it. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHACritical.md | ||
expr: | | ||
max by (job) (cnpg_pg_replication_streaming_replicas{namespace="{{ .namespace }}"} - cnpg_pg_replication_is_wal_receiver_up{namespace="{{ .namespace }}"}) < 1 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
{{- end -}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{{- $alert := "CNPGClusterHAWarning" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Cluster less than 2 standby replicas. | ||
description: |- | ||
CloudNativePG Cluster "{{ .labels.job }}" has only {{ .value }} standby replicas, putting | ||
your cluster at risk if another instance fails. The cluster is still able to operate normally, although | ||
the `-ro` and `-r` endpoints operate at reduced capacity. | ||
This can happen during a normal fail-over or automated minor version upgrades. The replaced instance may | ||
need some time to catch-up with the cluster primary instance. | ||
This alarm will be constantly triggered if your cluster is configured to run with less than 3 instances. | ||
In this case you may want to silence it. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md | ||
expr: | | ||
max by (job) (cnpg_pg_replication_streaming_replicas{namespace="{{ .namespace }}"} - cnpg_pg_replication_is_wal_receiver_up{namespace="{{ .namespace }}"}) < 2 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
{{- end -}} |
15 changes: 15 additions & 0 deletions
15
charts/cluster/prometheus_rules/cluster-high_connection-critical.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{{- $alert := "CNPGClusterHighConnectionsCritical" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Instance maximum number of connections critical! | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" instance {{ .labels.pod }} is using {{ .value }}% of | ||
the maximum number of connections. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md | ||
expr: | | ||
sum by (pod) (cnpg_backends_total{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) * 100 > 95 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
{{- end -}} |
15 changes: 15 additions & 0 deletions
15
charts/cluster/prometheus_rules/cluster-high_connection-warning.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{{- $alert := "CNPGClusterHighConnectionsWarning" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Instance is approaching the maximum number of connections. | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" instance {{ .labels.pod }} is using {{ .value }}% of | ||
the maximum number of connections. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md | ||
expr: | | ||
sum by (pod) (cnpg_backends_total{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) / max by (pod) (cnpg_pg_settings_setting{name="max_connections", namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) * 100 > 80 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
{{- end -}} |
17 changes: 17 additions & 0 deletions
17
charts/cluster/prometheus_rules/cluster-high_replication_lag.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
{{- $alert := "CNPGClusterHighReplicationLag" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Cluster high replication lag | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" is experiencing a high replication lag of | ||
{{ .value }}ms. | ||
High replication lag indicates network issues, busy instances, slow queries or suboptimal configuration. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md | ||
expr: | | ||
max(cnpg_pg_replication_lag{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) * 1000 > 1000 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
{{- end -}} |
17 changes: 17 additions & 0 deletions
17
charts/cluster/prometheus_rules/cluster-instances_on_same_node.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
{{- $alert := "CNPGClusterInstancesOnSameNode" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Cluster instances are located on the same node. | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" has {{ .value }} | ||
instances on the same node {{ .labels.node }}. | ||
A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md | ||
expr: | | ||
count by (node) (kube_pod_info{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"}) > 1 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
{{- end -}} |
22 changes: 22 additions & 0 deletions
22
charts/cluster/prometheus_rules/cluster-low_disk_space-critical.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{{- $alert := "CNPGClusterLowDiskSpaceCritical" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Instance is running out of disk space! | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" is running extremely low on disk space. Check attached PVCs! | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md | ||
expr: | | ||
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}"} / kubelet_volume_stats_capacity_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}"})) > 0.9 OR | ||
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-wal"} / kubelet_volume_stats_capacity_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-wal"})) > 0.9 OR | ||
max(sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-tbs.*"}) | ||
/ | ||
sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-tbs.*"}) | ||
* | ||
on(namespace, persistentvolumeclaim) group_left(volume) | ||
kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"{{ .podSelector }}"} | ||
) > 0.9 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
{{- end -}} |
22 changes: 22 additions & 0 deletions
22
charts/cluster/prometheus_rules/cluster-low_disk_space-warning.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
{{- $alert := "CNPGClusterLowDiskSpaceWarning" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Instance is running out of disk space. | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" is running low on disk space. Check attached PVCs. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md | ||
expr: | | ||
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}"} / kubelet_volume_stats_capacity_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}"})) > 0.7 OR | ||
max(max by(persistentvolumeclaim) (1 - kubelet_volume_stats_available_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-wal"} / kubelet_volume_stats_capacity_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-wal"})) > 0.7 OR | ||
max(sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-tbs.*"}) | ||
/ | ||
sum by (namespace,persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="{{ .namespace }}", persistentvolumeclaim=~"{{ .podSelector }}-tbs.*"}) | ||
* | ||
on(namespace, persistentvolumeclaim) group_left(volume) | ||
kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~"{{ .podSelector }}"} | ||
) > 0.7 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
{{- end -}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
{{- $alert := "CNPGClusterOffline" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Cluster has no running instances! | ||
description: |- | ||
CloudNativePG Cluster "{{ .labels.job }}" has no ready instances. | ||
Having an offline cluster means your applications will not be able to access the database, leading to | ||
potential service disruption and/or data loss. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterOffline.md | ||
expr: | | ||
({{ .Values.cluster.instances }} - count(cnpg_collector_up{namespace=~"{{ .namespace }}",pod=~"{{ .podSelector }}"}) OR vector(0)) > 0 | ||
for: 5m | ||
labels: | ||
severity: critical | ||
{{- end -}} |
16 changes: 16 additions & 0 deletions
16
charts/cluster/prometheus_rules/cluster-zone_spread-warning.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
{{- $alert := "CNPGClusterZoneSpreadWarning" -}} | ||
{{- if not (has $alert .excludeRules) -}} | ||
alert: {{ $alert }} | ||
annotations: | ||
summary: CNPG Cluster instances in the same zone. | ||
description: |- | ||
CloudNativePG Cluster "{{ .cluster }}" has instances in the same availability zone. | ||
A disaster in one availability zone will lead to a potential service disruption and/or data loss. | ||
runbook_url: https://github.com/cloudnative-pg/charts/blob/main/charts/cluster/docs/runbooks/CNPGClusterZoneSpreadWarning.md | ||
expr: | | ||
{{ .Values.cluster.instances }} > count(count by (label_topology_kubernetes_io_zone) (kube_pod_info{namespace=~"{{ .namespace }}", pod=~"{{ .podSelector }}"} * on(node,instance) group_left(label_topology_kubernetes_io_zone) kube_node_labels)) < 3 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
{{- end -}} |
Oops, something went wrong.