Skip to content

Commit

Permalink
DOCS-2166 Service Checks automation (DataDog#9642)
Browse files Browse the repository at this point in the history
  • Loading branch information
ruthnaebeck authored Jul 8, 2021
1 parent 6dfd836 commit a59f214
Show file tree
Hide file tree
Showing 118 changed files with 446 additions and 754 deletions.
4 changes: 2 additions & 2 deletions activemq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,7 @@ The ActiveMQ check does not include any events.

### Service Checks

**activemq.can_connect**:<br>
Returns `CRITICAL` if the Agent is unable to connect to and collect metrics from the monitored ActiveMQ instance, otherwise returns `OK`.
See [service_checks.json][16] for a list of service checks provided by this integration.

## Troubleshooting

Expand Down Expand Up @@ -142,3 +141,4 @@ Additional helpful documentation, links, and articles:
[13]: https://docs.datadoghq.com/agent/kubernetes/integrations/
[14]: https://docs.datadoghq.com/agent/kubernetes/log/?tab=containerinstallation#setup
[15]: https://github.com/DataDog/integrations-core/blob/master/activemq/datadog_checks/activemq/data/metrics.yaml
[16]: https://github.com/DataDog/integrations-core/blob/master/activemq/assets/service_checks.json
30 changes: 14 additions & 16 deletions airflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,38 +303,35 @@ Tips for Kubernetes installations:

[Run the Agent's status subcommand][5] and look for `airflow` under the Checks section.

## Data Collected
## Annexe

### Metrics
### Airflow DatadogHook

See [metadata.csv][6] for a list of metrics provided by this check.
In addition, [Airflow DatadogHook][11] can be used to interact with Datadog:

### Service Checks
- Send Metric
- Query Metric
- Post Event

## Data Collected

**airflow.can_connect**:<br>
Returns `CRITICAL` if unable to connect to Airflow. Returns `OK` otherwise.
### Metrics

**airflow.healthy**:<br>
Returns `CRITICAL` if Airflow is not healthy. Returns `OK` otherwise.
See [metadata.csv][6] for a list of metrics provided by this check.

### Events

The Airflow check does not include any events.

## Annexe

### Airflow DatadogHook

In addition, [Airflow DatadogHook][11] can be used to interact with Datadog:
### Service Checks

- Send Metric
- Query Metric
- Post Event
See [service_checks.json][17] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][7].


[1]: https://airflow.apache.org/docs/stable/metrics.html
[2]: https://app.datadoghq.com/account/settings#agent
[3]: https://github.com/DataDog/integrations-core/blob/master/airflow/datadog_checks/airflow/data/conf.yaml.example
Expand All @@ -351,3 +348,4 @@ Need help? Contact [Datadog support][7].
[14]: https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=kubernetes#configuration
[15]: https://github.com/DataDog/integrations-core/tree/master/airflow/tests/k8s_sample
[16]: https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
[17]: https://github.com/DataDog/integrations-core/blob/master/airflow/assets/service_checks.json
18 changes: 5 additions & 13 deletions amazon_msk/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,23 +34,14 @@ Follow the instructions below to install and configure this check for an Agent r

See [metadata.csv][11] for a list of metrics provided by this check.

### Service Checks

**aws.msk.can_connect**:<br>
Returns `CRITICAL` if the Agent is unable to discover nodes of the MSK cluster. Otherwise, returns `OK`.

**aws.msk.prometheus.health**:<br>
Returns `CRITICAL` if the check cannot access a metrics endpoint. Otherwise, returns `OK`.

When using the Agent 7+ implementation by setting `use_openmetrics` to `true`:

**aws.msk.openmetrics.health**:<br>
Returns `CRITICAL` if the Agent is unable to connect to the OpenMetrics endpoint, otherwise returns `OK`.

### Events

The Amazon MSK check does not include any events.

### Service Checks

See [service_checks.json][14] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][12].
Expand All @@ -68,3 +59,4 @@ Need help? Contact [Datadog support][12].
[11]: https://github.com/DataDog/integrations-core/blob/master/amazon_msk/metadata.csv
[12]: https://docs.datadoghq.com/help/
[13]: https://docs.aws.amazon.com/msk/latest/developerguide/open-monitoring.html
[14]: https://github.com/DataDog/integrations-core/blob/master/amazon_msk/assets/service_checks.json
15 changes: 6 additions & 9 deletions ambari/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,23 +113,19 @@ If service metrics collection is enabled with `collect_service_metrics` this int

See [metadata.csv][8] for a list of all metrics provided by this integration.

### Service Checks

**ambari.can_connect**:<br>
Returns `OK` if the cluster is reachable, otherwise returns `CRITICAL`.

**ambari.state**:<br>
Returns `OK` if the service is installed or running, `WARNING` if the service is stopping or uninstalling,
or `CRITICAL` if the service is uninstalled or stopped.

### Events

Ambari does not include any events.

### Service Checks

See [service_checks.json][10] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][9].


[1]: https://ambari.apache.org
[2]: https://docs.datadoghq.com/agent/
[3]: https://github.com/DataDog/integrations-core/blob/master/ambari/datadog_checks/ambari/data/conf.yaml.example
Expand All @@ -139,3 +135,4 @@ Need help? Contact [Datadog support][9].
[7]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[8]: https://github.com/DataDog/integrations-core/blob/master/ambari/metadata.csv
[9]: https://docs.datadoghq.com/help/
[10]: https://github.com/DataDog/integrations-core/blob/master/ambari/assets/service_checks.json
4 changes: 2 additions & 2 deletions apache/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -213,8 +213,7 @@ The Apache check does not include any events.

### Service Checks

**apache.can_connect**:<br>
Returns `CRITICAL` if the Agent cannot connect to the configured `apache_status_url`, otherwise returns `OK`.
See [service_checks.json][26] for a list of service checks provided by this integration.

## Troubleshooting

Expand Down Expand Up @@ -258,3 +257,4 @@ Additional helpful documentation, links, and articles:
[23]: https://docs.datadoghq.com/agent/docker/integrations/?tab=docker
[24]: https://docs.datadoghq.com/agent/amazon_ecs/logs/?tab=linux
[25]: https://docs.datadoghq.com/agent/docker/log/?tab=containerinstallation#log-integrations
[26]: https://github.com/DataDog/integrations-core/blob/master/apache/assets/service_checks.json
13 changes: 5 additions & 8 deletions azure_iot_edge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,18 +111,14 @@ Once the Agent has been deployed to the device, [run the Agent's status subcomma

See [metadata.csv][8] for a list of metrics provided by this check.

### Service Checks

**azure.iot_edge.edge_agent.prometheus.health**:<br>
Returns `CRITICAL` if the Agent is unable to reach the Edge Agent metrics Prometheus endpoint. Returns `OK` otherwise.

**azure.iot_edge.edge_hub.prometheus.health**:<br>
Returns `CRITICAL` if the Agent is unable to reach the Edge Hub metrics Prometheus endpoint. Returns `OK` otherwise.

### Events

Azure IoT Edge does not include any events.

### Service Checks

See [service_checks.json][11] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][9].
Expand All @@ -141,3 +137,4 @@ Need help? Contact [Datadog support][9].
[8]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/metadata.csv
[9]: https://docs.datadoghq.com/help/
[10]: https://www.datadoghq.com/blog/monitor-azure-iot-edge-with-datadog/
[11]: https://github.com/DataDog/integrations-core/blob/master/azure_iot_edge/assets/service_checks.json
4 changes: 2 additions & 2 deletions cassandra/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,7 @@ The Cassandra check does not include any events.

### Service Checks

**cassandra.can_connect**:<br>
Returns `CRITICAL` if the Agent is unable to connect to and collect metrics from the monitored Cassandra instance, otherwise returns `OK`.
See [service_checks.json][16] for a list of service checks provided by this integration.

## Troubleshooting

Expand All @@ -133,3 +132,4 @@ Need help? Contact [Datadog support][4].
[13]: https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics
[14]: https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics
[15]: https://www.datadoghq.com/blog/monitoring-cassandra-with-datadog
[16]: https://github.com/DataDog/integrations-core/blob/master/cassandra/assets/service_checks.json
4 changes: 2 additions & 2 deletions cassandra_nodetool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,7 @@ The Cassandra_nodetool check does not include any events.

### Service Checks

**cassandra.nodetool.node_up**:<br>
The agent sends this service check for each node of the monitored cluster. Returns `CRITICAL` if the node is down, otherwise `OK`.
See [service_checks.json][14] for a list of service checks provided by this integration.

## Troubleshooting

Expand All @@ -77,3 +76,4 @@ Need help? Contact [Datadog support][10].
[11]: https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics
[12]: https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics
[13]: https://www.datadoghq.com/blog/monitoring-cassandra-with-datadog
[14]: https://github.com/DataDog/integrations-core/blob/master/cassandra_nodetool/assets/service_checks.json
60 changes: 2 additions & 58 deletions ceph/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,64 +75,7 @@ The Ceph check does not include any events.

### Service Checks

**ceph.overall_status**:<br>
The Datadog Agent submits a service check for each of Ceph's host health checks.

In addition to this service check, the Ceph check also collects a configurable list of health checks for Ceph luminous and later. By default, these are:

**ceph.osd_down**:<br>
Returns `OK` if your OSDs are all up. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.osd_orphan**:<br>
Returns `OK` if you have no orphan OSD. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.osd_full**:<br>
Returns `OK` if your OSDs are not full. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.osd_nearfull**:<br>
Returns `OK` if your OSDs are not near full. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pool_full**:<br>
Returns `OK` if your pools have not reached their quota. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pool_near_full**:<br>
Returns `OK` if your pools are not near reaching their quota. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pg_availability**:<br>
Returns `OK` if there is full data availability. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pg_degraded**:<br>
Returns `OK` if there is full data redundancy. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pg_degraded_full**:<br>
Returns `OK` if there is enough space in the cluster for data redundancy. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pg_damaged**:<br>
Returns `OK` if there are no inconsistencies after data scrubing. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pg_not_scrubbed**:<br>
Returns `OK` if the PGs were scrubbed recently. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.pg_not_deep_scrubbed**:<br>
Returns `OK` if the PGs were deep scrubbed recently. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.cache_pool_near_full**:<br>
Returns `OK` if the cache pools are not near full. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.too_few_pgs**:<br>
Returns `OK` if the number of PGs is above the min threshold. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.too_many_pgs**:<br>
Returns `OK` if the number of PGs is below the max threshold. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.object_unfound**:<br>
Returns `OK` if all objects can be found. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.request_slow**:<br>
Returns `OK` requests are taking a normal time to process. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.

**ceph.request_stuck**:<br>
Returns `OK` requests are taking a normal time to process. Otherwise, returns `WARNING` if the severity is `HEALTH_WARN`, else `CRITICAL`.
See [service_checks.json][11] for a list of service checks provided by this integration.

## Troubleshooting

Expand All @@ -151,3 +94,4 @@ Need help? Contact [Datadog support][8].
[8]: https://docs.datadoghq.com/help/
[9]: https://www.datadoghq.com/blog/monitor-ceph-datadog
[10]: https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent
[11]: https://github.com/DataDog/integrations-core/blob/master/ceph/assets/service_checks.json
10 changes: 5 additions & 5 deletions cilium/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,15 +110,14 @@ Collecting logs is disabled by default in the Datadog Agent. To enable it, see [

See [metadata.csv][7] for a list of all metrics provided by this integration.

### Service Checks

**cilium.prometheus.health**:<br>
Returns `CRITICAL` if the Agent cannot reach the metrics endpoints, `OK` otherwise.

### Events

Cilium does not include any events.

### Service Checks

See [service_checks.json][12] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][8].
Expand All @@ -134,3 +133,4 @@ Need help? Contact [Datadog support][8].
[9]: https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/?tab=k8sfile#create-manifest
[10]: https://docs.datadoghq.com/agent/kubernetes/log/
[11]: https://docs.datadoghq.com/agent/kubernetes/integrations/
[12]: https://github.com/DataDog/integrations-core/blob/master/cilium/assets/service_checks.json
4 changes: 2 additions & 2 deletions cisco_aci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,7 @@ The Cisco ACI check sends tenant faults as events.

### Service Checks

**cisco_aci.can_connect**:<br>
Returns `CRITICAL` if the Agent cannot connect to the Cisco ACI API to collect metrics, otherwise `OK`.
See [service_checks.json][9] for a list of service checks provided by this integration.

## Troubleshooting

Expand All @@ -105,3 +104,4 @@ Need help? Contact [Datadog support][8].
[6]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
[7]: https://github.com/DataDog/integrations-core/blob/master/cisco_aci/metadata.csv
[8]: https://docs.datadoghq.com/help/
[9]: https://github.com/DataDog/integrations-core/blob/master/cisco_aci/assets/service_checks.json
11 changes: 6 additions & 5 deletions clickhouse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,19 +85,19 @@ Collecting logs is disabled by default in the Datadog Agent. To enable it, see [

See [metadata.csv][7] for a list of metrics provided by this integration.

### Service Checks

**clickhouse.can_connect**:<br>
Returns `CRITICAL` if the Agent is unable to connect to the monitored ClickHouse database. Otherwise, returns `OK`.

### Events

The ClickHouse check does not include any events.

### Service Checks

See [service_checks.json][10] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][8].


[1]: https://clickhouse.yandex
[2]: https://docs.datadoghq.com/agent/kubernetes/integrations/
[3]: https://docs.datadoghq.com/agent/
Expand All @@ -107,3 +107,4 @@ Need help? Contact [Datadog support][8].
[7]: https://github.com/DataDog/integrations-core/blob/master/clickhouse/metadata.csv
[8]: https://docs.datadoghq.com/help/
[9]: https://docs.datadoghq.com/agent/kubernetes/log/
[10]: https://github.com/DataDog/integrations-core/blob/master/clickhouse/assets/service_checks.json
9 changes: 5 additions & 4 deletions cloud_foundry_api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,18 +29,19 @@ No additional installation is needed on your server.

See [metadata.csv][6] for a list of metrics provided by this check.

### Service Checks

See [service_checks.json][8] for a list of service checks provided by this check.

### Events

The Cloud Foundry API integration collects the configured audit events.

### Service Checks

See [service_checks.json][8] for a list of service checks provided by this integration.

## Troubleshooting

Need help? Contact [Datadog support][7].


[1]: http://v3-apidocs.cloudfoundry.org
[2]: https://docs.datadoghq.com/agent/kubernetes/integrations
[3]: https://github.com/DataDog/integrations-core/blob/master/cloud_foundry_api/datadog_checks/cloud_foundry_api/data/conf.yaml.example
Expand Down
Loading

0 comments on commit a59f214

Please sign in to comment.