add cAdvisor and document detailed alert queries in README
Add cAdvisor container to the monitoring stack for container-level metrics. Configure Alloy to scrape cAdvisor. Expand the README Recommended Alerts section with exact PromQL/LogQL queries, thresholds, and Grafana alert rule configuration for all five alerts.
This commit is contained in:
87
README.md
87
README.md
@@ -330,15 +330,88 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron
|
|||||||
|
|
||||||
### Recommended Alerts
|
### Recommended Alerts
|
||||||
|
|
||||||
Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**):
|
Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule**
|
||||||
|
and select the appropriate data source (Prometheus or Loki).
|
||||||
|
|
||||||
| Alert | Condition | Severity |
|
| Alert | Condition | Severity |
|
||||||
|----------------------|-----------------------------------------------------------------------|----------|
|
|----------------------|--------------------------------------|----------|
|
||||||
| Disk usage high | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2 | Critical |
|
| Disk usage high | Available disk < 20% | Critical |
|
||||||
| Container restarting | Container restart count > 3 in 10 min | Warning |
|
| Container restarting | Restart count > 3 in 10 min | Warning |
|
||||||
| High memory usage | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning |
|
| High memory usage | Available memory < 10% | Warning |
|
||||||
| High CPU usage | `node_cpu_seconds_total` idle < 10% sustained 5 min | Warning |
|
| High CPU usage | CPU usage > 90% sustained 5 min | Warning |
|
||||||
| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min | Warning |
|
| Nextcloud cron stale | No cron log lines in 15 min | Warning |
|
||||||
|
|
||||||
|
#### Disk usage high
|
||||||
|
|
||||||
|
Fires when any filesystem drops below 20% free space.
|
||||||
|
|
||||||
|
- **Data source:** Prometheus
|
||||||
|
- **Query (A):**
|
||||||
|
```promql
|
||||||
|
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
|
||||||
|
```
|
||||||
|
- **Expression (B):** Threshold — `A IS BELOW 20`
|
||||||
|
- **Evaluate every:** `1m`
|
||||||
|
- **Pending period (For):** `5m`
|
||||||
|
- **Labels:** `severity: critical`
|
||||||
|
|
||||||
|
#### Container restarting
|
||||||
|
|
||||||
|
Fires when any container's start time changes more than 3 times in 10 minutes, indicating repeated restarts.
|
||||||
|
Requires cAdvisor (included in the monitoring stack).
|
||||||
|
|
||||||
|
- **Data source:** Prometheus
|
||||||
|
- **Query (A):**
|
||||||
|
```promql
|
||||||
|
changes(container_start_time_seconds{name!=""}[10m])
|
||||||
|
```
|
||||||
|
- **Expression (B):** Threshold — `A IS ABOVE 3`
|
||||||
|
- **Evaluate every:** `1m`
|
||||||
|
- **Pending period (For):** `0s`
|
||||||
|
- **Labels:** `severity: warning`
|
||||||
|
|
||||||
|
#### High memory usage
|
||||||
|
|
||||||
|
Fires when available memory drops below 10% of total.
|
||||||
|
|
||||||
|
- **Data source:** Prometheus
|
||||||
|
- **Query (A):**
|
||||||
|
```promql
|
||||||
|
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
|
||||||
|
```
|
||||||
|
- **Expression (B):** Threshold — `A IS BELOW 10`
|
||||||
|
- **Evaluate every:** `1m`
|
||||||
|
- **Pending period (For):** `5m`
|
||||||
|
- **Labels:** `severity: warning`
|
||||||
|
|
||||||
|
#### High CPU usage
|
||||||
|
|
||||||
|
Fires when average CPU usage exceeds 90% for 5 minutes.
|
||||||
|
|
||||||
|
- **Data source:** Prometheus
|
||||||
|
- **Query (A):**
|
||||||
|
```promql
|
||||||
|
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
|
||||||
|
```
|
||||||
|
- **Expression (B):** Threshold — `A IS BELOW 10`
|
||||||
|
- **Evaluate every:** `1m`
|
||||||
|
- **Pending period (For):** `5m`
|
||||||
|
- **Labels:** `severity: warning`
|
||||||
|
|
||||||
|
#### Nextcloud cron stale
|
||||||
|
|
||||||
|
Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped.
|
||||||
|
|
||||||
|
- **Data source:** Loki
|
||||||
|
- **Query (A):**
|
||||||
|
```logql
|
||||||
|
count_over_time({container="/nextcloud-cron"}[15m])
|
||||||
|
```
|
||||||
|
- **Expression (B):** Threshold — `A IS BELOW 1`
|
||||||
|
- **Alert condition:** also trigger on **No Data**
|
||||||
|
- **Evaluate every:** `5m`
|
||||||
|
- **Pending period (For):** `0s`
|
||||||
|
- **Labels:** `severity: warning`
|
||||||
|
|
||||||
### Recommended Dashboards
|
### Recommended Dashboards
|
||||||
|
|
||||||
|
|||||||
@@ -54,6 +54,18 @@ prometheus.scrape "node" {
|
|||||||
scrape_interval = "60s"
|
scrape_interval = "60s"
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ============================================================
|
||||||
|
// cAdvisor container metrics -> Grafana Cloud Prometheus
|
||||||
|
// ============================================================
|
||||||
|
|
||||||
|
prometheus.scrape "cadvisor" {
|
||||||
|
targets = [{"__address__" = "cadvisor:8080"}]
|
||||||
|
forward_to = [prometheus.remote_write.grafana_cloud.receiver]
|
||||||
|
|
||||||
|
scrape_interval = "60s"
|
||||||
|
metrics_path = "/metrics"
|
||||||
|
}
|
||||||
|
|
||||||
prometheus.remote_write "grafana_cloud" {
|
prometheus.remote_write "grafana_cloud" {
|
||||||
endpoint {
|
endpoint {
|
||||||
url = env("GRAFANA_CLOUD_PROMETHEUS_URL")
|
url = env("GRAFANA_CLOUD_PROMETHEUS_URL")
|
||||||
|
|||||||
@@ -33,6 +33,27 @@ services:
|
|||||||
max-size: "10m"
|
max-size: "10m"
|
||||||
max-file: "3"
|
max-file: "3"
|
||||||
|
|
||||||
|
cadvisor:
|
||||||
|
image: gcr.io/cadvisor/cadvisor:v0.52.1
|
||||||
|
container_name: cadvisor
|
||||||
|
restart: unless-stopped
|
||||||
|
volumes:
|
||||||
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||||
|
- /proc:/host/proc:ro
|
||||||
|
- /sys:/host/sys:ro
|
||||||
|
- /:/rootfs:ro
|
||||||
|
command:
|
||||||
|
- --docker_only=true
|
||||||
|
- --housekeeping_interval=30s
|
||||||
|
- --disable_metrics=accelerator,cpu_topology,disk,diskIO,hugetlb,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp
|
||||||
|
networks:
|
||||||
|
- monitoring
|
||||||
|
logging:
|
||||||
|
driver: json-file
|
||||||
|
options:
|
||||||
|
max-size: "10m"
|
||||||
|
max-file: "3"
|
||||||
|
|
||||||
alloy:
|
alloy:
|
||||||
image: grafana/alloy:v1.14.1
|
image: grafana/alloy:v1.14.1
|
||||||
container_name: alloy
|
container_name: alloy
|
||||||
|
|||||||
Reference in New Issue
Block a user