diff --git a/README.md b/README.md index aa56e10..349b983 100644 --- a/README.md +++ b/README.md @@ -330,15 +330,88 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron ### Recommended Alerts -Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**): +Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule** +and select the appropriate data source (Prometheus or Loki). -| Alert | Condition | Severity | -|----------------------|-----------------------------------------------------------------------|----------| -| Disk usage high | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2 | Critical | -| Container restarting | Container restart count > 3 in 10 min | Warning | -| High memory usage | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning | -| High CPU usage | `node_cpu_seconds_total` idle < 10% sustained 5 min | Warning | -| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min | Warning | +| Alert | Condition | Severity | +|----------------------|--------------------------------------|----------| +| Disk usage high | Available disk < 20% | Critical | +| Container restarting | Restart count > 3 in 10 min | Warning | +| High memory usage | Available memory < 10% | Warning | +| High CPU usage | CPU usage > 90% sustained 5 min | Warning | +| Nextcloud cron stale | No cron log lines in 15 min | Warning | + +#### Disk usage high + +Fires when any filesystem drops below 20% free space. + +- **Data source:** Prometheus +- **Query (A):** + ```promql + node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 + ``` +- **Expression (B):** Threshold — `A IS BELOW 20` +- **Evaluate every:** `1m` +- **Pending period (For):** `5m` +- **Labels:** `severity: critical` + +#### Container restarting + +Fires when any container's start time changes more than 3 times in 10 minutes, indicating repeated restarts. +Requires cAdvisor (included in the monitoring stack). + +- **Data source:** Prometheus +- **Query (A):** + ```promql + changes(container_start_time_seconds{name!=""}[10m]) + ``` +- **Expression (B):** Threshold — `A IS ABOVE 3` +- **Evaluate every:** `1m` +- **Pending period (For):** `0s` +- **Labels:** `severity: warning` + +#### High memory usage + +Fires when available memory drops below 10% of total. + +- **Data source:** Prometheus +- **Query (A):** + ```promql + node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 + ``` +- **Expression (B):** Threshold — `A IS BELOW 10` +- **Evaluate every:** `1m` +- **Pending period (For):** `5m` +- **Labels:** `severity: warning` + +#### High CPU usage + +Fires when average CPU usage exceeds 90% for 5 minutes. + +- **Data source:** Prometheus +- **Query (A):** + ```promql + avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 + ``` +- **Expression (B):** Threshold — `A IS BELOW 10` +- **Evaluate every:** `1m` +- **Pending period (For):** `5m` +- **Labels:** `severity: warning` + +#### Nextcloud cron stale + +Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped. + +- **Data source:** Loki +- **Query (A):** + ```logql + count_over_time({container="/nextcloud-cron"}[15m]) + ``` +- **Expression (B):** Threshold — `A IS BELOW 1` +- **Alert condition:** also trigger on **No Data** +- **Evaluate every:** `5m` +- **Pending period (For):** `0s` +- **Labels:** `severity: warning` ### Recommended Dashboards diff --git a/monitoring/config.alloy b/monitoring/config.alloy index 706e5af..c02ebe2 100644 --- a/monitoring/config.alloy +++ b/monitoring/config.alloy @@ -54,6 +54,18 @@ prometheus.scrape "node" { scrape_interval = "60s" } +// ============================================================ +// cAdvisor container metrics -> Grafana Cloud Prometheus +// ============================================================ + +prometheus.scrape "cadvisor" { + targets = [{"__address__" = "cadvisor:8080"}] + forward_to = [prometheus.remote_write.grafana_cloud.receiver] + + scrape_interval = "60s" + metrics_path = "/metrics" +} + prometheus.remote_write "grafana_cloud" { endpoint { url = env("GRAFANA_CLOUD_PROMETHEUS_URL") diff --git a/monitoring/docker-compose.yml b/monitoring/docker-compose.yml index 20c09c2..c4ec2d5 100644 --- a/monitoring/docker-compose.yml +++ b/monitoring/docker-compose.yml @@ -33,6 +33,27 @@ services: max-size: "10m" max-file: "3" + cadvisor: + image: gcr.io/cadvisor/cadvisor:v0.52.1 + container_name: cadvisor + restart: unless-stopped + volumes: + - /var/run/docker.sock:/var/run/docker.sock:ro + - /proc:/host/proc:ro + - /sys:/host/sys:ro + - /:/rootfs:ro + command: + - --docker_only=true + - --housekeeping_interval=30s + - --disable_metrics=accelerator,cpu_topology,disk,diskIO,hugetlb,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp + networks: + - monitoring + logging: + driver: json-file + options: + max-size: "10m" + max-file: "3" + alloy: image: grafana/alloy:v1.14.1 container_name: alloy