add cAdvisor and document detailed alert queries in README

Add cAdvisor container to the monitoring stack for container-level metrics. Configure Alloy to scrape cAdvisor. Expand the README Recommended Alerts section with exact PromQL/LogQL queries, thresholds, and Grafana alert rule configuration for all five alerts.
2026-03-22 22:51:22 +01:00
parent c736c23e9a
commit 926766346c
3 changed files with 114 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -330,15 +330,88 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron

 ### Recommended Alerts

-Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**):
+Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule**
+and select the appropriate data source (Prometheus or Loki).

-| Alert                | Condition                                                             | Severity |
-|----------------------|-----------------------------------------------------------------------|----------|
-| Disk usage high      | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2    | Critical |
-| Container restarting | Container restart count > 3 in 10 min                                 | Warning  |
-| High memory usage    | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning  |
-| High CPU usage       | `node_cpu_seconds_total` idle < 10% sustained 5 min                   | Warning  |
-| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min                           | Warning  |
+| Alert                | Condition                            | Severity |
+|----------------------|--------------------------------------|----------|
+| Disk usage high      | Available disk < 20%                 | Critical |
+| Container restarting | Restart count > 3 in 10 min          | Warning  |
+| High memory usage    | Available memory < 10%               | Warning  |
+| High CPU usage       | CPU usage > 90% sustained 5 min      | Warning  |
+| Nextcloud cron stale | No cron log lines in 15 min          | Warning  |
+
+#### Disk usage high
+
+Fires when any filesystem drops below 20% free space.
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 20`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `5m`
+- **Labels:** `severity: critical`
+
+#### Container restarting
+
+Fires when any container's start time changes more than 3 times in 10 minutes, indicating repeated restarts.
+Requires cAdvisor (included in the monitoring stack).
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  changes(container_start_time_seconds{name!=""}[10m])
+  ```
+- **Expression (B):** Threshold — `A IS ABOVE 3`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `0s`
+- **Labels:** `severity: warning`
+
+#### High memory usage
+
+Fires when available memory drops below 10% of total.
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 10`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `5m`
+- **Labels:** `severity: warning`
+
+#### High CPU usage
+
+Fires when average CPU usage exceeds 90% for 5 minutes.
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 10`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `5m`
+- **Labels:** `severity: warning`
+
+#### Nextcloud cron stale
+
+Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped.
+
+- **Data source:** Loki
+- **Query (A):**
+  ```logql
+  count_over_time({container="/nextcloud-cron"}[15m])
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 1`
+- **Alert condition:** also trigger on **No Data**
+- **Evaluate every:** `5m`
+- **Pending period (For):** `0s`
+- **Labels:** `severity: warning`

 ### Recommended Dashboards

--- a/monitoring/config.alloy
+++ b/monitoring/config.alloy
@@ -54,6 +54,18 @@ prometheus.scrape "node" {
  scrape_interval = "60s"
 }

+// ============================================================
+// cAdvisor container metrics -> Grafana Cloud Prometheus
+// ============================================================
+
+prometheus.scrape "cadvisor" {
+  targets    = [{"__address__" = "cadvisor:8080"}]
+  forward_to = [prometheus.remote_write.grafana_cloud.receiver]
+
+  scrape_interval = "60s"
+  metrics_path    = "/metrics"
+}
+
 prometheus.remote_write "grafana_cloud" {
  endpoint {
    url = env("GRAFANA_CLOUD_PROMETHEUS_URL")
--- a/monitoring/docker-compose.yml
+++ b/monitoring/docker-compose.yml
@@ -33,6 +33,27 @@ services:
        max-size: "10m"
        max-file: "3"

+  cadvisor:
+    image: gcr.io/cadvisor/cadvisor:v0.52.1
+    container_name: cadvisor
+    restart: unless-stopped
+    volumes:
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - /proc:/host/proc:ro
+      - /sys:/host/sys:ro
+      - /:/rootfs:ro
+    command:
+      - --docker_only=true
+      - --housekeeping_interval=30s
+      - --disable_metrics=accelerator,cpu_topology,disk,diskIO,hugetlb,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp
+    networks:
+      - monitoring
+    logging:
+      driver: json-file
+      options:
+        max-size: "10m"
+        max-file: "3"
+
  alloy:
    image: grafana/alloy:v1.14.1
    container_name: alloy