add cAdvisor and document detailed alert queries in README

Add cAdvisor container to the monitoring stack for container-level metrics. Configure Alloy to scrape cAdvisor. Expand the README Recommended Alerts section with exact PromQL/LogQL queries, thresholds, and Grafana alert rule configuration for all five alerts.
2026-03-22 22:51:22 +01:00
parent c736c23e9a
commit 926766346c
3 changed files with 114 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -330,15 +330,88 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron
 ### Recommended Alerts
-Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**):
+Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule**
 and select the appropriate data source (Prometheus or Loki).
 | Alert                | Condition                            | Severity |
-|----------------------|-----------------------------------------------------------------------|----------|
+|----------------------|--------------------------------------|----------|
-| Disk usage high      | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2    | Critical |
+| Disk usage high      | Available disk < 20%                 | Critical |
-| Container restarting | Container restart count > 3 in 10 min                                 | Warning  |
+| Container restarting | Restart count > 3 in 10 min          | Warning  |
-| High memory usage    | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning  |
+| High memory usage    | Available memory < 10%               | Warning  |
-| High CPU usage       | `node_cpu_seconds_total` idle < 10% sustained 5 min                   | Warning  |
+| High CPU usage       | CPU usage > 90% sustained 5 min      | Warning  |
-| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min                           | Warning  |
+| Nextcloud cron stale | No cron log lines in 15 min          | Warning  |
 #### Disk usage high
 Fires when any filesystem drops below 20% free space.
 - **Data source:** Prometheus
 - **Query (A):**
  ```promql
  node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
  ```
 - **Expression (B):** Threshold — `A IS BELOW 20`
 - **Evaluate every:** `1m`
 - **Pending period (For):** `5m`
 - **Labels:** `severity: critical`
 #### Container restarting
 Fires when any container's start time changes more than 3 times in 10 minutes, indicating repeated restarts.
 Requires cAdvisor (included in the monitoring stack).
 - **Data source:** Prometheus
 - **Query (A):**
  ```promql
  changes(container_start_time_seconds{name!=""}[10m])
  ```
 - **Expression (B):** Threshold — `A IS ABOVE 3`
 - **Evaluate every:** `1m`
 - **Pending period (For):** `0s`
 - **Labels:** `severity: warning`
 #### High memory usage
 Fires when available memory drops below 10% of total.
 - **Data source:** Prometheus
 - **Query (A):**
  ```promql
  node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
  ```
 - **Expression (B):** Threshold — `A IS BELOW 10`
 - **Evaluate every:** `1m`
 - **Pending period (For):** `5m`
 - **Labels:** `severity: warning`
 #### High CPU usage
 Fires when average CPU usage exceeds 90% for 5 minutes.
 - **Data source:** Prometheus
 - **Query (A):**
  ```promql
  avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
  ```
 - **Expression (B):** Threshold — `A IS BELOW 10`
 - **Evaluate every:** `1m`
 - **Pending period (For):** `5m`
 - **Labels:** `severity: warning`
 #### Nextcloud cron stale
 Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped.
 - **Data source:** Loki
 - **Query (A):**
  ```logql
  count_over_time({container="/nextcloud-cron"}[15m])
  ```
 - **Expression (B):** Threshold — `A IS BELOW 1`
 - **Alert condition:** also trigger on **No Data**
 - **Evaluate every:** `5m`
 - **Pending period (For):** `0s`
 - **Labels:** `severity: warning`
 ### Recommended Dashboards
--- a/monitoring/config.alloy
+++ b/monitoring/config.alloy
@@ -54,6 +54,18 @@ prometheus.scrape "node" {
  scrape_interval = "60s"
 }
 // ============================================================
 // cAdvisor container metrics -> Grafana Cloud Prometheus
 // ============================================================
 prometheus.scrape "cadvisor" {
  targets    = [{"__address__" = "cadvisor:8080"}]
  forward_to = [prometheus.remote_write.grafana_cloud.receiver]
  scrape_interval = "60s"
  metrics_path    = "/metrics"
 }
 prometheus.remote_write "grafana_cloud" {
  endpoint {
    url = env("GRAFANA_CLOUD_PROMETHEUS_URL")
--- a/monitoring/docker-compose.yml
+++ b/monitoring/docker-compose.yml
@@ -33,6 +33,27 @@ services:
        max-size: "10m"
        max-file: "3"
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.52.1
    container_name: cadvisor
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - --docker_only=true
      - --housekeeping_interval=30s
      - --disable_metrics=accelerator,cpu_topology,disk,diskIO,hugetlb,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp
    networks:
      - monitoring
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
  alloy:
    image: grafana/alloy:v1.14.1
    container_name: alloy