add cAdvisor and document detailed alert queries in README

Add cAdvisor container to the monitoring stack for container-level metrics. Configure Alloy to scrape cAdvisor. Expand the README Recommended Alerts section with exact PromQL/LogQL queries, thresholds, and Grafana alert rule configuration for all five alerts.
2026-03-22 22:51:22 +01:00
parent c736c23e9a
commit 926766346c
3 changed files with 114 additions and 8 deletions
--- a/README.md
+++ b/README.md
@@ -330,15 +330,88 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron

 ### Recommended Alerts

-Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**):
+Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule**
+and select the appropriate data source (Prometheus or Loki).

-| Alert                | Condition                                                             | Severity |
-|----------------------|-----------------------------------------------------------------------|----------|
-| Disk usage high      | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2    | Critical |
-| Container restarting | Container restart count > 3 in 10 min                                 | Warning  |
-| High memory usage    | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning  |
-| High CPU usage       | `node_cpu_seconds_total` idle < 10% sustained 5 min                   | Warning  |
-| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min                           | Warning  |
+| Alert                | Condition                            | Severity |
+|----------------------|--------------------------------------|----------|
+| Disk usage high      | Available disk < 20%                 | Critical |
+| Container restarting | Restart count > 3 in 10 min          | Warning  |
+| High memory usage    | Available memory < 10%               | Warning  |
+| High CPU usage       | CPU usage > 90% sustained 5 min      | Warning  |
+| Nextcloud cron stale | No cron log lines in 15 min          | Warning  |
+
+#### Disk usage high
+
+Fires when any filesystem drops below 20% free space.
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 20`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `5m`
+- **Labels:** `severity: critical`
+
+#### Container restarting
+
+Fires when any container's start time changes more than 3 times in 10 minutes, indicating repeated restarts.
+Requires cAdvisor (included in the monitoring stack).
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  changes(container_start_time_seconds{name!=""}[10m])
+  ```
+- **Expression (B):** Threshold — `A IS ABOVE 3`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `0s`
+- **Labels:** `severity: warning`
+
+#### High memory usage
+
+Fires when available memory drops below 10% of total.
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 10`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `5m`
+- **Labels:** `severity: warning`
+
+#### High CPU usage
+
+Fires when average CPU usage exceeds 90% for 5 minutes.
+
+- **Data source:** Prometheus
+- **Query (A):**
+  ```promql
+  avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 10`
+- **Evaluate every:** `1m`
+- **Pending period (For):** `5m`
+- **Labels:** `severity: warning`
+
+#### Nextcloud cron stale
+
+Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped.
+
+- **Data source:** Loki
+- **Query (A):**
+  ```logql
+  count_over_time({container="/nextcloud-cron"}[15m])
+  ```
+- **Expression (B):** Threshold — `A IS BELOW 1`
+- **Alert condition:** also trigger on **No Data**
+- **Evaluate every:** `5m`
+- **Pending period (For):** `0s`
+- **Labels:** `severity: warning`

 ### Recommended Dashboards