add cAdvisor and document detailed alert queries in README

Add cAdvisor container to the monitoring stack for container-level
metrics. Configure Alloy to scrape cAdvisor. Expand the README
Recommended Alerts section with exact PromQL/LogQL queries, thresholds,
and Grafana alert rule configuration for all five alerts.
This commit is contained in:
2026-03-22 22:51:22 +01:00
parent c736c23e9a
commit 926766346c
3 changed files with 114 additions and 8 deletions

View File

@@ -330,15 +330,88 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron
### Recommended Alerts
Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**):
Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule**
and select the appropriate data source (Prometheus or Loki).
| Alert | Condition | Severity |
|----------------------|-----------------------------------------------------------------------|----------|
| Disk usage high | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2 | Critical |
| Container restarting | Container restart count > 3 in 10 min | Warning |
| High memory usage | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning |
| High CPU usage | `node_cpu_seconds_total` idle < 10% sustained 5 min | Warning |
| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min | Warning |
| Alert | Condition | Severity |
|----------------------|--------------------------------------|----------|
| Disk usage high | Available disk < 20% | Critical |
| Container restarting | Restart count > 3 in 10 min | Warning |
| High memory usage | Available memory < 10% | Warning |
| High CPU usage | CPU usage > 90% sustained 5 min | Warning |
| Nextcloud cron stale | No cron log lines in 15 min | Warning |
#### Disk usage high
Fires when any filesystem drops below 20% free space.
- **Data source:** Prometheus
- **Query (A):**
```promql
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
```
- **Expression (B):** Threshold — `A IS BELOW 20`
- **Evaluate every:** `1m`
- **Pending period (For):** `5m`
- **Labels:** `severity: critical`
#### Container restarting
Fires when any container's start time changes more than 3 times in 10 minutes, indicating repeated restarts.
Requires cAdvisor (included in the monitoring stack).
- **Data source:** Prometheus
- **Query (A):**
```promql
changes(container_start_time_seconds{name!=""}[10m])
```
- **Expression (B):** Threshold — `A IS ABOVE 3`
- **Evaluate every:** `1m`
- **Pending period (For):** `0s`
- **Labels:** `severity: warning`
#### High memory usage
Fires when available memory drops below 10% of total.
- **Data source:** Prometheus
- **Query (A):**
```promql
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
```
- **Expression (B):** Threshold — `A IS BELOW 10`
- **Evaluate every:** `1m`
- **Pending period (For):** `5m`
- **Labels:** `severity: warning`
#### High CPU usage
Fires when average CPU usage exceeds 90% for 5 minutes.
- **Data source:** Prometheus
- **Query (A):**
```promql
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
```
- **Expression (B):** Threshold — `A IS BELOW 10`
- **Evaluate every:** `1m`
- **Pending period (For):** `5m`
- **Labels:** `severity: warning`
#### Nextcloud cron stale
Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped.
- **Data source:** Loki
- **Query (A):**
```logql
count_over_time({container="/nextcloud-cron"}[15m])
```
- **Expression (B):** Threshold — `A IS BELOW 1`
- **Alert condition:** also trigger on **No Data**
- **Evaluate every:** `5m`
- **Pending period (For):** `0s`
- **Labels:** `severity: warning`
### Recommended Dashboards