Compare commits

...

18 Commits

Author SHA1 Message Date
Thomas Gräfenstein
2281ebcb6d improved container restart alert 2026-03-22 23:56:27 +01:00
Thomas Gräfenstein
2942ff15bc remove unused /dev/kmsg device mount from cAdvisor (oom_event is disabled) 2026-03-22 23:34:29 +01:00
Thomas Gräfenstein
24e80de43c upgrade cAdvisor to v0.54.1 for Docker 29 containerd image store support 2026-03-22 23:30:24 +01:00
Thomas Gräfenstein
cfc8b61f98 connect cAdvisor to containerd socket for Docker 29 image store compatibility 2026-03-22 23:24:50 +01:00
Thomas Gräfenstein
b063128049 grant cAdvisor privileged access for cgroup v2 container discovery 2026-03-22 23:17:33 +01:00
Thomas Gräfenstein
a07adedd00 fix cAdvisor container discovery by mounting /sys and /var/lib/docker correctly 2026-03-22 23:14:32 +01:00
Thomas Gräfenstein
31705ad888 fix cAdvisor crash by removing unsupported accelerator metric group 2026-03-22 23:06:34 +01:00
Thomas Gräfenstein
b5c5c11114 ensure monitoring stack starts before all other services 2026-03-22 22:55:42 +01:00
Thomas Gräfenstein
926766346c add cAdvisor and document detailed alert queries in README
Add cAdvisor container to the monitoring stack for container-level
metrics. Configure Alloy to scrape cAdvisor. Expand the README
Recommended Alerts section with exact PromQL/LogQL queries, thresholds,
and Grafana alert rule configuration for all five alerts.
2026-03-22 22:51:22 +01:00
Thomas Gräfenstein
c736c23e9a enable NETWORKS in docker-socket-proxy for Alloy container discovery 2026-03-22 21:27:26 +01:00
Thomas Gräfenstein
a02f33e96e move text compression from Caddy to nginx for lower latency
Nginx is closer to the origin, so compressing there avoids an
extra hop. Removes the Caddy encode block for Nextcloud and adds
gzip in nginx with level 4 targeting text, CSS, JS, JSON, XML, SVG.
2026-03-22 21:08:40 +01:00
Thomas Gräfenstein
d62b627093 add .mjs MIME type to nginx to fix NS_ERROR_CORRUPTED_CONTENT
nginx doesn't know .mjs by default and serves it as
application/octet-stream, which breaks ES module loading
and causes Caddy compression mismatches.
2026-03-22 20:56:10 +01:00
Thomas Gräfenstein
fb1de4f079 limit Caddy compression to text content types to fix slow file downloads
Caddy was compressing all responses including binary file downloads
(PDFs, images, videos), which severely throttled download speed to
~130KB/s despite 30MB/s VPS bandwidth. Now only compresses text-based
types (HTML, CSS, JS, JSON, XML, SVG) where compression actually helps.
2026-03-22 20:26:03 +01:00
Thomas Gräfenstein
3bf80f6940 disable file compression temporary 2026-03-22 20:20:37 +01:00
Thomas Gräfenstein
1c2fb3c807 fix nginx redirect loop 2026-03-22 18:12:18 +01:00
Thomas Gräfenstein
b918e713e5 align nginx and Caddy config with official Nextcloud docs
Move security headers to Caddy (edge proxy), remove nginx gzip
(Caddy already compresses), add asset_immutable map for versioned
cache control, add missing static file extensions, fix .well-known
block, and hide X-Powered-By header.
2026-03-22 17:58:26 +01:00
Thomas Gräfenstein
ac3bff9351 fix nginx to fall through to PHP for dynamic assets like theming CSS
Static file locations were returning hard 404s instead of falling
through to PHP, which broke dynamically generated assets like
theming CSS files.
2026-03-22 17:49:45 +01:00
Thomas Gräfenstein
0088c11d5e enable Caddy response compression to fix slow page loads
Caddy was decompressing nginx's gzip responses and sending them
uncompressed to the browser, causing core-common.js (5.7MB) to
take 25s to download. Adding encode zstd gzip compresses it to
1.3MB at the edge.
2026-03-22 17:43:24 +01:00
8 changed files with 168 additions and 28 deletions

View File

@@ -330,15 +330,91 @@ or low cost, and Restic handles encryption + deduplication automatically. A cron
### Recommended Alerts
Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules**):
Set these up in Grafana Cloud UI (**Alerting** -> **Alert rules** -> **New alert rule**). Choose **Grafana-managed rule**
and select the appropriate data source (Prometheus or Loki).
| Alert | Condition | Severity |
|----------------------|-----------------------------------------------------------------------|----------|
| Disk usage high | `node_filesystem_avail_bytes` / `node_filesystem_size_bytes` < 0.2 | Critical |
| Container restarting | Container restart count > 3 in 10 min | Warning |
| High memory usage | `node_memory_MemAvailable_bytes` / `node_memory_MemTotal_bytes` < 0.1 | Warning |
| High CPU usage | `node_cpu_seconds_total` idle < 10% sustained 5 min | Warning |
| Nextcloud cron stale | No log line from `nextcloud-cron` in 15 min | Warning |
| Alert | Condition | Severity |
|----------------------|--------------------------------------|----------|
| Disk usage high | Available disk < 20% | Critical |
| Container restarting | Restart count > 3 in 10 min | Warning |
| High memory usage | Available memory < 10% | Warning |
| High CPU usage | CPU usage > 90% sustained 5 min | Warning |
| Nextcloud cron stale | No cron log lines in 15 min | Warning |
#### Disk usage high
Fires when any filesystem drops below 20% free space.
- **Data source:** Prometheus
- **Query (A):**
```promql
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
```
- **Expression (B):** Threshold — `A IS BELOW 20`
- **Evaluate every:** `1m`
- **Pending period (For):** `5m`
- **Labels:** `severity: critical`
#### Container restarting
Fires when any container restarts more than 3 times in 10 minutes, indicating a crash loop.
Detects both in-place restarts (`docker restart`) and ID-changing restarts (`docker compose down/up`).
Requires cAdvisor (included in the monitoring stack).
- **Data source:** Prometheus
- **Query (A):**
```promql
sum by (name) (changes(container_start_time_seconds{name!=""}[10m]))
+
count by (name) (count_over_time(container_start_time_seconds{name!=""}[10m])) - 1
```
- **Expression (B):** Threshold — `A IS ABOVE 3`
- **Evaluate every:** `1m`
- **Pending period (For):** `0s`
- **Labels:** `severity: warning`
#### High memory usage
Fires when available memory drops below 10% of total.
- **Data source:** Prometheus
- **Query (A):**
```promql
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
```
- **Expression (B):** Threshold — `A IS BELOW 10`
- **Evaluate every:** `1m`
- **Pending period (For):** `5m`
- **Labels:** `severity: warning`
#### High CPU usage
Fires when average CPU usage exceeds 90% for 5 minutes.
- **Data source:** Prometheus
- **Query (A):**
```promql
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
```
- **Expression (B):** Threshold — `A IS BELOW 10`
- **Evaluate every:** `1m`
- **Pending period (For):** `5m`
- **Labels:** `severity: warning`
#### Nextcloud cron stale
Fires when no log output from the `nextcloud-cron` container appears for 15 minutes, indicating background jobs have stopped.
- **Data source:** Loki
- **Query (A):**
```logql
count_over_time({container="/nextcloud-cron"}[15m])
```
- **Expression (B):** Threshold — `A IS BELOW 1`
- **Alert condition:** also trigger on **No Data**
- **Evaluate every:** `5m`
- **Pending period (For):** `0s`
- **Labels:** `severity: warning`
### Recommended Dashboards

View File

@@ -12,6 +12,11 @@ nextcloud.t-gstone.de {
reverse_proxy nextcloud-nginx:80
header Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
header Referrer-Policy "no-referrer"
header X-Content-Type-Options "nosniff"
header X-Frame-Options "SAMEORIGIN"
header X-Permitted-Cross-Domain-Policies "none"
header X-Robots-Tag "noindex, nofollow"
request_body {
max_size 10G

View File

@@ -3,6 +3,8 @@ services:
image: caddy:2-alpine
container_name: caddy
restart: unless-stopped
depends_on:
- alloy
ports:
- "80:80"
- "443:443"

View File

@@ -3,6 +3,8 @@ services:
image: gitea/gitea:1.25.5-rootless
container_name: gitea
restart: unless-stopped
depends_on:
- alloy
env_file: .env
volumes:
- ${DATA_ROOT}/gitea/data:/var/lib/gitea

View File

@@ -54,6 +54,18 @@ prometheus.scrape "node" {
scrape_interval = "60s"
}
// ============================================================
// cAdvisor container metrics -> Grafana Cloud Prometheus
// ============================================================
prometheus.scrape "cadvisor" {
targets = [{"__address__" = "cadvisor:8080"}]
forward_to = [prometheus.remote_write.grafana_cloud.receiver]
scrape_interval = "60s"
metrics_path = "/metrics"
}
prometheus.remote_write "grafana_cloud" {
endpoint {
url = env("GRAFANA_CLOUD_PROMETHEUS_URL")

View File

@@ -16,7 +16,7 @@ services:
- EXEC=0
- IMAGES=0
- INFO=0
- NETWORKS=0
- NETWORKS=1
- NODES=0
- PLUGINS=0
- SERVICES=0
@@ -33,6 +33,29 @@ services:
max-size: "10m"
max-file: "3"
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.54.1
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /run/containerd/containerd.sock:/run/containerd/containerd.sock:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
command:
- --docker_only=true
- --housekeeping_interval=30s
- --containerd=/run/containerd/containerd.sock
- --disable_metrics=cpu_topology,disk,diskIO,hugetlb,memory_numa,network,oom_event,percpu,perf_event,process,referenced_memory,resctrl,sched,tcp,udp
networks:
- monitoring
logging:
driver: json-file
options:
max-size: "10m"
max-file: "3"
alloy:
image: grafana/alloy:v1.14.1
container_name: alloy

View File

@@ -57,6 +57,8 @@ services:
image: postgres:17-alpine
container_name: nextcloud-postgres
restart: unless-stopped
depends_on:
- alloy
env_file: .env
volumes:
- ${DATA_ROOT}/nextcloud/db:/var/lib/postgresql/data
@@ -77,6 +79,8 @@ services:
image: redis:8-alpine
container_name: nextcloud-redis
restart: unless-stopped
depends_on:
- alloy
command: redis-server --requirepass ${REDIS_PASSWORD}
env_file: .env
networks:

View File

@@ -6,19 +6,30 @@ map $uri $nonce_uri {
default "";
}
map $arg_v $asset_immutable {
"" "";
default ", immutable";
}
server {
listen 80;
server_name _;
client_max_body_size 10G;
client_body_timeout 300s;
fastcgi_buffers 64 4K;
include mime.types;
types {
application/javascript mjs;
}
gzip on;
gzip_vary on;
gzip_comp_level 4;
gzip_min_length 256;
gzip_types application/javascript application/json text/css text/plain text/xml application/xml image/svg+xml;
gzip_proxied any;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml image/svg+xml;
client_max_body_size 10G;
client_body_timeout 300s;
fastcgi_buffers 64 4K;
root /var/www/html;
index index.php index.html /index.php$request_uri;
@@ -27,27 +38,18 @@ server {
location ^~ /.well-known {
location = /.well-known/carddav { return 301 /remote.php/dav/; }
location = /.well-known/caldav { return 301 /remote.php/dav/; }
location ^~ /.well-known { return 301 /index.php$uri; }
location /.well-known/acme-challenge { try_files $uri $uri/ =404; }
location /.well-known/pki-validation { try_files $uri $uri/ =404; }
return 301 /index.php$request_uri;
}
# Deny access to internal paths
location ~ ^/(?:build|tests|config|lib|3rdparty|templates|data)(?:$|/) { return 404; }
location ~ ^/(?:\.|autotest|occ|issue|indie|db_|console) { return 404; }
# Serve static files directly — only if file exists on disk
location ~ \.(?:css|js|mjs|svg|gif|png|jpg|ico|wasm|tflite|map|ogg|flac)$ {
try_files $uri =404;
expires 6M;
access_log off;
}
location ~ \.woff2?$ {
try_files $uri =404;
expires 7d;
access_log off;
}
# PHP handling
# PHP handling (must be before static file locations so that internal
# redirects like /index.php/apps/theming/theme/dark.css match here
# instead of cycling back into the static file try_files)
location ~ \.php(?:$|/) {
fastcgi_split_path_info ^(.+?\.php)(/.*)$;
set $path_info $fastcgi_path_info;
@@ -60,10 +62,24 @@ server {
fastcgi_param front_controller_active true;
fastcgi_pass php-handler;
fastcgi_intercept_errors on;
fastcgi_hide_header X-Powered-By;
fastcgi_request_buffering off;
fastcgi_max_temp_file_size 0;
}
# Serve static files directly, fall through to PHP for dynamic assets (e.g. theming)
location ~ \.(?:css|js|mjs|svg|gif|ico|jpg|png|webp|wasm|tflite|map|ogg|flac|mp4|webm)$ {
try_files $uri /index.php$request_uri;
add_header Cache-Control "public, max-age=15778463$asset_immutable";
access_log off;
}
location ~ \.woff2?$ {
try_files $uri /index.php$request_uri;
expires 7d;
access_log off;
}
# Default handler — route everything else through PHP front controller
location / {
rewrite ^ /index.php$request_uri last;