RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too
small for the full-table feature (900K route objects in memory). Raises the
limit to a parameterized 6g default (EXABGP_MEM_LIMIT).
Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy
containers are visible. Adds a Telegraf docker input that collects per-
container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard —
so resource pressure is caught before it causes an OOM crash. telegraf runs
with an overridden entrypoint so it keeps root and can read the docker socket.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The telemetry dashboards' router/interface variables used a keep|distinct
Flux pattern that returned only one source; switch to schema.tagValues so all
streaming routers and interfaces are listed. Parameterize telegraf.conf gNMI
addresses and credentials via GNMI_ADDRESSES/GNMI_USERNAME/GNMI_PASSWORD so
the telemetry fleet can scale without editing the config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets mem_limit on every service to cap the OOM/swap-exhaustion risk (the lab
host had only 5 MiB swap free). The three heavy services (psql, kafka,
psql-app) read their limits from .env so production can raise them; the rest
use lab-appropriate fixed values. Total ~25 GB, leaving headroom on the 31 GB
lab host.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the Compose project name and splits services into core / test / auth
profiles so the BMP collector core can deploy standalone. Adds setup.sh
(idempotent bootstrap), .env.example, and repo-resident Authelia config
templates so a fresh host deploys without manual steps. Parameterizes
hardcoded host IP and domain; points the Grafana InfluxDB datasource at the
container name.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>