Telegraf now collects host filesystem usage ([[inputs.disk]], via a
read-only /hostfs mount) and PostgreSQL database + per-table sizes
([[inputs.postgresql_extensible]]) into InfluxDB. Surfaces RIB growth
and disk pressure — relevant now that the full-table GoBGP feed has
pushed the openbmp DB to ~30 GB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too
small for the full-table feature (900K route objects in memory). Raises the
limit to a parameterized 6g default (EXABGP_MEM_LIMIT).
Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy
containers are visible. Adds a Telegraf docker input that collects per-
container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard —
so resource pressure is caught before it causes an OOM crash. telegraf runs
with an overridden entrypoint so it keeps root and can read the docker socket.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The telemetry dashboards' router/interface variables used a keep|distinct
Flux pattern that returned only one source; switch to schema.tagValues so all
streaming routers and interfaces are listed. Parameterize telegraf.conf gNMI
addresses and credentials via GNMI_ADDRESSES/GNMI_USERNAME/GNMI_PASSWORD so
the telemetry fleet can scale without editing the config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Switch Telegraf from native IOS-XR YANG paths to OpenConfig
(openconfig-interfaces:interfaces/interface/state/counters)
- Use json_ietf encoding instead of proto (IOS-XR 24.3.1 compat)
- Target only CORE-01/CORE-02 (R9K routers blocked by CML mgmt net)
- Update all 3 Grafana dashboard queries to match OpenConfig field
names (in-octets, out-octets, in-pkts, out-pkts, in-errors, etc.)
- Rewrite gnmi_grpc_config.py to use SSH/CLI via paramiko instead of
NETCONF (IOS-XR 24.3.1 rejects NETCONF gRPC edit-config)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- gNMI integration: NETCONF script to enable gRPC on all 9 routers,
Telegraf container with gnmi input plugin, InfluxDB for time-series
storage, 3 Grafana telemetry dashboards (utilization, errors, combined)
- Traffic generator: Scapy-based dual-mode container (sender/responder)
with Flask API, RFC 2544 test suite (throughput, latency, frame-loss,
back-to-back), Vue 3 web UI with flow builder, test runner, real-time
stats monitor, and results export
- docker-compose.yml updated with influxdb, telegraf, traffic-gen,
traffic-gen-ui services
- Full documentation in DOCS.md sections 15-16
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>