Removes hardcoded 10.40.40.202 references so a fresh clone + .env-only
edit can stand the stack up on a new compute node.
* docker-compose.yml: rib-poller PG_DSN now uses ${HOST_IP:-...}.
* obmp-rib-poller/poller.py: default PG_DSN host falls back to
${HOST_IP} env (compose passes it; manual runs honour $HOST_IP too).
* cml/gobgp_peering_config.py: GOBGP_IP read from $HOST_IP or the
HOST_IP= line in repo-root .env, with a small _env_default helper.
* cml/proxmox_bmp_config.py: COLLECTOR_HOST resolved the same way.
For gobgp/gobgpd.conf and gobgp-evpn/gobgpd.conf -- jauderho/gobgp is
distroless (no shell), so we can't sed-substitute at container start.
Pattern instead:
* gobgpd.conf is now gobgpd.conf.tmpl with __HOST_IP__ placeholders
(committed). The rendered gobgpd.conf is gitignored.
* setup.sh renders the .tmpl(s) to .conf using $HOST_IP from .env.
* compose `command` stays the simple `gobgpd -f /config/gobgpd.conf`.
After cloning on a new host: cp .env.example .env -> edit HOST_IP ->
./setup.sh -> docker compose up -d. Verified locally by force-recreating
gobgp; all 6 sessions (4 cores + 2 Bromirski) re-established in <60s.
Known portability gaps still to address (separate work):
* Hardcoded lab-router inventories in cml/*.py and
obmp-rib-poller/poller.py.
* The /etc/cron.d/openbmp */5 -> */15 edit inside obmp-psql-app is
not persistent (regenerated by config_cron on every container start).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
psql-app-consumer: profile-gated (scale-out) horizontal scale-out for the
Kafka->Postgres ingestion path. Shares the primary's /config read-only so
it reuses obmp-psql.yml, whose fixed group.id makes Kafka rebalance
partitions across the primary and every replica. Its command runs ONLY
the consumer jar -- no cron, RPKI/IRR/DBIP or initdb -- so it does not
duplicate the primary's DB-maintenance jobs (config_cron wires those up
unconditionally in /usr/sbin/run). Each replica brings its own consumer
and writer threads.
Measured: one consumer-only replica took the post-storm backlog drain
from a cold-start ~3.7k msg/s to ~48k msg/s; group membership 8->16. With
2 consumers feeding it, Postgres becomes the next bottleneck (~500% CPU)
-- DB write capacity is the ceiling beyond ~2-3 consumers.
docker compose --profile scale-out up -d --scale psql-app-consumer=2
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On a cold boot all containers start together; psql-app finishes its
RPKI/IRR/DBIP setup and opens its single Postgres connection while the
DB is still initialising -> "the database system is starting up" ->
ConsumerApp.main throws and the consumer dies. The container does NOT
exit (the wrapper keeps cron/rsyslog alive), so restart: unless-stopped
never fires and the consumer stays dead silently.
Add depends_on psql: condition: service_healthy (plus kafka) so Compose
holds psql-app until Postgres passes its pg_isready healthcheck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
obmp-churn-monitor: a decoupled fast-path BGP churn consumer. Reads
openbmp.parsed.unicast_prefix with its own Kafka consumer group and only
counts announcements/withdrawals per (router,peer) into churn_metrics
(010_churn_metrics.sql) -- no relational RIB write. Storm-tested: it
stayed real-time (tracked 1k->85k msg/s) while the psql-app bulk
pipeline lag grew 3.8M->5.6M. Live BGP Churn dashboard reads it.
tools/churn_storm.py: programmatic churn-storm generator (flaps GoBGP's
eBGP sessions to the lab cores) for load testing.
Stress-test finding: fleet-wide full table from 18 routers exceeds this
31 GiB host. The bottleneck is RAM, not CPU -- at 16 cores the host
still hit load 33 because it was swap-thrashing (swap 2/2 full, <1.5 GiB
free). Lag ran away 3.8M->20M+. Recourse: more host RAM for bulk
throughput; the fast-path consumer for visibility regardless.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router
post-policy accepted/advertised prefix counts and route-policy bindings
over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy
Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff
dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected.
GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to
both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route
reflectors the cores propagate it to every R9K client, so all 18 lab
routers carry and BMP-export a full table -- an intentional stress test
of the ingestion/storage path. cml/gobgp_peering_config.py applies and
rolls back the core-side config; gobgp/README.md documents the rollback.
Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every
30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard
gives visibility into the pipeline under churn load.
Peer Detail dashboard: the Peer selector is now router-qualified
(router -> peer) so it is unambiguous in an iBGP route-reflector mesh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A standalone Python Kafka consumer that subscribes to the
openbmp.parsed.evpn topic (which the stock psql-app ignores) and
writes BGP EVPN routes into evpn_rib. Field positions are pinned to
the verified collector 2.2.3 / v1.7 message layout; route_type is
derived from which fields populate. Profile-gated ('evpn-test')
alongside the gobgp-evpn injector.
Verified end to end: 5 injected type-2/type-3 routes land in evpn_rib
with correct RD, ethernet-tag, MAC, IP, label and route-target.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A profile-gated GoBGP instance (Compose profile 'evpn-test', not part
of the normal stack) that originates synthetic BGP EVPN routes and
BMP-exports its local RIB to the collector. Verified end to end: the
injected type-2/3/5 routes are parsed by the collector and land on
the openbmp.parsed.evpn Kafka topic, ready for the EVPN consumer.
inject-evpn.sh pushes type-2 (MAC/IP), type-3 (inclusive multicast)
and type-5 (IP-prefix) routes. Start with:
docker compose --profile evpn-test up -d gobgp-evpn
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Telegraf now collects host filesystem usage ([[inputs.disk]], via a
read-only /hostfs mount) and PostgreSQL database + per-table sizes
([[inputs.postgresql_extensible]]) into InfluxDB. Surfaces RIB growth
and disk pressure — relevant now that the full-table GoBGP feed has
pushed the openbmp DB to ~30 GB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The IPv6 eBGP session never established because the Docker bridge
has no IPv6. Switch the gobgp container to network_mode: host so it
uses the host's real dual-stack connectivity — both sessions to
AS57355 now source from the host's public v4/v6 addresses.
Host mode binds the host's port namespace, so disable GoBGP's
inbound BGP listener (port = -1) — we only originate outbound
sessions, and a non-root container cannot bind privileged port 179.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New gobgp service: GoBGP peers eBGP-multihop with the AS57355 lab
route server (Bromirski) for the full real IPv4 + IPv6 Internet table
and BMP-exports it to the OpenBMP collector, landing in ip_rib as a
monitored peer.
Config follows the route server's published peering spec: local AS
65001, no password, keepalive 3600 / hold-time 7200, IPv4 feed on the
v4 session and IPv6 feed on the v6 session. gobgp/mrt-refresh.sh is a
cron-safe fallback that injects RouteViews MRT RIB dumps when the live
session is down. The live BGP session is not started here — bringing
gobgp up establishes the external session and loads ~1M routes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too
small for the full-table feature (900K route objects in memory). Raises the
limit to a parameterized 6g default (EXABGP_MEM_LIMIT).
Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy
containers are visible. Adds a Telegraf docker input that collects per-
container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard —
so resource pressure is caught before it causes an OOM crash. telegraf runs
with an overridden entrypoint so it keeps root and can read the docker socket.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The telemetry dashboards' router/interface variables used a keep|distinct
Flux pattern that returned only one source; switch to schema.tagValues so all
streaming routers and interfaces are listed. Parameterize telegraf.conf gNMI
addresses and credentials via GNMI_ADDRESSES/GNMI_USERNAME/GNMI_PASSWORD so
the telemetry fleet can scale without editing the config.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets mem_limit on every service to cap the OOM/swap-exhaustion risk (the lab
host had only 5 MiB swap free). The three heavy services (psql, kafka,
psql-app) read their limits from .env so production can raise them; the rest
use lab-appropriate fixed values. Total ~25 GB, leaving headroom on the 31 GB
lab host.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the Compose project name and splits services into core / test / auth
profiles so the BMP collector core can deploy standalone. Adds setup.sh
(idempotent bootstrap), .env.example, and repo-resident Authelia config
templates so a fresh host deploys without manual steps. Parameterizes
hardcoded host IP and domain; points the Grafana InfluxDB datasource at the
container name.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Authelia (forward-auth) and nginx portal container for single-endpoint
authenticated access via Caddy reverse proxy. Configures Grafana auth proxy
for header-based auto-login. Updates Vue UI base paths and API routes for
/exabgp/ and /traffic/ subpath serving. Adds traffic-gen responder container
on dedicated Docker network.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gNMI integration: NETCONF script to enable gRPC on all 9 routers,
Telegraf container with gnmi input plugin, InfluxDB for time-series
storage, 3 Grafana telemetry dashboards (utilization, errors, combined)
- Traffic generator: Scapy-based dual-mode container (sender/responder)
with Flask API, RFC 2544 test suite (throughput, latency, frame-loss,
back-to-back), Vue 3 web UI with flow builder, test runner, real-time
stats monitor, and results export
- docker-compose.yml updated with influxdb, telegraf, traffic-gen,
traffic-gen-ui services
- Full documentation in DOCS.md sections 15-16
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* collector v2.2.3
* collector using debian-stable-slim
* dev-image updated to use debian-stable-slim
* Upgraded librdkafka to v1.9.2
* Fixed permission problems with postgres
* Grafana upgraded to 9.1.7
* psql-app v2.2.2
* postgres updated to use timescaledb-ha:pg14-ts2.8
* Update psql-app container to use MEM for heap setting
This fixes issue where psql-app would run out of memory
* Update psql-app container to restart psql consumer if
if stops. This handles restart on out of memory exit.
When first deploying the collector and kafka, it takes
kafka a couple minutes to start. In some cases, the
collector would proceed to startup without waiting for
kafka. This resulted in the first few messages to be dropped,
such as dropping the router init and peer up messages.
* Upgrades to all containers
* Resolves#7, resolves#6, resolves#2
* Compose changed to use versions instead of latest
* OBMP containers now use a version tag instead of build numbers