obmp-docker

Author	SHA1	Message	Date
sam	2634aada24	Parameterize HOST_IP everywhere -- portable to another lab host Removes hardcoded 10.40.40.202 references so a fresh clone + .env-only edit can stand the stack up on a new compute node. * docker-compose.yml: rib-poller PG_DSN now uses ${HOST_IP:-...}. * obmp-rib-poller/poller.py: default PG_DSN host falls back to ${HOST_IP} env (compose passes it; manual runs honour $HOST_IP too). * cml/gobgp_peering_config.py: GOBGP_IP read from $HOST_IP or the HOST_IP= line in repo-root .env, with a small _env_default helper. * cml/proxmox_bmp_config.py: COLLECTOR_HOST resolved the same way. For gobgp/gobgpd.conf and gobgp-evpn/gobgpd.conf -- jauderho/gobgp is distroless (no shell), so we can't sed-substitute at container start. Pattern instead: * gobgpd.conf is now gobgpd.conf.tmpl with __HOST_IP__ placeholders (committed). The rendered gobgpd.conf is gitignored. * setup.sh renders the .tmpl(s) to .conf using $HOST_IP from .env. * compose `command` stays the simple `gobgpd -f /config/gobgpd.conf`. After cloning on a new host: cp .env.example .env -> edit HOST_IP -> ./setup.sh -> docker compose up -d. Verified locally by force-recreating gobgp; all 6 sessions (4 cores + 2 Bromirski) re-established in <60s. Known portability gaps still to address (separate work): * Hardcoded lab-router inventories in cml/.py and obmp-rib-poller/poller.py. The /etc/cron.d/openbmp /5 -> /15 edit inside obmp-psql-app is not persistent (regenerated by config_cron on every container start). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 18:34:51 -07:00
sam	06019ef74c	Add consumer-only psql-app replica for ingestion scale-out psql-app-consumer: profile-gated (scale-out) horizontal scale-out for the Kafka->Postgres ingestion path. Shares the primary's /config read-only so it reuses obmp-psql.yml, whose fixed group.id makes Kafka rebalance partitions across the primary and every replica. Its command runs ONLY the consumer jar -- no cron, RPKI/IRR/DBIP or initdb -- so it does not duplicate the primary's DB-maintenance jobs (config_cron wires those up unconditionally in /usr/sbin/run). Each replica brings its own consumer and writer threads. Measured: one consumer-only replica took the post-storm backlog drain from a cold-start ~3.7k msg/s to ~48k msg/s; group membership 8->16. With 2 consumers feeding it, Postgres becomes the next bottleneck (~500% CPU) -- DB write capacity is the ceiling beyond ~2-3 consumers. docker compose --profile scale-out up -d --scale psql-app-consumer=2 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 14:04:34 -07:00
sam	22d26f0e0f	Gate psql-app startup on Postgres health (fix cold-boot race) On a cold boot all containers start together; psql-app finishes its RPKI/IRR/DBIP setup and opens its single Postgres connection while the DB is still initialising -> "the database system is starting up" -> ConsumerApp.main throws and the consumer dies. The container does NOT exit (the wrapper keeps cron/rsyslog alive), so restart: unless-stopped never fires and the consumer stays dead silently. Add depends_on psql: condition: service_healthy (plus kafka) so Compose holds psql-app until Postgres passes its pg_isready healthcheck. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 13:53:09 -07:00
sam	d7084aba54	Add fast-path churn monitor and churn-storm load tool obmp-churn-monitor: a decoupled fast-path BGP churn consumer. Reads openbmp.parsed.unicast_prefix with its own Kafka consumer group and only counts announcements/withdrawals per (router,peer) into churn_metrics (010_churn_metrics.sql) -- no relational RIB write. Storm-tested: it stayed real-time (tracked 1k->85k msg/s) while the psql-app bulk pipeline lag grew 3.8M->5.6M. Live BGP Churn dashboard reads it. tools/churn_storm.py: programmatic churn-storm generator (flaps GoBGP's eBGP sessions to the lab cores) for load testing. Stress-test finding: fleet-wide full table from 18 routers exceeds this 31 GiB host. The bottleneck is RAM, not CPU -- at 16 cores the host still hit load 33 because it was swap-thrashing (swap 2/2 full, <1.5 GiB free). Lag ran away 3.8M->20M+. Recourse: more host RAM for bulk throughput; the fast-path consumer for visibility regardless. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 13:17:09 -07:00
sam	b681c473c0	Add Policy Diff, fleet-wide full-table feed, and Kafka lag monitoring Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router post-policy accepted/advertised prefix counts and route-policy bindings over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected. GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route reflectors the cores propagate it to every R9K client, so all 18 lab routers carry and BMP-export a full table -- an intentional stress test of the ingestion/storage path. cml/gobgp_peering_config.py applies and rolls back the core-side config; gobgp/README.md documents the rollback. Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every 30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard gives visibility into the pipeline under churn load. Peer Detail dashboard: the Peer selector is now router-qualified (router -> peer) so it is unambiguous in an iBGP route-reflector mesh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 12:42:25 -07:00
sam	107cbf6ac5	Add obmp-evpn-consumer: openbmp.parsed.evpn -> evpn_rib (roadmap E5) A standalone Python Kafka consumer that subscribes to the openbmp.parsed.evpn topic (which the stock psql-app ignores) and writes BGP EVPN routes into evpn_rib. Field positions are pinned to the verified collector 2.2.3 / v1.7 message layout; route_type is derived from which fields populate. Profile-gated ('evpn-test') alongside the gobgp-evpn injector. Verified end to end: 5 injected type-2/type-3 routes land in evpn_rib with correct RD, ethernet-tag, MAC, IP, label and route-target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:28:19 -07:00
sam	f7532b62ef	Add modular gobgp-evpn EVPN test-route injector (roadmap E5) A profile-gated GoBGP instance (Compose profile 'evpn-test', not part of the normal stack) that originates synthetic BGP EVPN routes and BMP-exports its local RIB to the collector. Verified end to end: the injected type-2/3/5 routes are parsed by the collector and land on the openbmp.parsed.evpn Kafka topic, ready for the EVPN consumer. inject-evpn.sh pushes type-2 (MAC/IP), type-3 (inclusive multicast) and type-5 (IP-prefix) routes. Start with: docker compose --profile evpn-test up -d gobgp-evpn Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 09:15:44 -07:00
sam	fc164a5689	Add disk-space and DB-size monitoring to Telegraf Telegraf now collects host filesystem usage ([[inputs.disk]], via a read-only /hostfs mount) and PostgreSQL database + per-table sizes ([[inputs.postgresql_extensible]]) into InfluxDB. Surfaces RIB growth and disk pressure — relevant now that the full-table GoBGP feed has pushed the openbmp DB to ~30 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:28:36 -07:00
sam	cffb835f30	Enable IPv6 feed: run GoBGP in host network mode The IPv6 eBGP session never established because the Docker bridge has no IPv6. Switch the gobgp container to network_mode: host so it uses the host's real dual-stack connectivity — both sessions to AS57355 now source from the host's public v4/v6 addresses. Host mode binds the host's port namespace, so disable GoBGP's inbound BGP listener (port = -1) — we only originate outbound sessions, and a non-root container cannot bind privileged port 179. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:08:55 -07:00
sam	88a5546e29	Add GoBGP full-table feed container (roadmap E1) New gobgp service: GoBGP peers eBGP-multihop with the AS57355 lab route server (Bromirski) for the full real IPv4 + IPv6 Internet table and BMP-exports it to the OpenBMP collector, landing in ip_rib as a monitored peer. Config follows the route server's published peering spec: local AS 65001, no password, keepalive 3600 / hold-time 7200, IPv4 feed on the v4 session and IPv6 feed on the v6 session. gobgp/mrt-refresh.sh is a cron-safe fallback that injects RouteViews MRT RIB dumps when the live session is down. The live BGP session is not started here — bringing gobgp up establishes the external session and loads ~1M routes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:39:12 -07:00
sam	9d74940614	Fix ExaBGP OOM, add container health checks and resource monitoring RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too small for the full-table feature (900K route objects in memory). Raises the limit to a parameterized 6g default (EXABGP_MEM_LIMIT). Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy containers are visible. Adds a Telegraf docker input that collects per- container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard — so resource pressure is caught before it causes an OOM crash. telegraf runs with an overridden entrypoint so it keeps root and can read the docker socket. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:03:52 -07:00
sam	a662496e53	Fix telemetry dashboard variables and parameterize gNMI targets The telemetry dashboards' router/interface variables used a keep\|distinct Flux pattern that returned only one source; switch to schema.tagValues so all streaming routers and interfaces are listed. Parameterize telegraf.conf gNMI addresses and credentials via GNMI_ADDRESSES/GNMI_USERNAME/GNMI_PASSWORD so the telemetry fleet can scale without editing the config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 21:10:57 -07:00
sam	4e9bd7cc5a	Add container memory limits to all services Sets mem_limit on every service to cap the OOM/swap-exhaustion risk (the lab host had only 5 MiB swap free). The three heavy services (psql, kafka, psql-app) read their limits from .env so production can raise them; the rest use lab-appropriate fixed values. Total ~25 GB, leaving headroom on the 31 GB lab host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 20:04:37 -07:00
sam	cf4e5b07c6	Add Compose profiles, setup.sh bootstrap, and config templates for portable deployment Pins the Compose project name and splits services into core / test / auth profiles so the BMP collector core can deploy standalone. Adds setup.sh (idempotent bootstrap), .env.example, and repo-resident Authelia config templates so a fresh host deploys without manual steps. Parameterizes hardcoded host IP and domain; points the Grafana InfluxDB datasource at the container name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 19:21:04 -07:00
sam	45f4c9859d	Add Authelia auth gateway, portal landing page, and subpath routing Adds Authelia (forward-auth) and nginx portal container for single-endpoint authenticated access via Caddy reverse proxy. Configures Grafana auth proxy for header-based auto-login. Updates Vue UI base paths and API routes for /exabgp/ and /traffic/ subpath serving. Adds traffic-gen responder container on dedicated Docker network. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-15 14:23:09 -07:00
sam	dcebf15bb3	Add Phase 4: gNMI streaming telemetry and traffic generator - gNMI integration: NETCONF script to enable gRPC on all 9 routers, Telegraf container with gnmi input plugin, InfluxDB for time-series storage, 3 Grafana telemetry dashboards (utilization, errors, combined) - Traffic generator: Scapy-based dual-mode container (sender/responder) with Flask API, RFC 2544 test suite (throughput, latency, frame-loss, back-to-back), Vue 3 web UI with flow builder, test runner, real-time stats monitor, and results export - docker-compose.yml updated with influxdb, telegraf, traffic-gen, traffic-gen-ui services - Full documentation in DOCS.md sections 15-16 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 15:29:44 -07:00
sam	6621942032	Add Phase 2: Vue 3 control panel, 6 learning dashboards, new BGP scenarios - exabgp-ui/: Vue 3 + Vite SPA served by NGINX on :5001; proxies /api/ to ExaBGP Flask on :5050; includes StatusBar, ScenarioPanel, RouteTable, AnnounceForm, PeerStatus, ChurnControl components - docker-compose.yml: add obmp-exabgp-ui service (host network, port 5001) - exabgp/scenarios/__init__.py: add convergence_test, route_leak, hijack_simulation scenarios for structured BGP learning exercises - exabgp/inject.py: add 'peers' and 'monitor' subcommands; live-refresh terminal status view with ANSI cursor repositioning - obmp-grafana/dashboards/Learning/: 6 new OBMP-Learning dashboards (update rate, peer health, AS path, RPKI, churn, attributes) - obmp-grafana/provisioning/dashboards/openbmp-dashboards.yml: add OpenBMP-Learning folder provider pointing to dashboards/Learning/ - DOCS.md: document Web UI, 3 new scenarios, 6 learning dashboards; fix section numbering (10-14) and architecture diagram (23 dashboards) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 15:37:16 -07:00
sam	233dadbb41	Add ExaBGP route injector, Grafana dashboards, and full documentation - Add exabgp/ container: ExaBGP 5.x + Flask REST API for on-demand BGP route injection into CML IOS-XR lab (AS 65020 via eBGP from AS 65100) - Add 6 injection scenarios: internet_sample, churn, blackhole, anycast, full_table, lab_prefixes - Add inject.py CLI wrapper for the ExaBGP API - Add iosxr_bgp_config.md with IOS-XR neighbor config and NETCONF script - Add obmp-grafana/ dashboards and provisioning (17 dashboards) - Update docker-compose.yml: add exabgp service, fix Kafka external listener IP, extend log retention from 90min to 720min - Add DOCS.md: full project documentation including architecture, setup, user guide, sanity checks, troubleshooting, and command reference - Update .gitignore: exclude .env and .claude/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-05 14:46:37 -07:00
Tim Evens	3f38af5312	Version 2.2.3 updates * collector v2.2.3 * collector using debian-stable-slim * dev-image updated to use debian-stable-slim * Upgraded librdkafka to v1.9.2 * Fixed permission problems with postgres * Grafana upgraded to 9.1.7 * psql-app v2.2.2 * postgres updated to use timescaledb-ha:pg14-ts2.8	2022-10-20 07:12:08 -07:00
Tim Evens	0f3312a719	Updates for v2.2.1	2022-06-17 18:20:05 -07:00
Tim Evens	e19e5ac73a	Fix psql container rm file issue	2022-06-10 12:53:24 -07:00
Tim Evens	237345b476	Add ENABLE_DBIP to ```psql-app``` container to auto import DB-IP geo data	2022-06-08 14:53:55 -07:00
Tim Evens	84bec5293b	version 2.2.0 updates	2022-06-08 11:53:55 -07:00
Tim Evens	e7fad858d9	Change global_ip_rib function cron job	2022-05-17 10:38:19 -07:00
Tim Evens	eb52eace41	Updates for 2.1.1	2022-03-31 12:13:46 -07:00
Tim Evens	c5f3d6ef59	2.1.1 Updates * Update psql-app container to use MEM for heap setting This fixes issue where psql-app would run out of memory * Update psql-app container to restart psql consumer if if stops. This handles restart on out of memory exit.	2022-03-28 15:51:15 -07:00
Tim Evens	0a0d2ceec1	2.1.1 updates * Fix vpnv6/l3vpn next-hop decoding * Fix ip_rib_log enabling compression to be after hypertable creation * Add pg_cron to postgres container * Upgraded postgres container to timescaledb 2.6.0-pg14	2022-03-28 12:43:37 -07:00
Tim Evens	b9b8c44713	Change max_wal_size to 10GB by default and add missing upgrade sql file	2022-03-09 10:48:58 -08:00
Tim Evens	05737d2682	v2.1.0 updates * Add peeringdb script and cron job * Fix running more than one cronjob at a time * Update upgrade script for psql-app	2022-03-04 07:27:23 -08:00
Tim Evens	492c000ce9	Add whois and upgrade to 2.1.0	2022-02-22 14:30:05 -08:00
Tim Evens	fd2874d00e	Fix for collector kafka startup issue When first deploying the collector and kafka, it takes kafka a couple minutes to start. In some cases, the collector would proceed to startup without waiting for kafka. This resulted in the first few messages to be dropped, such as dropping the router init and peer up messages.	2022-02-01 12:49:17 -08:00
Tim Evens	a0e6a5bc6f	Fixes to psql-app, version 2.0.2	2022-01-31 11:05:58 -08:00
Tim Evens	c3839aa8fb	Security fixes, issues resolved, and more * Upgrades to all containers * Resolves #7, resolves #6, resolves #2 * Compose changed to use versions instead of latest * OBMP containers now use a version tag instead of build numbers	2022-01-28 15:12:01 -08:00
sydon7	cd25509e39	adding cron drops for all timeseries tables	2021-07-30 22:55:53 +00:00
sydon7	fc362aab60	rpki updates	2021-04-30 14:14:27 +00:00
Tim Evens	eba244cdf7	Fix postgres to create ts. Update compose to use latest	2021-03-31 00:13:09 -07:00
Tim Evens	c61f766cc3	Adjust defaults in compose and fix postgres mem setting	2021-03-30 22:31:06 -07:00
Tim Evens	74154229ad	more changes to compose	2021-03-30 19:00:25 -07:00
Tim Evens	574bf5e8a9	Add psql-app conatainer and docker compose	2021-03-30 14:25:24 -07:00

39 Commits