Derived from the 2026-05-19 ingestion stress-test session. psql-app's
unicast_prefix drain rate caps at a few-hundred msg/s under continuous
Postgres maintenance (autovacuum on ip_rib + update_global_ip_rib() /
update_chg_stats() / update_peer_rib_counts() crons) competing for
ip_rib disk I/O.
ALTER TABLE ip_rib SET autovacuum_vacuum_scale_factor=0.02 -- run more
often on smaller chunks. cost_limit kept at its OpenBMP-default 3000 so
each run finishes fast; the consumer runs flat out between bursts
instead of being throttled continuously.
DROP INDEX for four unused/redundant indexes (every INSERT updates every
index; these all had 0 scans in ~2h of heavy activity):
- ip_rib_hash_id_idx (907 MB)
- ip_rib_base_attr_hash_id_idx (558 MB)
- ip_rib_prefix_idx (1538 MB, GiST)
- ip_rib_origin_as_idx (364 MB)
9 -> 5 indexes; ~3.4 GB freed (6,715 MB -> 3,348 MB). Reduces index
write-amplification per UPSERT by ~45% and shortens autovacuum on
ip_rib by ~the same.
Measurement note: across-cycle 25-min runs were inconclusive on the
sustained-rate effect (inflow was near-zero by then -- gobgp stopped --
so the consumer was largely idle). The real test is re-enabling the
fleet-wide feed with the consumer-replica + 62 GiB RAM and seeing
whether unicast_prefix keeps up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
obmp-churn-monitor: a decoupled fast-path BGP churn consumer. Reads
openbmp.parsed.unicast_prefix with its own Kafka consumer group and only
counts announcements/withdrawals per (router,peer) into churn_metrics
(010_churn_metrics.sql) -- no relational RIB write. Storm-tested: it
stayed real-time (tracked 1k->85k msg/s) while the psql-app bulk
pipeline lag grew 3.8M->5.6M. Live BGP Churn dashboard reads it.
tools/churn_storm.py: programmatic churn-storm generator (flaps GoBGP's
eBGP sessions to the lab cores) for load testing.
Stress-test finding: fleet-wide full table from 18 routers exceeds this
31 GiB host. The bottleneck is RAM, not CPU -- at 16 cores the host
still hit load 33 because it was swap-thrashing (swap 2/2 full, <1.5 GiB
free). Lag ran away 3.8M->20M+. Recourse: more host RAM for bulk
throughput; the fast-path consumer for visibility regardless.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router
post-policy accepted/advertised prefix counts and route-policy bindings
over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy
Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff
dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected.
GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to
both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route
reflectors the cores propagate it to every R9K client, so all 18 lab
routers carry and BMP-export a full table -- an intentional stress test
of the ingestion/storage path. cml/gobgp_peering_config.py applies and
rolls back the core-side config; gobgp/README.md documents the rollback.
Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every
30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard
gives visibility into the pipeline under churn load.
Peer Detail dashboard: the Peer selector is now router-qualified
(router -> peer) so it is unambiguous in an iBGP route-reflector mesh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table
(roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib;
a dedicated consumer will populate it.
- production-sizing.md: corrected retention figures to the actual
policy values, added a measured-data section (one full feed ≈
+5 GB current state; DB now ~30 GB), and a horizontal-scaling
section — the bottleneck is the psql-app consumer + disk IOPS, so
scale psql-app as a Kafka consumer group (cap = partition count),
treat multi-collector as HA/locality not throughput.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The AS map previously exploded ~4.4M base_attrs AS_PATH rows live,
three times per load (one per panel), ~1.8s each — slow enough that
navigating away cancelled the queries mid-flight.
Add mv_as_adjacency: undirected consecutive-AS pairs with occurrence
counts over the full RIB (17k rows), refreshed hourly by pg_cron via
REFRESH ... CONCURRENTLY. The dashboard panels now read the view in
~1ms. Min-occurrence options rescaled for full-RIB counts
(2000/5000/10000/50000, default 2000 -> ~63-node graph).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* collector v2.2.3
* collector using debian-stable-slim
* dev-image updated to use debian-stable-slim
* Upgraded librdkafka to v1.9.2
* Fixed permission problems with postgres
* Grafana upgraded to 9.1.7
* psql-app v2.2.2
* postgres updated to use timescaledb-ha:pg14-ts2.8
When first deploying the collector and kafka, it takes
kafka a couple minutes to start. In some cases, the
collector would proceed to startup without waiting for
kafka. This resulted in the first few messages to be dropped,
such as dropping the router init and peer up messages.
* Upgrades to all containers
* Resolves#7, resolves#6, resolves#2
* Compose changed to use versions instead of latest
* OBMP containers now use a version tag instead of build numbers