Derived from the 2026-05-19 ingestion stress-test session. psql-app's
unicast_prefix drain rate caps at a few-hundred msg/s under continuous
Postgres maintenance (autovacuum on ip_rib + update_global_ip_rib() /
update_chg_stats() / update_peer_rib_counts() crons) competing for
ip_rib disk I/O.
ALTER TABLE ip_rib SET autovacuum_vacuum_scale_factor=0.02 -- run more
often on smaller chunks. cost_limit kept at its OpenBMP-default 3000 so
each run finishes fast; the consumer runs flat out between bursts
instead of being throttled continuously.
DROP INDEX for four unused/redundant indexes (every INSERT updates every
index; these all had 0 scans in ~2h of heavy activity):
- ip_rib_hash_id_idx (907 MB)
- ip_rib_base_attr_hash_id_idx (558 MB)
- ip_rib_prefix_idx (1538 MB, GiST)
- ip_rib_origin_as_idx (364 MB)
9 -> 5 indexes; ~3.4 GB freed (6,715 MB -> 3,348 MB). Reduces index
write-amplification per UPSERT by ~45% and shortens autovacuum on
ip_rib by ~the same.
Measurement note: across-cycle 25-min runs were inconclusive on the
sustained-rate effect (inflow was near-zero by then -- gobgp stopped --
so the consumer was largely idle). The real test is re-enabling the
fleet-wide feed with the consumer-replica + 62 GiB RAM and seeing
whether unicast_prefix keeps up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
obmp-churn-monitor: a decoupled fast-path BGP churn consumer. Reads
openbmp.parsed.unicast_prefix with its own Kafka consumer group and only
counts announcements/withdrawals per (router,peer) into churn_metrics
(010_churn_metrics.sql) -- no relational RIB write. Storm-tested: it
stayed real-time (tracked 1k->85k msg/s) while the psql-app bulk
pipeline lag grew 3.8M->5.6M. Live BGP Churn dashboard reads it.
tools/churn_storm.py: programmatic churn-storm generator (flaps GoBGP's
eBGP sessions to the lab cores) for load testing.
Stress-test finding: fleet-wide full table from 18 routers exceeds this
31 GiB host. The bottleneck is RAM, not CPU -- at 16 cores the host
still hit load 33 because it was swap-thrashing (swap 2/2 full, <1.5 GiB
free). Lag ran away 3.8M->20M+. Recourse: more host RAM for bulk
throughput; the fast-path consumer for visibility regardless.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router
post-policy accepted/advertised prefix counts and route-policy bindings
over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy
Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff
dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected.
GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to
both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route
reflectors the cores propagate it to every R9K client, so all 18 lab
routers carry and BMP-export a full table -- an intentional stress test
of the ingestion/storage path. cml/gobgp_peering_config.py applies and
rolls back the core-side config; gobgp/README.md documents the rollback.
Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every
30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard
gives visibility into the pipeline under churn load.
Peer Detail dashboard: the Peer selector is now router-qualified
(router -> peer) so it is unambiguous in an iBGP route-reflector mesh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table
(roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib;
a dedicated consumer will populate it.
- production-sizing.md: corrected retention figures to the actual
policy values, added a measured-data section (one full feed ≈
+5 GB current state; DB now ~30 GB), and a horizontal-scaling
section — the bottleneck is the psql-app consumer + disk IOPS, so
scale psql-app as a Kafka consumer group (cap = partition count),
treat multi-collector as HA/locality not throughput.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The AS map previously exploded ~4.4M base_attrs AS_PATH rows live,
three times per load (one per panel), ~1.8s each — slow enough that
navigating away cancelled the queries mid-flight.
Add mv_as_adjacency: undirected consecutive-AS pairs with occurrence
counts over the full RIB (17k rows), refreshed hourly by pg_cron via
REFRESH ... CONCURRENTLY. The dashboard panels now read the view in
~1ms. Min-occurrence options rescaled for full-RIB counts
(2000/5000/10000/50000, default 2000 -> ~63-node graph).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>