obmp-docker

Author	SHA1	Message	Date
sam	2a82bd9a94	ip_rib perf tuning: per-table autovacuum + drop 4 unused indexes Derived from the 2026-05-19 ingestion stress-test session. psql-app's unicast_prefix drain rate caps at a few-hundred msg/s under continuous Postgres maintenance (autovacuum on ip_rib + update_global_ip_rib() / update_chg_stats() / update_peer_rib_counts() crons) competing for ip_rib disk I/O. ALTER TABLE ip_rib SET autovacuum_vacuum_scale_factor=0.02 -- run more often on smaller chunks. cost_limit kept at its OpenBMP-default 3000 so each run finishes fast; the consumer runs flat out between bursts instead of being throttled continuously. DROP INDEX for four unused/redundant indexes (every INSERT updates every index; these all had 0 scans in ~2h of heavy activity): - ip_rib_hash_id_idx (907 MB) - ip_rib_base_attr_hash_id_idx (558 MB) - ip_rib_prefix_idx (1538 MB, GiST) - ip_rib_origin_as_idx (364 MB) 9 -> 5 indexes; ~3.4 GB freed (6,715 MB -> 3,348 MB). Reduces index write-amplification per UPSERT by ~45% and shortens autovacuum on ip_rib by ~the same. Measurement note: across-cycle 25-min runs were inconclusive on the sustained-rate effect (inflow was near-zero by then -- gobgp stopped -- so the consumer was largely idle). The real test is re-enabling the fleet-wide feed with the consumer-replica + 62 GiB RAM and seeing whether unicast_prefix keeps up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 16:50:15 -07:00
sam	d7084aba54	Add fast-path churn monitor and churn-storm load tool obmp-churn-monitor: a decoupled fast-path BGP churn consumer. Reads openbmp.parsed.unicast_prefix with its own Kafka consumer group and only counts announcements/withdrawals per (router,peer) into churn_metrics (010_churn_metrics.sql) -- no relational RIB write. Storm-tested: it stayed real-time (tracked 1k->85k msg/s) while the psql-app bulk pipeline lag grew 3.8M->5.6M. Live BGP Churn dashboard reads it. tools/churn_storm.py: programmatic churn-storm generator (flaps GoBGP's eBGP sessions to the lab cores) for load testing. Stress-test finding: fleet-wide full table from 18 routers exceeds this 31 GiB host. The bottleneck is RAM, not CPU -- at 16 cores the host still hit load 33 because it was swap-thrashing (swap 2/2 full, <1.5 GiB free). Lag ran away 3.8M->20M+. Recourse: more host RAM for bulk throughput; the fast-path consumer for visibility regardless. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 13:17:09 -07:00
sam	b681c473c0	Add Policy Diff, fleet-wide full-table feed, and Kafka lag monitoring Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router post-policy accepted/advertised prefix counts and route-policy bindings over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected. GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route reflectors the cores propagate it to every R9K client, so all 18 lab routers carry and BMP-export a full table -- an intentional stress test of the ingestion/storage path. cml/gobgp_peering_config.py applies and rolls back the core-side config; gobgp/README.md documents the rollback. Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every 30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard gives visibility into the pipeline under churn load. Peer Detail dashboard: the Peer selector is now router-qualified (router -> peer) so it is unambiguous in an iBGP route-reflector mesh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 12:42:25 -07:00
sam	2d83d6c02e	Add evpn_rib schema; update production sizing with measured data - postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table (roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib; a dedicated consumer will populate it. - production-sizing.md: corrected retention figures to the actual policy values, added a measured-data section (one full feed ≈ +5 GB current state; DB now ~30 GB), and a horizontal-scaling section — the bottleneck is the psql-app consumer + disk IOPS, so scale psql-app as a Kafka consumer group (cap = partition count), treat multi-collector as HA/locality not throughput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:44:09 -07:00
sam	cc0d20bf9e	Back AS Relationship Map with a materialized view The AS map previously exploded ~4.4M base_attrs AS_PATH rows live, three times per load (one per panel), ~1.8s each — slow enough that navigating away cancelled the queries mid-flight. Add mv_as_adjacency: undirected consecutive-AS pairs with occurrence counts over the full RIB (17k rows), refreshed hourly by pg_cron via REFRESH ... CONCURRENTLY. The dashboard panels now read the view in ~1ms. Min-occurrence options rescaled for full-RIB counts (2000/5000/10000/50000, default 2000 -> ~63-node graph). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 07:04:38 -07:00
Tim Evens	84bec5293b	version 2.2.0 updates	2022-06-08 11:53:55 -07:00
RaviTeja Buddabathuni (rbuddaba)	a630c5db7d	feat: add pg_cron extension for cron jobs	2022-03-15 13:08:48 -05:00

7 Commits