105 Commits

Author SHA1 Message Date
sam
ef932fe1e8 Dashboard QoL: fill the viewport, push legends to bottom
Two recurring layout issues across dashboards I built this session:

  1) Right-placed legend tables ate 30% of each panel width.
  2) Default h:9 panels left ~50% of the viewport empty on a 1080p
     display (total dashboard height ~18 grid rows vs ~30 available).

Stack Resources (Telemetry-3001/stack_resources.json):
  * 3 timeseries: legend placement right -> bottom, calcs [max] -> [last,max],
    added sortBy: Max desc so top consumers float to the top of the legend.
  * Bumped all 4 panels h: 9 -> 14 (dashboard total 18 -> 28 rows).

Kafka Ingestion Lag and Live BGP Churn (Telemetry-3001/*):
  * Bumped timeseries panels h: 9 -> 12; second-row y: 13 -> 16.
    Dashboard total 22 -> 28 rows.

Policy Diff (obmp/History-1002/policy_diff.json):
  * Bumped bottom-row panels h: 8 -> 11. Total 24 -> 27 rows.

Untouched (already adequate, scrollable by design, or built earlier):
  evpn_rib (30 rows), global_table (38), router_diff (52), and the
  Maps-1006 dashboards (already h:22-28 single panels).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 19:58:33 -07:00
sam
2634aada24 Parameterize HOST_IP everywhere -- portable to another lab host
Removes hardcoded 10.40.40.202 references so a fresh clone + .env-only
edit can stand the stack up on a new compute node.

  * docker-compose.yml: rib-poller PG_DSN now uses ${HOST_IP:-...}.
  * obmp-rib-poller/poller.py: default PG_DSN host falls back to
    ${HOST_IP} env (compose passes it; manual runs honour $HOST_IP too).
  * cml/gobgp_peering_config.py: GOBGP_IP read from $HOST_IP or the
    HOST_IP= line in repo-root .env, with a small _env_default helper.
  * cml/proxmox_bmp_config.py: COLLECTOR_HOST resolved the same way.

For gobgp/gobgpd.conf and gobgp-evpn/gobgpd.conf -- jauderho/gobgp is
distroless (no shell), so we can't sed-substitute at container start.
Pattern instead:

  * gobgpd.conf is now gobgpd.conf.tmpl with __HOST_IP__ placeholders
    (committed). The rendered gobgpd.conf is gitignored.
  * setup.sh renders the .tmpl(s) to .conf using $HOST_IP from .env.
  * compose `command` stays the simple `gobgpd -f /config/gobgpd.conf`.

After cloning on a new host:  cp .env.example .env  -> edit HOST_IP ->
./setup.sh -> docker compose up -d. Verified locally by force-recreating
gobgp; all 6 sessions (4 cores + 2 Bromirski) re-established in <60s.

Known portability gaps still to address (separate work):
  * Hardcoded lab-router inventories in cml/*.py and
    obmp-rib-poller/poller.py.
  * The /etc/cron.d/openbmp */5 -> */15 edit inside obmp-psql-app is
    not persistent (regenerated by config_cron on every container start).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 18:34:51 -07:00
sam
2a82bd9a94 ip_rib perf tuning: per-table autovacuum + drop 4 unused indexes
Derived from the 2026-05-19 ingestion stress-test session. psql-app's
unicast_prefix drain rate caps at a few-hundred msg/s under continuous
Postgres maintenance (autovacuum on ip_rib + update_global_ip_rib() /
update_chg_stats() / update_peer_rib_counts() crons) competing for
ip_rib disk I/O.

ALTER TABLE ip_rib SET autovacuum_vacuum_scale_factor=0.02 -- run more
often on smaller chunks. cost_limit kept at its OpenBMP-default 3000 so
each run finishes fast; the consumer runs flat out between bursts
instead of being throttled continuously.

DROP INDEX for four unused/redundant indexes (every INSERT updates every
index; these all had 0 scans in ~2h of heavy activity):
  - ip_rib_hash_id_idx           (907 MB)
  - ip_rib_base_attr_hash_id_idx (558 MB)
  - ip_rib_prefix_idx            (1538 MB, GiST)
  - ip_rib_origin_as_idx         (364 MB)

9 -> 5 indexes; ~3.4 GB freed (6,715 MB -> 3,348 MB). Reduces index
write-amplification per UPSERT by ~45% and shortens autovacuum on
ip_rib by ~the same.

Measurement note: across-cycle 25-min runs were inconclusive on the
sustained-rate effect (inflow was near-zero by then -- gobgp stopped --
so the consumer was largely idle). The real test is re-enabling the
fleet-wide feed with the consumer-replica + 62 GiB RAM and seeing
whether unicast_prefix keeps up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 16:50:15 -07:00
sam
06019ef74c Add consumer-only psql-app replica for ingestion scale-out
psql-app-consumer: profile-gated (scale-out) horizontal scale-out for the
Kafka->Postgres ingestion path. Shares the primary's /config read-only so
it reuses obmp-psql.yml, whose fixed group.id makes Kafka rebalance
partitions across the primary and every replica. Its command runs ONLY
the consumer jar -- no cron, RPKI/IRR/DBIP or initdb -- so it does not
duplicate the primary's DB-maintenance jobs (config_cron wires those up
unconditionally in /usr/sbin/run). Each replica brings its own consumer
and writer threads.

Measured: one consumer-only replica took the post-storm backlog drain
from a cold-start ~3.7k msg/s to ~48k msg/s; group membership 8->16. With
2 consumers feeding it, Postgres becomes the next bottleneck (~500% CPU)
-- DB write capacity is the ceiling beyond ~2-3 consumers.

  docker compose --profile scale-out up -d --scale psql-app-consumer=2

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 14:04:34 -07:00
sam
22d26f0e0f Gate psql-app startup on Postgres health (fix cold-boot race)
On a cold boot all containers start together; psql-app finishes its
RPKI/IRR/DBIP setup and opens its single Postgres connection while the
DB is still initialising -> "the database system is starting up" ->
ConsumerApp.main throws and the consumer dies. The container does NOT
exit (the wrapper keeps cron/rsyslog alive), so restart: unless-stopped
never fires and the consumer stays dead silently.

Add depends_on psql: condition: service_healthy (plus kafka) so Compose
holds psql-app until Postgres passes its pg_isready healthcheck.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:53:09 -07:00
sam
d7084aba54 Add fast-path churn monitor and churn-storm load tool
obmp-churn-monitor: a decoupled fast-path BGP churn consumer. Reads
openbmp.parsed.unicast_prefix with its own Kafka consumer group and only
counts announcements/withdrawals per (router,peer) into churn_metrics
(010_churn_metrics.sql) -- no relational RIB write. Storm-tested: it
stayed real-time (tracked 1k->85k msg/s) while the psql-app bulk
pipeline lag grew 3.8M->5.6M. Live BGP Churn dashboard reads it.

tools/churn_storm.py: programmatic churn-storm generator (flaps GoBGP's
eBGP sessions to the lab cores) for load testing.

Stress-test finding: fleet-wide full table from 18 routers exceeds this
31 GiB host. The bottleneck is RAM, not CPU -- at 16 cores the host
still hit load 33 because it was swap-thrashing (swap 2/2 full, <1.5 GiB
free). Lag ran away 3.8M->20M+. Recourse: more host RAM for bulk
throughput; the fast-path consumer for visibility regardless.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 13:17:09 -07:00
sam
b681c473c0 Add Policy Diff, fleet-wide full-table feed, and Kafka lag monitoring
Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router
post-policy accepted/advertised prefix counts and route-policy bindings
over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy
Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff
dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected.

GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to
both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route
reflectors the cores propagate it to every R9K client, so all 18 lab
routers carry and BMP-export a full table -- an intentional stress test
of the ingestion/storage path. cml/gobgp_peering_config.py applies and
rolls back the core-side config; gobgp/README.md documents the rollback.

Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every
30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard
gives visibility into the pipeline under churn load.

Peer Detail dashboard: the Peer selector is now router-qualified
(router -> peer) so it is unambiguous in an iBGP route-reflector mesh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:42:25 -07:00
sam
565ebdbee0 Roadmap E5: mark EVPN lab-testable scope complete
evpn_rib table, gobgp-evpn injector, obmp-evpn-consumer and the EVPN
RIB dashboard are built and verified for type-2/type-3. type-5 and
real (non-synthetic) EVPN remain limited by collector 2.2.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:31:08 -07:00
sam
4e0f3fb0ff Add EVPN RIB dashboard (roadmap E5)
Visualises BGP EVPN routes from evpn_rib: route counts and EVIs,
route-type breakdown, a per-EVI summary, and detail tables for type-2
MAC/IP advertisements (MAC, host IP, VNI/label, route-targets, ESI)
and type-3 inclusive-multicast routes. Scoped by an RD/EVI variable.
Lives in the OBMP-L3VPN folder.

Completes roadmap E5's lab-testable scope: evpn_rib table, gobgp-evpn
injector, obmp-evpn-consumer, and this dashboard — verified end to
end with synthetic type-2/type-3 routes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:30:43 -07:00
sam
107cbf6ac5 Add obmp-evpn-consumer: openbmp.parsed.evpn -> evpn_rib (roadmap E5)
A standalone Python Kafka consumer that subscribes to the
openbmp.parsed.evpn topic (which the stock psql-app ignores) and
writes BGP EVPN routes into evpn_rib. Field positions are pinned to
the verified collector 2.2.3 / v1.7 message layout; route_type is
derived from which fields populate. Profile-gated ('evpn-test')
alongside the gobgp-evpn injector.

Verified end to end: 5 injected type-2/type-3 routes land in evpn_rib
with correct RD, ethernet-tag, MAC, IP, label and route-target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:28:19 -07:00
sam
41ec96c3ac EVPN injector: drop type-5 (collector 2.2.3 mis-decodes it)
Verified against the live collector: EVPN type-2 (MAC/IP) and type-3
(inclusive multicast) parse cleanly onto openbmp.parsed.evpn, but
type-5 (IP-prefix) is mis-decoded — the IP prefix corrupts the RD
field. inject-evpn.sh now injects only type-2/3; the type-5
limitation is documented in the injector README and roadmap E5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:24:08 -07:00
sam
f7532b62ef Add modular gobgp-evpn EVPN test-route injector (roadmap E5)
A profile-gated GoBGP instance (Compose profile 'evpn-test', not part
of the normal stack) that originates synthetic BGP EVPN routes and
BMP-exports its local RIB to the collector. Verified end to end: the
injected type-2/3/5 routes are parsed by the collector and land on
the openbmp.parsed.evpn Kafka topic, ready for the EVPN consumer.

inject-evpn.sh pushes type-2 (MAC/IP), type-3 (inclusive multicast)
and type-5 (IP-prefix) routes. Start with:
  docker compose --profile evpn-test up -d gobgp-evpn

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:15:44 -07:00
sam
2d83d6c02e Add evpn_rib schema; update production sizing with measured data
- postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table
  (roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib;
  a dedicated consumer will populate it.
- production-sizing.md: corrected retention figures to the actual
  policy values, added a measured-data section (one full feed ≈
  +5 GB current state; DB now ~30 GB), and a horizontal-scaling
  section — the bottleneck is the psql-app consumer + disk IOPS, so
  scale psql-app as a Kafka consumer group (cap = partition count),
  treat multi-collector as HA/locality not throughput.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:44:09 -07:00
sam
c18d11a48f Roadmap E5: refine with EVPN research findings
The OpenBMP collector already decodes EVPN and emits openbmp.parsed.evpn;
the gap is solely the psql-app (no subscription/handler) and the missing
schema table. L2VPN-VPLS is unsupported entirely. Records the two
implementation paths: fork the Java psql-app, or run GoBMP as a second
EVPN-capable collector with a thin Postgres consumer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:31:44 -07:00
sam
e55e12b778 Add Storage & Feed Health dashboard
Surfaces the new Telegraf disk/DB metrics plus GoBGP feed health:
openbmp database size (current + trend), largest tables, host
filesystem usage % and free space, GoBGP feed route count, and the
state of the IPv4/IPv6 BGP sessions to AS57355. Lives in the
OBMP-Telemetry folder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:31:08 -07:00
sam
fc164a5689 Add disk-space and DB-size monitoring to Telegraf
Telegraf now collects host filesystem usage ([[inputs.disk]], via a
read-only /hostfs mount) and PostgreSQL database + per-table sizes
([[inputs.postgresql_extensible]]) into InfluxDB. Surfaces RIB growth
and disk pressure — relevant now that the full-table GoBGP feed has
pushed the openbmp DB to ~30 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:28:36 -07:00
sam
cffb835f30 Enable IPv6 feed: run GoBGP in host network mode
The IPv6 eBGP session never established because the Docker bridge
has no IPv6. Switch the gobgp container to network_mode: host so it
uses the host's real dual-stack connectivity — both sessions to
AS57355 now source from the host's public v4/v6 addresses.

Host mode binds the host's port namespace, so disable GoBGP's
inbound BGP listener (port = -1) — we only originate outbound
sessions, and a non-root container cannot bind privileged port 179.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:08:55 -07:00
sam
7766525787 Roadmap: add E5 — L2VPN/EVPN needs platform work, not dashboards
This OpenBMP deployment has no EVPN/L2VPN schema; supporting it
requires collector + psql-app + schema changes upstream, not a
Grafana dashboard. Captured as E5 with a research-spike first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:01:59 -07:00
sam
6496b60311 Make L3VPN RD/VRF filter dynamic (roadmap E4)
Both L3VPN dashboards had a static custom 'rd' variable holding only
the '-' (all) sentinel — you could not actually filter by a VRF.
Convert 'rd' to a query variable that discovers route distinguishers
from l3vpn_rib. Degrades cleanly on the (currently empty) lab table:
the query always returns '-', so behaviour is unchanged until real
L3VPN data exists, then RDs auto-populate. Existing panel SQL
('$rd' = '-' OR rd = '$rd') is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:00:50 -07:00
sam
0451b2aa87 Add Global Internet Table dashboard (roadmap E3)
Explores the real DFZ table received from the AS57355 route server
via the GoBGP feed (the '$feed' / GoBGP BMP peer): IPv4/IPv6 prefix
counts, distinct origin ASes, prefix-length distribution, top origin
ASes by prefix count, and an overlap-based prefix lookup. Serves as
the comparison baseline for the Router Diff dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:57:47 -07:00
sam
af4b816bef Fix GoBGP BMP target: use host IP, not collector hostname
GoBGP's BMP config requires a literal IP — 'obmp-collector' failed
to parse and the container crash-looped. Point BMP export at the
docker host IP (10.40.40.202) where the collector publishes port
5000; stable across container recreation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:51:41 -07:00
sam
8ced62e491 Add generic Router Diff dashboard (roadmap E2)
Generalizes the RR-specific RR Loc-RIB Diff into a comparison of up
to four selectable routers. router1/router2 are required; router3/
router4 default to a '-- none --' sentinel and drop cleanly out of
every query (no empty-IN, no dangling predicates).

Panels: per-router prefix counts, divergent-prefix count, a presence
matrix (row per prefix, column per router, cell = best-path next-hop),
a divergence detail table classifying missing / next-hop / AS-path
disagreement, and a per-prefix all-paths drill-down. Once the GoBGP
global feed (E1) is up, GLOBAL-FEED is selectable as any of the four
for lab-vs-Internet diffing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:41:49 -07:00
sam
88a5546e29 Add GoBGP full-table feed container (roadmap E1)
New gobgp service: GoBGP peers eBGP-multihop with the AS57355 lab
route server (Bromirski) for the full real IPv4 + IPv6 Internet table
and BMP-exports it to the OpenBMP collector, landing in ip_rib as a
monitored peer.

Config follows the route server's published peering spec: local AS
65001, no password, keepalive 3600 / hold-time 7200, IPv4 feed on the
v4 session and IPv6 feed on the v6 session. gobgp/mrt-refresh.sh is a
cron-safe fallback that injects RouteViews MRT RIB dumps when the live
session is down. The live BGP session is not started here — bringing
gobgp up establishes the external session and loads ~1M routes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:39:12 -07:00
sam
d60c582ff6 Add roadmap Track E: Internet-scale routing analytics
Plan for a local full-Internet routing table, a generalized N-way
router diff, and VRF/RD scoping:

- E1: GoBGP container peering AS57355 (Bromirski lab route server)
  for a live full v4/v6 table, MRT RIB dumps as a 2-hourly fallback,
  BMP-exported into ip_rib as a GLOBAL-FEED peer.
- E2: generic up-to-4-router diff dashboard (presence matrix),
  generalized from the RR-specific rr_locrib_diff.
- E3: global table exploration dashboard.
- E4: VRF/RD scoping across unicast + L3VPN dashboards (built to
  schema; not lab-verifiable with CML IOS-XR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:19:34 -07:00
sam
cc0d20bf9e Back AS Relationship Map with a materialized view
The AS map previously exploded ~4.4M base_attrs AS_PATH rows live,
three times per load (one per panel), ~1.8s each — slow enough that
navigating away cancelled the queries mid-flight.

Add mv_as_adjacency: undirected consecutive-AS pairs with occurrence
counts over the full RIB (17k rows), refreshed hourly by pg_cron via
REFRESH ... CONCURRENTLY. The dashboard panels now read the view in
~1ms. Min-occurrence options rescaled for full-RIB counts
(2000/5000/10000/50000, default 2000 -> ~63-node graph).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:04:38 -07:00
sam
0190ef5fb8 Fix BGP Peer Map blank graph: connect disconnected lab components
The node graph rendered blank because the two CML/PROX labs formed
two disconnected components (iBGP-only meshes within each lab), and
Grafana's nodeGraph layout renders nothing for a disconnected graph.

Match BGP sessions to monitored routers by peer IP as well as peer
BGP-ID, so the real cross-lab eBGP sessions become graph edges. The
graph is now one connected component (30 iBGP + 4 eBGP edges) and
lays out. The companion external-neighbours table uses the same
peer-IP check so those sessions are no longer double-listed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:42:31 -07:00
sam
940f54c553 Fix BGP Peer Map blank node graph: numeric edge mainstat
The node graph rendered empty because the edges target returned a
string mainstat ('iBGP'/'eBGP'). Grafana's nodeGraph treats edge
mainStat as numeric for layout/labelling; a string value silently
breaks the layout so no nodes are drawn (the working LS map and the
original ls_topo both cast edge mainstat to an integer).

Edge mainstat is now COUNT(DISTINCT feed)::int (BMP peer-feed count
for the router pair); the iBGP/eBGP label moves to secondarystat and
detail__session_type, which accept strings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:14:05 -07:00
sam
d815a4774b Use proven singlequote format for RR variable in BGP Peer Map
Switch the route-reflector membership test to
= ANY(ARRAY[${rr_loopbacks:singlequote}]::text[]) — the singlequote
format is the one already proven to interpolate correctly in this
Grafana instance (rr_locrib_diff uses it), and the ARRAY[...]::text[]
wrapper stays valid (empty array) when the variable resolves empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:53:43 -07:00
sam
1acdc32dda Fix LS Topology map double-quoted protocol variable
The protocol variable has includeAll enabled, so Grafana auto-quotes
its value ('IS-IS_L2'); the SQL then wrapped it again, producing
''IS-IS_L2'' and a syntax error that blanked the node graph. Replace
the quoted equality filter with IN ($protocol) — Grafana already
emits a quoted CSV — and make the variable multi-select so "All"
expands cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:20:09 -07:00
sam
0200932ea0 Fix BGP Peer Map empty-variable crash in RR detection
When the rr_loopbacks variable resolved empty, the IN
(${rr_loopbacks:singlequote}) clause expanded to IN (), a SQL
syntax error that blanked the topology panel. Switch to
= ANY(string_to_array('${rr_loopbacks:csv}', ',')), which yields
a no-match (not a syntax error) on an empty variable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:12:12 -07:00
sam
f6a100e673 Add OBMP-Maps dashboard suite: BGP/IGP/AS topology and geo maps
Create a new OBMP-Maps Grafana folder (folderUid 1006) with four
data-visualization dashboards built on nodeGraph and geomap panels:

- BGP Peer Map: routers as nodes, BGP sessions as edges; iBGP/eBGP
  edge typing and operator-editable rr_loopbacks variable to denote
  route reflectors; companion table for sessions to non-monitored
  neighbours.
- IGP / Link-State Topology Map: reworked from LinkState-1004 and
  moved here (uid preserved); scoped by peer feed / protocol / AS so
  the 489-node BGP-LS topology stays readable; SR-capability rings.
- AS Relationship Map: AS adjacency graph from consecutive AS_PATH
  pairs over a 200k-route sample; min-occurrence and focus-AS
  variables; nodes enriched from info_asn whois.
- Geographic Prefix Map: geomap of RIB prefixes and origin ASes by
  IP geolocation, with a note that lab 10.x loopbacks do not
  geolocate; bounded geo_ip join via a sample-size variable.

Also add a data link on the Looking Glass ASN Info panel's origin_as
column that jumps to the ASN View dashboard scoped to that AS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:07:41 -07:00
sam
26dea47a55 Make the ASN View origin-AS selector a free-text input
asn_num was a fixed custom variable; converting it to a textbox lets an
operator look up any origin AS and see all of its RIB prefixes, upstreams,
and downstreams.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:22:21 -07:00
sam
9d74940614 Fix ExaBGP OOM, add container health checks and resource monitoring
RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too
small for the full-table feature (900K route objects in memory). Raises the
limit to a parameterized 6g default (EXABGP_MEM_LIMIT).

Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy
containers are visible. Adds a Telegraf docker input that collects per-
container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard —
so resource pressure is caught before it causes an OOM crash. telegraf runs
with an overridden entrypoint so it keeps root and can read the docker socket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:03:52 -07:00
sam
482c0cdc01 Add ipv6 unicast to ExaBGP neighbor family
The IOS-XR routers negotiate IPv6 unicast capability, but the generated
exabgp.conf declared only ipv4 unicast — producing repeated "route family
(ipv6/unicast) is not configured" errors that crashed ExaBGP. Declaring
ipv6 unicast on the neighbor matches the routers' capabilities and stops
the crash-restart cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:40:32 -07:00
sam
6d3387dfe5 Add RR next-hop sanity check to the RR Loc-RIB Diff dashboard
Adds a panel that flags the next-hop-self-on-an-RR anti-pattern: reflected
routes (those carrying ORIGINATOR_ID) whose NEXT_HOP is an RR loopback while
the route was originated by a different router — meaning the RR rewrote
next-hop to itself and has been pulled into the forwarding path. RR-originated
routes and legitimately-imported eBGP routes (originator == next-hop) are
excluded. An editable rr_loopbacks template variable keeps it environment-
agnostic — useful for validating RR behavior during an IOS-XR to Junos
migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:18:22 -07:00
sam
a662496e53 Fix telemetry dashboard variables and parameterize gNMI targets
The telemetry dashboards' router/interface variables used a keep|distinct
Flux pattern that returned only one source; switch to schema.tagValues so all
streaming routers and interfaces are listed. Parameterize telegraf.conf gNMI
addresses and credentials via GNMI_ADDRESSES/GNMI_USERNAME/GNMI_PASSWORD so
the telemetry fleet can scale without editing the config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:10:57 -07:00
sam
0732ebfa07 Add production-readiness deliverables: security, backup, alerting
Adds a prioritized security-hardening checklist, a PostgreSQL logical-backup
script (pg-backup.sh) with a documented restore procedure, and Grafana
alerting provisioning (peer-down, flap-storm, RPKI-invalid, router-down rules
plus a contact-point template). The alerting YAML and contact points need
operator review before being relied on for paging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:55:03 -07:00
sam
7e3370b5a5 Rework Grafana dashboard information architecture
Reorganizes 31 dashboards into an operator-first structure with real
navigation. Adds Router Detail and Peer Detail drilldown dashboards; merges
LS Nodes+Links and the two L3VPN dashboards; modernizes all deprecated panels
(table-old/graph/worldmap). Every dashboard gets the obmp-nav dropdown so the
whole set is reachable from anywhere. Graduates the operational "Learning"
dashboards into Operations/Routing/LinkState folders, retires the Tops folder,
and relabels folders (Base->Operations, History->Routing, Learning->Reference).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:55:03 -07:00
sam
f430758992 Scope NOC Overview "Peers Down" panels to the dashboard time range
The scorecard and table counted every bgp_peers row in a down state,
including peers removed long ago (OpenBMP never prunes bgp_peers). They now
filter on the peer's last state-change timestamp via $__timeFilter, so the
panel reflects current/recent problems rather than all-time history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:29:59 -07:00
sam
f1558946ae Add production sizing guide for 40 full-table-edge routers
Documents compute, memory, and storage requirements for a production
deployment: ~100-150M NLRI estimate, 96-128 GB RAM, 16-32 vCPU, 3-5 TB NVMe,
a split-host architecture option, PostgreSQL tuning, and a BMP RIB-scope
recommendation (Adj-RIB-In only initially).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:06:25 -07:00
sam
960806fc06 Add NOC Overview dashboard and rebuild home as a navigation hub
NOC Overview is the new flagship operator landing dashboard — health
scorecards, peer session timeline, BGP update rate, and attention tables for
peers down, churning prefixes, RPKI invalids, and topology changes. All counts
come from stats_* aggregate tables so it stays fast at production scale.
OBMP-Home is rebuilt as a lightweight navigation hub pointing at NOC Overview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:04:37 -07:00
sam
4e9bd7cc5a Add container memory limits to all services
Sets mem_limit on every service to cap the OOM/swap-exhaustion risk (the lab
host had only 5 MiB swap free). The three heavy services (psql, kafka,
psql-app) read their limits from .env so production can raise them; the rest
use lab-appropriate fixed values. Total ~25 GB, leaving headroom on the 31 GB
lab host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:04:37 -07:00
sam
8ac156ce86 Add second-lab ExaBGP peering and bulk BMP config script
Generalizes exabgp/startup.sh to template BGP neighbors from an EXABGP_PEERS
list (ip:peer_as:description), so ExaBGP peers with multiple labs. Adds
cml/proxmox_bmp_config.py to apply the bmp server block to a lab's IOS-XR
routers over SSH (BMP config is not exposed via NETCONF YANG on current XR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 19:21:11 -07:00
sam
cf4e5b07c6 Add Compose profiles, setup.sh bootstrap, and config templates for portable deployment
Pins the Compose project name and splits services into core / test / auth
profiles so the BMP collector core can deploy standalone. Adds setup.sh
(idempotent bootstrap), .env.example, and repo-resident Authelia config
templates so a fresh host deploys without manual steps. Parameterizes
hardcoded host IP and domain; points the Grafana InfluxDB datasource at the
container name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 19:21:04 -07:00
sam
31286d5d3e Add platform roadmap: multi-lab CML integration and production deployment
Four-track roadmap covering configuration centralization (inventory.yaml),
CML API automation (virl2_client), production ISP deployment (multi-vendor
IOS-XR + Junos), and packaging for distribution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:23:38 -07:00
sam
da49b3e462 Add CML integration: XRd and ExaBGP node/image definitions and build scripts
CML 2.9 node definitions for XRd Control-Plane (third RR) and ExaBGP route
injector as Docker-based CML nodes. Includes build scripts to export Docker
images as tars for CML import, with IOS-XR startup configs for IS-IS, BGP,
and BMP.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:23:30 -07:00
sam
541f018bc5 Add RR Loc-RIB diff dashboard and route diversity config
Dashboard compares Adj-RIB-In tables between two Route Reflectors via BMP,
showing missing prefixes, attribute diffs (next-hop, AS path), and per-client
consistency. Route diversity script deploys 29 prefixes across R9K-01-07 via
NETCONF to create verifiable next-hop differences between RRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:23:19 -07:00
sam
45f4c9859d Add Authelia auth gateway, portal landing page, and subpath routing
Adds Authelia (forward-auth) and nginx portal container for single-endpoint
authenticated access via Caddy reverse proxy. Configures Grafana auth proxy
for header-based auto-login. Updates Vue UI base paths and API routes for
/exabgp/ and /traffic/ subpath serving. Adds traffic-gen responder container
on dedicated Docker network.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:23:09 -07:00
sam
422b98d555 Fix telemetry dashboards: update Flux queries and InfluxDB datasource URL
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:22:58 -07:00
sam
d691b512f9 Add full internet table injection with background worker and progress tracking
Generates realistic IPv4 routing tables (1K-900K prefixes) with DFZ-like
prefix length distribution, varied AS paths, and transit ASN diversity.
Background injection with progress API, CLI follow mode, and Vue UI
component with preset sizes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:22:51 -07:00