30 Commits

Author SHA1 Message Date
sam
b681c473c0 Add Policy Diff, fleet-wide full-table feed, and Kafka lag monitoring
Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router
post-policy accepted/advertised prefix counts and route-policy bindings
over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy
Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff
dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected.

GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to
both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route
reflectors the cores propagate it to every R9K client, so all 18 lab
routers carry and BMP-export a full table -- an intentional stress test
of the ingestion/storage path. cml/gobgp_peering_config.py applies and
rolls back the core-side config; gobgp/README.md documents the rollback.

Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every
30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard
gives visibility into the pipeline under churn load.

Peer Detail dashboard: the Peer selector is now router-qualified
(router -> peer) so it is unambiguous in an iBGP route-reflector mesh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:42:25 -07:00
sam
4e0f3fb0ff Add EVPN RIB dashboard (roadmap E5)
Visualises BGP EVPN routes from evpn_rib: route counts and EVIs,
route-type breakdown, a per-EVI summary, and detail tables for type-2
MAC/IP advertisements (MAC, host IP, VNI/label, route-targets, ESI)
and type-3 inclusive-multicast routes. Scoped by an RD/EVI variable.
Lives in the OBMP-L3VPN folder.

Completes roadmap E5's lab-testable scope: evpn_rib table, gobgp-evpn
injector, obmp-evpn-consumer, and this dashboard — verified end to
end with synthetic type-2/type-3 routes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 09:30:43 -07:00
sam
e55e12b778 Add Storage & Feed Health dashboard
Surfaces the new Telegraf disk/DB metrics plus GoBGP feed health:
openbmp database size (current + trend), largest tables, host
filesystem usage % and free space, GoBGP feed route count, and the
state of the IPv4/IPv6 BGP sessions to AS57355. Lives in the
OBMP-Telemetry folder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:31:08 -07:00
sam
6496b60311 Make L3VPN RD/VRF filter dynamic (roadmap E4)
Both L3VPN dashboards had a static custom 'rd' variable holding only
the '-' (all) sentinel — you could not actually filter by a VRF.
Convert 'rd' to a query variable that discovers route distinguishers
from l3vpn_rib. Degrades cleanly on the (currently empty) lab table:
the query always returns '-', so behaviour is unchanged until real
L3VPN data exists, then RDs auto-populate. Existing panel SQL
('$rd' = '-' OR rd = '$rd') is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:00:50 -07:00
sam
0451b2aa87 Add Global Internet Table dashboard (roadmap E3)
Explores the real DFZ table received from the AS57355 route server
via the GoBGP feed (the '$feed' / GoBGP BMP peer): IPv4/IPv6 prefix
counts, distinct origin ASes, prefix-length distribution, top origin
ASes by prefix count, and an overlap-based prefix lookup. Serves as
the comparison baseline for the Router Diff dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:57:47 -07:00
sam
8ced62e491 Add generic Router Diff dashboard (roadmap E2)
Generalizes the RR-specific RR Loc-RIB Diff into a comparison of up
to four selectable routers. router1/router2 are required; router3/
router4 default to a '-- none --' sentinel and drop cleanly out of
every query (no empty-IN, no dangling predicates).

Panels: per-router prefix counts, divergent-prefix count, a presence
matrix (row per prefix, column per router, cell = best-path next-hop),
a divergence detail table classifying missing / next-hop / AS-path
disagreement, and a per-prefix all-paths drill-down. Once the GoBGP
global feed (E1) is up, GLOBAL-FEED is selectable as any of the four
for lab-vs-Internet diffing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:41:49 -07:00
sam
cc0d20bf9e Back AS Relationship Map with a materialized view
The AS map previously exploded ~4.4M base_attrs AS_PATH rows live,
three times per load (one per panel), ~1.8s each — slow enough that
navigating away cancelled the queries mid-flight.

Add mv_as_adjacency: undirected consecutive-AS pairs with occurrence
counts over the full RIB (17k rows), refreshed hourly by pg_cron via
REFRESH ... CONCURRENTLY. The dashboard panels now read the view in
~1ms. Min-occurrence options rescaled for full-RIB counts
(2000/5000/10000/50000, default 2000 -> ~63-node graph).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:04:38 -07:00
sam
0190ef5fb8 Fix BGP Peer Map blank graph: connect disconnected lab components
The node graph rendered blank because the two CML/PROX labs formed
two disconnected components (iBGP-only meshes within each lab), and
Grafana's nodeGraph layout renders nothing for a disconnected graph.

Match BGP sessions to monitored routers by peer IP as well as peer
BGP-ID, so the real cross-lab eBGP sessions become graph edges. The
graph is now one connected component (30 iBGP + 4 eBGP edges) and
lays out. The companion external-neighbours table uses the same
peer-IP check so those sessions are no longer double-listed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:42:31 -07:00
sam
940f54c553 Fix BGP Peer Map blank node graph: numeric edge mainstat
The node graph rendered empty because the edges target returned a
string mainstat ('iBGP'/'eBGP'). Grafana's nodeGraph treats edge
mainStat as numeric for layout/labelling; a string value silently
breaks the layout so no nodes are drawn (the working LS map and the
original ls_topo both cast edge mainstat to an integer).

Edge mainstat is now COUNT(DISTINCT feed)::int (BMP peer-feed count
for the router pair); the iBGP/eBGP label moves to secondarystat and
detail__session_type, which accept strings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:14:05 -07:00
sam
d815a4774b Use proven singlequote format for RR variable in BGP Peer Map
Switch the route-reflector membership test to
= ANY(ARRAY[${rr_loopbacks:singlequote}]::text[]) — the singlequote
format is the one already proven to interpolate correctly in this
Grafana instance (rr_locrib_diff uses it), and the ARRAY[...]::text[]
wrapper stays valid (empty array) when the variable resolves empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:53:43 -07:00
sam
1acdc32dda Fix LS Topology map double-quoted protocol variable
The protocol variable has includeAll enabled, so Grafana auto-quotes
its value ('IS-IS_L2'); the SQL then wrapped it again, producing
''IS-IS_L2'' and a syntax error that blanked the node graph. Replace
the quoted equality filter with IN ($protocol) — Grafana already
emits a quoted CSV — and make the variable multi-select so "All"
expands cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:20:09 -07:00
sam
0200932ea0 Fix BGP Peer Map empty-variable crash in RR detection
When the rr_loopbacks variable resolved empty, the IN
(${rr_loopbacks:singlequote}) clause expanded to IN (), a SQL
syntax error that blanked the topology panel. Switch to
= ANY(string_to_array('${rr_loopbacks:csv}', ',')), which yields
a no-match (not a syntax error) on an empty variable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:12:12 -07:00
sam
f6a100e673 Add OBMP-Maps dashboard suite: BGP/IGP/AS topology and geo maps
Create a new OBMP-Maps Grafana folder (folderUid 1006) with four
data-visualization dashboards built on nodeGraph and geomap panels:

- BGP Peer Map: routers as nodes, BGP sessions as edges; iBGP/eBGP
  edge typing and operator-editable rr_loopbacks variable to denote
  route reflectors; companion table for sessions to non-monitored
  neighbours.
- IGP / Link-State Topology Map: reworked from LinkState-1004 and
  moved here (uid preserved); scoped by peer feed / protocol / AS so
  the 489-node BGP-LS topology stays readable; SR-capability rings.
- AS Relationship Map: AS adjacency graph from consecutive AS_PATH
  pairs over a 200k-route sample; min-occurrence and focus-AS
  variables; nodes enriched from info_asn whois.
- Geographic Prefix Map: geomap of RIB prefixes and origin ASes by
  IP geolocation, with a note that lab 10.x loopbacks do not
  geolocate; bounded geo_ip join via a sample-size variable.

Also add a data link on the Looking Glass ASN Info panel's origin_as
column that jumps to the ASN View dashboard scoped to that AS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:07:41 -07:00
sam
26dea47a55 Make the ASN View origin-AS selector a free-text input
asn_num was a fixed custom variable; converting it to a textbox lets an
operator look up any origin AS and see all of its RIB prefixes, upstreams,
and downstreams.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:22:21 -07:00
sam
9d74940614 Fix ExaBGP OOM, add container health checks and resource monitoring
RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too
small for the full-table feature (900K route objects in memory). Raises the
limit to a parameterized 6g default (EXABGP_MEM_LIMIT).

Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy
containers are visible. Adds a Telegraf docker input that collects per-
container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard —
so resource pressure is caught before it causes an OOM crash. telegraf runs
with an overridden entrypoint so it keeps root and can read the docker socket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:03:52 -07:00
sam
6d3387dfe5 Add RR next-hop sanity check to the RR Loc-RIB Diff dashboard
Adds a panel that flags the next-hop-self-on-an-RR anti-pattern: reflected
routes (those carrying ORIGINATOR_ID) whose NEXT_HOP is an RR loopback while
the route was originated by a different router — meaning the RR rewrote
next-hop to itself and has been pulled into the forwarding path. RR-originated
routes and legitimately-imported eBGP routes (originator == next-hop) are
excluded. An editable rr_loopbacks template variable keeps it environment-
agnostic — useful for validating RR behavior during an IOS-XR to Junos
migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:18:22 -07:00
sam
a662496e53 Fix telemetry dashboard variables and parameterize gNMI targets
The telemetry dashboards' router/interface variables used a keep|distinct
Flux pattern that returned only one source; switch to schema.tagValues so all
streaming routers and interfaces are listed. Parameterize telegraf.conf gNMI
addresses and credentials via GNMI_ADDRESSES/GNMI_USERNAME/GNMI_PASSWORD so
the telemetry fleet can scale without editing the config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 21:10:57 -07:00
sam
0732ebfa07 Add production-readiness deliverables: security, backup, alerting
Adds a prioritized security-hardening checklist, a PostgreSQL logical-backup
script (pg-backup.sh) with a documented restore procedure, and Grafana
alerting provisioning (peer-down, flap-storm, RPKI-invalid, router-down rules
plus a contact-point template). The alerting YAML and contact points need
operator review before being relied on for paging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:55:03 -07:00
sam
7e3370b5a5 Rework Grafana dashboard information architecture
Reorganizes 31 dashboards into an operator-first structure with real
navigation. Adds Router Detail and Peer Detail drilldown dashboards; merges
LS Nodes+Links and the two L3VPN dashboards; modernizes all deprecated panels
(table-old/graph/worldmap). Every dashboard gets the obmp-nav dropdown so the
whole set is reachable from anywhere. Graduates the operational "Learning"
dashboards into Operations/Routing/LinkState folders, retires the Tops folder,
and relabels folders (Base->Operations, History->Routing, Learning->Reference).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:55:03 -07:00
sam
f430758992 Scope NOC Overview "Peers Down" panels to the dashboard time range
The scorecard and table counted every bgp_peers row in a down state,
including peers removed long ago (OpenBMP never prunes bgp_peers). They now
filter on the peer's last state-change timestamp via $__timeFilter, so the
panel reflects current/recent problems rather than all-time history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:29:59 -07:00
sam
960806fc06 Add NOC Overview dashboard and rebuild home as a navigation hub
NOC Overview is the new flagship operator landing dashboard — health
scorecards, peer session timeline, BGP update rate, and attention tables for
peers down, churning prefixes, RPKI invalids, and topology changes. All counts
come from stats_* aggregate tables so it stays fast at production scale.
OBMP-Home is rebuilt as a lightweight navigation hub pointing at NOC Overview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:04:37 -07:00
sam
cf4e5b07c6 Add Compose profiles, setup.sh bootstrap, and config templates for portable deployment
Pins the Compose project name and splits services into core / test / auth
profiles so the BMP collector core can deploy standalone. Adds setup.sh
(idempotent bootstrap), .env.example, and repo-resident Authelia config
templates so a fresh host deploys without manual steps. Parameterizes
hardcoded host IP and domain; points the Grafana InfluxDB datasource at the
container name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 19:21:04 -07:00
sam
541f018bc5 Add RR Loc-RIB diff dashboard and route diversity config
Dashboard compares Adj-RIB-In tables between two Route Reflectors via BMP,
showing missing prefixes, attribute diffs (next-hop, AS path), and per-client
consistency. Route diversity script deploys 29 prefixes across R9K-01-07 via
NETCONF to create verifiable next-hop differences between RRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:23:19 -07:00
sam
422b98d555 Fix telemetry dashboards: update Flux queries and InfluxDB datasource URL
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:22:58 -07:00
sam
c28c9b2527 Fix gNMI telemetry: OpenConfig paths, json_ietf encoding, SSH config
- Switch Telegraf from native IOS-XR YANG paths to OpenConfig
  (openconfig-interfaces:interfaces/interface/state/counters)
- Use json_ietf encoding instead of proto (IOS-XR 24.3.1 compat)
- Target only CORE-01/CORE-02 (R9K routers blocked by CML mgmt net)
- Update all 3 Grafana dashboard queries to match OpenConfig field
  names (in-octets, out-octets, in-pkts, out-pkts, in-errors, etc.)
- Rewrite gnmi_grpc_config.py to use SSH/CLI via paramiko instead of
  NETCONF (IOS-XR 24.3.1 rejects NETCONF gRPC edit-config)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 16:19:16 -07:00
sam
dcebf15bb3 Add Phase 4: gNMI streaming telemetry and traffic generator
- gNMI integration: NETCONF script to enable gRPC on all 9 routers,
  Telegraf container with gnmi input plugin, InfluxDB for time-series
  storage, 3 Grafana telemetry dashboards (utilization, errors, combined)
- Traffic generator: Scapy-based dual-mode container (sender/responder)
  with Flask API, RFC 2544 test suite (throughput, latency, frame-loss,
  back-to-back), Vue 3 web UI with flow builder, test runner, real-time
  stats monitor, and results export
- docker-compose.yml updated with influxdb, telegraf, traffic-gen,
  traffic-gen-ui services
- Full documentation in DOCS.md sections 15-16

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 15:29:44 -07:00
sam
f23e222bc0 Add Phase 3: TE/SR analytics, anomaly detection, DB schema reference
- 4 new Grafana dashboards:
  - Database Schema Map (obmp-learn-07): interactive schema reference
    with live row counts, relationship diagrams, column details
  - TE & Segment Routing Analytics (obmp-learn-08): exposes BGP-LS TE/SR
    fields (bandwidth, admin groups, SRLG, SR SIDs, protection types)
  - Topology Change & Anomaly Detection (obmp-learn-09): link state
    change tracking, origin AS hijack detection, convergence timeline
  - Link Utilization & TE Thought Experiment (obmp-learn-10): capacity
    data from BGP-LS + streaming telemetry integration guide
- DB_SCHEMA.md: standalone database reference (33 tables, 11 views)
- 3 new ExaBGP scenarios: te_community_steering, origin_shift, path_diversity
- Updated DOCS.md with Phase 3 dashboards and scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 13:31:03 -07:00
sam
f4d5bd7c85 Fix LS Topology dashboard: relax igp_metric filter for CML lab
IOS-XR 9000v in CML uses igp_metric=16000000 on all IS-IS links.
The stock dashboard filter (< 16000000) excluded all links, making
the Node dropdown empty and topology panel show no data. Changed
to <= 16777215 (IS-IS wide metric max) so lab links are included.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-06 12:49:48 -07:00
sam
6621942032 Add Phase 2: Vue 3 control panel, 6 learning dashboards, new BGP scenarios
- exabgp-ui/: Vue 3 + Vite SPA served by NGINX on :5001; proxies /api/ to
  ExaBGP Flask on :5050; includes StatusBar, ScenarioPanel, RouteTable,
  AnnounceForm, PeerStatus, ChurnControl components
- docker-compose.yml: add obmp-exabgp-ui service (host network, port 5001)
- exabgp/scenarios/__init__.py: add convergence_test, route_leak,
  hijack_simulation scenarios for structured BGP learning exercises
- exabgp/inject.py: add 'peers' and 'monitor' subcommands; live-refresh
  terminal status view with ANSI cursor repositioning
- obmp-grafana/dashboards/Learning/: 6 new OBMP-Learning dashboards
  (update rate, peer health, AS path, RPKI, churn, attributes)
- obmp-grafana/provisioning/dashboards/openbmp-dashboards.yml: add
  OpenBMP-Learning folder provider pointing to dashboards/Learning/
- DOCS.md: document Web UI, 3 new scenarios, 6 learning dashboards;
  fix section numbering (10-14) and architecture diagram (23 dashboards)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 15:37:16 -07:00
sam
233dadbb41 Add ExaBGP route injector, Grafana dashboards, and full documentation
- Add exabgp/ container: ExaBGP 5.x + Flask REST API for on-demand BGP
  route injection into CML IOS-XR lab (AS 65020 via eBGP from AS 65100)
- Add 6 injection scenarios: internet_sample, churn, blackhole, anycast,
  full_table, lab_prefixes
- Add inject.py CLI wrapper for the ExaBGP API
- Add iosxr_bgp_config.md with IOS-XR neighbor config and NETCONF script
- Add obmp-grafana/ dashboards and provisioning (17 dashboards)
- Update docker-compose.yml: add exabgp service, fix Kafka external
  listener IP, extend log retention from 90min to 720min
- Add DOCS.md: full project documentation including architecture, setup,
  user guide, sanity checks, troubleshooting, and command reference
- Update .gitignore: exclude .env and .claude/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 14:46:37 -07:00