obmp-docker/docs/ROADMAP.md
sam d60c582ff6 Add roadmap Track E: Internet-scale routing analytics
Plan for a local full-Internet routing table, a generalized N-way
router diff, and VRF/RD scoping:

- E1: GoBGP container peering AS57355 (Bromirski lab route server)
  for a live full v4/v6 table, MRT RIB dumps as a 2-hourly fallback,
  BMP-exported into ip_rib as a GLOBAL-FEED peer.
- E2: generic up-to-4-router diff dashboard (presence matrix),
  generalized from the RR-specific rr_locrib_diff.
- E3: global table exploration dashboard.
- E4: VRF/RD scoping across unicast + L3VPN dashboards (built to
  schema; not lab-verifiable with CML IOS-XR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:19:34 -07:00

14 KiB

OpenBMP Platform Roadmap

Context

This BMP monitoring platform is being developed against CML virtual labs (IOS-XR) and will be deployed into an ISP production network running IOS-XR and Juniper routers/route reflectors. The two tracks share a common foundation: configuration must be environment-agnostic so the same stack runs identically against virtual or production routers.

Currently, router IPs, AS numbers, and credentials are hardcoded across 8+ files, tightly coupling the stack to a single CML lab. This roadmap addresses both the multi-lab development workflow and production deployment.


Track A: Configuration Centralization (Foundation for Both Tracks)

A1. Create inventory.yaml — unified topology inventory

File: inventory.yaml (new)

Single source of truth for all environments. Structure:

platform:
  host_ip: 10.40.40.202
  bmp_port: 5000
  exabgp_port: 5050

environments:
  cml-lab1:
    type: cml            # cml | production
    description: "CML RR cluster - 9 IOS-XR virtual routers"
    cml_server: "https://10.40.40.174"
    cml_user: webui
    bgp_as: 65020
    netconf: { user: webui, password: cisco, port: 830 }
    exabgp:
      local_as: 65100
      peers:
        - { ip: 10.100.0.100, name: CORE-01, peer_as: 65020 }
        - { ip: 10.100.0.200, name: CORE-02, peer_as: 65020 }
    routers:
      CORE-01:  { mgmt: 10.100.0.100, loopback: 10.10.255.0, role: rr, vendor: iosxr, gnmi: true }
      CORE-02:  { mgmt: 10.100.0.200, loopback: 10.10.255.20, role: rr, vendor: iosxr, gnmi: true }
      R9K-01:   { mgmt: 10.100.0.1, loopback: 10.10.255.1, role: client, vendor: iosxr }
      # ...

  cml-lab2:
    type: cml
    description: "Second CML Lab (TBD topology)"
    cml_server: "https://<lab2-ip>"
    routers: {}

  production:
    type: production
    description: "ISP production network"
    bgp_as: <prod-as>
    netconf: { user: <prod-user>, port: 830 }
    routers:
      # IOS-XR and Juniper RRs + routers
      PROD-RR1: { mgmt: x.x.x.x, role: rr, vendor: iosxr, gnmi: true }
      PROD-RR2: { mgmt: x.x.x.x, role: rr, vendor: junos }
      # ...

Key design decisions:

  • vendor: iosxr | junos — drives NETCONF dialect, gNMI paths, and config templates
  • type: cml | production — CML environments have cml_server for API automation; production does not
  • Credentials in inventory.yaml (gitignored) or pulled from env vars

A2. Create config_loader.py — Python inventory helper

File: config_loader.py (new)

Functions: get_env(name), get_all_routers(), get_routers_by_vendor(vendor), get_exabgp_peers(), get_gnmi_targets(), get_routers_for_env(env_name)

A3. Refactor hardcoded Python scripts

Replace ROUTERS dicts/lists with config_loader calls:

  • exabgp/route_diversity_config.py (line 47)
  • exabgp/bgpls_config.py (line 35)
  • gnmi/gnmi_grpc_config.py (line 25)

A4. Expand .env and parameterize docker-compose.yml

Add to .env:

OBMP_DATA_ROOT=/var/openbmp
DOCKER_HOST_IP=10.40.40.202
EXABGP_LOCAL_IP=10.40.40.202
EXABGP_LOCAL_AS=65100
EXABGP_PEER_AS=65020
EXABGP_PEER_1=10.100.0.100
EXABGP_PEER_2=10.100.0.200

Replace hardcoded IPs in docker-compose.yml (Kafka listener, ExaBGP env vars).

A5. Telegraf config parameterization

Replace hardcoded gNMI addresses in telegraf/telegraf.conf with env var substitution. Pass GNMI_TARGETS from docker-compose.yml.

A6. Fix InfluxDB datasource URL

obmp-grafana/provisioning/datasources/influxdb-ds.yml: replace http://10.40.40.202:8086 with http://obmp-influxdb:8086.


Track B: Multi-Lab CML Development

B1. Dynamic ExaBGP multi-peer support

File: exabgp/startup.sh

Accept EXABGP_PEERS env var (comma-separated ip:as:description), generate N neighbor blocks. Keep PEER_1/PEER_2 fallback.

B2. CML API client module

File: cml/cml_client.py (new)

Python module using virl2_client SDK:

  • Connect to CML server (creds from inventory.yaml)
  • Upload node/image definitions
  • Import/export topology YAML
  • Start/stop/destroy labs
  • Get node status

B3. Topology template system

File: cml/templates/xrd_rr.j2 (new)

Jinja2 templates for XRd startup config. Parameterize: hostname, loopback, link IPs, IS-IS NET, BGP AS, neighbor IPs, BMP target.

B4. CLI deployment tool

File: cml/deploy.py (new)

python3 cml/deploy.py --env cml-lab1 status
python3 cml/deploy.py --env cml-lab1 upload-images
python3 cml/deploy.py --env cml-lab2 create
python3 cml/deploy.py --env cml-lab2 start
python3 cml/deploy.py --env cml-lab2 destroy

B5. Update build scripts with API push

cml/build-cml-image.sh and cml/build-xrd-image.sh get --push <env-name> flag.


Track C: Production ISP Deployment

C1. Multi-vendor NETCONF support

Current scripts assume IOS-XR NETCONF only. For Juniper RRs:

  • config_loader.py provides vendor field per router
  • NETCONF scripts branch on vendor for dialect differences (device_params='iosxr' vs device_params='junos')
  • Route diversity, BGP-LS config scripts get Junos templates alongside IOS-XR

C2. Multi-vendor gNMI paths

Telegraf gNMI subscriptions currently use OpenConfig paths which work for both IOS-XR and Junos, but:

  • Verify Juniper gNMI support on target hardware
  • Add vendor-specific path overrides in inventory.yaml if needed
  • Telegraf can subscribe to multiple targets with different configs via [[inputs.gnmi]] blocks

C3. BMP considerations for production

  • BMP collector (port 5000) accepts connections from any router — no changes needed
  • Production routers need BMP config pushed (manual or via NETCONF automation)
  • Consider: separate BMP server IDs per environment for dashboard filtering
  • Juniper BMP config differs from IOS-XR — add Junos BMP config templates

C4. Dashboard multi-environment awareness

  • Add a Grafana template variable for environment filtering (by router name prefix or a tag)
  • Consider a "Network Overview" dashboard that shows all environments side-by-side
  • Existing dashboards work as-is — router dropdowns will show all BMP-reporting routers

C5. Security hardening for production

  • Move credentials out of inventory.yaml into environment variables or a secrets manager
  • Authelia config: stronger passwords, TOTP enforcement, session timeouts
  • PostgreSQL: restrict access, enable SSL
  • Kafka: consider authentication if exposed beyond localhost
  • BMP port: firewall to only accept connections from known router management IPs

C6. Scalability considerations

  • Monitor PostgreSQL disk usage and query performance with production-scale RIBs
  • TimescaleDB compression policies for historical data (ip_rib_log, ls_*_log)
  • Kafka topic partitioning if message throughput is high
  • Consider read replicas or materialized views for heavy Grafana queries

Track D: Packaging & Distribution

D1. Configuration templates

  • inventory.yaml.example — documented example with placeholder values
  • .env.example — all environment variables with descriptions

D2. Bootstrap script

setup.sh that:

  • Creates required directories ($OBMP_DATA_ROOT/authelia, etc.)
  • Copies example configs if originals don't exist
  • Validates inventory.yaml syntax
  • Generates Telegraf config from inventory

D3. Published Docker images

Push custom images to a registry (Docker Hub or GHCR):

  • obmp-exabgp
  • obmp-exabgp-ui
  • obmp-traffic-gen
  • obmp-traffic-gen-ui
  • obmp-portal

Replace build: with image: in docker-compose.yml (keep build as override).

D4. Documentation

  • docs/quickstart.md — 5-minute setup guide
  • docs/adding-a-lab.md — how to add a CML lab environment
  • docs/production-deployment.md — production hardening checklist
  • docs/architecture.md — system diagram, data flow, port map

Track E: Internet-Scale Routing Analytics

Adds a local copy of the real global routing table, generalizes router comparison to an N-way diff, and threads VRF/RD scoping through the dashboards. The full-table feed (E1) is the foundation — E2/E3 consume it.

E1. GoBGP full-table feed → BMP → ip_rib

Files: docker-compose.yml (new gobgp service), gobgp/gobgpd.conf (new), gobgp/mrt-refresh.sh (new)

Stand up a GoBGP container that obtains a full Internet table (IPv4 ~1M + IPv6 ~200k) and BMP-exports it to the existing OpenBMP collector, so the global table lands in ip_rib as an ordinary monitored peer — every existing dashboard and the diff then work against it for free.

  • Primary feed — eBGP multihop session to Łukasz Bromirski's lab route server, AS57355 (85.232.240.179, 2001:1a68:2c:2::179). Local ASN private (e.g. 65199); announce nothing; ebgp-multihop TTL ~64; receive-only.
  • BMP export — GoBGP [[bmp-servers]] block at the collector (port 5000), route-monitoring-policy = pre-policy.
  • Fallback / seedgobgp/mrt-refresh.sh, run every 2h (host cron or a sidecar): download the latest RouteViews (archive.routeviews.org) or RIPE-RIS MRT RIB dump and gobgp mrt inject it into the same instance.
  • Identification — distinct BMP router name (e.g. GLOBAL-FEED) so dashboards can include/exclude it.

Caveats:

  • The route server is a single volunteer-run host, no SLA — the MRT fallback is the reliability backstop, not optional.
  • A full table roughly triples ip_rib size — see E-scale below.
  • The feed carries no VRF/L3VPN routes — global unicast only.

E2. Generic multi-router diff dashboard

File: obmp-grafana/dashboards/.../router_diff.json (new, uid router-diff), generalized from rr_locrib_diff.json

Replace the hardwired RR1-vs-RR2 model with up to 4 selectable routers:

  • Template vars router1-router4 (query type); router1/router2 required, router3/router4 default to a "— none —" sentinel and their panels hide when unset.
  • Presence matrix — rows = prefixes, columns = selected routers, cell = present / next-hop / origin-AS; the core view.
  • Divergence view — table of prefixes where the selected routers disagree (missing on some, or differing best-path attributes).
  • Keep the per-prefix all-paths drill-down from the RR diff.
  • The global feed (E1) is selectable as any of the 4 → "lab vs the real Internet." The existing rr-locrib-diff stays as the RR-specific quick view.

E3. Global table exploration dashboard

File: obmp-grafana/dashboards/.../global_table.json (new)

Explorable dashboard over the GLOBAL-FEED peer: prefix count by AFI, origin-AS distribution, prefix-length histogram, search by prefix/AS, more-/less-specific lookups. Doubles as the comparison baseline for E2.

E4. VRF / RD awareness

Files: existing unicast + L3VPN dashboards

Thread a Route-Distinguisher / VRF scoping dimension through the dashboards:

  • Add a vrf / rd template variable to the L3VPN dashboards and unicast dashboards where applicable.
  • VRF/RD columns and filters on RIB tables.
  • The diff (E2) gains a per-VRF scope.

Constraint (stated plainly): CML IOS-XR images can't originate L3VPN routes and the global feed carries none — so E4 is built to the L3VPN schema and unverifiable in this lab; it validates only against production routers. Keep E4 scope minimal until there's a real L3VPN source.

E-scale. PostgreSQL sizing for a full table

A full v4+v6 table is ~1.2M prefixes; with attributes and history this is a multi-GB addition to ip_rib / ip_rib_log. Before enabling E1 continuously: confirm disk headroom on $OBMP_DATA_ROOT, apply TimescaleDB compression to ip_rib_log (also flagged in C6). The mv_as_adjacency materialized view (already in place — postgres/scripts/006_obmp_matviews.sql) becomes far more valuable once real-Internet AS paths are present.


Implementation Order

Priority Step Track Description
1 A1 Foundation Create inventory.yaml
2 A2 Foundation Create config_loader.py
3 A3 Foundation Refactor hardcoded Python scripts
4 A4 Foundation Parameterize .env + docker-compose
5 A5-A6 Foundation Telegraf + InfluxDB datasource fixes
6 B1 CML Dev Dynamic ExaBGP multi-peer
7 B2-B4 CML Dev CML API client + deploy CLI
8 C1 Production Multi-vendor NETCONF (Junos support)
9 C3 Production Junos BMP config templates
10 C5 Production Security hardening
11 D1-D2 Packaging Config templates + bootstrap script
12 D3 Packaging Publish Docker images to registry
13 D4 Packaging Documentation
14 E1 Analytics GoBGP full-table feed (AS57355 live + MRT fallback)
15 E2 Analytics Generic 4-router diff dashboard
16 E3 Analytics Global table exploration dashboard
17 E4 Analytics VRF/RD scoping (to schema, lab-unverifiable)

Steps 1-5 (Track A) unblock everything else. Steps 6-7 and 8-10 can proceed in parallel once the foundation is in place. Track E is independent of A-D: E1 is the foundation for E2/E3; E4 can proceed any time but is lab-unverifiable.


Verification

  1. Config centralization: Change a router IP in inventory.yaml, verify all scripts pick it up
  2. ExaBGP multi-peer: Set 3+ peers, restart, verify BGP sessions establish
  3. CML API: deploy.py --env cml-lab1 status connects and lists nodes
  4. BMP multi-source: Router from lab 2 sends BMP, appears in SELECT * FROM routers and Grafana
  5. Junos support: NETCONF script connects to a Juniper router, pushes config
  6. Production dry-run: Point a test router from the ISP network at the collector, verify end-to-end
  7. Clean deploy: Clone repo on a fresh host, run setup.sh, docker compose up, confirm stack starts

Risks

  • Router name collisions: Enforce unique hostnames across all environments
  • Address space overlap: Each environment needs distinct management subnets
  • Juniper BMP differences: Junos BMP implementation may differ in supported tables/TLVs — test early
  • Production scale: 500K-route labs are slow; production full tables will stress PostgreSQL more
  • Credentials in inventory: Must be gitignored; consider env var fallback for CI/CD
  • Volunteer route server (E1): the AS57355 full-table feed has no SLA and can flap or be retired — the 2-hourly MRT fallback is mandatory, not optional
  • Full-table DB growth (E1): a live global feed roughly triples ip_rib; size disk and enable ip_rib_log compression before turning it on continuously
  • VRF work unverifiable (E4): no L3VPN source in the CML lab — E4 ships to schema correctness only, validated later against production