# OpenBMP Platform Roadmap ## Context This BMP monitoring platform is being developed against CML virtual labs (IOS-XR) and will be deployed into an ISP production network running IOS-XR and Juniper routers/route reflectors. The two tracks share a common foundation: configuration must be environment-agnostic so the same stack runs identically against virtual or production routers. Currently, router IPs, AS numbers, and credentials are hardcoded across 8+ files, tightly coupling the stack to a single CML lab. This roadmap addresses both the multi-lab development workflow and production deployment. --- ## Track A: Configuration Centralization (Foundation for Both Tracks) ### A1. Create `inventory.yaml` — unified topology inventory **File**: `inventory.yaml` (new) Single source of truth for all environments. Structure: ```yaml platform: host_ip: 10.40.40.202 bmp_port: 5000 exabgp_port: 5050 environments: cml-lab1: type: cml # cml | production description: "CML RR cluster - 9 IOS-XR virtual routers" cml_server: "https://10.40.40.174" cml_user: webui bgp_as: 65020 netconf: { user: webui, password: cisco, port: 830 } exabgp: local_as: 65100 peers: - { ip: 10.100.0.100, name: CORE-01, peer_as: 65020 } - { ip: 10.100.0.200, name: CORE-02, peer_as: 65020 } routers: CORE-01: { mgmt: 10.100.0.100, loopback: 10.10.255.0, role: rr, vendor: iosxr, gnmi: true } CORE-02: { mgmt: 10.100.0.200, loopback: 10.10.255.20, role: rr, vendor: iosxr, gnmi: true } R9K-01: { mgmt: 10.100.0.1, loopback: 10.10.255.1, role: client, vendor: iosxr } # ... cml-lab2: type: cml description: "Second CML Lab (TBD topology)" cml_server: "https://" routers: {} production: type: production description: "ISP production network" bgp_as: netconf: { user: , port: 830 } routers: # IOS-XR and Juniper RRs + routers PROD-RR1: { mgmt: x.x.x.x, role: rr, vendor: iosxr, gnmi: true } PROD-RR2: { mgmt: x.x.x.x, role: rr, vendor: junos } # ... ``` Key design decisions: - `vendor: iosxr | junos` — drives NETCONF dialect, gNMI paths, and config templates - `type: cml | production` — CML environments have `cml_server` for API automation; production does not - Credentials in `inventory.yaml` (gitignored) or pulled from env vars ### A2. Create `config_loader.py` — Python inventory helper **File**: `config_loader.py` (new) Functions: `get_env(name)`, `get_all_routers()`, `get_routers_by_vendor(vendor)`, `get_exabgp_peers()`, `get_gnmi_targets()`, `get_routers_for_env(env_name)` ### A3. Refactor hardcoded Python scripts Replace `ROUTERS` dicts/lists with `config_loader` calls: - `exabgp/route_diversity_config.py` (line 47) - `exabgp/bgpls_config.py` (line 35) - `gnmi/gnmi_grpc_config.py` (line 25) ### A4. Expand `.env` and parameterize `docker-compose.yml` Add to `.env`: ```env OBMP_DATA_ROOT=/var/openbmp DOCKER_HOST_IP=10.40.40.202 EXABGP_LOCAL_IP=10.40.40.202 EXABGP_LOCAL_AS=65100 EXABGP_PEER_AS=65020 EXABGP_PEER_1=10.100.0.100 EXABGP_PEER_2=10.100.0.200 ``` Replace hardcoded IPs in `docker-compose.yml` (Kafka listener, ExaBGP env vars). ### A5. Telegraf config parameterization Replace hardcoded gNMI addresses in `telegraf/telegraf.conf` with env var substitution. Pass `GNMI_TARGETS` from docker-compose.yml. ### A6. Fix InfluxDB datasource URL `obmp-grafana/provisioning/datasources/influxdb-ds.yml`: replace `http://10.40.40.202:8086` with `http://obmp-influxdb:8086`. --- ## Track B: Multi-Lab CML Development ### B1. Dynamic ExaBGP multi-peer support **File**: `exabgp/startup.sh` Accept `EXABGP_PEERS` env var (comma-separated `ip:as:description`), generate N neighbor blocks. Keep `PEER_1`/`PEER_2` fallback. ### B2. CML API client module **File**: `cml/cml_client.py` (new) Python module using `virl2_client` SDK: - Connect to CML server (creds from `inventory.yaml`) - Upload node/image definitions - Import/export topology YAML - Start/stop/destroy labs - Get node status ### B3. Topology template system **File**: `cml/templates/xrd_rr.j2` (new) Jinja2 templates for XRd startup config. Parameterize: hostname, loopback, link IPs, IS-IS NET, BGP AS, neighbor IPs, BMP target. ### B4. CLI deployment tool **File**: `cml/deploy.py` (new) ```bash python3 cml/deploy.py --env cml-lab1 status python3 cml/deploy.py --env cml-lab1 upload-images python3 cml/deploy.py --env cml-lab2 create python3 cml/deploy.py --env cml-lab2 start python3 cml/deploy.py --env cml-lab2 destroy ``` ### B5. Update build scripts with API push `cml/build-cml-image.sh` and `cml/build-xrd-image.sh` get `--push ` flag. --- ## Track C: Production ISP Deployment ### C1. Multi-vendor NETCONF support Current scripts assume IOS-XR NETCONF only. For Juniper RRs: - `config_loader.py` provides `vendor` field per router - NETCONF scripts branch on vendor for dialect differences (`device_params='iosxr'` vs `device_params='junos'`) - Route diversity, BGP-LS config scripts get Junos templates alongside IOS-XR ### C2. Multi-vendor gNMI paths Telegraf gNMI subscriptions currently use OpenConfig paths which work for both IOS-XR and Junos, but: - Verify Juniper gNMI support on target hardware - Add vendor-specific path overrides in `inventory.yaml` if needed - Telegraf can subscribe to multiple targets with different configs via `[[inputs.gnmi]]` blocks ### C3. BMP considerations for production - BMP collector (port 5000) accepts connections from any router — no changes needed - Production routers need BMP config pushed (manual or via NETCONF automation) - Consider: separate BMP server IDs per environment for dashboard filtering - Juniper BMP config differs from IOS-XR — add Junos BMP config templates ### C4. Dashboard multi-environment awareness - Add a Grafana template variable for environment filtering (by router name prefix or a tag) - Consider a "Network Overview" dashboard that shows all environments side-by-side - Existing dashboards work as-is — router dropdowns will show all BMP-reporting routers ### C5. Security hardening for production - Move credentials out of `inventory.yaml` into environment variables or a secrets manager - Authelia config: stronger passwords, TOTP enforcement, session timeouts - PostgreSQL: restrict access, enable SSL - Kafka: consider authentication if exposed beyond localhost - BMP port: firewall to only accept connections from known router management IPs ### C6. Scalability considerations - Monitor PostgreSQL disk usage and query performance with production-scale RIBs - TimescaleDB compression policies for historical data (ip_rib_log, ls_*_log) - Kafka topic partitioning if message throughput is high - Consider read replicas or materialized views for heavy Grafana queries --- ## Track D: Packaging & Distribution ### D1. Configuration templates - `inventory.yaml.example` — documented example with placeholder values - `.env.example` — all environment variables with descriptions ### D2. Bootstrap script `setup.sh` that: - Creates required directories (`$OBMP_DATA_ROOT/authelia`, etc.) - Copies example configs if originals don't exist - Validates inventory.yaml syntax - Generates Telegraf config from inventory ### D3. Published Docker images Push custom images to a registry (Docker Hub or GHCR): - `obmp-exabgp` - `obmp-exabgp-ui` - `obmp-traffic-gen` - `obmp-traffic-gen-ui` - `obmp-portal` Replace `build:` with `image:` in docker-compose.yml (keep build as override). ### D4. Documentation - `docs/quickstart.md` — 5-minute setup guide - `docs/adding-a-lab.md` — how to add a CML lab environment - `docs/production-deployment.md` — production hardening checklist - `docs/architecture.md` — system diagram, data flow, port map --- ## Track E: Internet-Scale Routing Analytics Adds a local copy of the real global routing table, generalizes router comparison to an N-way diff, and threads VRF/RD scoping through the dashboards. The full-table feed (E1) is the foundation — E2/E3 consume it. ### E1. GoBGP full-table feed → BMP → `ip_rib` **Files**: `docker-compose.yml` (new `gobgp` service), `gobgp/gobgpd.conf` (new), `gobgp/mrt-refresh.sh` (new) Stand up a GoBGP container that obtains a full Internet table (IPv4 ~1M + IPv6 ~200k) and BMP-exports it to the existing OpenBMP collector, so the global table lands in `ip_rib` as an ordinary monitored peer — every existing dashboard and the diff then work against it for free. - **Primary feed** — eBGP multihop session to Łukasz Bromirski's lab route server, **AS57355** (`85.232.240.179`, `2001:1a68:2c:2::179`). Local ASN private (e.g. 65199); announce nothing; `ebgp-multihop` TTL ~64; receive-only. - **BMP export** — GoBGP `[[bmp-servers]]` block at the collector (port 5000), `route-monitoring-policy = pre-policy`. - **Fallback / seed** — `gobgp/mrt-refresh.sh`, run every 2h (host cron or a sidecar): download the latest RouteViews (`archive.routeviews.org`) or RIPE-RIS MRT RIB dump and `gobgp mrt inject` it into the same instance. - **Identification** — distinct BMP router name (e.g. `GLOBAL-FEED`) so dashboards can include/exclude it. Caveats: - The route server is a single volunteer-run host, no SLA — the MRT fallback is the reliability backstop, not optional. - A full table roughly triples `ip_rib` size — see E-scale below. - The feed carries **no VRF/L3VPN** routes — global unicast only. ### E2. Generic multi-router diff dashboard **File**: `obmp-grafana/dashboards/.../router_diff.json` (new, uid `router-diff`), generalized from `rr_locrib_diff.json` Replace the hardwired RR1-vs-RR2 model with up to **4 selectable routers**: - Template vars `router1`-`router4` (query type); `router1`/`router2` required, `router3`/`router4` default to a "— none —" sentinel and their panels hide when unset. - **Presence matrix** — rows = prefixes, columns = selected routers, cell = present / next-hop / origin-AS; the core view. - **Divergence view** — table of prefixes where the selected routers disagree (missing on some, or differing best-path attributes). - Keep the per-prefix all-paths drill-down from the RR diff. - The global feed (E1) is selectable as any of the 4 → "lab vs the real Internet." The existing `rr-locrib-diff` stays as the RR-specific quick view. ### E3. Global table exploration dashboard **File**: `obmp-grafana/dashboards/.../global_table.json` (new) Explorable dashboard over the `GLOBAL-FEED` peer: prefix count by AFI, origin-AS distribution, prefix-length histogram, search by prefix/AS, more-/less-specific lookups. Doubles as the comparison baseline for E2. ### E4. VRF / RD awareness **Files**: existing unicast + L3VPN dashboards Thread a Route-Distinguisher / VRF scoping dimension through the dashboards: - Add a `vrf` / `rd` template variable to the L3VPN dashboards and unicast dashboards where applicable. - VRF/RD columns and filters on RIB tables. - The diff (E2) gains a per-VRF scope. Constraint (stated plainly): CML IOS-XR images can't originate L3VPN routes and the global feed carries none — so E4 is **built to the L3VPN schema and unverifiable in this lab**; it validates only against production routers. Keep E4 scope minimal until there's a real L3VPN source. ### E-scale. PostgreSQL sizing for a full table A full v4+v6 table is ~1.2M prefixes; with attributes and history this is a multi-GB addition to `ip_rib` / `ip_rib_log`. Before enabling E1 continuously: confirm disk headroom on `$OBMP_DATA_ROOT`, apply TimescaleDB compression to `ip_rib_log` (also flagged in C6). The `mv_as_adjacency` materialized view (already in place — `postgres/scripts/006_obmp_matviews.sql`) becomes far more valuable once real-Internet AS paths are present. --- ## Implementation Order | Priority | Step | Track | Description | |----------|------|-------|-------------| | 1 | A1 | Foundation | Create `inventory.yaml` | | 2 | A2 | Foundation | Create `config_loader.py` | | 3 | A3 | Foundation | Refactor hardcoded Python scripts | | 4 | A4 | Foundation | Parameterize `.env` + docker-compose | | 5 | A5-A6 | Foundation | Telegraf + InfluxDB datasource fixes | | 6 | B1 | CML Dev | Dynamic ExaBGP multi-peer | | 7 | B2-B4 | CML Dev | CML API client + deploy CLI | | 8 | C1 | Production | Multi-vendor NETCONF (Junos support) | | 9 | C3 | Production | Junos BMP config templates | | 10 | C5 | Production | Security hardening | | 11 | D1-D2 | Packaging | Config templates + bootstrap script | | 12 | D3 | Packaging | Publish Docker images to registry | | 13 | D4 | Packaging | Documentation | | 14 | E1 | Analytics | GoBGP full-table feed (AS57355 live + MRT fallback) | | 15 | E2 | Analytics | Generic 4-router diff dashboard | | 16 | E3 | Analytics | Global table exploration dashboard | | 17 | E4 | Analytics | VRF/RD scoping (to schema, lab-unverifiable) | Steps 1-5 (Track A) unblock everything else. Steps 6-7 and 8-10 can proceed in parallel once the foundation is in place. Track E is independent of A-D: E1 is the foundation for E2/E3; E4 can proceed any time but is lab-unverifiable. --- ## Verification 1. **Config centralization**: Change a router IP in `inventory.yaml`, verify all scripts pick it up 2. **ExaBGP multi-peer**: Set 3+ peers, restart, verify BGP sessions establish 3. **CML API**: `deploy.py --env cml-lab1 status` connects and lists nodes 4. **BMP multi-source**: Router from lab 2 sends BMP, appears in `SELECT * FROM routers` and Grafana 5. **Junos support**: NETCONF script connects to a Juniper router, pushes config 6. **Production dry-run**: Point a test router from the ISP network at the collector, verify end-to-end 7. **Clean deploy**: Clone repo on a fresh host, run `setup.sh`, `docker compose up`, confirm stack starts --- ## Risks - **Router name collisions**: Enforce unique hostnames across all environments - **Address space overlap**: Each environment needs distinct management subnets - **Juniper BMP differences**: Junos BMP implementation may differ in supported tables/TLVs — test early - **Production scale**: 500K-route labs are slow; production full tables will stress PostgreSQL more - **Credentials in inventory**: Must be gitignored; consider env var fallback for CI/CD - **Volunteer route server (E1)**: the AS57355 full-table feed has no SLA and can flap or be retired — the 2-hourly MRT fallback is mandatory, not optional - **Full-table DB growth (E1)**: a live global feed roughly triples `ip_rib`; size disk and enable `ip_rib_log` compression before turning it on continuously - **VRF work unverifiable (E4)**: no L3VPN source in the CML lab — E4 ships to schema correctness only, validated later against production