2026-05-15 14:23:38 -07:00
# OpenBMP Platform Roadmap
## Context
This BMP monitoring platform is being developed against CML virtual labs (IOS-XR) and will be deployed into an ISP production network running IOS-XR and Juniper routers/route reflectors. The two tracks share a common foundation: configuration must be environment-agnostic so the same stack runs identically against virtual or production routers.
Currently, router IPs, AS numbers, and credentials are hardcoded across 8+ files, tightly coupling the stack to a single CML lab. This roadmap addresses both the multi-lab development workflow and production deployment.
---
## Track A: Configuration Centralization (Foundation for Both Tracks)
### A1. Create `inventory.yaml` — unified topology inventory
**File**: `inventory.yaml` (new)
Single source of truth for all environments. Structure:
```yaml
platform:
host_ip: 10.40.40.202
bmp_port: 5000
exabgp_port: 5050
environments:
cml-lab1:
type: cml # cml | production
description: "CML RR cluster - 9 IOS-XR virtual routers"
cml_server: "https://10.40.40.174"
cml_user: webui
bgp_as: 65020
netconf: { user: webui, password: cisco, port: 830 }
exabgp:
local_as: 65100
peers:
- { ip: 10.100.0.100, name: CORE-01, peer_as: 65020 }
- { ip: 10.100.0.200, name: CORE-02, peer_as: 65020 }
routers:
CORE-01: { mgmt: 10.100.0.100, loopback: 10.10.255.0, role: rr, vendor: iosxr, gnmi: true }
CORE-02: { mgmt: 10.100.0.200, loopback: 10.10.255.20, role: rr, vendor: iosxr, gnmi: true }
R9K-01: { mgmt: 10.100.0.1, loopback: 10.10.255.1, role: client, vendor: iosxr }
# ...
cml-lab2:
type: cml
description: "Second CML Lab (TBD topology)"
cml_server: "https://< lab2-ip > "
routers: {}
production:
type: production
description: "ISP production network"
bgp_as: < prod-as >
netconf: { user: < prod-user > , port: 830 }
routers:
# IOS-XR and Juniper RRs + routers
PROD-RR1: { mgmt: x.x.x.x, role: rr, vendor: iosxr, gnmi: true }
PROD-RR2: { mgmt: x.x.x.x, role: rr, vendor: junos }
# ...
```
Key design decisions:
- `vendor: iosxr | junos` — drives NETCONF dialect, gNMI paths, and config templates
- `type: cml | production` — CML environments have `cml_server` for API automation; production does not
- Credentials in `inventory.yaml` (gitignored) or pulled from env vars
### A2. Create `config_loader.py` — Python inventory helper
**File**: `config_loader.py` (new)
Functions: `get_env(name)` , `get_all_routers()` , `get_routers_by_vendor(vendor)` , `get_exabgp_peers()` , `get_gnmi_targets()` , `get_routers_for_env(env_name)`
### A3. Refactor hardcoded Python scripts
Replace `ROUTERS` dicts/lists with `config_loader` calls:
- `exabgp/route_diversity_config.py` (line 47)
- `exabgp/bgpls_config.py` (line 35)
- `gnmi/gnmi_grpc_config.py` (line 25)
### A4. Expand `.env` and parameterize `docker-compose.yml`
Add to `.env` :
```env
OBMP_DATA_ROOT=/var/openbmp
DOCKER_HOST_IP=10.40.40.202
EXABGP_LOCAL_IP=10.40.40.202
EXABGP_LOCAL_AS=65100
EXABGP_PEER_AS=65020
EXABGP_PEER_1=10.100.0.100
EXABGP_PEER_2=10.100.0.200
```
Replace hardcoded IPs in `docker-compose.yml` (Kafka listener, ExaBGP env vars).
### A5. Telegraf config parameterization
Replace hardcoded gNMI addresses in `telegraf/telegraf.conf` with env var substitution. Pass `GNMI_TARGETS` from docker-compose.yml.
### A6. Fix InfluxDB datasource URL
`obmp-grafana/provisioning/datasources/influxdb-ds.yml` : replace `http://10.40.40.202:8086` with `http://obmp-influxdb:8086` .
---
## Track B: Multi-Lab CML Development
### B1. Dynamic ExaBGP multi-peer support
**File**: `exabgp/startup.sh`
Accept `EXABGP_PEERS` env var (comma-separated `ip:as:description` ), generate N neighbor blocks. Keep `PEER_1` /`PEER_2` fallback.
### B2. CML API client module
**File**: `cml/cml_client.py` (new)
Python module using `virl2_client` SDK:
- Connect to CML server (creds from `inventory.yaml` )
- Upload node/image definitions
- Import/export topology YAML
- Start/stop/destroy labs
- Get node status
### B3. Topology template system
**File**: `cml/templates/xrd_rr.j2` (new)
Jinja2 templates for XRd startup config. Parameterize: hostname, loopback, link IPs, IS-IS NET, BGP AS, neighbor IPs, BMP target.
### B4. CLI deployment tool
**File**: `cml/deploy.py` (new)
```bash
python3 cml/deploy.py --env cml-lab1 status
python3 cml/deploy.py --env cml-lab1 upload-images
python3 cml/deploy.py --env cml-lab2 create
python3 cml/deploy.py --env cml-lab2 start
python3 cml/deploy.py --env cml-lab2 destroy
```
### B5. Update build scripts with API push
`cml/build-cml-image.sh` and `cml/build-xrd-image.sh` get `--push <env-name>` flag.
---
## Track C: Production ISP Deployment
### C1. Multi-vendor NETCONF support
Current scripts assume IOS-XR NETCONF only. For Juniper RRs:
- `config_loader.py` provides `vendor` field per router
- NETCONF scripts branch on vendor for dialect differences (`device_params='iosxr'` vs `device_params='junos'` )
- Route diversity, BGP-LS config scripts get Junos templates alongside IOS-XR
### C2. Multi-vendor gNMI paths
Telegraf gNMI subscriptions currently use OpenConfig paths which work for both IOS-XR and Junos, but:
- Verify Juniper gNMI support on target hardware
- Add vendor-specific path overrides in `inventory.yaml` if needed
- Telegraf can subscribe to multiple targets with different configs via `[[inputs.gnmi]]` blocks
### C3. BMP considerations for production
- BMP collector (port 5000) accepts connections from any router — no changes needed
- Production routers need BMP config pushed (manual or via NETCONF automation)
- Consider: separate BMP server IDs per environment for dashboard filtering
- Juniper BMP config differs from IOS-XR — add Junos BMP config templates
### C4. Dashboard multi-environment awareness
- Add a Grafana template variable for environment filtering (by router name prefix or a tag)
- Consider a "Network Overview" dashboard that shows all environments side-by-side
- Existing dashboards work as-is — router dropdowns will show all BMP-reporting routers
### C5. Security hardening for production
- Move credentials out of `inventory.yaml` into environment variables or a secrets manager
- Authelia config: stronger passwords, TOTP enforcement, session timeouts
- PostgreSQL: restrict access, enable SSL
- Kafka: consider authentication if exposed beyond localhost
- BMP port: firewall to only accept connections from known router management IPs
### C6. Scalability considerations
- Monitor PostgreSQL disk usage and query performance with production-scale RIBs
- TimescaleDB compression policies for historical data (ip_rib_log, ls_*_log)
- Kafka topic partitioning if message throughput is high
- Consider read replicas or materialized views for heavy Grafana queries
---
## Track D: Packaging & Distribution
### D1. Configuration templates
- `inventory.yaml.example` — documented example with placeholder values
- `.env.example` — all environment variables with descriptions
### D2. Bootstrap script
`setup.sh` that:
- Creates required directories (`$OBMP_DATA_ROOT/authelia` , etc.)
- Copies example configs if originals don't exist
- Validates inventory.yaml syntax
- Generates Telegraf config from inventory
### D3. Published Docker images
Push custom images to a registry (Docker Hub or GHCR):
- `obmp-exabgp`
- `obmp-exabgp-ui`
- `obmp-traffic-gen`
- `obmp-traffic-gen-ui`
- `obmp-portal`
Replace `build:` with `image:` in docker-compose.yml (keep build as override).
### D4. Documentation
- `docs/quickstart.md` — 5-minute setup guide
- `docs/adding-a-lab.md` — how to add a CML lab environment
- `docs/production-deployment.md` — production hardening checklist
- `docs/architecture.md` — system diagram, data flow, port map
---
2026-05-19 07:19:34 -07:00
## Track E: Internet-Scale Routing Analytics
Adds a local copy of the real global routing table, generalizes router
comparison to an N-way diff, and threads VRF/RD scoping through the
dashboards. The full-table feed (E1) is the foundation — E2/E3 consume it.
### E1. GoBGP full-table feed → BMP → `ip_rib`
**Files**: `docker-compose.yml` (new `gobgp` service), `gobgp/gobgpd.conf` (new), `gobgp/mrt-refresh.sh` (new)
Stand up a GoBGP container that obtains a full Internet table (IPv4 ~1M +
IPv6 ~200k) and BMP-exports it to the existing OpenBMP collector, so the
global table lands in `ip_rib` as an ordinary monitored peer — every
existing dashboard and the diff then work against it for free.
- **Primary feed** — eBGP multihop session to Łukasz Bromirski's lab route
server, **AS57355** (`85.232.240.179` , `2001:1a68:2c:2::179` ). Local ASN
private (e.g. 65199); announce nothing; `ebgp-multihop` TTL ~64; receive-only.
- **BMP export** — GoBGP `[[bmp-servers]]` block at the collector (port 5000),
`route-monitoring-policy = pre-policy` .
- **Fallback / seed** — `gobgp/mrt-refresh.sh` , run every 2h (host cron or a
sidecar): download the latest RouteViews (`archive.routeviews.org` ) or
RIPE-RIS MRT RIB dump and `gobgp mrt inject` it into the same instance.
- **Identification** — distinct BMP router name (e.g. `GLOBAL-FEED` ) so
dashboards can include/exclude it.
Caveats:
- The route server is a single volunteer-run host, no SLA — the MRT fallback
is the reliability backstop, not optional.
- A full table roughly triples `ip_rib` size — see E-scale below.
- The feed carries **no VRF/L3VPN** routes — global unicast only.
### E2. Generic multi-router diff dashboard
**File**: `obmp-grafana/dashboards/.../router_diff.json` (new, uid `router-diff` ), generalized from `rr_locrib_diff.json`
Replace the hardwired RR1-vs-RR2 model with up to **4 selectable routers** :
- Template vars `router1` -`router4` (query type); `router1` /`router2` required,
`router3` /`router4` default to a "— none —" sentinel and their panels hide
when unset.
- **Presence matrix** — rows = prefixes, columns = selected routers, cell =
present / next-hop / origin-AS; the core view.
- **Divergence view** — table of prefixes where the selected routers disagree
(missing on some, or differing best-path attributes).
- Keep the per-prefix all-paths drill-down from the RR diff.
- The global feed (E1) is selectable as any of the 4 → "lab vs the real
Internet." The existing `rr-locrib-diff` stays as the RR-specific quick view.
### E3. Global table exploration dashboard
**File**: `obmp-grafana/dashboards/.../global_table.json` (new)
Explorable dashboard over the `GLOBAL-FEED` peer: prefix count by AFI,
origin-AS distribution, prefix-length histogram, search by prefix/AS,
more-/less-specific lookups. Doubles as the comparison baseline for E2.
### E4. VRF / RD awareness
**Files**: existing unicast + L3VPN dashboards
Thread a Route-Distinguisher / VRF scoping dimension through the dashboards:
- Add a `vrf` / `rd` template variable to the L3VPN dashboards and unicast
dashboards where applicable.
- VRF/RD columns and filters on RIB tables.
- The diff (E2) gains a per-VRF scope.
Constraint (stated plainly): CML IOS-XR images can't originate L3VPN routes
and the global feed carries none — so E4 is **built to the L3VPN schema and
unverifiable in this lab**; it validates only against production routers.
Keep E4 scope minimal until there's a real L3VPN source.
2026-05-19 08:01:59 -07:00
### E5. L2VPN / EVPN support — platform-level, not a dashboard task
L2VPN/EVPN was requested alongside L3VPN. **It cannot be done as a dashboard
2026-05-19 08:31:44 -07:00
change.** Research findings on where the gap actually is:
- **Collector** (`openbmp/collector` ) — *already decodes EVPN* . It has an
`EVPN.cpp` parser and emits a parsed `openbmp.parsed.evpn` Kafka topic
(RD, ESI, MAC, ethernet-tag, IP, labels, route-targets). No work needed.
- **psql-app** (`openbmp/psql-app` ) — **drops it** . It never subscribes to
`openbmp.parsed.evpn` , has no `EvpnQuery` handler, and the PostgreSQL
schema has no EVPN table. This is the whole gap.
- **L2VPN-VPLS (SAFI 65)** — not supported anywhere; only EVPN (AFI 25).
Two viable paths:
1. **Fork the psql-app** (Java): subscribe to the evpn topic, add an
`EvpnQuery` class, add an `evpn_rib` table + history/stats. Keeps one
unified schema; cost is owning a Java fork of a slow-moving upstream and
inheriting the collector's older EVPN parser (likely no RFC 9251/9572
route types).
2. **Run GoBMP** (`sbezverk/gobmp` , Go) as a second collector — strongest,
most current EVPN decoding — plus a thin Kafka→Postgres consumer landing
an `evpn_rib` table. Less code than the Java fork, but two collectors and
two ingest paths.
Recommended: path 2 for fastest EVPN visibility; path 1 if a single unified
OpenBMP schema outweighs the extra effort. Either way, then build EVPN
dashboards (per-EVI, MAC mobility, RT scoping).
2026-05-19 08:01:59 -07:00
2026-05-19 09:24:08 -07:00
**Status / measured:** the `evpn_rib` table is in place
(`postgres/scripts/007_obmp_evpn.sql` ) and a profile-gated `gobgp-evpn`
injector exercises the pipeline. Verified against the running collector
2.2.3: EVPN **type-2 (MAC/IP)** and **type-3 (inclusive multicast)** parse
cleanly onto `openbmp.parsed.evpn` ; **type-5 (IP-prefix) is mis-decoded** (the
prefix corrupts the RD field). So a path-1 fork inherits a collector that
only does type-2/3 reliably — another point in favour of path 2 (GoBMP) if
type-5 matters. Next: the `obmp-evpn-consumer` for type-2/3, then the
dashboard.
2026-05-19 07:19:34 -07:00
### E-scale. PostgreSQL sizing for a full table
A full v4+v6 table is ~1.2M prefixes; with attributes and history this is a
multi-GB addition to `ip_rib` / `ip_rib_log` . Before enabling E1 continuously:
confirm disk headroom on `$OBMP_DATA_ROOT` , apply TimescaleDB compression to
`ip_rib_log` (also flagged in C6). The `mv_as_adjacency` materialized view
(already in place — `postgres/scripts/006_obmp_matviews.sql` ) becomes far
more valuable once real-Internet AS paths are present.
---
2026-05-15 14:23:38 -07:00
## Implementation Order
| Priority | Step | Track | Description |
|----------|------|-------|-------------|
| 1 | A1 | Foundation | Create `inventory.yaml` |
| 2 | A2 | Foundation | Create `config_loader.py` |
| 3 | A3 | Foundation | Refactor hardcoded Python scripts |
| 4 | A4 | Foundation | Parameterize `.env` + docker-compose |
| 5 | A5-A6 | Foundation | Telegraf + InfluxDB datasource fixes |
| 6 | B1 | CML Dev | Dynamic ExaBGP multi-peer |
| 7 | B2-B4 | CML Dev | CML API client + deploy CLI |
| 8 | C1 | Production | Multi-vendor NETCONF (Junos support) |
| 9 | C3 | Production | Junos BMP config templates |
| 10 | C5 | Production | Security hardening |
| 11 | D1-D2 | Packaging | Config templates + bootstrap script |
| 12 | D3 | Packaging | Publish Docker images to registry |
| 13 | D4 | Packaging | Documentation |
2026-05-19 07:19:34 -07:00
| 14 | E1 | Analytics | GoBGP full-table feed (AS57355 live + MRT fallback) |
| 15 | E2 | Analytics | Generic 4-router diff dashboard |
| 16 | E3 | Analytics | Global table exploration dashboard |
2026-05-19 08:01:59 -07:00
| 17 | E4 | Analytics | VRF/RD scoping for L3VPN (to schema, lab-unverifiable) |
| 18 | E5 | Platform | L2VPN/EVPN support — research spike, then collector/schema work |
2026-05-15 14:23:38 -07:00
2026-05-19 07:19:34 -07:00
Steps 1-5 (Track A) unblock everything else. Steps 6-7 and 8-10 can proceed in parallel once the foundation is in place. Track E is independent of A-D: E1 is the foundation for E2/E3; E4 can proceed any time but is lab-unverifiable.
2026-05-15 14:23:38 -07:00
---
## Verification
1. **Config centralization** : Change a router IP in `inventory.yaml` , verify all scripts pick it up
2. **ExaBGP multi-peer** : Set 3+ peers, restart, verify BGP sessions establish
3. **CML API** : `deploy.py --env cml-lab1 status` connects and lists nodes
4. **BMP multi-source** : Router from lab 2 sends BMP, appears in `SELECT * FROM routers` and Grafana
5. **Junos support** : NETCONF script connects to a Juniper router, pushes config
6. **Production dry-run** : Point a test router from the ISP network at the collector, verify end-to-end
7. **Clean deploy** : Clone repo on a fresh host, run `setup.sh` , `docker compose up` , confirm stack starts
---
## Risks
- **Router name collisions**: Enforce unique hostnames across all environments
- **Address space overlap**: Each environment needs distinct management subnets
- **Juniper BMP differences**: Junos BMP implementation may differ in supported tables/TLVs — test early
- **Production scale**: 500K-route labs are slow; production full tables will stress PostgreSQL more
- **Credentials in inventory**: Must be gitignored; consider env var fallback for CI/CD
2026-05-19 07:19:34 -07:00
- **Volunteer route server (E1)**: the AS57355 full-table feed has no SLA and can flap or be retired — the 2-hourly MRT fallback is mandatory, not optional
- **Full-table DB growth (E1)**: a live global feed roughly triples `ip_rib` ; size disk and enable `ip_rib_log` compression before turning it on continuously
- **VRF work unverifiable (E4)**: no L3VPN source in the CML lab — E4 ships to schema correctness only, validated later against production