270 lines
9.7 KiB
Markdown
270 lines
9.7 KiB
Markdown
|
|
# OpenBMP Platform Roadmap
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
This BMP monitoring platform is being developed against CML virtual labs (IOS-XR) and will be deployed into an ISP production network running IOS-XR and Juniper routers/route reflectors. The two tracks share a common foundation: configuration must be environment-agnostic so the same stack runs identically against virtual or production routers.
|
||
|
|
|
||
|
|
Currently, router IPs, AS numbers, and credentials are hardcoded across 8+ files, tightly coupling the stack to a single CML lab. This roadmap addresses both the multi-lab development workflow and production deployment.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Track A: Configuration Centralization (Foundation for Both Tracks)
|
||
|
|
|
||
|
|
### A1. Create `inventory.yaml` — unified topology inventory
|
||
|
|
|
||
|
|
**File**: `inventory.yaml` (new)
|
||
|
|
|
||
|
|
Single source of truth for all environments. Structure:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
platform:
|
||
|
|
host_ip: 10.40.40.202
|
||
|
|
bmp_port: 5000
|
||
|
|
exabgp_port: 5050
|
||
|
|
|
||
|
|
environments:
|
||
|
|
cml-lab1:
|
||
|
|
type: cml # cml | production
|
||
|
|
description: "CML RR cluster - 9 IOS-XR virtual routers"
|
||
|
|
cml_server: "https://10.40.40.174"
|
||
|
|
cml_user: webui
|
||
|
|
bgp_as: 65020
|
||
|
|
netconf: { user: webui, password: cisco, port: 830 }
|
||
|
|
exabgp:
|
||
|
|
local_as: 65100
|
||
|
|
peers:
|
||
|
|
- { ip: 10.100.0.100, name: CORE-01, peer_as: 65020 }
|
||
|
|
- { ip: 10.100.0.200, name: CORE-02, peer_as: 65020 }
|
||
|
|
routers:
|
||
|
|
CORE-01: { mgmt: 10.100.0.100, loopback: 10.10.255.0, role: rr, vendor: iosxr, gnmi: true }
|
||
|
|
CORE-02: { mgmt: 10.100.0.200, loopback: 10.10.255.20, role: rr, vendor: iosxr, gnmi: true }
|
||
|
|
R9K-01: { mgmt: 10.100.0.1, loopback: 10.10.255.1, role: client, vendor: iosxr }
|
||
|
|
# ...
|
||
|
|
|
||
|
|
cml-lab2:
|
||
|
|
type: cml
|
||
|
|
description: "Second CML Lab (TBD topology)"
|
||
|
|
cml_server: "https://<lab2-ip>"
|
||
|
|
routers: {}
|
||
|
|
|
||
|
|
production:
|
||
|
|
type: production
|
||
|
|
description: "ISP production network"
|
||
|
|
bgp_as: <prod-as>
|
||
|
|
netconf: { user: <prod-user>, port: 830 }
|
||
|
|
routers:
|
||
|
|
# IOS-XR and Juniper RRs + routers
|
||
|
|
PROD-RR1: { mgmt: x.x.x.x, role: rr, vendor: iosxr, gnmi: true }
|
||
|
|
PROD-RR2: { mgmt: x.x.x.x, role: rr, vendor: junos }
|
||
|
|
# ...
|
||
|
|
```
|
||
|
|
|
||
|
|
Key design decisions:
|
||
|
|
- `vendor: iosxr | junos` — drives NETCONF dialect, gNMI paths, and config templates
|
||
|
|
- `type: cml | production` — CML environments have `cml_server` for API automation; production does not
|
||
|
|
- Credentials in `inventory.yaml` (gitignored) or pulled from env vars
|
||
|
|
|
||
|
|
### A2. Create `config_loader.py` — Python inventory helper
|
||
|
|
|
||
|
|
**File**: `config_loader.py` (new)
|
||
|
|
|
||
|
|
Functions: `get_env(name)`, `get_all_routers()`, `get_routers_by_vendor(vendor)`, `get_exabgp_peers()`, `get_gnmi_targets()`, `get_routers_for_env(env_name)`
|
||
|
|
|
||
|
|
### A3. Refactor hardcoded Python scripts
|
||
|
|
|
||
|
|
Replace `ROUTERS` dicts/lists with `config_loader` calls:
|
||
|
|
- `exabgp/route_diversity_config.py` (line 47)
|
||
|
|
- `exabgp/bgpls_config.py` (line 35)
|
||
|
|
- `gnmi/gnmi_grpc_config.py` (line 25)
|
||
|
|
|
||
|
|
### A4. Expand `.env` and parameterize `docker-compose.yml`
|
||
|
|
|
||
|
|
Add to `.env`:
|
||
|
|
```env
|
||
|
|
OBMP_DATA_ROOT=/var/openbmp
|
||
|
|
DOCKER_HOST_IP=10.40.40.202
|
||
|
|
EXABGP_LOCAL_IP=10.40.40.202
|
||
|
|
EXABGP_LOCAL_AS=65100
|
||
|
|
EXABGP_PEER_AS=65020
|
||
|
|
EXABGP_PEER_1=10.100.0.100
|
||
|
|
EXABGP_PEER_2=10.100.0.200
|
||
|
|
```
|
||
|
|
|
||
|
|
Replace hardcoded IPs in `docker-compose.yml` (Kafka listener, ExaBGP env vars).
|
||
|
|
|
||
|
|
### A5. Telegraf config parameterization
|
||
|
|
|
||
|
|
Replace hardcoded gNMI addresses in `telegraf/telegraf.conf` with env var substitution. Pass `GNMI_TARGETS` from docker-compose.yml.
|
||
|
|
|
||
|
|
### A6. Fix InfluxDB datasource URL
|
||
|
|
|
||
|
|
`obmp-grafana/provisioning/datasources/influxdb-ds.yml`: replace `http://10.40.40.202:8086` with `http://obmp-influxdb:8086`.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Track B: Multi-Lab CML Development
|
||
|
|
|
||
|
|
### B1. Dynamic ExaBGP multi-peer support
|
||
|
|
|
||
|
|
**File**: `exabgp/startup.sh`
|
||
|
|
|
||
|
|
Accept `EXABGP_PEERS` env var (comma-separated `ip:as:description`), generate N neighbor blocks. Keep `PEER_1`/`PEER_2` fallback.
|
||
|
|
|
||
|
|
### B2. CML API client module
|
||
|
|
|
||
|
|
**File**: `cml/cml_client.py` (new)
|
||
|
|
|
||
|
|
Python module using `virl2_client` SDK:
|
||
|
|
- Connect to CML server (creds from `inventory.yaml`)
|
||
|
|
- Upload node/image definitions
|
||
|
|
- Import/export topology YAML
|
||
|
|
- Start/stop/destroy labs
|
||
|
|
- Get node status
|
||
|
|
|
||
|
|
### B3. Topology template system
|
||
|
|
|
||
|
|
**File**: `cml/templates/xrd_rr.j2` (new)
|
||
|
|
|
||
|
|
Jinja2 templates for XRd startup config. Parameterize: hostname, loopback, link IPs, IS-IS NET, BGP AS, neighbor IPs, BMP target.
|
||
|
|
|
||
|
|
### B4. CLI deployment tool
|
||
|
|
|
||
|
|
**File**: `cml/deploy.py` (new)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 cml/deploy.py --env cml-lab1 status
|
||
|
|
python3 cml/deploy.py --env cml-lab1 upload-images
|
||
|
|
python3 cml/deploy.py --env cml-lab2 create
|
||
|
|
python3 cml/deploy.py --env cml-lab2 start
|
||
|
|
python3 cml/deploy.py --env cml-lab2 destroy
|
||
|
|
```
|
||
|
|
|
||
|
|
### B5. Update build scripts with API push
|
||
|
|
|
||
|
|
`cml/build-cml-image.sh` and `cml/build-xrd-image.sh` get `--push <env-name>` flag.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Track C: Production ISP Deployment
|
||
|
|
|
||
|
|
### C1. Multi-vendor NETCONF support
|
||
|
|
|
||
|
|
Current scripts assume IOS-XR NETCONF only. For Juniper RRs:
|
||
|
|
- `config_loader.py` provides `vendor` field per router
|
||
|
|
- NETCONF scripts branch on vendor for dialect differences (`device_params='iosxr'` vs `device_params='junos'`)
|
||
|
|
- Route diversity, BGP-LS config scripts get Junos templates alongside IOS-XR
|
||
|
|
|
||
|
|
### C2. Multi-vendor gNMI paths
|
||
|
|
|
||
|
|
Telegraf gNMI subscriptions currently use OpenConfig paths which work for both IOS-XR and Junos, but:
|
||
|
|
- Verify Juniper gNMI support on target hardware
|
||
|
|
- Add vendor-specific path overrides in `inventory.yaml` if needed
|
||
|
|
- Telegraf can subscribe to multiple targets with different configs via `[[inputs.gnmi]]` blocks
|
||
|
|
|
||
|
|
### C3. BMP considerations for production
|
||
|
|
|
||
|
|
- BMP collector (port 5000) accepts connections from any router — no changes needed
|
||
|
|
- Production routers need BMP config pushed (manual or via NETCONF automation)
|
||
|
|
- Consider: separate BMP server IDs per environment for dashboard filtering
|
||
|
|
- Juniper BMP config differs from IOS-XR — add Junos BMP config templates
|
||
|
|
|
||
|
|
### C4. Dashboard multi-environment awareness
|
||
|
|
|
||
|
|
- Add a Grafana template variable for environment filtering (by router name prefix or a tag)
|
||
|
|
- Consider a "Network Overview" dashboard that shows all environments side-by-side
|
||
|
|
- Existing dashboards work as-is — router dropdowns will show all BMP-reporting routers
|
||
|
|
|
||
|
|
### C5. Security hardening for production
|
||
|
|
|
||
|
|
- Move credentials out of `inventory.yaml` into environment variables or a secrets manager
|
||
|
|
- Authelia config: stronger passwords, TOTP enforcement, session timeouts
|
||
|
|
- PostgreSQL: restrict access, enable SSL
|
||
|
|
- Kafka: consider authentication if exposed beyond localhost
|
||
|
|
- BMP port: firewall to only accept connections from known router management IPs
|
||
|
|
|
||
|
|
### C6. Scalability considerations
|
||
|
|
|
||
|
|
- Monitor PostgreSQL disk usage and query performance with production-scale RIBs
|
||
|
|
- TimescaleDB compression policies for historical data (ip_rib_log, ls_*_log)
|
||
|
|
- Kafka topic partitioning if message throughput is high
|
||
|
|
- Consider read replicas or materialized views for heavy Grafana queries
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Track D: Packaging & Distribution
|
||
|
|
|
||
|
|
### D1. Configuration templates
|
||
|
|
|
||
|
|
- `inventory.yaml.example` — documented example with placeholder values
|
||
|
|
- `.env.example` — all environment variables with descriptions
|
||
|
|
|
||
|
|
### D2. Bootstrap script
|
||
|
|
|
||
|
|
`setup.sh` that:
|
||
|
|
- Creates required directories (`$OBMP_DATA_ROOT/authelia`, etc.)
|
||
|
|
- Copies example configs if originals don't exist
|
||
|
|
- Validates inventory.yaml syntax
|
||
|
|
- Generates Telegraf config from inventory
|
||
|
|
|
||
|
|
### D3. Published Docker images
|
||
|
|
|
||
|
|
Push custom images to a registry (Docker Hub or GHCR):
|
||
|
|
- `obmp-exabgp`
|
||
|
|
- `obmp-exabgp-ui`
|
||
|
|
- `obmp-traffic-gen`
|
||
|
|
- `obmp-traffic-gen-ui`
|
||
|
|
- `obmp-portal`
|
||
|
|
|
||
|
|
Replace `build:` with `image:` in docker-compose.yml (keep build as override).
|
||
|
|
|
||
|
|
### D4. Documentation
|
||
|
|
|
||
|
|
- `docs/quickstart.md` — 5-minute setup guide
|
||
|
|
- `docs/adding-a-lab.md` — how to add a CML lab environment
|
||
|
|
- `docs/production-deployment.md` — production hardening checklist
|
||
|
|
- `docs/architecture.md` — system diagram, data flow, port map
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation Order
|
||
|
|
|
||
|
|
| Priority | Step | Track | Description |
|
||
|
|
|----------|------|-------|-------------|
|
||
|
|
| 1 | A1 | Foundation | Create `inventory.yaml` |
|
||
|
|
| 2 | A2 | Foundation | Create `config_loader.py` |
|
||
|
|
| 3 | A3 | Foundation | Refactor hardcoded Python scripts |
|
||
|
|
| 4 | A4 | Foundation | Parameterize `.env` + docker-compose |
|
||
|
|
| 5 | A5-A6 | Foundation | Telegraf + InfluxDB datasource fixes |
|
||
|
|
| 6 | B1 | CML Dev | Dynamic ExaBGP multi-peer |
|
||
|
|
| 7 | B2-B4 | CML Dev | CML API client + deploy CLI |
|
||
|
|
| 8 | C1 | Production | Multi-vendor NETCONF (Junos support) |
|
||
|
|
| 9 | C3 | Production | Junos BMP config templates |
|
||
|
|
| 10 | C5 | Production | Security hardening |
|
||
|
|
| 11 | D1-D2 | Packaging | Config templates + bootstrap script |
|
||
|
|
| 12 | D3 | Packaging | Publish Docker images to registry |
|
||
|
|
| 13 | D4 | Packaging | Documentation |
|
||
|
|
|
||
|
|
Steps 1-5 (Track A) unblock everything else. Steps 6-7 and 8-10 can proceed in parallel once the foundation is in place.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
1. **Config centralization**: Change a router IP in `inventory.yaml`, verify all scripts pick it up
|
||
|
|
2. **ExaBGP multi-peer**: Set 3+ peers, restart, verify BGP sessions establish
|
||
|
|
3. **CML API**: `deploy.py --env cml-lab1 status` connects and lists nodes
|
||
|
|
4. **BMP multi-source**: Router from lab 2 sends BMP, appears in `SELECT * FROM routers` and Grafana
|
||
|
|
5. **Junos support**: NETCONF script connects to a Juniper router, pushes config
|
||
|
|
6. **Production dry-run**: Point a test router from the ISP network at the collector, verify end-to-end
|
||
|
|
7. **Clean deploy**: Clone repo on a fresh host, run `setup.sh`, `docker compose up`, confirm stack starts
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Risks
|
||
|
|
|
||
|
|
- **Router name collisions**: Enforce unique hostnames across all environments
|
||
|
|
- **Address space overlap**: Each environment needs distinct management subnets
|
||
|
|
- **Juniper BMP differences**: Junos BMP implementation may differ in supported tables/TLVs — test early
|
||
|
|
- **Production scale**: 500K-route labs are slow; production full tables will stress PostgreSQL more
|
||
|
|
- **Credentials in inventory**: Must be gitignored; consider env var fallback for CI/CD
|