obmp-docker/DOCS.md
sam 233dadbb41 Add ExaBGP route injector, Grafana dashboards, and full documentation
- Add exabgp/ container: ExaBGP 5.x + Flask REST API for on-demand BGP
  route injection into CML IOS-XR lab (AS 65020 via eBGP from AS 65100)
- Add 6 injection scenarios: internet_sample, churn, blackhole, anycast,
  full_table, lab_prefixes
- Add inject.py CLI wrapper for the ExaBGP API
- Add iosxr_bgp_config.md with IOS-XR neighbor config and NETCONF script
- Add obmp-grafana/ dashboards and provisioning (17 dashboards)
- Update docker-compose.yml: add exabgp service, fix Kafka external
  listener IP, extend log retention from 90min to 720min
- Add DOCS.md: full project documentation including architecture, setup,
  user guide, sanity checks, troubleshooting, and command reference
- Update .gitignore: exclude .env and .claude/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 14:46:37 -07:00

747 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OpenBMP + ExaBGP Route Injector — Full Documentation
## Table of Contents
1. [What Is This Project?](#1-what-is-this-project)
2. [Architecture](#2-architecture)
3. [Prerequisites](#3-prerequisites)
4. [Initial Setup (First Time)](#4-initial-setup-first-time)
5. [IOS-XR Router Configuration](#5-ios-xr-router-configuration)
6. [Starting and Stopping](#6-starting-and-stopping)
7. [Route Injection User Guide](#7-route-injection-user-guide)
8. [Grafana Dashboards](#8-grafana-dashboards)
9. [Sanity Checks](#9-sanity-checks)
10. [Relevant Commands Reference](#10-relevant-commands-reference)
11. [Troubleshooting](#11-troubleshooting)
12. [Data Retention](#12-data-retention)
13. [Environment Variables Reference](#13-environment-variables-reference)
---
## 1. What Is This Project?
This is a **BGP Monitoring Platform (BMP) lab stack** deployed via Docker Compose. It collects, stores, and visualizes BGP routing data from a Cisco IOS-XR lab network (running in Cisco Modeling Labs / CML).
**What it does:**
- Receives BMP (BGP Monitoring Protocol, RFC 7854) telemetry from routers on TCP port 5000
- Streams BMP data through Kafka into a TimescaleDB/PostgreSQL database
- Provides 17 Grafana dashboards for real-time and historical BGP analysis
- Includes an **ExaBGP route injector** that peers with the two CORE routers and injects synthetic BGP routes, enabling testing of BGP policy, route propagation, and Grafana dashboards without needing internet connectivity
**The lab network:**
- AS 65020 — 9 Cisco IOS-XR routers in CML (iBGP full mesh via route-reflectors)
- AS 65100 — ExaBGP container (eBGP peer to both CORE routers)
- CORE-01: `10.100.0.100` (CML-R9K-CORE-01)
- CORE-02: `10.100.0.200` (CML-R9K-CORE-02)
- Host IP: `10.40.40.202` (ExaBGP binds here; reachable from CML management network)
---
## 2. Architecture
```
IOS-XR Routers (9x, AS 65020)
BMP telemetry on TCP 5000
|
v
obmp-collector (openbmp/collector:2.2.3)
|
v
obmp-kafka (confluentinc/cp-kafka:7.1.1)
+ obmp-zookeeper (confluentinc/cp-zookeeper:7.1.1)
|
v
obmp-psql-app (openbmp/psql-app:2.2.2)
Java consumer — writes parsed BGP data to PostgreSQL
|
v
obmp-psql (openbmp/postgres:2.2.1)
PostgreSQL 14 + TimescaleDB
|
+---------> obmp-grafana (grafana/grafana:9.1.7) :3000
| 17 dashboards, PostgreSQL datasource
+---------> obmp-whois (openbmp/whois:2.2.0) :4300
WHOIS query server backed by the DB
ExaBGP (obmp-exabgp, built locally)
python:3.11-slim + exabgp 5.x + Flask API
Peers eBGP to CORE-01 and CORE-02 (AS 65100 -> AS 65020)
HTTP API on :5050 — inject/withdraw routes on demand
Routes propagate via iBGP mesh to all 9 routers -> BMP -> DB -> Grafana
```
### Container Summary
| Container | Image | Port(s) | Role |
|-----------|-------|---------|------|
| obmp-zookeeper | confluentinc/cp-zookeeper:7.1.1 | 2181 (internal) | Kafka coordination |
| obmp-kafka | confluentinc/cp-kafka:7.1.1 | 9092 | Message broker |
| obmp-collector | openbmp/collector:2.2.3 | 5000 | BMP receiver |
| obmp-psql-app | openbmp/psql-app:2.2.2 | 9005 | Kafka→PostgreSQL consumer |
| obmp-psql | openbmp/postgres:2.2.1 | 5432 | TimescaleDB storage |
| obmp-grafana | grafana/grafana:9.1.7 | 3000 | Visualization |
| obmp-whois | openbmp/whois:2.2.0 | 4300 | WHOIS query server |
| obmp-exabgp | local build | 5050 (host net) | BGP route injector |
---
## 3. Prerequisites
- Docker Engine (20.10+) and Docker Compose v2
- Host IP `10.40.40.202` reachable from the CML management network
- CML routers with BMP configured pointing to `10.40.40.202:5000`
- CML CORE routers configured with ExaBGP as eBGP neighbor (see Section 5)
- `OBMP_DATA_ROOT` directory created (default: `/var/openbmp`)
---
## 4. Initial Setup (First Time)
### 4.1 Clone the repository
```bash
git clone <this-repo-url>
cd obmp-docker
```
### 4.2 Create persistent data directories
```bash
export OBMP_DATA_ROOT=/var/openbmp
sudo mkdir -p $OBMP_DATA_ROOT
mkdir -p ${OBMP_DATA_ROOT}/config
mkdir -p ${OBMP_DATA_ROOT}/kafka-data
mkdir -p ${OBMP_DATA_ROOT}/zk-data
mkdir -p ${OBMP_DATA_ROOT}/zk-log
mkdir -p ${OBMP_DATA_ROOT}/postgres/data
mkdir -p ${OBMP_DATA_ROOT}/postgres/ts
mkdir -p ${OBMP_DATA_ROOT}/grafana
mkdir -p ${OBMP_DATA_ROOT}/grafana/dashboards
sudo chmod -R 777 $OBMP_DATA_ROOT
```
### 4.3 Initialise the database (first run only)
Create the init trigger file — this causes psql-app to create all tables on startup:
```bash
touch ${OBMP_DATA_ROOT}/config/init_db
```
> **Warning:** Do not create this file on subsequent runs unless you want to wipe and recreate the entire database.
### 4.4 Copy Grafana provisioning files
```bash
cp -r obmp-grafana/provisioning ${OBMP_DATA_ROOT}/grafana/
cp -r obmp-grafana/dashboards ${OBMP_DATA_ROOT}/grafana/
```
### 4.5 Start the stack
```bash
OBMP_DATA_ROOT=/var/openbmp docker compose -p obmp up -d
```
Wait ~2 minutes for all services to initialise (especially PostgreSQL and psql-app which run schema migrations).
### 4.6 Verify everything is running
```bash
docker compose -p obmp ps
docker compose -p obmp logs --tail=20 psql-app
```
---
## 5. IOS-XR Router Configuration
The ExaBGP container peers eBGP with both CORE routers. Each CORE router must be configured with:
### 5.1 Route policies (apply once per router)
```
route-policy EXABGP_IN
pass
end-policy
route-policy EXABGP_OUT
drop
end-policy
```
### 5.2 BGP neighbor block
```
router bgp 65020
neighbor 10.40.40.202
remote-as 65100
description ExaBGP-Route-Injector
ebgp-multihop 5
update-source MgmtEth0/RP0/CPU0/0
!
address-family ipv4 unicast
route-policy EXABGP_IN in
route-policy EXABGP_OUT out
next-hop-self
!
!
!
```
### 5.3 Static route for next-hop resolution
IOS-XR BGP does not use the default route (0.0.0.0/0) to resolve BGP next-hops. A more-specific static route for the ExaBGP host subnet is required in the default VRF:
```
router static
address-family ipv4 unicast
10.40.40.0/24 10.100.0.254
!
!
```
### 5.4 Config notes
| Knob | Why |
|------|-----|
| `remote-as 65100` | ExaBGP presents as AS 65100 (eBGP to your AS 65020 mesh) |
| `ebgp-multihop 5` | Host and router are on different subnets |
| `update-source MgmtEth0/RP0/CPU0/0` | ExaBGP is reachable via the management interface |
| `next-hop-self` | Replace ExaBGP's next-hop (10.40.40.202) with the CORE router's address when reflecting into iBGP — ensures all routers can resolve the next-hop |
| `EXABGP_OUT` drops | Prevents the lab from advertising its own prefixes back to ExaBGP |
| Static route | Required: IOS-XR BGP will not install injected routes as bestpaths without a specific route to the next-hop |
### 5.5 NETCONF alternative
See `exabgp/iosxr_bgp_config.md` for a Python/ncclient script that pushes all of the above config programmatically.
Credentials: `username=webui`, `password=cisco`, port 830.
---
## 6. Starting and Stopping
### Start all services
```bash
OBMP_DATA_ROOT=/var/openbmp docker compose -p obmp up -d
```
### Stop all services (preserve data)
```bash
docker compose -p obmp down
```
### Stop and remove all data (full reset)
```bash
docker compose -p obmp down -v
sudo rm -rf /var/openbmp
```
### Rebuild the ExaBGP container (after code changes)
```bash
docker compose -p obmp build exabgp
docker compose -p obmp up -d exabgp
```
### Restart a single service
```bash
docker compose -p obmp restart <service>
# e.g.:
docker compose -p obmp restart exabgp
docker compose -p obmp restart psql-app
```
---
## 7. Route Injection User Guide
The ExaBGP container exposes a Flask REST API on port 5050 (host network). The `inject.py` CLI wraps this API.
### 7.1 Setup
```bash
cd exabgp
pip install requests # only needed if running inject.py from the host
```
### 7.2 Check status
```bash
python3 inject.py status
```
Output shows API health, active route count, and peer states:
```json
{
"status": "ok",
"active_routes": 77,
"peers": {
"10.100.0.100": {"state": "up", "updated": "2026-03-05T10:00:00Z"},
"10.100.0.200": {"state": "up", "updated": "2026-03-05T10:00:00Z"}
}
}
```
### 7.3 List available scenarios
```bash
python3 inject.py scenarios
```
| Scenario | Routes | Description |
|----------|--------|-------------|
| `internet_sample` | ~94 | Partial internet table — real public prefixes, realistic AS paths (Cloudflare, Google, AWS, Azure, etc.) |
| `churn` | 30 | RFC documentation prefixes for announce/withdraw churn testing |
| `blackhole` | 5 | /32 prefixes with RTBH community (65100:666 + 65535:666) |
| `anycast` | 3 | Same prefixes with varying AS paths and MEDs (best-path testing) |
| `full_table` | 500+ | Large partial internet table with synthetic /24s |
| `lab_prefixes` | 8 | Enterprise/SP-style routes with communities and local-pref |
### 7.4 Load a scenario
```bash
python3 inject.py scenario internet_sample
```
Routes propagate: ExaBGP → CORE-01/CORE-02 (eBGP) → all 9 routers (iBGP) → BMP → Kafka → PostgreSQL → Grafana.
### 7.5 Withdraw a scenario
```bash
python3 inject.py withdraw-scenario internet_sample
```
### 7.6 Announce individual prefixes
```bash
python3 inject.py announce 10.0.0.0/8 \
--as-path 65100 3356 15169 \
--community 65100:100 \
--med 100
```
### 7.7 Withdraw individual prefixes
```bash
python3 inject.py withdraw 10.0.0.0/8
```
### 7.8 Withdraw everything
```bash
python3 inject.py withdraw-all
```
### 7.9 Generate route churn (populate history tables)
The `churn` command cycles the churn scenario repeatedly, generating `ip_rib_log` and `stats_chg_*` entries that power Grafana's history dashboards.
```bash
# 5 cycles, 30 seconds apart
python3 inject.py churn --count 5 --interval 30
# Run indefinitely until Ctrl+C
python3 inject.py churn
```
### 7.10 REST API directly (curl)
```bash
BASE=http://localhost:5050
# Health
curl $BASE/healthz
# List scenarios
curl $BASE/scenarios
# Load scenario
curl -X POST $BASE/scenario/internet_sample
# Announce custom prefix
curl -X POST $BASE/announce \
-H 'Content-Type: application/json' \
-d '{"prefixes":["10.0.0.0/8"],"as_path":[65100,3356,15169],"communities":["65100:100"]}'
# Withdraw all
curl -X POST $BASE/withdraw/all
# Peer state
curl $BASE/peers
```
### 7.11 Adding custom scenarios
Edit `exabgp/scenarios/__init__.py`. Add an entry to `SCENARIOS` following the existing pattern:
```python
SCENARIOS['my_scenario'] = {
'description': 'My custom routes',
'routes': [
_r('192.0.2.0/24', [65100, 65200], communities=['65100:100']),
],
}
```
The `scenarios/` directory is volume-mounted into the container, so changes are live without rebuilding. However, the Python module is imported at container start — **restart the container** after editing:
```bash
docker compose -p obmp restart exabgp
```
---
## 8. Grafana Dashboards
Access: `http://10.40.40.202:3000`
Default credentials: `admin` / `openbmp` (anonymous access also enabled)
### Dashboard Categories
| Category | Dashboard | Description |
|----------|-----------|-------------|
| General | OBMP Home | Overview / landing page |
| Base | Inventory | Router and peer inventory |
| Base | Looking Glass | Real-time RIB lookup by prefix |
| Base | ASN View | ASN-level routing view |
| History | Prefix History | Route change history for a prefix |
| History | Prefix History by ASN | Filtered by origin AS |
| History | Prefix History by Community | Filtered by BGP community |
| Tops | Top Prefixes | Most-updated prefixes |
| Tops | Top L3VPN Prefixes | L3VPN equivalent |
| Link State | LS Nodes | IS-IS link-state node database |
| Link State | LS Links | IS-IS link-state link database |
| Link State | LS Topology | Network topology map |
| Link State | LS Prefixes | Link-state prefix database |
| Link State | LS History | Link-state change history |
| L3VPN | L3VPN Looking Glass | VPN RIB lookup |
| L3VPN | L3VPN Prefix History | VPN route change history |
| L3VPN | L3VPN RIB Browser | Full VPN RIB browser |
> History dashboards require `ip_rib_log` and `stats_chg_*` table data. Run `inject.py churn` to populate these.
---
## 9. Sanity Checks
### 9.1 All containers running
```bash
docker compose -p obmp ps
```
All containers should show `running`. If any are restarting, check logs:
```bash
docker compose -p obmp logs --tail=50 <service>
```
### 9.2 ExaBGP peers up
```bash
python3 exabgp/inject.py status
```
Both `10.100.0.100` and `10.100.0.200` should show `"state": "up"`.
Or check from the router side:
```
show bgp neighbors 10.40.40.202
show bgp summary | inc 10.40.40.202
```
### 9.3 Routes accepted by CORE routers
After loading `internet_sample`:
```bash
# On CORE-01 or CORE-02:
show bgp summary
# Expect: 77 accepted prefixes, 77 are bestpaths from 10.40.40.202
show bgp 8.8.8.0/24
# Expect: best path via 10.40.40.202 (eBGP), also iBGP copies from other routers
```
### 9.4 Routes in OpenBMP database
```bash
docker exec -it obmp-psql psql -U openbmp -c "
SELECT count(DISTINCT prefix) AS unique_prefixes,
count(DISTINCT peer_hash_id) AS peers_reporting
FROM ip_rib
WHERE isIPv4 = true AND isWithdrawn = false;
"
```
Expect `~129 unique prefixes` and `56 peers_reporting` (9 routers × ~6 peers each) after loading `internet_sample`.
### 9.5 Kafka is healthy
```bash
docker exec -it obmp-kafka kafka-topics --bootstrap-server localhost:29092 --list
```
Should show topics like `openbmp.parsed.unicast_prefix`, `openbmp.parsed.peer`, etc.
### 9.6 Grafana datasource
Open `http://10.40.40.202:3000` → Configuration → Data Sources → OpenBMP → Test.
Should return "Database Connection OK".
### 9.7 BMP collector receiving data
```bash
docker compose -p obmp logs --tail=30 collector
```
Should show connections from router management IPs.
### 9.8 psql-app consumer is caught up
```bash
docker compose -p obmp logs --tail=30 psql-app
```
Should show periodic cron job outputs (RPKI sync, IRR sync, global_ip_rib updates).
---
## 10. Relevant Commands Reference
### Docker Compose
```bash
# Start stack
OBMP_DATA_ROOT=/var/openbmp docker compose -p obmp up -d
# Stop stack
docker compose -p obmp down
# Show status
docker compose -p obmp ps
# Follow logs (all services)
docker compose -p obmp logs -f
# Follow logs (specific service)
docker compose -p obmp logs -f exabgp
docker compose -p obmp logs -f psql-app
docker compose -p obmp logs -f collector
# Rebuild and restart ExaBGP
docker compose -p obmp build exabgp && docker compose -p obmp up -d exabgp
# Restart a service
docker compose -p obmp restart psql-app
```
### Route Injection (from `exabgp/` directory)
```bash
# API health and peer states
python3 inject.py status
# List active routes
python3 inject.py routes
# List scenarios
python3 inject.py scenarios
# Load a scenario
python3 inject.py scenario internet_sample
python3 inject.py scenario churn
python3 inject.py scenario blackhole
python3 inject.py scenario full_table
python3 inject.py scenario lab_prefixes
# Withdraw a scenario
python3 inject.py withdraw-scenario internet_sample
# Withdraw all active routes
python3 inject.py withdraw-all
# Announce a specific prefix
python3 inject.py announce 10.0.0.0/8 --as-path 65100 3356 15169 --community 65100:100
# Withdraw a specific prefix
python3 inject.py withdraw 10.0.0.0/8
# Run churn (populate history tables)
python3 inject.py churn --count 5 --interval 30
```
### Database Queries
```bash
# Connect to database
docker exec -it obmp-psql psql -U openbmp -d openbmp
# Count unique prefixes in RIB
SELECT count(DISTINCT prefix) FROM ip_rib WHERE isIPv4=true AND isWithdrawn=false;
# Show recent route changes
SELECT prefix, origin_as, iswithdrawn, timestamp
FROM ip_rib_log
ORDER BY timestamp DESC LIMIT 20;
# Show peer summary
SELECT name, state, timestamp_last_updated
FROM bgp_peers
ORDER BY state, name;
# Show routes from ExaBGP peer
SELECT prefix, origin_as, as_path
FROM ip_rib
WHERE peer_hash_id IN (
SELECT hash_id FROM bgp_peers WHERE peer_addr = '10.40.40.202'
)
AND isWithdrawn = false;
```
### IOS-XR Verification (on router CLI)
```
show bgp neighbors 10.40.40.202
show bgp neighbors 10.40.40.202 received routes
show bgp summary
show bgp 8.8.8.0/24
show bgp 1.1.1.0/24
show route 8.8.8.0/24
```
---
## 11. Troubleshooting
### ExaBGP container keeps restarting
Check logs:
```bash
docker compose -p obmp logs --tail=50 exabgp
```
Common causes and fixes:
| Symptom | Cause | Fix |
|---------|-------|-----|
| Exits after "welcome" banner | Missing or wrong env file path | `startup.sh` generates `/usr/local/etc/exabgp/exabgp.env` — verify this path exists in container |
| Process `api` killed 5 times | Wrong Python path in conf | Conf uses `/usr/local/bin/python3` — correct for python:3.11-slim |
| `drop = true` in env | ExaBGP drops privileges to nobody, can't bind 179 | `startup.sh` patches `drop = false` — check the sed lines ran |
| `__pycache__ Permission denied` during build | Root-owned cache from previous container run | `.dockerignore` excludes `**/__pycache__` — confirm file exists |
### BGP sessions not establishing
1. Verify host IP `10.40.40.202` is reachable from CML management network: `ping 10.40.40.202` from router
2. Check ExaBGP peer state: `python3 exabgp/inject.py status`
3. On router: `show bgp neighbors 10.40.40.202` — look for error codes
4. Common IOS-XR errors:
- `no-update-source-config` — add `update-source MgmtEth0/RP0/CPU0/0`
- `no-ipv6-address` — ensure only IPv4 unicast AF is configured (no IPv6)
- TCP refused — check port 179 is reachable (ExaBGP uses `network_mode: host`)
### Routes received but not bestpath
IOS-XR BGP requires a specific route to resolve the BGP next-hop (10.40.40.202). The default route (0.0.0.0/0) is insufficient.
```
router static
address-family ipv4 unicast
10.40.40.0/24 10.100.0.254
```
Verify: `show bgp 1.1.1.0/24` — should show `Status: s (active), bestpath`.
### Grafana shows no data
1. Check datasource: Configuration → Data Sources → OpenBMP → Test
2. Verify psql-app is writing: `docker compose -p obmp logs psql-app`
3. Check the database directly (see database queries above)
4. History dashboards need route churn — run `python3 inject.py churn`
### Kafka not starting
Zookeeper must be healthy first. Check:
```bash
docker compose -p obmp logs zookeeper
docker compose -p obmp restart kafka
```
### psql-app fails to start
Usually a PostgreSQL connection issue or schema mismatch. Check:
```bash
docker compose -p obmp logs psql-app
# If "relation does not exist" errors: re-trigger DB init
touch /var/openbmp/config/init_db
docker compose -p obmp restart psql-app
```
---
## 12. Data Retention
Configured in `docker-compose.yml` via `POSTGRES_DROP_*` environment variables:
| Table | Default Retention |
|-------|-------------------|
| peer_event_log | 1 year |
| stat_reports | 4 weeks |
| ip_rib_log | 4 weeks |
| alerts | 4 weeks |
| ls_nodes_log | 4 months |
| ls_links_log | 4 months |
| ls_prefixes_log | 4 months |
| stats_chg_byprefix | 4 weeks |
| stats_chg_byasn | 4 weeks |
| stats_chg_bypeer | 4 weeks |
| stats_ip_origins | 4 weeks |
| stats_peer_rib | 4 weeks |
| stats_peer_update_counts | 4 weeks |
Adjust in `docker-compose.yml` under the `psql-app` service environment block.
---
## 13. Environment Variables Reference
### ExaBGP container
| Variable | Default | Description |
|----------|---------|-------------|
| `EXABGP_LOCAL_IP` | `10.40.40.202` | Host IP ExaBGP binds to and uses as router-id |
| `EXABGP_LOCAL_AS` | `65100` | ExaBGP's AS number |
| `EXABGP_PEER_AS` | `65020` | AS of the IOS-XR lab |
| `EXABGP_PEER_1` | `10.100.0.100` | First CORE router to peer with |
| `EXABGP_PEER_2` | `10.100.0.200` | Second CORE router to peer with |
| `EXABGP_API_PORT` | `5050` | Flask API port |
### psql-app container (key variables)
| Variable | Default | Description |
|----------|---------|-------------|
| `MEM` | `3` | JVM heap in GB |
| `ENABLE_RPKI` | `1` | Enable RPKI sync from Cloudflare |
| `ENABLE_IRR` | `1` | Enable IRR sync |
| `ENABLE_DBIP` | `1` | Enable DB-IP geolocation import |
| `POSTGRES_REPORT_WINDOW` | `8 minute` | Aggregation window for summary tables |
### inject.py (CLI)
| Variable | Default | Description |
|----------|---------|-------------|
| `EXABGP_API` | `http://localhost:5050` | ExaBGP API base URL |