obmp-docker/gobgp/README.md
sam b681c473c0 Add Policy Diff, fleet-wide full-table feed, and Kafka lag monitoring
Policy Diff (roadmap E2 follow-up): obmp-rib-poller pulls per-router
post-policy accepted/advertised prefix counts and route-policy bindings
over CLI+NETCONF (BMP on XRv9000 24.3.1 carries only pre-policy
Adj-RIB-In). New tables in 008_obmp_policy_diff.sql; Policy Diff
dashboard joins them against BMP ip_rib for received-vs-kept-vs-rejected.

GoBGP fleet-wide feed: GoBGP re-advertises the full Bromirski table to
both labs' core routers (CML AS65020, PROX AS65021) over eBGP; as route
reflectors the cores propagate it to every R9K client, so all 18 lab
routers carry and BMP-export a full table -- an intentional stress test
of the ingestion/storage path. cml/gobgp_peering_config.py applies and
rolls back the core-side config; gobgp/README.md documents the rollback.

Kafka lag monitoring: kafka-lag-monitor samples consumer-group lag every
30s into TimescaleDB (009_kafka_lag.sql); Kafka Ingestion Lag dashboard
gives visibility into the pipeline under churn load.

Peer Detail dashboard: the Peer selector is now router-qualified
(router -> peer) so it is unambiguous in an iBGP route-reflector mesh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 12:42:25 -07:00

163 lines
6.5 KiB
Markdown

# GoBGP global Internet table feed (roadmap E1)
This service runs [GoBGP](https://github.com/osrg/gobgp) to pull the **full real
Internet routing table** (IPv4 ~1M + IPv6 ~200k routes) from Łukasz Bromirski's
lab route server (**AS57355**) and BMP-export every received route to the
OpenBMP collector. The table lands in PostgreSQL `ip_rib` as a monitored peer.
- Image: `jauderho/gobgp:v4.5.0` — community-maintained, multi-arch, tracks
upstream GoBGP releases (rebuilt within an hour of each release). Chosen
because the official `osrg/gobgp` image is published less consistently.
- Local AS: **65001** (private). Router-id: `10.40.40.250`.
- The session is **receive-only** — we announce nothing to the route server.
## Files
| File | Purpose |
|------------------|----------------------------------------------------------------|
| `gobgpd.conf` | GoBGP daemon config (global, neighbors, BMP export). TOML. |
| `mrt-refresh.sh` | MRT full-table fallback loader (cron-driven). |
| `mrt/` | Created at runtime; cached RouteViews RIB dumps. |
## Bring it up
The `gobgp` service is defined in the repo `docker-compose.yml`, on the same
default compose network as `collector`, and `depends_on` it.
```sh
docker compose config # validate compose is well-formed
docker compose up -d gobgp # start (collector must be running)
docker logs -f obmp-gobgp
```
> The live BGP cutover is performed by a human — bringing the container up is
> all that is needed; GoBGP initiates the eBGP-multihop sessions automatically.
## Confirm the session and route count
```sh
# session state — expect both neighbors in "Establ"
docker exec obmp-gobgp gobgp neighbor
# received route counts — expect ~1M IPv4, ~200k IPv6
docker exec obmp-gobgp gobgp global rib summary -a ipv4
docker exec obmp-gobgp gobgp global rib summary -a ipv6
```
## How the data appears in OpenBMP
GoBGP opens an outbound **BMP** session to `obmp-collector:5000` with
`route-monitoring-policy = "pre-policy"` (Adj-RIB-In, pre import-policy —
consistent with the rest of the OpenBMP fleet).
In OpenBMP / PostgreSQL the source is identified by the **BMP router**, which
GoBGP reports using its `router-id` (`10.40.40.250`) and `local-as` (`65001`):
- `routers` table — a row with `ip_address` / name derived from `10.40.40.250`.
- `bgp_peers` table — two peer rows for `85.232.240.179` and
`2001:1a68:2c:2::179`, both `peer_as = 57355`.
- `ip_rib` — every prefix from the global table, attributed to those peers.
To find it in Grafana/SQL, filter on `peer_as = 57355` or the router-id above.
## Fleet-wide full-table feed into the CML lab (stress test)
GoBGP additionally re-advertises the full table to the two CML core routers
(CORE-01/CORE-02, AS65020). As route reflectors the cores propagate it to all
seven R9K clients, so every lab router carries and BMP-exports a full table —
an intentional stress test of the OpenBMP ingestion/storage path (the database
grows toward ~55-65 GB).
- **GoBGP side** — `gobgpd.conf` neighbors `10.100.0.100` / `10.100.0.200`
(peer-as 65020, eBGP-multihop, IPv4+IPv6, `prefix-limit` caps). The
route-server sessions carry `default-export-policy = "reject-route"` so the
lab's own routes can never leak back to AS57355.
- **Router side** — `cml/gobgp_peering_config.py` adds the `neighbor
10.40.40.202` config (with `maximum-prefix 1.5M`/`400k` caps) to both cores.
GoBGP is host-networked, so it sources BGP TCP from the host IP
`10.40.40.202`, not its router-id `10.40.40.250` — the cores peer with the
host IP.
### Apply
```sh
python3 cml/gobgp_peering_config.py # configure both cores
docker compose up -d --force-recreate gobgp # load gobgpd.conf changes
```
> A volume-mounted config change does NOT trigger a recreate on its own —
> `--force-recreate` is required for GoBGP to re-read `gobgpd.conf`.
### Rollback
**Emergency stop** (fastest — feed off within seconds, no router change):
```sh
docker compose stop gobgp
```
Stopping GoBGP drops the eBGP sessions; the cores withdraw the full table and
the withdrawal propagates to every client. The `ip_rib` rows are marked
withdrawn and aged out by the existing TimescaleDB retention.
**Full revert** (also removes the router-side config):
```sh
python3 cml/gobgp_peering_config.py --remove # delete neighbor from cores
docker compose stop gobgp
```
To keep the Bromirski feed running but drop only the lab injection, delete the
two `10.100.0.x` `[[neighbors]]` blocks from `gobgpd.conf` and
`docker compose up -d --force-recreate gobgp`.
### What to watch during convergence
```sh
docker exec obmp-gobgp gobgp neighbor # 4 sessions Establ
docker logs --tail 20 obmp-psql-app # consumer lag
docker exec obmp-psql psql -U openbmp -d openbmp -c \
"SELECT count(*) FROM ip_rib WHERE iswithdrawn = false;" # row growth
```
If `psql-app` consumer lag climbs without draining, or PostgreSQL CPU/IO
saturates, use the emergency stop above.
## MRT fallback
AS57355 is a **single volunteer-run host with no SLA** — it can and does go
away. `mrt-refresh.sh` keeps the global table in `ip_rib` warm when the live
feed is down:
1. If any AS57355 session is `Established`, the script does nothing — the live
feed is authoritative and must not be overwritten with a stale dump.
2. Otherwise it downloads the latest full RIB dump from RouteViews
(`https://archive.routeviews.org/route-views/bgpdata/YYYY.MM/RIBS/rib.YYYYMMDD.HHMM.bz2`,
published every 2 hours UTC) and runs `gobgp mrt inject global <file>`,
which installs every prefix into the running daemon. BMP export to the
collector then happens automatically.
The script is idempotent (re-uses an already-downloaded dump), guarded by a
`flock` against overlapping runs, and prunes to the 4 most recent dumps.
### Schedule it (host crontab, 2-hour cadence)
```cron
0 */2 * * * docker exec obmp-gobgp /config/mrt-refresh.sh >> /var/log/gobgp-mrt.log 2>&1
```
Run it once manually to verify:
```sh
docker exec obmp-gobgp /config/mrt-refresh.sh
```
## Caveats
- **No SLA.** AS57355 is a volunteer lab route server; treat the live feed as
best-effort and rely on the MRT fallback for continuity.
- eBGP-multihop TTL is set to 64 — the route server is many hops away.
- A full table is ~1M+ prefixes; expect a noticeable load spike in the
collector and PostgreSQL when the session first establishes or an MRT dump
is injected.