- postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table (roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib; a dedicated consumer will populate it. - production-sizing.md: corrected retention figures to the actual policy values, added a measured-data section (one full feed ≈ +5 GB current state; DB now ~30 GB), and a horizontal-scaling section — the bottleneck is the psql-app consumer + disk IOPS, so scale psql-app as a Kafka consumer group (cap = partition count), treat multi-collector as HA/locality not throughput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
147 lines
7.6 KiB
Markdown
147 lines
7.6 KiB
Markdown
# OpenBMP Production Sizing — 40 Full-Table-Edge Routers
|
||
|
||
Sizing guidance for deploying the OpenBMP stack against a production ISP
|
||
network of **40 full-table-edge routers** with gNMI streaming telemetry.
|
||
Derived from the OpenBMP `psql-app` sizing guidance and measured lab behavior.
|
||
|
||
## Workload assumptions
|
||
|
||
| Parameter | Value |
|
||
|-----------|-------|
|
||
| Monitored routers | 40, full-table edge |
|
||
| BMP RIB scope | Adj-RIB-In (see recommendation below) |
|
||
| Full feeds per router | ~2–3 eBGP peers carrying the full DFZ |
|
||
| Routes per full feed | ~1.2M (≈1M IPv4 + ~0.2M IPv6) |
|
||
| **Estimated total NLRIs** | **~100–150M** in Adj-RIB-In |
|
||
| Telemetry | gNMI via Telegraf → InfluxDB, ~50–200 interfaces/router, 10 s interval |
|
||
| History retention | `ip_rib_log` 2 months, LS logs 8 weeks, `peer_event_log` 4 months (lab policy defaults; tunable) |
|
||
|
||
The NLRI estimate (40 × ~2.5 feeds × 1.2M) places this deployment at the top
|
||
of the OpenBMP `psql-app` guidance tier (150M NLRIs → 64 GB heap).
|
||
|
||
## Measured data point (lab, 2026)
|
||
|
||
Real numbers from the lab after adding **one** full-table feed (GoBGP →
|
||
AS57355, ~1.04M IPv4 + ~0.25M IPv6 routes):
|
||
|
||
| Metric | Before feed | After 1 full feed |
|
||
|--------|-------------|-------------------|
|
||
| `openbmp` DB size | ~25 GB | **~30 GB** |
|
||
| `ip_rib` (current state) | small | 5.3 GB |
|
||
| `ip_rib_log` (history hypertable) | — | 7.75 GB, 82/97 chunks compressed |
|
||
| `base_attrs` | ~1 GB | 2.3 GB |
|
||
| `geo_ip` (fixed reference data) | 8.8 GB | 8.8 GB |
|
||
|
||
So **one full feed ≈ +5 GB current-state**, plus history that accrues against
|
||
the 2-month `ip_rib_log` retention. The ~1.3M-route initial dump ingested in
|
||
minutes with no Kafka consumer lag. Extrapolating linearly, 40 routers × ~2.5
|
||
feeds ≈ 100 feed-equivalents → on the order of **0.5 TB current state** before
|
||
history and indexes; the 2–4 TB storage target below holds with headroom.
|
||
|
||
## BMP RIB scope — recommendation
|
||
|
||
**Deploy with Adj-RIB-In only.** It is the OpenBMP default, is what every
|
||
dashboard is built on, and captures the highest-value data — what each peer
|
||
advertises. Alternatives and their cost:
|
||
|
||
- **Loc-RIB** — adds a full post-best-path converged table per router
|
||
(~40 × 1.2M ≈ +48M NLRIs). Add later, selectively, only where best-path
|
||
analysis is needed; verify the IOS-XR release supports Loc-RIB BMP.
|
||
- **Adj-RIB-Out** — multiplies further (per advertised peer). Not recommended
|
||
for the initial deployment.
|
||
- **Post-policy Adj-RIB-In** — if inbound policy is restrictive this trims
|
||
volume meaningfully; with permissive import it is similar to pre-policy.
|
||
|
||
## Compute & memory
|
||
|
||
| Component | Lab today | Production target | Rationale |
|
||
|-----------|-----------|-------------------|-----------|
|
||
| **Total RAM** | 31 GB | **96–128 GB** | psql-app heap 48–64 GB + PostgreSQL shared_buffers/cache + Kafka 4–8 GB + InfluxDB + Grafana + collector |
|
||
| **CPU** | 8 cores | **16–32 vCPU** | PostgreSQL is CPU-bound under full-table churn — lab psql already sustains ~287% (3 cores) at 18 routers |
|
||
| `psql-app` JVM heap (`MEM`) | 3 GB | **48–64 GB** | OpenBMP guidance: 4 GB ≈ 10M NLRIs, 64 GB ≈ 150M NLRIs |
|
||
| `psql-app` container `mem_limit` | 4 GB | **heap + ~8 GB** | Set `PSQL_APP_MEM_LIMIT` above the JVM heap |
|
||
| `psql` container `mem_limit` | 6 GB | **48–64 GB** | Set `PSQL_MEM_LIMIT`; PostgreSQL wants ~25% as `shared_buffers` and the rest for OS cache |
|
||
| `kafka` container `mem_limit` | 4 GB | **8–12 GB** | Set `KAFKA_MEM_LIMIT`; full-table initial dumps from 40 routers are bursty |
|
||
|
||
## Storage
|
||
|
||
| Store | Lab today | Production target | Notes |
|
||
|-------|-----------|-------------------|-------|
|
||
| **PostgreSQL** | 30 GB | **2–4 TB NVMe SSD** | `ip_rib` current state (~100–150M rows) + `ip_rib_log` history (2-month retention, the dominant grower) + `base_attrs` + `geo_ip` (~9 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom. |
|
||
| **Kafka** | 0.2 GB | **100–500 GB** | 12 h retention; sized for full-table initial-dump bursts × 40 routers |
|
||
| **InfluxDB (telemetry)** | minimal | **50–200 GB** | 40 routers × ~50–200 interfaces × 10 s gNMI × 30 d; compresses well |
|
||
| **Total** | — | **~3–5 TB fast NVMe** | Use NVMe; PostgreSQL random-IO under churn is the bottleneck on slow disks |
|
||
|
||
Put the PostgreSQL data directory and the TimescaleDB tablespace on NVMe.
|
||
`ip_rib_log` retention (2 months in the lab) is the main storage tuning knob
|
||
— revisit once production update volume is measured; halving it roughly
|
||
halves the dominant history table.
|
||
|
||
## Architecture
|
||
|
||
A single host is viable only if large (**≥128 GB RAM, ≥32 vCPU, multi-TB
|
||
NVMe**). **Preferred: split services across hosts** —
|
||
|
||
| Host | Services | Profile |
|
||
|------|----------|---------|
|
||
| **DB host** (heaviest) | postgres | — |
|
||
| **Pipeline host** | kafka, zookeeper, collector, psql-app | core |
|
||
| **Presentation host** | grafana, influxdb, telegraf, whois | core + telemetry |
|
||
|
||
Whichever layout: every service already carries a Compose `mem_limit` — raise
|
||
`PSQL_MEM_LIMIT` / `PSQL_APP_MEM_LIMIT` / `KAFKA_MEM_LIMIT` in `.env` for the
|
||
production hosts.
|
||
|
||
## Horizontal scaling — where it actually helps
|
||
|
||
The ingestion bottleneck is **not** the collector or Kafka — it is the
|
||
`psql-app` consumer writing to PostgreSQL, and ultimately **disk IOPS**.
|
||
Plan scaling accordingly:
|
||
|
||
- **Scale `psql-app` as a Kafka consumer group.** Run multiple `psql-app`
|
||
containers with the **same group ID**; Kafka rebalances partitions across
|
||
them and fails over automatically. This is the real throughput lever and
|
||
also provides HA. **Hard cap = Kafka partition count** — the compose sets
|
||
`KAFKA_NUM_PARTITIONS: 8`, so ≤ 8 useful instances. **Raise the partition
|
||
count before scaling past a few consumers** — it cannot easily be reduced
|
||
later.
|
||
- **Disk IOPS is the named bottleneck.** Target **≥ 5000 IOPS** (NVMe) for
|
||
the PostgreSQL store; this buys more headroom than any container count.
|
||
- **Multiple collectors are an HA / locality decision, not a throughput
|
||
one.** A BMP session is one stateful TCP connection and cannot be load
|
||
balanced — you distribute routers by pointing each router's `bmp server`
|
||
config at a specific collector. All collectors feed one Kafka. Shard
|
||
collectors for fault isolation / POP locality, not for performance, and
|
||
note a dead collector's routers go dark until reconfigured (no auto-
|
||
failover at the collector tier).
|
||
- Within one `psql-app`, writer threads already auto-scale per type
|
||
(`writer_max_threads_per_type`); the consumer-group is the across-instance
|
||
layer on top.
|
||
|
||
Bursts (every collector restart triggers simultaneous full-table dumps from
|
||
all peers) are absorbed by Kafka — size Kafka retention so a slow consumer
|
||
never loses data during a convergence storm.
|
||
|
||
## PostgreSQL tuning
|
||
|
||
- `shared_buffers` ≈ 25% of host RAM; large `effective_cache_size`.
|
||
- Raise `work_mem` (dashboard aggregate queries) and `maintenance_work_mem`.
|
||
- `max_wal_size` already 10 GB — keep or raise for churn bursts.
|
||
- Enable parallel query (`max_parallel_workers_per_gather`).
|
||
- Aggressive autovacuum on churn tables (`ip_rib`, `base_attrs`, `ip_rib_log`)
|
||
— applied in the lab; persist these settings in production provisioning.
|
||
- TimescaleDB compression is already enabled on `ip_rib_log` and the `stats_*`
|
||
hypertables — keep it.
|
||
|
||
## Reference bill of materials (single-host option)
|
||
|
||
| Resource | Spec |
|
||
|----------|------|
|
||
| CPU | 32 vCPU |
|
||
| RAM | 128 GB |
|
||
| Storage | 4 TB NVMe SSD |
|
||
| Network | 1 GbE+ to the routers' BMP source network |
|
||
|
||
For the split-host option, divide per the architecture table — the DB host
|
||
takes the bulk of RAM and all of the fast storage.
|