obmp-docker/docs/production-sizing.md
sam 2d83d6c02e Add evpn_rib schema; update production sizing with measured data
- postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table
  (roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib;
  a dedicated consumer will populate it.
- production-sizing.md: corrected retention figures to the actual
  policy values, added a measured-data section (one full feed ≈
  +5 GB current state; DB now ~30 GB), and a horizontal-scaling
  section — the bottleneck is the psql-app consumer + disk IOPS, so
  scale psql-app as a Kafka consumer group (cap = partition count),
  treat multi-collector as HA/locality not throughput.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 08:44:09 -07:00

147 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OpenBMP Production Sizing — 40 Full-Table-Edge Routers
Sizing guidance for deploying the OpenBMP stack against a production ISP
network of **40 full-table-edge routers** with gNMI streaming telemetry.
Derived from the OpenBMP `psql-app` sizing guidance and measured lab behavior.
## Workload assumptions
| Parameter | Value |
|-----------|-------|
| Monitored routers | 40, full-table edge |
| BMP RIB scope | Adj-RIB-In (see recommendation below) |
| Full feeds per router | ~23 eBGP peers carrying the full DFZ |
| Routes per full feed | ~1.2M (≈1M IPv4 + ~0.2M IPv6) |
| **Estimated total NLRIs** | **~100150M** in Adj-RIB-In |
| Telemetry | gNMI via Telegraf → InfluxDB, ~50200 interfaces/router, 10 s interval |
| History retention | `ip_rib_log` 2 months, LS logs 8 weeks, `peer_event_log` 4 months (lab policy defaults; tunable) |
The NLRI estimate (40 × ~2.5 feeds × 1.2M) places this deployment at the top
of the OpenBMP `psql-app` guidance tier (150M NLRIs → 64 GB heap).
## Measured data point (lab, 2026)
Real numbers from the lab after adding **one** full-table feed (GoBGP →
AS57355, ~1.04M IPv4 + ~0.25M IPv6 routes):
| Metric | Before feed | After 1 full feed |
|--------|-------------|-------------------|
| `openbmp` DB size | ~25 GB | **~30 GB** |
| `ip_rib` (current state) | small | 5.3 GB |
| `ip_rib_log` (history hypertable) | — | 7.75 GB, 82/97 chunks compressed |
| `base_attrs` | ~1 GB | 2.3 GB |
| `geo_ip` (fixed reference data) | 8.8 GB | 8.8 GB |
So **one full feed ≈ +5 GB current-state**, plus history that accrues against
the 2-month `ip_rib_log` retention. The ~1.3M-route initial dump ingested in
minutes with no Kafka consumer lag. Extrapolating linearly, 40 routers × ~2.5
feeds ≈ 100 feed-equivalents → on the order of **0.5 TB current state** before
history and indexes; the 24 TB storage target below holds with headroom.
## BMP RIB scope — recommendation
**Deploy with Adj-RIB-In only.** It is the OpenBMP default, is what every
dashboard is built on, and captures the highest-value data — what each peer
advertises. Alternatives and their cost:
- **Loc-RIB** — adds a full post-best-path converged table per router
(~40 × 1.2M ≈ +48M NLRIs). Add later, selectively, only where best-path
analysis is needed; verify the IOS-XR release supports Loc-RIB BMP.
- **Adj-RIB-Out** — multiplies further (per advertised peer). Not recommended
for the initial deployment.
- **Post-policy Adj-RIB-In** — if inbound policy is restrictive this trims
volume meaningfully; with permissive import it is similar to pre-policy.
## Compute & memory
| Component | Lab today | Production target | Rationale |
|-----------|-----------|-------------------|-----------|
| **Total RAM** | 31 GB | **96128 GB** | psql-app heap 4864 GB + PostgreSQL shared_buffers/cache + Kafka 48 GB + InfluxDB + Grafana + collector |
| **CPU** | 8 cores | **1632 vCPU** | PostgreSQL is CPU-bound under full-table churn — lab psql already sustains ~287% (3 cores) at 18 routers |
| `psql-app` JVM heap (`MEM`) | 3 GB | **4864 GB** | OpenBMP guidance: 4 GB ≈ 10M NLRIs, 64 GB ≈ 150M NLRIs |
| `psql-app` container `mem_limit` | 4 GB | **heap + ~8 GB** | Set `PSQL_APP_MEM_LIMIT` above the JVM heap |
| `psql` container `mem_limit` | 6 GB | **4864 GB** | Set `PSQL_MEM_LIMIT`; PostgreSQL wants ~25% as `shared_buffers` and the rest for OS cache |
| `kafka` container `mem_limit` | 4 GB | **812 GB** | Set `KAFKA_MEM_LIMIT`; full-table initial dumps from 40 routers are bursty |
## Storage
| Store | Lab today | Production target | Notes |
|-------|-----------|-------------------|-------|
| **PostgreSQL** | 30 GB | **24 TB NVMe SSD** | `ip_rib` current state (~100150M rows) + `ip_rib_log` history (2-month retention, the dominant grower) + `base_attrs` + `geo_ip` (~9 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom. |
| **Kafka** | 0.2 GB | **100500 GB** | 12 h retention; sized for full-table initial-dump bursts × 40 routers |
| **InfluxDB (telemetry)** | minimal | **50200 GB** | 40 routers × ~50200 interfaces × 10 s gNMI × 30 d; compresses well |
| **Total** | — | **~35 TB fast NVMe** | Use NVMe; PostgreSQL random-IO under churn is the bottleneck on slow disks |
Put the PostgreSQL data directory and the TimescaleDB tablespace on NVMe.
`ip_rib_log` retention (2 months in the lab) is the main storage tuning knob
— revisit once production update volume is measured; halving it roughly
halves the dominant history table.
## Architecture
A single host is viable only if large (**≥128 GB RAM, ≥32 vCPU, multi-TB
NVMe**). **Preferred: split services across hosts**
| Host | Services | Profile |
|------|----------|---------|
| **DB host** (heaviest) | postgres | — |
| **Pipeline host** | kafka, zookeeper, collector, psql-app | core |
| **Presentation host** | grafana, influxdb, telegraf, whois | core + telemetry |
Whichever layout: every service already carries a Compose `mem_limit` — raise
`PSQL_MEM_LIMIT` / `PSQL_APP_MEM_LIMIT` / `KAFKA_MEM_LIMIT` in `.env` for the
production hosts.
## Horizontal scaling — where it actually helps
The ingestion bottleneck is **not** the collector or Kafka — it is the
`psql-app` consumer writing to PostgreSQL, and ultimately **disk IOPS**.
Plan scaling accordingly:
- **Scale `psql-app` as a Kafka consumer group.** Run multiple `psql-app`
containers with the **same group ID**; Kafka rebalances partitions across
them and fails over automatically. This is the real throughput lever and
also provides HA. **Hard cap = Kafka partition count** — the compose sets
`KAFKA_NUM_PARTITIONS: 8`, so ≤ 8 useful instances. **Raise the partition
count before scaling past a few consumers** — it cannot easily be reduced
later.
- **Disk IOPS is the named bottleneck.** Target **≥ 5000 IOPS** (NVMe) for
the PostgreSQL store; this buys more headroom than any container count.
- **Multiple collectors are an HA / locality decision, not a throughput
one.** A BMP session is one stateful TCP connection and cannot be load
balanced — you distribute routers by pointing each router's `bmp server`
config at a specific collector. All collectors feed one Kafka. Shard
collectors for fault isolation / POP locality, not for performance, and
note a dead collector's routers go dark until reconfigured (no auto-
failover at the collector tier).
- Within one `psql-app`, writer threads already auto-scale per type
(`writer_max_threads_per_type`); the consumer-group is the across-instance
layer on top.
Bursts (every collector restart triggers simultaneous full-table dumps from
all peers) are absorbed by Kafka — size Kafka retention so a slow consumer
never loses data during a convergence storm.
## PostgreSQL tuning
- `shared_buffers` ≈ 25% of host RAM; large `effective_cache_size`.
- Raise `work_mem` (dashboard aggregate queries) and `maintenance_work_mem`.
- `max_wal_size` already 10 GB — keep or raise for churn bursts.
- Enable parallel query (`max_parallel_workers_per_gather`).
- Aggressive autovacuum on churn tables (`ip_rib`, `base_attrs`, `ip_rib_log`)
— applied in the lab; persist these settings in production provisioning.
- TimescaleDB compression is already enabled on `ip_rib_log` and the `stats_*`
hypertables — keep it.
## Reference bill of materials (single-host option)
| Resource | Spec |
|----------|------|
| CPU | 32 vCPU |
| RAM | 128 GB |
| Storage | 4 TB NVMe SSD |
| Network | 1 GbE+ to the routers' BMP source network |
For the split-host option, divide per the architecture table — the DB host
takes the bulk of RAM and all of the fast storage.