obmp-docker/docs/production-sizing.md
sam f1558946ae Add production sizing guide for 40 full-table-edge routers
Documents compute, memory, and storage requirements for a production
deployment: ~100-150M NLRI estimate, 96-128 GB RAM, 16-32 vCPU, 3-5 TB NVMe,
a split-host architecture option, PostgreSQL tuning, and a BMP RIB-scope
recommendation (Adj-RIB-In only initially).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:06:25 -07:00

4.9 KiB
Raw Permalink Blame History

OpenBMP Production Sizing — 40 Full-Table-Edge Routers

Sizing guidance for deploying the OpenBMP stack against a production ISP network of 40 full-table-edge routers with gNMI streaming telemetry. Derived from the OpenBMP psql-app sizing guidance and measured lab behavior.

Workload assumptions

Parameter Value
Monitored routers 40, full-table edge
BMP RIB scope Adj-RIB-In (see recommendation below)
Full feeds per router ~23 eBGP peers carrying the full DFZ
Routes per full feed ~1.2M (≈1M IPv4 + ~0.2M IPv6)
Estimated total NLRIs ~100150M in Adj-RIB-In
Telemetry gNMI via Telegraf → InfluxDB, ~50200 interfaces/router, 10 s interval
History retention ip_rib_log 4 weeks, LS logs 4 months, peer_event_log 1 year

The NLRI estimate (40 × ~2.5 feeds × 1.2M) places this deployment at the top of the OpenBMP psql-app guidance tier (150M NLRIs → 64 GB heap).

BMP RIB scope — recommendation

Deploy with Adj-RIB-In only. It is the OpenBMP default, is what every dashboard is built on, and captures the highest-value data — what each peer advertises. Alternatives and their cost:

  • Loc-RIB — adds a full post-best-path converged table per router (~40 × 1.2M ≈ +48M NLRIs). Add later, selectively, only where best-path analysis is needed; verify the IOS-XR release supports Loc-RIB BMP.
  • Adj-RIB-Out — multiplies further (per advertised peer). Not recommended for the initial deployment.
  • Post-policy Adj-RIB-In — if inbound policy is restrictive this trims volume meaningfully; with permissive import it is similar to pre-policy.

Compute & memory

Component Lab today Production target Rationale
Total RAM 31 GB 96128 GB psql-app heap 4864 GB + PostgreSQL shared_buffers/cache + Kafka 48 GB + InfluxDB + Grafana + collector
CPU 8 cores 1632 vCPU PostgreSQL is CPU-bound under full-table churn — lab psql already sustains ~287% (3 cores) at 18 routers
psql-app JVM heap (MEM) 3 GB 4864 GB OpenBMP guidance: 4 GB ≈ 10M NLRIs, 64 GB ≈ 150M NLRIs
psql-app container mem_limit 4 GB heap + ~8 GB Set PSQL_APP_MEM_LIMIT above the JVM heap
psql container mem_limit 6 GB 4864 GB Set PSQL_MEM_LIMIT; PostgreSQL wants ~25% as shared_buffers and the rest for OS cache
kafka container mem_limit 4 GB 812 GB Set KAFKA_MEM_LIMIT; full-table initial dumps from 40 routers are bursty

Storage

Store Lab today Production target Notes
PostgreSQL 25 GB 24 TB NVMe SSD ip_rib current state (~100150M rows) + ip_rib_log history (4-week retention, the dominant grower) + base_attrs + geo_ip (~7 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom.
Kafka 0.2 GB 100500 GB 12 h retention; sized for full-table initial-dump bursts × 40 routers
InfluxDB (telemetry) minimal 50200 GB 40 routers × ~50200 interfaces × 10 s gNMI × 30 d; compresses well
Total ~35 TB fast NVMe Use NVMe; PostgreSQL random-IO under churn is the bottleneck on slow disks

Put the PostgreSQL data directory and the TimescaleDB tablespace on NVMe. ip_rib_log 4-week retention is the main storage tuning knob — revisit once production update volume is measured.

Architecture

A single host is viable only if large (≥128 GB RAM, ≥32 vCPU, multi-TB NVMe). Preferred: split services across hosts

Host Services Profile
DB host (heaviest) postgres
Pipeline host kafka, zookeeper, collector, psql-app core
Presentation host grafana, influxdb, telegraf, whois core + telemetry

Whichever layout: every service already carries a Compose mem_limit — raise PSQL_MEM_LIMIT / PSQL_APP_MEM_LIMIT / KAFKA_MEM_LIMIT in .env for the production hosts.

PostgreSQL tuning

  • shared_buffers ≈ 25% of host RAM; large effective_cache_size.
  • Raise work_mem (dashboard aggregate queries) and maintenance_work_mem.
  • max_wal_size already 10 GB — keep or raise for churn bursts.
  • Enable parallel query (max_parallel_workers_per_gather).
  • Aggressive autovacuum on churn tables (ip_rib, base_attrs, ip_rib_log) — applied in the lab; persist these settings in production provisioning.
  • TimescaleDB compression is already enabled on ip_rib_log and the stats_* hypertables — keep it.

Reference bill of materials (single-host option)

Resource Spec
CPU 32 vCPU
RAM 128 GB
Storage 4 TB NVMe SSD
Network 1 GbE+ to the routers' BMP source network

For the split-host option, divide per the architecture table — the DB host takes the bulk of RAM and all of the fast storage.