diff --git a/docs/production-sizing.md b/docs/production-sizing.md index ec01b97..fc1d5f6 100644 --- a/docs/production-sizing.md +++ b/docs/production-sizing.md @@ -14,11 +14,30 @@ Derived from the OpenBMP `psql-app` sizing guidance and measured lab behavior. | Routes per full feed | ~1.2M (≈1M IPv4 + ~0.2M IPv6) | | **Estimated total NLRIs** | **~100–150M** in Adj-RIB-In | | Telemetry | gNMI via Telegraf → InfluxDB, ~50–200 interfaces/router, 10 s interval | -| History retention | `ip_rib_log` 4 weeks, LS logs 4 months, `peer_event_log` 1 year | +| History retention | `ip_rib_log` 2 months, LS logs 8 weeks, `peer_event_log` 4 months (lab policy defaults; tunable) | The NLRI estimate (40 × ~2.5 feeds × 1.2M) places this deployment at the top of the OpenBMP `psql-app` guidance tier (150M NLRIs → 64 GB heap). +## Measured data point (lab, 2026) + +Real numbers from the lab after adding **one** full-table feed (GoBGP → +AS57355, ~1.04M IPv4 + ~0.25M IPv6 routes): + +| Metric | Before feed | After 1 full feed | +|--------|-------------|-------------------| +| `openbmp` DB size | ~25 GB | **~30 GB** | +| `ip_rib` (current state) | small | 5.3 GB | +| `ip_rib_log` (history hypertable) | — | 7.75 GB, 82/97 chunks compressed | +| `base_attrs` | ~1 GB | 2.3 GB | +| `geo_ip` (fixed reference data) | 8.8 GB | 8.8 GB | + +So **one full feed ≈ +5 GB current-state**, plus history that accrues against +the 2-month `ip_rib_log` retention. The ~1.3M-route initial dump ingested in +minutes with no Kafka consumer lag. Extrapolating linearly, 40 routers × ~2.5 +feeds ≈ 100 feed-equivalents → on the order of **0.5 TB current state** before +history and indexes; the 2–4 TB storage target below holds with headroom. + ## BMP RIB scope — recommendation **Deploy with Adj-RIB-In only.** It is the OpenBMP default, is what every @@ -48,14 +67,15 @@ advertises. Alternatives and their cost: | Store | Lab today | Production target | Notes | |-------|-----------|-------------------|-------| -| **PostgreSQL** | 25 GB | **2–4 TB NVMe SSD** | `ip_rib` current state (~100–150M rows) + `ip_rib_log` history (4-week retention, the dominant grower) + `base_attrs` + `geo_ip` (~7 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom. | +| **PostgreSQL** | 30 GB | **2–4 TB NVMe SSD** | `ip_rib` current state (~100–150M rows) + `ip_rib_log` history (2-month retention, the dominant grower) + `base_attrs` + `geo_ip` (~9 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom. | | **Kafka** | 0.2 GB | **100–500 GB** | 12 h retention; sized for full-table initial-dump bursts × 40 routers | | **InfluxDB (telemetry)** | minimal | **50–200 GB** | 40 routers × ~50–200 interfaces × 10 s gNMI × 30 d; compresses well | | **Total** | — | **~3–5 TB fast NVMe** | Use NVMe; PostgreSQL random-IO under churn is the bottleneck on slow disks | Put the PostgreSQL data directory and the TimescaleDB tablespace on NVMe. -`ip_rib_log` 4-week retention is the main storage tuning knob — revisit once -production update volume is measured. +`ip_rib_log` retention (2 months in the lab) is the main storage tuning knob +— revisit once production update volume is measured; halving it roughly +halves the dominant history table. ## Architecture @@ -72,6 +92,36 @@ Whichever layout: every service already carries a Compose `mem_limit` — raise `PSQL_MEM_LIMIT` / `PSQL_APP_MEM_LIMIT` / `KAFKA_MEM_LIMIT` in `.env` for the production hosts. +## Horizontal scaling — where it actually helps + +The ingestion bottleneck is **not** the collector or Kafka — it is the +`psql-app` consumer writing to PostgreSQL, and ultimately **disk IOPS**. +Plan scaling accordingly: + +- **Scale `psql-app` as a Kafka consumer group.** Run multiple `psql-app` + containers with the **same group ID**; Kafka rebalances partitions across + them and fails over automatically. This is the real throughput lever and + also provides HA. **Hard cap = Kafka partition count** — the compose sets + `KAFKA_NUM_PARTITIONS: 8`, so ≤ 8 useful instances. **Raise the partition + count before scaling past a few consumers** — it cannot easily be reduced + later. +- **Disk IOPS is the named bottleneck.** Target **≥ 5000 IOPS** (NVMe) for + the PostgreSQL store; this buys more headroom than any container count. +- **Multiple collectors are an HA / locality decision, not a throughput + one.** A BMP session is one stateful TCP connection and cannot be load + balanced — you distribute routers by pointing each router's `bmp server` + config at a specific collector. All collectors feed one Kafka. Shard + collectors for fault isolation / POP locality, not for performance, and + note a dead collector's routers go dark until reconfigured (no auto- + failover at the collector tier). +- Within one `psql-app`, writer threads already auto-scale per type + (`writer_max_threads_per_type`); the consumer-group is the across-instance + layer on top. + +Bursts (every collector restart triggers simultaneous full-table dumps from +all peers) are absorbed by Kafka — size Kafka retention so a slow consumer +never loses data during a convergence storm. + ## PostgreSQL tuning - `shared_buffers` ≈ 25% of host RAM; large `effective_cache_size`. diff --git a/postgres/scripts/007_obmp_evpn.sql b/postgres/scripts/007_obmp_evpn.sql new file mode 100644 index 0000000..5691c8a --- /dev/null +++ b/postgres/scripts/007_obmp_evpn.sql @@ -0,0 +1,44 @@ +-- BGP EVPN RIB table (roadmap E5) +-- +-- The OpenBMP collector already decodes EVPN and emits the +-- 'openbmp.parsed.evpn' Kafka topic, but the stock psql-app consumer never +-- subscribes to it and the base schema has no table for it. This table is +-- the landing zone; a dedicated consumer (obmp-evpn-consumer, separate) +-- subscribes to the topic and writes here. +-- +-- Mirrors l3vpn_rib conventions. route_type is derived by the consumer from +-- which fields are populated (the parsed EVPN message has no explicit type), +-- so it is nullable. +CREATE TABLE IF NOT EXISTS evpn_rib ( + hash_id uuid NOT NULL, + base_attr_hash_id uuid, + peer_hash_id uuid NOT NULL, + rd varchar(128) NOT NULL, + rd_type smallint, + route_type smallint, -- EVPN route type 1..5 + origin_as bigint, + eth_segment_id varchar(255), -- ESI + eth_tag_id bigint, + mac macaddr, + mac_len smallint, + ip inet, + ip_len smallint, + orig_router_ip inet, + mpls_label1 bigint, -- VXLAN VNI when encap = vxlan + mpls_label2 bigint, + ext_community_list varchar(50)[], -- route-targets + path_id bigint, + timestamp timestamp(6) without time zone NOT NULL DEFAULT (now() AT TIME ZONE 'utc'), + first_added_timestamp timestamp(6) without time zone NOT NULL DEFAULT (now() AT TIME ZONE 'utc'), + iswithdrawn boolean NOT NULL DEFAULT false, + isprepolicy boolean NOT NULL DEFAULT true, + isadjribin boolean NOT NULL DEFAULT true, + PRIMARY KEY (peer_hash_id, hash_id) +); +CREATE INDEX IF NOT EXISTS evpn_rib_hash_id_idx ON evpn_rib (hash_id); +CREATE INDEX IF NOT EXISTS evpn_rib_base_attr_idx ON evpn_rib (base_attr_hash_id); +CREATE INDEX IF NOT EXISTS evpn_rib_rd_idx ON evpn_rib (rd); +CREATE INDEX IF NOT EXISTS evpn_rib_route_type_idx ON evpn_rib (route_type); +CREATE INDEX IF NOT EXISTS evpn_rib_mac_idx ON evpn_rib (mac); +CREATE INDEX IF NOT EXISTS evpn_rib_extcomm_idx ON evpn_rib USING gin (ext_community_list); +CREATE INDEX IF NOT EXISTS evpn_rib_timestamp_idx ON evpn_rib ("timestamp");