Add evpn_rib schema; update production sizing with measured data

- postgres/scripts/007_obmp_evpn.sql: the evpn_rib landing table
  (roadmap E5 step 1), applied to the live DB. Mirrors l3vpn_rib;
  a dedicated consumer will populate it.
- production-sizing.md: corrected retention figures to the actual
  policy values, added a measured-data section (one full feed ≈
  +5 GB current state; DB now ~30 GB), and a horizontal-scaling
  section — the bottleneck is the psql-app consumer + disk IOPS, so
  scale psql-app as a Kafka consumer group (cap = partition count),
  treat multi-collector as HA/locality not throughput.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
sam 2026-05-19 08:44:09 -07:00
parent c18d11a48f
commit 2d83d6c02e
2 changed files with 98 additions and 4 deletions

View File

@ -14,11 +14,30 @@ Derived from the OpenBMP `psql-app` sizing guidance and measured lab behavior.
| Routes per full feed | ~1.2M (≈1M IPv4 + ~0.2M IPv6) |
| **Estimated total NLRIs** | **~100150M** in Adj-RIB-In |
| Telemetry | gNMI via Telegraf → InfluxDB, ~50200 interfaces/router, 10 s interval |
| History retention | `ip_rib_log` 4 weeks, LS logs 4 months, `peer_event_log` 1 year |
| History retention | `ip_rib_log` 2 months, LS logs 8 weeks, `peer_event_log` 4 months (lab policy defaults; tunable) |
The NLRI estimate (40 × ~2.5 feeds × 1.2M) places this deployment at the top
of the OpenBMP `psql-app` guidance tier (150M NLRIs → 64 GB heap).
## Measured data point (lab, 2026)
Real numbers from the lab after adding **one** full-table feed (GoBGP →
AS57355, ~1.04M IPv4 + ~0.25M IPv6 routes):
| Metric | Before feed | After 1 full feed |
|--------|-------------|-------------------|
| `openbmp` DB size | ~25 GB | **~30 GB** |
| `ip_rib` (current state) | small | 5.3 GB |
| `ip_rib_log` (history hypertable) | — | 7.75 GB, 82/97 chunks compressed |
| `base_attrs` | ~1 GB | 2.3 GB |
| `geo_ip` (fixed reference data) | 8.8 GB | 8.8 GB |
So **one full feed ≈ +5 GB current-state**, plus history that accrues against
the 2-month `ip_rib_log` retention. The ~1.3M-route initial dump ingested in
minutes with no Kafka consumer lag. Extrapolating linearly, 40 routers × ~2.5
feeds ≈ 100 feed-equivalents → on the order of **0.5 TB current state** before
history and indexes; the 24 TB storage target below holds with headroom.
## BMP RIB scope — recommendation
**Deploy with Adj-RIB-In only.** It is the OpenBMP default, is what every
@ -48,14 +67,15 @@ advertises. Alternatives and their cost:
| Store | Lab today | Production target | Notes |
|-------|-----------|-------------------|-------|
| **PostgreSQL** | 25 GB | **24 TB NVMe SSD** | `ip_rib` current state (~100150M rows) + `ip_rib_log` history (4-week retention, the dominant grower) + `base_attrs` + `geo_ip` (~7 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom. |
| **PostgreSQL** | 30 GB | **24 TB NVMe SSD** | `ip_rib` current state (~100150M rows) + `ip_rib_log` history (2-month retention, the dominant grower) + `base_attrs` + `geo_ip` (~9 GB fixed). OpenBMP guidance: 500 GB main + 1 TB TimescaleDB; add headroom. |
| **Kafka** | 0.2 GB | **100500 GB** | 12 h retention; sized for full-table initial-dump bursts × 40 routers |
| **InfluxDB (telemetry)** | minimal | **50200 GB** | 40 routers × ~50200 interfaces × 10 s gNMI × 30 d; compresses well |
| **Total** | — | **~35 TB fast NVMe** | Use NVMe; PostgreSQL random-IO under churn is the bottleneck on slow disks |
Put the PostgreSQL data directory and the TimescaleDB tablespace on NVMe.
`ip_rib_log` 4-week retention is the main storage tuning knob — revisit once
production update volume is measured.
`ip_rib_log` retention (2 months in the lab) is the main storage tuning knob
— revisit once production update volume is measured; halving it roughly
halves the dominant history table.
## Architecture
@ -72,6 +92,36 @@ Whichever layout: every service already carries a Compose `mem_limit` — raise
`PSQL_MEM_LIMIT` / `PSQL_APP_MEM_LIMIT` / `KAFKA_MEM_LIMIT` in `.env` for the
production hosts.
## Horizontal scaling — where it actually helps
The ingestion bottleneck is **not** the collector or Kafka — it is the
`psql-app` consumer writing to PostgreSQL, and ultimately **disk IOPS**.
Plan scaling accordingly:
- **Scale `psql-app` as a Kafka consumer group.** Run multiple `psql-app`
containers with the **same group ID**; Kafka rebalances partitions across
them and fails over automatically. This is the real throughput lever and
also provides HA. **Hard cap = Kafka partition count** — the compose sets
`KAFKA_NUM_PARTITIONS: 8`, so ≤ 8 useful instances. **Raise the partition
count before scaling past a few consumers** — it cannot easily be reduced
later.
- **Disk IOPS is the named bottleneck.** Target **≥ 5000 IOPS** (NVMe) for
the PostgreSQL store; this buys more headroom than any container count.
- **Multiple collectors are an HA / locality decision, not a throughput
one.** A BMP session is one stateful TCP connection and cannot be load
balanced — you distribute routers by pointing each router's `bmp server`
config at a specific collector. All collectors feed one Kafka. Shard
collectors for fault isolation / POP locality, not for performance, and
note a dead collector's routers go dark until reconfigured (no auto-
failover at the collector tier).
- Within one `psql-app`, writer threads already auto-scale per type
(`writer_max_threads_per_type`); the consumer-group is the across-instance
layer on top.
Bursts (every collector restart triggers simultaneous full-table dumps from
all peers) are absorbed by Kafka — size Kafka retention so a slow consumer
never loses data during a convergence storm.
## PostgreSQL tuning
- `shared_buffers` ≈ 25% of host RAM; large `effective_cache_size`.

View File

@ -0,0 +1,44 @@
-- BGP EVPN RIB table (roadmap E5)
--
-- The OpenBMP collector already decodes EVPN and emits the
-- 'openbmp.parsed.evpn' Kafka topic, but the stock psql-app consumer never
-- subscribes to it and the base schema has no table for it. This table is
-- the landing zone; a dedicated consumer (obmp-evpn-consumer, separate)
-- subscribes to the topic and writes here.
--
-- Mirrors l3vpn_rib conventions. route_type is derived by the consumer from
-- which fields are populated (the parsed EVPN message has no explicit type),
-- so it is nullable.
CREATE TABLE IF NOT EXISTS evpn_rib (
hash_id uuid NOT NULL,
base_attr_hash_id uuid,
peer_hash_id uuid NOT NULL,
rd varchar(128) NOT NULL,
rd_type smallint,
route_type smallint, -- EVPN route type 1..5
origin_as bigint,
eth_segment_id varchar(255), -- ESI
eth_tag_id bigint,
mac macaddr,
mac_len smallint,
ip inet,
ip_len smallint,
orig_router_ip inet,
mpls_label1 bigint, -- VXLAN VNI when encap = vxlan
mpls_label2 bigint,
ext_community_list varchar(50)[], -- route-targets
path_id bigint,
timestamp timestamp(6) without time zone NOT NULL DEFAULT (now() AT TIME ZONE 'utc'),
first_added_timestamp timestamp(6) without time zone NOT NULL DEFAULT (now() AT TIME ZONE 'utc'),
iswithdrawn boolean NOT NULL DEFAULT false,
isprepolicy boolean NOT NULL DEFAULT true,
isadjribin boolean NOT NULL DEFAULT true,
PRIMARY KEY (peer_hash_id, hash_id)
);
CREATE INDEX IF NOT EXISTS evpn_rib_hash_id_idx ON evpn_rib (hash_id);
CREATE INDEX IF NOT EXISTS evpn_rib_base_attr_idx ON evpn_rib (base_attr_hash_id);
CREATE INDEX IF NOT EXISTS evpn_rib_rd_idx ON evpn_rib (rd);
CREATE INDEX IF NOT EXISTS evpn_rib_route_type_idx ON evpn_rib (route_type);
CREATE INDEX IF NOT EXISTS evpn_rib_mac_idx ON evpn_rib (mac);
CREATE INDEX IF NOT EXISTS evpn_rib_extcomm_idx ON evpn_rib USING gin (ext_community_list);
CREATE INDEX IF NOT EXISTS evpn_rib_timestamp_idx ON evpn_rib ("timestamp");