sam 0732ebfa07 Add production-readiness deliverables: security, backup, alerting

Adds a prioritized security-hardening checklist, a PostgreSQL logical-backup
script (pg-backup.sh) with a documented restore procedure, and Grafana
alerting provisioning (peer-down, flap-storm, RPKI-invalid, router-down rules
plus a contact-point template). The alerting YAML and contact points need
operator review before being relied on for paging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 20:55:03 -07:00

8.0 KiB

Raw Blame History

OpenBMP Backup & Restore

How to back up and restore the OpenBMP PostgreSQL database, what the backup covers, and what it deliberately does not.

What `scripts/pg-backup.sh` backs up

The script runs pg_dump inside the obmp-psql container and produces a single timestamped, compressed, custom-format dump of the entire openbmp database:

All BMP/BGP operational tables — routers, bgp_peers, ip_rib, base_attrs, global_ip_rib, l3vpn_rib, the ls_* link-state tables.
All history / TimescaleDB hypertables — ip_rib_log, peer_event_log, stat_reports, and the stats_* aggregate tables.
Reference / enrichment data — geo_ip, info_asn, info_route, rpki_validator, pdb_exchange_peers.
Schema objects — table definitions, indexes, views, functions, triggers, enum types, and the TimescaleDB hypertable configuration.

The dump is taken against a live database — pg_dump uses an MVCC snapshot, so no downtime and no service stop is required. It is written atomically (to a .partial file, renamed on success) so an interrupted run never leaves a dump that looks valid but is truncated.

Output: ${OBMP_DATA_ROOT:-/var/openbmp}/backups/openbmp-YYYYMMDD-HHMMSS.dump

TimescaleDB note

The OpenBMP database uses TimescaleDB hypertables (ip_rib_log, peer_event_log, the stats_* tables, with compression policies). A pg_dump logical backup restores hypertables correctly — the dump captures the _timescaledb_catalog metadata, and on restore the hypertable structure, chunks, and compression settings are recreated. No special flags are needed for the dump. The only requirement is that the restore target has the TimescaleDB extension available — which the openbmp/postgres image provides, so restoring into a fresh obmp-psql works out of the box.

Scheduling

Make the script executable once:

chmod +x scripts/pg-backup.sh

Add a cron entry (crontab -e) — daily at 02:30, logging to a file:

30 2 * * * OBMP_DATA_ROOT=/var/openbmp /home/user/obmp-docker/scripts/pg-backup.sh >> /var/openbmp/backups/pg-backup.log 2>&1

The cron user must be able to reach the Docker daemon — run it as a user in the docker group, or as root. A systemd timer is an equally valid alternative.

Configuration

All settings are environment variables with sensible defaults:

Variable	Default	Purpose
`OBMP_DATA_ROOT`	`/var/openbmp`	Base data dir; backups go to `${OBMP_DATA_ROOT}/backups`
`OBMP_BACKUP_DIR`	(unset)	Explicit backup dir, overrides the default
`OBMP_PG_CONTAINER`	`obmp-psql`	Postgres container name
`OBMP_PG_DB`	`openbmp`	Database name
`OBMP_PG_USER`	`openbmp`	Database user
`OBMP_BACKUP_RETENTION_DAYS`	`14`	Dumps older than this are pruned each run

Retention only prunes files matching the script's own openbmp-*.dump naming pattern — nothing else in the directory is touched.

Production recommendations

Copy dumps off-host. A local backup does not survive host loss. Sync the backup directory to object storage / a backup server (e.g. nightly rclone, restic, or your existing ISP backup tooling).
Size the backup volume — at production scale (~100–150M NLRIs) the dump can be tens of GB even compressed. See docs/production-sizing.md.
Test restores periodically — an untested backup is not a backup.
For tighter RPO than once-daily logical dumps, consider PostgreSQL continuous archiving / PITR (WAL archiving + pg_basebackup). That is out of scope for this script but worth planning for a production deployment.

Restore procedure

This restores a dump into a fresh, empty obmp-psql database. Restoring over a populated database risks conflicts — start clean.

1. Stop the writers

Stop the services that write to the database so nothing races the restore:

docker compose -p obmp stop psql-app collector

Leave obmp-psql running.

2. Recreate an empty database

Drop and recreate the openbmp database inside the running container:

docker exec -i obmp-psql psql -U openbmp -d postgres <<'EOSQL'
DROP DATABASE IF EXISTS openbmp;
CREATE DATABASE openbmp OWNER openbmp;
EOSQL

Restoring into a brand-new container? Bring obmp-psql up first and let it initialize, but do not create the config/init_db trigger file — the schema comes from the dump, not from psql-app's first-run migration.

3. Restore the dump

Copy the dump into the container and run pg_restore:

DUMP=/var/openbmp/backups/openbmp-YYYYMMDD-HHMMSS.dump

docker cp "${DUMP}" obmp-psql:/tmp/restore.dump

docker exec -i obmp-psql \
  pg_restore -U openbmp -d openbmp --no-owner --no-privileges \
             --jobs=4 /tmp/restore.dump

docker exec obmp-psql rm -f /tmp/restore.dump

--no-owner --no-privileges — the dump was created with the same flags; objects are recreated owned by the connecting role.
--jobs=4 — parallel restore; raise it on a many-core host to speed up the large ip_rib / ip_rib_log tables. Custom-format dumps support this.
Some non-fatal warnings (e.g. about the TimescaleDB extension or existing objects) are normal. A non-zero exit with only warnings is usually fine — inspect the output before assuming failure.

Alternatively, stream the restore without docker cp:

docker exec -i obmp-psql pg_restore -U openbmp -d openbmp \
  --no-owner --no-privileges < "${DUMP}"

(Streaming via stdin disables --jobs parallelism — use docker cp for large dumps.)

4. Verify

docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
  SELECT (SELECT count(*) FROM routers)    AS routers,
         (SELECT count(*) FROM bgp_peers)  AS peers,
         (SELECT count(*) FROM ip_rib)     AS rib_rows;"

Confirm hypertables came back:

docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
  SELECT hypertable_name FROM timescaledb_information.hypertables;"

5. Restart the writers

docker compose -p obmp start collector psql-app

The collector reconnects to the routers' BMP sessions and psql-app resumes consuming from Kafka. Live state catches up from the routers.

What is NOT covered

This backup is PostgreSQL only. The following are out of scope and need their own handling:

Kafka data is transient. The obmp-kafka topics are a short-retention pipeline buffer (KAFKA_LOG_RETENTION_MINUTES: 720 — 12 hours). They are not a system of record and do not need backing up. After a restore, routers re-send BMP and the pipeline refills naturally.

InfluxDB telemetry has its own backup. The gNMI streaming-telemetry data lives in obmp-influxdb (bucket telemetry), not in PostgreSQL. pg_dump does not touch it. Back it up separately with the Influx CLI:

# Backup
docker exec obmp-influxdb influx backup /var/lib/influxdb2/backup \
  --token "$INFLUXDB_ADMIN_TOKEN"
docker cp obmp-influxdb:/var/lib/influxdb2/backup \
  /var/openbmp/backups/influxdb-$(date +%Y%m%d)

# Restore
docker cp /var/openbmp/backups/influxdb-YYYYMMDD \
  obmp-influxdb:/var/lib/influxdb2/restore
docker exec obmp-influxdb influx restore /var/lib/influxdb2/restore \
  --token "$INFLUXDB_ADMIN_TOKEN"

Telemetry is also less critical than BMP data (30-day retention, data-plane counters) — back it up if you need historical telemetry to survive a host loss; otherwise the 30-day window simply re-fills.

Grafana — dashboards and datasources are provisioned from files in the repo (obmp-grafana/provisioning/ and obmp-grafana/dashboards/), so they are already version-controlled in git. The Grafana database under ${OBMP_DATA_ROOT}/grafana (users, preferences, manually-created dashboards, alert state) is not covered by this script — back up that directory separately if it holds anything not reproducible from the repo.
Configuration & secrets — .env, docker-compose.yml, and the ${OBMP_DATA_ROOT}/config directory. Keep these in version control / your secrets manager.

8.0 KiB Raw Blame History Unescape Escape