obmp-docker/docs/backup-restore.md
sam 0732ebfa07 Add production-readiness deliverables: security, backup, alerting
Adds a prioritized security-hardening checklist, a PostgreSQL logical-backup
script (pg-backup.sh) with a documented restore procedure, and Grafana
alerting provisioning (peer-down, flap-storm, RPKI-invalid, router-down rules
plus a contact-point template). The alerting YAML and contact points need
operator review before being relied on for paging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 20:55:03 -07:00

224 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# OpenBMP Backup & Restore
How to back up and restore the OpenBMP PostgreSQL database, what the backup
covers, and what it deliberately does not.
---
## What `scripts/pg-backup.sh` backs up
The script runs `pg_dump` inside the `obmp-psql` container and produces a
single timestamped, compressed, custom-format dump of the **entire `openbmp`
database**:
- All BMP/BGP operational tables — `routers`, `bgp_peers`, `ip_rib`,
`base_attrs`, `global_ip_rib`, `l3vpn_rib`, the `ls_*` link-state tables.
- All history / TimescaleDB hypertables — `ip_rib_log`, `peer_event_log`,
`stat_reports`, and the `stats_*` aggregate tables.
- Reference / enrichment data — `geo_ip`, `info_asn`, `info_route`,
`rpki_validator`, `pdb_exchange_peers`.
- Schema objects — table definitions, indexes, views, functions, triggers,
enum types, and the TimescaleDB hypertable configuration.
The dump is taken against a **live database**`pg_dump` uses an MVCC
snapshot, so no downtime and no service stop is required. It is written
atomically (to a `.partial` file, renamed on success) so an interrupted run
never leaves a dump that looks valid but is truncated.
Output: `${OBMP_DATA_ROOT:-/var/openbmp}/backups/openbmp-YYYYMMDD-HHMMSS.dump`
### TimescaleDB note
The OpenBMP database uses TimescaleDB hypertables (`ip_rib_log`,
`peer_event_log`, the `stats_*` tables, with compression policies).
**A `pg_dump` logical backup restores hypertables correctly** — the dump
captures the `_timescaledb_catalog` metadata, and on restore the hypertable
structure, chunks, and compression settings are recreated. No special flags
are needed for the dump. The only requirement is that the **restore target
has the TimescaleDB extension available** — which the `openbmp/postgres`
image provides, so restoring into a fresh `obmp-psql` works out of the box.
---
## Scheduling
Make the script executable once:
```bash
chmod +x scripts/pg-backup.sh
```
Add a cron entry (`crontab -e`) — daily at 02:30, logging to a file:
```cron
30 2 * * * OBMP_DATA_ROOT=/var/openbmp /home/user/obmp-docker/scripts/pg-backup.sh >> /var/openbmp/backups/pg-backup.log 2>&1
```
The cron user must be able to reach the Docker daemon — run it as a user in
the `docker` group, or as root. A systemd timer is an equally valid
alternative.
### Configuration
All settings are environment variables with sensible defaults:
| Variable | Default | Purpose |
|----------|---------|---------|
| `OBMP_DATA_ROOT` | `/var/openbmp` | Base data dir; backups go to `${OBMP_DATA_ROOT}/backups` |
| `OBMP_BACKUP_DIR` | (unset) | Explicit backup dir, overrides the default |
| `OBMP_PG_CONTAINER` | `obmp-psql` | Postgres container name |
| `OBMP_PG_DB` | `openbmp` | Database name |
| `OBMP_PG_USER` | `openbmp` | Database user |
| `OBMP_BACKUP_RETENTION_DAYS` | `14` | Dumps older than this are pruned each run |
Retention only prunes files matching the script's own `openbmp-*.dump`
naming pattern — nothing else in the directory is touched.
### Production recommendations
- **Copy dumps off-host.** A local backup does not survive host loss. Sync
the backup directory to object storage / a backup server (e.g. nightly
`rclone`, `restic`, or your existing ISP backup tooling).
- **Size the backup volume** — at production scale (~100150M NLRIs) the
dump can be tens of GB even compressed. See `docs/production-sizing.md`.
- **Test restores periodically** — an untested backup is not a backup.
- For tighter RPO than once-daily logical dumps, consider PostgreSQL
continuous archiving / PITR (WAL archiving + `pg_basebackup`). That is out
of scope for this script but worth planning for a production deployment.
---
## Restore procedure
This restores a dump into a **fresh, empty** `obmp-psql` database. Restoring
over a populated database risks conflicts — start clean.
### 1. Stop the writers
Stop the services that write to the database so nothing races the restore:
```bash
docker compose -p obmp stop psql-app collector
```
Leave `obmp-psql` running.
### 2. Recreate an empty database
Drop and recreate the `openbmp` database inside the running container:
```bash
docker exec -i obmp-psql psql -U openbmp -d postgres <<'EOSQL'
DROP DATABASE IF EXISTS openbmp;
CREATE DATABASE openbmp OWNER openbmp;
EOSQL
```
> Restoring into a **brand-new container**? Bring `obmp-psql` up first and let
> it initialize, but **do not** create the `config/init_db` trigger file —
> the schema comes from the dump, not from psql-app's first-run migration.
### 3. Restore the dump
Copy the dump into the container and run `pg_restore`:
```bash
DUMP=/var/openbmp/backups/openbmp-YYYYMMDD-HHMMSS.dump
docker cp "${DUMP}" obmp-psql:/tmp/restore.dump
docker exec -i obmp-psql \
pg_restore -U openbmp -d openbmp --no-owner --no-privileges \
--jobs=4 /tmp/restore.dump
docker exec obmp-psql rm -f /tmp/restore.dump
```
- `--no-owner --no-privileges` — the dump was created with the same flags;
objects are recreated owned by the connecting role.
- `--jobs=4` — parallel restore; raise it on a many-core host to speed up the
large `ip_rib` / `ip_rib_log` tables. Custom-format dumps support this.
- Some non-fatal warnings (e.g. about the TimescaleDB extension or existing
objects) are normal. A non-zero exit with only warnings is usually fine —
inspect the output before assuming failure.
Alternatively, stream the restore without `docker cp`:
```bash
docker exec -i obmp-psql pg_restore -U openbmp -d openbmp \
--no-owner --no-privileges < "${DUMP}"
```
(Streaming via stdin disables `--jobs` parallelism — use `docker cp` for
large dumps.)
### 4. Verify
```bash
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
SELECT (SELECT count(*) FROM routers) AS routers,
(SELECT count(*) FROM bgp_peers) AS peers,
(SELECT count(*) FROM ip_rib) AS rib_rows;"
```
Confirm hypertables came back:
```bash
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
SELECT hypertable_name FROM timescaledb_information.hypertables;"
```
### 5. Restart the writers
```bash
docker compose -p obmp start collector psql-app
```
The collector reconnects to the routers' BMP sessions and psql-app resumes
consuming from Kafka. Live state catches up from the routers.
---
## What is NOT covered
This backup is **PostgreSQL only**. The following are out of scope and need
their own handling:
- **Kafka data is transient.** The `obmp-kafka` topics are a short-retention
pipeline buffer (`KAFKA_LOG_RETENTION_MINUTES: 720` — 12 hours). They are
not a system of record and do not need backing up. After a restore, routers
re-send BMP and the pipeline refills naturally.
- **InfluxDB telemetry has its own backup.** The gNMI streaming-telemetry
data lives in `obmp-influxdb` (bucket `telemetry`), not in PostgreSQL.
`pg_dump` does not touch it. Back it up separately with the Influx CLI:
```bash
# Backup
docker exec obmp-influxdb influx backup /var/lib/influxdb2/backup \
--token "$INFLUXDB_ADMIN_TOKEN"
docker cp obmp-influxdb:/var/lib/influxdb2/backup \
/var/openbmp/backups/influxdb-$(date +%Y%m%d)
# Restore
docker cp /var/openbmp/backups/influxdb-YYYYMMDD \
obmp-influxdb:/var/lib/influxdb2/restore
docker exec obmp-influxdb influx restore /var/lib/influxdb2/restore \
--token "$INFLUXDB_ADMIN_TOKEN"
```
Telemetry is also less critical than BMP data (30-day retention,
data-plane counters) — back it up if you need historical telemetry to
survive a host loss; otherwise the 30-day window simply re-fills.
- **Grafana** — dashboards and datasources are provisioned from files in the
repo (`obmp-grafana/provisioning/` and `obmp-grafana/dashboards/`), so they
are already version-controlled in git. The Grafana database under
`${OBMP_DATA_ROOT}/grafana` (users, preferences, manually-created
dashboards, alert state) is *not* covered by this script — back up that
directory separately if it holds anything not reproducible from the repo.
- **Configuration & secrets** — `.env`, `docker-compose.yml`, and the
`${OBMP_DATA_ROOT}/config` directory. Keep these in version control /
your secrets manager.