Adds a prioritized security-hardening checklist, a PostgreSQL logical-backup script (pg-backup.sh) with a documented restore procedure, and Grafana alerting provisioning (peer-down, flap-storm, RPKI-invalid, router-down rules plus a contact-point template). The alerting YAML and contact points need operator review before being relied on for paging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
224 lines
8.0 KiB
Markdown
224 lines
8.0 KiB
Markdown
# OpenBMP Backup & Restore
|
||
|
||
How to back up and restore the OpenBMP PostgreSQL database, what the backup
|
||
covers, and what it deliberately does not.
|
||
|
||
---
|
||
|
||
## What `scripts/pg-backup.sh` backs up
|
||
|
||
The script runs `pg_dump` inside the `obmp-psql` container and produces a
|
||
single timestamped, compressed, custom-format dump of the **entire `openbmp`
|
||
database**:
|
||
|
||
- All BMP/BGP operational tables — `routers`, `bgp_peers`, `ip_rib`,
|
||
`base_attrs`, `global_ip_rib`, `l3vpn_rib`, the `ls_*` link-state tables.
|
||
- All history / TimescaleDB hypertables — `ip_rib_log`, `peer_event_log`,
|
||
`stat_reports`, and the `stats_*` aggregate tables.
|
||
- Reference / enrichment data — `geo_ip`, `info_asn`, `info_route`,
|
||
`rpki_validator`, `pdb_exchange_peers`.
|
||
- Schema objects — table definitions, indexes, views, functions, triggers,
|
||
enum types, and the TimescaleDB hypertable configuration.
|
||
|
||
The dump is taken against a **live database** — `pg_dump` uses an MVCC
|
||
snapshot, so no downtime and no service stop is required. It is written
|
||
atomically (to a `.partial` file, renamed on success) so an interrupted run
|
||
never leaves a dump that looks valid but is truncated.
|
||
|
||
Output: `${OBMP_DATA_ROOT:-/var/openbmp}/backups/openbmp-YYYYMMDD-HHMMSS.dump`
|
||
|
||
### TimescaleDB note
|
||
|
||
The OpenBMP database uses TimescaleDB hypertables (`ip_rib_log`,
|
||
`peer_event_log`, the `stats_*` tables, with compression policies).
|
||
**A `pg_dump` logical backup restores hypertables correctly** — the dump
|
||
captures the `_timescaledb_catalog` metadata, and on restore the hypertable
|
||
structure, chunks, and compression settings are recreated. No special flags
|
||
are needed for the dump. The only requirement is that the **restore target
|
||
has the TimescaleDB extension available** — which the `openbmp/postgres`
|
||
image provides, so restoring into a fresh `obmp-psql` works out of the box.
|
||
|
||
---
|
||
|
||
## Scheduling
|
||
|
||
Make the script executable once:
|
||
|
||
```bash
|
||
chmod +x scripts/pg-backup.sh
|
||
```
|
||
|
||
Add a cron entry (`crontab -e`) — daily at 02:30, logging to a file:
|
||
|
||
```cron
|
||
30 2 * * * OBMP_DATA_ROOT=/var/openbmp /home/user/obmp-docker/scripts/pg-backup.sh >> /var/openbmp/backups/pg-backup.log 2>&1
|
||
```
|
||
|
||
The cron user must be able to reach the Docker daemon — run it as a user in
|
||
the `docker` group, or as root. A systemd timer is an equally valid
|
||
alternative.
|
||
|
||
### Configuration
|
||
|
||
All settings are environment variables with sensible defaults:
|
||
|
||
| Variable | Default | Purpose |
|
||
|----------|---------|---------|
|
||
| `OBMP_DATA_ROOT` | `/var/openbmp` | Base data dir; backups go to `${OBMP_DATA_ROOT}/backups` |
|
||
| `OBMP_BACKUP_DIR` | (unset) | Explicit backup dir, overrides the default |
|
||
| `OBMP_PG_CONTAINER` | `obmp-psql` | Postgres container name |
|
||
| `OBMP_PG_DB` | `openbmp` | Database name |
|
||
| `OBMP_PG_USER` | `openbmp` | Database user |
|
||
| `OBMP_BACKUP_RETENTION_DAYS` | `14` | Dumps older than this are pruned each run |
|
||
|
||
Retention only prunes files matching the script's own `openbmp-*.dump`
|
||
naming pattern — nothing else in the directory is touched.
|
||
|
||
### Production recommendations
|
||
|
||
- **Copy dumps off-host.** A local backup does not survive host loss. Sync
|
||
the backup directory to object storage / a backup server (e.g. nightly
|
||
`rclone`, `restic`, or your existing ISP backup tooling).
|
||
- **Size the backup volume** — at production scale (~100–150M NLRIs) the
|
||
dump can be tens of GB even compressed. See `docs/production-sizing.md`.
|
||
- **Test restores periodically** — an untested backup is not a backup.
|
||
- For tighter RPO than once-daily logical dumps, consider PostgreSQL
|
||
continuous archiving / PITR (WAL archiving + `pg_basebackup`). That is out
|
||
of scope for this script but worth planning for a production deployment.
|
||
|
||
---
|
||
|
||
## Restore procedure
|
||
|
||
This restores a dump into a **fresh, empty** `obmp-psql` database. Restoring
|
||
over a populated database risks conflicts — start clean.
|
||
|
||
### 1. Stop the writers
|
||
|
||
Stop the services that write to the database so nothing races the restore:
|
||
|
||
```bash
|
||
docker compose -p obmp stop psql-app collector
|
||
```
|
||
|
||
Leave `obmp-psql` running.
|
||
|
||
### 2. Recreate an empty database
|
||
|
||
Drop and recreate the `openbmp` database inside the running container:
|
||
|
||
```bash
|
||
docker exec -i obmp-psql psql -U openbmp -d postgres <<'EOSQL'
|
||
DROP DATABASE IF EXISTS openbmp;
|
||
CREATE DATABASE openbmp OWNER openbmp;
|
||
EOSQL
|
||
```
|
||
|
||
> Restoring into a **brand-new container**? Bring `obmp-psql` up first and let
|
||
> it initialize, but **do not** create the `config/init_db` trigger file —
|
||
> the schema comes from the dump, not from psql-app's first-run migration.
|
||
|
||
### 3. Restore the dump
|
||
|
||
Copy the dump into the container and run `pg_restore`:
|
||
|
||
```bash
|
||
DUMP=/var/openbmp/backups/openbmp-YYYYMMDD-HHMMSS.dump
|
||
|
||
docker cp "${DUMP}" obmp-psql:/tmp/restore.dump
|
||
|
||
docker exec -i obmp-psql \
|
||
pg_restore -U openbmp -d openbmp --no-owner --no-privileges \
|
||
--jobs=4 /tmp/restore.dump
|
||
|
||
docker exec obmp-psql rm -f /tmp/restore.dump
|
||
```
|
||
|
||
- `--no-owner --no-privileges` — the dump was created with the same flags;
|
||
objects are recreated owned by the connecting role.
|
||
- `--jobs=4` — parallel restore; raise it on a many-core host to speed up the
|
||
large `ip_rib` / `ip_rib_log` tables. Custom-format dumps support this.
|
||
- Some non-fatal warnings (e.g. about the TimescaleDB extension or existing
|
||
objects) are normal. A non-zero exit with only warnings is usually fine —
|
||
inspect the output before assuming failure.
|
||
|
||
Alternatively, stream the restore without `docker cp`:
|
||
|
||
```bash
|
||
docker exec -i obmp-psql pg_restore -U openbmp -d openbmp \
|
||
--no-owner --no-privileges < "${DUMP}"
|
||
```
|
||
|
||
(Streaming via stdin disables `--jobs` parallelism — use `docker cp` for
|
||
large dumps.)
|
||
|
||
### 4. Verify
|
||
|
||
```bash
|
||
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
|
||
SELECT (SELECT count(*) FROM routers) AS routers,
|
||
(SELECT count(*) FROM bgp_peers) AS peers,
|
||
(SELECT count(*) FROM ip_rib) AS rib_rows;"
|
||
```
|
||
|
||
Confirm hypertables came back:
|
||
|
||
```bash
|
||
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
|
||
SELECT hypertable_name FROM timescaledb_information.hypertables;"
|
||
```
|
||
|
||
### 5. Restart the writers
|
||
|
||
```bash
|
||
docker compose -p obmp start collector psql-app
|
||
```
|
||
|
||
The collector reconnects to the routers' BMP sessions and psql-app resumes
|
||
consuming from Kafka. Live state catches up from the routers.
|
||
|
||
---
|
||
|
||
## What is NOT covered
|
||
|
||
This backup is **PostgreSQL only**. The following are out of scope and need
|
||
their own handling:
|
||
|
||
- **Kafka data is transient.** The `obmp-kafka` topics are a short-retention
|
||
pipeline buffer (`KAFKA_LOG_RETENTION_MINUTES: 720` — 12 hours). They are
|
||
not a system of record and do not need backing up. After a restore, routers
|
||
re-send BMP and the pipeline refills naturally.
|
||
|
||
- **InfluxDB telemetry has its own backup.** The gNMI streaming-telemetry
|
||
data lives in `obmp-influxdb` (bucket `telemetry`), not in PostgreSQL.
|
||
`pg_dump` does not touch it. Back it up separately with the Influx CLI:
|
||
|
||
```bash
|
||
# Backup
|
||
docker exec obmp-influxdb influx backup /var/lib/influxdb2/backup \
|
||
--token "$INFLUXDB_ADMIN_TOKEN"
|
||
docker cp obmp-influxdb:/var/lib/influxdb2/backup \
|
||
/var/openbmp/backups/influxdb-$(date +%Y%m%d)
|
||
|
||
# Restore
|
||
docker cp /var/openbmp/backups/influxdb-YYYYMMDD \
|
||
obmp-influxdb:/var/lib/influxdb2/restore
|
||
docker exec obmp-influxdb influx restore /var/lib/influxdb2/restore \
|
||
--token "$INFLUXDB_ADMIN_TOKEN"
|
||
```
|
||
|
||
Telemetry is also less critical than BMP data (30-day retention,
|
||
data-plane counters) — back it up if you need historical telemetry to
|
||
survive a host loss; otherwise the 30-day window simply re-fills.
|
||
|
||
- **Grafana** — dashboards and datasources are provisioned from files in the
|
||
repo (`obmp-grafana/provisioning/` and `obmp-grafana/dashboards/`), so they
|
||
are already version-controlled in git. The Grafana database under
|
||
`${OBMP_DATA_ROOT}/grafana` (users, preferences, manually-created
|
||
dashboards, alert state) is *not* covered by this script — back up that
|
||
directory separately if it holds anything not reproducible from the repo.
|
||
|
||
- **Configuration & secrets** — `.env`, `docker-compose.yml`, and the
|
||
`${OBMP_DATA_ROOT}/config` directory. Keep these in version control /
|
||
your secrets manager.
|