obmp-docker/docs/backup-restore.md

224 lines
8.0 KiB
Markdown
Raw Permalink Normal View History

# OpenBMP Backup & Restore
How to back up and restore the OpenBMP PostgreSQL database, what the backup
covers, and what it deliberately does not.
---
## What `scripts/pg-backup.sh` backs up
The script runs `pg_dump` inside the `obmp-psql` container and produces a
single timestamped, compressed, custom-format dump of the **entire `openbmp`
database**:
- All BMP/BGP operational tables — `routers`, `bgp_peers`, `ip_rib`,
`base_attrs`, `global_ip_rib`, `l3vpn_rib`, the `ls_*` link-state tables.
- All history / TimescaleDB hypertables — `ip_rib_log`, `peer_event_log`,
`stat_reports`, and the `stats_*` aggregate tables.
- Reference / enrichment data — `geo_ip`, `info_asn`, `info_route`,
`rpki_validator`, `pdb_exchange_peers`.
- Schema objects — table definitions, indexes, views, functions, triggers,
enum types, and the TimescaleDB hypertable configuration.
The dump is taken against a **live database**`pg_dump` uses an MVCC
snapshot, so no downtime and no service stop is required. It is written
atomically (to a `.partial` file, renamed on success) so an interrupted run
never leaves a dump that looks valid but is truncated.
Output: `${OBMP_DATA_ROOT:-/var/openbmp}/backups/openbmp-YYYYMMDD-HHMMSS.dump`
### TimescaleDB note
The OpenBMP database uses TimescaleDB hypertables (`ip_rib_log`,
`peer_event_log`, the `stats_*` tables, with compression policies).
**A `pg_dump` logical backup restores hypertables correctly** — the dump
captures the `_timescaledb_catalog` metadata, and on restore the hypertable
structure, chunks, and compression settings are recreated. No special flags
are needed for the dump. The only requirement is that the **restore target
has the TimescaleDB extension available** — which the `openbmp/postgres`
image provides, so restoring into a fresh `obmp-psql` works out of the box.
---
## Scheduling
Make the script executable once:
```bash
chmod +x scripts/pg-backup.sh
```
Add a cron entry (`crontab -e`) — daily at 02:30, logging to a file:
```cron
30 2 * * * OBMP_DATA_ROOT=/var/openbmp /home/user/obmp-docker/scripts/pg-backup.sh >> /var/openbmp/backups/pg-backup.log 2>&1
```
The cron user must be able to reach the Docker daemon — run it as a user in
the `docker` group, or as root. A systemd timer is an equally valid
alternative.
### Configuration
All settings are environment variables with sensible defaults:
| Variable | Default | Purpose |
|----------|---------|---------|
| `OBMP_DATA_ROOT` | `/var/openbmp` | Base data dir; backups go to `${OBMP_DATA_ROOT}/backups` |
| `OBMP_BACKUP_DIR` | (unset) | Explicit backup dir, overrides the default |
| `OBMP_PG_CONTAINER` | `obmp-psql` | Postgres container name |
| `OBMP_PG_DB` | `openbmp` | Database name |
| `OBMP_PG_USER` | `openbmp` | Database user |
| `OBMP_BACKUP_RETENTION_DAYS` | `14` | Dumps older than this are pruned each run |
Retention only prunes files matching the script's own `openbmp-*.dump`
naming pattern — nothing else in the directory is touched.
### Production recommendations
- **Copy dumps off-host.** A local backup does not survive host loss. Sync
the backup directory to object storage / a backup server (e.g. nightly
`rclone`, `restic`, or your existing ISP backup tooling).
- **Size the backup volume** — at production scale (~100150M NLRIs) the
dump can be tens of GB even compressed. See `docs/production-sizing.md`.
- **Test restores periodically** — an untested backup is not a backup.
- For tighter RPO than once-daily logical dumps, consider PostgreSQL
continuous archiving / PITR (WAL archiving + `pg_basebackup`). That is out
of scope for this script but worth planning for a production deployment.
---
## Restore procedure
This restores a dump into a **fresh, empty** `obmp-psql` database. Restoring
over a populated database risks conflicts — start clean.
### 1. Stop the writers
Stop the services that write to the database so nothing races the restore:
```bash
docker compose -p obmp stop psql-app collector
```
Leave `obmp-psql` running.
### 2. Recreate an empty database
Drop and recreate the `openbmp` database inside the running container:
```bash
docker exec -i obmp-psql psql -U openbmp -d postgres <<'EOSQL'
DROP DATABASE IF EXISTS openbmp;
CREATE DATABASE openbmp OWNER openbmp;
EOSQL
```
> Restoring into a **brand-new container**? Bring `obmp-psql` up first and let
> it initialize, but **do not** create the `config/init_db` trigger file —
> the schema comes from the dump, not from psql-app's first-run migration.
### 3. Restore the dump
Copy the dump into the container and run `pg_restore`:
```bash
DUMP=/var/openbmp/backups/openbmp-YYYYMMDD-HHMMSS.dump
docker cp "${DUMP}" obmp-psql:/tmp/restore.dump
docker exec -i obmp-psql \
pg_restore -U openbmp -d openbmp --no-owner --no-privileges \
--jobs=4 /tmp/restore.dump
docker exec obmp-psql rm -f /tmp/restore.dump
```
- `--no-owner --no-privileges` — the dump was created with the same flags;
objects are recreated owned by the connecting role.
- `--jobs=4` — parallel restore; raise it on a many-core host to speed up the
large `ip_rib` / `ip_rib_log` tables. Custom-format dumps support this.
- Some non-fatal warnings (e.g. about the TimescaleDB extension or existing
objects) are normal. A non-zero exit with only warnings is usually fine —
inspect the output before assuming failure.
Alternatively, stream the restore without `docker cp`:
```bash
docker exec -i obmp-psql pg_restore -U openbmp -d openbmp \
--no-owner --no-privileges < "${DUMP}"
```
(Streaming via stdin disables `--jobs` parallelism — use `docker cp` for
large dumps.)
### 4. Verify
```bash
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
SELECT (SELECT count(*) FROM routers) AS routers,
(SELECT count(*) FROM bgp_peers) AS peers,
(SELECT count(*) FROM ip_rib) AS rib_rows;"
```
Confirm hypertables came back:
```bash
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
SELECT hypertable_name FROM timescaledb_information.hypertables;"
```
### 5. Restart the writers
```bash
docker compose -p obmp start collector psql-app
```
The collector reconnects to the routers' BMP sessions and psql-app resumes
consuming from Kafka. Live state catches up from the routers.
---
## What is NOT covered
This backup is **PostgreSQL only**. The following are out of scope and need
their own handling:
- **Kafka data is transient.** The `obmp-kafka` topics are a short-retention
pipeline buffer (`KAFKA_LOG_RETENTION_MINUTES: 720` — 12 hours). They are
not a system of record and do not need backing up. After a restore, routers
re-send BMP and the pipeline refills naturally.
- **InfluxDB telemetry has its own backup.** The gNMI streaming-telemetry
data lives in `obmp-influxdb` (bucket `telemetry`), not in PostgreSQL.
`pg_dump` does not touch it. Back it up separately with the Influx CLI:
```bash
# Backup
docker exec obmp-influxdb influx backup /var/lib/influxdb2/backup \
--token "$INFLUXDB_ADMIN_TOKEN"
docker cp obmp-influxdb:/var/lib/influxdb2/backup \
/var/openbmp/backups/influxdb-$(date +%Y%m%d)
# Restore
docker cp /var/openbmp/backups/influxdb-YYYYMMDD \
obmp-influxdb:/var/lib/influxdb2/restore
docker exec obmp-influxdb influx restore /var/lib/influxdb2/restore \
--token "$INFLUXDB_ADMIN_TOKEN"
```
Telemetry is also less critical than BMP data (30-day retention,
data-plane counters) — back it up if you need historical telemetry to
survive a host loss; otherwise the 30-day window simply re-fills.
- **Grafana** — dashboards and datasources are provisioned from files in the
repo (`obmp-grafana/provisioning/` and `obmp-grafana/dashboards/`), so they
are already version-controlled in git. The Grafana database under
`${OBMP_DATA_ROOT}/grafana` (users, preferences, manually-created
dashboards, alert state) is *not* covered by this script — back up that
directory separately if it holds anything not reproducible from the repo.
- **Configuration & secrets** — `.env`, `docker-compose.yml`, and the
`${OBMP_DATA_ROOT}/config` directory. Keep these in version control /
your secrets manager.