Add production-readiness deliverables: security, backup, alerting
Adds a prioritized security-hardening checklist, a PostgreSQL logical-backup script (pg-backup.sh) with a documented restore procedure, and Grafana alerting provisioning (peer-down, flap-storm, RPKI-invalid, router-down rules plus a contact-point template). The alerting YAML and contact points need operator review before being relied on for paging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
7e3370b5a5
commit
0732ebfa07
223
docs/backup-restore.md
Normal file
223
docs/backup-restore.md
Normal file
@ -0,0 +1,223 @@
|
|||||||
|
# OpenBMP Backup & Restore
|
||||||
|
|
||||||
|
How to back up and restore the OpenBMP PostgreSQL database, what the backup
|
||||||
|
covers, and what it deliberately does not.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What `scripts/pg-backup.sh` backs up
|
||||||
|
|
||||||
|
The script runs `pg_dump` inside the `obmp-psql` container and produces a
|
||||||
|
single timestamped, compressed, custom-format dump of the **entire `openbmp`
|
||||||
|
database**:
|
||||||
|
|
||||||
|
- All BMP/BGP operational tables — `routers`, `bgp_peers`, `ip_rib`,
|
||||||
|
`base_attrs`, `global_ip_rib`, `l3vpn_rib`, the `ls_*` link-state tables.
|
||||||
|
- All history / TimescaleDB hypertables — `ip_rib_log`, `peer_event_log`,
|
||||||
|
`stat_reports`, and the `stats_*` aggregate tables.
|
||||||
|
- Reference / enrichment data — `geo_ip`, `info_asn`, `info_route`,
|
||||||
|
`rpki_validator`, `pdb_exchange_peers`.
|
||||||
|
- Schema objects — table definitions, indexes, views, functions, triggers,
|
||||||
|
enum types, and the TimescaleDB hypertable configuration.
|
||||||
|
|
||||||
|
The dump is taken against a **live database** — `pg_dump` uses an MVCC
|
||||||
|
snapshot, so no downtime and no service stop is required. It is written
|
||||||
|
atomically (to a `.partial` file, renamed on success) so an interrupted run
|
||||||
|
never leaves a dump that looks valid but is truncated.
|
||||||
|
|
||||||
|
Output: `${OBMP_DATA_ROOT:-/var/openbmp}/backups/openbmp-YYYYMMDD-HHMMSS.dump`
|
||||||
|
|
||||||
|
### TimescaleDB note
|
||||||
|
|
||||||
|
The OpenBMP database uses TimescaleDB hypertables (`ip_rib_log`,
|
||||||
|
`peer_event_log`, the `stats_*` tables, with compression policies).
|
||||||
|
**A `pg_dump` logical backup restores hypertables correctly** — the dump
|
||||||
|
captures the `_timescaledb_catalog` metadata, and on restore the hypertable
|
||||||
|
structure, chunks, and compression settings are recreated. No special flags
|
||||||
|
are needed for the dump. The only requirement is that the **restore target
|
||||||
|
has the TimescaleDB extension available** — which the `openbmp/postgres`
|
||||||
|
image provides, so restoring into a fresh `obmp-psql` works out of the box.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scheduling
|
||||||
|
|
||||||
|
Make the script executable once:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
chmod +x scripts/pg-backup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Add a cron entry (`crontab -e`) — daily at 02:30, logging to a file:
|
||||||
|
|
||||||
|
```cron
|
||||||
|
30 2 * * * OBMP_DATA_ROOT=/var/openbmp /home/user/obmp-docker/scripts/pg-backup.sh >> /var/openbmp/backups/pg-backup.log 2>&1
|
||||||
|
```
|
||||||
|
|
||||||
|
The cron user must be able to reach the Docker daemon — run it as a user in
|
||||||
|
the `docker` group, or as root. A systemd timer is an equally valid
|
||||||
|
alternative.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
All settings are environment variables with sensible defaults:
|
||||||
|
|
||||||
|
| Variable | Default | Purpose |
|
||||||
|
|----------|---------|---------|
|
||||||
|
| `OBMP_DATA_ROOT` | `/var/openbmp` | Base data dir; backups go to `${OBMP_DATA_ROOT}/backups` |
|
||||||
|
| `OBMP_BACKUP_DIR` | (unset) | Explicit backup dir, overrides the default |
|
||||||
|
| `OBMP_PG_CONTAINER` | `obmp-psql` | Postgres container name |
|
||||||
|
| `OBMP_PG_DB` | `openbmp` | Database name |
|
||||||
|
| `OBMP_PG_USER` | `openbmp` | Database user |
|
||||||
|
| `OBMP_BACKUP_RETENTION_DAYS` | `14` | Dumps older than this are pruned each run |
|
||||||
|
|
||||||
|
Retention only prunes files matching the script's own `openbmp-*.dump`
|
||||||
|
naming pattern — nothing else in the directory is touched.
|
||||||
|
|
||||||
|
### Production recommendations
|
||||||
|
|
||||||
|
- **Copy dumps off-host.** A local backup does not survive host loss. Sync
|
||||||
|
the backup directory to object storage / a backup server (e.g. nightly
|
||||||
|
`rclone`, `restic`, or your existing ISP backup tooling).
|
||||||
|
- **Size the backup volume** — at production scale (~100–150M NLRIs) the
|
||||||
|
dump can be tens of GB even compressed. See `docs/production-sizing.md`.
|
||||||
|
- **Test restores periodically** — an untested backup is not a backup.
|
||||||
|
- For tighter RPO than once-daily logical dumps, consider PostgreSQL
|
||||||
|
continuous archiving / PITR (WAL archiving + `pg_basebackup`). That is out
|
||||||
|
of scope for this script but worth planning for a production deployment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Restore procedure
|
||||||
|
|
||||||
|
This restores a dump into a **fresh, empty** `obmp-psql` database. Restoring
|
||||||
|
over a populated database risks conflicts — start clean.
|
||||||
|
|
||||||
|
### 1. Stop the writers
|
||||||
|
|
||||||
|
Stop the services that write to the database so nothing races the restore:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -p obmp stop psql-app collector
|
||||||
|
```
|
||||||
|
|
||||||
|
Leave `obmp-psql` running.
|
||||||
|
|
||||||
|
### 2. Recreate an empty database
|
||||||
|
|
||||||
|
Drop and recreate the `openbmp` database inside the running container:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec -i obmp-psql psql -U openbmp -d postgres <<'EOSQL'
|
||||||
|
DROP DATABASE IF EXISTS openbmp;
|
||||||
|
CREATE DATABASE openbmp OWNER openbmp;
|
||||||
|
EOSQL
|
||||||
|
```
|
||||||
|
|
||||||
|
> Restoring into a **brand-new container**? Bring `obmp-psql` up first and let
|
||||||
|
> it initialize, but **do not** create the `config/init_db` trigger file —
|
||||||
|
> the schema comes from the dump, not from psql-app's first-run migration.
|
||||||
|
|
||||||
|
### 3. Restore the dump
|
||||||
|
|
||||||
|
Copy the dump into the container and run `pg_restore`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
DUMP=/var/openbmp/backups/openbmp-YYYYMMDD-HHMMSS.dump
|
||||||
|
|
||||||
|
docker cp "${DUMP}" obmp-psql:/tmp/restore.dump
|
||||||
|
|
||||||
|
docker exec -i obmp-psql \
|
||||||
|
pg_restore -U openbmp -d openbmp --no-owner --no-privileges \
|
||||||
|
--jobs=4 /tmp/restore.dump
|
||||||
|
|
||||||
|
docker exec obmp-psql rm -f /tmp/restore.dump
|
||||||
|
```
|
||||||
|
|
||||||
|
- `--no-owner --no-privileges` — the dump was created with the same flags;
|
||||||
|
objects are recreated owned by the connecting role.
|
||||||
|
- `--jobs=4` — parallel restore; raise it on a many-core host to speed up the
|
||||||
|
large `ip_rib` / `ip_rib_log` tables. Custom-format dumps support this.
|
||||||
|
- Some non-fatal warnings (e.g. about the TimescaleDB extension or existing
|
||||||
|
objects) are normal. A non-zero exit with only warnings is usually fine —
|
||||||
|
inspect the output before assuming failure.
|
||||||
|
|
||||||
|
Alternatively, stream the restore without `docker cp`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec -i obmp-psql pg_restore -U openbmp -d openbmp \
|
||||||
|
--no-owner --no-privileges < "${DUMP}"
|
||||||
|
```
|
||||||
|
|
||||||
|
(Streaming via stdin disables `--jobs` parallelism — use `docker cp` for
|
||||||
|
large dumps.)
|
||||||
|
|
||||||
|
### 4. Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
|
||||||
|
SELECT (SELECT count(*) FROM routers) AS routers,
|
||||||
|
(SELECT count(*) FROM bgp_peers) AS peers,
|
||||||
|
(SELECT count(*) FROM ip_rib) AS rib_rows;"
|
||||||
|
```
|
||||||
|
|
||||||
|
Confirm hypertables came back:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec -i obmp-psql psql -U openbmp -d openbmp -c "
|
||||||
|
SELECT hypertable_name FROM timescaledb_information.hypertables;"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Restart the writers
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -p obmp start collector psql-app
|
||||||
|
```
|
||||||
|
|
||||||
|
The collector reconnects to the routers' BMP sessions and psql-app resumes
|
||||||
|
consuming from Kafka. Live state catches up from the routers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What is NOT covered
|
||||||
|
|
||||||
|
This backup is **PostgreSQL only**. The following are out of scope and need
|
||||||
|
their own handling:
|
||||||
|
|
||||||
|
- **Kafka data is transient.** The `obmp-kafka` topics are a short-retention
|
||||||
|
pipeline buffer (`KAFKA_LOG_RETENTION_MINUTES: 720` — 12 hours). They are
|
||||||
|
not a system of record and do not need backing up. After a restore, routers
|
||||||
|
re-send BMP and the pipeline refills naturally.
|
||||||
|
|
||||||
|
- **InfluxDB telemetry has its own backup.** The gNMI streaming-telemetry
|
||||||
|
data lives in `obmp-influxdb` (bucket `telemetry`), not in PostgreSQL.
|
||||||
|
`pg_dump` does not touch it. Back it up separately with the Influx CLI:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Backup
|
||||||
|
docker exec obmp-influxdb influx backup /var/lib/influxdb2/backup \
|
||||||
|
--token "$INFLUXDB_ADMIN_TOKEN"
|
||||||
|
docker cp obmp-influxdb:/var/lib/influxdb2/backup \
|
||||||
|
/var/openbmp/backups/influxdb-$(date +%Y%m%d)
|
||||||
|
|
||||||
|
# Restore
|
||||||
|
docker cp /var/openbmp/backups/influxdb-YYYYMMDD \
|
||||||
|
obmp-influxdb:/var/lib/influxdb2/restore
|
||||||
|
docker exec obmp-influxdb influx restore /var/lib/influxdb2/restore \
|
||||||
|
--token "$INFLUXDB_ADMIN_TOKEN"
|
||||||
|
```
|
||||||
|
|
||||||
|
Telemetry is also less critical than BMP data (30-day retention,
|
||||||
|
data-plane counters) — back it up if you need historical telemetry to
|
||||||
|
survive a host loss; otherwise the 30-day window simply re-fills.
|
||||||
|
|
||||||
|
- **Grafana** — dashboards and datasources are provisioned from files in the
|
||||||
|
repo (`obmp-grafana/provisioning/` and `obmp-grafana/dashboards/`), so they
|
||||||
|
are already version-controlled in git. The Grafana database under
|
||||||
|
`${OBMP_DATA_ROOT}/grafana` (users, preferences, manually-created
|
||||||
|
dashboards, alert state) is *not* covered by this script — back up that
|
||||||
|
directory separately if it holds anything not reproducible from the repo.
|
||||||
|
|
||||||
|
- **Configuration & secrets** — `.env`, `docker-compose.yml`, and the
|
||||||
|
`${OBMP_DATA_ROOT}/config` directory. Keep these in version control /
|
||||||
|
your secrets manager.
|
||||||
488
docs/security-hardening.md
Normal file
488
docs/security-hardening.md
Normal file
@ -0,0 +1,488 @@
|
|||||||
|
# OpenBMP Production Security Hardening
|
||||||
|
|
||||||
|
A prioritized checklist for hardening the OpenBMP Docker stack before exposing
|
||||||
|
it to a production ISP network of 40 full-table-edge routers. Work top to
|
||||||
|
bottom — items are ordered roughly by risk reduction per unit effort.
|
||||||
|
|
||||||
|
This document **recommends** changes. It does not modify `docker-compose.yml`
|
||||||
|
or any running service. Apply the changes in a maintenance window and test.
|
||||||
|
|
||||||
|
> Threat model in brief: the stack ingests BMP from production routers, stores
|
||||||
|
> the full DFZ in PostgreSQL, and exposes Grafana to operators. The crown
|
||||||
|
> jewels are (a) the database, (b) the Grafana admin plane, and (c) the BMP
|
||||||
|
> ingest port. Everything below protects one of those three.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 0 — Credentials (do this first)
|
||||||
|
|
||||||
|
Every service currently ships with the placeholder credential `openbmp` and
|
||||||
|
related defaults are committed in `docker-compose.yml`:
|
||||||
|
|
||||||
|
| Service | Setting | Current value |
|
||||||
|
|---------|---------|---------------|
|
||||||
|
| PostgreSQL | `POSTGRES_USER` / `POSTGRES_PASSWORD` | `openbmp` / `openbmp` |
|
||||||
|
| psql-app | `POSTGRES_PASSWORD` | `openbmp` |
|
||||||
|
| whois | `POSTGRES_PASSWORD` | `openbmp` |
|
||||||
|
| Grafana | `GF_SECURITY_ADMIN_PASSWORD` | `openbmp` |
|
||||||
|
| InfluxDB | `DOCKER_INFLUXDB_INIT_PASSWORD` | `openbmp123` |
|
||||||
|
| InfluxDB | `DOCKER_INFLUXDB_INIT_ADMIN_TOKEN` | `openbmp-telemetry-token` |
|
||||||
|
| Grafana datasource | `secureJsonData.password` | `openbmp` (in `openbmp-ds.yml`) |
|
||||||
|
|
||||||
|
### 0.1 Move every secret to `.env` (or a secrets manager)
|
||||||
|
|
||||||
|
`.env` is git-ignored. As a minimum, replace the hardcoded literals in
|
||||||
|
`docker-compose.yml` with `${VAR}` references and define them in `.env`:
|
||||||
|
|
||||||
|
```env
|
||||||
|
# .env — never commit this file
|
||||||
|
POSTGRES_PASSWORD=<long-random-string>
|
||||||
|
GF_SECURITY_ADMIN_PASSWORD=<long-random-string>
|
||||||
|
INFLUXDB_ADMIN_PASSWORD=<long-random-string>
|
||||||
|
INFLUXDB_ADMIN_TOKEN=<long-random-token>
|
||||||
|
```
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.yml (recommended edit — operator applies)
|
||||||
|
grafana:
|
||||||
|
environment:
|
||||||
|
- GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:?set in .env}
|
||||||
|
psql:
|
||||||
|
environment:
|
||||||
|
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD:?set in .env}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `:?` form makes the stack fail fast if a secret is missing rather than
|
||||||
|
silently falling back to a default.
|
||||||
|
|
||||||
|
Generate strong values:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl rand -base64 32 # passwords
|
||||||
|
openssl rand -hex 32 # tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
### 0.2 For a real production deployment, use a secrets manager
|
||||||
|
|
||||||
|
`.env` on disk is better than committed literals, but it is still a
|
||||||
|
plaintext file readable by anyone in the `docker` group. For production:
|
||||||
|
|
||||||
|
- **Docker Compose secrets** (`secrets:` block, files mounted at
|
||||||
|
`/run/secrets/...`) — the lowest-friction upgrade; keep the secret files
|
||||||
|
outside the repo, `chmod 600`, owned by root.
|
||||||
|
- **HashiCorp Vault**, **AWS Secrets Manager**, **Bitwarden Secrets**, or your
|
||||||
|
existing ISP secret store — inject at deploy time via a wrapper that renders
|
||||||
|
`.env` from the vault and shreds it after `docker compose up`.
|
||||||
|
|
||||||
|
Whatever the choice: rotate all six credentials above on first production
|
||||||
|
deploy — they have been in git history as `openbmp` and must be considered
|
||||||
|
compromised.
|
||||||
|
|
||||||
|
### 0.3 Rotate the Grafana datasource password in lockstep
|
||||||
|
|
||||||
|
`obmp-grafana/provisioning/datasources/openbmp-ds.yml` carries
|
||||||
|
`secureJsonData.password`. It is read at Grafana start. When you change the
|
||||||
|
PostgreSQL password, update this file too (it supports `$__file{}` and
|
||||||
|
env-var expansion: `password: $POSTGRES_PASSWORD`) and restart Grafana.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 1 — Network exposure / firewalling
|
||||||
|
|
||||||
|
The host currently publishes these ports to `0.0.0.0`: 5000 (BMP), 5432
|
||||||
|
(PostgreSQL), 9092 (Kafka), 3000 (Grafana), 8086 (InfluxDB), 4300 (whois),
|
||||||
|
9091 (Authelia). Most should not be world-reachable.
|
||||||
|
|
||||||
|
### 1.1 BMP collector (port 5000) — restrict to router management subnets
|
||||||
|
|
||||||
|
The collector accepts a BMP session from any source. A rogue BMP feed can
|
||||||
|
inject bogus routers/peers/prefixes into the database. Firewall it to the
|
||||||
|
router management subnets only.
|
||||||
|
|
||||||
|
`nftables` example (preferred on modern hosts):
|
||||||
|
|
||||||
|
```nft
|
||||||
|
# /etc/nftables.conf — adjust subnets to your router management ranges
|
||||||
|
table inet obmp {
|
||||||
|
chain input {
|
||||||
|
type filter hook input priority 0; policy accept;
|
||||||
|
|
||||||
|
# BMP ingest — only from router management subnets
|
||||||
|
tcp dport 5000 ip saddr { 10.100.0.0/24, 10.100.1.0/24 } accept
|
||||||
|
tcp dport 5000 drop
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`iptables` equivalent:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
iptables -A INPUT -p tcp --dport 5000 -s 10.100.0.0/24 -j ACCEPT
|
||||||
|
iptables -A INPUT -p tcp --dport 5000 -s 10.100.1.0/24 -j ACCEPT
|
||||||
|
iptables -A INPUT -p tcp --dport 5000 -j DROP
|
||||||
|
```
|
||||||
|
|
||||||
|
> Docker's `iptables` integration uses the `DOCKER-USER` chain for
|
||||||
|
> container-published ports. Put the rules above in `DOCKER-USER` so Docker
|
||||||
|
> does not bypass them:
|
||||||
|
> ```bash
|
||||||
|
> iptables -I DOCKER-USER -p tcp --dport 5000 -s 10.100.0.0/24 -j RETURN
|
||||||
|
> iptables -I DOCKER-USER -p tcp --dport 5000 -s 10.100.1.0/24 -j RETURN
|
||||||
|
> iptables -A DOCKER-USER -p tcp --dport 5000 -j DROP
|
||||||
|
> ```
|
||||||
|
|
||||||
|
### 1.2 PostgreSQL (5432), Kafka (9092), InfluxDB (8086), whois (4300)
|
||||||
|
|
||||||
|
None of these need to be reachable from outside the stack:
|
||||||
|
|
||||||
|
- **PostgreSQL** — only `psql-app`, `whois`, and `grafana` connect, all on the
|
||||||
|
Compose network. Bind the published port to loopback only, or drop the
|
||||||
|
`ports:` mapping entirely:
|
||||||
|
```yaml
|
||||||
|
# docker-compose.yml — psql service
|
||||||
|
ports:
|
||||||
|
- "127.0.0.1:5432:5432" # localhost only; or remove entirely
|
||||||
|
```
|
||||||
|
- **Kafka 9092** — see Priority 2.
|
||||||
|
- **InfluxDB 8086** — only Grafana and Telegraf use it; bind to loopback or
|
||||||
|
drop the mapping (Telegraf uses host networking and reaches it via
|
||||||
|
localhost; Grafana reaches it on the Compose network).
|
||||||
|
- **whois 4300** — expose only if you actually offer a public whois service;
|
||||||
|
otherwise bind to loopback.
|
||||||
|
|
||||||
|
For anything that genuinely must be reachable, restrict by source with the
|
||||||
|
firewall pattern from 1.1.
|
||||||
|
|
||||||
|
### 1.3 Grafana (3000) — keep it behind Authelia
|
||||||
|
|
||||||
|
Authelia already fronts Grafana (the `auth` profile + `GF_AUTH_PROXY_*`
|
||||||
|
settings). Make that the *only* path:
|
||||||
|
|
||||||
|
- Bind Grafana's published port to loopback: `127.0.0.1:3000:3000`, and let
|
||||||
|
the reverse proxy / Authelia terminate TLS and reach it internally.
|
||||||
|
- Do **not** leave port 3000 directly reachable — `GF_AUTH_PROXY_ENABLED=true`
|
||||||
|
trusts the `Remote-User` header, so any client that can reach 3000 directly
|
||||||
|
and set that header bypasses authentication entirely.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 2 — Kafka transport security
|
||||||
|
|
||||||
|
Kafka is currently **PLAINTEXT** and advertises a host-IP listener:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://obmp-kafka:29092,PLAINTEXT_HOST://${HOST_IP}:9092
|
||||||
|
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
|
||||||
|
```
|
||||||
|
|
||||||
|
The `obmp-kafka:29092` listener is internal to the Compose network and is the
|
||||||
|
only one the collector and psql-app use. The `PLAINTEXT_HOST://...:9092`
|
||||||
|
listener exists only for outside access and is not needed by the core stack.
|
||||||
|
|
||||||
|
**Recommended (simplest, most secure): remove the host listener.** If nothing
|
||||||
|
outside the Compose network consumes Kafka, drop the `9092` port mapping and
|
||||||
|
the `PLAINTEXT_HOST` advertised listener so Kafka is reachable only on the
|
||||||
|
internal Docker network:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
kafka:
|
||||||
|
# remove the - "9092:9092" ports entry
|
||||||
|
environment:
|
||||||
|
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://obmp-kafka:29092
|
||||||
|
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
|
||||||
|
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
|
||||||
|
```
|
||||||
|
|
||||||
|
**If external Kafka access is genuinely required** (e.g. a separate analytics
|
||||||
|
consumer, or the split-host architecture in `production-sizing.md` where
|
||||||
|
Kafka and the DB are on different hosts), do **not** leave it PLAINTEXT on a
|
||||||
|
routed network. Enable SASL_SSL on the external listener:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://obmp-kafka:29092,SASL_SSL://${HOST_IP}:9092
|
||||||
|
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,SASL_SSL:SASL_SSL
|
||||||
|
KAFKA_SASL_ENABLED_MECHANISMS: SCRAM-SHA-512
|
||||||
|
KAFKA_SSL_KEYSTORE_LOCATION: /etc/kafka/secrets/kafka.keystore.jks
|
||||||
|
KAFKA_SSL_KEYSTORE_PASSWORD: ${KAFKA_KEYSTORE_PASSWORD}
|
||||||
|
KAFKA_SSL_KEY_PASSWORD: ${KAFKA_KEY_PASSWORD}
|
||||||
|
KAFKA_SSL_TRUSTSTORE_LOCATION: /etc/kafka/secrets/kafka.truststore.jks
|
||||||
|
KAFKA_SSL_TRUSTSTORE_PASSWORD: ${KAFKA_TRUSTSTORE_PASSWORD}
|
||||||
|
```
|
||||||
|
|
||||||
|
Keep the internal `PLAINTEXT://obmp-kafka:29092` listener for the collector
|
||||||
|
and psql-app — intra-Compose traffic on a private bridge does not need TLS and
|
||||||
|
adding SASL there means re-configuring both clients. At minimum, never publish
|
||||||
|
a PLAINTEXT Kafka listener on an IP that routes beyond the host.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 3 — PostgreSQL hardening
|
||||||
|
|
||||||
|
### 3.1 Change the default `openbmp` / `openbmp` credentials
|
||||||
|
|
||||||
|
Covered in Priority 0. Note that `POSTGRES_USER`/`POSTGRES_PASSWORD` only take
|
||||||
|
effect when the data directory is initialized. To rotate on an existing
|
||||||
|
database, change the password in SQL and update every consumer:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec -it obmp-psql psql -U openbmp -d openbmp \
|
||||||
|
-c "ALTER ROLE openbmp WITH PASSWORD '<new-strong-password>';"
|
||||||
|
```
|
||||||
|
|
||||||
|
Then update `POSTGRES_PASSWORD` for `psql-app` and `whois`, the
|
||||||
|
`secureJsonData.password` in `openbmp-ds.yml`, and restart those services.
|
||||||
|
|
||||||
|
### 3.2 Create a least-privilege role for Grafana
|
||||||
|
|
||||||
|
Grafana only needs to read. Do not let it connect as the owning role:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE ROLE grafana_ro LOGIN PASSWORD '<strong-password>';
|
||||||
|
GRANT CONNECT ON DATABASE openbmp TO grafana_ro;
|
||||||
|
GRANT USAGE ON SCHEMA public TO grafana_ro;
|
||||||
|
GRANT SELECT ON ALL TABLES IN SCHEMA public TO grafana_ro;
|
||||||
|
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO grafana_ro;
|
||||||
|
```
|
||||||
|
|
||||||
|
Point `openbmp-ds.yml` at `grafana_ro`. This contains a Grafana compromise to
|
||||||
|
read-only and blocks SQL-panel writes.
|
||||||
|
|
||||||
|
### 3.3 Restrict `pg_hba.conf`
|
||||||
|
|
||||||
|
The default OpenBMP image is permissive (`host all all all md5` or similar).
|
||||||
|
Tighten it so only the stack's own subnet can connect, and require
|
||||||
|
`scram-sha-256`:
|
||||||
|
|
||||||
|
```conf
|
||||||
|
# pg_hba.conf (inside the obmp-psql container / mounted)
|
||||||
|
# TYPE DATABASE USER ADDRESS METHOD
|
||||||
|
local all all scram-sha-256
|
||||||
|
host openbmp openbmp 172.16.0.0/12 scram-sha-256 # Docker bridge range
|
||||||
|
host openbmp grafana_ro 172.16.0.0/12 scram-sha-256
|
||||||
|
hostssl openbmp openbmp 0.0.0.0/0 scram-sha-256 # only if remote DB host
|
||||||
|
# reject everything else
|
||||||
|
host all all 0.0.0.0/0 reject
|
||||||
|
```
|
||||||
|
|
||||||
|
Identify the actual Compose network subnet with
|
||||||
|
`docker network inspect obmp_default` and scope `ADDRESS` to it. Reload with
|
||||||
|
`docker exec obmp-psql psql -U openbmp -c "SELECT pg_reload_conf();"`.
|
||||||
|
|
||||||
|
> `scram-sha-256` requires `password_encryption = scram-sha-256` in
|
||||||
|
> `postgresql.conf` and that passwords were set/rotated *after* that change.
|
||||||
|
|
||||||
|
### 3.4 Enable SSL/TLS
|
||||||
|
|
||||||
|
The Grafana datasource already requests `sslmode: "require"` — but the server
|
||||||
|
must actually present a certificate. In `postgresql.conf`:
|
||||||
|
|
||||||
|
```conf
|
||||||
|
ssl = on
|
||||||
|
ssl_cert_file = '/var/lib/postgresql/server.crt'
|
||||||
|
ssl_key_file = '/var/lib/postgresql/server.key'
|
||||||
|
```
|
||||||
|
|
||||||
|
Generate a cert (self-signed is acceptable for an internal DB; use your
|
||||||
|
internal CA if you have one):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl req -new -x509 -days 825 -nodes -text \
|
||||||
|
-out server.crt -keyout server.key -subj "/CN=obmp-psql"
|
||||||
|
chmod 600 server.key # PostgreSQL refuses a world-readable key
|
||||||
|
```
|
||||||
|
|
||||||
|
Mount both files into the container's data directory. For the strongest
|
||||||
|
posture, move clients to `sslmode: verify-full` once a proper CA chain is in
|
||||||
|
place. This is most important if PostgreSQL runs on a separate host (the
|
||||||
|
split-host architecture in `production-sizing.md`) — intra-host Compose
|
||||||
|
traffic is lower-risk but TLS is still recommended.
|
||||||
|
|
||||||
|
### 3.5 Limit listen addresses
|
||||||
|
|
||||||
|
If PostgreSQL must accept connections from another host (split-host layout),
|
||||||
|
keep `listen_addresses` scoped — do not leave it at `*` if a single interface
|
||||||
|
suffices:
|
||||||
|
|
||||||
|
```conf
|
||||||
|
listen_addresses = 'localhost,172.18.0.1' # loopback + Docker bridge gateway
|
||||||
|
```
|
||||||
|
|
||||||
|
On a single-host deployment, drop the `5432` port mapping entirely (1.2) so
|
||||||
|
the listener is reachable only on the Compose network.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 4 — Drop `privileged: true` on the `psql` service
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
psql:
|
||||||
|
privileged: true # <-- remove or replace
|
||||||
|
shm_size: 1536m
|
||||||
|
sysctls:
|
||||||
|
- net.ipv4.tcp_keepalive_intvl=30
|
||||||
|
- net.ipv4.tcp_keepalive_probes=5
|
||||||
|
- net.ipv4.tcp_keepalive_time=180
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why it is a risk:** `privileged: true` gives the container *all* Linux
|
||||||
|
capabilities, disables seccomp/AppArmor confinement, and grants access to all
|
||||||
|
host devices. A compromise of PostgreSQL — the process most exposed to
|
||||||
|
untrusted route data — would then be a near-complete host compromise. This is
|
||||||
|
the single largest container-isolation gap in the stack.
|
||||||
|
|
||||||
|
**Why it is probably there:** PostgreSQL needs adequate shared memory and
|
||||||
|
benefits from the TCP keepalive `sysctls`. The compose file already sets
|
||||||
|
`shm_size: 1536m` and the `sysctls:` list explicitly — both of which Docker
|
||||||
|
applies *without* needing privileged mode. So `privileged: true` is most
|
||||||
|
likely a leftover, not a hard requirement.
|
||||||
|
|
||||||
|
**Recommended action — test without it:**
|
||||||
|
|
||||||
|
1. In a maintenance window, remove `privileged: true` and start the service.
|
||||||
|
2. Confirm PostgreSQL starts, the namespaced `sysctls` apply
|
||||||
|
(`docker exec obmp-psql sysctl net.ipv4.tcp_keepalive_time`), and shared
|
||||||
|
memory is honored (`docker exec obmp-psql cat /proc/meminfo | grep Shmem`,
|
||||||
|
and watch for `could not resize shared memory segment` errors in the log).
|
||||||
|
3. If everything is healthy, leave it removed.
|
||||||
|
|
||||||
|
If a specific capability turns out to be needed, add only that one instead of
|
||||||
|
going fully privileged:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
psql:
|
||||||
|
# privileged: true <-- removed
|
||||||
|
shm_size: 1536m
|
||||||
|
cap_drop:
|
||||||
|
- ALL
|
||||||
|
cap_add:
|
||||||
|
- CHOWN
|
||||||
|
- SETUID
|
||||||
|
- SETGID
|
||||||
|
- DAC_OVERRIDE # add only capabilities proven necessary by testing
|
||||||
|
sysctls:
|
||||||
|
- net.ipv4.tcp_keepalive_intvl=30
|
||||||
|
- net.ipv4.tcp_keepalive_probes=5
|
||||||
|
- net.ipv4.tcp_keepalive_time=180
|
||||||
|
```
|
||||||
|
|
||||||
|
The `sysctls:` block stays — those are namespaced and do not require
|
||||||
|
privileged mode.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 5 — Container hardening (defense in depth)
|
||||||
|
|
||||||
|
Apply across services after the higher-priority items. Test each service
|
||||||
|
individually — `read_only` in particular will surface paths a service writes
|
||||||
|
to that then need explicit `tmpfs` mounts.
|
||||||
|
|
||||||
|
### 5.1 `no-new-privileges`
|
||||||
|
|
||||||
|
Prevents a process inside a container from gaining privileges via setuid
|
||||||
|
binaries. Safe to apply to every service:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
security_opt:
|
||||||
|
- no-new-privileges:true
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.2 Drop capabilities
|
||||||
|
|
||||||
|
Most of these services need almost no Linux capabilities. Start from zero and
|
||||||
|
add back only what breaks:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cap_drop:
|
||||||
|
- ALL
|
||||||
|
```
|
||||||
|
|
||||||
|
- `grafana`, `whois`, `portal`, `zookeeper` — typically run fine with
|
||||||
|
`cap_drop: [ALL]`.
|
||||||
|
- `collector`, `kafka`, `psql`, `psql-app` — drop ALL, then add back any
|
||||||
|
capability proven necessary (see Priority 4 for `psql`).
|
||||||
|
- `traffic-gen*` legitimately need `NET_RAW`/`NET_ADMIN` (Scapy) — leave those
|
||||||
|
`cap_add` entries; they are already minimal.
|
||||||
|
|
||||||
|
### 5.3 Read-only root filesystem
|
||||||
|
|
||||||
|
Make the root filesystem immutable where the service only writes to known
|
||||||
|
volumes:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
grafana:
|
||||||
|
read_only: true
|
||||||
|
tmpfs:
|
||||||
|
- /tmp
|
||||||
|
# /var/lib/grafana is already a bind mount — writes go there, not to rootfs
|
||||||
|
|
||||||
|
portal:
|
||||||
|
read_only: true # nginx:alpine static site; add tmpfs for nginx
|
||||||
|
tmpfs:
|
||||||
|
- /tmp
|
||||||
|
- /var/cache/nginx
|
||||||
|
- /var/run
|
||||||
|
```
|
||||||
|
|
||||||
|
`read_only` is straightforward for `grafana`, `portal`, and `whois`. It is
|
||||||
|
trickier for `psql`, `kafka`, and `zookeeper` (they write to data volumes but
|
||||||
|
also expect a writable rootfs in places) — test individually and add `tmpfs`
|
||||||
|
mounts for any write paths, or skip `read_only` for those and rely on
|
||||||
|
`cap_drop` + `no-new-privileges`.
|
||||||
|
|
||||||
|
### 5.4 Pin and scan images
|
||||||
|
|
||||||
|
Images are already version-pinned (`grafana:9.1.7`, `cp-kafka:7.1.1`,
|
||||||
|
`openbmp/postgres:2.2.1`, etc.) — good. Add periodic vulnerability scanning:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
trivy image openbmp/postgres:2.2.1
|
||||||
|
trivy image grafana/grafana:9.1.7
|
||||||
|
```
|
||||||
|
|
||||||
|
Note Grafana 9.1.7 is old; review Grafana security advisories and plan an
|
||||||
|
upgrade path. Track CVEs for the pinned Confluent and OpenBMP images too.
|
||||||
|
|
||||||
|
### 5.5 Resource limits
|
||||||
|
|
||||||
|
Every service already has a `mem_limit`. For production also set `cpus:` (or
|
||||||
|
`deploy.resources.limits`) so a runaway query or ingest burst cannot starve
|
||||||
|
the host — this also mitigates local denial-of-service. See
|
||||||
|
`docs/production-sizing.md` for target values.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority 6 — Authelia / access control
|
||||||
|
|
||||||
|
Authelia fronts Grafana (ROADMAP C5). For production:
|
||||||
|
|
||||||
|
- Enforce **TOTP / 2FA** for all operator accounts; do not allow `one_factor`
|
||||||
|
for the Grafana route.
|
||||||
|
- Set short session timeouts and an inactivity expiry in the Authelia config.
|
||||||
|
- Use strong, unique passwords; back the user store with your IdP / LDAP if
|
||||||
|
available rather than the file backend.
|
||||||
|
- Ensure Authelia's own secrets (`jwt_secret`, `session.secret`,
|
||||||
|
`storage.encryption_key`) are strong and stored as secrets, not literals.
|
||||||
|
- Confirm the reverse proxy strips any client-supplied `Remote-User` header
|
||||||
|
before Authelia sets it — otherwise the auth-proxy trust model is bypassable
|
||||||
|
(see 1.3).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick checklist
|
||||||
|
|
||||||
|
- [ ] Rotate all six default credentials; remove literals from compose, move to `.env` / secrets manager
|
||||||
|
- [ ] Update `openbmp-ds.yml` datasource password to match
|
||||||
|
- [ ] Firewall BMP port 5000 to router management subnets (`DOCKER-USER` chain)
|
||||||
|
- [ ] Bind 5432 / 8086 / 4300 to loopback or drop the port mappings
|
||||||
|
- [ ] Bind Grafana 3000 to loopback; reach it only via Authelia
|
||||||
|
- [ ] Remove the Kafka `PLAINTEXT_HOST` listener + 9092 mapping (or enable SASL_SSL if external access needed)
|
||||||
|
- [ ] Create `grafana_ro` least-privilege DB role; repoint the datasource
|
||||||
|
- [ ] Tighten `pg_hba.conf`; require `scram-sha-256`
|
||||||
|
- [ ] Enable PostgreSQL `ssl = on` with a server certificate
|
||||||
|
- [ ] Test removing `privileged: true` from `psql`; replace with specific `cap_add` if needed
|
||||||
|
- [ ] Add `security_opt: [no-new-privileges:true]` to all services
|
||||||
|
- [ ] Add `cap_drop: [ALL]` and add back only required capabilities
|
||||||
|
- [ ] Add `read_only: true` + `tmpfs` to `grafana` / `portal` / `whois`
|
||||||
|
- [ ] Add `cpus:` limits per service
|
||||||
|
- [ ] Scan images with `trivy`; plan a Grafana upgrade off 9.1.7
|
||||||
|
- [ ] Enforce TOTP and short sessions in Authelia
|
||||||
71
obmp-grafana/provisioning/alerting/contact-points.yaml
Normal file
71
obmp-grafana/provisioning/alerting/contact-points.yaml
Normal file
@ -0,0 +1,71 @@
|
|||||||
|
# OpenBMP — Grafana contact points & notification policy provisioning
|
||||||
|
# Grafana 9.1.7 (apiVersion: 1)
|
||||||
|
#
|
||||||
|
# Defines WHERE alert notifications go (contact points) and WHICH alerts go
|
||||||
|
# there (the notification policy tree). Pairs with obmp-alerts.yaml in this
|
||||||
|
# directory.
|
||||||
|
#
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# OPERATOR REVIEW — this file ships with PLACEHOLDERS. Fill them in.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# * The 'obmp-ops' contact point below has BOTH an email and a webhook
|
||||||
|
# receiver as examples. Delete whichever you do not use and fill in real
|
||||||
|
# values for the one you keep.
|
||||||
|
# * EMAIL requires Grafana SMTP to be configured (the [smtp] section of
|
||||||
|
# grafana.ini, or GF_SMTP_* env vars on the obmp-grafana container).
|
||||||
|
# Without working SMTP the email receiver silently fails.
|
||||||
|
# * WEBHOOK url: point it at your alerting system (Slack incoming webhook,
|
||||||
|
# PagerDuty Events API, Mattermost, an internal handler, etc.).
|
||||||
|
# * After editing, restart Grafana and verify under
|
||||||
|
# Alerting > Contact points > (test).
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
|
||||||
|
apiVersion: 1
|
||||||
|
|
||||||
|
# --- Contact points ----------------------------------------------------
|
||||||
|
contactPoints:
|
||||||
|
- orgId: 1
|
||||||
|
name: obmp-ops
|
||||||
|
receivers:
|
||||||
|
# ---- Email receiver (requires Grafana SMTP configured) ----
|
||||||
|
- uid: obmp-ops-email
|
||||||
|
type: email
|
||||||
|
settings:
|
||||||
|
# REPLACE with the real NOC / on-call distribution address(es).
|
||||||
|
# Comma-separate multiple recipients.
|
||||||
|
addresses: noc@example.net
|
||||||
|
singleEmail: false
|
||||||
|
disableResolveMessage: false
|
||||||
|
|
||||||
|
# ---- Webhook receiver (Slack / PagerDuty / internal handler) ----
|
||||||
|
# Delete this block if you only use email.
|
||||||
|
- uid: obmp-ops-webhook
|
||||||
|
type: webhook
|
||||||
|
settings:
|
||||||
|
# REPLACE with your real webhook endpoint.
|
||||||
|
url: https://hooks.example.net/services/REPLACE-ME
|
||||||
|
httpMethod: POST
|
||||||
|
disableResolveMessage: false
|
||||||
|
|
||||||
|
# --- Notification policy tree -----------------------------------------
|
||||||
|
# The root policy routes every alert from obmp-alerts.yaml to 'obmp-ops'.
|
||||||
|
# Sub-routes split by the `severity` label so critical alerts can page
|
||||||
|
# faster / repeat sooner than warnings.
|
||||||
|
policies:
|
||||||
|
- orgId: 1
|
||||||
|
receiver: obmp-ops
|
||||||
|
# Group alerts that share these labels into a single notification.
|
||||||
|
group_by: ['alertname', 'service']
|
||||||
|
# Timing for the default (warning-ish) path.
|
||||||
|
group_wait: 30s
|
||||||
|
group_interval: 5m
|
||||||
|
repeat_interval: 4h
|
||||||
|
routes:
|
||||||
|
# Critical alerts (peer down, router BMP down): notify fast, repeat
|
||||||
|
# more often until resolved.
|
||||||
|
- receiver: obmp-ops
|
||||||
|
matchers:
|
||||||
|
- severity = critical
|
||||||
|
group_wait: 10s
|
||||||
|
group_interval: 2m
|
||||||
|
repeat_interval: 1h
|
||||||
270
obmp-grafana/provisioning/alerting/obmp-alerts.yaml
Normal file
270
obmp-grafana/provisioning/alerting/obmp-alerts.yaml
Normal file
@ -0,0 +1,270 @@
|
|||||||
|
# OpenBMP — Grafana unified-alerting rule provisioning
|
||||||
|
# Grafana 9.1.7 (apiVersion: 1)
|
||||||
|
#
|
||||||
|
# Provisioned alert rules for the OpenBMP BGP-monitoring stack. They query the
|
||||||
|
# PostgreSQL datasource (uid: obmp_postgres) and fire on BGP peer/router
|
||||||
|
# session loss, peer flap storms, and RPKI-invalid routes.
|
||||||
|
#
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# DEPLOYMENT
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# This file is read by Grafana from /etc/grafana/provisioning/alerting/.
|
||||||
|
# The compose stack bind-mounts ${OBMP_DATA_ROOT}/grafana/provisioning into
|
||||||
|
# the container, so copy this directory there and restart Grafana:
|
||||||
|
#
|
||||||
|
# cp -r obmp-grafana/provisioning/alerting ${OBMP_DATA_ROOT}/grafana/provisioning/
|
||||||
|
# docker compose -p obmp restart grafana
|
||||||
|
#
|
||||||
|
# Pair it with contact-points.yaml (in this directory) for notifications.
|
||||||
|
#
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# OPERATOR REVIEW — fields you should check before relying on these
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# * folderUID: '1001' — reuses the existing 'OBMP-Base' dashboard folder so
|
||||||
|
# the rules have a home in the UI. Change it to a dedicated alerting
|
||||||
|
# folder UID if you prefer; the folder must already exist in Grafana.
|
||||||
|
# * datasourceUid: obmp_postgres — confirmed correct for this stack.
|
||||||
|
# * Thresholds and `for:` durations below are reasonable starting points.
|
||||||
|
# Tune them against your production baseline (40 full-table routers will
|
||||||
|
# have a different normal flap/churn profile than the lab).
|
||||||
|
# * The reduce/threshold expression UIDs (B, C) and refIds are internal to
|
||||||
|
# each rule; do not rename them without updating the matching references.
|
||||||
|
# * Alert-rule provisioning YAML is intricate. These definitions are
|
||||||
|
# intentionally minimal and well-commented. After first load, open each
|
||||||
|
# rule in the Grafana UI (Alerting > Alert rules) and confirm it
|
||||||
|
# evaluates without error before depending on it for paging.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
|
||||||
|
apiVersion: 1
|
||||||
|
|
||||||
|
groups:
|
||||||
|
- orgId: 1
|
||||||
|
name: OpenBMP BGP Health
|
||||||
|
folder: OBMP-Base
|
||||||
|
# How often every rule in this group is evaluated.
|
||||||
|
interval: 1m
|
||||||
|
rules:
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# (a) BGP peer down within the last 15 minutes
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# bgp_peers.state is an enum ('up'/'down'); .timestamp is the last
|
||||||
|
# state-change time. A peer whose state is 'down' AND changed within
|
||||||
|
# the last 15 min indicates a recent session loss.
|
||||||
|
- uid: obmp-peer-down
|
||||||
|
title: BGP Peer Down (recent)
|
||||||
|
condition: C
|
||||||
|
for: 5m
|
||||||
|
data:
|
||||||
|
- refId: A
|
||||||
|
relativeTimeRange: { from: 600, to: 0 }
|
||||||
|
datasourceUid: obmp_postgres
|
||||||
|
model:
|
||||||
|
refId: A
|
||||||
|
datasource: { type: postgres, uid: obmp_postgres }
|
||||||
|
format: table
|
||||||
|
rawSql: >
|
||||||
|
SELECT count(*)::float8 AS value
|
||||||
|
FROM bgp_peers
|
||||||
|
WHERE state = 'down'
|
||||||
|
AND timestamp > (now() AT TIME ZONE 'utc') - interval '15 minutes';
|
||||||
|
- refId: B
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: B
|
||||||
|
type: reduce
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: A
|
||||||
|
reducer: last
|
||||||
|
- refId: C
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: C
|
||||||
|
type: threshold
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: B
|
||||||
|
# Fire when one or more peers went down in the last 15 min.
|
||||||
|
conditions:
|
||||||
|
- evaluator: { type: gt, params: [0] }
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
service: bmp
|
||||||
|
annotations:
|
||||||
|
summary: One or more BGP peers went down in the last 15 minutes
|
||||||
|
description: >
|
||||||
|
{{ $values.B }} BGP peer(s) are in state 'down' with a state
|
||||||
|
change within the last 15 minutes. Check the OBMP peer
|
||||||
|
inventory and the affected routers.
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# (b) Peer flap storm — >5 down-events for one peer in 1 hour
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# peer_event_log records every peer state transition. Counting 'down'
|
||||||
|
# events per peer over the last hour detects a flapping session even
|
||||||
|
# if the peer is currently 'up'. The inner query groups per peer; the
|
||||||
|
# outer takes the worst offender's count.
|
||||||
|
- uid: obmp-peer-flap-storm
|
||||||
|
title: BGP Peer Flap Storm
|
||||||
|
condition: C
|
||||||
|
for: 0m
|
||||||
|
data:
|
||||||
|
- refId: A
|
||||||
|
relativeTimeRange: { from: 3600, to: 0 }
|
||||||
|
datasourceUid: obmp_postgres
|
||||||
|
model:
|
||||||
|
refId: A
|
||||||
|
datasource: { type: postgres, uid: obmp_postgres }
|
||||||
|
format: table
|
||||||
|
rawSql: >
|
||||||
|
SELECT coalesce(max(c), 0)::float8 AS value
|
||||||
|
FROM (
|
||||||
|
SELECT count(*) AS c
|
||||||
|
FROM peer_event_log
|
||||||
|
WHERE state = 'down'
|
||||||
|
AND timestamp > (now() AT TIME ZONE 'utc') - interval '1 hour'
|
||||||
|
GROUP BY peer_hash_id
|
||||||
|
) s;
|
||||||
|
- refId: B
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: B
|
||||||
|
type: reduce
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: A
|
||||||
|
reducer: last
|
||||||
|
- refId: C
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: C
|
||||||
|
type: threshold
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: B
|
||||||
|
# >5 down-events for a single peer within 1h = flap storm.
|
||||||
|
conditions:
|
||||||
|
- evaluator: { type: gt, params: [5] }
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
service: bmp
|
||||||
|
annotations:
|
||||||
|
summary: A BGP peer is flapping (more than 5 resets in the last hour)
|
||||||
|
description: >
|
||||||
|
At least one peer has logged {{ $values.B }} 'down' events in
|
||||||
|
peer_event_log within the last hour. Investigate link/session
|
||||||
|
instability on the affected peer.
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# (c) RPKI-invalid routes present
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# ip_rib has no RPKI column on this schema, so validity is derived by
|
||||||
|
# joining against rpki_validator (ROA cache, refreshed by the psql-app
|
||||||
|
# RPKI cron). A route is "invalid" when a covering ROA exists for the
|
||||||
|
# prefix but NO ROA matches its origin AS.
|
||||||
|
#
|
||||||
|
# NOTE: rpki_validator is empty until ENABLE_RPKI=1 has run at least
|
||||||
|
# once (every ~2h). Until then this rule correctly reports 0.
|
||||||
|
- uid: obmp-rpki-invalid
|
||||||
|
title: RPKI-Invalid Routes Present
|
||||||
|
condition: C
|
||||||
|
for: 10m
|
||||||
|
data:
|
||||||
|
- refId: A
|
||||||
|
relativeTimeRange: { from: 600, to: 0 }
|
||||||
|
datasourceUid: obmp_postgres
|
||||||
|
model:
|
||||||
|
refId: A
|
||||||
|
datasource: { type: postgres, uid: obmp_postgres }
|
||||||
|
format: table
|
||||||
|
rawSql: >
|
||||||
|
SELECT count(*)::float8 AS value
|
||||||
|
FROM ip_rib r
|
||||||
|
WHERE r.iswithdrawn = false
|
||||||
|
AND r.origin_as IS NOT NULL
|
||||||
|
AND EXISTS (
|
||||||
|
SELECT 1 FROM rpki_validator v
|
||||||
|
WHERE r.prefix <<= v.prefix
|
||||||
|
AND r.prefix_len BETWEEN masklen(v.prefix) AND v.prefix_len_max
|
||||||
|
)
|
||||||
|
AND NOT EXISTS (
|
||||||
|
SELECT 1 FROM rpki_validator v2
|
||||||
|
WHERE r.prefix <<= v2.prefix
|
||||||
|
AND r.prefix_len BETWEEN masklen(v2.prefix) AND v2.prefix_len_max
|
||||||
|
AND v2.origin_as = r.origin_as
|
||||||
|
);
|
||||||
|
- refId: B
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: B
|
||||||
|
type: reduce
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: A
|
||||||
|
reducer: last
|
||||||
|
- refId: C
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: C
|
||||||
|
type: threshold
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: B
|
||||||
|
# Any RPKI-invalid route is worth surfacing. Raise the param
|
||||||
|
# (e.g. to 10) if you expect a steady-state baseline of
|
||||||
|
# invalids and only want to alert on spikes.
|
||||||
|
conditions:
|
||||||
|
- evaluator: { type: gt, params: [0] }
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
service: routing-security
|
||||||
|
annotations:
|
||||||
|
summary: RPKI-invalid routes are present in the RIB
|
||||||
|
description: >
|
||||||
|
{{ $values.B }} route(s) in ip_rib are RPKI-invalid (a covering
|
||||||
|
ROA exists but none matches the route's origin AS). Possible
|
||||||
|
mis-origination or hijack — review the RPKI Validation dashboard.
|
||||||
|
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# (d) Router BMP session down
|
||||||
|
# ------------------------------------------------------------------
|
||||||
|
# routers.state is the BMP session state for each monitored router.
|
||||||
|
# 'down' means the router's BMP feed to the collector has dropped.
|
||||||
|
- uid: obmp-router-bmp-down
|
||||||
|
title: Router BMP Session Down
|
||||||
|
condition: C
|
||||||
|
for: 5m
|
||||||
|
data:
|
||||||
|
- refId: A
|
||||||
|
relativeTimeRange: { from: 600, to: 0 }
|
||||||
|
datasourceUid: obmp_postgres
|
||||||
|
model:
|
||||||
|
refId: A
|
||||||
|
datasource: { type: postgres, uid: obmp_postgres }
|
||||||
|
format: table
|
||||||
|
rawSql: >
|
||||||
|
SELECT count(*)::float8 AS value
|
||||||
|
FROM routers
|
||||||
|
WHERE state = 'down';
|
||||||
|
- refId: B
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: B
|
||||||
|
type: reduce
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: A
|
||||||
|
reducer: last
|
||||||
|
- refId: C
|
||||||
|
datasourceUid: __expr__
|
||||||
|
model:
|
||||||
|
refId: C
|
||||||
|
type: threshold
|
||||||
|
datasource: { type: __expr__, uid: __expr__ }
|
||||||
|
expression: B
|
||||||
|
# Any router with a down BMP session.
|
||||||
|
conditions:
|
||||||
|
- evaluator: { type: gt, params: [0] }
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
service: bmp
|
||||||
|
annotations:
|
||||||
|
summary: One or more routers have a down BMP session
|
||||||
|
description: >
|
||||||
|
{{ $values.B }} router(s) are in BMP state 'down' — the
|
||||||
|
collector is no longer receiving BMP from them. Check the
|
||||||
|
router BMP config and reachability to the collector on port 5000.
|
||||||
105
scripts/pg-backup.sh
Executable file
105
scripts/pg-backup.sh
Executable file
@ -0,0 +1,105 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# pg-backup.sh — logical backup of the OpenBMP PostgreSQL database.
|
||||||
|
#
|
||||||
|
# Performs a `pg_dump` of the `openbmp` database inside the obmp-psql
|
||||||
|
# container, writes a timestamped compressed dump to a backup directory,
|
||||||
|
# and prunes dumps older than the configured retention.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./pg-backup.sh
|
||||||
|
#
|
||||||
|
# Configuration (environment variables, all optional):
|
||||||
|
# OBMP_DATA_ROOT Base data dir. Default: /var/openbmp
|
||||||
|
# Backups go to ${OBMP_DATA_ROOT}/backups unless
|
||||||
|
# OBMP_BACKUP_DIR is set.
|
||||||
|
# OBMP_BACKUP_DIR Explicit backup directory. Overrides the default.
|
||||||
|
# OBMP_PG_CONTAINER Postgres container name. Default: obmp-psql
|
||||||
|
# OBMP_PG_DB Database name. Default: openbmp
|
||||||
|
# OBMP_PG_USER Database user. Default: openbmp
|
||||||
|
# OBMP_BACKUP_RETENTION_DAYS Prune dumps older than N days. Default: 14
|
||||||
|
#
|
||||||
|
# Output format:
|
||||||
|
# pg_dump custom format (-Fc), gzip-level compressed by pg_dump itself.
|
||||||
|
# Restore with `pg_restore` — see docs/backup-restore.md.
|
||||||
|
#
|
||||||
|
# This script is idempotent and safe to run repeatedly. It does not stop
|
||||||
|
# the database; pg_dump takes a consistent MVCC snapshot of a live DB.
|
||||||
|
#
|
||||||
|
# Make it executable once:
|
||||||
|
# chmod +x scripts/pg-backup.sh
|
||||||
|
#
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Scheduling via cron
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
# Run `crontab -e` and add (daily at 02:30, log to a file):
|
||||||
|
#
|
||||||
|
# 30 2 * * * OBMP_DATA_ROOT=/var/openbmp /home/user/obmp-docker/scripts/pg-backup.sh >> /var/openbmp/backups/pg-backup.log 2>&1
|
||||||
|
#
|
||||||
|
# The script must be able to reach the Docker daemon, so run it as a user
|
||||||
|
# in the `docker` group (or root). For systemd-based hosts a
|
||||||
|
# systemd timer is an equally good alternative to cron.
|
||||||
|
# ----------------------------------------------------------------------
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# --- Configuration -----------------------------------------------------
|
||||||
|
OBMP_DATA_ROOT="${OBMP_DATA_ROOT:-/var/openbmp}"
|
||||||
|
BACKUP_DIR="${OBMP_BACKUP_DIR:-${OBMP_DATA_ROOT}/backups}"
|
||||||
|
PG_CONTAINER="${OBMP_PG_CONTAINER:-obmp-psql}"
|
||||||
|
PG_DB="${OBMP_PG_DB:-openbmp}"
|
||||||
|
PG_USER="${OBMP_PG_USER:-openbmp}"
|
||||||
|
RETENTION_DAYS="${OBMP_BACKUP_RETENTION_DAYS:-14}"
|
||||||
|
|
||||||
|
TIMESTAMP="$(date +%Y%m%d-%H%M%S)"
|
||||||
|
DUMP_NAME="openbmp-${TIMESTAMP}.dump"
|
||||||
|
DUMP_PATH="${BACKUP_DIR}/${DUMP_NAME}"
|
||||||
|
DUMP_TMP="${DUMP_PATH}.partial"
|
||||||
|
|
||||||
|
log() { printf '%s [pg-backup] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$*"; }
|
||||||
|
fail() { log "ERROR: $*" >&2; exit 1; }
|
||||||
|
|
||||||
|
# --- Pre-flight checks -------------------------------------------------
|
||||||
|
command -v docker >/dev/null 2>&1 || fail "docker command not found in PATH"
|
||||||
|
|
||||||
|
if ! docker inspect -f '{{.State.Running}}' "${PG_CONTAINER}" 2>/dev/null | grep -q true; then
|
||||||
|
fail "container '${PG_CONTAINER}' is not running"
|
||||||
|
fi
|
||||||
|
|
||||||
|
mkdir -p "${BACKUP_DIR}" || fail "cannot create backup directory ${BACKUP_DIR}"
|
||||||
|
|
||||||
|
# --- Backup ------------------------------------------------------------
|
||||||
|
# Write to a .partial file first, then atomically rename on success so a
|
||||||
|
# crashed/interrupted run never leaves a truncated dump that looks valid.
|
||||||
|
log "starting backup of database '${PG_DB}' from container '${PG_CONTAINER}'"
|
||||||
|
|
||||||
|
if docker exec "${PG_CONTAINER}" \
|
||||||
|
pg_dump -U "${PG_USER}" -d "${PG_DB}" -Fc --no-owner --no-privileges \
|
||||||
|
> "${DUMP_TMP}"; then
|
||||||
|
mv -f "${DUMP_TMP}" "${DUMP_PATH}"
|
||||||
|
else
|
||||||
|
rm -f "${DUMP_TMP}"
|
||||||
|
fail "pg_dump failed; no backup written"
|
||||||
|
fi
|
||||||
|
|
||||||
|
DUMP_SIZE="$(du -h "${DUMP_PATH}" | cut -f1)"
|
||||||
|
log "backup complete: ${DUMP_PATH} (${DUMP_SIZE})"
|
||||||
|
|
||||||
|
# --- Prune old backups -------------------------------------------------
|
||||||
|
# Only prune files matching our own naming pattern, so nothing else in the
|
||||||
|
# directory (logs, manual dumps) is touched.
|
||||||
|
log "pruning dumps older than ${RETENTION_DAYS} days"
|
||||||
|
PRUNED=0
|
||||||
|
while IFS= read -r -d '' old; do
|
||||||
|
rm -f "${old}"
|
||||||
|
log " removed $(basename "${old}")"
|
||||||
|
PRUNED=$((PRUNED + 1))
|
||||||
|
done < <(find "${BACKUP_DIR}" -maxdepth 1 -type f \
|
||||||
|
-name 'openbmp-*.dump' -mtime "+${RETENTION_DAYS}" -print0)
|
||||||
|
log "pruned ${PRUNED} old dump(s)"
|
||||||
|
|
||||||
|
# Also clean up any stale .partial files from previous crashed runs.
|
||||||
|
find "${BACKUP_DIR}" -maxdepth 1 -type f -name 'openbmp-*.dump.partial' \
|
||||||
|
-mtime +1 -delete 2>/dev/null || true
|
||||||
|
|
||||||
|
log "done"
|
||||||
Loading…
x
Reference in New Issue
Block a user