obmp-docker/docs/security-hardening.md

489 lines
18 KiB
Markdown
Raw Permalink Normal View History

# OpenBMP Production Security Hardening
A prioritized checklist for hardening the OpenBMP Docker stack before exposing
it to a production ISP network of 40 full-table-edge routers. Work top to
bottom — items are ordered roughly by risk reduction per unit effort.
This document **recommends** changes. It does not modify `docker-compose.yml`
or any running service. Apply the changes in a maintenance window and test.
> Threat model in brief: the stack ingests BMP from production routers, stores
> the full DFZ in PostgreSQL, and exposes Grafana to operators. The crown
> jewels are (a) the database, (b) the Grafana admin plane, and (c) the BMP
> ingest port. Everything below protects one of those three.
---
## Priority 0 — Credentials (do this first)
Every service currently ships with the placeholder credential `openbmp` and
related defaults are committed in `docker-compose.yml`:
| Service | Setting | Current value |
|---------|---------|---------------|
| PostgreSQL | `POSTGRES_USER` / `POSTGRES_PASSWORD` | `openbmp` / `openbmp` |
| psql-app | `POSTGRES_PASSWORD` | `openbmp` |
| whois | `POSTGRES_PASSWORD` | `openbmp` |
| Grafana | `GF_SECURITY_ADMIN_PASSWORD` | `openbmp` |
| InfluxDB | `DOCKER_INFLUXDB_INIT_PASSWORD` | `openbmp123` |
| InfluxDB | `DOCKER_INFLUXDB_INIT_ADMIN_TOKEN` | `openbmp-telemetry-token` |
| Grafana datasource | `secureJsonData.password` | `openbmp` (in `openbmp-ds.yml`) |
### 0.1 Move every secret to `.env` (or a secrets manager)
`.env` is git-ignored. As a minimum, replace the hardcoded literals in
`docker-compose.yml` with `${VAR}` references and define them in `.env`:
```env
# .env — never commit this file
POSTGRES_PASSWORD=<long-random-string>
GF_SECURITY_ADMIN_PASSWORD=<long-random-string>
INFLUXDB_ADMIN_PASSWORD=<long-random-string>
INFLUXDB_ADMIN_TOKEN=<long-random-token>
```
```yaml
# docker-compose.yml (recommended edit — operator applies)
grafana:
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GF_SECURITY_ADMIN_PASSWORD:?set in .env}
psql:
environment:
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD:?set in .env}
```
The `:?` form makes the stack fail fast if a secret is missing rather than
silently falling back to a default.
Generate strong values:
```bash
openssl rand -base64 32 # passwords
openssl rand -hex 32 # tokens
```
### 0.2 For a real production deployment, use a secrets manager
`.env` on disk is better than committed literals, but it is still a
plaintext file readable by anyone in the `docker` group. For production:
- **Docker Compose secrets** (`secrets:` block, files mounted at
`/run/secrets/...`) — the lowest-friction upgrade; keep the secret files
outside the repo, `chmod 600`, owned by root.
- **HashiCorp Vault**, **AWS Secrets Manager**, **Bitwarden Secrets**, or your
existing ISP secret store — inject at deploy time via a wrapper that renders
`.env` from the vault and shreds it after `docker compose up`.
Whatever the choice: rotate all six credentials above on first production
deploy — they have been in git history as `openbmp` and must be considered
compromised.
### 0.3 Rotate the Grafana datasource password in lockstep
`obmp-grafana/provisioning/datasources/openbmp-ds.yml` carries
`secureJsonData.password`. It is read at Grafana start. When you change the
PostgreSQL password, update this file too (it supports `$__file{}` and
env-var expansion: `password: $POSTGRES_PASSWORD`) and restart Grafana.
---
## Priority 1 — Network exposure / firewalling
The host currently publishes these ports to `0.0.0.0`: 5000 (BMP), 5432
(PostgreSQL), 9092 (Kafka), 3000 (Grafana), 8086 (InfluxDB), 4300 (whois),
9091 (Authelia). Most should not be world-reachable.
### 1.1 BMP collector (port 5000) — restrict to router management subnets
The collector accepts a BMP session from any source. A rogue BMP feed can
inject bogus routers/peers/prefixes into the database. Firewall it to the
router management subnets only.
`nftables` example (preferred on modern hosts):
```nft
# /etc/nftables.conf — adjust subnets to your router management ranges
table inet obmp {
chain input {
type filter hook input priority 0; policy accept;
# BMP ingest — only from router management subnets
tcp dport 5000 ip saddr { 10.100.0.0/24, 10.100.1.0/24 } accept
tcp dport 5000 drop
}
}
```
`iptables` equivalent:
```bash
iptables -A INPUT -p tcp --dport 5000 -s 10.100.0.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 5000 -s 10.100.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 5000 -j DROP
```
> Docker's `iptables` integration uses the `DOCKER-USER` chain for
> container-published ports. Put the rules above in `DOCKER-USER` so Docker
> does not bypass them:
> ```bash
> iptables -I DOCKER-USER -p tcp --dport 5000 -s 10.100.0.0/24 -j RETURN
> iptables -I DOCKER-USER -p tcp --dport 5000 -s 10.100.1.0/24 -j RETURN
> iptables -A DOCKER-USER -p tcp --dport 5000 -j DROP
> ```
### 1.2 PostgreSQL (5432), Kafka (9092), InfluxDB (8086), whois (4300)
None of these need to be reachable from outside the stack:
- **PostgreSQL** — only `psql-app`, `whois`, and `grafana` connect, all on the
Compose network. Bind the published port to loopback only, or drop the
`ports:` mapping entirely:
```yaml
# docker-compose.yml — psql service
ports:
- "127.0.0.1:5432:5432" # localhost only; or remove entirely
```
- **Kafka 9092** — see Priority 2.
- **InfluxDB 8086** — only Grafana and Telegraf use it; bind to loopback or
drop the mapping (Telegraf uses host networking and reaches it via
localhost; Grafana reaches it on the Compose network).
- **whois 4300** — expose only if you actually offer a public whois service;
otherwise bind to loopback.
For anything that genuinely must be reachable, restrict by source with the
firewall pattern from 1.1.
### 1.3 Grafana (3000) — keep it behind Authelia
Authelia already fronts Grafana (the `auth` profile + `GF_AUTH_PROXY_*`
settings). Make that the *only* path:
- Bind Grafana's published port to loopback: `127.0.0.1:3000:3000`, and let
the reverse proxy / Authelia terminate TLS and reach it internally.
- Do **not** leave port 3000 directly reachable — `GF_AUTH_PROXY_ENABLED=true`
trusts the `Remote-User` header, so any client that can reach 3000 directly
and set that header bypasses authentication entirely.
---
## Priority 2 — Kafka transport security
Kafka is currently **PLAINTEXT** and advertises a host-IP listener:
```yaml
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://obmp-kafka:29092,PLAINTEXT_HOST://${HOST_IP}:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
```
The `obmp-kafka:29092` listener is internal to the Compose network and is the
only one the collector and psql-app use. The `PLAINTEXT_HOST://...:9092`
listener exists only for outside access and is not needed by the core stack.
**Recommended (simplest, most secure): remove the host listener.** If nothing
outside the Compose network consumes Kafka, drop the `9092` port mapping and
the `PLAINTEXT_HOST` advertised listener so Kafka is reachable only on the
internal Docker network:
```yaml
kafka:
# remove the - "9092:9092" ports entry
environment:
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://obmp-kafka:29092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
```
**If external Kafka access is genuinely required** (e.g. a separate analytics
consumer, or the split-host architecture in `production-sizing.md` where
Kafka and the DB are on different hosts), do **not** leave it PLAINTEXT on a
routed network. Enable SASL_SSL on the external listener:
```yaml
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://obmp-kafka:29092,SASL_SSL://${HOST_IP}:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,SASL_SSL:SASL_SSL
KAFKA_SASL_ENABLED_MECHANISMS: SCRAM-SHA-512
KAFKA_SSL_KEYSTORE_LOCATION: /etc/kafka/secrets/kafka.keystore.jks
KAFKA_SSL_KEYSTORE_PASSWORD: ${KAFKA_KEYSTORE_PASSWORD}
KAFKA_SSL_KEY_PASSWORD: ${KAFKA_KEY_PASSWORD}
KAFKA_SSL_TRUSTSTORE_LOCATION: /etc/kafka/secrets/kafka.truststore.jks
KAFKA_SSL_TRUSTSTORE_PASSWORD: ${KAFKA_TRUSTSTORE_PASSWORD}
```
Keep the internal `PLAINTEXT://obmp-kafka:29092` listener for the collector
and psql-app — intra-Compose traffic on a private bridge does not need TLS and
adding SASL there means re-configuring both clients. At minimum, never publish
a PLAINTEXT Kafka listener on an IP that routes beyond the host.
---
## Priority 3 — PostgreSQL hardening
### 3.1 Change the default `openbmp` / `openbmp` credentials
Covered in Priority 0. Note that `POSTGRES_USER`/`POSTGRES_PASSWORD` only take
effect when the data directory is initialized. To rotate on an existing
database, change the password in SQL and update every consumer:
```bash
docker exec -it obmp-psql psql -U openbmp -d openbmp \
-c "ALTER ROLE openbmp WITH PASSWORD '<new-strong-password>';"
```
Then update `POSTGRES_PASSWORD` for `psql-app` and `whois`, the
`secureJsonData.password` in `openbmp-ds.yml`, and restart those services.
### 3.2 Create a least-privilege role for Grafana
Grafana only needs to read. Do not let it connect as the owning role:
```sql
CREATE ROLE grafana_ro LOGIN PASSWORD '<strong-password>';
GRANT CONNECT ON DATABASE openbmp TO grafana_ro;
GRANT USAGE ON SCHEMA public TO grafana_ro;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO grafana_ro;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO grafana_ro;
```
Point `openbmp-ds.yml` at `grafana_ro`. This contains a Grafana compromise to
read-only and blocks SQL-panel writes.
### 3.3 Restrict `pg_hba.conf`
The default OpenBMP image is permissive (`host all all all md5` or similar).
Tighten it so only the stack's own subnet can connect, and require
`scram-sha-256`:
```conf
# pg_hba.conf (inside the obmp-psql container / mounted)
# TYPE DATABASE USER ADDRESS METHOD
local all all scram-sha-256
host openbmp openbmp 172.16.0.0/12 scram-sha-256 # Docker bridge range
host openbmp grafana_ro 172.16.0.0/12 scram-sha-256
hostssl openbmp openbmp 0.0.0.0/0 scram-sha-256 # only if remote DB host
# reject everything else
host all all 0.0.0.0/0 reject
```
Identify the actual Compose network subnet with
`docker network inspect obmp_default` and scope `ADDRESS` to it. Reload with
`docker exec obmp-psql psql -U openbmp -c "SELECT pg_reload_conf();"`.
> `scram-sha-256` requires `password_encryption = scram-sha-256` in
> `postgresql.conf` and that passwords were set/rotated *after* that change.
### 3.4 Enable SSL/TLS
The Grafana datasource already requests `sslmode: "require"` — but the server
must actually present a certificate. In `postgresql.conf`:
```conf
ssl = on
ssl_cert_file = '/var/lib/postgresql/server.crt'
ssl_key_file = '/var/lib/postgresql/server.key'
```
Generate a cert (self-signed is acceptable for an internal DB; use your
internal CA if you have one):
```bash
openssl req -new -x509 -days 825 -nodes -text \
-out server.crt -keyout server.key -subj "/CN=obmp-psql"
chmod 600 server.key # PostgreSQL refuses a world-readable key
```
Mount both files into the container's data directory. For the strongest
posture, move clients to `sslmode: verify-full` once a proper CA chain is in
place. This is most important if PostgreSQL runs on a separate host (the
split-host architecture in `production-sizing.md`) — intra-host Compose
traffic is lower-risk but TLS is still recommended.
### 3.5 Limit listen addresses
If PostgreSQL must accept connections from another host (split-host layout),
keep `listen_addresses` scoped — do not leave it at `*` if a single interface
suffices:
```conf
listen_addresses = 'localhost,172.18.0.1' # loopback + Docker bridge gateway
```
On a single-host deployment, drop the `5432` port mapping entirely (1.2) so
the listener is reachable only on the Compose network.
---
## Priority 4 — Drop `privileged: true` on the `psql` service
```yaml
psql:
privileged: true # <-- remove or replace
shm_size: 1536m
sysctls:
- net.ipv4.tcp_keepalive_intvl=30
- net.ipv4.tcp_keepalive_probes=5
- net.ipv4.tcp_keepalive_time=180
```
**Why it is a risk:** `privileged: true` gives the container *all* Linux
capabilities, disables seccomp/AppArmor confinement, and grants access to all
host devices. A compromise of PostgreSQL — the process most exposed to
untrusted route data — would then be a near-complete host compromise. This is
the single largest container-isolation gap in the stack.
**Why it is probably there:** PostgreSQL needs adequate shared memory and
benefits from the TCP keepalive `sysctls`. The compose file already sets
`shm_size: 1536m` and the `sysctls:` list explicitly — both of which Docker
applies *without* needing privileged mode. So `privileged: true` is most
likely a leftover, not a hard requirement.
**Recommended action — test without it:**
1. In a maintenance window, remove `privileged: true` and start the service.
2. Confirm PostgreSQL starts, the namespaced `sysctls` apply
(`docker exec obmp-psql sysctl net.ipv4.tcp_keepalive_time`), and shared
memory is honored (`docker exec obmp-psql cat /proc/meminfo | grep Shmem`,
and watch for `could not resize shared memory segment` errors in the log).
3. If everything is healthy, leave it removed.
If a specific capability turns out to be needed, add only that one instead of
going fully privileged:
```yaml
psql:
# privileged: true <-- removed
shm_size: 1536m
cap_drop:
- ALL
cap_add:
- CHOWN
- SETUID
- SETGID
- DAC_OVERRIDE # add only capabilities proven necessary by testing
sysctls:
- net.ipv4.tcp_keepalive_intvl=30
- net.ipv4.tcp_keepalive_probes=5
- net.ipv4.tcp_keepalive_time=180
```
The `sysctls:` block stays — those are namespaced and do not require
privileged mode.
---
## Priority 5 — Container hardening (defense in depth)
Apply across services after the higher-priority items. Test each service
individually — `read_only` in particular will surface paths a service writes
to that then need explicit `tmpfs` mounts.
### 5.1 `no-new-privileges`
Prevents a process inside a container from gaining privileges via setuid
binaries. Safe to apply to every service:
```yaml
security_opt:
- no-new-privileges:true
```
### 5.2 Drop capabilities
Most of these services need almost no Linux capabilities. Start from zero and
add back only what breaks:
```yaml
cap_drop:
- ALL
```
- `grafana`, `whois`, `portal`, `zookeeper` — typically run fine with
`cap_drop: [ALL]`.
- `collector`, `kafka`, `psql`, `psql-app` — drop ALL, then add back any
capability proven necessary (see Priority 4 for `psql`).
- `traffic-gen*` legitimately need `NET_RAW`/`NET_ADMIN` (Scapy) — leave those
`cap_add` entries; they are already minimal.
### 5.3 Read-only root filesystem
Make the root filesystem immutable where the service only writes to known
volumes:
```yaml
grafana:
read_only: true
tmpfs:
- /tmp
# /var/lib/grafana is already a bind mount — writes go there, not to rootfs
portal:
read_only: true # nginx:alpine static site; add tmpfs for nginx
tmpfs:
- /tmp
- /var/cache/nginx
- /var/run
```
`read_only` is straightforward for `grafana`, `portal`, and `whois`. It is
trickier for `psql`, `kafka`, and `zookeeper` (they write to data volumes but
also expect a writable rootfs in places) — test individually and add `tmpfs`
mounts for any write paths, or skip `read_only` for those and rely on
`cap_drop` + `no-new-privileges`.
### 5.4 Pin and scan images
Images are already version-pinned (`grafana:9.1.7`, `cp-kafka:7.1.1`,
`openbmp/postgres:2.2.1`, etc.) — good. Add periodic vulnerability scanning:
```bash
trivy image openbmp/postgres:2.2.1
trivy image grafana/grafana:9.1.7
```
Note Grafana 9.1.7 is old; review Grafana security advisories and plan an
upgrade path. Track CVEs for the pinned Confluent and OpenBMP images too.
### 5.5 Resource limits
Every service already has a `mem_limit`. For production also set `cpus:` (or
`deploy.resources.limits`) so a runaway query or ingest burst cannot starve
the host — this also mitigates local denial-of-service. See
`docs/production-sizing.md` for target values.
---
## Priority 6 — Authelia / access control
Authelia fronts Grafana (ROADMAP C5). For production:
- Enforce **TOTP / 2FA** for all operator accounts; do not allow `one_factor`
for the Grafana route.
- Set short session timeouts and an inactivity expiry in the Authelia config.
- Use strong, unique passwords; back the user store with your IdP / LDAP if
available rather than the file backend.
- Ensure Authelia's own secrets (`jwt_secret`, `session.secret`,
`storage.encryption_key`) are strong and stored as secrets, not literals.
- Confirm the reverse proxy strips any client-supplied `Remote-User` header
before Authelia sets it — otherwise the auth-proxy trust model is bypassable
(see 1.3).
---
## Quick checklist
- [ ] Rotate all six default credentials; remove literals from compose, move to `.env` / secrets manager
- [ ] Update `openbmp-ds.yml` datasource password to match
- [ ] Firewall BMP port 5000 to router management subnets (`DOCKER-USER` chain)
- [ ] Bind 5432 / 8086 / 4300 to loopback or drop the port mappings
- [ ] Bind Grafana 3000 to loopback; reach it only via Authelia
- [ ] Remove the Kafka `PLAINTEXT_HOST` listener + 9092 mapping (or enable SASL_SSL if external access needed)
- [ ] Create `grafana_ro` least-privilege DB role; repoint the datasource
- [ ] Tighten `pg_hba.conf`; require `scram-sha-256`
- [ ] Enable PostgreSQL `ssl = on` with a server certificate
- [ ] Test removing `privileged: true` from `psql`; replace with specific `cap_add` if needed
- [ ] Add `security_opt: [no-new-privileges:true]` to all services
- [ ] Add `cap_drop: [ALL]` and add back only required capabilities
- [ ] Add `read_only: true` + `tmpfs` to `grafana` / `portal` / `whois`
- [ ] Add `cpus:` limits per service
- [ ] Scan images with `trivy`; plan a Grafana upgrade off 9.1.7
- [ ] Enforce TOTP and short sessions in Authelia