2 Commits

Author SHA1 Message Date
sam
ef932fe1e8 Dashboard QoL: fill the viewport, push legends to bottom
Two recurring layout issues across dashboards I built this session:

  1) Right-placed legend tables ate 30% of each panel width.
  2) Default h:9 panels left ~50% of the viewport empty on a 1080p
     display (total dashboard height ~18 grid rows vs ~30 available).

Stack Resources (Telemetry-3001/stack_resources.json):
  * 3 timeseries: legend placement right -> bottom, calcs [max] -> [last,max],
    added sortBy: Max desc so top consumers float to the top of the legend.
  * Bumped all 4 panels h: 9 -> 14 (dashboard total 18 -> 28 rows).

Kafka Ingestion Lag and Live BGP Churn (Telemetry-3001/*):
  * Bumped timeseries panels h: 9 -> 12; second-row y: 13 -> 16.
    Dashboard total 22 -> 28 rows.

Policy Diff (obmp/History-1002/policy_diff.json):
  * Bumped bottom-row panels h: 8 -> 11. Total 24 -> 27 rows.

Untouched (already adequate, scrollable by design, or built earlier):
  evpn_rib (30 rows), global_table (38), router_diff (52), and the
  Maps-1006 dashboards (already h:22-28 single panels).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 19:58:33 -07:00
sam
9d74940614 Fix ExaBGP OOM, add container health checks and resource monitoring
RCA: the exabgp container was OOM-killed — its 512m mem_limit was far too
small for the full-table feature (900K route objects in memory). Raises the
limit to a parameterized 6g default (EXABGP_MEM_LIMIT).

Adds Docker healthchecks to 14 services (port/HTTP probes) so unhealthy
containers are visible. Adds a Telegraf docker input that collects per-
container CPU/memory/IO into InfluxDB, plus a "Stack Resources" dashboard —
so resource pressure is caught before it causes an OOM crash. telegraf runs
with an overridden entrypoint so it keeps root and can read the docker socket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:03:52 -07:00