### Current observations - **L40S 48GB** → faster than realtime. Use `pace:"realtime"` to avoid client over‑buffering. - **L4 24GB** → slightly **below** realtime even with pre‑roll buffering, TF32/Autotune, smaller chunks (`max_decode_frames`), and the **base** checkpoint. ### Practical guidance - For consistent realtime, target **~40GB VRAM per active stream** (e.g., **A100 40GB**, or MIG slices ≈ **35–40GB** on newer GPUs). - Keep client‑side **overlap‑add** (25–40 ms) for seamless chunk joins. - Prefer **`pace:"realtime"`** once playback begins; use **ASAP** only to build a short pre‑roll if needed. - Optional knob: **`max_decode_frames`** (default **50** ≈ 2.0 s). Reducing to **36–45** can lower per‑chunk latency/VRAM, but doesn't increase frames/sec throughput. ### Concurrency This research build is designed for **one active jam per GPU**. Concurrency would require GPU partitioning (MIG) or horizontal scaling with a session scheduler.