### Current observations | |
- **L40S 48GB** → faster than realtime. Use `pace:"realtime"` to avoid client over‑buffering. | |
- **L4 24GB** → slightly **below** realtime even with pre‑roll buffering, TF32/Autotune, smaller chunks (`max_decode_frames`), and the **base** checkpoint. | |
### Practical guidance | |
- For consistent realtime, target **~40GB VRAM per active stream** (e.g., **A100 40GB**, or MIG slices ≈ **35–40GB** on newer GPUs). | |
- Keep client‑side **overlap‑add** (25–40 ms) for seamless chunk joins. | |
- Prefer **`pace:"realtime"`** once playback begins; use **ASAP** only to build a short pre‑roll if needed. | |
- Optional knob: **`max_decode_frames`** (default **50** ≈ 2.0 s). Reducing to **36–45** can lower per‑chunk latency/VRAM, but doesn't increase frames/sec throughput. | |
### Concurrency | |
This research build is designed for **one active jam per GPU**. Concurrency would require GPU partitioning (MIG) or horizontal scaling with a session scheduler. |