magenta-retry / docs /performance.md
thecollabagepatch's picture
extracted markdown from app.py into docs
8f1aba9
|
raw
history blame
982 Bytes

Current observations

  • L40S 48GB → faster than realtime. Use pace:"realtime" to avoid client over‑buffering.
  • L4 24GB → slightly below realtime even with pre‑roll buffering, TF32/Autotune, smaller chunks (max_decode_frames), and the base checkpoint.

Practical guidance

  • For consistent realtime, target ~40GB VRAM per active stream (e.g., A100 40GB, or MIG slices ≈ 35–40GB on newer GPUs).
  • Keep client‑side overlap‑add (25–40 ms) for seamless chunk joins.
  • Prefer pace:"realtime" once playback begins; use ASAP only to build a short pre‑roll if needed.
  • Optional knob: max_decode_frames (default 50 ≈ 2.0 s). Reducing to 36–45 can lower per‑chunk latency/VRAM, but doesn't increase frames/sec throughput.

Concurrency

This research build is designed for one active jam per GPU. Concurrency would require GPU partitioning (MIG) or horizontal scaling with a session scheduler.