akhaliq HF Staff commited on
Commit
c456be0
·
1 Parent(s): 375b1c6

update system prompt

Browse files
Files changed (1) hide show
  1. app.py +76 -0
app.py CHANGED
@@ -223,6 +223,7 @@ Duration Guidelines:
223
  - Video generation: typically 60-180 seconds
224
  - Audio/music generation: typically 30-90 seconds
225
  - Model loading + inference: add 10-30s buffer
 
226
 
227
  Functions that typically need @spaces.GPU:
228
  - Image generation (text-to-image, image-to-image)
@@ -231,6 +232,43 @@ Functions that typically need @spaces.GPU:
231
  - Model inference with transformers, diffusers
232
  - Any function using .to('cuda') or GPU operations
233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
  ## Complete Gradio API Reference
235
 
236
  This reference is automatically synced from https://www.gradio.app/llms.txt to ensure accuracy.
@@ -278,6 +316,7 @@ Duration Guidelines:
278
  - Video generation: typically 60-180 seconds
279
  - Audio/music generation: typically 30-90 seconds
280
  - Model loading + inference: add 10-30s buffer
 
281
 
282
  Functions that typically need @spaces.GPU:
283
  - Image generation (text-to-image, image-to-image)
@@ -286,6 +325,43 @@ Functions that typically need @spaces.GPU:
286
  - Model inference with transformers, diffusers
287
  - Any function using .to('cuda') or GPU operations
288
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
  ## Complete Gradio API Reference
290
 
291
  This reference is automatically synced from https://www.gradio.app/llms.txt to ensure accuracy.
 
223
  - Video generation: typically 60-180 seconds
224
  - Audio/music generation: typically 30-90 seconds
225
  - Model loading + inference: add 10-30s buffer
226
+ - AoT compilation during startup: use @spaces.GPU(duration=1500) for maximum allowed duration
227
 
228
  Functions that typically need @spaces.GPU:
229
  - Image generation (text-to-image, image-to-image)
 
232
  - Model inference with transformers, diffusers
233
  - Any function using .to('cuda') or GPU operations
234
 
235
+ ## Advanced ZeroGPU Optimization (Recommended)
236
+
237
+ For production Spaces with heavy models, consider ahead-of-time (AoT) compilation for 1.3x-1.8x speedups:
238
+
239
+ ```python
240
+ import spaces
241
+ import torch
242
+ from diffusers import DiffusionPipeline
243
+
244
+ pipe = DiffusionPipeline.from_pretrained(..., torch_dtype=torch.bfloat16)
245
+ pipe.to('cuda')
246
+
247
+ @spaces.GPU(duration=1500) # Max duration for compilation
248
+ def compile_transformer():
249
+ with spaces.aoti_capture(pipe.transformer) as call:
250
+ pipe("arbitrary example prompt")
251
+
252
+ exported = torch.export.export(
253
+ pipe.transformer,
254
+ args=call.args,
255
+ kwargs=call.kwargs,
256
+ )
257
+ return spaces.aoti_compile(exported)
258
+
259
+ compiled_transformer = compile_transformer()
260
+ spaces.aoti_apply(compiled_transformer, pipe.transformer)
261
+
262
+ @spaces.GPU
263
+ def generate(prompt):
264
+ return pipe(prompt).images
265
+ ```
266
+
267
+ Optional enhancements:
268
+ - FP8 quantization with torchao for additional 1.2x speedup (H200 compatible)
269
+ - Dynamic shapes for variable input sizes
270
+ - FlashAttention-3 via kernels library for attention speedups
271
+
272
  ## Complete Gradio API Reference
273
 
274
  This reference is automatically synced from https://www.gradio.app/llms.txt to ensure accuracy.
 
316
  - Video generation: typically 60-180 seconds
317
  - Audio/music generation: typically 30-90 seconds
318
  - Model loading + inference: add 10-30s buffer
319
+ - AoT compilation during startup: use @spaces.GPU(duration=1500) for maximum allowed duration
320
 
321
  Functions that typically need @spaces.GPU:
322
  - Image generation (text-to-image, image-to-image)
 
325
  - Model inference with transformers, diffusers
326
  - Any function using .to('cuda') or GPU operations
327
 
328
+ ## Advanced ZeroGPU Optimization (Recommended)
329
+
330
+ For production Spaces with heavy models, consider ahead-of-time (AoT) compilation for 1.3x-1.8x speedups:
331
+
332
+ ```python
333
+ import spaces
334
+ import torch
335
+ from diffusers import DiffusionPipeline
336
+
337
+ pipe = DiffusionPipeline.from_pretrained(..., torch_dtype=torch.bfloat16)
338
+ pipe.to('cuda')
339
+
340
+ @spaces.GPU(duration=1500) # Max duration for compilation
341
+ def compile_transformer():
342
+ with spaces.aoti_capture(pipe.transformer) as call:
343
+ pipe("arbitrary example prompt")
344
+
345
+ exported = torch.export.export(
346
+ pipe.transformer,
347
+ args=call.args,
348
+ kwargs=call.kwargs,
349
+ )
350
+ return spaces.aoti_compile(exported)
351
+
352
+ compiled_transformer = compile_transformer()
353
+ spaces.aoti_apply(compiled_transformer, pipe.transformer)
354
+
355
+ @spaces.GPU
356
+ def generate(prompt):
357
+ return pipe(prompt).images
358
+ ```
359
+
360
+ Optional enhancements:
361
+ - FP8 quantization with torchao for additional 1.2x speedup (H200 compatible)
362
+ - Dynamic shapes for variable input sizes
363
+ - FlashAttention-3 via kernels library for attention speedups
364
+
365
  ## Complete Gradio API Reference
366
 
367
  This reference is automatically synced from https://www.gradio.app/llms.txt to ensure accuracy.