thepatch
/

sao-small-onnx

Model card Files Files and versions Community

sao-small-onnx / README.md

thecollabagepatch's picture

thecollabagepatch

Update README.md

2a53414 verified 2 months ago

|

history blame contribute delete

2.12 kB

	---
	license: other
	license_name: stability-ai
	license_link: https://stability.ai/license
	---

	attempting to run stable-audio-open-small with onnxruntime in swift/IOS

	this is a mess. these models run successfully in python when validating em. haven't gotten the iphone to stop crashing yet.

	when using the fp16_tools version, the diffusion component crashes the iphone on step 0.

	when using the initial version, the decoder ((autoencoder_arm.onnx)) crashes the iphone.

	nothing to see here, yet... just wanted a place to store these.

	like everything else i do...pure vibes zero real knowledge.

	Here's a python script i used to validate outputs against the original pytorch model.

	there's another one using cfg stuff that gets essentially the same outputs.

	```

	#!/usr/bin/env python
	import numpy as np, soundfile as sf, onnxruntime as ort
	from transformers import AutoTokenizer

	# Load ONNX models
	dit = ort.InferenceSession("diffusion_dit_arm.onnx")
	cond = ort.InferenceSession("conditioners.onnx")
	dec = ort.InferenceSession("autoencoder_arm.onnx")

	# Config
	prompt = "lo-fi hip-hop beat with pianos 90bpm"
	steps = 10
	rng = np.random.RandomState(12345)
	x = rng.randn(1, 64, 256).astype(np.float32)

	# Conditioning
	tok = AutoTokenizer.from_pretrained("t5-base")
	tokens = tok(prompt, truncation=True, padding="max_length", max_length=128, return_tensors="np")
	conds = cond.run(None, {
	"input_ids": tokens["input_ids"].astype(np.int64),
	"attention_mask": tokens["attention_mask"].astype(np.int64),
	"seconds_total": np.array([10.0], dtype=np.float32)
	})
	cross, _, glob = conds

	# Run 10 steps with linear t, no CFG
	for i in range(steps):
	t_val = 1.0 - i / (steps - 1)
	t = np.array([t_val], dtype=np.float32)

	v = dit.run(None, {
	"x": x, "t": t,
	"cross_attn_cond": cross,
	"global_cond": glob
	})[0]

	x -= 0.1 * v # fixed Euler step

	# Decode
	audio = dec.run(None, {'sampled': x})[0]
	if audio.shape[0] == 2:
	audio = audio.T
	audio /= np.abs(audio).max()
	sf.write("onnx_lofi_linear.wav", audio, 44100)
	print("✅ onnx_lofi_linear.wav written!")

	```