nari-labs/Dia-1.6B · Option to output speakers to different audio files

First, thank you for the effort made, I really appreciate it. Using this project (https://github.com/BobRandomNumber/ComfyUI-DiaTTS), I was able to use Dia in ComfyUI very easily and got interesting - mostly good - results.

I like how Dia voices are able to sound more human with coughs, laughs and everything. Also, prompting is easy and having two speakers is great. Regarding that, I have a suggestion: please make it possible to output speaker 1 and speaker 2 to different audio files.

Why? If you want to create talking avatar, it is now fairly easy to do this with AI from an input image and voice audio file, and even mask the video so that you can overlay it to a background image or video. However, with Dia, you could make an interesting conversation with two avatars, but the workflow does not work if the voices are mixed into the same audio file.

As others have already said, another challenge seems to be voice consistency. You have to split text to certain length chunks to make Dia output decent voice, but then you'll get different voice despite of using the same seed. If you use audio prompt, the voice doesn't sound like the input and still tends to be a bit different every time even with the same audio prompt and seed. Well, at least it does not change between male and female then...

I don't mind splitting the input text to chunks, but hopefully you can make Dia output the same voice consistently, then it could become very useful open-source TTS model. It would be awesome to be able to create NotebookLM style TTS in a controlled manner.