--- {} ---
 Please support our community by starring it 感谢大家支持 [**Highlights**](#highlights) | [**Introduction**](#introduction) | [**Installation**](#installation) | [**Quick Start**](#quick-start) | [**Tutorial**](https://github.com/FunAudioLLM/InspireMusic#tutorial) | [**Models**](#model-zoo) | [**Contact**](#contact) --- ## Highlights **InspireMusic** focuses on music generation, song generation, and audio generation. - A unified toolkit designed for music, song, and audio generation. - Music generation tasks with high audio quality. - Long-form music generation. ## Introduction > [!Note] > This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts. > [!Tip] > To preview the performance, please refer to [InspireMusic Demo Page](https://funaudiollm.github.io/inspiremusic). InspireMusic is a unified music, song, and audio generation framework through the audio tokenization integrated with autoregressive transformer and flow-matching based model. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both training and inference codes for AI-based generative models that create high-quality music. Featuring a unified framework, InspireMusic incorporates audio tokenizers with autoregressive transformer and super-resolution flow-matching modeling, allowing for the controllable generation of music, song, and audio with both text and audio prompts. The toolkit currently supports music generation, will support song generation, audio generation in the future. ## InspireMusic![]() |
Figure 1: An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, audio generation capable of producing high-quality long-form audio. InspireMusic consists of the following three key components. Audio Tokenizers convert the raw audio waveform into discrete audio tokens that can be efficiently processed and trained by the autoregressive transformer model. Audio waveform of lower sampling rate has converted to discrete tokens via a high bitrate compression audio tokenizer[1]. Autoregressive Transformer model is based on Qwen2.5[2] as the backbone model and is trained using a next-token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant token sequences. The audio and text tokens are the inputs of an autoregressive model with the next token prediction to generate tokens. Super-Resolution Flow-Matching Model based on flow modeling method, maps the generated tokens to latent features with high-resolution fine-grained acoustic details[3] obtained from a higher sampling rate of audio to ensure the acoustic information flow connected with high fidelity through models. A vocoder then generates the final audio waveform from these enhanced latent features. InspireMusic supports a range of tasks including text-to-music, music continuation, music reconstruction, and music super-resolution. |