In the realm of text-to-music synthesis, the quality of generated content has been advancing, but the controllability of musical aspects remains unexplored. A team of researchers from Singapore University of Technology and Design and the Queen Mary University of London introduced a solution to this challenge, named Mustango, extends the Tango text-to-audio model, aiming to control generated music not only with general text captions but with richer captions containing specific instructions related to chords, beats, tempo, and key.
The researchers introduce Mustango as a music-domain-knowledge-inspired text-to-music system based on diffusion models. They highlight the unique challenges in generating music directly from a diffusion model, emphasizing the need to balance alignment with conditional text and musicality. Mustango enables musicians, producers, and sound designers to create music clips with specific conditions such as chord progression, tempo, and key selection.
As part of Mustango, the researchers propose MuNet, a Music-Domain-Knowledge-Informed UNet sub-module. MuNet integrates music-specific features, predicted from the text prompt, including chords, beats, key, and tempo, into the diffusion denoising process. To overcome the limited availability of open datasets with music and text captions, the researchers introduce a novel data augmentation method. This method involves altering the harmonic, rhythmic, and dynamic aspects of music audio and using Music Information Retrieval methods to extract music features, which are then appended to existing text descriptions, resulting in the MusicBench dataset.
The MusicBench dataset contains over 52,000 instances, enriching the original text descriptions with beats, downbeats location, underlying chord progression, key, and tempo. The researchers conduct extensive experiments demonstrating that Mustango achieves state-of-the-art music quality. They emphasise the controllability of Mustango through music-specific text prompts, showcasing superior performance in capturing desired chords, beats, keys, and tempo across multiple datasets. They assess the adaptability of these predictors in scenarios where control sentences are absent from the prompt and observe that Mustango outperforms Tango in such cases, indicating that the control predictors do not compromise performance.
The experiments include comparisons with baselines, such as Tango, and variants of Mustango, demonstrating the effectiveness of the proposed data augmentation approach in enhancing performance. Mustango trained from scratch is highlighted as the best performer, surpassing Tango and other variants in terms of audio quality, rhythm presence, and harmony. Mustango has 1.4B parameters, much more than that of Tango.
In conclusion, the researchers introduce Mustango as a significant advancement in text-to-music synthesis. They address the controllability gap in existing systems and demonstrate the effectiveness of their proposed method through extensive experiments. Mustango not only achieves state-of-the-art music quality but also provides enhanced controllability, making it a valuable contribution to the field. The researchers release the MusicBench dataset, offering a resource for future research in text-to-music synthesis.
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. She is always reading about the developments in different field of AI and ML.