Music on demand: How AI composes entire songs with just a few keywords

Abstract Illustration Of Soundwaves Blue And Yellow

The ability of generative AI to produce eloquent texts and impressive images has long been common knowledge. However, in addition to these achievements, a remarkable development is emerging in the field of music. Generative AI can now compose entire songs; all it takes is the input of a few keywords.

AI technologies for creating melodies or songs are known as generative music models and translate a text entered by the user into a unique musical composition. The generated sounds reflect the mood, genre or even specific details described in the text [1] . This is a major advance in the way AI can support human creativity. Whether it’s creating background music for videos, composing thematic soundscapes for video games or personalizing songs, text-conditioned music generation is becoming increasingly important.

Let’s look at an example. I used the following keywords to describe the music: “Simple melodic house track”. The text-to-music software used [2] then spits out a three-minute song within seconds. Here you can hear a short excerpt of the work created:

 

How does text-conditioned music generation work?

The example above shows how a ready-to-use music model works in principle. The user formulates the desired sound in keywords, whereupon the AI translates the description into suitable music. However, to create such a link between text and audio, it must first learn to identify key elements such as genre, mood, instruments and tempo in the text. This learning process takes place during the “training” of the model. Generative music models are trained using large amounts of music data consisting of pairs of songs and their key elements in text form. During training, the model learns which words are associated with which sounds. For example, it learns that certain words, such as “calm” or “fast”, correspond to specific sound patterns, such as soft melodies or fast rhythms. On the other hand, it acquires the ability to recognize their patterns and structures to be able to construct coherent songs.

 

Generative Music Principle

Generative Music Principle

 

But how can text and sound be associated at all; after all, they are distinct modalities that are represented differently? This is precisely the crux of the matter. A generative music model maps the entered text descriptions and audio data in such a way that they can be equally interpreted and thus compared. Specifically, both modalities are transformed into the same latent space – a mathematical representation.

We can imagine this in simplified terms as follows: After the transformation, both inputs are represented by a corresponding arrow. If the two arrows point in a similar direction, the text and the sound are similar. Based on the countless training pairs, the AI learns how it has to tweak the transformation in order to align the arrows of a pair similarly. The direction of an arrow also represents the attributes of the key elements of the text and attributes of the sound.

 

Latent Space

Latent Space

 

Moreover, the model can convert an arrow back into an audio signal. If we use a trained generative music model, we simply enter the text description. The transformation creates an arrow that can be interpreted both as text and as an associated sound and thus represents its attributes. By transforming the text arrow back, a sound is created that matches the description entered.

 

Inference

Inference

 

Popular generative music models

Meta’s MusicGen [3]

Riffusion [4]

Suno [2]

The examples were generated with the same description: “An upbeat deep house song from the 1980s with lush jazz-funk chords and touches of soul music”

Ethical and copyright considerations for generative music

While text-conditined music models offer exciting possibilities, they also raise important ethical and legal issues, particularly in relation to copyright [5] . These models are trained on extensive datasets of existing music, some of which may be protected by copyright [6] . It is therefore conceivable that AI-generated music could inadvertently mimic essential elements of protected works, leading to potential copyright infringement [7] .

In ethical terms, AI-supported creation has both transformative and disruptive effects on music makers [8] . It can help musicians to increase their productivity and stimulate their creativity. Conversely, this could reduce employment opportunities for creatives. At the same time, generative music models are dependent on human-generated audio data for their training to enable high-quality output. So, without artists, there is no generative artificial intelligence.  Therefore, a balance between the benefits of AI and fair compensation and recognition for human artists is crucial in the further development of this technology.


References

1 Bengesi, Staphord, et al. “Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers.” IEEE Access (2024).

2 Suno. Suno AI. https://suno.com/. Accessed Sept. 12, 2024.

3 Copet, Jade, et al. “Simple and controllable music generation.” Advances in Neural Information Processing Systems 36 (2024).

4 Forsgren, Seth, and Hayk Martiros. “Riffusion – Stable Diffusion for Real-Time Music Generation.” Riffusion, 2022, https://riffusion.com/about. Accessed Sept. 12, 2024.

5 Deng, Junwei, Shiyuan Zhang, and Jiaqi Ma. “Computational Copyright: Towards A Royalty Model for Music Generative AI.”

6 Peukert, Christian, and Margaritha Windisch. “The economics of copyright in the digital age.” Journal of Economic Surveys (2024).

7 Henderson, Peter, et al. “Foundation models and fair use.” Journal of Machine Learning Research 24.400 (2023): 1-79.

8 Lin, Tsen-Fang, and Liang-Bi Chen. “Harmony and algorithm: Exploring the advancements and impacts of AI-generated music.” IEEE Potentials (2024).

Creative Commons Licence

AUTHOR: Yannis Schmutz

Yannis Schmutz is a research associate at the Generative AI Lab at Bern University of Applied Sciences. His research focuses on audio and image generation as well as deep learning-based weather reconstruction.

Create PDF

Related Posts

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *