Music and Audio with AI

SingSong: Generating musical accompaniments from singing
Chris Donahue, Antoine Caillon1, Adam Roberts, Ethan Manilow, Philippe Esling1, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel Google Research, 1IRCAM, Equal Contribution
We present SingSong, a system which generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM—a state-of-the-art approach for unconditional audio generation—to be suitable for conditional ”audio-to-audio” generation tasks, and train it on the source-separated (vocal, instrumental) pairs. To improve our system’s generalization from source-separated training data (where the vocals contain artifacts of the instrumental) to isolated vocals we might expect from users, we explore a number of different featurizations of vocal inputs, the best of which improves quantitative performance on isolated vocals by 53% relative to the default AudioLM featurization. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline.

We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music.

Prime Voice AI
The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling.

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at this https URL.

Byte Pair Encoding for Symbolic Music

Symbolic Music: Music stored in a notation-based format (e.g., MIDI), which contains excplicit information about note onsets and pitch on individual tracks (for different instruments), but in contrast to Digital Audio no sound.

Nathan Fradet, Jean-Pierre Briot, Fabien Chhel, Amal El Fallah Seghrouchni, Nicolas Gutowski

The symbolic music modality is nowadays mostly represented as discrete and used with sequential models such as Transformers, for deep learning tasks. Recent research put efforts on the tokenization, i.e. the conversion of data into sequences of integers intelligible to such models. This can be achieved by many ways as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations are based on small vocabularies describing the note attributes and time events, resulting in fairly long token sequences. In this paper, we show how Byte Pair Encoding (BPE) can improve the results of deep learning models while improving its performances. We experiment on music generation and composer classification, and study the impact of BPE on how models learn the embeddings, and show that it can help to increase their isotropy, i.e., the uniformity of the variance of their positions in the space.

AudioLM – A Language Modeling Approach to Audio Generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, Neil Zeghidour

Abstract. We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

Audio AI timeline:

From Simple Patterns to Sentience’s Complexity

“In order for AI to be able to overtake a programmers job, implies that the client knows what he wants. We’re safe…”
This job related predicament rises, as I see it, only in a brief moment in a much broader natural course. It is the kind of meme that strives to survive, it probably deserves a definitive NFT minting as of 03.2022 before it will soon fall into oblivion.
I want to start tackling its survival efficiency by talking about the AI and its power as I see it from this stand point: in which in the past decade the AI’s potential reached only its first steps in its infancy with clear signals to world changing capabilities. This goes further down, through refinements (of its rightly chosen) ontology.
We live in a world (or better said, this is the way the world works) in which almost everything employs from within some mechanics of refinement and adaptation. A machinery that moves things towards escaping chaos – and this with great energy consumption – but energy that partly comes from within (as the “will” opposed to the “death drive”). This goes from the inherent patterns in nature up to the end spectrum of the human activity(1). The underlying battle between order and entropy(2) on the underlying surface of our particular(3) universe with its laws. And above this, as the layer of simple emerging patterns to the ultimate, macro refinement substrate of sentient manifestations, and on top of it with the layer of symbolic, cultures and abstract thinking. A predictable pattern alright, further on rising within the brain energy patterns, a culmination tip, that leads to the creation of synthetic worlds, artificial sentience, transcendental states of being in the digital universe, as the next steps. A macro, ever growing vertical ontology at work.
Within this broader context the refinement produces further deepening of the domains and within AI domain, the current advances allow neural training of some larger than ever data sets, like in the language, vision, with a touch of symbolic, towards incipient meaning. In the context of programming we see first results in training upon some good part of all the human written programming code. And that we are able to put that to work in the business requirements with the programming languages on the real use cases (necessary for a program to have a purpose) for now in the form of AI assisted programming(4).
And further refinement would lead to a more natural way of conceiving programs through language processing of the requirements, from the problems to the actual code generation. And with, again, a further refinement into the symbolic AI with the actual predictable outcome not by only answering the questions, but with solutions offered by AI prior asking the question(5). All that within a domain criteria based on programming/AI ethics, best practices solutions, security, cultural impact, etc.
On the side of symbolic AI at this time there is an upward trend of trying different models of processing, a process in itself that requires further research. At the same time I see that this process is hindered by the fact that the models are still mapping or try mimicking some partial models of the mind, of trying to explain how brain works, and by posing answers to the questions related to consciousness(6).
I am still on the path and researching on my own symbolic model within the essentials, unspoiled concepts advanced through the innovative approach by Ludwig Wittgenstein:

“The reason computers have no understanding of the sentences they process is not that they lack sufficient neuronal complexity, but that they are not, and cannot be, participants in the culture to which the sentences belong. A sentence does not acquire meaning through the correlation, one to one, of its words with objects in the world; it acquires meaning through the use that is made of it in the communal life of human beings.”

There is not only – many would call with yesterday’s standards as a “grim” future – but there is in fact to be reminded that one cannot oppose the refinements because it requires also effort and energy none is possessing enough. Through self cultivated death drive that will only help on the short run…so remember this meme and laugh at its NFT later.

(1) forms of life with language games adaptation, creation activities with continuous refinements of their ontological models, circulating concept cultures.
(2) with simple patterns from which something emerges and with the counter action of opposite forces from nature up to the psyche and symbolic, the death drive.
(3) multiverse theory, in which very briefly explained: the eternal timeless energy waves produces bubbles of universes each with its fundamentals.
(4) copilot software that has the basis all of the github source code.
(5) if we have the right domain question we have the answer, in that the answer is there, it is only that briefly something is obscuring it from view.
(6) on questions related to the knowledge of ourselves, which in fact, are not of scientific nature.

C. Stefan / 24.03.2022