Pages

Thursday, 4 May 2023

TANGO: The Latent Diffusion Model for Realistic Text-to-Audio Generation


Introduction

Declare Lab is a Singapore-based research lab that works on different AI problems requiring deep cognitive and language understanding. The lab's mission is to breathe cognitive and language skills of human-like depth into machines by solving challenging NLP problems, such as dialogue comprehension and generation, commonsense reasoning, multimodal understanding, and more. The lab has identified that the lack of large-scale datasets with high-quality text-audio pairs is one of the main reasons why audio still lags behind in large-scale multimodal generative modeling. To address this issue, the lab has developed a model called 'Tango', which is trained on a small dataset.

What is Tango?

Tango is a text-to-audio generative model that uses FLAN T5 as a text encoder. It involves training a UNet based diffusion model for audio generation. Despite being trained on a smaller dataset (i.e. TANGO's LDM with 63x less data) than other state-of-the-art models, it performs well comparably across both objective and subjective metrics.

Components of Tango

Tango's project consists of three main components: textual-prompt encoders, latent diffusion model, and Mel Spectrum/audio VAE.
  • Textual-Prompt Encoders: This component receives the data in the form of text and encodes it.
  • Latent Diffusion Model: his component generates a latent representation of the desired audio input using textual representation from the encoder. Uses standard-like noises as well as through reverse diffusions.
  • Mel Spectrum/audio VAE:  Latest audio representations are constructed and fed to the basic output to get generative response.

Key Features
  • Tango generates audio descriptions from text inputs.
  • It can accurately represent different environments and sounds based on the input description.
  • Tango is able to refine sounds by adding additional details such as an approaching and disappearing racing car.
  • It can accurately describe complex sounds like a battlefield.

Comparison with Audio LDM


Audio LDM is another tool that generates audio descriptions from text inputs. AudioLDM is a new technology that generates audio based on text input. It is a latent diffusion model that learns continuous audio representations from contrastive language-audio pretraining (CLAP) latents. AudioLDM can generate text-conditional sound effects, human speech, and music. It is built on a latent space to learn the continuous audio representations from CLAP latents. AudioLDM is trained on a single GPU, without text supervision, and enables zero-shot text-guided audio style-transfer, inpainting, and super-resolution. It is a TTA system that benefits from computational efficiency and text-conditional audio manipulations. While achieving state-of-the-art generation quality with continuous LDMs. AudioLDM is the first TTA system that can generate realistic audio samples given any text input.

Compared to Audio LDM, Tango produces more refined and accurate sound outputs. Let us check out it with below examples.

Examples of Sound Outputs

It can create realistic soundscapes like a stadium full of cheering fans.
  • Audio LDM

  • Tango
sources- https://tango-web.github.io/
You can create sound effects by typing in a short prompt such as "Gentle water stream, birds chirping and sudden gunshot".
Tango will then generate a gender response based on this prompt.

  • Audio LDM

  • Tango

sources- https://tango-web.github.io/
Results 

There are several such examples available that give prompts to Tango and Audio LDM. The results are astonishing. 
The Tango model has been able to outperform current state-of-the-art models in text-to-audio generative applications despite being trained on much smaller datasets than those models. The model's performance can be attributed to better parameters and metrics that measure different aspects of text-to-audio applications.

Using Tango on the Web Front

To use Tango on the web front, simply click "submit" after entering your prompt. The response time may be slower due to high usage, but you can increase steps and guidance skill and tweak parameters for different responses. If you have a powerful GPU, it is recommended to run Tango locally. 


All the links you need to access the research paper, repository, and installation instructions for the Tango project are provided in the ‘source’ section at the end of this article. You’ll also find information on how to use Tango on the web front. Be sure to check it out!

Limitations of Tango
  • Tango has been trained on a relatively small dataset called Audio Caps, which limits its ability to generate good audio samples for concepts not included in the dataset.
  • It may not be able to generate good quality outputs for singing or monologues as it has not been trained on these datasets yet.
  • It requires a large amount of data to train effectively.
  • It can be difficult to fine-tune the model for specific use cases.

Future Developments of Tango

Declare Lab research team is currently training another version of the model on larger datasets to enhance its generalization, compositional, and controllable generation ability. The team is inviting collaborators and sponsors to train TANGO on larger datasets.

Conclusion

Overall,The TANGO model is a machine learning model developed by Declare Lab that generates realistic audio from textual prompts. The model consists of three main components: textual-prompt encoders, latent diffusion model, and Mel Spectrum/audio VAE. The model is trained on the small AudioCaps dataset, so it may not generate good audio samples related to concepts that it has not seen in training. However, The TANGO model, which is a latent diffusion model (LDM)-based approach, outperforms the state-of-the-art AudioLDM on most metrics and stays comparable on the rest. The model has the potential to be trained on larger datasets to improve its performance.


sources
Declare-Lab website - https://declare-lab.net/_pages/tango/
Research Paper - 
https://arxiv.org/abs/2304.13731
Demo - https://huggingface.co/spaces/declare-lab/tango
Repo - https://github.com/declare-lab/tango


No comments:

Post a Comment

Hymba by NVIDIA: Advancing SLMs with Hybrid-Head Architecture

Introduction Recent achievements in small language models geared them toward greater effectiveness and efficiency. Innovations in the aspect...