fbpx

Introducing NVIDIA’s Audio Flamingo, the Next Frontier in Audio Language Models

Introducing NVIDIA’s Audio Flamingo, the Next Frontier in Audio Language Models

Understanding sound is undeniably essential for an agent’s interplay with the world. Despite the spectacular capabilities of huge language fashions (LLMs) in comprehending and reasoning by means of textual information, their grasp of sound stays restricted.

In their current paper titled “Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities,” a crew of researchers from NVIDIA introduces Audio Flamingo, a groundbreaking audio language mannequin. This mannequin incorporates in-context studying (ICL), retrieval augmented technology (RAG), and multi-turn dialogue capabilities, reaching state-of-the-art efficiency throughout varied audio understanding duties.

The crew summarizes their key contributions as follows:

  1. We suggest Audio Flamingo: a Flamingo-based audio language mannequin for audio understanding with a sequence of improvements. Audio Flamingo achieves state-of-the-art outcomes on a number of close-ended and open-ended audio understanding duties.
  2. We design a sequence of methodologies for environment friendly use of ICL and retrieval, which result in the state-of-the-art few-shot studying outcomes.
  3. We allow Audio Flamingo to have sturdy multiturn dialogue capability, and present considerably higher outcomes in comparison with baseline strategies.

The Audio Flamingo structure consists of 4 parts: i) an audio characteristic extractor with sliding window, ii) audio illustration transformation layers, iii) a decoder-only language mannequin, and iv) gated xattn-dense layers.

Specifically, the crew makes use of ClapCap (Elizalde et al., 2023b) because the spine for the audio characteristic extractor, processing 7-second, 44.1kHz uncooked audio inputs right into a 1024-dimensional vector illustration. For longer audio segments, they make use of sliding home windows to seize temporal data successfully.

The audio illustration transformation layers include three self-attention layers with 8 heads and an internal dimension of 2048 every. For the language mannequin, they make use of OPT-IML-MAX-1.3B (Iyer et al., 2022), a mannequin with 1.3 billion parameters and 24 LM blocks. They combine gated xattn-dense layers from Flamingo to situation the mannequin on audio inputs.

The researchers evaluated Audio Flamingo throughout a various vary of shut and open-ended benchmarks. A single Audio Flamingo mannequin outperforms earlier state-of-the-art techniques on most benchmarks, with the dialogue model considerably surpassing baseline efficiency on dialogue duties.

The crew intends to open-source each the coaching and inference code for Audio Flamingo, with a demo web site accessible at https://audioflamingo.github.io/.

The paper Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities is on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t wish to miss any information or analysis breakthroughs. Subscribe to our fashionable publication Synced Global AI Weekly to get weekly AI updates.

The publish Introducing NVIDIA’s Audio Flamingo, the Next Frontier in Audio Language Models first appeared on Synced.

HI-FI News

through Synced https://ift.tt/mE2PSgp

February 11, 2024 at 04:14PM

Select your currency