Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Abstract

Spatial audio is a crucial component in creating immersive experiences. However, recording it precisely can require expensive and specialized setups. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end flow-matching transformer diffusion model to approach this task. We carry out comprehensive evaluations on objective metrics related to input conditioning and generation metrics. Through extensive comparisons on two datasets, we show our end-to-end method outperforms the traditional baselines on both quantitative and qualitative metrics.

Examples for TAU-NIGENS Spatial Sound Events 2020 (TAU-NIGENS-20) dataset

Randomly selected examples. (Some samples may have low volume, so you may need to increase the audio level.)

	Ground-Truth	Diff-SAGe (Ours)	Reference Audio (Simulated)	AudioLDM (Simulated)	Tango (Simulated)
Alarm
Baby Crying
Crash
Dog Barking
Engine Running
Female Screaming
Female Speaking
Fire Burning
Footsteps
Door Knock
Male Screaming
Male Speaking
Phone Ringing
Piano