Spatial audio is a crucial component in creating immersive experiences. However, recording it precisely can require expensive and specialized setups. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end flow-matching transformer diffusion model to approach this task. We carry out comprehensive evaluations on objective metrics related to input conditioning and generation metrics. Through extensive comparisons on two datasets, we show our end-to-end method outperforms the traditional baselines on both quantitative and qualitative metrics.
Randomly selected examples. (Some samples may have low volume, so you may need to increase the audio level.)
Ground-Truth | Diff-SAGe (Ours) |
Reference Audio (Simulated) |
AudioLDM (Simulated) |
Tango (Simulated) |
|
---|---|---|---|---|---|
Alarm | |||||
Baby Crying | |||||
Crash | |||||
Dog Barking | |||||
Engine Running | |||||
Female Screaming | |||||
Female Speaking | |||||
Fire Burning | |||||
Footsteps | |||||
Door Knock | |||||
Male Screaming | |||||
Male Speaking | |||||
Phone Ringing | |||||
Piano |
Randomly selected examples. (Some samples may have low volume, so you may need to increase the audio level.)
Ground-Truth | Diff-SAGe (Ours) |
Reference Audio (Simulated) |
AudioLDM (Simulated) |
Tango (Simulated) |
|
---|---|---|---|---|---|
Clearthroat | |||||
Cough | |||||
Door Slam | |||||
Drawer | |||||
Keyboard | |||||
Keys Drop | |||||
Knock | |||||
Laughter | |||||
Page Turn | |||||
Phone | |||||
Speech |
@article{kushwaha2024diff,
title={Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models},
author={Kushwaha, Saksham Singh and Ma, Jianbo and Thomas, Mark RP and Tian, Yapeng and Bruni, Avery},
journal={arXiv preprint arXiv:2410.11299},
year={2024}
}