Visually aligned sound generation via sound-producing motion parsing [paper]
We propose to tame the visually aligned sound generation by projecting the sound-producing motion to a discriminative temporal visual embedding. This visual embedding can, then, distinguish the transient visual motion from complex background information. which leads to produce high temporal-wise alignment sounds. We refer to it as SPMNet.
Code, pre-trained models and all demos will be released here. Welcome to watch this repository for the latest updates.
dog_1.mp4
dog_6.mp4
drum_1.mp4
drum_2.mp4
firework_1.mp4
firework_2.mp4
Listen for the audio samples on our materials.
Our paper was accepted by Neurocomputing. Please use this bibtex if you would like to cite our work
@article{Ma2022VisuallyAS,
title={Visually Aligned Sound Generation via Sound-Producing Motion Parsing},
author={Xin Ma and Wei Zhong and Long Ye and Qin Zhang},
journal={Neurocomputing},
year={2022}
}
We acknowledge the following work: