Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

We propose to tame the visually aligned sound generation by projecting the sound-producing motion to a discriminative temporal visual embedding. This visual embedding can, then, distinguish the transient visual motion from complex background information. which leads to produce high temporal-wise alignment sounds. We refer to it as SPMNet.

News

Code, pre-trained models and all demos will be released here. Welcome to watch this repository for the latest updates.

Demo

Dog

dog_1.mp4

dog_6.mp4

Drum

drum_1.mp4

drum_2.mp4

Firework

firework_1.mp4

firework_2.mp4

Listen for the audio samples on our materials.

Citation

Our paper was accepted by Neurocomputing. Please use this bibtex if you would like to cite our work

@article{Ma2022VisuallyAS,
  title={Visually Aligned Sound Generation via Sound-Producing Motion Parsing},
  author={Xin Ma and Wei Zhong and Long Ye and Qin Zhang},
  journal={Neurocomputing},
  year={2022}
}

Acknowledgments

We acknowledge the following work:

The code base is built upon RegNet repo.
Thanks to SpecVQGAN open source efforts.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

News

Demo

Dog

Drum

Firework

Citation

Acknowledgments

About

Releases

Packages

mx-mark/SPMNet

Folders and files

Latest commit

History

Repository files navigation

Visually aligned sound generation via sound-producing motion parsing [paper]

Overview

News

Demo

Dog

Drum

Firework

Citation

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages