DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Paper: arxiv
This codebase was built upon and extended from the NExT-GPT framework.

Setup

pip install -r requirements.txt

Or you can use Dockerfile to establish a virtual environment.

Prepare Pre-trained Model Checkpoints

Put DeepResonance_α and DeepResonance_β downloaded from Huggingface into ./ckpt/
Referring to NExT-GPT, please prepare the checkpoints of ImageBind (huge) and Vicuna (7b-v0) in ./ckpt/pretrained_ckpt

Inference with Deepresonance-α

cd code
DATASET=musiccaps # can replace with others, dataset path is defined in inference_deepresonance.py; select from [musicqa, musiccaps, music4way_musiccaps, music4way_mi2t, music4way_mv2t, music4way_any2t]
OUTPUT=musiccaps_dra_res
python inference_deepresonance.py --dataset $DATASET --result_file_name $OUTPUT --ckpt_path ../ckpt/deepresonance_alpha_delta_ckpt/deepresonance/7b_tiva_v0 --imagebind_embs_seq

Inference with Deepresonance-β

cd code
DATASET=musiccaps # can replace with others, dataset path is defined in inference_deepresonance.py; select from [musicqa, musiccaps, music4way_musiccaps, music4way_mi2t, music4way_mv2t, music4way_any2t]
OUTPUT=musiccaps_drb_res
python inference_deepresonance.py --dataset $DATASET --result_file_name $OUTPUT --ckpt_path ../ckpt/deepresonance_beta_delta_ckpt/deepresonance/7b_tiva_v0 --prellmfusion --imagebind_embs_seq

Prepare Data

Put all the text datasets downloaded from Huggingface into ./data
For multimodal source data including music, videos, and images, they should be downloaded separately with the IDs shown in each text file. We do not provide due to the licence issues. Please download them with the IDs; all the data is originally from the AudioSet dataset. Refer to the filtered subset of M2UGen and download all the video and music pairs from YouTube

Model Training

Train DeepResonance-α:

cd code
source scripts/train_deepresonance_alpha.sh ../ckpt/deepresonance_alpha_delta_ckpt_exp1

Train DeepResonance-β:

cd code
source scripts/train_deepresonance_beta.sh ../ckpt/deepresonance_beta_delta_ckpt_exp1

Cite

If you find this repo useful, please consider citing:

@article{DBLP:journals/corr/abs-2502-12623,
  author       = {Zhuoyuan Mao and
                  Mengjie Zhao and
                  Qiyu Wu and
                  Hiromi Wakaki and
                  Yuki Mitsufuji},
  title        = {DeepResonance: Enhancing Multimodal Music Understanding via Music-centric
                  Multi-way Instruction Tuning},
  journal      = {CoRR},
  volume       = {abs/2502.12623},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.12623},
  doi          = {10.48550/ARXIV.2502.12623},
  eprinttype    = {arXiv},
  eprint       = {2502.12623},
  timestamp    = {Wed, 19 Mar 2025 11:49:47 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-12623.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Setup

Prepare Pre-trained Model Checkpoints

Inference with Deepresonance-α

Inference with Deepresonance-β

Prepare Data

Model Training

Cite

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ckpt		ckpt
code		code
data		data
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

sony/DeepResonance

Folders and files

Latest commit

History

Repository files navigation

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Setup

Prepare Pre-trained Model Checkpoints

Inference with Deepresonance-α

Inference with Deepresonance-β

Prepare Data

Model Training

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages