Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee
git clone https://github.com/jjunak-yun/FLowHigh_code.git
cd FLowHigh_code
pip install -r requirements.txt
- Download the VCTK dataset.
- Remove speakers
p280
andp315
from the dataset. - Create a
train
directory and atest
directory, then split the dataset accordingly. - Update the
data_path
in theconfigs/config.json
file with the path to your newly createdtrain
directory.
- To adjust the training conditions, modify the
configs/config.json
file according to your preferences.
CUDA_VISIBLE_DEVICES=0 python train.py
FLowHigh_indep_adaptive_400k : This is the main model from our paper, an audio super-resolution model designed to reconstruct low-resolution audio into high-resolution audio at 48 kHz.
FLowHigh_basic_400k : This model adopts conditional probability path based on basic flow-matching and is also an audio super-resolution model that reconstructs low-resolution audio into high-resolution audio at 48 kHz.
- Prepare the checkpoint of the trained model.
- Prepare a downsampled audio sample with a sampling rate smaller than 48 kHz (e.g., 12 kHz, 16 kHz).
Note: If you wish to match the experimental setup in our paper, usescipy.resample_poly()
to downsample the audio. - Run the following command:
CUDA_VISIBLE_DEVICES=0 python inference.py \
--input_path {downsampled_audio_path} --output_path {save_output_audio_path} \
--target_sampling_rate 48000 --up_sampling_method scipy --architecture='transformer' \
--time_step 1 --ode_method={ode_solver} --cfm_method={cfm_path} --sigma 0.0001 \
--model_path {model_checkpoint_path} \
--n_layers 2 --n_heads 16 --dim_head 64 \
--n_mels 256 --f_max 24000 --n_fft 2048 --win_length 2048 --hop_length 480 \
--vocoder 'bigvgan' --vocoder_path='/PATH/vocoder/BIGVGAN/checkpoint/g_48_00850000' \
--vocoder_config_path='/PATH/vocoder/BIGVGAN/config/bigvgan_48khz_256band_config.json' \
Parameter Name | Description |
---|---|
--time_step | The number of steps for solving the ODE (Ordinary Differential Equation). In our paper, we utilized a single-step approach ( time_step=1 ). While increasing time_step generally enhances the quality, in our case, the improvement was not significantly noticeable. |
--ode_method | Choose between euler or midpoint . The midpoint method improves performance but doubles the NFEs (Number of Function Evaluations). Recommendation: Despite the increase in NFE, we recommend using the midpoint method for better performance. Note: The choice of ode_method is independent of the trainind settings. |
--cfm_method | Sets the Conditional Probability Paths. In our paper, we used the path independent_cfm_adaptive . Other available options include basic_cfm (https://arxiv.org/abs/2210.02747) and independent_cfm_constant (https://arxiv.org/abs/2302.00482). |
--sigma | Influences the path setting. Ensure you use the same value for sigma as was used during training. |
- add base training code
- add requirements.txt
- upload pre-trained checkpoint for independent_cfm_adaptive
- upload pre-trained checkpoint for basic_cfm
- optimize the training speed
This implementation was developed based on the following repository:
- Voicebox(unofficial pytorch implementation): https://github.com/lucidrains/voicebox-pytorch.git (for architecture backbone)
- Fre-painter: https://github.com/FrePainter/code.git (for audio super-resolution implementation)
- TorchCFM: https://github.com/atong01/conditional-flow-matching.git (for CFM logic)
- BigVGAN: https://github.com/NVIDIA/BigVGAN.git (for pre-trained vocoder)
- Nu-wave2: https://github.com/maum-ai/nuwave2.git (for data processing)