Skip to content

Code for the paper "FLowHigh: Towards efficient and high-quality audio super-resolution with single-step flow matching"

License

Notifications You must be signed in to change notification settings

jjunak-yun/FLowHigh_code

Repository files navigation

FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching

The official implementation of FLowHigh

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

Clone our repository

git clone https://github.com/jjunak-yun/FLowHigh_code.git
cd FLowHigh_code

Install the requirements

pip install -r requirements.txt

Data preparation

  • Download the VCTK dataset.
  • Remove speakers p280 and p315 from the dataset.
  • Create a train directory and a test directory, then split the dataset accordingly.
  • Update the data_path in the configs/config.json file with the path to your newly created train directory.

Training

  • To adjust the training conditions, modify the configs/config.json file according to your preferences.
CUDA_VISIBLE_DEVICES=0 python train.py 

Pre-trained checkpoints

FLowHigh_indep_adaptive_400k : This is the main model from our paper, an audio super-resolution model designed to reconstruct low-resolution audio into high-resolution audio at 48 kHz.

FLowHigh_basic_400k : This model adopts conditional probability path based on basic flow-matching and is also an audio super-resolution model that reconstructs low-resolution audio into high-resolution audio at 48 kHz.

Inference of audio

  • Prepare the checkpoint of the trained model.
  • Prepare a downsampled audio sample with a sampling rate smaller than 48 kHz (e.g., 12 kHz, 16 kHz).
    Note: If you wish to match the experimental setup in our paper, use scipy.resample_poly() to downsample the audio.
  • Run the following command:
CUDA_VISIBLE_DEVICES=0 python inference.py \
    --input_path {downsampled_audio_path} --output_path {save_output_audio_path} \
    --target_sampling_rate 48000 --up_sampling_method scipy --architecture='transformer' \
    --time_step 1 --ode_method={ode_solver} --cfm_method={cfm_path} --sigma 0.0001 \
    --model_path {model_checkpoint_path} \
    --n_layers 2 --n_heads 16 --dim_head 64 \
    --n_mels 256 --f_max 24000 --n_fft 2048 --win_length 2048 --hop_length 480 \
    --vocoder 'bigvgan' --vocoder_path='/PATH/vocoder/BIGVGAN/checkpoint/g_48_00850000' \
    --vocoder_config_path='/PATH/vocoder/BIGVGAN/config/bigvgan_48khz_256band_config.json' \
Parameter Name Description
--time_step The number of steps for solving the ODE (Ordinary Differential Equation).
In our paper, we utilized a single-step approach (time_step=1).
While increasing time_step generally enhances the quality, in our case, the improvement was not significantly noticeable.
--ode_method Choose between euler or midpoint.
The midpoint method improves performance but doubles the NFEs (Number of Function Evaluations).
Recommendation: Despite the increase in NFE, we recommend using the midpoint method for better performance.
Note: The choice of ode_method is independent of the trainind settings.
--cfm_method Sets the Conditional Probability Paths.
In our paper, we used the path independent_cfm_adaptive.
Other available options include basic_cfm(https://arxiv.org/abs/2210.02747) and independent_cfm_constant(https://arxiv.org/abs/2302.00482).
--sigma Influences the path setting.
Ensure you use the same value for sigma as was used during training.

To-do list

  • add base training code
  • add requirements.txt
  • upload pre-trained checkpoint for independent_cfm_adaptive
  • upload pre-trained checkpoint for basic_cfm
  • optimize the training speed

References

This implementation was developed based on the following repository:

About

Code for the paper "FLowHigh: Towards efficient and high-quality audio super-resolution with single-step flow matching"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages