-
Notifications
You must be signed in to change notification settings - Fork 445
Open
Labels
questionFurther information is requestedFurther information is requested
Description
I am training ConvTasNet on Librimix train-100 dataset. It works fine when I train it using sep_noisy mode, while it prompts such an error when I train it using enh_single mode:
Results from the following experiment will be stored in exp/train_convtasnet_3rd_causal
Stage 2: Training
/O/asteroid/asteroid/models/conv_tasnet.py:89: UserWarning: In causal configuration cumulative layer normalization (cgLN)or channel-wise layer normalization (chanLN) must be used. Changing cLN to cLN
warnings.warn(
/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python train.py --exp_dir exp/train_convtasnet_3rd_causal - ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W CUDAAllocatorConfig.h:30] Warning: expandable_segments not supported on this platform (function operator())
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
---------------------------------------------
0 | model | ConvTasNet | 5.1 M
1 | loss_func | PITLossWrapper | 0
---------------------------------------------
5.1 M Trainable params
0 Non-trainable params
5.1 M Total params
20.202 Total estimated model params size (MB)
{'data': {'n_src': 2,
'sample_rate': 8000,
'segment': 3,
'task': 'enh_single',
'train_dir': 'data/wav8k/min/train-100',
'valid_dir': 'data/wav8k/min/dev'},
'filterbank': {'kernel_size': 16, 'n_filters': 512, 'stride': 8},
'main_args': {'exp_dir': 'exp/train_convtasnet_3rd_causal', 'help': None},
'masknet': {'bn_chan': 128,
'hid_chan': 512,
'mask_act': 'relu',
'n_blocks': 8,
'n_repeats': 3,
'skip_chan': 128},
'optim': {'lr': 0.001, 'optimizer': 'adam', 'weight_decay': 0.0},
'positional arguments': {},
'training': {'batch_size': 14,
'early_stop': True,
'epochs': 200,
'half_lr': True,
'num_workers': 4}}
Drop 0 utterances from 13900 (shorter than 3 seconds)
Drop 0 utterances from 13900 (shorter than 3 seconds)
Sanity Checking: | | 0/? [00:00<?, ?it/s]Traceback (most recent call last):
File "O/asteroid/egs/librimix/ConvTasNet/train.py", line 146, in <module>
main(arg_dic)
File "O/asteroid/egs/librimix/ConvTasNet/train.py", line 112, in main
trainer.fit(system)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_stage
self._run_sanity_check()
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1060, in _run_sanity_check
val_loop.run()
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 128, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
batch = super().__next__()
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
batch = next(self.iterator)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
out = next(self.iterators[0])
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
data = self._next_data()
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
return self._process_data(data)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
data.reraise()
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'source_2_path'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "O/asteroid/asteroid/data/librimix_dataset.py", line 106, in __getitem__
source_path = row[f"source_{i + 1}_path"]
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/series.py", line 1112, in __getitem__
return self._get_value(key)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/series.py", line 1228, in _get_value
loc = self.index.get_loc(label)
File "/home/ionotronics/.pyenv/versions/audio_tf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'source_2_path'
And here is my run.sh file:
#!/bin/bash
# Exit on error
set -e
set -o pipefail
# If you haven't generated LibriMix start from stage 0
# Main storage directory. You'll need disk space to store LibriSpeech, WHAM noises
# and LibriMix. This is about 500 Gb
storage_dir=O/asteroid/datasets
# After running the recipe a first time, you can run it from stage 3 directly to train new models.
# Path to the python you'll use for the experiment. Defaults to the current python
# You can run ./utils/prepare_python_env.sh to create a suitable python environment, paste the output here.
python_path=python
# Example usage
# ./run.sh --stage 3 --tag my_tag --task sep_noisy --id 0,1
# General
stage=0 # Controls from which stage to start
tag="" # Controls the directory name associated to the experiment
# You can ask for several GPUs using id (passed to CUDA_VISIBLE_DEVICES)
id=$CUDA_VISIBLE_DEVICES
out_dir=librimix # Controls the directory name associated to the evaluation results inside the experiment directory
# Network config
n_blocks=8 # Number of conv blocks in each repeat
n_repeats=3 # Number of repeats in the Conv-TasNet
mask_act=relu
# Training config
epochs=200
batch_size=14
num_workers=4
half_lr=yes
early_stop=yes
# Optim config
optimizer=adam
lr=0.001
weight_decay=0.
# Data config
sample_rate=8000
mode=min # max for val_acc, min for val_loss
n_src=2 # Number of voice sources in the speech
segment=3
task=enh_single # one of 'enh_single', 'enh_both', 'sep_clean', 'sep_noisy'
eval_use_gpu=1
# Need to --compute_wer 1 --eval_mode max to be sure the user knows all the metrics
# are for the all mode.
compute_wer=0
eval_mode=
. utils/parse_options.sh
sr_string=$(($sample_rate/1000))
suffix=wav${sr_string}k/$mode
if [ -z "$eval_mode" ]; then
eval_mode=$mode
fi
train_dir=data/$suffix/train-100
valid_dir=data/$suffix/dev
test_dir=data/wav${sr_string}k/$eval_mode/test
if [[ $stage -le 0 ]]; then
echo "Stage 0: Generating Librimix dataset"
if [ -z "$storage_dir" ]; then
echo "Need to fill in the storage_dir variable in run.sh to run stage 0. Exiting"
exit 1
fi
. local/generate_librimix.sh --storage_dir $storage_dir --n_src $n_src
fi
if [[ $stage -le 1 ]]; then
echo "Stage 1: Generating csv files including wav path and duration"
. local/prepare_data.sh --storage_dir $storage_dir --n_src $n_src
fi
# Generate a random ID for the run if no tag is specified
uuid=$($python_path -c 'import uuid, sys; print(str(uuid.uuid4())[:8])')
if [[ -z ${tag} ]]; then
tag=${uuid}
fi
expdir=exp/train_convtasnet_${tag}
mkdir -p $expdir && echo $uuid >> $expdir/run_uuid.txt
echo "Results from the following experiment will be stored in $expdir"
if [[ $stage -le 2 ]]; then
echo "Stage 2: Training"
mkdir -p logs
CUDA_VISIBLE_DEVICES=$id $python_path train.py --exp_dir $expdir \
--n_blocks $n_blocks \
--n_repeats $n_repeats \
--mask_act $mask_act \
--epochs $epochs \
--batch_size $batch_size \
--num_workers $num_workers \
--half_lr $half_lr \
--early_stop $early_stop \
--optimizer $optimizer \
--lr $lr \
--weight_decay $weight_decay \
--train_dir $train_dir \
--valid_dir $valid_dir \
--sample_rate $sample_rate \
--n_src $n_src \
--task $task \
--segment $segment | tee logs/train_${tag}.log
cp logs/train_${tag}.log $expdir/train.log
# Get ready to publish
mkdir -p $expdir/publish_dir
echo "librimix/ConvTasNet" > $expdir/publish_dir/recipe_name.txt
fi
if [[ $stage -le 3 ]]; then
echo "Stage 3 : Evaluation"
if [[ $compute_wer -eq 1 ]]; then
if [[ $eval_mode != "max" ]]; then
echo "Cannot compute WER without max mode. Start again with --stage 2 --compute_wer 1 --eval_mode max"
exit 1
fi
# Install espnet if not instaled
if ! python -c "import espnet" &> /dev/null; then
echo 'This recipe requires espnet. Installing requirements.'
$python_path -m pip install espnet_model_zoo
$python_path -m pip install jiwer
$python_path -m pip install tabulate
fi
fi
$python_path eval.py \
--exp_dir $expdir \
--test_dir $test_dir \
--out_dir $out_dir \
--use_gpu $eval_use_gpu \
--compute_wer $compute_wer \
--task $task | tee logs/eval_${tag}.log
cp logs/eval_${tag}.log $expdir/eval.log
fi
Could you please suggest if there is any issue in the run.sh configuration?
Thanks,
Colin
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested