Skip to content

ConvTasNet pretrained huggingface model inference setup  #697

@Rodolfo-S

Description

@Rodolfo-S

I'm trying to do some inferencing on this pretrained ConvTasNet single source enhancement model on hugging face and I'm getting notably poor output.

I tried passing an ~18.5 sec, 16kHz clean speech clip mixed with -40dB white Gaussian noise and the output seemed to have about the SNR and the scaling ballooned well passed +/-1 (max sample value around 1500). Additionally, the speech itself sounds slightly distorted.

I should note that I also tried passing just the clean speech to the model and got similar results, as far as added distortion goes.
image

I'm trying to figure out if I've configured everything correctly to inference using LambdaOverlapAdd. I mostly used the Process large audio files notebook as reference. Here's my code.

kernel_size = 32
stride = 16

model = torch.hub.load('mpariente/asteroid', 'conv_tasnet', 'JorisCos/ConvTasNet_Libri1Mix_enhsingle_16k')

continuous_nnet = LambdaOverlapAdd(
    nnet=model,
    n_src=1,
    window_size=kernel_size,
    hop_size=stride,
    window=None,
    reorder_chunks=False
)

in_tensor = torch.from_numpy(noisy_audio[None, None, :])
out_tensor = continuous_nnet.forward(in_tensor)

out_wav = out_tensor.numpy().squeeze()

Where noisy_audio is the 1-D noisy speech signal, and window_size and hop_size were inferred from the config provided on the hugging face page for the model.

Is there something I'm missing or doing wrong here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions