Skip to content

HF interface#2116

Merged
mravanelli merged 314 commits into
speechbrain:unstable-v0.6from
mhn226:HF_interface
Oct 11, 2023
Merged

HF interface#2116
mravanelli merged 314 commits into
speechbrain:unstable-v0.6from
mhn226:HF_interface

Conversation

@mhn226

@mhn226 mhn226 commented Aug 10, 2023

Copy link
Copy Markdown
Collaborator

This PR simplifies the #1596 PR following @TParcollet 's ideas.
Briefly, we create an interface class called HuggingFaceTransformer which only implements several esensial methods, for example, modeling loading using HuggingFace's Auto Classes (https://huggingface.co/docs/transformers/model_doc/auto).
The users would need to customize their own HF's class which inherits HuggingFaceTransformer. The adapted huggingface_wav2vec.py and huggingface_whisper.py would serve as examples of how to make such classes.

@mravanelli mravanelli added the enhancement New feature or request label Aug 10, 2023

@TParcollet TParcollet left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I will request ideas from @Adel-Moumen and @pplantinga here. I think I like this easy to use abstraction. The main remaining question here being: do we put everything in a single file called HuggingfaceTransformers.py or do we create a directory with multiple files, with each file containing our implementation of a model? Also we have to think about one thing: dropping the wav2vec2 in the naming of our implementation, because it also works with Hubert and wavlm. So maybe something like DefaultHuggingfaceTransformer? wdyt @Adel-Moumen @pplantinga and @mravanelli

Comment thread speechbrain/lobes/models/huggingface.py Outdated


class HuggingFaceTransformer(nn.Module):
"""This lobe provides AutoClass architecture loading into SpeechBrain modules.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe slightly more precise about what it means: being able to load any model from HuggingFace transformer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to give information about how to use this class as well. Like it gives an ensemble of functions that the user must override by inheriting from this class to support any HuggingFace model.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread speechbrain/lobes/models/huggingface.py Outdated
tokenizer=False,
for_pretraining=False,
freeze=False,
sampling_rate=16000,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the sampling_rate here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it there because whisper puts it:

feature_extractor = WhisperFeatureExtractor.from_pretrained(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be solved by passing a **kwarg instead. This is to avoid having 1 billion arguments once we have many model implemented

Comment thread speechbrain/lobes/models/huggingface.py Outdated
err_msg = f"{path} does not contain a .bin or .ckpt checkpoint !"
raise FileNotFoundError(err_msg)

def _modify_state_dict_partial_fn(self):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this function for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it for modifying the state_dict when needed. For example with wav2vec, loading sb model would require some modifications to be compatible (

)
This should be overwritten by users depending on the model in use.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the partial_fn and add a description to explain what it is for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread speechbrain/lobes/models/huggingface.py Outdated
return config


def make_padding_masks(src, wav_len=None, pad_idx=0):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a function in SpeechBrain to do this already? I believe we do, maybe in util, or attention or somewhere.

Comment thread speechbrain/lobes/models/huggingface.py Outdated
return src_key_padding_mask


def make_masks(src, wav_len=None, pad_idx=0):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took these two from huggingface_wav2vec.py

@mhn226

mhn226 commented Aug 11, 2023

Copy link
Copy Markdown
Collaborator Author

Thanks for the work! I will request ideas from @Adel-Moumen and @pplantinga here. I think I like this easy to use abstraction. The main remaining question here being: do we put everything in a single file called HuggingfaceTransformers.py or do we create a directory with multiple files, with each file containing our implementation of a model? Also we have to think about one thing: dropping the wav2vec2 in the naming of our implementation, because it also works with Hubert and wavlm. So maybe something like DefaultHuggingfaceTransformer? wdyt @Adel-Moumen @pplantinga and @mravanelli

what is default huggingface transformer exactly? I agree that "wav2vec" is not representative for hubert and wavlm, but DefaultHuggingfaceTransformer is a bit too much? :D

@TParcollet

Copy link
Copy Markdown
Collaborator

Well, I have no idea actually, trying to build a name out of the fact that a lot of models share the same default encoder structure, and this is why they can use the same class.

@Adel-Moumen

Adel-Moumen commented Aug 11, 2023

Copy link
Copy Markdown
Collaborator

Thanks for the work! I will request ideas from @Adel-Moumen and @pplantinga here. I think I like this easy to use abstraction. The main remaining question here being: do we put everything in a single file called le files, with each file containing our implementation of a model? Also we have to think about one thing: dropping the wav2vec2 in the naming of our implementation, because it also works with Hubert and wavlm. So maybe something like DefaultHuggingfaceTransformer? wdyt @Adel-Moumen @pplantinga and @mravanelli

I like the idea of having a new folder (e.g. called huggingface_transformers) with multiple files. We need to separate each model. So, having wav2vec.py, wavlm.py, etc...

Why should we drop the wav2vec2 naming ? We could just leave it as it is and then for wavlm.py create a new class which inherits from the class in wav2vec.py (e.g. WavLM(Wav2vec2)). Even though we are not going to change anything (e.g. fwd etc), but from an end user perspective, it could be cleaner like that.

@mhn226

mhn226 commented Aug 11, 2023

Copy link
Copy Markdown
Collaborator Author

Thanks for the work! I will request ideas from @Adel-Moumen and @pplantinga here. I think I like this easy to use abstraction. The main remaining question here being: do we put everything in a single file called le files, with each file containing our implementation of a model? Also we have to think about one thing: dropping the wav2vec2 in the naming of our implementation, because it also works with Hubert and wavlm. So maybe something like DefaultHuggingfaceTransformer? wdyt @Adel-Moumen @pplantinga and @mravanelli

I like the idea of having a new folder (e.g. called huggingface_transformers) with multiple files. We need to separate each model. So, having wav2vec.py, wavlm.py, etc...

Why should we drop the wav2vec2 naming ? We could just leave it as it is and then for wavlm.py create a new class which inherits from the class in wav2vec.py (e.g. WavLM(Wav2vec2)). Even though we are not going to change anything (e.g. fwd etc), but from an end user perspective, it could be cleaner like that.

I also like this idea of making a huggingface_transformers directory which hosts separate files for each HF's model. One question, huggingface_transformers should be inside or outside the transformers directory?

@mhn226

mhn226 commented Aug 14, 2023

Copy link
Copy Markdown
Collaborator Author

I added a huggingface_transformers directory inside speechbrain/lobes/models which hosts the interface huggingface.py, wav2vec class huggingface_wav2vec.py, and whisper class huggingface_whisper.py.
This would slightly affects some recipes which were modified accordingly.
There should be a huggingface_wavlm.py and huggingface_hubert.py added if we follow @Adel-Moumen and @TParcollet 's suggestion. They should technically resemble huggingface_wav2vec.py. I will soon add them.

self.tokenizer = AutoTokenizer.from_pretrained(source, **kwarg)


def make_padding_masks(src, wav_len=None, pad_idx=0):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you double-check that we don't have this somewhere already? I really think we do.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find it anywhere else rather than in huggingface_wav2vec.py
It looks the same as the make_masks method though. So I removed make_masks.

@TParcollet TParcollet left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went again throughout the code. I like the current state of it. I think it is close to be ready. I will just ask @Adel-Moumen to run recipe testing to make sure that nothing breaks. Also @mhn226 could you please modify all the tutorials? You will need to manually check unfortunately ...

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

I think we should change the name of huggingface_whisper to only whisper. The class is already located in speechbrain/lobes/models/huggingface_transformers so it is already obvious that whisper is coming from HuggingFace... This comment applies to all models including wav2vec.

Furthermore, @TParcollet, you said you wanted to separate WavLM from Hubert etc... this is not the current state of the code. We are still creating a WavLM obj through the huggingface_wav2vec class! What about creating new files for them ? (e.g. wavlm.py which is inheriting from wav2vec etc...)

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

Note: when merging, please select merge with git stash!

@mhn226

mhn226 commented Aug 15, 2023

Copy link
Copy Markdown
Collaborator Author

I think we should change the name of huggingface_whisper to only whisper. The class is already located in speechbrain/lobes/models/huggingface_transformers so it is already obvious that whisper is coming from HuggingFace... This comment applies to all models including wav2vec.

Furthermore, @TParcollet, you said you wanted to separate WavLM from Hubert etc... this is not the current state of the code. We are still creating a WavLM obj through the huggingface_wav2vec class! What about creating new files for them ? (e.g. wavlm.py which is inheriting from wav2vec etc...)

I somehow don't have problems with the file names because huggingface.py would cease to nothing.py following your logic, unless you have another suggestion for it?
I have huggingface_wavlm.py and huggingface_hubert.py already. I will push them soon!

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

I think we should change the name of huggingface_whisper to only whisper. The class is already located in speechbrain/lobes/models/huggingface_transformers so it is already obvious that whisper is coming from HuggingFace... This comment applies to all models including wav2vec.
Furthermore, @TParcollet, you said you wanted to separate WavLM from Hubert etc... this is not the current state of the code. We are still creating a WavLM obj through the huggingface_wav2vec class! What about creating new files for them ? (e.g. wavlm.py which is inheriting from wav2vec etc...)

I somehow don't have problems with the file names because huggingface.py would cease to nothing.py following your logic, unless you have another suggestion for it? I have huggingface_wavlm.py and huggingface_hubert.py already. I will push them soon!

No, it could be a special case or could be renamed to something like base_transformers.py or base_huggingface.py idk. I think this is really redundant to name them huggingface_* while there are already in the huggingface folder. Everything gets longer for 0 amount of informations. In a yaml file its 12 chars in more. What do you think @TParcollet @mravanelli ?

@mhn226

mhn226 commented Aug 16, 2023

Copy link
Copy Markdown
Collaborator Author

I think we should change the name of huggingface_whisper to only whisper. The class is already located in speechbrain/lobes/models/huggingface_transformers so it is already obvious that whisper is coming from HuggingFace... This comment applies to all models including wav2vec.
Furthermore, @TParcollet, you said you wanted to separate WavLM from Hubert etc... this is not the current state of the code. We are still creating a WavLM obj through the huggingface_wav2vec class! What about creating new files for them ? (e.g. wavlm.py which is inheriting from wav2vec etc...)

I somehow don't have problems with the file names because huggingface.py would cease to nothing.py following your logic, unless you have another suggestion for it? I have huggingface_wavlm.py and huggingface_hubert.py already. I will push them soon!

No, it could be a special case or could be renamed to something like base_transformers.py or base_huggingface.py idk. I think this is really redundant to name them huggingface_* while there are already in the huggingface folder. Everything gets longer for 0 amount of informations. In a yaml file its 12 chars in more. What do you think @TParcollet @mravanelli ?

Ok I see your point. I'll modify the model files as your suggestion. I'll keep huggingface.py as it is because "base" would create undesirable confusion with the model size in my opinion.

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

I think we should change the name of huggingface_whisper to only whisper. The class is already located in speechbrain/lobes/models/huggingface_transformers so it is already obvious that whisper is coming from HuggingFace... This comment applies to all models including wav2vec.
Furthermore, @TParcollet, you said you wanted to separate WavLM from Hubert etc... this is not the current state of the code. We are still creating a WavLM obj through the huggingface_wav2vec class! What about creating new files for them ? (e.g. wavlm.py which is inheriting from wav2vec etc...)

I somehow don't have problems with the file names because huggingface.py would cease to nothing.py following your logic, unless you have another suggestion for it? I have huggingface_wavlm.py and huggingface_hubert.py already. I will push them soon!

No, it could be a special case or could be renamed to something like base_transformers.py or base_huggingface.py idk. I think this is really redundant to name them huggingface_* while there are already in the huggingface folder. Everything gets longer for 0 amount of informations. In a yaml file its 12 chars in more. What do you think @TParcollet @mravanelli ?

Ok I see your point. I'll modify the model files as your suggestion. I'll keep huggingface.py as it is because "base" would create undesirable confusion with the model size in my opinion.

Yes. Good.

@mhn226 mhn226 changed the base branch from develop to unstable-v0.6 August 17, 2023 15:28
@mhn226

mhn226 commented Sep 29, 2023

Copy link
Copy Markdown
Collaborator Author

Hi, so this PR passed the recipe tests (thanks to @Adel-Moumen for doing the tests). The remaining problem is that it doesn't seem to give the same scores when loading some pre-trained (old) models we published on dropbox. For example, this model for Italian from this recipe gave super weird WER. And it's the same if I do the test with the develop branch. Retraining the model from scratch gave the same outcome for both this PR and the develop branch.

With this recipe, Italian model from this recipe, we have relatively the same scores as what we published (tested with both this PR and the develop branch). So it's good here.

But wav2vec+CTC doesn't work with the french model. same problem!

So I don't know how we can conclude? I personally think that if we see the same scores between this PR and the develop branch, we can say it is OK? The underlying problem doesn't come from the PR? Can you give your thoughts? @Adel-Moumen @TParcollet @mravanelli

Comment thread speechbrain/lobes/models/huggingface_transformers/whisper.py

@mhn226 mhn226 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapt changes to comments from @TParcollet and @Adel-Moumen

Comment thread speechbrain/lobes/models/huggingface_transformers/wav2vec.py
Comment thread speechbrain/lobes/models/huggingface.py Outdated


class HuggingFaceTransformer(nn.Module):
"""This lobe provides AutoClass architecture loading into SpeechBrain modules.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread speechbrain/lobes/models/huggingface.py Outdated
err_msg = f"{path} does not contain a .bin or .ckpt checkpoint !"
raise FileNotFoundError(err_msg)

def _modify_state_dict_partial_fn(self):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.model.train()

def _from_pretrained(
self, source, config, model, save_path, cache_dir,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

logger = logging.getLogger(__name__)


class HuggingFaceHuBERT(HuggingFaceWav2Vec2):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name changed

def __init__(self, hub, num_layers, layernorm=False):
super().__init__()
self.encoder = AutoModel.from_pretrained(hub, output_hidden_states=True)
super().__init__(source=hub, save_path=".")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread speechbrain/lobes/models/huggingface_transformers/wavlm.py
Comment thread speechbrain/lobes/models/huggingface_transformers/whisper.py
Comment thread speechbrain/lobes/models/huggingface_transformers/huggingface.py Outdated
self.tokenizer = AutoTokenizer.from_pretrained(source, **kwarg)


def make_padding_masks(src, wav_len=None, pad_idx=0):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find it anywhere else rather than in huggingface_wav2vec.py
It looks the same as the make_masks method though. So I removed make_masks.

@mravanelli

Copy link
Copy Markdown
Collaborator

Hi @mhn226, I suggest merging the latest version of develop. We recently merge a new recipes that used HF to get a pretrained version of GPT2 (see here). Do you think we should integrate it here?

@mhn226

mhn226 commented Oct 5, 2023

Copy link
Copy Markdown
Collaborator Author

Hi @mhn226, I suggest merging the latest version of develop. We recently merge a new recipes that used HF to get a pretrained version of GPT2 (see here). Do you think we should integrate it here?

Hi Mirco, I also plan to implement text-based pretrained models like mbart, labse, nllb, etc. too.
What are your opinions about merging to develop? @TParcollet @Adel-Moumen

@mravanelli

Copy link
Copy Markdown
Collaborator

Note that @poonehmousavi is working on an HF interface to integrate Llama2. Maybe you guys can talk to each other to better coordinate your efforts.

Note: we need to merge in unstable. We just need to update the unstable branch and the code in your PR with the latest develop such that you have have the new GPT2 recipe

@mhn226

mhn226 commented Oct 5, 2023

Copy link
Copy Markdown
Collaborator Author

Note that @poonehmousavi is working on an HF interface to integrate Llama2. Maybe you guys can talk to each other to better coordinate your efforts.

Note: we need to merge in unstable. We just need to update the unstable branch and the code in your PR with the latest develop such that you have have the new GPT2 recipe

OK I see. Thanks for the clarification!

@poonehmousavi

Copy link
Copy Markdown
Collaborator

Note that @poonehmousavi is working on an HF interface to integrate Llama2. Maybe you guys can talk to each other to better coordinate your efforts.
Note: we need to merge in unstable. We just need to update the unstable branch and the code in your PR with the latest develop such that you have have the new GPT2 recipe

OK I see. Thanks for the clarification!

@mhn226 , The gpt recipe is already merged, it is not that much different from other HF models that we previously have. But it might be different slightly for LLAMA2. I will add you to the LLAMA2 PR when I feel more confident about it. Let me know if I could be of any help.

@mravanelli

Copy link
Copy Markdown
Collaborator

I merged the latest version of dev, which includes the GPT2 fine-tuning recipe implemented by @poonehmousavi. @mhn226, could you please revise it and integrate into the HF refactor?

@mhn226

mhn226 commented Oct 9, 2023

Copy link
Copy Markdown
Collaborator Author

I merged the latest version of dev, which includes the GPT2 fine-tuning recipe implemented by @poonehmousavi. @mhn226, could you please revise it and integrate into the HF refactor?

Hi Marco, yes I'm working on that. Thank you for the merging!

@mhn226 mhn226 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @poonehmousavi : I adapt your recipe to this HF interface PR. Could you please take a look at the changes, especially at my comments? Please leave some comments if needed.

P/s: I see that you're using directly the generate() method from HF. I don't have any problem with that but I wonder if we want to use our decoding method in the future as @Adel-Moumen has been making wonderful effort in the scoring and searching part?


# gpt model
gpt_model: !new:speechbrain.lobes.models.huggingface_gpt.HuggingFaceGPT
gpt_model: !new:speechbrain.lobes.models.huggingface_transformers.gpt.GPT

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi : this got changed because we created a dir for all HF models (huggingface_transformers), in which gpt.py is located. Long story but we omitted "huggingface" everywhere in the py file names and also the name of the model classes because we believe that the huggingface_transformers dir implicitly indicates the HF nature of these models.

)

# Load tokenizer and add special tokens
tokenizer = GPT2Tokenizer.from_pretrained(

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi : I removed these lines because the tokenizer will be automatically loaded by the GPT class.

tokenizer = GPT2Tokenizer.from_pretrained(
hparams["gpt_hub"], pad_token=None
)
tokenizer = hparams["gpt_model"].tokenizer

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi : here I call the tokenizer of the GPT model

logger = logging.getLogger(__name__)


class GPT(HFTransformersInterface):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi : This class defines the GPT model. Could you please complete the docstring? I think it lacks quite a lot of the parameters used by the class. I could do it but I think you're more familiar with the class.

self.top_k = top_k
self.top_p = top_p
self.num_beams = num_beams
self.early_stopping = early_stopping

@mhn226 mhn226 Oct 9, 2023

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi : I made eos_token_id a param because different gpt models might have different eos_token_id. This is especially true when a multilingual model is used.

self.num_beams = num_beams
self.early_stopping = early_stopping

self.load_tokenizer(source=source, pad_token=None, use_fast=False)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poonehmousavi : here the tokenizer is loaded. "use_fast" is set False because I see that you specifically use GPT2Tokenizer instead of GPT2TokenizerFast which is loaded by default by AutoTokenizer.

@Adel-Moumen

Copy link
Copy Markdown
Collaborator

P/s: I see that you're using directly the generate() method from HF. I don't have any problem with that but I wonder if we want to use our decoding method in the future as @Adel-Moumen has been making wonderful effort in the scoring and searching part?

No. This is a different beam search :)

@mravanelli mravanelli self-requested a review October 11, 2023 14:29
@mravanelli

Copy link
Copy Markdown
Collaborator

Thank you @mhn226 and all the other contributors for this PR. I did some tests and everything LGTM. It significantly improves the way we fetch models from HuggingFace.

@mravanelli mravanelli merged commit bf40fb0 into speechbrain:unstable-v0.6 Oct 11, 2023
@mhn226 mhn226 mentioned this pull request Oct 19, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants