-
Notifications
You must be signed in to change notification settings - Fork 211
adds fields for sequence logic in ActivationsStore #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds fields for sequence logic in ActivationsStore #492
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #492 +/- ##
==========================================
- Coverage 85.65% 85.55% -0.10%
==========================================
Files 28 28
Lines 3568 3593 +25
Branches 443 448 +5
==========================================
+ Hits 3056 3074 +18
- Misses 331 336 +5
- Partials 181 183 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| if self.prepend_bos and not self.exclude_bos_between_sequences: | ||
| sequence_separator_token_id = bos_token_id | ||
|
|
||
| yield from concat_and_batch_sequences( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to adapt the concat_and_batch_sequences function to handle disable_concat_sequences, so then the pretokenize runner could also gain this capability too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I've added disable_concat_sequences for concat_and_batch_sequences() and PretokenizeRunnerConfig.
sae_lens/config.py
Outdated
| store_batch_size_prompts: int = 32 | ||
| seqpos_slice: tuple[int | None, ...] = (None,) | ||
| disable_concat_sequences: bool = False | ||
| exclude_bos_between_sequences: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have the same options both here and in pretokenize_runner? I could see an argument that the pretokenize_runner arguments are too confusing / complicated if it's best to stick with just bos here, but can also see it being nice if we have the same options and thus same capabilities regardless of pretokenizing or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking whether we should add exclude_bos_between_sequences to PretokenizeRunnerConfig, or change exclude_bos_between_sequences to be like PretokenizeRunnerConfig.sequence_separator_token?
I think we should do the latter. If that sounds good, I can make the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking the latter, it is sort of strange these have different options, but can see the argument that maybe the pretokenize runner version is too complex if we want to keep them separate. It looks like the 6.0 release will have a lot of breaking changes so if we're going to unify these now is probably a good time to do it.
| [3, 4, 5, 6, 7], | ||
| [10, 11, 12, 13, 14], | ||
| ] | ||
| assert batches.tolist() == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
chanind
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@chanind I realized I didn't change the logic for I'm thinking I should add And then for Does that sound good to you? |
|
Sounds reasonable! |
e1270c1 to
f0b2a2a
Compare
|
Because of jbloomAus/SAEDashboard#46 (comment), I've actually made changes so |
|
You should be able to check if a field is explicitly set on |
|
In the case of |
…rch#492) * adds fields for sequence logic in ActivationsStore * adds docstrings to LanguageModelSAERunnerConfig * groups params for ActivationsStore together * adds disable_concat_sequences to concat_and_batch_sequences() * adds disable_concat_sequences to PretokenizedDatasetMetadata * replaces exclude_bos_between_sequences with sequence_separator_token * updates test name * fixes tests * adds params to ActivationsStore.from_sae() * adds fields to SAEMetdata
Description
Adds a field, where if the value is
True, excludes the BOS token between concatenated sequences, and a field, where if the value isTrue, disables concatenating sequences and ignores sequences shorter than the context size. Gemma 2 wasn't trained on BOS tokens and we want to match this during training. We can do this usingPretokenizeRunner, but not without it.Fixes #472
Type of change
Please delete options that are not relevant.
Checklist:
You have tested formatting, typing and tests
make check-cito check format and linting. (you can runmake formatto format code if needed.)Performance Check.
If you have implemented a training change, please indicate precisely how performance changes with respect to the following metrics:
Please links to wandb dashboards with a control and test group.