fix: tokenization of special characters: by antoine-lizee · Pull Request #850 · abetlen/llama-cpp-python

antoine-lizee · 2023-10-30T11:10:24Z

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly. See #838 (comment) for more details.

I checked that with this fix, the vanilla call to llm.create_completion(temperature=0) leads to exactly the same results for a simple chat prompt than when using ./main --temp 0 from llama.cpp - which it didn't before.

I changed the behaviour also for the embeddings and the LlamaTokenizer. I'm missing context so might be wrong on those, but I figured it would be good to be consistent.

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

antoine-lizee · 2023-10-30T11:17:40Z

This should also make Chat Templates work properly ( #711 ) provided that we update a few of them with the eos in the right place (eg: </s> for llama2). Should solve #801, may address #800?

antoine-lizee · 2023-11-01T20:39:19Z

@abetlen In case you missed this.

fourdim · 2023-11-01T21:04:26Z

What about removing the empty test.py file?

abetlen · 2023-11-01T23:38:46Z

@antoine-lizee looks good, I'm slightly hesistant to change the default behaviour of the completion function, would it be sufficient to only do this for chat_completion?

fourdim · 2023-11-01T23:47:45Z

Nope, that will not be sufficient. In my case, I'm infilling codes using bigcode/starcoder.
It has special tokens <fim_prefix>, <fim_suffix>, <fim_middle> to guide starcoder infilling the code in the middle rather than the normal completion.
If we set special to False, the model only outputs something random.

abetlen · 2023-11-02T01:30:05Z

I'll go ahead and merge this in as is for now, should have time in the next week to address any issues if this causes breaking changes.

@antoine-lizee thank you for the contribution!

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

* Add low-level batching notebook * fix: tokenization of special characters: (#850) It should behave like llama.cpp, where most out of the box usages treat special characters accordingly * Update CHANGELOG * Cleanup * Fix runner label * Update notebook * Use llama_decode and batch api * Support logits_all parameter --------- Co-authored-by: Antoine Lizee <[email protected]>

fix: tokenization of special characters:

8c7b4c1

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

antoine-lizee mentioned this pull request Oct 30, 2023

Error with special tokens tokenization #838

Closed

abetlen merged commit 47ca05a into abetlen:main Nov 2, 2023

abetlen pushed a commit that referenced this pull request Nov 2, 2023

fix: tokenization of special characters: (#850)

4d4e0f1

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

abetlen mentioned this pull request Nov 2, 2023

0.2.9 broke the bos/eos/sys handling for chat sequences #800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tokenization of special characters:#850

fix: tokenization of special characters:#850
abetlen merged 1 commit intoabetlen:mainfrom
antoine-lizee:tokenisation-special-characters

antoine-lizee commented Oct 30, 2023 •

edited

Loading

Uh oh!

antoine-lizee commented Oct 30, 2023 •

edited

Loading

Uh oh!

antoine-lizee commented Nov 1, 2023

Uh oh!

fourdim commented Nov 1, 2023

Uh oh!

abetlen commented Nov 1, 2023

Uh oh!

fourdim commented Nov 1, 2023 •

edited

Loading

Uh oh!

abetlen commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antoine-lizee commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoine-lizee commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoine-lizee commented Nov 1, 2023

Uh oh!

fourdim commented Nov 1, 2023

Uh oh!

abetlen commented Nov 1, 2023

Uh oh!

fourdim commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abetlen commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

antoine-lizee commented Oct 30, 2023 •

edited

Loading

antoine-lizee commented Oct 30, 2023 •

edited

Loading

fourdim commented Nov 1, 2023 •

edited

Loading