Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Nov 17, 2024

Very optional, very out of the way, very explicit warnings about how badly it can go, ample room for comfortable configuration ({{random}} on prefills should work out of the box, which is the most useful usecase for {{random}}, PHIs and A/Ns take minor configuration).

Checklist:

@ghost
Copy link
Author

ghost commented Nov 17, 2024

OpenRouter support added as well. To clarify, the 2 in the code is supposed to be from the cache immediately before the previous message (if available) regardless of what's going on with injections at depth (which should all presumably have user role for Claude?), hence why depth is strict role switches (unless I'm doing something absurdly stupid here, always possible).

@cloak1505
Copy link
Contributor

cloak1505 commented Nov 18, 2024

(which should all presumably have user role for Claude?)

I doubt that's a requirement. It's on user messages in the docs because naturally the last message in a request would just be user (solo chat being the typical use case). User sends on turn 1. Then turn 2. The current and previous breakpoints happen to be on the user; you get the idea.

hence why depth is strict role switches (unless I'm doing something absurdly stupid here, always possible)

https://docs.anthropic.com/en/release-notes/api#october-8th-2024
Anthropic recently loosened restrictions to allow for consecutive same-role messages, but they say those messages will be combined into single message, which makes me worry whether consecutive assistant messages will break the cache. If it does, then we'll have to default to caching only the user messages to be safe. If not, well, caching assistant messages will help group chats where most messages are from various characters.

Also, we are allowed up to 4 breakpoints, so we should use 4 (edit: if not enableSystemPromptCache else 3 uh, 2). This will allow users to edit 7 messages back (after 4th last user turn) and keep the cache.

#2693

My original idea was system prompt + last 3 user messages, which would allow you to restart the chat assuming the sys prompt is at least 1024 tokens

@ghost
Copy link
Author

ghost commented Nov 18, 2024

I very intentionally only used 2 breakpoints to avoid breaking the system prompt caching option (which uses up to 2 breakpoints).

EDIT: Anyway the current setup is entirely rational. It's customizable, it hits good caches between swipes and between messages, assuming people aren't going back and editing the chat history. Extremely optimized for 90% of usecases other than swiping 50 times (but still optimized for that).

EDIT 2: Exclusively caching on user messages might be a good heuristic, I'll sleep on it. Group chats were always a mess and will remain a mess and I refuse to waste more than 5 neurons on them.

@cloak1505
Copy link
Contributor

System prompt caching uses 2? My bad. But where and why?

@ghost
Copy link
Author

ghost commented Nov 18, 2024

convertedPrompt.systemPrompt[convertedPrompt.systemPrompt.length - 1]['cache_control'] = { type: 'ephemeral' };

requestBody.tools[requestBody.tools.length - 1]['cache_control'] = { type: 'ephemeral' };

One breakpoint on the sysprompt and another for the tools. It doesn't NEED both, but if I were using a third breakpoint I'd just place it on the prefill to optimize swipes all the way (with a third configuration option).

EDIT: To clarify the prefill breakpoint would be more of a "might as well" thing because your prefill shouldn't be big enough for the caching to make a difference, heuristically, and not worth breaking a feature that already exists.

@ghost
Copy link
Author

ghost commented Nov 18, 2024

paint

Funny drawing to explain the intended cache hits.

bodyParams['route'] = 'fallback';
}

let cachingAtDepth = getConfigValue('claude.cachingAtDepth', -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe put that into a separate function (i.e in the prompt-converters module) cause it makes the endpoint harder to read.

Copy link
Author

@ghost ghost Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also edited up the anthropic code so it doesn't rely on ST's current squashing behavior for messages, anything else?

EDIT: by "squashing" I mean "flattening the content arrays and/or just intentionally always putting everything in a single content array".

Copy link
Member

@Cohee1207 Cohee1207 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked at it (with N = 0) and it appears to be functional in the ideal circustances. Can't say any more than that.

@Cohee1207 Cohee1207 merged commit 54db498 into SillyTavern:staging Nov 18, 2024
@ghost
Copy link
Author

ghost commented Nov 18, 2024

Thx. Sorry about the eslint thing.

@Wolfsblvt Wolfsblvt added 🟨 ⬤⬤⬤○○ [PR][🎯Auto-applied] [Medium]100-500 lines changed 🏭 Backend Changes [PR] Contains changes to the backend and/or API 🤖 API / Model [ISSUE][PR] Related to specific APIs or Models ⚙️ config.yaml [ISSUE][PR] Relates to changes to the config.yaml labels Mar 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖 API / Model [ISSUE][PR] Related to specific APIs or Models 🏭 Backend Changes [PR] Contains changes to the backend and/or API ⚙️ config.yaml [ISSUE][PR] Relates to changes to the config.yaml 🟨 ⬤⬤⬤○○ [PR][🎯Auto-applied] [Medium]100-500 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants