An LLM fine-tuning course online conference for everything LLMs.
Build skills to be effective with LLMs
Course website: https://maven.com/parlance-labs/fine-tuning
Next up:
- Fine-Tuning Workshop 2
- Fine-Tuning Workshop 3
- Fine-Tuning Workshop 4
- Conference Talk: From prompt to model (with Kyle Corbitt)
- Conference Talk: Inspect, An OSS framework for LLM evals (with JJ Allaire)
- Conference Talk: Beyond the basics of Retrieval (Augmenting Generation) (with Ben Clavié)
Syllabus:
- Our Philosophy
- Start easy, step up the ladder of complexity slowly
- Shorten the development cycle
- What makes a good first product
- When to fine-tune
- How fine-tuning works
- When to use base model as is
- Practice with many example use cases
- Picking an LLM Use Case
-
Homework
List 5 LLM use cases you personally have or that you think are especially interesting.
Describe each use case in 1-2 sentences.
Then write 1-2 sentences describing whether is is best served with fine-tuning, RAG, prompt engineering or some combination.
Post ONE use case and the reasoning behind your approach in the course forum by May 21.
-
A glimpse into the course content
- Develop intuition for how fine tuning works
- Understand when to fine-tune
Compared to any other resource, this course is going to focus more on our experiences, working on a wide range of commercial projects. And so everything is going to be trying to be as practical and action actionable and fueled by business experience as opposed to conventional academic research as much as possible.
The State of LLMs
With the transition to GenAI and LLMs, especially the initial ChatGPT moment, the thing that was most striking to Dan Becker about our field is that really no one knew where to apply these LLM models or what are the patterns to successfully deploy LLMs to solve a range of business problems. This is still sort of the case. And the only way to figure out what those patterns are is to get exposure to a wide range of different problems.
- Hands on
- Practical rather than theoretical
- Interactive
- Finish more capable than you started
- DO NOT start with fine-tuning. Prompt eng first (that is much quicker)
- Prompt eng is much quicker. Fine-tuning and using fine-tuned LLM is more complex and time consuming.
- Hamel pointed out that they are not here to sell you fine-tuning. They want to give you an intuition and when and where fine-tuning may help you, when it many not help you.
- Use OpenAI, Clude, etc
- "Vibe-checks" are OK in the beginning
- "Vibe-checks" where you will at the output and does it look good, what do you like about it, what do you not like.
- Write simple tests & assertions
- Over time, you will start using more and more programmatic simple tests & assertions. Over a long enough period of time, you probably always do vibe checks but you will accumulate in increasing and eventually quite large set of tests & assertions that you run.
- Ship fast
- Ship something simple very quickly to get a project started.
Dan's story: He recently had a project that made the above especially clear. They were working with a company that takes academic journal articles, extracts info from them, then struture the info, and sell the results to physical manufacturers. (watch the recording at 00:12:32)
Key take-aways: For almost all use cases, simple things work reasonably enough to start making progress on. (Simple things frequently work at least tolerably well for almost all use cases and you can improve on them.)
A workflow on how to continuosly your models especially with fine-tuning.
Hamel's blog post: Your AI Product Needs Evals -- How to construct domain-specific LLM evaluation systems.
I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.
Walkthrough of fine-tuning LLM.
We will do that reasonably quickly and not going too much into technical or mathematical detail. A quick refresher is broadly useful for everyone.
Model decides the next word at a time, example what is the likelihood of this particular next word given the words seen so far. This likelihood (predicted probabilities of different tokens) is calculated based on weights. They give output based on the text they have learned.
Base models aren't helpful.
As their name suggests, suggests a good baseline for fine-tuning.
Fine-Tuning
When we do fine-tuning, we will start with a dataset. It will have many pairs of input and output (prompt and response). We are going to train it to take something in the form of the input and create something in the form of output. We want to harness next token prediction.
The trick to do this is to put it in something which is call a template (see screeshot below).
Above is a very simple template. This template have a string, which is:
- input that's highlighted in yellow
- output that's highlighted in green
- one token in between that's highlighted in red
The one token in between will be our way of making thd model at inference time short circuit all the other training that it's had and actually say when I see this token, the likely next token after that in this case are an answer or a helpful answer.
So this is our way of training with next token prediction.
** Need consistent templating between training & inference**
One thing we're going to call it out so many times, (it is the bane of my day existence) and that is templating. (watch the recording at 00:23:34)
This is actually a harder problem than it would sound like. Hamel done some pretty cool work on how that relates to tokenization. There's a whole rabbit hole there.
Hamel: Yes, this is the biggest nightmare. So as you know we're going to spend time learning Axolotl and when I teach Axolotl, the bulk of the time is making sure that you understand what the template is. Because that is where 99% of errors in practice happen with this. And it sounds like, "Oh, OK why would you ever get this wrong?". The fact of the matter is, there's many different kinds of templates out there. There's things that assemble templates for you. It's really easy to misunderstand what the template is. It's often the case you don't assemble this template yourself. If you don't precisely understand what that template is, you can get into trouble really fast. The reason why it comes up so much is, there's a lot of abstractions out there. I've seen roughly half of the time, something go wrong between, what you expect to happen and what actually being assembled.
Hamel drop a link in the chat about some very detailed analysis of these tokens and when you can be misled even when they look the same. (You can read if you're interested.)
Observation (watch the recording at 00:28:51)
Sharing Tweets by Anton (@abacaj) and Emmanuel Ameisen (@mlpowered).
Interest in fine tuning is like waves. It increases and decreases.
Story: A year ago, at OpenAI event, they said we think there's going to be one model to rule them all and we don't think there's going to be like lots of small models that are useful for specific purposes.
There is no question is that sometimes fine-tuning is the right approach. You're going to have bespoke data.
There's been an important trend towards longer context windows, which means you can give more examples in your prompt. That way is in favor of less fine-tuning and more of just dropping a lot into the prompts.
Dan don't think either extreme view is right, and the community will sort of move back and forth between those over time.
Hamel's take on this:
My sentiment is like, you should definitely try NOT to fine-tune first. You need to prove to yourself that you should fine-tune. The way to prove to yourself is, you should have some minimal evaluation system. Once you don’t make progress move to fine-tune.
It's important to learn how to do prompting. It's kind of funny to think that prompting is a skill. But actually I think it is. I like practice it quite a bit.
You may have a generic model which has not seen specific data. Then you need fine-tuning.
The task is to find the value based on description.
Because it's just regression you could use classical NLP or ML techniques for this. There's important reason that we did not want to do that. NLP / Regression was not useful because description won’t cover all the words. Users might enter new words and NLP will not work.
Slide 16: Unacceptable results
- Learned that responses were round numbers, but not great at getting approximately right values - Inappropriate loss function
- Considering this as a regression model. Output numbers are also string for the LLM. 10, 5, 100, 5000. They don’t use 97.
- Training data had "wrong" small values
- Careful with training data. Value has entered low for various reasons like insurance cost etc. Users will put 10 instead of 500.
- Many incomprehensible descriptions due to length limit
- Companies will have acronyms so that it would fit in 80 chars space. Thus the model needs access to the acronym and related policy of the company. This is all revealed by looking at the raw data.
- Conventional NLP/ML also not good enough
- In this case, NLP/ML was not useful.
(watch the recording at 00:39:35)
I think this is a really great use case in which to learn some of the important nuances about fine-tuning.
Honeycomb provides a domain specific query language for observability. It's basically telemetry. You can think like people use DataDog for some of these things.
They created a natural language query assistant and it will build the query (DSL not SQL) for you using LLMs.
In the alpha version of the product that they released, the user provided two inputs: a natural query + schema. Input gets assembled into a prompt into GPT-3.5. And then out comes a Honeycomb query. The Honeycomb query is a very complex prompt.
The Prompt
System Message
+
Content: column names
+
Query Spec
+
Tips
Problems
Complex tips, best practices, few shot example, etc. are very difficult to followed or expressed by LLM.
That's a smell that may be fine-tuning might be useful.
Slide 21:
- Data privacy
- Honeycomb is using GPT-3.5. They're able to roll it out to a select number of customers. But they have to get permission from their customers to ship their data to OpenAI. Not everybody wants to opt-in to that. They also don't want to ask everybody for that. It's kind of disruptive. They wanted to see if there's a way that they can own the model, run these workloads inside like a trusted boundary where data doesn't leave their property.
- Quality vs. latency tradeoff
- Honeycomb experimented with GPT-4. It was a bit too expensive and also slower than they wanted. The reason you want to think about fine-tuning is may be it's possible to train a smaller model to do better, to try to approach the quality of like bigger model with lower latency.
- Extremely narrow problem
- Honeycomb problem is a very narrow problem -- the domain is very narrow.
- Prompt engineering is impractical
- Prompt engineering in Honeycomb's case is impractical -- to express all the nuances of the Honeycomb query language, just even expressing that as a human being is very difficult.
RESULT: Fine-tuned model was faster, more compliant from the data privacy perspective & higher quality vs. GPT 3.5
What we will do is we will simulate that in this course. We will give you access to synthetic data and we will walk you through how we dit it.
Slide 24:
Imagine you decide to build a chatbot for your first fine-tuning project. What factors determine whether fine-tuning a chatbot will work well or not?
(watch the recoding: 01:09:18)
(watch the recording at 01:20:39)
Slide 26:
- Email composer
- Listing Finder
- CMA (Comparitive Market Analysis)
- Create Marketing Website
- Create Social Media Post
- Query Knowledge Base
- ... 25 others tools
People wants to use chat bots. 9 out of 10 will ask for it. It is better to say NO in most cases.
The idea: let's put a chat bot on your software and you can ask it anything.
So that breaks really fast because that surface area is extremely large and it kind of devolves into AGI in the sense like "hey ask it to do anything.". It's not really scoped. It's hard to make progess around something that isn't scoped.
Slide 27:
- Manage user expectations
- Large surface area
- Combinations of tools
- Compromise - specificty
Chat bot can help in narrow cases. Users' expectations are very high. Build a chat bot for each of the blocks.
Slide 28
(watch the recording at 01:28:05)
Dan: We were working on a chat bot for a package delivery company called DPD. Actually I told them I thought it was not ready to be released but they were antsy so they released it.
DPD Chat bot error caused it to swear at customer.
So this, I think just speaks to the fact that we don't really have a great sense for what people's expectations are.
Someone comments about guardrails in the chat. There's bunch of tools that are meant to be guard rails and like check these so called prompt injections. None of those work especially well. Guardrails are not foolproof. (watch the recoding: 01:31:01)
Hamel: I'll drop a blog post in the chat about looking at your prompt and how important that is, which highlights things like different kinds of guardrails. (aside: Simon Willison's prompt injection blog post series)
Slide 29:
- Want bespoke behavior
- Valuable enough to justify operational complexity
- Have examples of desired input/outputs
(watch the recoding: 01:32:10)
Slide 30
Table:
Prompt-Response pair 1 - Freat answer!
Prompt-Response pair 2 - Ok response!
Prompt-Response pair 2 - Too long-winded
Prompt-Response pair 2 - Pretty good
Prompt-Response pair 2 - Not bad. A little repetitive
(watch the recoding: 01:34:20)
While it's difficult to write perfect responses (as shown in the previous slide), humans are typically pretty good at saying, given 2 choices, which they like more.
So there is a whole field of techniques that are preference optimization algorithms.
Regarding the screenshot of the Tweet: the top models on this leaderboard uses a technique called DPO. DPO is short for Direct Preference Optimization.
What is DPO?
(watch the recoding: 01:35:11)
The model learns to imitate the behavior or style of responses to those prompts.
Supervised fine-tuning + prompt + response. You tell the model what is the best and worst response to a query.
Dan: I did a project like that for a large publisher. This is an example we worked on relatively little data.
So they had incoming emails. For each of 200 emails, we had 2 different customer service agents write a response. Their manager took these pairs of responses and said, of these 2, here's the one that I prefer. Then we fine-tune a Zephyr model with DPO. Then we compared model to alternative response sources on new emails.
(watch the recoding: 01:37:12)
Test results for the responses showed DPO is better than GPT-4.
Slide 34
(watch the recoding: 01:39:09)
Quiz:
-
Restaurants, customer service emails (answer: good use case for fine-tune)
-
Medical publisher has an army of analysts that classify each new research article into some complex ontology that they've built. (answer: great use case for fine-tune)
-
A startup wants to build the world's best short fiction writer. Here, most people said this is a poor fit for fine-tuning.
Dan: If I were a startup trying to build this, I would for a period of time have two different models that produce different responses. I would have them rank that story. Now we would be able to do DPO and say this story A is better than story B. The model can really in a very granular way or very data informed way learn about people's preferences like what do they like in a way that I don't think is at all possible without some sort of preference optimization.
Hamel: poor-fit for fine-tuning.
-
I had to fudge wants to give each employee an automated summary of new articles on specific topics at the beginning of the day. They need LM based service that takes news articles as inputs and responds with summaries. (answer: poor fit for fine-tuning. Dan thinks: ChatGPT can do a great job of this. I don't really understand what data that you would have internally.)
Questions in the Zoom chat.
- Wade: Can you show us some examples of assertions and simple tests?
- Hamel: We will do that in the course when we get to the point.
- What is the difference between pre-training and fine-tuning?
- Hamel: They are the same thing. The same procedure. It's just a matter of different data. Pre-training is not focus on a specific domain. You're trying to feed a wide diverse set of data to teach a model general skills, whereas fine-tuning is you're training a model to do really well on a very specific domain. Pre-training is where your big base models come from and then you can fine-tune on top of those.
- Dan: They both mathematically the same basically. In terms of their purpose, pre-training is really teaching the model to learn basic language and fine-tuning is as the name suggest fine-tuning it for a specific purpose that you're going to want to use it for in the future.
(watch the recording at 00:51:20)
- How do you know it was a fine-tuning versus RAG question?
- It's a common confusion actually. These 2 techniques RAG and fine-tuning are not competing with each other per se.
- RAG is useful when the LLM can go to a data store to check the latest info. Fine-tuning can be done on the output of RAG.
- Consider fine-tuning when a good prompt and RAG does not work.
- Can we fine-tuning a model to make it better at doing function calls?
- Hamel: Yes, absolutely. There's some open models that have been fine-tuned already on Llama 3 and certainly Llama 2 with a specific purpose of function calling.
- Dan: You need lots of data so that it maps to all the parts in your problem space.
- How many samples are necessary for fine-tuning?
- Dan: It varies quite a bit. The least that I have used that I think we viewed as success is 100 samples. It wouldn't surprise me if there are examples with less than that. The most important determinants of this is how broad is your problem.
- Can you have too much data?
- Dan: No. I'm hesitant to say never.
- Is there a value in fine-tuning a model on both correct and incorrect responses?
- Dan: Soon we will talk about preference optimization which isn't exactly this but pretty close to that. Where you've got instead of right and wrong, you have better and worse. Example, a publisher where we built a tool to automate responding to emails and we had better and worse samples. We use preference optimization and came up with something that was better than if you did conventional, supervised fine-tuning.
- The Gorilla leaderboard: https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html
- Hamel: It is for function calling. That's great but keep in mind with Gorilla leaderboard it's a bit over fit to function calling. In practice, you're going to have a mix of function calling and non-function calling. Pick every leaderboard with a grain of salt. Also look at the test data and think about how it might apply to your use case. But it's OK way to get a general sense.
- Multimodal fine-tuning
- One thing that I would emphasize is that the LLava model is very good. There's a script in the LLava repository for fine-tuning. Just getting that set up has, if anything been easier than I would have expected. May be I will write a blog post about it. If you were to look at the LLava repository, you would be surprised at how well it can be done with an amount of effort that's not as immense as I probably expected beforhand.
- Does synthetic data have to come from a more powerful model?
- Hamel: Yes, if you can. One of the key reasons why I like LLMs as opposed to classic ML, it's more fun to work on those projects because I get unblocked if I run into a situation where I don't have enough data. I usually use the most powerful model I can to generate synthetic data. Usually Mistral large because the T&C don't scare anybody. They're very permissive, like you could generate synthetic data and train another model. There's a lot of different ways to do that. One way is taking existing data and perturbing that data like asking a language model to rewrite it and then change the output in accordance with that all by using evals in the middle. Another way you can generate test cases or inputs to your LM system. Your LM system might be like some complex system that has RAG in it that does function calls and then finally return something to the user. So you can generate inputs into that system. There's a lot to say in words. We'll show you more in upcoming lessons about what that means.
- Do I use base model or instruction tune models for fine-tuning?
- Hamel: Instruction tuned models are already fine-tuned. Base models are generally preferred.
- What is the model size?
- Hamel: I try to get away with the smallest size that I can. So I try to fine-tune a 7-billion parameter model. I use my intuition at this point, like how narrow is the domain based on the other things that I've done. The best thing you can do is try to train a 7-billion parameter model. That's the sweet spot. If you can get something into a very small package or small-ish package. Then is going to make more sense. The larger the model you have to justify more like it's going to cost more. It's going be harder to host, so on and so forth. Those are where the payoff is really big.
Q&A again. (watch the recording at 01:52:56)
- Quantization
- Hamel: explain quantization
- Dan: We have the CTO of Predibase as speaker. He is the expert in this area.
- Hallucination taking the example of classifying academic or science articles
onto a complex ontology (thousands of classes). How do you make sure the LLM
only outputs valid classes?
- Hamel: We have enough examples that only use a specific set of classes. We have a set of metrics that we are checking all the time. And we will just treat that as a miss-classification.
- Is there any homework?
- Go the Maven platform, check the Workshop 1 syllabus. The homework is there.
- Come up with a set of 5 use cases, just rate them out that you think would be interesting for each of them whether it's good or bad as a use case for fine-tuning and share that in the Discord.
- The customer service DPO fine-tuning example is better than GPT-4. Can share more detail?
- Dan: McDonald example glutten free policies. GPT-4 respond to the customer service manager, "I have no idea". So the idea that you're going to forever tell GPT-4 enough that it can respond to all the questions that come. That is fiction.
- Does prompt engineering or few shot examples complement fine-tuning?
- Dan: It is not necessarily the case that you would need to use just one or the other. But for the most part I think of those as alternatives. You could use both.
- Hamel: One rule of thumb is, in your prompt anything that stays exactly the same in your prompt and doesn't change from large language model invocation, invocation, fine-tuning should be able to just completely remove that. It's kind of dead weight. You can implicitly just teach your model, whatever the hell is that you're saying that you're repeating every single time, you don't need to say it anymore. Now, if your few shot examples are dynamic, it depends. The more extensively you fine tune your model. You shouldn't need few shot examples anymore. Few shot example is more of like a prompt engineering technique. I haven't actually tested that though to be honest. It always surprises me of what works. There's a spectrum so like, if anything staying the same in your prompt and if you have a few shot examples in your prompt and they're never changing, then those are always that you can definitely get rid of with fine-tuning.
- Human annotation
- Hamel: So data annotation we'll cover this a bit in the next course. You want to have a human in the loop when you're doing evals. And you want to be able to look at lots of different examples and kind of curate which ones are good and bad. You also want to look at your failure mode. You want to curate data that covers all the different use cases that you can think of. Every time I try to use some tools for looking at data, I get frustrated because every domain is very specific. I like to build my own tools with something like Gradio or Streamlit. I'll put a blog post that I wrote about this topic in the chat, "Curating LLM data".
A collection of 98% of links posted in the chat:
AI Product Evaluation
- Your AI Product Needs Evals - https://hamel.dev/blog/posts/evals/
- Langtrace AI - Monitor, eval & improve your LLM apps - https://langtrace.ai
- Observability for LLMs - https://www.honeycomb.io/llm
- Inspect, a framework for large language model evaluations created by the UK AI Safety Institute - https://ukgovernmentbeis.github.io/inspect_ai/
Programming and Development Tools
- DSPy: Programming—not prompting—Foundation Models - https://github.com/stanfordnlp/dspy
- Cohere Toolkit to quickly build and deploy RAG apps - https://docs.cohere.com/docs/cohere-toolkit
- Open UI - https://v0.dev and https://github.com/wandb/openui
- Effortless Python web applications - https://shiny.posit.co/py/
- Fireworks’ GPT-4-level function calling model - https://fireworks.ai/blog/firefunction-v1-gpt-4-level-function-calling
- Code for the Hermes Pro Large Language Model to perform function calling based on the provided schema - https://github.com/NousResearch/Hermes-Function-Calling
- InternVL - A Pioneering Open-Source Alternative to GPT-4V - https://github.com/OpenGVLab/InternVL
- Gorilla: Large Language Model Connected with Massive APIs - https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html
- Get your LLM app from prototype to production - https://www.langchain.com/langsmith
Tokenization and Fine-Tuning
- Fine-tuning: Axolotl vs Unsloth vs TorchTune - https://swaroopch.com/notes/fine-tuning-library
- Curating LLM data - https://hamel.dev/notes/llm/finetuning/04_data_cleaning.html
- Tools for curating LLM data - https://hamel.dev/notes/llm/04_data_cleaning.html
- Notebook fine-tuning on a Captcha image dataset - https://github.com/vikhyat/moondream/blob/main/notebooks/Finetuning.ipynb
Prompt Engineering
- Anthropic's Prompt Engineering Interactive Tutorial - https://docs.google.com/spreadsheets/d/19jzLgRruG9kjUQNKtCg1ZjdD6l6weA6qRXG5zLIAhC8/htmlview?usp=sharing
- Fuck You, Show Me The Prompt - https://hamel.dev/blog/posts/prompt/
- Series: Prompt injection - https://simonwillison.net/series/prompt-injection/
Research Papers and Studies
- Constitutional AI: Harmlessness from AI Feedback - https://arxiv.org/pdf/2212.08073
- The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions - https://arxiv.org/abs/2404.13208 (Making the model follow system prompts)
- Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study - https://arxiv.org/abs/2404.10719
- RAFT: Adapting Language Model to Domain Specific RAG - https://arxiv.org/abs/2403.10131
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? - https://arxiv.org/abs/2405.05904
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model - https://arxiv.org/pdf/2305.18290
- The Unreasonable Ineffectiveness of the Deeper Layers -<https://twitter.com/kwindla/status/1788224280754618393
- YouTube Channel by Umar Jamil - https://www.youtube.com/@umarjamilai
News and Articles
- The End of Finetuning — with Jeremy Howard of Fast.ai - https://www.latent.space/p/fastai
- Air Canada Has to Honor a Refund Policy Its Chatbot Made Up - https://www.wired.com/story/air-canada-chatbot-refund-policy/
Source: Discord
Some of them are recommended reading list for today's workshop by the instructors.
Some highlights:
-
- Summary of some of the main questions and answers from Hamel and Dan:
- Session Summary from Limitless: https://discord.com/channels/1238365980128706560/1239614536298795121/1240020377296441445
-
Unrelated: Dan's DPO project (refer to slide on "DPO For Customer Service At Large Publisher")
I will be writing a white paper for Straive on our DPO project, but haven't written it yet. There are also some limits on what we can say based on the downstream client's preferences.
I will share the white paper here and with you when it's ready.
-
Many learners are curios about fine tuning embeddings
- @mwildehahn on Discord says: "Same! Since the netflix paper (https://arxiv.org/html/2403.05440v1) there has been a lot of discussion about how cosine similarity isn't a great metric for semantic similarity but I haven't seen a lot around fine tuning your own embedding model or when that would be necessary. I get doing that for a very specific domain like medical language or company jargon but from today's session, it seems like the general purpose embeddings from something like openai would be best for embeddings given the base language knowledge?"
Thank you for taking the time to write this. Very useful!