This episode provides an overview of the real-world technologies involved in the umbrella phrase Artificial Intelligence. Anthony Alford explains just enough about machine learning, large language models, retrieval-augmented generation, and other AI terms which today’s software architects need to be able to discuss.
Key Takeaways
- The term "Artificial Intelligence" often means generative AI (genAI) because that is the most common implementation people are familiar with
- LLMs, like any ML model, take input and provide output, like a function you can call via an API
- Before you adopt LLMs in your application, define your success criteria
- Retrieval-Augmented Generation (RAG) should be an early step for improving your LLM adoption, if prompt engineering is not sufficient
- Vector databases provide nearest-neighbor searches, which helps find related content to use in the context provided to the LLM
Subscribe on:
Transcript
Introduction [00:42]
Thomas Betts: Hi, everyone. Here at InfoQ, we try to provide our audience with information about the latest software innovations and trends. And I personally recognize that sometimes there's a lot of new information out there and we tend to focus on the subjects that are most relevant to what we're currently working on and what we're interested in. Then sometimes you realize that what used to be one of those subjects off to the side is now right in front of you and you can't ignore it anymore. And I'll admit that that was my approach for a lot of the news over the past decade or so about big data, machine learning, artificial intelligence. I found it interesting, but because it wasn't what I was working with, I had this very thin, high-level understanding of most of those topics. And that's fine. That's how software architects usually approach a problem.
We tend to be T-shaped in our knowledge. We have a broad range of subjects we need to know about, and we only go deep in our understanding of a few of them until we have to go deeper in our understanding for something else. That's where I think we've gotten with ML and AI. It's no longer something off to the side. Architects have to deal with these every day. They're front and center because product owners, CTOs, CEOs, maybe even our customers are asking, "Can you put some AI in that?" to just about everything, it seems.
That gets me to today's episode. I've invited Anthony Alford on to help explain some of these ML and AI concepts that are now, I think, required knowledge to be an effective software architect. Anthony's voice will probably sound familiar because he's another InfoQ editor. He co-hosts the Generally AI podcast with Roland Meertens, and I believe that just started its second season. Anthony, thanks for joining me on my episode of the InfoQ Podcast.
Anthony Alford: Thank you for having me.
Thomas Betts: I think a useful way to go through this today in our discussion is to do this big AI glossary. There's a lot of terms that get thrown around, and that's where, I think, architects need to understand what is that term and then figure out how much do I need to know about it so they can have intelligent conversations with their coworkers. I want to provide today just enough information so that those architects can go and have those conversations and realize when something comes up and they have to start implementing that for a project or thinking about a design, they have a little bit more context and that will help them be more successful as they do more research. Sound like a plan?
Anthony Alford: Sounds great.
AI usually means deep learning or neural networks [03:00]
Thomas Betts: All right. First give me your definition. What is AI?
Anthony Alford: AI is artificial intelligence.
Thomas Betts: And we're done.
Anthony Alford: Yay. And, in fact, when I talk to people about this, I say, "AI really tells you more about the difficulty of the problem you're trying to solve". It's not an actual solution. The good news is when most people are talking about AI, they're actually talking about some type of machine learning. And machine learning is definitely a technology. It's a well-studied, well-defined branch of science. And, in fact, the part of machine learning that most people mean now is something called deep learning, which is also known as neural networks. This has been around since the 1950s, so it's pretty widely studied.
ML models are just functions that take input and provide output [03:48]
Thomas Betts: Yes, I think that's the idea that AI is not a product you can go buy. You can go buy a machine learning model. You can build a machine learning model. You can add it to your system, but you can't just say, "I want an AI". But that's the way people are talking about it. Let's start talking about the things that exist, the tangible elements. Give me some examples of what people are thinking when they say, "I want AI in my system". What are the machine learning elements they're talking about?
Anthony Alford: Of course, most people are talking about something like a large language model or a generative AI. What I like to tell people as software developers, the way you can think about these things is it's a function. We write code that calls functions in external libraries all the time. At one level you can think about it. It is just a function that you can call. The inputs and outputs are quite complex, right? The input might be an entire image or a podcast audio, and the output might also be something big like the transcript of the podcast or a summary.
Thomas Betts: And that's where we get into the... Most people are thinking of generated AI, gen AI. Give me some text, give me an image, give me some sound. That's the input. Machine learning model, it all comes down to ones and zeros, right? It's breaking that up into some sort of data it can understand and doing math on it, right?
Anthony Alford: Yes, that's right. Again, when I talk to software developers, I say, "When you think about the input and output of these functions, the input and output is just an array of floats". Actually, it's possibly a multidimensional array. The abstract term for that is a tensor. And if you look at some of the common machine learning libraries, they're going to use the word tensor. It just means a multidimensional array, but you have to be able to express all your inputs and outputs as these tensors.
Building an ML model is like writing a lot of unit tests and refining the function [05:42]
Thomas Betts: Yes, these are the things I learned in math back in university years ago, but because I'm not a data scientist, I don't use those words every day and forget, "Oh yes, multidimensional array, I understand what that is". But exactly. That's like several extra syllables I don't need to say. I've got these tensors I'm putting in. What do I do with it? How do I build one of these models?
Anthony Alford: Okay, if you want to build your own model, which you actually might want to consider not doing that, we can talk about that later. But in general, the way these models are built is a process called training and supervised learning. What you really need, again, from our perspective as software developers, we need a suite of unit tests. A really big suite of unit tests, which means just what we expect, some inputs to the function and expected outputs from the function. The training process essentially is randomly writing a function. It starts with a random function and then just keeps fixing bugs in that function until the unit test pass somewhat. They don't actually have to exactly pass. You also tell it, "Here's a way to compute how bad the tests are failing and just make that number smaller every time".
Thomas Betts: That's where you get to the probability that this all comes down to math. Again, I'm used to writing unit tests and I say, "My inputs are A and B, and I expect C to come out". You're saying, "Here's A and B and I expect C". But here's how you can tell how close you are to C?
Anthony Alford: Exactly, yes. It depends on the data type. I mentioned they all turn into tensors, but the easiest one is, let's say, you're building a model that outputs an actual number. Maybe you're building a model that the inputs are things like the square feet of a house and the number of rooms, et cetera. And the output is the expected house price. If you give it unit tests, you can get a measure of how off the unit test is just by subtracting the number that you get out from the number that you expect. You can do sum of squared errors. Then the machine learning will just keep changing the function to make that sum of squared errors lower. With something like text or an image, it may be a little trickier to come up with a measurement of how off the unit tests are.
Language models are trained using sentences to predict the probability of the next word in the sentence [07:59]
Thomas Betts: We're getting into all the ideas of gen AI. Let's just take the text example for now and we'll leave off audio and images and everything else, because it's the same principles. Most people are familiar with interacting with ChatGPT. I type in something and it gives me a bunch of text. How did those come about and how did we create these LLMs that people said, "When I type in this sentence, I expect this sentence in response".
Anthony Alford: Okay, so how long of the story do you want Here? We can go back to 2017 or even earlier.
Thomas Betts: Let's give the high level details. If it's an important milestone, I think it's useful to sometimes have the origin story.
Anthony Alford: You're right. The short answer is these things like ChatGPT or what are called language models, the input of the function is a sequence of words or more abstractly tokens. The output is all possible tokens along with their probability of being the next one. Let me give you an example. If I give you the input sequence once upon a... What's the next word?
Thomas Betts: I'm going to guess time.
Anthony Alford: Right. What the LLM will give you is will give you every possible word with its probability and time will have a very high probability of being the next one. Then something like pancake would have a lower probability. That's a probability distribution. We actually know the answer. In training, we know that in the unit test, the word time has the probability of 100%. Every other word has a probability of zero. That's one probability distribution. The probability distribution it gives us is another one. And there's a measure of how different those are. That's called cross-entropy loss.
That's how you can train it to improve that. It'll shift its output distribution to have time be closer to 100% and everything else zero. That's a language model, and the method that I described is really how they're trained. You take a whole bunch of text and you take sequences of that text and you chop out a word or multiple words and you have it fill in those words. The way it fills it in is it gives you a probability distribution for every possible word. Ideally, the one you chopped out has the highest probability.
Thomas Betts: Got you. It's like the image recognitions that we've seen for years.
Anthony Alford: Exactly.
Thomas Betts: We've had image recognition models. It's like, "How do I identify this is a dog? This is a cat?" and we trained it. It's like, "This is a cat. This is a dog". And it started putting that into its model somehow. It's like when I see this array of pixels, the answer is such a probability that it is a dog in that picture.
Anthony Alford: Yes. And in general, if we want to talk about data types, this is an enumeration. With enumeration data types, this thing is what you might call a classifier. You were talking about a dog or a cat. It'll give you an answer for every possible output class. Every possible enumeration value has a probability associated with it. You want the real one to be close to 100%, and you want the rest of them to be close to zero. It's the same for text. The entire vocabulary is given in a probability distribution.
Neural networks are doing matrix multiplication, with extremely large matrices [11:14]
Thomas Betts: That's when you hear about how big these models are, it's how much they've been trained on. The assumption is that ChatGPT and GPT-4 was basically trained on everything that you could possibly get off the internet. I don't know how true that is, but that's the way people talk about.
Anthony Alford: It's close enough to be true. That's the data set. There's also the number of parameters that make up the model. When we're talking about these deep learning models, those are neural networks. And neural networks are, at heart, matrix multiplication. I mentioned those input tensors. You could think of them as like matrices. You can multiply that times the model's matrix. We talk about those matrix entries are sometimes called weights because ultimately what you're doing is a weighted sum of the input values. When we talk about how big is the model, we're talking about how many matrix parameters are in that thing. For GPT-4, we don't know. We were not told. If you go all the way back to GPT-2, there was like one and a half billion parameters in the matrices inside it.
Thomas Betts: Yes, I think we're now seeing...
Anthony Alford: Hundreds of billions.
Thomas Betts: Hundreds of billions, yes.
Where does large language model come in? Is it in the billion?
Anthony Alford: Yes. Well, it's not a hard number. But what we're seeing now is if something is tens or hundreds of billions, that's probably large. We have smaller ones now where you'll see Llama or something like... What is it, Gemma from Google? And Phi from Microsoft. Those are still billions, but they're only... From 1 to 10 billion is considered a small model now. That's small enough to run on your laptop actually.
Thomas Betts: Okay, you just threw out several other names and these are the things that I'm talking about that architects were like, "Oh, I think I've heard of Llama. Gemma sounds familiar". And was it Psi?
Anthony Alford: Phi, P-H-I, right. The Greek letter. Here in America, Phi, but other places it's Phee.
Hugging Face is like GitHub for language models [13:28]
Thomas Betts: I know you can go out and find details of some of these. There's a site called Hugging Face that I don't understand, but you can go and find the models and you can test the models. What is that?
Anthony Alford: Hugging Face, you can think of as the GitHub for language models. In fact, I mentioned a library. They have an SDK. They have a library, Python library you can install on your laptop that will behind the scenes download and run these smaller language models that you can actually run on your machine. What they do is they have files that contain those matrix entries that I mentioned.
Thomas Betts: That's the composed model, if you will, right? I always think the training is I'm going to run my program and the output is the model. The training process might take hours or days, but once it's done, it's done and it's baked. Now I have the model, and now that model, for large language models or small language models, you're saying it's something that I can put on my laptop. Some of those, if they were smaller machine learning models, we've been able to move those around for a while, right?
Two phases of the machine learning life cycle [14:35]
Anthony Alford: Oh, yes. We can think of two phases in the life cycle of machine learning model, the training that you mentioned. We could think of that as developing a function, and then once it's developed, once we've written the function, we might build it and deploy it as a jar, for example, or some kind of library that you can use. The trained model is like that, and when you load it up and you put an input into it and get an output out, that's called inference. The model infers some output from your input. Those are the two big chunks of the model lifecycle.
Auto-regressive models take the output and feed it back in as the next input, adding to the context [15:12]
Thomas Betts: Back to the large language models where you're talking about predict the next word and then predict the next word. This is where it's feeding it back in. The way I've understood is it's just auto-complete on steroids, one letter, one word. It's like, "I'll just do all of it". It keeps feeding that sentence that it's building back into the context, and so that's the next thing.
Anthony Alford: That's right. And you'll hear these models referred to as autoregressive, and that's exactly what they're doing. You start with initial input, which sometimes we call that the prompt. We also call the input to the model, the context. The prompt is the initial context and then it outputs one more token that's stuck on the end and then it feeds back as the new context and the process just repeats. These things also are able to output a token that basically says, "Stop". And that's how they know to stop. Whereas I've tried that auto-complete with my phone where I just keep auto-completing over and over. It eventually produces gibberish, but it is the exact same idea.
Tokens are the words or parts of words that the model can respond with [16:18]
Thomas Betts: You've now said token a few times, and I keep saying word. And I know the layman is usually interchanging those, and it's not exactly the same thing. That a token is not a word all the time. What is a token in terms of these language models?
Anthony Alford: When people first started, it was words. We're probably familiar with the idea with search engines of doing things like stemming or things like that where the word itself doesn't actually become the token. The reason you want to do something that's not exactly the word is I mentioned you can only get an output that is one of the tokens that it knows about. You've seen things like, "Well, let's just use the bytes as tokens". I think now it's byte pairs. Basically, it's no longer at the word level. A token is smaller than a word. You might see a token be a couple of letters or characters or bytes.
Thomas Betts: And what's the advantage of shrinking those down? Instead of predicting the next word is once upon a time, it would predict T and then I and then M then E.
Anthony Alford: Or something like that, or TI. The reason is so that you can output words that are not real words, that wouldn't be in the regular vocabulary.
Thomas Betts: Now is it smart enough to say that time is one possible token and TI might be a different one? Does it break it down both ways?
Anthony Alford: The tokenization is, that's almost become a commodity in itself. Most people are not really looking at what the specific token data set is. I think typically you want something a little bigger than one character, but you want something smaller than a word. This is something that researchers have experimented with.
Thomas Betts: And my interaction with knowing the number of tokens counts is... When I've played around these things, used a ChatGPT or OpenAI, API, it's measuring how many tokens are being used. And you're being sometimes billed by the number of tokens.
Anthony Alford: Yes, that's right. Because essentially the output is a token, and the input we mentioned, that's called the context, the models have a maximum size of the context or input in the number of tokens. It's in the order of thousands or maybe even hundreds of thousands now with a lot of these models. But eventually, it will have to stop because effectively you can't take a larger input.
Thomas Betts: Yes, and I remember people found those limits when ChatGPT came out is you'd have this conversation that would go on and on and on, and pretty soon you watched the first part of your conversation just fall off the stack, if you will.
Anthony Alford: Yes, the maximum context length is built into the model. And there's a problem with the algorithmic complexity is the square of that context size. As you get bigger, the model gets bigger as the square of that, and that's how the runtime increases as the square of that, et cetera.
Efficiency and power consumption [19:15]
Thomas Betts: That's where you're getting into the efficiency of these models. There's been some discussion of how much power is being consumed in data centers all around the world to build these models, run these models, and that's one of those things that you can get your head around. If you have this thing it takes...
Anthony Alford: It's an awful lot.
Thomas Betts: It's a lot. It's an awful lot. Say it takes 30,000, 32,000 tokens and you're saying the square of that, that suddenly gets very, very large.
Anthony Alford: Oh, yes. Not only does it grow as a square of that, but it's like there's a big multiplier as well. Training these models consumes so much power, only the people who do it know how much. But really they're just looking at their cloud bill. Nobody knows what the cloud bill was for training GPT-3 or 4, but it's a lot.
Thomas Betts: Yes, that's why people are looking not to build your own model. Most people are not in the business of needing to create their own LLM. These things are done, but people are using them to replace Google searches. One of the problems is you don't have the context because the model wasn't trained on current events. It's not searching Google and giving you results. It's just predicting words.
Anthony Alford: Exactly. Now they are trying to build that in. If you use Bing, Bing is actually using GPT-4, and it will include search results in its answer, which when we get to the... I don't want to spoiler, when we get to RAG, we can talk about that.
Transformers - GPT means Generative, Pretrained Transformer [20:43]
Thomas Betts: Well, let's leave RAG off to the side a little bit. Let's dig a little bit into transformer without rewriting the entire history. I think you and Roland have talked about that a little bit on your podcast.
Anthony Alford: Right, we've mentioned LLMs in general and GPT family in particular. Well, the T in GPT stands for transformer, and this was something that a Google research team came up with in 2017. They wrote a paper called Attention is All You Need. They were working on translation and before that the translation models were using recursion, which is different from what we were talking about with autoregression. Anyway, they came up with this model that really just uses a feature called attention or a mechanism called attention. They called it the transformer.
Now really all the language models are based on this. That's what the T in GPT stands for. GPT stands for generative pre-trained transformer, and they all use this attention mechanism. You could think of attention as a way for the model to pick out what's important in that input sequence. The word is, I think, sometimes used in... It's similar to information retrieval, so it uses a lot of concepts like queries and keys and values. But at a high level, it's a way for the model to... Given that input sequence, identify the important parts of it and use that to generate the next token.
Attention is weighting the input [22:13]
Thomas Betts: It might throw out some of your input or recategorize and say, "These are the important words in that context".
Anthony Alford: The mathematics is, it finds keys that match the query and then returns the values that are associated with those. A lot of times it does focus on certain parts of the input versus other pieces.
Thomas Betts: That's where weighting comes into play, right?
Anthony Alford: Exactly. That's how it is.
Thomas Betts: You mentioned that these matrices have weights on them. It's going to figure out which words or parts of that input, and that one word doesn't always have the same weight. It's in the context of that input, it might have more weight.
Anthony Alford: Yes, you did a better job explaining that than I did.
Thomas Betts: It's not my first time trying to explain this. I get a little bit better every time. Again, one of the points of why I wanted to do this episode.
Adding an LLM to your product [23:03]
Thomas Betts: We've got transformers, it's just a term, and the attention, that's how we're figuring out what goes in. That outputs, in the case of GPT outputs, GPT, but that's a branded term. LLM is the generic term, right?
Anthony Alford: Right.
Thomas Betts: It's like Kleenex versus tissue. Let's say I want to use one of these LLMs in my application. This is the thing that my product owner, my CEO is like, "Put some AI on it". I want to look like we're being innovative. We've got to have something that is this predictive thing like, "Look at how it looked at our model and comes up with something". How do we go about doing that?
Anthony Alford: Can I plug an InfoQ piece already? Just earlier this year I edited the eMag, the Practical Applications of Generative AI e-magazine. And we had several experts on LLMs in particular to talk about this. Definitely recommend everybody read that, but what they recommended is... You have publicly available commercial LLMs like GPT for ChatGPT. There's also Claude. There's also Google's Gemini. AWS has some as well. Anyway, if you find one of these that seems to work, try it out. So you can quickly adopt LLM functionality by using one of these commercial ones. It's just an API. It's a web-based API. You call it using an SDK, so it looks like any kind of web service.
That's number one. Number two, for long-term cost maybe, right? Because it's a web service and API, like we said, we're paying per token. It's actually probably pretty cheap. But longer term there's cost concerns, and there may be privacy concerns because these commercial LLMs have gotten better at their promises about, "We're not gonna keep your data. We're going to keep your data safe". But there's also the data that it gives you back in the case of, say, like code generation.
I think there was a lawsuit just recently. I think people whose code was used to train this, they're saying that this thing is outputting my code, right? There's concerns about copyright violation. Anyway, longer term, if you want to bring that LLM capability in house, you can use one of these open source models. You can run it in your own cloud, or you can run it in a public cloud but on your own machine. Then you have more control over that.
Thomas Betts: Yes, it's kind of the build versus buy model. Right?
Anthony Alford: Exactly.
Thomas Betts: And I like the idea of, "Let's see if this is going to work". Do the experiments. Run those tests on the public one and maybe put some very tight guardrails. Make sure you aren't sending private data. I think it was to plug another InfoQ thing. Recently the AI, ML trends report came out. I listened to that podcast. That was one where it mentioned that because they were setting up so many screens to filter and clean out the data before sending it to OpenAI or whichever API they were using, that scrubbed out some of the important context and the results coming back weren't as good. Once you brought the model in house and you could say, "Oh, we own the data. It never leaves our network. We'll send it everything". All of a sudden your quality goes up too.
Anthony Alford: It's definitely very easy to experiment with. And if you find that the experiment works, it may make sense to bring it in house. There's the short answer.
Hosting an open-source LLM yourself [26:36]
Thomas Betts: Like you said, "If you want to pay per use and it's easy to get started". That's one way to go. When you're talking about bringing in house, you mentioned you can have it on your own cloud. Like we're on Azure, AWS. Is that basically I spin up an EC2 instance and I just install my own.
Anthony Alford: That's one way. Of course, the service providers like AWS are going to give you a value add version where they spin it up for you and it's very much like the regular model where you pay per use. But yes, you could do that. You could do it right on EC2.
Thomas Betts: Yes. Are you doing the product as a service, the platform as a service, the infrastructure as a service, then you can do whatever you want on it. Your results may vary, but that might be another way to do that next phase of your experiment as you're trying to figure out what this is. How easy is it for me to spin up something, put out a model there and say, "Okay, here's our results using this public API, and here's if we bring it in house with our private API". Maybe you look at the cost. Maybe look at the quality of the results.
Anthony Alford: Yep, for sure.
Comparing LLMs [27:37]
Thomas Betts: How are people comparing those things? What is the apples to apples comparison of, "I'm going to use OpenAI versus one of the things I pull off of Hugging Face?"
Anthony Alford: This is actually a problem. As these things get better, it's tricky to judge. In the olden days where we had things like linear regression and we had that supervised learning where we know the answer, we can get a metric that's based on something like accuracy. What is the total sum of squared error? But nowadays, how good is the output of ChatGPT? Well, if you're having it do your homework, if you get an A, then it was pretty good. And, in fact, believe it or not, this is very much a common thing that they're doing now with these models is they're saying, "We train this model, it can take the AP Chemistry exam and make a passing grade".
Another thing I see a lot in the literature is if they're comparing their model to a baseline model, they'll have both models produce the output from the same input and have human judges compare them. It's like Coke versus Pepsi, which four out of five people chose Pepsi. And even more fun is do that, but with ChatGPT as the judge. And believe it or not, a lot of people are doing that as well. I guess the answer is it's not easy.
Thomas Betts: Yes, that's where I tend to say these things are non-deterministic. You talked about the probability, you don't know that the answer is going to come out. Your test is not... I asked this question, I got this answer. Because you don't necessarily know what types of questions are going to be going in, so you don't know what outputs are going to come out.
Anthony Alford: Yes, exactly. That's actually one of the most scary things is you don't know what's going to come out. Something very unpleasant or embarrassing might come out and that's really got people concerned about using these things in production environments.
Thomas Betts: Yes.
Before you adopt LLMs in your application, define your success criteria [29:38]
Anthony Alford: But I will say one thing... Again, talking back the e-magazine, one of my experts said, "Before you adopt LLMs in your application, you should have good success criteria lined out for that". That may be the harder part to do. How will I know if it's successful? It's going to depend on your application, but it's something you should think hard about.
Thomas Betts: Well, I like that because it puts back the question on the product owners. The CTOs are saying, "I need some AI in it". What do you want to have happen? Because there's a lot of places where you shouldn't put AI. I work on an accounting system. You should not have it just guess your books.
Retrieval-Augmented Generation (RAG) should be an early step for improving your LLM adoption [30:19]
Thomas Betts: When we're talking about using these for ourselves, whether we're hosting them or bringing them in house, how do we get those better quality results? Do we just use them out of the box? I had a podcast a while ago and learned about retrieval augmented generation. I hear RAG talked about a lot. Give me the high level overview of what that is and why that should be a first step to make your LLM adoption better.
Anthony Alford: Again, on my expert panel, they said, "The first thing is to try better prompts". We've probably heard of prompt engineering. We know that the way you phrase something to ChatGPT makes a big difference in how it responds. Definitely try doing stuff with prompts. The next step, retrieval augmented generation or RAG. I think we mentioned, the LLMs, they're trained and they don't know anything that happened after that training. If we ask who won the football game last night? It doesn't know, or it might not say it doesn't know, it might actually make up something completely not true. This is also a problem for a business where you want it to know about your internal knowledge base, right? If you want it to know things that are on your Wiki or in your documentation, things like that. What RAG is is you take your documents, you break them up into chunks, but essentially you take a big chunk of text and you run it through an LLM that generates a single vector for that chunk of text.
This is called an embedding. And that vector in some way encodes the meaning of that text. You do this with all your documents and then you have a database where each document has a vector associated with it that tells you something about its meaning. Then when you go and ask the LLM a question, you do the same thing. You take your question and you turn that into a vector, and the vector database lets you quickly and efficiently find vectors that are close to that and therefore are close to your question in meaning. It takes the content from that and shoves that into the LLM context along with your question. And now it knows all that stuff along with your question. We know that these LLMs are very good at... If you give it a chunk of text and say, "Explain this". Or, "Here's a question about this chunk of text". It is quite good. That's what the intention mechanism does, is it lets it find parts of that chunk of text that answer the question or solve the problem that you're asking.
Thomas Betts: The way I've heard that explained is, let's say I do my search and instead of me writing a really elaborate prompt, because I'm willing to sit there and type for 30 seconds. That's all the words I'm going to come up with. Instead, I would say, "Answer the question based on these documents". And I can give all those documents in the context and now it knows, "Okay, that's what I'm going to use". I'm not going to use just my base level LLM predict the next word. I'm going to predict the next word based on this context.
Anthony Alford: Right. And the retrieval part is finding those documents automatically and including them in the context for you. That's the key component... If you actually know the documents, and let's say somebody gave you, "Here's our user manual, answer questions about it". Which is a pretty cool use case for someone who's, say, in customer service. If the user manual is small enough to fit into the context, which it probably is at hundreds of thousands of tokens, then that's great. But maybe you don't have that. Maybe you have a bunch of knowledge base articles. This will go and find the right knowledge base article and then answer the question based on that.
Thomas Betts: Right, because our knowledge base has tens of thousands of articles opposed to a couple of hundred pages.
Anthony Alford: Exactly.
Thomas Betts: And you're still using the LLM, which has all of its knowledge of, "Here's how I complete a sentence".
Anthony Alford: Yep.
Fine-tuning is one option to make an LLM better suited for your needs [34:07]
Thomas Betts: You are not building a new model based off of your knowledge base or your training documents.
Anthony Alford: Exactly. But let's say you did want to do that and that might be a better solution in some cases, this process is called fine-tuning. I mentioned the T in GPT was transformer. The P is pre-trained. This is a whole subfield of machine learning called transfer learning where you train a model, you pre-train it and it's general purpose. Then you can fine tune it for a specific case. In the case of GPT-2, 3 and higher, they found out you don't need to. It's pretty good on its own. But what the fine-tuning does is its additional training on that model. Instead of using the model as is, you restart the training process. You've got your own set of unit tests. You got your own fine-tuning data where the know the inputs, you know the outputs. The advantage is for fine-tuning, it can be much smaller than what is needed to train the full GPT.
Thomas Betts: And that's because you're starting from what already exists.
Anthony Alford: Exactly. Right.
Thomas Betts: You're not starting from baseline or nothing. It's just saying tweak your model. That's going back to things that I understood, again, at a superficial level with machine learning training is like, "You can overtrain the data". If you give too many answers in one area, it's like, "Look, we got to a 99.9%". But then something comes in and it doesn't know about and it has no answer. It's way off base. In this case, if I'm trying to get the model to be very specific to my company's applications, my data, that might be the desired outcome. I don't want someone to be using my customer service chatbot to ask about when's the next Taylor Swift show?
Anthony Alford: Yes, exactly. In fact, the original ChatGPT and newer models, they do fine tune them to give more helpful answers and follow instructions. This is something with GPT-3.5, again, that model is pre-trained on basically the whole internet, and it could give you answers that were pretty good, but they found that sometimes it would just give you answers that were... It's that whole joke about this is technically true but not at all useful. So they fine-tuned it to give you answers that are more helpful to follow instructions. They call it alignment, and the way they do that is they have a small data set of, "This was the input. Here's the output you gave, but this output here is better". They fine-tune it to work towards the more appropriate output.
Vector databases provide nearest-neighbor searches [36:45]
Thomas Betts: I need to back up just a little bit. When you mentioned we're going to create these vectors, I'm going to have a vector database, I'm going to do a vector search. Another one of those terms that gets thrown around and people are like, "Well, do I have a vector database?" I think Azure just announced that they're going to... I think it's in beta right now. Basically turn your Cosmos database into a vector database, like flip a checkbox in the portal and all of a sudden you have vectors. What does that do for me? Why is that an advantage?
Anthony Alford: Okay, I have an upcoming podcast on this very problem. We mentioned that for a chunk of text you can create from that a vector that encodes the meaning. The vector is very high dimensional. It's hundreds, maybe thousands of dimensions. You're trying to solve the problem of given one vector, how do you find those vectors in the database that are close to that input vector? You could just run through all of them. You do a table scan basically, and sort the output. That probably is actually fine. The complexity is high enough that at scale it's not going to perform great. What you need is something more like a B-tree lookup, which is log-N. The vector database is actually... The vectors are probably not the important part, it's the nearest neighbor search. This is the problem we're solving, is given a vector input what are the nearest neighbors to that vector in your database? That's the problem that you want to solve in an efficient, scalable way.
Thomas Betts: Got you. It's going through and looking at my data and saying, "Here are the vectors for all the parameters". And based on that, these are related words...?
Anthony Alford: Well, no, literally just it doesn't look it. It's just given two vectors. How close are..
Thomas Betts: How close are the vectors? It doesn't know what it came from?
Anthony Alford: Exactly, right. Now, once it finds the ones that are closed, then those are in the same database row or there's a pointer to the content that it came from, which is what you actually care about.
Thomas Betts: Got you.
Anthony Alford: But the database, its purpose is to do the nearest neighbor search where you give it a vector and it finds the top K in its database that are closest to it.
Thomas Betts: Yes. This is where, I think, we're going back to the beginning that AI as a product isn't something that exists. We've had fuzzy search techniques for a while. This has been something people have wanted and everyone's gotten used to Google. I can type in whatever I want, and it figures out. Like you said, you take the stems of the words... This is another one of those, I didn't give you exactly what the answer asked for. So it's not a find this row in the database, but find records that are close to what I intended and that's what they're doing.
Anthony Alford: Yes, I think you might find this referred to as semantic search or maybe neural search. The neural meaning that's how the vectors are generated from a neural network.
Thomas Betts: But it's all about that. I don't have a specific thing in mind to find the intent.
An LLM is a tool to solve natural language processing (NLP) problems [39:45]
Thomas Betts: I guess LLMs really fall under in my head the category of natural language processing, right?
Anthony Alford: Yes, exactly.
Thomas Betts: Because that used to be a thing. I had data scientists on my team who were working in the field of natural language processing. Is that still a thing? Is that just a subset or has it just gotten overwhelmed in the news by LLMs?
Anthony Alford: I think you could think of an LLM as a tool to solve natural language processing problems. For example, we used to look at things like named-entity recognition, parts of speech recognition, that kind of thing. That's still something you have to do, but an LLM can do it.
Thomas Betts: Right.
Anthony Alford: And it can do it pretty well and works out of the box. If you look at what, I think... Again, we were talking about Google and Attention is All You Need. They came up with a version of that called BERT, and it would do this stuff like named entities and parts of speech, tagging and things like that.
LLMs are useful because they are very general, but that does not make them general AI [40:44]
Thomas Betts: Got you. And that's one of those LLMs are these generalists. Find ways to make them more specific. If you have a specific use case in mind, you can go down a fine-tuning route. You can find a different model that's just closer, and that's going to have those benefits of... It's going to cost less to run. It's probably going to be better quality answers. It's probably going to return faster, I'm assuming if it's less computational.
Anthony Alford: Yes, this is one of the reasons people are excited about LLMs is that they are very general. That's one of the things where people started saying, "Is this general AI?" That's been the holy grail of AI research forever. Yes, we can make a program that plays chess very well, but it can't drive a car. The holy grail is to build one model that can solve just about any problem. Just if we flatter ourselves as human beings, we can do lots of different tasks. We can do podcasts. We can build model race cars or read books. The holy grail of AI is one model to rule them all, and LLMs could do so much without additional training. That's what one of the early GPT papers was like, "Look, we built this thing out of the box. It can do summarization, question answering, translation, code generation, all these tasks". And that's one of the things that got people really excited about it. It looks like it could do everything.
Thomas Betts: Yes, I think it's the now how do we use it? Because that seems so powerful. But going back to your point, you need to have a specific output in mind. What is your goal? Why would you add this? Because it sounds exciting. Everyone wants to use it, but you have an intent of how does that fit into your product? How does that fit into your solution?
Anthony Alford: Yes. It's always about what business problem am I trying to solve? And how do I know if I succeeded?
AI copilots versus AI agents [42:38]
Thomas Betts: We're over on time. I'm going to have one last bonus round question and we'll wrap it up.
Anthony Alford: Yes.
Thomas Betts: A lot of people talk about having AI Copilots. I can't remember how many Microsoft Copilots and GitHub Copilots. Everything's a copilot. Distinguish that from an AI agent because that's another term that's being thrown around. They both sound like the embodiment of this thing as a person. There's a whole different discussion about that. But these are two different things. What's a co-pilot versus an agent?
Anthony Alford: I think we did talk about this on the trends podcast. An agent has some degree of autonomy. The co-pilot is you've got to push the button to make it go eventually. Again, I don't want to turn this into AI fear. The fear that people have of AI is autonomous AI, in my opinion. If we can mitigate that fear by keeping things as co-pilots, then maybe that's the way to go. But I think the key is autonomy. You have to agree to the co-pilots answer and make it go.
Thomas Betts: The agents can do stuff on their own, but maybe we have supervisor agents. And like you said, "I don't know how to tell the output, so I'm going to ask ChatGPT, 'Did I train my model correctly?'" And you feed it back into yet another AI. The AI agent story is you have supervisor agents who watch the other ones, and then it's who's watching the watchers?
Anthony Alford: Watches the watchers? Yes, indeed.
Wrap-up [43:57]
Thomas Betts: Well, I really appreciate all your time. I learned a lot. I hope this was useful for the audience.
Anthony Alford: Me too.
Thomas Betts: It's always good to go through and do this little refresher of here's what I think I understand, but bounced off someone who really knows. I'll be sure to provide links to the things we mentioned. The eMag is great.
Anthony Alford: Yes.
Thomas Betts: Then the trends report and podcast and some other stuff. Anthony, thanks again for joining me on the InfoQ Podcast.
Anthony Alford: It was a pleasure. Thanks for having me.
Thomas Betts: And we hope you'll join us again next time.