KBall and returning guest Tejas Kumar dive into the topic of building LLM agents using JavaScript. What they are, how they can be useful (including how Tejas used home-built agents to double his podcasting productivity) & how to get started building and running your own agents, even all on your own device with local models.
Featuring
Sponsors
Socket – Secure your supply chain and ship with confidence. Install the GitHub app, book a demo or learn more
Neon – Fleets of Postgres! Enterprises use Neon to operate hundreds of thousands of Postgres databases: Automated, instant provisioning of the world’s most popular database.
Notes & Links
Chapters
Chapter Number | Chapter Start Time | Chapter Title | Chapter Duration |
1 | 00:00 | It's party time, y'all | 00:56 |
2 | 00:56 | Hellooo party people | 00:45 |
3 | 01:40 | Welcoming back Tejas 👀 | 02:13 |
4 | 03:53 | What makes an agent | 03:31 |
5 | 07:24 | When to use agents | 02:16 |
6 | 09:40 | Who builds these? | 01:34 |
7 | 11:14 | Ollama is... | 02:45 |
8 | 13:59 | The agent framework | 02:14 |
9 | 16:13 | Sponsor: Socket | 02:49 |
10 | 19:02 | From the ground up | 05:42 |
11 | 24:44 | Gotchas | 04:11 |
12 | 28:55 | Running locally | 03:45 |
13 | 32:40 | Levels of local resources | 03:55 |
14 | 36:34 | Shortcomings | 05:31 |
15 | 42:05 | Jumping to binaries | 01:28 |
16 | 43:32 | Sponsor: Neon | 03:33 |
17 | 47:05 | Where to use this stuff | 04:01 |
18 | 51:06 | Decision making | 01:06 |
19 | 52:12 | Just (useful) tools | 03:37 |
20 | 55:50 | Closing time | 02:11 |
21 | 58:01 | Next up on the pod | 01:20 |
Transcript
Play the audio to listen along while you enjoy the transcript. 🎧
Hello, JS Party people. I’m Kball, I’m your host today, and I am doing another one of these fun deep-dive interviews. I’m joined today with Tejas Kumar. Tejas, how are you doing, man?
Hey, it’s good to be here. I’m doing well. Thanks.
Thanks. I’m excited to have you on. I think we’ve had you on the show once before, maybe a year or two ago, and we got deep into like the vibes, and how to have good energy, and succeed… Today we’re doing a more focused technical topic; you have been getting really into a topic that’s interesting to me, which is AI or LLM agents, and doing it in JavaScript, which is different than what I’ve been doing it in. So I’m really excited to get into there. But let’s maybe kind of start – give our listeners a little bit of a background if they didn’t listen to that old episode, who you are, how you got into this stuff… And it feels like a bit of a shift from where you started… So yeah, kind of curious how you got here.
Yeah, I’m Tejas. If you didn’t listen to the last episode, no worries. I have been a web engineer for most of my life. I got into it as a kid, from age eight, just sort of building things with Frontpage, and Dreamweaver, and had an internship at 15… I’d say that’s where my professional career began, and then I’ve just sort of been doing that for the past 16 years.
Eventually, some people, some leaders recognized that I had a gift for communication, specifically I’m talking about Guillermo Rauch, from Vercel, who asked me to come Lead Developer Relations at Zeit back then for a little while… And that was sort of my getting into developer relations, which was where I was when we did the last JS Party episode. I was the director of Developer Relations at Xata, which is a serverless database company. It’s run by some great friends.
And today, as you mentioned, I’m doing a lot of AI. I work as a developer relations engineer for generative AI, or Gen AI, at Datastax. And my whole job is to live and breathe Gen AI, and understand it as deeply as I can, so I can teach it with as much quality and as much fidelity as I can. So Datastax is heavily focused on RAG. That’s like bringing real-time context into prompts that we send to large language models, and so we help them come up to date and hallucinate less. But there’s also the whole other side of the equation, which is agentic workflows… Which is what I’ve been spending a lot of my sort of extracurricular time, let’s say… This technique RAG, for those listening, it stands for Retrieval Augmented Generation, and essentially, as a developer, you retrieve some data and you give it as a prompt to the LLM, and you use it to augment the generated output. So that’s RAG. But with agentic workflows, this changes a little bit, where instead of you the developer doing the retrieval, the LLM itself can sort of recognize when it’s time to call the tool. That’s actually just one [unintelligible 00:03:41.19] I’m happy to talk about agent workflows in a broader sense, but RAG initiated by agents, where they themselves retrieve data is something that I’ve also been working on. So yeah.
Yeah. Awesome. Well, yeah, let’s maybe start with agent, because I feel like that is a word that gets thrown around a lot. Sometimes I feel like people use agent to mean anything that is “I don’t understand what it means, but it’s going to do something for me.” So how do you define what an agent is in this sort of new world?
Yeah, that’s a good question. I just follow a number of experts’ definitions of this thing. I tend not to try and coin terms myself, mainly because I’m just not very credentialed, if we’re being honest… So how do I see agents? I summarize it, I summate it - I’m trying to find the right word - I deduce it from definitions from industry experts who have done it before me. So people like Andrew Ng, the founder of Coursera, and now the founder of DeepLearning.ai - I think he’s got some great content about this, where he defines agentic workflows as workflows that have LLMs perform three tasks, either all three or a subset of them. And those are reflection, meaning generate some output and reflect on it, “Is it good, is it not?”, and then iteratively work on it until it cannot be improved further. So there’s reflection. There’s tool calling, as I mentioned, with RAG, where the large language model will, sort of like a human being, recognize – for example, if you ask me to do a complex calculation, like 324 divided by 9 times 7, I’ll just be like “It’s time to get a calculator.” I’ll recognize that this is the sort of boundary of my capabilities, and go use a tool. So number two is tool calling. And number three - I think it was agent collaboration, where you have – yeah, it’s LLM as judge. It’s this model where a capable model (pun intended) coordinates lesser capable models towards an outcome you want. So it’s like GPT 4.0 being the most capable of Open AI’s models would orchestrate like three or four different 3.5 turbo models that are doing various tasks or generations. And so those three, either one of them or all of them, make up according to Andrew Ng an gent workflow.
[05:47] According to David Khourshid, AI agents are an implementation of the actor model, which is just a programming model where you have an entity called an actor, that sort of acts in response to observing its environment. So the classic implementation of the actor model is PacMan. Actually, a great example of AI, but rule-based AI, where the rules are known ahead of time, is PacMan, where you have PacMan, the little yellow pizza thing, and it’s observing the environment: where are the ghosts, where are the cherries, where are the dots… And you as the player take on the role of the actor. But there’s also demo mode, where the actor model is in play. And according to David Khourshid, this implements agentic workflows. However, it’s rule-based, it’s not generative, but it’s still an agentic workflow, where PacMan is an agent.
So I just take a mishmash of those two - these are the preeminent leaders in the space in my mind - and marry them, and that’s the working definition that I have for an agent. So it’s not a sentence, it’s not a nutshell, but I’m trying to give you more sort of a broad framework of how I see agent workflows. I have seen this term abused, where people will build – maybe abused is too strong… But people will build a custom GPT; this is a feature you can use from Open AI’s GPT 4. They’ll just build a custom GPT, add a system prompt, add some knowledge that GPT 4 can do RAG on, and call this an agent. I disagree; I don’t think that’s an agent, that’s just a RAG application. It doesn’t really do any of the things like we talked about: reflection, tool calling, collaboration, or observing an environment and responding accordingly. So I’d say those four tenets make an agent an agent.
Interesting. Maybe another way we could take this is like what would you use an agent for? When is this the right tool to sort of pull out of the tool chest?
Yeah, I think the term agent is so cool, because it applies to human beings, and we tend to anthropomorphize AI a little bit… I’m not saying that’s right or wrong, I’m not qualified to make that judgment, but I think it’s the same. When would you use a human agent? And this may be will trigger some doomers, like “Oh my gosh, they’re going to take our jobs.” I think it will take jobs. There’s no question about that. And so I think it’s good to be prepared.
So for example, I have a podcast - we actually use Riverside, the application we’re using to record this… And Riverside has some great capabilities with web hooks, as does cal.com, which is a really great scheduling tool. And so what we use agents for is to orchestrate across a variety of web hooks operations that a team of people would do. And so my podcast is run entirely by some AI agents. It’s not fully automated end to end, but for example if you schedule an episode with me, depending on which scheduling link you use - one is experimental, where we experiment and use the agent workflow, and the other one is just manual. So if you schedule with an agent link, what will happen is it will immediately, as soon as the event is scheduled, will fire off an agentic task to discover you. So “Who is Kevin Ball? What – okay, this email address in the calendar invite… Let’s go find where it occurs on the internet. Okay, GitHub, okay, Twitter, okay, Google…” And then it’s going to find out things that you’re passionate about.
The whole point of my podcast, Contagious Code, is to take what people are passionate about, and make that contagious to the listeners. And so it will find – literally, this is the task that the agent has… It finds what you’re passionate about, and then it will construct a discussion outline for the length of our discussion, and then upload that to probably GitHub or Google Drive. We’re still deciding. Right now it’s a gist on GitHub. So if you go on my GitHub gist, there’s a bunch. I’m using my access token, so they’ll make the discussion outlines. And then it will attach that link to the calendar invite.
So then when we come to record, we both have the discussion. The job is done. After we record, there’s similar post-processing steps. All of this used to be done by people, but doesn’t need to be anymore. So where would I use agents? The same place I would use human agents, as much as I can.
Yeah. So are you building those yourself?
Yeah, I actually am building and have – well, I’ve built a framework similar to Next.js. I’ve showed you this off podcast… Where you just define a bunch of tools, and the large language model will, just like a human being, recognize when “Okay, I don’t know –” So for example, go find out what Kevin Ball is passionate about. If you don’t have the right tools, the large language model would be like “Hey, listen, I can’t.”
Or it will confidently make something up, and be like “Hey, Kevin Ball is this person over there –”
Right. Even worse.
[10:12] I mean, for my name in particular, there’s a couple of famous people with that name, and so it’ll probably tell me you that I’m a football player, or an actor, or something like that.
Right. Which you maybe are. But – exactly. And if you go to GPT 3.5 Turbo, for example, it will just say “I don’t have the capacity to browse the internet”, something like that. But the moment you introduce these tools, it can do that.
And so it’s iterative. it’s up to the developer in developer land to just define a bunch of tools and pray to the AI gods that it will use it… How the AI actually knows when to call the tool is part of its training data. And as we know, Open AI doesn’t make any of that public, but Meta does, and so does Mistral. So you can just pick your model. You can also - and I do - run this stuff locally. So it’s not in someone’s cloud, where they can steal my data… Since it’s really just single user, I run this totally like at home. I’ve got a LLaMA, I’ve got Mistral 8X22B, which has support for function calling… It just – it works on a single device. It doesn’t need to scale, because it’s not mass market yet.
Yeah. Well, and so for this audience, who’s probably – well, maybe they’re playing with these tools, maybe not. Ollama is…
Yeah, thanks for mentioning. So if anyone’s familiar with Docker - I suspect they are; it’s JS Party, and we’ve probably built an OJS server and dockerized it… So Docker is a way to run servers and things software in what are called containers. These are abstractions on a virtual machine, so it’s just a nice, isolated environment. The team from Docker, a large portion of them, quit, and went to start and join a company called Ollama. And Ollama is basically like Docker, but for LLMs. The syntax – so you have an Ollama file, it’s called a model file, so like a Docker file, so literally… The concept overlap is incredible. Like, in Docker you have a Docker file, with Ollama you have a model file, and you write syntax that looks exactly like Docker file. From, and then you specify the base model, so Mistral 7D, and then you can add like a system prompt, you can do a bunch of stuff.
If you just have a from statement, and using the CLI type Ollama run, it will run that model locally for you. And then you can just use it like you would an LLM. “Generate me an email”, ask questions about whatever, and it will do it locally, on your GPU.
The cool thing about Ollama is it’s a centralized effort to be able to run large language models across a variety of hardware architectures. So if you download it for macOS silicon, with Apple silicon, it will just work. If you download it on Windows, it will – it’s like the Docker principle as well. So it’s really cool. So you can use Ollama to then – I also mentioned some models that maybe people aren’t exposed to everyday… GPT 4 and GPT 3 - and these are the models behind ChatGPT, that Open AI gives you… But Open AI is highly controversial, because they are not public. Nothing is open about this company.
It’s the whole thing about like you name yourself to cover up your greatest weakness, right? They are Open AI, it’s just they are not open at all.
Yeah. You should just call me like Rich Tejas, or something, because I’m really not. So their weights, their actual models - not open source. The training data - not open source. They don’t really publish papers as often as things like Google Brain, or even Meta. I think Meta is doing a tremendous good job, a really good job of being open with the – meta should have a department called Open AI, and it’ll actually be open.
In any case, maybe you don’t want to use those models. There’s a French company called Mistral - I say that proudly, as a resident of the European Union - that has a bunch of open source models. These models are fully open source, you can clone them locally, you can tweak them, you can fine-tune them, you can do whatever you want. And so what I run is Mistral 8X22B. This is their largest open source model, that has support for function calling, and I run that with Ollama. So Ollama is an inference engine, to answer your question in a super-long winded way. Ollama is an inference engine that you can run either locally or in the cloud, and then you pair that with a language model, and then you basically can build your own ChatGPT.
[13:59] Awesome. So let’s come back to your agent framework… You showed me a little bit – you said it’s like Next, it’s in JavaScript… Is that open source? Can people play with it?
Not yet, mainly because I feel like - and I may be wrong here, but in my mind before I open source something, I want to make sure it’s… It’s already useful, but I’m not sure it’s clean enough. It’s sort of like how you get dressed before you go out, usually, ideally, hopefully… It’s not dressed. And so not yet. Also, there’s people, friends, Sunil Pai from Partykit and David Khourshid from Stately, working on exactly the same thing. And theirs is open source. So David has stately.ai/agent. That’s his agent library, and that’s fully open source and ready to go. I think he’s still working on the documentation, but it will be soon, if not already done.
So if you were starting today, would you use Stately, or would you still build your own?
No, I’d build my own. I’ve always built my own. I really don’t –
You built your own React, right? That was your famous talk for a while.
Yeah, I don’t do like npm create next app. Instead, what I’ll do is [unintelligible 00:15:04.03] and I’ll bootstrap everything myself… Because I sort of like that control. I think it’s sort of like the car enthusiasts who will only drive stick shift, even though there’s, some would say, better ways. It’s like that. I like the raw control.
It’s so hard to get a stick shift in the States these days…
Oh, really?
Yeah. Nobody carries them anymore. It’s really depressing.
I wonder if we can do an electric stick shift. How would that work? That might be interesting.
I mean, it’s a total sideline, but electric motors, part of the advantage is that they can continuously apply torque throughout all this, so you don’t need to shift gears in the same way.
Yeah, but you want to feel something.
It’s like the way that they will play engine sounds for the electric…
Yeah. The Hyundai Elantra, the new electric one is just absolutely bananas. They actually have like paddle shifters… It’s fully electric, and they have paddle shifters and they mimic the torque. It’s wild.
Break: [16:03]
Let’s say somebody wanted to follow your footsteps and build it from the ground-up, because they wanted to explore all the different pieces. How do you interact, from JavaScript, with these models? What does that end up looking like?
This is a great question. It’s not difficult. I want to just preface by saying that. And people say “Oh, Tejas, but you say it’s not difficult, and then you tell us to do a difficult thing.”
Yeah, you built React in a 30-minute – or maybe it was an hour talk, right?
Yeah. It really just takes – I hope I’m not being difficult about this, but I think it just takes a little bit of thought. So how do you do you? I regret not wearing this black T-shirt based on what I’m about to say, but you would use the Vercel AI SDK… Oh, my gosh… I’m getting flashbacks to my old job at Zeit. Anyway, the Vercel AI SDK is a really great piece of software. And again, it’s very capable, but I use it because I’m confident I could build it given the time. So my thing with abstraction is I typically don’t trust blackbox abstractions unless I know how they work on the inside. And then I’m like “Yeah, cool. This saves me a bunch of time.” If I don’t know how they work on the inside, I tend to be uncomfortable to the point where I have to build it myself, or at least the bare bones, like I did with React, so that I understand kind of what’s happening.
Okay, so the Vercel AI SDK - what does it do? It’s pretty cool. It exports a function called create ai, then you can give it a language model [unintelligible 00:20:22.22] Think of it as an abstraction on top of like the Opening AI SDK, and the Mistral SDK. So a lot of these large language model as a service companies like Open AI and Mistral and Replicate and whatever, they all have SDKs. And the SDKs are not standardized. JavaScript is a standardized programming language. But these SDKs aren’t standardized. And so if you build your entire company on like Opening AI’s GPT 4.0, and then you’re like “Oh, this is way too expensive. We need to shift to self-hosted Mistral”, that’s going to be painful, changing from one SDK to another. So the Vercel AI SDK is a general SDK, where you can just swap out the language model pretty easily. It can do that because the language model is just an input parameter, and the functions you call use that input parameter. So it’s very nice and standardized. So I would use that.
With the AI SDK, they have as part of the model that you give as the input parameter, you also can pass in an array of tools. And what is a tool? A tool is just a function, literally an async JavaScript function that does a task and returns a message. So think of it this way - when you call the Open AI API, and you send a prompt, the role of the message you’re sending is user, and the content of this message is “convert for me 100 US dollars into Euros.” That’s the prompt.
Now, if there is no tools, the response will be “I don’t know how to do that with today’s exchange rate, but here’s some nonsense based on some exchange rate I imagined.” It won’t say that, but that’s what you’ll get.
It will confidently tell you the wrong answer.
Yeah, unfortunately. It won’t even say this is nonsense. So that’s how you would call the SDK. But when you add in tools as this input parameter, how does a tool look? A tool, indeed, is just a function, but this function also has metadata. So the metadata has a description, and it’s literally just a plain text description, and a schema of input parameters. And this is just a Zod schema. So it’s a JavaScript object, you can have keys and values. And so based on the description of the metadata of the tool, the large language model will call it, because it’s a language model. And the tie here is really language. So if the description of your tool using the Vercel AI SDK is “Get the current exchange rate, or get a list of current exchanges”, that’s the description, then the language model will see “Okay–” It’s just vector similarity, right? It will see the input prompt contains exchange rate, “I don’t know if this tool contains exchange rate. I’m just going to call this and hope for the best.” And that tool will return a message.
So we talked about the role being user, and the content being your prompt. The tool will return a message where the role is tool call, and the content is whatever that function returned as a string. And so then Open AI, or the large language model has been trained to recognize the JSON, where the type is tool call, and will take that and add it to its context. “A-ha, now the exchange rates are this. I got this from the tool. I’m going to generate some text for you.” So this is RAG, really. Tool calling is RAG, because it did retrieve the exchange rates, and then used it to generate its own output, or to augment its own generated output, I should say. But yeah, that’s how the AI SDK works.
[23:40] How would I build this? It’s exactly like that. I would add the AI SDK to my project, create an instance of their AI inference function object, and pass it a bunch of tools, and then just send prompts to it. That basically is just my – this is why also I don’t feel like open sourcing it, because it’s not revolutionary. It’s just using another library. Also, Stately’s AI agent framework is exactly the same. Also, Next.js is exactly the same. It’s just using React in an opinionated way. But Next.js is open source, so maybe I should open source my thing.
I think you should. I think you should absolutely open source it. I mean, even if it’s not ready, say it’s not ready, but you’re learning in public, and you’re showing people. You’re talking about it… I’d love to see how you do it.
Yeah. One caveat with this though is that it does get very expensive. This is why I opt to run the models locally, because then it’s free… But to run this at scale, as in like multi-user workloads is going to always be expensive at this point in time. So that’s something people should probably know about.
Yeah. Well, and that moves into a topic area around gotchas for this. And one that I’ve definitely noticed playing with these tools and then trying to help developers use these tools is… I think calling LLMs artificial intelligence maybe gets in the way of people using them well… Because if you think of them as intelligence, you expect them to be able to, for example, infer things that to a human seem the same…
Yeah.
But going back to your tool description, the text of that description is really important. It should linguistically be very close to the language that will trigger when it wants to do this.
Yeah. There’s one caveat here, because they can also translate. This is absolutely bananas. If somebody is speaking Korean with the language model, and is like “Convert this currency” in Korean, it will still do it. Because the vector dimensional space transcends language the way we know it. And I think that’s very cool.
Yeah, no, it’s super-cool. And they are very powerful. And when I’ve used them sometimes – it’s similar to what you’re talking about in terms of blackbox. These are sort of black boxes… And the lines between what works and what doesn’t, for example referencing a tool call, often are unintuitive. I’m trying to think of a good example… But using language that to me might mean the same thing will not trigger it at all.
Absolutely. Yeah, I’m totally picking up what you’re putting down. And I think this is why I did go a level deeper and just trained my own model that does tool calling… Which - now I see it. I see the matrix, so to speak. And I think it’s very important to always look at the person behind the curtain.
Yeah. I’m happy to go down that trail if you want, and we actually have an episode on my podcast with Kyle Corbett, the founder and CEO of a company called OpenPipe, that does fine-tuning as a service. It’s very cool. Full disclosure, I’m an investor, so I need to disclose that… But he is just a genius about fine-tuning, and tool calling, and machine – he’s got a background in this stuff. He’s had it for years, and so he was able to teach me a lot, and I did eventually come up with a large language model completely my own, that can call the right tools, and so on. It makes sense now.
So when you do that, are you also implementing that in JavaScript, the wrapping around it? Or how does that work?
Yeah, you can’t, as far as I know. So I haven’t fine-tuned models in JavaScript, mainly because I need access to my GPUs, which I know you can do with TensorFlow JS, I just haven’t… The tooling in Python is just fundamentally different and fundamentally better. Like, you’ve got so many – you could like npm install… The ecosystem of things you can use in Python for your machine learning workflows are just unparalleled. And really, this is the great gap. Like, if the JavaScript ecosystem wants to mobilize and create an equivalent level of tooling that Python has in JavaScript, we could really take over the space. But for whatever reason, we don’t have it.
What am I talking about? Well, specifically, Hugging Face the company - for those who maybe don’t get to play with this Hugging Face, it’s like GitHub, but for machine learning models. People can upload their models there, and fork them, and clone them, and download them, and do wherever they want… At least the ones that are open source.
So Hugging Face is just the biggest contributor to the Python ecosystem. They have a great library called Transformers. This thing is bananas. It’s like the bedrock of all fine-tuning operations that you would do maybe as an enthusiast. I can’t speak for like Academia and research and people with H100s from NVIDIA, but for me, with my Apple silicon, Hugging Face Transformers comes with so many great declarative abstractions out of the box.
[28:17] For example, you instantiate a trainer, and it’s a class, and you give it a bunch of hyperparameters, like” I want this many epochs, I want this learning rate”, and so on and so forth. And then once you’ve configured this instance, you literally just call ‘trainer.train”. How cool is that? And it will do that. And it will look for a GPU, if you have one. It’s called an MPS device. Or it will try its best effort to do it on your CPU, and will probably crash your system. I’ve crashed my computer many times. But all of this in JavaScript is just at this point in time not as accessible because of the ecosystem as it is in Python.
Interesting. So something you talked about there - you said you’ve crashed your computer, and you are running everything locally… So what are the gotchas if you want to start running locally? How likely are you to crash things? How fast or slow is this? What does that end up looking like?
Yeah, that’s a great question. You’re unlikely to crash things, especially like – so I’m working on a 2021 MacBook Pro with an M1 chip. It’s pretty old. It’s three years old. It’s a very old device. And it works just fine. It’s pretty much impossible to crash, unless you really get around it.
For example, the macOS kernel is really extremely world-class at making sure you don’t crash the system. There’s plenty of safeguards in place. So what will typically happen is your application itself will freeze, so not your entire system. And at some point it will say “Hey, this thing’s taking too long. Do you want to force it to quit?” and the kernel will just pull the kill switch. Very cool.
There’s a way around this using an environment variable. So PyTorch is the thing that’s causing the memory problems. PyTorch is an open source library from Meta that helps with machine learning. And so PyTorch allows you to set an environment variable called mps high watermark ratio. And this is at what point do you throw an out of memory exception. Because high watermark literally means you’re about to reach the watermark, the level where if this is a tide pool, you’re going to start losing water; you’re going to overflow, literally. I love the language we have in computer science. Overflow, watermark etc.
So you set this, it’s a threshold before overflow, at which point PyTorch will just kill the process. You can set that to zero. And then what will happen is you just completely bypass – so you’re like “You know what? If we have an overflow, we have an overflow. I’ll just like hard reboot.” And so you set that to zero, and then you’ll crash your system. Because all your GPU, all your CPU is going to be consumed, and you’re not going to have free resources to respond to the Caps Lock key and turn the light green. You’re not gonna have resources to –
So what you’re saying is unless you go out of your way to tell your computer “It’s okay to crash”, it’s not going to crash.
Yeah.
And just so I understand - you referenced PyTorch. That’s getting run by Ollama under the covers, or are you explicitly running that?
No, no, I’m explicitly running it. To be fair, Hugging Face Transformers runs it. Yeah. So my fine-tunes by the way, just to make this really accessible to everyone - I do it through a Jupyter Notebook. For those who aren’t familiar, this Jupyter is just a big JSON file that has cells. And each cell runs in isolation, but they’re shared scope. So it looks like a notebook with code snippets, and you can run those code snippets… It’s basically like text, snippet, text, snippet, text snippet. And you can run snippets in isolation, and they share scope. So you can say like [unintelligible 00:31:39.05] in a snippet somewhere high above, and then way down under a bunch of text you could reference a and it will just know the value.
And the reason I do this is because where there’s these snippets in a notebook there’s also checkpoints… Meaning I can go up to the point where I run npm install safely, and then the step after that could crash, but I’ll still have my dependencies. That’s really cool for an iterative training process, because with fine-tuning and training machine learning models you have to load a bunch of stuff into memory, and sort of keep it there. And the loading takes time. So if the loading step crashes, then you have to load it again, and again… So it’s really cool that you’re able to just load things into memory, and run an inference later, and if the inference fails, the stuffs still in memory.
[32:21]Yeah. So unless you go out of your way, you’re not going to crash things. I recommend going out of your way, because you’re still not going to – like, we’re very protected. Worst case, your computer becomes fully unresponsive, and then you press and hold the power button for like 10 seconds, and it just does a reboot, and you’re fine. Nothing will explode, so it’s worth playing.
So coming back then, peeling back the layers… So if you wanted to get involved, or start playing with this stuff, the simplest thing which most people have done is you just go to one of these online services. You go to ChatGPT or something like that, you play with it there. You see “What can this thing do for me in this setting?” Next layer is you’re using some sort of local code, maybe it’s an agentic framework, something like that, but you’re still interacting with an online model…
API.
An online API. You don’t have to do anything. One layer beyond that, you’re downloading Ollama, running a local model of some sort. Now, let’s talk briefly about like levels of local resources. So you’re on a three-year-old MacBook Pro. I’m guessing something like 16 gigs of memory, or something like that.
Exactly.
How much exactly to run these things locally?
That’s the cool thing about Ollama - they will work on anything. And if they don’t, they’re explicit about it upfront. So that’s the promise with Ollama, is they detect your – it’s sort of like Docker. You don’t really think of “What hardware am I working on?” Just like, in your Docker file, are like “From Ubuntu or whatever, do it.” And it will virtualize that for you. And Ollama is exactly the same. So it really is – there’s quite a bit of interop. It’ll work on three-year-old devices, it will work on Windows, or on Linux… It’s pretty cool.
Okay. So now you’ve got your local model… You’re probably at this layer still not fine-tuning, but you’re just running against a local model, using your JavaScript, or I guess it looks like the Vercel SDK is TypeScript, so Nick Nisi will be happy, he’ll be willing to play with it…
Yeah. Well, let me say this… If you want to go the local route, you need really just two things, and none of them are JavaScript, but you can add JavaScript later. We’ll talk about that in a second. But just to make sure, just to get really clear… If you want to run any large language model, or frankly any machine learning model locally, you just need two things. One is a very, very typically large file called theweights. And this is literally a neural network. Think of it as a brain on your computer. These are like orders of gigabytes, of 70 gigabytes, sometimes terabytes. They’re sometimes petabytes. They’re very, very large. And all they are are big, almost graph-like data structures, with a bunch of nodes and a bunch of edges. And each node has a number associated with it. Think of it as like if you see a soundboard from an audio engineer’s ask, there’s a billion different knobs for like EQ settings, and volume, and stuff. That’s it. So turning these knobs is how inference works, is how training works. Setting the values on each knob is basically the training process.
So you have this huge file, that’s the weights, and you have an inference engine. Something to run the algorithm that those weights express. They take an input, they pass it through those weights, and get a predicted output with some degree of certainty. The inference engine, the lowest level of inference engine is something called llama.cpp. It’s exactly what it sounds like. For me, this is beyond the scope of my knowledge. Ollama abstracts on top of llama.cpp, and it just makes it more comfortable for people like me. So you need those two files. That’s it.
Now, if you’re running inference locally, meaning you can send input tokens, you get output tokens, you can say “Hey, ChatGPT, what’s two plus four?”, you get six. Or not ChatGPT. “Hey, local model, what’s two plus four?”, you get six, maybe. Then the inference engines typically expose a web API, or you can wrap it with a web API.
Ollama runs a web API on localhost. It’s some weird port, like localhost port and then a five-digit thing… But once that’s running locally, then you can just do a fetch request and JavaScript to it.
[36:15] The cool thing about Ollama’s HTTP API is that it’s 100% Open AI-compatible. So you could literally like run an inference with ChatGPT, copy that as fetch, and change the URL to instead of like chatgpt.com/whatever, localhost port something slash whatever, and it will just work with your local model.
Well, and that gives a really low barrier to entry for just hacking around with code here…
Yes.
…because now suddenly you don’t have to worry about a number of things. You don’t have to worry about signing up for an API based account on Chat GpT, because that’s a separate thing than their web interface. You don’t have to worry about “Are they stealing my data? What are they doing with it? Where is it going?” You don’t have to worry about any of this stuff. Now, what are the shortcomings? GPT 4.0, you mentioned it’s sort of like state of the art highest power model. If I step down to using something - like, you’re using Mistral 8X22B… How is that going to feel different?
Oh, it’s gonna feel very different. As a consequence of these models being ethical and open source, they kind of suck.
[laughs]
If anyone’s gone from like Adobe Photoshop to like The Gimp, which was – I don’t know if you remember, this was like the old open source –
I do remember Gimp. I spent a lot of time in Gimp because I didn’t want to pay for Photoshop, and it was miserable.
Yeah, it’s exactly like that. It’s like going from macOS to Linux. It’s the tax of open source. It just really, really sucks. But you can work around it. Through some system prompt engineering, through some RAG… And honestly, through fine-tuning – there’s a model called Mistral 7B Instruct. It was like purpose-built for fine-tuning. And so this is the secret sauce. This is what people should be doing. You work with a crappy model, and over time you tend to get some really good inferences. And so you collect all your good inferences, pair them up with the prompts that lead to them, and then use those to fine-tune a smaller model, like Mistral 7B Instruct. And then you’ve got a really high-quality model that’s specialized at what you want. And then you can run it locally, and it’s going to be better than – it will probably outperform GPT 4.0 for your use cases, because it just more intimately knows your data.
Yeah. Well, and this highlights – one of the things that going to these smaller models does is it peels back, once again, the layers… If you go and interact with ChatGPT, it can kind of just feel like magic. And that’s dangerous, because it means that you assume that it’s better than it is; you assume that it can do all of these different things. And it will try and it will look good in a lot of different ways, but it’s so powerful that it demos incredibly well, it gets really, really close most of the time out of the box… And it’s hard to see “What is actually happening under there?” It feels magic.
Yes. Sorry to interrupt - especially tool calling feels so magical. “How does it call a function…?” But this is just training data; text in, text out.
Yeah. So you get down to those smaller models and you start to see that a little bit more raw. And you say “Oh, it’s just making *bleep* up based on pattern matching.” And you can get better at how you present patterns to it, you can teach it the patterns that matter to you.
Yes. And it will then produce patterns that also matter more to you. I think also what’s worth noting is the large language models themselves don’t really call any tools. They return data, they return text that then a layer in front of them, on Open AI’s side or whatever [unintelligible 00:39:38.14] can reason about the format of string it produced and then call the function. So it’s not like the large language models have function calling capabilities. They have text generation capabilities. They will just generate like some JSON. Think of an object with URL this, and parameters that, and then your API that talks to the language model will receive the string, parse it, and then call a tool, and then return text after that. So like you said, it seems magical, but there’s just layers of APIs on layers of APIs.
[40:09] Well, and you can peel back the layers on that too, even within like a ChatGPT. Just ask it to render JSON, or ask it to render YAML. It turns out YAML is a really nice language for large language models, compared to JSON, because it encodes meaning in the spaces, and it is very human-readable, which means it’s very close to language, which means it’s something they grok really well… But yeah, you could do all sorts of things. You can ask it to, instead of just outputting an answer, output three answers; output a summary of your conversation so far inside the summary tag, and then your answer. “Okay, sure.” It will do that.
Yeah, exactly. So I want people to know that it’s not magic. And I do want people to also care about the magician behind the magic. I think that’s really going to be important for us as we move forward in the AI age. Also, one thing I want to clarify - and this is speculative, but - I think there’s a lot of positive valence casting on like “Oh my gosh, Open AI released ChatGPT, and it was so great, and they did good for humanity, and so on”, but if we balance that out a little bit with what we know about big tech and capitalism, I think another side that’s worth discussing, that I just don’t see us discussing enough, is the idea that they just really needed human feedback. Because the way you make these language models really great is through a technique called RLHF, or reinforcement learning with human feedback. They need people to do text generations at scale, and click on the thumbs up or thumbs down, because this helps them create new models, new – like, GPT 4 is just a successor to GP 3.5 because 100 million users used it, and then clicked on thumbs up and thumbs down, and they used those to fine-tune GPT 3.5, and so on, and so on. So without that, Open AI wouldn’t be Open AI today. And I think that’s another reason to – so I challenge the idea that it’s just like fully altruistic, like “We’re gonna give something good to humanity, and do research.” They also need us, as much as we need it.
Yeah. Well, and I think this gets into a little bit – one of the challenges of being a software developer is we tend to jump to binaries, we like to nail things down… And so I talk with a lot of developers who are either “AI is the future of everything”, or, possibly even more common, “This stuff is all bull****. Just not good for anything.” And I think what’s much more interesting to me is like the line along the way of saying “The hype machine is the hype machine, it’s going to do what it’s going to do, it’s gonna go crazy, and a lot of the stuff they’re saying is not there.” And as you highlight, they have their reasons for doing it. Sam Altman’s not out there talking about AGI because he actually thinks it’s coming, he’s doing that because it pumps up Open AI, and it gets all sorts of outcomes that he wants out of it. He’s not, as far as I can tell, an altruist in any of that.
Yeah. And also, should they discover AGI, they have an incentive to not reveal that they’ve discovered AGI, because it would give Microsoft and the stakeholders an enormous advantage in the market. So why would you then share that, or open source that, unless you absolutely have to? But how can you absolutely have to unless you have people overlooking you and holding you accountable to do that? …which, as far as I know, they don’t. So yeah, it’s worth – I think those discussions are exceedingly important as AI continues to grow in maturity.
Break: [43:29]
The question I’m kind of leading to here is how do you think about kind of where to use this stuff, where it’s going to fit in application development in the future? Obviously, as we’ve highlighted, it’s really easy to get started playing with it, and you can do it with JavaScript. You’re using it to do real work with your podcast agents; they are making it easier for you to do what you wanted to do… How do you see this playing out in the ecosystem?
Yeah. I will add, we add the podcast grew in terms of production efficiency by 100%. We literally doubled the amount of episodes we ship from once a week to twice a week because of the agentic workflows. And it’s me; there’s no one else that works on this podcast other than me, and a bunch of agents. And each episode is nearly two hours, and it’s quite a bit of work, but that’s the power of agents.
Where can people use this? I think that’s a really great question. I think we just have to get curious a little bit. Because as I mentioned, anything that you could do with a human, that is even yourself – I could produce all this podcast stuff myself manually, right? But there’s a better way. So I think where people can use this is in the places that they’re already spending manual energy.
For example, I know runners, people who will go for a run and they’ll get into Strava, and look at their stats, and be like “Oh, I was slow today”, or “I was fast today.” What if you didn’t have to do that? You just go for your run, jump in the shower, come out, and you just have a summary, like “Hey, this is how you stacked up to all the other workouts.” And it’s not reactive, where you’re like having to send a prompt and get a response. It’s proactive. You literally like just go about your day, and somewhere your agent interrupts you with “Just so you know, your run today was actually better than your past three efforts in the same route”, things like that.
Or, another place people can use this is – so I think Apple Intelligence actually is going to change the game on this, because Apple Intelligence makes AI personal, I think for the first time ever… And I think what they’re maybe not talking about, but I think this future is coming, is the age of not just personal AI, but proactive AI. So not reactive, I send you a prompt, you send me a generation, it’s more I’m just gonna go about my day and you’re gonna tell me things that are super-useful. For example, I could have a calendar event next week for lunch with [unintelligible 00:49:20.01] and I forget about it. And so I go play tennis, and then I come back from tennis and my agent is like “Hey, just so you know, you have lunch with Yanni next week, and there is no location in the calendar invite. By the way, the last time you both talked, you liked this place, so I went and made a reservation for you, and it’s attached to the calendar invite now.” That whole thing just happens. And that can happen with agentic workflows.
So I think this is where people will end up using it, or should end up using it. We don’t live in that future today, but we will. And I think there’s companies to be started there, and open source projects to be made, and a lot of stuff there. Am I going to start one of these companies? Absolutely not. I just don’t care enough.
[49:59] [laughs]
We talked about this, Kevin… I care about the novelty and the - not so much the novelty, but the complexity of it. I care about how it works, and knowing how it works, and the person behind the curtain… But I know all that. And that sort of removes the fun from building it. Because cool. Yes, I can. And so it’s this weird thing where when I recognize I can build something, I don’t know. But when I’m chasing the knowledge, then I build a bunch of stuff. So anyway…
So it sounds like, essentially, think about what you’re doing today that you would rather not be doing, and see if you can figure out how to get an AI to do it.
Yeah. Because – I was just gonna say, one of the things that I think people talk about is this future where the machines take over. And then what do we do? The doomer theory. And some people see this as a good thing. The machines run everything, and we just like paint all day, and eat pizza, and chill, and do sports, and whatever we want. I think we could make that future. Like you said, automate away the things you don’t want to do with agents, and just live your life. So that’s sort of what I would do. That’s what I’m doing, actually, with my podcast.
One of the things that I’ve found playing with these tools is, at least in their current state, they’re really not good at things like decision-making, but they’re pretty good at “I want you to do a thing. Go and do it.” Especially if you’re willing to spend some time to figure out “How do I tune this prompt? How do I write the right tools?” or “How do I ask the AI to write the right tools, and get it to do things?” And so I think there’s kind of an interesting question there around “Can we use these tools to get rid of the drudgework, but then elevate the interesting decision-making, ideation, exploration pieces? In your example that you shared, personally, I wouldn’t want it to book a restaurant, but I’d want it to suggest it and say “Hey, here’s the restaurant. Do you want me to book it?” And then I can make that decision, and then it can go and do the work for me.
Yeah. And dialing in that threshold is I think also where a lot of the complexity in the work is. I think Apple Intelligence, again, does this really well, where it uses Open AI’s models as a tool, literally. We talked about tool calling… Apple Intelligence, they have a small on-device model that does tool calling, but to another LLM. And I think that that’s pretty cool, and I think we’ll see a lot of that as well.
Well, and I think that’s the model that for me I want to bring this back to for developers. The danger is either you dismiss these as they’re not useful at all, or you think they’re magic, and they’ll do everything. They’re just tools. They’re useful tools, they create some new capabilities, some very powerful capabilities, but we need to figure out how we incorporate those tools in the software we’re writing.
Yeah. And there was this scientist, the godfather of AI, Dr. Hinton, who was working at Google, and who left Google so that he could speak more freely about the dangers of AI… And he mentioned that within the next few years - I forget how many. I think it was 20 or so years; forgive me. But he says within the next few years there’s a 50/50 chance that artificial intelligence will be smarter than human beings. And if you listen to him speak, it sounds really dangerous and scary. And he says the only instance in existence that we have where something less capable controls something more capable is when a baby controls the mother to feed it. But this is rare. There is no other instance where something less capable controls something more capable. So his theory is that in the next 20 or so years there is a 50/50 chance that we will achieve as ASR, artificial super-intelligence, and this will be more capable than us, therefore it will control us. But I tend to not agree with this. And it’s kind of stupid for me to like disagree with such an established person, rihgt? But at this time in history there’s – robotics is the bottleneck. Like, so what if ASI controls us and is smart? It can’t really do anything in the physical world at this point in time.
And so yes, maybe some systems will go wrong, and things will be deleted or whatever, restaurants will be booked… But we’ll recognize we messed up and adjust it. We always do. Like we did with the airline industry. This was new, and when it was nascent, planes would literally fall out of the sky. There’s so many incidents of like Pan Am, and KLM, and Cathay Pacific having all kinds of issues. But now it’s the safest way to travel. And I think that’s part of the human story, is that we’ll introduce the right safety measures, and it’ll be okay. We will make mistakes along the way, but I think we’ll get there.
[54:23] I also feel like there’s a little bit around the development of AI that reminds me of fusion, in the sense that people have been saying “We’re 5 to 10 years away from fusion power” for the last 60 years. And maybe we’ll get there, but it just keeps being there, and I feel like that has been true in the AI world as well. People are like “Oh my gosh, we’re gonna match human intelligence in the next 10 years.” And you can find people saying that going back almost as far as there are computers, because I think part of it is you get so into thinking about these computers that you maybe don’t realize the extent of what actually happens in human intelligence. Like, we do a lot more than next token prediction.
Yeah. Although, although, the thing that makes us human, that sets us apart from lesser animals is the prefrontal cortex. It’s the center of the brain that literally, literally just does predictions. And based on those predictions we’ll either quiet down other circuits, or raise their activity. It will inhibit or excite. But predictions are so crucial to the human experience. And so I think it’s important to not undervalue that, but also not overvalue it. And so next token prediction is still prediction, on some level.
We will see. Yeah, I mean, my personal opinion is this is another example of our ability to get fooled into thinking S curves are exponentials. Are there any things that we haven’t talked about, that you would like to leave our listeners with?
Yeah. I mean, I work at Datastax, so I’d love it if they check out our tools. We make a vector database that’s super-useful for similarity search, but I think one thing that not enough people are hyped about as I am is something called Langflow. It’s a low-code and no-code builder for RAG and other Gen AI workflows. It’s really cool. So you have this drag and drop style interface where you can generate RAG pipelines and other things. And it makes these things more accessible. This is what I’m excited about, right? It’s the democratization of Gen AI. It makes it more accessible to a wider net of people.
I was talking to swyx, Shawn Wang, the founder of SmallAI, and he mentioned “Dude, it’s like the internet just began.” And there’s a lot of work to be done, and there’s a lot of room at the table… And so a lot of our work ought to be spent on making this stuff accessible. So that’s what I’m really into. That’s what I would invite people to do as well, is come and play. And if there’s questions you have, and if there’s support you need, I’m here. Kevin, you’re here. A bunch of us are – we’ve been around for a little bit longer, and we’re happy to support you.
Absolutely. Yeah, this stuff is going fast, but despite all the hype around how many people have tried it, and all these different things, it’s still very early days. We haven’t figured out how to use these things effectively in very many instances. I love this agent example that you have, because it is concrete, visible and it has clearly accelerated your work, and I think there’s many, many more opportunities out there.
So yeah, let’s close with that… It’s early days, you can still get involved in this stuff. I do think it is going to transform our industry, so I think the head in the sand approach is probably not the right one. Like, if you’re looking for what’s your next learning thing, maybe not the next frontend framework, instead look at Stately AI and how you can interact with LLMs using TypeScript.
Yeah.
Alright. With that, thank you, Tejas. I’m Kball, and this has been JS Party. Catch you all next week.
Our transcripts are open source on GitHub. Improvements are welcome. 💚