As with a lot of organisations, the idea of using LLM's is a reasonably frightning concept, as people freely hand over internal IP and sensitive comms to remote entities that are heavily data bound by nature. I know it was on our minds when deciding on LLM's and their role within the team and wider company. 6 months ago, I set out to explore what offerings were like in the self-hosted and/or OSS space, and if anything could be achieved locally. After using this setup since then, and after getting a lot of questions on it, I thought I might share some of the things I've come across and getting it all setup.
Que in Ollama and Continue. Ollama is an easy way to locally download, manage and run models. Its very familiar to Docker in its usuage, and can probably be most conceptually aligned with it in how it operates, think images = llm's. Continue is the other side of that, being the bridge/interconnect to allow what's in VSCode to talk with Ollama in a way that makes sense.
Depending on the model you choose, the capability varies, but I've found it makes for an excellent companion on all the areas you'd expect it to, especially code auto complete and documentation. The benefits being that its all fully local and performant.
My current dev machine is a 16" MBP, comprising of an M1 Pro APU (2E/8P @ 2/3.2GHz CPU, 5.2 TFLOP GPU) with 32GB RAM (LPDDR5 6400MT/s, shared memory (RAM/VRAM)). That should hopefully give you a reference point on CPU/GPU/RAM/VRAM, the main factors for LLM performance (depending on hardware acceleration, and if you're running it on CPU or GPU, and if its in a shared/pooled memory environment such as Apple Silicon.)
With this hardware, I'm getting around similar performance of ChatGPT 3.5. Ollama will load/unload the model in the background automatically to conserve memory usage and system load when the model is not in use. It causes a slight delay on the first call if its not used after the parking timeout, although this can be configured in the settings.
The best models I've found to use so far are either a more broad Llama3 or Qwen's more code focused models, at the time of writing Qwen2.5-Coder is the latest- but they are constantly updated, so keep an eye out for updates and improvements. Qwen's code models come very close to the performance of ChatGPT 4o and Claude 3.5.
I usually tend to stick with the 7B/8B models, just as a good balance of performance (LLM and system), memory use and storage space requirements (4-5GB) of the unloaded models.
You'll also notice as well suffixes like -chat
or -instruct
, these are just the contexts that the model has been additionally trained on to hone the model to a more specific purpose. Things like -chat
will have been trained from the base model with chat transcripts to predict a more conversational style, whereas something like -instruct
will be trained more on task assignment to deal with direct instruction.
Just keep your guard up, trust yourself more than it, and if something doesn't smell right, it probably isn't. Double check its output. LLM's are superb tools, but that's it, its a tool in which you wield, and therefor accept responsibility for in using. They're not intelligent and won't reason, they're just probability engines wrapped around language, so never blindly follow what they say.
With all that in mind, let's get things setup. I'm going to assume brew
is installed and ready to use (if it isn't, you really should!);
- Install Ollama with
brew install --cask ollama
- You'll notice we're grabbing the
--cask
version of Ollama instead of the pure CLI version. This is personal preference, but if you run the app version, that will runollama serve
in the background as a menu bar app, freeing up an active terminal
- You'll notice we're grabbing the
- Pull your LLM of choice, e.g. Qwen2.5-Coder-7B, via
ollama run qwen2.5-coder:7b
in a terminal to download the model - Install Continue in VSCode
- Once installed, select
Ollama
as the model provider, then select theAutodetect
model preset - This will read all available models from the Ollama API and populate them in your model selection at the bottom of the sidebar window for us to then further customise later
- This is mainly to ensure everything is wired up and working, it will also give you a reference point in Continue's
config.json
- Once installed, select
- Using the settings cog to open Continue's
config.json
, or the command palette, change thetabAutocompleteModel
to the following, to change it to Qwen's Coder model:
"tabAutocompleteModel": {
"model": "qwen2.5-coder:7b",
"provider": "ollama",
"title": "Qwen Coder"
}
And thats it! It should all 'just work'. There are loads of keyboard shortcuts and command palette options to explore, so have an adventure!
I find myself using the Continue integration in 4 key ways;
- Really good code autocomplete
- I find it often will get what I'm about to write, or will get in the ball park, pretty consistently- meaning hitting tab saves time that adds up!
- Really good at writing tests & code refactoring
- As its contextually aware, its really good at taking what you've already written and writing additional complementary things like refactoring code or unit/feature tests. I've never tried using it to generate code in a TDD scenario, but give it a go and see what it does! Although review these carefully, as if it were an external PR!
- If you're refactoring code, I'd almost certainly ensure you have a test suite already too.
- Superb documentation autocomplete
- As continue feeds in context, I find it is really good at writing method/class docblocks and inline code comments
- Sidebar chat for code generation and more in depth querying
- Continue will allow you to define attached context or initiate chat from the context menu on highlighted code, for those times you need a more nuanced reflection
Its a nice tool to have in your arsenal. I feel like its a constant mid-senior engineer just sitting there ready to be bounced off for some code review or a random question. Given how terrible search engines are now days, I often find myself turning to an LLM scoped to the specific docs I'm requring or a more open ended search to prune through the rubbish and get to the heart of what I'm after, especially in the ability to more naturally search and explain problems- instead of having to think in distilling it to keywords and finessing search engine positive/negative query terms to wittle down results. Google-Fu, or probably more Duck-Duck-Fu now.
Just keep your guard up, trust yourself more than it, and if something doesn't smell right, it probably isn't. LLM's are superb tools, but that's it, its a tool. They're not intelligent and won't reason, they're just probability engines wrapped around language, so never blindly follow what they say.
Here's an example below of my config.json
from Continue, (the customCommands
are just the default ones) to hopefully get you started.