Skip to content

v0.2.0

Compare
Choose a tag to compare
@github-actions github-actions released this 02 Jul 19:25
· 616 commits to main since this release
53da2c6

Concurrency

Concurrency

Ollama 0.2.0 is now available with concurrency support. This unlocks 2 specific features:

Parallel requests

Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as:

  • Handling multiple chat sessions at the same time
  • Hosting a code completion LLM for your internal team
  • Processing different parts of a document simultaneously
  • Running several agents at the same time.
demo.mov

Multiple models

Ollama now supports loading different models at the same time, dramatically improving:

  • Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously.
  • Agents: multiple different agents can now run simultaneously
  • Running large and small models side-by-side

Models are automatically loaded and unloaded based on requests and how much GPU memory is available.

To see which models are loaded, run ollama ps:

% ollama ps
NAME                    ID              SIZE    PROCESSOR       UNTIL
gemma:2b                030ee63283b5    2.8 GB  100% GPU        4 minutes from now
all-minilm:latest       1b226e2802db    530 MB  100% GPU        4 minutes from now
llama3:latest           365c0bd3c000    6.7 GB  100% GPU        4 minutes from now

For more information on concurrency, see the FAQ

New models

  • GLM-4: A strong multi-lingual general language model with competitive performance to Llama 3.
  • CodeGeeX4: A versatile model for AI software development scenarios, including code completion.
  • Gemma 2: Improved output quality and base text generation models now available

What's Changed

  • Improved Gemma 2
    • Fixed issue where model would generate invalid tokens after hitting context window
    • Fixed inference output issues with gemma2:27b
    • Re-downloading the model may be required: ollama pull gemma2 or ollama pull gemma2:27b
  • Ollama will now show a better error if a model architecture isn't supported
  • Improved handling of quotes and spaces in Modelfile FROM lines
  • Ollama will now return an error if the system does not have enough memory to run a model on Linux

New Contributors

Full Changelog: v0.1.48...v0.2.0