November 21, 2024

Details matter with open source models

Open source models like Qwen 2.5 32B Instruct are performing very well on aider’s code editing benchmark, rivaling closed source frontier models.

But pay attention to how your model is being served and quantized, as it can impact code editing skill. Open source models are often available at a variety of quantizations, and can be served with different token limits. These details matter when working with code.

The graph and table below compares different versions of the Qwen 2.5 Coder 32B Instruct model, served both locally and from a variety of cloud providers.

The HuggingFace BF16 weights served via glhf.chat.
4bit and 8bit quants for mlx.
The results from OpenRouter’s mix of providers which serve the model with different levels of quantization.
Results from individual providers served via OpenRouter and directly to their own APIs.
Ollama locally serving different quantizations from the Ollama model library.

This benchmarking effort highlighted a number of pitfalls and details which can have a significant impact on the model’s ability to correctly edit code:

Quantization – Open source models are often available at dozens of different quantizations. Most seem to only modestly decrease code editing skill, but stronger quantizations do have a real impact.
Context window – Cloud providers can decide how large a context window to accept, and they often choose differently. Ollama defaults to a tiny 2k context window, and silently discards data that exceeds it. Such a small window has catastrophic effects on performance.
Output token limits – Open source models are often served with wildly differing output token limits. This has a direct impact on how much code the model can write or edit in a response.
Buggy cloud providers – Between Qwen 2.5 Coder 32B Instruct and DeepSeek V2.5, there were multiple cloud providers with broken or buggy API endpoints. They seemed to be returning result different from expected based on the advertised quantization and context sizes. The harm caused to the code editing benchmark varied from serious to catastrophic.

The best versions of the model rival GPT-4o, while the worst performing quantization is more like the older GPT-4 Turbo. Even an excellent fp16 quantization falls to GPT-3.5 Turbo levels of performance if run with Ollama’s default 2k context window.

Sections

Benchmark results

Model	Percent completed correctly	Percent using correct edit format	Command	Edit format
Fireworks: unknown	72.2%	94.0%	`aider --model fireworks_ai/accounts/fireworks/models/qwen2p5-coder-32b-instruct`	diff
Deepinfra: BF16	72.2%	94.7%	`aider --model deepinfra/Qwen/Qwen2.5-Coder-32B-Instruct`	diff
mlx-community: 8bit	72.2%	92.5%	`aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-8bit`	diff
mlx-community: 4bit	72.2%	88.7%	`aider --model openai/mlx-community/Qwen2.5-Coder-32B-Instruct-4bit`	diff
Ollama: fp16	71.4%	90.2%	`aider --model ollama/qwen2.5-coder:32b-instruct-fp16`	diff
HuggingFace via GLHF: BF16	71.4%	94.7%	`aider --model openai/hf:Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://glhf.chat/api/openai/v1`	diff
Deepinfra via OpenRouter: BF16	69.9%	89.5%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Hyperbolic: BF16	69.2%	91.7%	`aider --model openai/Qwen/Qwen2.5-Coder-32B-Instruct --openai-api-base https://api.hyperbolic.xyz/v1/`	diff
Hyperbolic via OpenRouter: BF16	68.4%	89.5%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Fireworks via OpenRouter: unknown	67.7%	94.0%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
OpenRouter: multiple	67.7%	95.5%	`aider --model openrouter/qwen/qwen-2.5-coder-32b-instruct`	diff
Ollama: q4_K_M	66.9%	94.0%	`aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M`	diff
Ollama: q2_K	61.7%	91.7%	`aider --model ollama/qwen2.5-coder:32b-instruct-q2_K`	diff
Ollama: fp16, 2k ctx	51.9%	46.2%	`aider --model ollama/qwen2.5-coder:32b-instruct-fp16 # num_ctx: 2048`	diff

Setting Ollama’s context window size

Ollama uses a 2k context window by default, which is very small for working with aider. Unlike most other LLM servers, Ollama does not throw an error if you submit a request that exceeds the context window. Instead, it just silently truncates the request by discarding the “oldest” messages in the chat to make it fit within the context window.

Except for the single 2k context result, all of the Ollama results above were collected with at least an 8k context window. An 8k window is large enough to attempt all the coding problems in the benchmark. Aider sets Ollama’s context window to 8k by default, starting in aider v0.65.0.

You can change the Ollama server’s context window with a .aider.model.settings.yml file like this:

- name: ollama/qwen2.5-coder:32b-instruct-fp16
  extra_params:
    num_ctx: 8192

Choosing providers with OpenRouter

OpenRouter allows you to ignore specific providers in your preferences. This can be used to limit your OpenRouter requests to be served by only your preferred providers.

Notes

This article went through many revisions as I received feedback from numerous members of the community. Here are some of the noteworthy learnings and changes:

The first version of this article included incorrect Ollama models.
Earlier Ollama results used the too small default 2k context window, artificially harming the benchmark results.
The benchmark results appear to have uncovered a problem in the way OpenRouter was communicating with Hyperbolic. They fixed the issue 11/24/24, shortly after it was pointed out.