Skip to content

kojix2/llama.cr

Repository files navigation

llama.cr

test examples docs Lines of Code Static Badge

Crystal bindings for llama.cpp, a C/C++ implementation of LLaMA, Falcon, GPT-2, and other large language models.

The version in shard.yml corresponds to the compatible llama.cpp build number.

This project is under active development and may change rapidly.

Versioning Policy

  • This library version tracks the upstream llama.cpp build number.
  • The version in shard.yml uses the numeric build value (for example 8119).
  • Git tags use the v<build> format (for example v8119).
  • Compatibility target is one upstream build at a time.
  • Consumers should pin an exact shard version (for example 8119), not a version range.

Features

  • Low-level bindings to the llama.cpp C API
  • High-level Crystal wrapper classes for easy usage
  • Memory management for C resources
  • Simple text generation interface
  • Advanced sampling methods (Min-P, Typical, Mirostat, etc.)
  • Batch processing for efficient token handling
  • KV cache management for optimized inference
  • State saving and loading

Installation

Prerequisites

You need the llama.cpp shared library (libllama) available on your system.

1. Download Prebuilt Binary (Recommended)

LLAMA_BUILD="b$(shards version)"
curl -L "https://github.com/ggml-org/llama.cpp/releases/download/${LLAMA_BUILD}/llama-${LLAMA_BUILD}-bin-ubuntu-x64.tar.gz" -o llama.tar.gz
tar -xzf llama.tar.gz
sudo cp llama-${LLAMA_BUILD}/*.so* /usr/local/lib/
sudo ldconfig

For macOS, replace ubuntu-x64 with macos-arm64 and *.so with *.dylib.

Alternative: Use local libraries with standard linker flags

If you prefer not to install system-wide, point Crystal and the runtime loader to your local llama.cpp library directory:

export LLAMA_LIB_DIR=/path/to/llama.cpp
LIBRARY_PATH="$LLAMA_LIB_DIR" crystal build examples/simple.cr --link-flags "-L$LLAMA_LIB_DIR -Wl,-rpath,$LLAMA_LIB_DIR -lllama -lggml"
LD_LIBRARY_PATH="$LLAMA_LIB_DIR" ./simple --model models/tiny_model.gguf

On macOS, replace LD_LIBRARY_PATH with DYLD_LIBRARY_PATH.

If backend auto-detection fails in newer llama.cpp builds, also set GGML_BACKEND_PATH to a backend shared library file (not a directory), for example:

export GGML_BACKEND_PATH="$LLAMA_LIB_DIR/libggml-cpu-haswell.so"

For local development/tests, a full example is:

MODEL_PATH=/path/to/model.gguf \
LIBRARY_PATH="$LLAMA_LIB_DIR" \
LD_LIBRARY_PATH="$LLAMA_LIB_DIR" \
GGML_BACKEND_PATH="$LLAMA_LIB_DIR/libggml-cpu-haswell.so" \
crystal spec

Minimal examples:

# Linux
LIBRARY_PATH="$LLAMA_LIB_DIR" crystal build examples/simple.cr --link-flags "-L$LLAMA_LIB_DIR -Wl,-rpath,$LLAMA_LIB_DIR -lllama -lggml"
LD_LIBRARY_PATH="$LLAMA_LIB_DIR" ./simple --model models/tiny_model.gguf

# macOS
LIBRARY_PATH="$LLAMA_LIB_DIR" crystal build examples/simple.cr --link-flags "-L$LLAMA_LIB_DIR -Wl,-rpath,$LLAMA_LIB_DIR -lllama -lggml"
DYLD_LIBRARY_PATH="$LLAMA_LIB_DIR" ./simple --model models/tiny_model.gguf
Build from source (advanced users)
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
LLAMA_BUILD="b$(shards version ..)"
git checkout "${LLAMA_BUILD}"
mkdir build && cd build
cmake .. && cmake --build . --config Release
sudo cmake --install . && sudo ldconfig

Obtaining GGUF Model Files

You'll need a model file in GGUF format. For testing, smaller quantized models (1-3B parameters) with Q4_K_M quantization are recommended.

Popular options:

Adding to Your Project

Add the dependency to your shard.yml:

We strongly recommend pinning an exact version because llama.cpp updates can include breaking changes between build numbers.

dependencies:
  llama:
    github: kojix2/llama.cr
    version: 8119

Then run shards install.

Usage

Basic Text Generation

require "llama"

# Load a model
model = Llama::Model.new("/path/to/model.gguf")

# Create a context
context = model.context

# Generate text
response = context.generate("Once upon a time", max_tokens: 100, temperature: 0.8)
puts response

# Or use the convenience method
response = Llama.generate("/path/to/model.gguf", "Once upon a time")
puts response

Advanced Sampling

require "llama"

model = Llama::Model.new("/path/to/model.gguf")
context = model.context

# Create a sampler chain with multiple sampling methods
chain = Llama::SamplerChain.new
chain.add(Llama::Sampler::TopK.new(40))
chain.add(Llama::Sampler::MinP.new(0.05, 1))
chain.add(Llama::Sampler::Temp.new(0.8))
chain.add(Llama::Sampler::Dist.new(42))

# Generate text with the custom sampler chain
result = context.generate_with_sampler("Write a short poem about AI:", chain, 150)
puts result

Chat Conversations

require "llama"
require "llama/chat"

model = Llama::Model.new("/path/to/model.gguf")
context = model.context

# Create a chat conversation
messages = [
  Llama::ChatMessage.new("system", "You are a helpful assistant."),
  Llama::ChatMessage.new("user", "Hello, who are you?")
]

# Generate a response
response = context.chat(messages)
puts "Assistant: #{response}"

# Continue the conversation
messages << Llama::ChatMessage.new("assistant", response)
messages << Llama::ChatMessage.new("user", "Tell me a joke")
response = context.chat(messages)
puts "Assistant: #{response}"

Embeddings

require "llama"

model = Llama::Model.new("/path/to/model.gguf")

# Create a context with embeddings enabled
context = model.context(embeddings: true)

# Get embeddings for text
text = "Hello, world!"
tokens = model.vocab.tokenize(text)
batch = Llama::Batch.get_one(tokens)
context.decode(batch)
embeddings = context.get_embeddings_seq(0)

puts "Embedding dimension: #{embeddings.size}"

Utilities

System Info

puts Llama.system_info

Tokenization Utility

model = Llama::Model.new("/path/to/model.gguf")
puts Llama.tokenize_and_format(model.vocab, "Hello, world!", ids_only: true)

Examples

The examples directory contains sample code demonstrating various features:

  • simple.cr - Basic text generation
  • chat.cr - Chat conversations with models
  • tokenize.cr - Tokenization and vocabulary features

API Documentation

See kojix2.github.io/llama.cr for full API docs.

Core Classes

  • Llama::Model - Represents a loaded LLaMA model
  • Llama::Context - Handles inference state for a model
  • Llama::Vocab - Provides access to the model's vocabulary
  • Llama::Batch - Manages batches of tokens for efficient processing
  • Llama::KvCache - Controls the key-value cache for optimized inference
  • Llama::State - Handles saving and loading model state
  • Llama::SamplerChain - Combines multiple sampling methods

Samplers

  • Llama::Sampler::TopK - Keeps only the top K most likely tokens
  • Llama::Sampler::TopP - Nucleus sampling (keeps tokens until cumulative probability exceeds P)
  • Llama::Sampler::Temp - Applies temperature to logits
  • Llama::Sampler::Dist - Samples from the final probability distribution
  • Llama::Sampler::MinP - Keeps tokens with probability >= P * max_probability
  • Llama::Sampler::Typical - Selects tokens based on their "typicality" (entropy)
  • Llama::Sampler::Mirostat - Dynamically adjusts sampling to maintain target entropy
  • Llama::Sampler::Penalties - Applies penalties to reduce repetition

Development

See DEVELOPMENT.md for development guidelines.

This software is primarily created through AI-generated code.

Do you need commit rights?

  • If you need commit rights to my repository or want to get admin rights and take over the project, please feel free to contact @kojix2.
  • Many OSS projects become abandoned because only the founder has commit rights to the original repository.

Contributing

  1. Fork it (https://github.com/kojix2/llama.cr/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

License

This project is available under the MIT License. See the LICENSE file for more info.

About

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors