Releases · predibase/lorax

@ajtejankar

🎉 Enhancements

Add support for adapter loading in mllama by @ajtejankar in #669
Record number of skipped tokens in the response by @tgaddair in #681
Record TTFT and TPOT in response headers by @tgaddair in #684
Add cli arg --speculation-max-batch-size by @tgaddair in #686
Use --predibase-api-token parameter when downloading by @joseph-predibase in #687
Launcher args for compile max batch size and rank by @tgaddair in #690

🐛 Bugfixes

Fix stella embeddings + Integration tests for lorax by @magdyksaleh in #668
Fix lora loading and indexing bug in mllama by @ajtejankar in #682
Set maximum grpc message receive size to 2GiB by @tgaddair in #667
Fix frequency_penalty and presence_penalty by @tgaddair in #672
Fix scores (remove debug code) by @tgaddair in #673
Fix top_p to allow setting it to 1.0 by @magdyksaleh in #676
Format fixes tool calling by @magdyksaleh in #680
Use predibase API token when downloading pbase files by @joseph-predibase in #688
Pbase adapter source resolution by @magdyksaleh in #689
fix: Make logprob field optional for response Pydantic validation by @jeffreyftang in #692

🔧 Maintenance

Only use sha tag for running int tests by @magdyksaleh in #674
Fix int tests 2 by @magdyksaleh in #675
Always build and push image before running IT by @arnavgarg1 in #678
Only push main if int tests pass by @magdyksaleh in #677
Remove bad check by @magdyksaleh in #683

Full Changelog: v0.12.0...v0.12.1

@tgaddair

🎉 Enhancements

Prompt prefix caching for multi-LoRA by @tgaddair in #655
Convert to Triton Punica kernels by @tgaddair in #658
Support FP8 KV Cache by @ajtejankar in #652
Added Mllama by @tgaddair in #619
Flash mllama by @tgaddair in #622
support MRL embeddings for qwen2 by @magdyksaleh in #621
Support for Embeddings with XLM-RoBERTa and Adapters by @jfhetzer in #656
Merge weights by @magdyksaleh in #600
feat: Function calling with output schema enforcement by @jeffreyftang in #536
Chunked prefill by @tgaddair in #653
add num inputs to metrics by @magdyksaleh in #615
Add --predibase-api-token CLI arg by @joseph-predibase in #617
Add --disable-sgmv flag by @joseph-predibase in #639
Enhance Structured Output Interface by @GirinMan in #644

🐛 Bugfixes

Add done message to openai endpoints by @magdyksaleh in #618
Fix CUDA graph compilation by @tgaddair in #627
Fix CUDA graphs for Medusa by @tgaddair in #628
Fix retrace message by @tgaddair in #629
Fix prefix plumbing and BGMV compiler dimensions by @tgaddair in #631
Fix punica kernel compilation by @tgaddair in #632
Fix FlashInfer when not using prefix caching by @tgaddair in #633
Fix cuda graph tracing without lora ranks by @tgaddair in #634
Added ranks 96 and 128 to BGMV kernel by @tgaddair in #630
Look for language model lm head by @Infernaught in #640
Return n choices for chat completions API by @tgaddair in #638
Fix llava_next for llama 3.2 vision cross attention states by @tgaddair in #641
Fix compile for qwen-2.5-32b by @tgaddair in #645
Added backwards compatible field to OpenAI json_object API by @tgaddair in #648
Fix PREDIBASE_API_TOKEN env var being thrown away by @joseph-predibase in #654
Fix absent fp8_kv property on llama and qwen models by @ajtejankar in #662
Fix seqlen bug for sliding window models like Mistral v0.1 by @ajtejankar in #660
Fix sliding window + compile bug by @ajtejankar in #666

📝 Docs

added metrics docs, updated links in main docs by @noyoshi in #663

🔧 Maintenance

upgrade poetry by @magdyksaleh in #613
Fix deps4 by @magdyksaleh in #614
Remove LD_PRELOAD from Docker and improve error message by @tgaddair in #623
add label to id this as a lorax image by @noyoshi in #626
pass correct stuff to predibase-reporter by @magdyksaleh in #635
try using arc runner for build by @noyoshi in #646
change runner 2 by @magdyksaleh in #650

New Contributors

@joseph-predibase made their first contribution in #617
@jfhetzer made their first contribution in #656

Full Changelog: v0.11.0...v0.12.0

@tgaddair

🎉 Enhancements

Add prefix caching by @tgaddair in #581
Add Llava Next (VLM) by @tgaddair in #586
Embedder Service v0 with FlashBert by @magdyksaleh in #385
Added eager prefill option by @tgaddair in #524
BERT NER support by @magdyksaleh in #531
Preload adapters during init by @tgaddair in #543
Add support for batching to embedder models by @tgaddair in #503
Bert to gpu by @magdyksaleh in #507
Add distilbert by @magdyksaleh in #508
feat: return usage in ChatCompletionStreamResponse by @GirinMan in #506
Added Gemma2 by @tgaddair in #530
Move kv cache allocation to router to ensure correct block allocation by @tgaddair in #545
Tokenize inputs in router by @tgaddair in #548
Add support for Llama 3 rotary embeddings by @tgaddair in #551
Apply chat template in router to properly validate input length by @tgaddair in #538
Allow eager_prefill to be set in Helm chart by @bdalal in #557
Support FP8 for Mistral by @ajtejankar in #559
Support FP8 for LLaMa by @ajtejankar in #562
Support classify batch by @magdyksaleh in #577
Adding longrope for serve Phi-3 by @huytuong010101 in #576
Add new agnostic health endpoint by @magdyksaleh in #588
Support FlashInfer for BERT by @tgaddair in #597
Speed up NER inference by @magdyksaleh in #598
Disable healthcheck tracing and add metrics to classify + classify_batch endpoints by @magdyksaleh in #603
Added launcher args for preloaded_adapter_source and backend by @tgaddair in #604
Parallelize tokenization for /classify_batch and remove block allocator for non-causal LMs by @tgaddair in #609
support bge-base-en-v1.5 by @magdyksaleh in #593

🐛 Bugfixes

Fix for the LM_HEAD issue by @ajtejankar in #475
fix: load tokenizer/config with trust_remote_code by @thincal in #476
Fix issue with Medusa batch load signature by @tgaddair in #492
add missed dtypes for 8bit kv cache by @flozi00 in #490
Fix quant cache OOM by @flozi00 in #494
Add retries on common session errors for the client by @gyanesh-mishra in #495
Revert AWQ to stable commit by @tgaddair in #498
Fixed phi-3 with Su Rotary Embedding by @tgaddair in #499
Fixed case where loaded lora adapter has no segments by @tgaddair in #510
fix batching bug by @magdyksaleh in #513
Fix issue with GQA initialization for Qwen2 by @arnavgarg1 in #514
Disable fp8 kv cache for lovelace by @tgaddair in #520
Bug fix for illegal memory access error caused when running medusa lora and plain loras in parallel. by @ajtejankar in #525
bug : fix the type checking errors thrown by new ruff version by @ajtejankar in #533
bug : fix Qwen-2 sliding_window config bug by @ajtejankar in #532
Infer dtype from model config when not explicitly specified by @arnavgarg1 in #534
Fix gemma2 by @Infernaught in #539
Fix : compile bug causing models to error with 'lora' key not found by @ajtejankar in #547
Fix: short circuit download, load, offload for preloaded adapters by @tgaddair in #552
Fix the attention bug caused by upgrading vLLM by @ajtejankar in #555
Fix LM head interaction with Medusa by @tgaddair in #567
Fix adapter mask when using speculative decoding + LM head LoRA by @tgaddair in #570
Fix outlines compatibility with speculative decoding by @tgaddair in #578
Fix qwen lora by @magdyksaleh in #585
Fix classify and classify_batch for Python client by @tgaddair in #608
Fix ner entity merging by @magdyksaleh in #596
Fix class ner by @magdyksaleh in #602
Fix dependencies to address high urgency dependabot alerts by @magdyksaleh in #612

📝 Docs

docs: update development_env.md by @eltociear in #515
Doc updates for Medusa training by @arnavgarg1 in #544
Add "pbase" to adapter_source docstrings by @alexsherstinsky in #583
Add prerequisites to readme by @csabakecskemeti in #584

🔧 Maintenance

chore: update infer.rs by @eltociear in #487
start porting latest tgi by @flozi00 in #480
Bump client to v0.6.1 by @tgaddair in #496
Update Makefile-awq by @flozi00 in #493
hqq upgrades by @flozi00 in #491
try out an integration test workflow by @noyoshi in #516
no warm up by @magdyksaleh in #540
Update PyTorch, CUDA, vLLM, and Bitsandbytes by @ajtejankar in #553
Added missing nvidia-ml-py package by @tgaddair in #558
parse headers for errored requests by @noyoshi in #564
handle folders for predibase by @noyoshi in #565
enable mistral nemo by @magdyksaleh in #568
bump version by @noyoshi in #569
Install flashinfer in Docker by @tgaddair in #582
feat : use --no-cache-dir flag to pip in dockerfiles to save space by @Rajpratik71 in #587
Add missing configs by @magdyksaleh in #590
Address rust compiler warnings by @magdyksaleh in #589

New Contributors

@eltociear made their first contribution in #487
@ajtejankar made their first contribution in #475
@bdalal made their first contribution in #557
@Rajpratik71 made their first contribution in #587
@csabakecskemeti made their first contribution in #584

Full Changelog: v0.10.0...v0.11.0

@tgaddair

🎉 Enhancements

Added support for Medusa speculative decoding adapters by @tgaddair in #372
Added Medusa adapters per request by @tgaddair in #454
Support jointly trained Medusa + LoRA adapters by @tgaddair in #482
Adds prompt lookup decoding (ngram speculation) by @tgaddair in #375
Use SGMV for prefill BGMV for decode by @tgaddair in #464
Added phi3 by @tgaddair in #445
Added support for C4AI Command-R (cohere) by @tgaddair in #411
Add DBRX by @tgaddair in #423
Refactor adapter interface to support adapters other than LoRA (e.g., speculative decoding) by @tgaddair in #359
Initializing server with an adapter sets it as the default by @tgaddair in #370
Implement Seed Parameter Support for OpenAI-Compatible API Endpoints by @GirinMan in #374
lorax launcher now has --default-adapter-source by @noyoshi in #419
enh: Make client's handling of error responses more robust and user-friendly by @jeffreyftang in #418
Support both medusa v1 and v2 by @tgaddair in #421
use default HF HUB token when checking for base model info by @noyoshi in #428
Added adapter_source and api_token to completions API by @tgaddair in #446
Increase max stop sequences by @tgaddair in #453
Support LORAX_USE_GLOBAL_HF_TOKEN by @tgaddair in #462
Allow setting temperature=0 by @tgaddair in #467
Merge medusa segments by @tgaddair in #471

🐛 Bugfixes

Fix CUDA compile when using long sequence lengths by @tgaddair in #363
Fix CUDA graph compile with speculative decoding by @tgaddair in #381
Fix mixtral for speculative decoding by @tgaddair in #382
Fix import of EntryNotFoundError by @tgaddair in #401
Fix warmup when using spculative decoding by @tgaddair in #402
fix: assign bias directly by @thincal in #398
fix: Enable ignoring botocore ClientError during download_file by @jeffreyftang in #404
Fix Pydantic v2 adapter_id and merged_adapters validation by @claudioMontanari in #408
fix: Suppress pydantic warning over model_id field in DeployedModel by @jeffreyftang in #409
Fix phi by @noyoshi in #410
fix: Missing / in pbase endpoint by @jeffreyftang in #415
Print correct number of key value heads on dimension assertion. by @dstripelis in #414
Fix request variable by @Infernaught in #416
fix: Rename _get_slice to get_slice by @tgaddair in #424
fix: Hack for llama3 eos_token_id by @tgaddair in #427
fix: checking the base_model_name_or_path of adapter_config and early return if null by @thincal in #431
fix: use logits to calculate alternative tokens by @JTS22 in #425
Fixed default pbase endpoint url by @tgaddair in #435
fix: Downloading private adapters from HF by @tgaddair in #443
Fix Outlines compatibility with speculative decoding by @tgaddair in #447
fix: Handle edge case where allowed tokens are out of bounds by @tgaddair in #449
Fix special tokens showing up in the response by @tgaddair in #450
Fix Medusa + LoRA by @tgaddair in #455
Ensure Llama 3 stops on all EOS tokens by @arnavgarg1 in #456
Reuse session per class instance by @gyanesh-mishra in #468

📝 Docs

Fix chat completion and docs by @GirinMan in #358
Added batch processing example by @tgaddair in #386
Medusa docs by @tgaddair in #459
Updated supported base models in docs by @arnavgarg1 in #458
Docs for private HF models by @tgaddair in #460
Auth header docs by @tgaddair in #461

🔧 Maintenance

Add CNAME file for Docs by @martindavis in #364
Update tagging logic and add flake8 linter by @magdyksaleh in #365
Apply black formatting by @tgaddair in #376
Switch formatting and linting to ruff by @tgaddair in #378
Style: change line length to 120 and enforce import sort order by @tgaddair in #383
Bump pydantic version to >2, <3 by @claudioMontanari in #405
refactor: set config into weights for quantization feature support more easily by @thincal in #400
Update Predibase integration to support v2 API by @jeffreyftang in #403
logging by @magdyksaleh in #436
revert by @magdyksaleh in #437
Upgrade to CUDA 12.1 and PyTorch 2.3.0 by @tgaddair in #472
int: Bump Lorax Client to 3.9 by @gyanesh-mishra in #486
Bump lorax client v0.6.0 by @tgaddair in #488

New Contributors

@GirinMan made their first contribution in #358
@martindavis made their first contribution in #364
@thincal made their first contribution in #398
@claudioMontanari made their first contribution in #405
@dstripelis made their first contribution in #414

Full Changelog: v0.9.0...v0.10.0

@tgaddair

🎉 Enhancements

Allow assigning dedicated memory reservation for adapters on GPU by @tgaddair in #303
Enforce adapters cannot be loaded past --adapter-memory-fraction by @tgaddair in #306
Added Qwen2 by @tgaddair in #327
Make max_new_tokens optional, default to max_total_tokens - input_length by @tgaddair in #353
Expose ignore_eos_token option in generate requests by @jeffreyftang in #340
Generate to max_total_tokens during warmup by @tgaddair in #286
Add support for returning alternative tokens by @JTS22 in #297
feat: add repetition_penalty and top_k to openai by @huytuong010101 in #288
Add support for LoRA adapters trained with Rank-Stabilized scaling by @arnavgarg1 in #299
Provide more granular methods to configure the embedded S3 client. by @mitchklusty in #325
Allow specifying base model as model param in OpenAI API by @tgaddair in #331
Add ignore_eos_token param to completions and chat completions endpoints by @jeffreyftang in #344
Log whether SGMV kernel is enabled by @tgaddair in #342
Log generated tokens out to file when streaming by @magdyksaleh in #309

🐛 Bugfixes

Fix tensor parallelism with SGMV to use true rank of the LoRA after splitting by @tgaddair in #324
Fix hanging caused by tqdm stderr not being printed by @tgaddair in #352
Fix dynamic RoPE by @tgaddair in #350
Only update cache during warmup by @tgaddair in #351
Prevent model loading errors from appearing as flash attention import errors by @tgaddair in #328
Make architecture compatibility check non-fatal if base model config cannot be loaded by @tgaddair in #317
Fix Qwen2 LoRA loading by @tgaddair in #345
Remove vec wrapping from OpenAI-compatible response by @jeffreyftang in #273
Disallow early stopping during warmup by @tgaddair in #290
Skip returning EOS token on finish_reason 'stop' by @jeffreyftang in #289
Fixed static adapter loading with same arch by @tgaddair in #300
Ensure model_id is a string when using a model from s3 by @fadebek in #291
Fix name for adapter id by @noyoshi in #284
Update AsyncClient with ignore_eos_token parameter by @jeffreyftang in #341

📝 Docs

Update docs now that we no longer return a list from OpenAI-compatible endpoints by @jeffreyftang in #281
Change guided generation to structured generation by @jeffreyftang in #302
Clarify getting started documentation regarding port number used in pre-built Docker image. by @alexsherstinsky in #313
Added system requirements to README by @tgaddair in #293
Update README.md by @tgaddair in #294

🔧 Maintenance

Split out server and router unit tests by @tgaddair in #275
Add in response headers to streaming endpoint by @noyoshi in #282
Propagate bearer token from header if one exists for OpenAI-compatible endpoints by @jeffreyftang in #278
Update tokenizers to v0.15 to be consistent with server by @tgaddair in #285
Autogen python client docs by @tgaddair in #295
Reporting on total tokens by @noyoshi in #349

New Contributors

@huytuong010101 made their first contribution in #288
@fadebek made their first contribution in #291
@JTS22 made their first contribution in #297
@alexsherstinsky made their first contribution in #313
@mitchklusty made their first contribution in #325

Full Changelog: v0.8.1...v0.9.0

@tgaddair

🎉 Enhancements

Added Gemma by @tgaddair in #267
Pass details param into client by @magdyksaleh in #265

🔧 Maintenance

bump version by @magdyksaleh in #268
Bump by @magdyksaleh in #270

Full Changelog: v0.8.0...v0.8.1

@tgaddair

🎉 Enhancements

Added Outlines logits processor for JSON schema validation by @tgaddair in #224
Enable JSON guided generation via OpenAI-compatible API by @jeffreyftang in #243
JSON schema for guided generation now optionally respects field order by @jeffreyftang in #264
Set default adapter source by @magdyksaleh in #223
Pad LoRA ranks to ensure compatibility with SGMV kernel by @tgaddair in #256
Add model and adapter response headers by @magdyksaleh in #220
Add Cors params by @magdyksaleh in #221
Add expose headers by @magdyksaleh in #230

🐛 Bugfixes

Properly split out model_id when retrieving adapter weights downloaded from S3 by @jeffreyftang in #246
Fixed TIES merging to calculate sign before applying weights by @tgaddair in #239
Update s3.py by @llama-shepard in #234
Fix concatenate for flash batch by @tgaddair in #254
Fixed batch merging and filtering to handle Outlines state by @tgaddair in #263

📝 Docs

Add guide for guided generation by @jeffreyftang in #240
Added contributing guide by @tgaddair in #226
Update README to include model merging by @tgaddair in #225
Updated structured output by @tgaddair in #258
Minor corrections to development env setup instructions by @jeffreyftang in #228

🔧 Maintenance

Upgrade docker to use rust 1.75 and ubuntu 22.04 by @tgaddair in #250
Upgrading rust for dependency changes by @DhruvaBansal00 in #248
fix paths on runner by @noyoshi in #242

New Contributors

@jeffreyftang made their first contribution in #228
@DhruvaBansal00 made their first contribution in #248

Full Changelog: v0.7.0...v0.8.0

@tgaddair

🎉 Enhancements

Merge multiple LoRA adapters per request (linear, TIES, DARE) by @tgaddair in #212
Eetq by @flozi00 in #195
hqq JIT Quantization by @flozi00 in #147
Added Bloom dynamic adapter loading by @tgaddair in #187
Added pbase adapter_source and expose api_token in client by @tgaddair in #181
Cloudflare R2 Source by @llama-shepard in #198

🐛 Bugfixes

Fixed Phi for new HF format by @tgaddair in #192
Fixed OpenAI stream response data by @tgaddair in #193
fix: OpenAI response format by @tgaddair in #184
Fix RoPE and YARN scaling by @tgaddair in #202
check for base model earlier in the adapter function by @noyoshi in #196

📝 Docs

Updated quantization docs by @tgaddair in #206

🔧 Maintenance

Upgrade to pytorch==2.2.0 by @tgaddair in #217
upgrade exllama kernel by @flozi00 in #209
Add a model cache to avoid running out of storage by @magdyksaleh in #201

New Contributors

@llama-shepard made their first contribution in #198

Full Changelog: v0.6.0...v0.7.0

@tgaddair

🎉 Enhancements

OpenAI v1 Completions API by @tgaddair in #170
OpenAI v1 Chat Completions API by @tgaddair in #171
Added prompt_tokens to the response by @tgaddair in #165

🐛 Bugfixes

fix: Handle NaN values during weight conversion by @tgaddair in #168

📝 Docs

docs: OpenAI compatible API by @tgaddair in #174

🔧 Maintenance

fix: Only install stanford-stk on linux by @tgaddair in #169
added separate installation for torch by @asingh9530 in #173

New Contributors

@asingh9530 made their first contribution in #173

Full Changelog: v0.5.0...v0.6.0

@tgaddair

🎉 Enhancements

CUDA graph compilation by @tgaddair in #154

🐛 Bugfixes

Fixed deadlock in sgmv_shrink kernel caused by imbalanced segments by @tgaddair in #156
Fixed loading adapter from absolute s3 path by @tgaddair in #161

📝 Docs

Update client docs with new endpoint source by @abidwael in #126
Update client docs with new endpoint source by @abidwael in #146

🔧 Maintenance

Reduce Docker size by removing duplicate torch install by @tgaddair in #144
remove CACHE_MANAGER in flash_causal_lm.py by @michaelfeil in #157

New Contributors

@michaelfeil made their first contribution in #157

Full Changelog: v0.4.1...v0.5.0

Releases: predibase/lorax

v0.12.1

🎉 Enhancements

🐛 Bugfixes

🔧 Maintenance

Contributors

v0.12.0: Multi-LoRA prefix caching, fp8 kv cache, Mllama, function calling

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.11.0: Prefix caching, VLMs, BERT (embed, NER), FP8

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.10.0: Speculative decoding adapters and SGMV + BGMV

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.9.0

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.8.1: Gemma support

🎉 Enhancements

🔧 Maintenance

Contributors

v0.8: Structured Output via Outlines

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.7: LoRA Merging (linear, TIES, DARE) per request

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.6: OpenAI compatible API

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.5: CUDA graph compilation

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors