evaluation

Here are 2,248 public repositories matching this topic...

mlflow / mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

Updated Mar 15, 2026
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated Mar 14, 2026
TypeScript

comet-ml / opik

Star

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground evaluation openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

Updated Mar 14, 2026
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Mar 15, 2026
TypeScript

Tencent / WeKnora

Star

LLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

Updated Mar 13, 2026
Go

vibrantlabsai / ragas

Star

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Updated Feb 24, 2026
Python

mrgloom / awesome-semantic-segmentation

Star

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

Updated May 8, 2021

oumi-ai / oumi

Star

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

evaluation inference llama fine-tuning sft dpo slms llms vlms gpt-oss gpt-oss-120b gpt-oss-20b

Updated Mar 13, 2026
Python

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Mar 13, 2026
Python

coze-dev / coze-loop

Star

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Mar 15, 2026
Go

Helicone / helicone

Star

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated Mar 7, 2026
TypeScript

Kiln-AI / Kiln

Star

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Updated Mar 15, 2026
Python

Marker-Inc-Korea / AutoRAG

Star

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated Mar 10, 2026
Python

MichaelGrupp / evo

Star

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

Updated Feb 11, 2026
Python

Knetic / govaluate

Star

Arbitrary expression evaluation for golang

go parsing evaluation expression

Updated Mar 25, 2025
Go

Agenta-AI / agenta

Star

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

evaluation agents observability prompt-engineering llmops prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability

Updated Mar 14, 2026
TypeScript

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated Mar 13, 2026
Python

EvolvingLMMs-Lab / lmms-eval

Star

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

benchmark evaluation agi video-understanding vlm multimodal large-language-models vision-language-model llm-evaluation audio-evaluation multimodal-evaluation

Updated Mar 15, 2026
Python

sdiehl / write-you-a-haskell

Star

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

compiler functional-programming book lambda-calculus evaluation type-theory type pdf-book type-checking haskel type-system functional-language hindley-milner type-inference intermediate-representation

Updated Jan 11, 2021
Haskell

CLUEbenchmark / SuperCLUE

Star

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

evaluation chinese gpt-4 foundation-models chatgpt

Updated Feb 6, 2026

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 2,248 public repositories matching this topic...

mlflow / mlflow

langfuse / langfuse

comet-ml / opik

promptfoo / promptfoo

Tencent / WeKnora

vibrantlabsai / ragas

mrgloom / awesome-semantic-segmentation

oumi-ai / oumi

open-compass / opencompass

coze-dev / coze-loop

Helicone / helicone

Kiln-AI / Kiln

Marker-Inc-Korea / AutoRAG

MichaelGrupp / evo

Knetic / govaluate

Agenta-AI / agenta

open-compass / VLMEvalKit

EvolvingLMMs-Lab / lmms-eval

sdiehl / write-you-a-haskell

CLUEbenchmark / SuperCLUE

Improve this page

Add this topic to your repo