Build software better, together

hidai25 / eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

python testing cli mcp evaluation pytest regression-testing ai-agents autogen llm anthropic langchain-agent openai-assistants crewai langgraph agentic-ai agent-evaluation agent-benchmark

Updated Jun 15, 2026
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 2, 2026
Python

collinear-ai / tau-trait

Star

TraitBasis applied to TauBench

rl-envs rl-training agent-benchmark

Updated Nov 11, 2025
Python

edholofy / dojo.md

Star

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated May 2, 2026
TypeScript

NoesisVision / nasde-toolkit

Star

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

Updated Jun 10, 2026
Python

justindobbs / Tracecore

Star

Deterministic runtime for agent evaluation

reliability-engineering specification ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-testing agent-benchmark deterministic-testing autoresearch

Updated Mar 25, 2026
Python

dataanswer / awesome-agent-benchmarks

Star

A curated collection of the world’s most advanced benchmark datasets for evaluating Large Language Model (LLM) Agents.

agent benchmarks awesome-list agent-based-modeling awesome-list-awesome-list ai-agent llm-agent llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Dec 21, 2025

he-yufeng / CodeJoust

Star

Pit AI coding agents against the same bug. Score them on tests, diff, cost, and time — pick the winning patch.

python gemini codex cli-tool git-worktree llm aider claude-code coding-agent parallel-agents agent-benchmark ai-arena

Updated May 12, 2026
Python

haoyifan / Silicon-Pantheon

Star

Silicon Pantheon - Tactics game played by AI agents coached by human

mcp turn-based gpt strategy-game ai-agents llm claude-code agent-benchmark competitive-ai

Updated May 4, 2026
Python

jackjin1997 / AgentBench-Live

Star

Variance-aware benchmark for AI coding agents. Same agent + same task can swing 70 points — we publish min/max, not just averages. Claude Code · Gemini CLI · Codex CLI · Aider · 10 tasks · Docker sandbox · MIT.

benchmark leaderboard evaluation variance reproducibility ai-agents aider llm-evaluation gemini-cli claude-code codex-cli agent-benchmark cli-agents

Updated May 14, 2026
Python

ArshVermaGit / open-ev-code-handler

Star

Deterministic evaluation environment for AI code reviewers covering bugs, security (OWASP), and architecture via FastAPI + OpenEnv.

security-audit ai static-analysis owasp code-review software-architecture evaluation-framework ai-agents fastapi llm llm-evaluation agent-benchmark openenv

Updated Apr 8, 2026
Python

shaumik / PokeArena

Star

A Pokémon battle arena where any agent can play — human, deterministic game-tree AI, or LLM — over an open WebSocket protocol (MCP, CLI, or your own client). One leaderboard, ranked by who plays best.

Updated Jun 16, 2026
Go

AgentBenchAudit / evidence-bounds

Star

Release repository for agent benchmark evidence-reporting artifacts and reproduction workflows.

python benchmark evaluation reproducibility agent-benchmark research-artifacts

Updated May 19, 2026
HTML

chenrui333 / prediction-agent-arena

Star

Prediction-market agent arena for AI agent evaluation, paper trading, practice rounds, contests, and leaderboard-based battle testing.

go docker redis sqlite nextjs leaderboard competitions prediction-markets ai-agents paper-trading fly-io trading-simulation market-simulation llm-agents agent-evaluation agent-benchmark agent-arena

Updated May 30, 2026
Go

axxafo / awesome-agent-benchmarks

Star

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.

search awesome ai benchmarks rl agent-based-modeling reasoning awesome-list-awesome-list ai-models ai-agent for-devs llm-agent agentic llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Jun 16, 2026

SanJueLogic / MeiGen-DesignAgentBench

Star

A reproducible benchmark for evaluating AI design agents across 7design scenarios. Double-blind SbS voting · 140 tasks · Bootstrap CI

benchmark reproducible-research leaderboard evaluation side-by-side image-generation text-to-image creative-ai multimodal human-evaluation ai-evaluation generative-ai design-agent agent-benchmark

Updated Apr 24, 2026
Python

camerasearch / fieldopsbench

Star

Multimodal evaluation benchmark for AI agents in real-world field operations across 16 trades (HVAC, electrical, plumbing, roofing, solar, mining, oil & gas, marine, telecom, automotive, construction, and more). 194 cases; scores retrieval, code citation, jurisdiction, safety, trajectory, multi-turn, speed; 5-layer contamination defense.

benchmark evaluation electrical hvac trades ai-safety plumbing contamination-detection multimodal code-compliance huggingface-datasets vision-language-model llm-evaluation field-operations agent-benchmark

Updated Apr 19, 2026
Python

immu4989 / dspy-security-bench

Star

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite.

python robustness dspy prompt-injection llm-security llm-evaluation prompt-optimization agentic-ai agent-benchmark agentdojo

Updated Jun 16, 2026
Python

mrazakhan / ContinuumAI

Star

Solving the amnesiac problem for LLM agents. Research series on agents that compound knowledge across sessions — first measurement: +4.6 pp accuracy lift on Terminal-Bench 2.1 with an open-weight executor and a single failure-derived skill file.

reproducible-research ai-agents cost-optimization skill-learning openrouter agent-benchmark terminal-bench open-weight-llm

Updated Jun 11, 2026

justindobbs / awesome-certified-agents

Star

A community catalog of autonomous agents and bundles certified by passing TraceCore deterministic episode runs in public CI

open-source benchmarking evaluation multi-agent deterministic ai-agents developer-tools-test agent-benchmark tracecore

Updated Mar 7, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-benchmark

Here are 40 public repositories matching this topic...

hidai25 / eval-view

Cre4T3Tiv3 / ai-agents-reality-check

collinear-ai / tau-trait

edholofy / dojo.md

NoesisVision / nasde-toolkit

justindobbs / Tracecore

dataanswer / awesome-agent-benchmarks

he-yufeng / CodeJoust

haoyifan / Silicon-Pantheon

jackjin1997 / AgentBench-Live

ArshVermaGit / open-ev-code-handler

shaumik / PokeArena

AgentBenchAudit / evidence-bounds

chenrui333 / prediction-agent-arena

axxafo / awesome-agent-benchmarks

SanJueLogic / MeiGen-DesignAgentBench

camerasearch / fieldopsbench

immu4989 / dspy-security-bench

mrazakhan / ContinuumAI

justindobbs / awesome-certified-agents

Improve this page

Add this topic to your repo