A/B Experiments
Use the experiments frontmatter section to compare workflow variants across
repeated runs. Each experiment declares a name and a set of variants. On every
run, the activation job picks one variant and exposes it to the prompt.
Experiments work best when you test one workflow choice at a time, such as:
- prompt wording
- model selection
- whether to delegate to a sub-agent
- which subskill (inline skill) to invoke
Declaring experiments
Section titled “Declaring experiments”Add an experiments map to the workflow frontmatter. Each key names an
experiment. The value is either a simple array of variants (bare-array form) or
a rich object with additional metadata fields.
Bare-array form
Section titled “Bare-array form”---on: issues: types: [opened]
engine: copilot
experiments: style: [concise, detailed]---
Summarize this issue in a **${{ experiments.style }}** way.Rich object form
Section titled “Rich object form”Use the object form when you want built-in reporting and experiment metadata:
---on: schedule: daily on weekdays
engine: copilot
experiments: prompt_style: variants: [concise, detailed] description: "Test whether a concise prompt reduces cost without quality loss" hypothesis: "H0: no change in aic. H1: concise reduces AIC by >=15%" metric: aic secondary_metrics: [duration_ms, discussion_word_count] guardrail_metrics: - name: success_rate threshold: ">=0.95" - name: empty_output_rate direction: min threshold: 0.0 weight: [50, 50] min_samples: 25 start_date: "2026-05-05" end_date: "2026-07-25" issue: 1234---
Summarize the findings in a **${{ experiments.prompt_style }}** way.[!NOTE] Experiment names must be valid identifiers: start with a letter or underscore, followed by letters, digits, or underscores. For example, use
styleorfeature_1. Names that do not match this pattern are ignored.
Using variants in the prompt
Section titled “Using variants in the prompt”Reference a variant with ${{ experiments.<name> }}. At runtime, gh-aw
replaces the expression with the selected variant string, such as concise.
Use the {{#if experiments.<name> }} block syntax for conditional prompt
sections. A variant value of no is treated as falsy, which makes yes/no
experiments easy to express:
---experiments: caveman: [yes, no]---
{{#if experiments.caveman }}Talk like a caveman in all your responses. Me test. You run.{{/if}}
Address the issue described above.Common experiment ideas
Section titled “Common experiment ideas”Most experiments compare a single decision in the workflow. The examples below show common patterns.
Try different prompt styles
Section titled “Try different prompt styles”---experiments: style: [concise, detailed]---
Summarize this issue in a **${{ experiments.style }}** way.Try different models
Section titled “Try different models”Model experiments are useful when you want to compare speed, cost, and output
quality. gh-aw model aliases such as small and large are often a good place
to start. See Model Aliases.
---engine: id: copilot model: ${{ experiments.model }}
experiments: model: [small, large]---
Review the issue and recommend the next action.Try using a sub-agent
Section titled “Try using a sub-agent”This pattern compares a direct prompt with a delegated sub-agent flow.
---experiments: use_summarizer: [yes, no]---
{{#if experiments.use_summarizer }}Use the `file-summarizer` sub-agent to summarize `README.md`, then continue.{{/if}}
Write a short project overview for maintainers.
## agent: `file-summarizer`---model: smalldescription: Summarizes a file in a few sentences---Read the given file and return a concise summary.See Inline Sub-Agents for the full syntax.
Try different subskills
Section titled “Try different subskills”This pattern compares two reusable instruction blocks, sometimes called subskills, without changing the main workflow prompt.
---experiments: triage_skill: [triage-fast, triage-deep]---
Use the `${{ experiments.triage_skill }}` skill to classify this issue.
## skill: `triage-fast`---description: Fast issue triage---Classify the issue and suggest the smallest next step.
## skill: `triage-deep`---description: Detailed issue triage---Classify the issue, identify missing context, and recommend a fuller follow-upplan.Statistical balancing
Section titled “Statistical balancing”The activation job tracks how often each variant has been selected. The counter
is stored using the storage setting in the experiments: block. By default,
gh-aw chooses the least-used variant on each run. If multiple variants are tied,
including on the first run, one of them is chosen at random. Over time, this
keeps usage roughly balanced across variants.
When you provide a weight array, gh-aw uses weighted random selection instead
of least-used selection. For example, [70, 30] gives the first variant a 70%
selection probability. If start_date or end_date is set and the current
date falls outside that range, gh-aw returns the control variant (the first
entry) without incrementing any counter.
Storage Configuration
Section titled “Storage Configuration”The storage key inside the experiments: map controls where experiment state
is persisted:
experiments: storage: repo # or: cache (default: repo) prompt_style: [concise, detailed]| Value | Behavior |
|---|---|
repo (default) | Commits state to a git branch named experiments/{sanitizedWorkflowID} (workflow ID lowercased with hyphens removed, e.g. my-workflow → experiments/myworkflow). Durable — survives cache evictions. Requires contents: write permission (added automatically by the compiler). |
cache | Uses GitHub Actions cache (legacy). State may be evicted after 7 days of inactivity. |
When storage: repo, the compiler adds a push_experiments_state job after the
activation job and commits the updated state.json to the experiments branch.
Accessing assignments downstream
Section titled “Accessing assignments downstream”Each experiment exposes its selected variant as an activation job output:
| Expression | Description |
|---|---|
needs.activation.outputs.<name> | Selected variant for experiment <name> |
needs.activation.outputs.experiments | All assignments as a JSON object |
Use these expressions in downstream jobs defined in the jobs: frontmatter section.
Analyzing results
Section titled “Analyzing results”The activation job uploads the counter state as an experiment artifact. Download and inspect it with the gh aw CLI:
# Download the experiment artifact for a specific rungh aw audit <run-id> --artifacts experiment
# Display experiment assignments in the audit reportgh aw audit <run-id>The A/B Experiments section of the audit report shows the variant chosen on the most recent run and the cumulative counts across all runs:
A/B Experiments • caveman = yes (cumulative: no:4, yes:5) • style = concise (cumulative: concise:5, detailed:4)Filtering audit results by variant
Section titled “Filtering audit results by variant”Use --experiment and --variant to filter audit runs to a specific variant:
gh aw audit <run-id> --experiment prompt_style --variant conciseStep summary
Section titled “Step summary”Each activation job writes a Markdown step summary that shows the selected
variants, cumulative counts, and, when you use the object form, progress toward
min_samples:
## A/B Experiment Assignments
| Experiment | Selected Variant | All Variants | Cumulative Counts || --- | --- | --- | --- || prompt_style | concise | concise, detailed | concise: 8, detailed: 7|
### Sampling Progress
prompt_style (target: 25 per variant) concise: ████████░░░░░░░░░░░░ 8/25 (32%) detailed: ███████░░░░░░░░░░░░░ 7/25 (28%)
### Experiment Details
**prompt_style**
> Test whether a concise prompt reduces cost without quality loss
**Hypothesis:** H0: no change in aic. H1: concise reduces AIC by >=15%
**Guardrail metrics:**- `success_rate` >=0.95- `empty_output_rate` ==0
Tracking issue: [#1234](https://github.com/owner/repo/issues/1234)Frontmatter reference
Section titled “Frontmatter reference”Bare-array form
Section titled “Bare-array form”| Field | Type | Description |
|---|---|---|
experiments | object | Map of experiment name → variant array or config object |
experiments.<name> | string[] | Array of two or more variant strings for one experiment |
Object form fields
Section titled “Object form fields”| Field | Type | Required | Description |
|---|---|---|---|
variants | string[] | ✓ | Array of two or more variant strings |
description | string | Human-readable explanation of what the experiment tests | |
hypothesis | string | Null and alternative hypothesis (e.g. "H0: no change. H1: concise reduces AIC by >=15%") | |
metric | string | Primary metric to observe (e.g. aic, duration_ms) | |
secondary_metrics | string[] | Additional metrics to track alongside the primary metric | |
guardrail_metrics | object[] | List of guardrail objects with name (string), threshold (comparison string like >=0.95 or bare number like 0.0), and optional direction ("min" or "max"). When threshold is a bare number, direction governs the pass condition (≤ for min, ≥ for max). See experiments-specification §4.4 for full semantics. | |
min_samples | integer | Minimum runs per variant required before statistical analysis is considered reliable. The step summary shows a progress bar toward this target. | |
weight | integer[] | Per-variant probability weights (same length as variants). Enables weighted-random selection; values are relative and need not sum to 100. | |
issue | integer | GitHub issue number that tracks this experiment’s lifecycle | |
start_date | string | ISO-8601 date (YYYY-MM-DD) before which the experiment is inactive. The control variant is returned before this date without incrementing any counter. | |
end_date | string | ISO-8601 date (YYYY-MM-DD) after which the experiment is inactive. The control variant is returned after this date without incrementing any counter. |