GitHub Agentic Workflows

A/B Experiments

Use the experiments frontmatter section to compare workflow variants across repeated runs. Each experiment declares a name and a set of variants. On every run, the activation job picks one variant and exposes it to the prompt.

Experiments work best when you test one workflow choice at a time, such as:

  • prompt wording
  • model selection
  • whether to delegate to a sub-agent
  • which subskill (inline skill) to invoke

Add an experiments map to the workflow frontmatter. Each key names an experiment. The value is either a simple array of variants (bare-array form) or a rich object with additional metadata fields.

---
on:
issues:
types: [opened]
engine: copilot
experiments:
style: [concise, detailed]
---
Summarize this issue in a **${{ experiments.style }}** way.

Use the object form when you want built-in reporting and experiment metadata:

---
on:
schedule: daily on weekdays
engine: copilot
experiments:
prompt_style:
variants: [concise, detailed]
description: "Test whether a concise prompt reduces cost without quality loss"
hypothesis: "H0: no change in aic. H1: concise reduces AIC by >=15%"
metric: aic
secondary_metrics: [duration_ms, discussion_word_count]
guardrail_metrics:
- name: success_rate
threshold: ">=0.95"
- name: empty_output_rate
direction: min
threshold: 0.0
weight: [50, 50]
min_samples: 25
start_date: "2026-05-05"
end_date: "2026-07-25"
issue: 1234
---
Summarize the findings in a **${{ experiments.prompt_style }}** way.

[!NOTE] Experiment names must be valid identifiers: start with a letter or underscore, followed by letters, digits, or underscores. For example, use style or feature_1. Names that do not match this pattern are ignored.

Reference a variant with ${{ experiments.<name> }}. At runtime, gh-aw replaces the expression with the selected variant string, such as concise.

Use the {{#if experiments.<name> }} block syntax for conditional prompt sections. A variant value of no is treated as falsy, which makes yes/no experiments easy to express:

---
experiments:
caveman: [yes, no]
---
{{#if experiments.caveman }}
Talk like a caveman in all your responses. Me test. You run.
{{/if}}
Address the issue described above.

Most experiments compare a single decision in the workflow. The examples below show common patterns.

---
experiments:
style: [concise, detailed]
---
Summarize this issue in a **${{ experiments.style }}** way.

Model experiments are useful when you want to compare speed, cost, and output quality. gh-aw model aliases such as small and large are often a good place to start. See Model Aliases.

---
engine:
id: copilot
model: ${{ experiments.model }}
experiments:
model: [small, large]
---
Review the issue and recommend the next action.

This pattern compares a direct prompt with a delegated sub-agent flow.

---
experiments:
use_summarizer: [yes, no]
---
{{#if experiments.use_summarizer }}
Use the `file-summarizer` sub-agent to summarize `README.md`, then continue.
{{/if}}
Write a short project overview for maintainers.
## agent: `file-summarizer`
---
model: small
description: Summarizes a file in a few sentences
---
Read the given file and return a concise summary.

See Inline Sub-Agents for the full syntax.

This pattern compares two reusable instruction blocks, sometimes called subskills, without changing the main workflow prompt.

---
experiments:
triage_skill: [triage-fast, triage-deep]
---
Use the `${{ experiments.triage_skill }}` skill to classify this issue.
## skill: `triage-fast`
---
description: Fast issue triage
---
Classify the issue and suggest the smallest next step.
## skill: `triage-deep`
---
description: Detailed issue triage
---
Classify the issue, identify missing context, and recommend a fuller follow-up
plan.

The activation job tracks how often each variant has been selected. The counter is stored using the storage setting in the experiments: block. By default, gh-aw chooses the least-used variant on each run. If multiple variants are tied, including on the first run, one of them is chosen at random. Over time, this keeps usage roughly balanced across variants.

When you provide a weight array, gh-aw uses weighted random selection instead of least-used selection. For example, [70, 30] gives the first variant a 70% selection probability. If start_date or end_date is set and the current date falls outside that range, gh-aw returns the control variant (the first entry) without incrementing any counter.

The storage key inside the experiments: map controls where experiment state is persisted:

experiments:
storage: repo # or: cache (default: repo)
prompt_style: [concise, detailed]
ValueBehavior
repo (default)Commits state to a git branch named experiments/{sanitizedWorkflowID} (workflow ID lowercased with hyphens removed, e.g. my-workflowexperiments/myworkflow). Durable — survives cache evictions. Requires contents: write permission (added automatically by the compiler).
cacheUses GitHub Actions cache (legacy). State may be evicted after 7 days of inactivity.

When storage: repo, the compiler adds a push_experiments_state job after the activation job and commits the updated state.json to the experiments branch.

Each experiment exposes its selected variant as an activation job output:

ExpressionDescription
needs.activation.outputs.<name>Selected variant for experiment <name>
needs.activation.outputs.experimentsAll assignments as a JSON object

Use these expressions in downstream jobs defined in the jobs: frontmatter section.

The activation job uploads the counter state as an experiment artifact. Download and inspect it with the gh aw CLI:

Terminal window
# Download the experiment artifact for a specific run
gh aw audit <run-id> --artifacts experiment
# Display experiment assignments in the audit report
gh aw audit <run-id>

The A/B Experiments section of the audit report shows the variant chosen on the most recent run and the cumulative counts across all runs:

A/B Experiments
• caveman = yes (cumulative: no:4, yes:5)
• style = concise (cumulative: concise:5, detailed:4)

Use --experiment and --variant to filter audit runs to a specific variant:

Terminal window
gh aw audit <run-id> --experiment prompt_style --variant concise

Each activation job writes a Markdown step summary that shows the selected variants, cumulative counts, and, when you use the object form, progress toward min_samples:

## A/B Experiment Assignments
| Experiment | Selected Variant | All Variants | Cumulative Counts |
| --- | --- | --- | --- |
| prompt_style | concise | concise, detailed | concise: 8, detailed: 7|
### Sampling Progress
prompt_style (target: 25 per variant)
concise: ████████░░░░░░░░░░░░ 8/25 (32%)
detailed: ███████░░░░░░░░░░░░░ 7/25 (28%)
### Experiment Details
**prompt_style**
> Test whether a concise prompt reduces cost without quality loss
**Hypothesis:** H0: no change in aic. H1: concise reduces AIC by >=15%
**Guardrail metrics:**
- `success_rate` >=0.95
- `empty_output_rate` ==0
Tracking issue: [#1234](https://github.com/owner/repo/issues/1234)
FieldTypeDescription
experimentsobjectMap of experiment name → variant array or config object
experiments.<name>string[]Array of two or more variant strings for one experiment
FieldTypeRequiredDescription
variantsstring[]Array of two or more variant strings
descriptionstringHuman-readable explanation of what the experiment tests
hypothesisstringNull and alternative hypothesis (e.g. "H0: no change. H1: concise reduces AIC by >=15%")
metricstringPrimary metric to observe (e.g. aic, duration_ms)
secondary_metricsstring[]Additional metrics to track alongside the primary metric
guardrail_metricsobject[]List of guardrail objects with name (string), threshold (comparison string like >=0.95 or bare number like 0.0), and optional direction ("min" or "max"). When threshold is a bare number, direction governs the pass condition (≤ for min, ≥ for max). See experiments-specification §4.4 for full semantics.
min_samplesintegerMinimum runs per variant required before statistical analysis is considered reliable. The step summary shows a progress bar toward this target.
weightinteger[]Per-variant probability weights (same length as variants). Enables weighted-random selection; values are relative and need not sum to 100.
issueintegerGitHub issue number that tracks this experiment’s lifecycle
start_datestringISO-8601 date (YYYY-MM-DD) before which the experiment is inactive. The control variant is returned before this date without incrementing any counter.
end_datestringISO-8601 date (YYYY-MM-DD) after which the experiment is inactive. The control variant is returned after this date without incrementing any counter.