Skip to content

feat: strengthen agentic audit baselines and execution observability#22711

Open
mnkiefer wants to merge 8 commits intomainfrom
obs-tools
Open

feat: strengthen agentic audit baselines and execution observability#22711
mnkiefer wants to merge 8 commits intomainfrom
obs-tools

Conversation

@mnkiefer
Copy link
Contributor

@mnkiefer mnkiefer commented Mar 24, 2026

  • This PR adds baseline-aware auditing, policy-driven observability checks, and ready-to-use monitoring workflows so teams can detect risky agentic behavior earlier and operationalize it across repositories.

  • This is important since with agentic workflows, we need to be looking at behavior not just outcomes: a workflow can keep passing while becoming more expensive, less controlled, or less predictable. Standard GitHub Actions status does not tell you that.

  • Logs: better gh aw logs reporting across multiple runs

    • Shows recurring patterns and trends instead of only one-run summaries
    • Surfaces observability insights so users can spot risky, noisy, resource-heavy, or weakly controlled behavior
    • Gives a more operational view of workflow behavior over time
    • Makes it easier to monitor many executions and identify patterns worth follow-up
  • Audits: smarter gh aw audit output that explains what happened in a run

    • Compares a run against a similar successful baseline
    • Shows meaningful changes like blocked requests, MCP failures, and execution posture shifts
    • Adds task-domain detection to help interpret whether the workflow behavior fits the task
    • Adds behavioral fingerprinting to show the overall execution style of a run
    • Adds agentic assessments to show whether a workflow looks lean, risky, overbuilt, or weakly controlled
    • Adds key findings so the most important conclusions are visible immediately
    • Adds recommendations so users know what to investigate or change next
    • Adds performance metrics so users can see whether runs are getting slower, costlier, or heavier
    • Gives users a faster way to debug regressions and understand why a run is concerning

Agentic Observability Kit

  • Adds a drop-in observability workflow kit users can adopt without building monitoring from scratch. We also have a central-reporting variant of this for teams managing many repositories.
  • This gives teams an operational workflow for turning observability into ongoing reporting and follow-up and makes it easier to standardize monitoring across repos instead of inventing a custom setup each time.

Copilot AI review requested due to automatic review settings March 24, 2026 14:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds richer observability and auditing capabilities for agentic workflow runs, including baseline comparison, execution-path insights, a policy evaluation CLI, and an opt-in GitHub Actions job summary section.

Changes:

  • Introduces new JSON schemas for observability reports and observability guardrail policies.
  • Adds baseline comparison to the audit report (compare against last successful run) and surfaces observability insights in audit/logs output.
  • Adds an opt-in workflow compilation step to generate an “Observability” GitHub Actions job summary section, plus a new observability-policy CLI command for policy evaluation.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
schemas/observability-report.json New schema describing run/fleet observability payloads (lineage, execution, reasoning, tooling/network/actuation, insights/findings).
schemas/observability-policy.json New schema for policy rules evaluated against observability payloads.
pkg/workflow/observability_job_summary_test.go Tests ensuring an opt-in job-summary step is injected (and absent by default).
pkg/workflow/frontmatter_types_test.go Adds frontmatter parsing coverage for observability.job-summary.
pkg/workflow/frontmatter_types.go Adds ObservabilityConfig and exposes it via FrontmatterConfig.
pkg/workflow/compiler_yaml_main_job.go Hooks observability summary generation into main job step synthesis.
pkg/workflow/compiler_yaml_ai_execution.go Implements opt-in observability summary step and mode extraction.
pkg/parser/schemas/main_workflow_schema.json Extends workflow frontmatter schema to allow observability.job-summary on/off.
pkg/parser/schema_test.go Validates schema acceptance of observability.job-summary.
pkg/cli/observability_schema_test.go Verifies the observability report schema includes key sections/defs.
pkg/cli/observability_policy_test.go Adds unit tests for policy evaluation + schema parsing.
pkg/cli/observability_policy_command_test.go Tests CLI evaluation output, JSON emission, and blocking behavior.
pkg/cli/observability_policy_command.go Implements new observability-policy eval command and rendering/error behavior.
pkg/cli/observability_policy.go Adds policy/payload types and matching logic for rule evaluation.
pkg/cli/observability_insights_test.go Adds tests for audit/logs observability insight generation.
pkg/cli/observability_insights.go Implements observability insight generation + console rendering helper.
pkg/cli/mcp_schema_test.go Ensures schema generation works for ObservabilityPolicy.
pkg/cli/logs_report.go Adds aggregated observability insights to logs data and console rendering.
pkg/cli/audit_test.go Updates audit tests to account for new Comparison field.
pkg/cli/audit_report_render.go Renders new “Comparison” and “Observability Insights” sections in audit output.
pkg/cli/audit_report.go Adds Comparison + ObservabilityInsights to audit JSON and populates insights.
pkg/cli/audit_comparison_test.go Adds tests for comparison behavior (no baseline / risky / stable).
pkg/cli/audit_comparison.go Implements baseline lookup, snapshot extraction, delta/classification, and recommendations.
pkg/cli/audit.go Computes current snapshot + attaches baseline comparison into audit output.
cmd/gh-aw/main.go Registers the new observability-policy command under the analysis group.
actions/setup/js/generate_observability_summary.test.cjs Adds vitest coverage for summary generation and opt-out behavior.
actions/setup/js/generate_observability_summary.cjs Implements job summary synthesis from runtime artifacts (aw_info, agent_output, gateway logs).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mnkiefer mnkiefer self-assigned this Mar 24, 2026
@pelikhan
Copy link
Contributor

@copilot review comments, lint go

Copy link
Contributor

Copilot AI commented Mar 24, 2026

@copilot review comments, lint go

Addressed all review comments in commit 5abdb23:

  1. countBlockedRequests() — now sums across all gatewayEventPaths files instead of returning after the first match.
  2. Audit comparison classification — added reason codes for beneficial changes (turns_decrease, blocked_requests_decrease, mcp_failures_resolved) and explicit switch cases so runs that differ from baseline are no longer classified as stable.
  3. renderObservabilityInsights — added critical and low severity icons.
  4. min_blocked_requests evidence — now includes both threshold and actual count, e.g. blocked_requests_gte=5 actual=7.
  5. warn action — added explicit case "warn" using FormatWarningMessage so it renders distinctly from info messages.

Copilot AI requested a review from pelikhan March 24, 2026 18:06
@mnkiefer mnkiefer changed the title feat: improve agentic audit baselines and execution observability feat: strengthen agentic audit baselines and execution observability Mar 24, 2026
@pelikhan pelikhan marked this pull request as ready for review March 24, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants