Skip to content

Fix/replay after startup failure#3964

Open
AuraReaper wants to merge 6 commits intomainfrom
fix/replay-after-startup-failure
Open

Fix/replay after startup failure#3964
AuraReaper wants to merge 6 commits intomainfrom
fix/replay-after-startup-failure

Conversation

@AuraReaper
Copy link
Contributor

Describe the changes that are made

  • Preserve per-test-set startup failure status during replay so Docker Compose app shutdowns are reported as APP_HALTED instead of being downgraded to INTERNAL_ERR.
  • Continue replaying the remaining test sets in Docker Compose mode when one test set causes the app to stop during startup.
  • Add focused unit tests for replay status resolution and abort behavior.

What type of PR is this? (check all applicable)

  • 📦 Chore
  • 🍕 Feature
  • 🐞 Bug Fix
  • 📝 Documentation Update
  • 🎨 Style
  • 🧑‍💻 Code Refactor
  • 🔥 Performance Improvements
  • ✅ Test
  • 🔁 CI
  • ⏩ Revert

Added e2e test pipeline?

  • 👍 yes
  • 🙅 no, because they aren't needed
  • 🙋 no, because I need help

Added comments for hard-to-understand areas?

  • 👍 yes
  • 🙅 no, because the code is self-explanatory

Added to documentation?

  • 📜 README.md
  • 📓 Wiki
  • 🙅 no documentation needed

Are there any sample code or steps to test the changes?

  • 👍 yes, mentioned below
  • 🙅 no, because it is not needed

Self Review done?

  • ✅ yes
  • ❌ no, because I need help

Any relevant screenshots, recordings or logs?

  • Manual cloud replay verification was done locally.
  • Verified that the broken test set is marked as APP_HALTED and the subsequent test sets continue to run.

🧠 Semantics for PR Title & Branch Name

PR Title: fix: continue replay after docker compose startup failure
Branch Name: fix/replay-after-startup-failure


Copilot AI review requested due to automatic review settings March 23, 2026 17:18
@github-actions
Copy link

github-actions bot commented Mar 23, 2026

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


0 out of 2 committers have signed the CLA.
@yash
@AuraReaper
Yash seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

@github-actions
Copy link

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

@github-actions
Copy link

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 23, 2026

Kilo Code Review could not run — your account is out of credits.

Add credits at app.kilo.ai to enable reviews on this change.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts replay status handling so that startup failures (notably in Docker Compose mode) preserve an APP_HALTED/APP_FAULT classification instead of being overwritten as INTERNAL_ERR, and so that a single failing test set doesn’t necessarily abort the entire Docker Compose replay run.

Changes:

  • Added helper functions to map application/run-time errors to TestSetStatus, and to decide when a replay run should abort.
  • Updated replay flow to avoid aborting the full run on APP_HALTED / APP_FAULT when running in Docker Compose mode.
  • Added logic to resolve/override status in several error paths (consumed mocks, mock params updates, test case results retrieval, and final status resolution).
Comments suppressed due to low confidence (1)

pkg/service/replay/replay.go:450

  • In Docker Compose mode, APP_HALTED/APP_FAULT no longer aborts the full run, but the flaky-check attempt loop will still continue rerunning the same failing test set (since abortTestRun stays false). That can waste time and may restart the same broken compose stack multiple times before moving on. Consider breaking out of the per-testset attempt loop immediately for terminal statuses (APP_HALTED/APP_FAULT/INTERNAL_ERR), while still allowing the outer loop to continue to the next test set in Docker Compose mode.
			switch testSetStatus {
			case models.TestSetStatusAppHalted:
				testSetResult = false
				abortTestRun = shouldAbortTestRun(testSetStatus, cmdType)
			case models.TestSetStatusInternalErr:
				testSetResult = false
				abortTestRun = shouldAbortTestRun(testSetStatus, cmdType)
			case models.TestSetStatusFaultUserApp:
				testSetResult = false
				abortTestRun = shouldAbortTestRun(testSetStatus, cmdType)
			case models.TestSetStatusUserAbort:
				return nil
			case models.TestSetStatusFailed:
				testSetResult = false
			case models.TestSetStatusPassed:
				testSetResult = true
			case models.TestSetStatusIgnored:
				testSetResult = false
			case models.TestSetStatusNoTestsToRun:
				testSetResult = false
			}

			if testSetStatus != models.TestSetStatusIgnored {
				testRunResult = testRunResult && testSetResult
				if abortTestRun {
					break
				}
			}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

strings.Contains(errMsg, "connection reset by peer") ||
strings.Contains(errMsg, "broken pipe") ||
strings.Contains(errMsg, "server closed") ||
strings.Contains(errMsg, "eof") ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Overly broad string match

The check for "eof" is very short and could inadvertently match error messages that happen to contain "eof" as a substring (e.g., "malloc of size...", or any error mentioning "EOF" in a different context). Consider using a more specific match like "eof" at word boundaries, or checking for the exact string "eof" (as the full error message) or "unexpected eof".

default:
testSetStatusByErrChan = models.TestSetStatusAppHalted
}
utils.LogError(r.logger, err, "application failed to run")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Error log lacks actionable next step

Per coding guidelines, error logs should provide a logical next step for the user, not just state what failed. Consider enhancing this message to guide users on troubleshooting, e.g., "application failed to run; check the application logs for details or verify the app command is correct".

default:
testSetStatusByErrChan = models.TestSetStatusAppHalted
}
utils.LogError(r.logger, err, "application failed to run")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Error log lacks actionable next step

Same as line 855 - this error log should provide guidance to the user. Consider: "application failed to run; check the application logs for details or verify the app command is correct".

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 24, 2026

Code Review Summary

Status: 3 Issues Found (unchanged from previous review) | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 2
Issue Details (click to expand)

WARNING

File Line Issue
pkg/service/replay/replay.go 92 Overly broad string match for "eof" in isDockerComposeReplayShutdown could match unintended errors

SUGGESTION

File Line Issue
pkg/service/replay/replay.go 855 Error log "application failed to run" lacks actionable next step for the user
pkg/service/replay/replay.go 1107 Error log "application failed to run" lacks actionable next step for the user
Incremental Review Notes

New commits reviewed since 8e6a0a9:

  • Performance test workflow added: New CI workflow and shell scripts for performance testing with k6. No issues found - scripts include proper error handling with actionable messages.

  • DNS mock lookup refactored: pkg/agent/proxy/dns.go simplified by removing retry loop and updateDNSMock. Now uses stateless lookup via GetStatelessMocks. This is a good simplification that removes complexity and potential race conditions.

  • MockManager optimized: Added GetStatelessMocks method with optimized O(1) lookup maps (statelessFiltered/statelessUnfiltered). Implementation is thread-safe with proper mutex protection.

  • util.go improved: Added nil checks in filterByTimeStamp and filterByMapping. Also optimized filterByMapping to use map-based lookup (O(1)) instead of nested loops (O(n*m)).

Previous issues remain: The 3 issues from the initial review are still present in unchanged code and should still be addressed.

Other Observations (not in diff)

Other error logs in unchanged code also lack actionable next steps (lines 1182, 1309, 1347, 1638). Per custom instructions, error logs should provide a logical next step for the user, not just state what failed. Consider a follow-up to enhance these messages.

Positive Notes
  • The mutex protection for testSetStatusByErrChan via setErrStatus/getErrStatus properly addresses the data race concern
  • The mapAppErrorToTestSetStatus function correctly handles ErrTestBinStopped and has a sensible default case
  • The refactoring consolidates error-to-status mapping logic into reusable helper functions
  • The status resolution logic in resolveTestSetStatus properly handles precedence of terminal statuses
  • New GetStatelessMocks provides efficient O(1) DNS mock lookup, improving performance
  • Nil checks added to mock filtering functions improve robustness
Files Reviewed (7 files)
  • pkg/service/replay/replay.go - 3 issues (unchanged from previous review)
  • pkg/agent/proxy/dns.go - 0 new issues (simplified stateless lookup)
  • pkg/agent/proxy/mockmanager.go - 0 new issues (added GetStatelessMocks)
  • pkg/util.go - 0 new issues (added nil checks, optimized filterByMapping)
  • .github/workflows/keploy-performance-test.yml - 0 issues (new CI workflow)
  • .github/workflows/test_workflow_scripts/performance/*.sh - 0 issues (CI scripts with proper error handling)

Fix these issues in Kilo Cloud


Reviewed by claude-4.5-opus-20251124 · 447,999 tokens

@github-actions
Copy link

The CLA check failed. Please ensure you have:

  • Signed the CLA by commenting 'I have read the CLA Document and I hereby sign the CLA.'
  • Used the correct email address in your commits (matches the one you used to sign the CLA).

After fixing these issues, comment 'recheck' to trigger the workflow again.

@AuraReaper AuraReaper self-assigned this Mar 25, 2026
@github-actions
Copy link

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 3.02ms 6.1ms 606.16ms 98.93 0.00% ❌ FAIL
2 2.63ms 3.31ms 4.9ms 100.00 0.00% ✅ PASS
3 2.55ms 3.23ms 4.95ms 100.00 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 1 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants