fix(ci): isolate test files into separate processes to stop kuzu SIGSEGV#317
Merged
Merged
Conversation
Running the full suite in one process accumulated native runtimes (cocoindex + lancedb Tokio, kuzu scheduler, torch) that corrupted the heap, crashing kuzu's NodeTableScanState::scanNext with a SIGSEGV at ~53%. pytest-xdist --dist loadfile gives each test file a fresh worker process so no cross-file native state accumulates. Verified on real x86 CI: 771 passed / 9 skipped, no segfault. Co-Authored-By: Claude <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
CI segfaulted (exit 139) at ~53% in
ladybug/kuzu'sNodeTableScanState::scanNext— a bad-pointermemcpy(glibc AVX) during thefind_by_name_or_fqnMATCH (s:Symbol)scan.ladybug 0.17.1is kuzu (re-vendored aslbug).Diagnosis runs (real x86, via a temporary gdb workflow on
debug/segfault-gdb):lbug::storage::NodeTableScanState::scanNexttest_ladybug_queries.pyaloneOMP_NUM_THREADS=1 RAYON_NUM_THREADS=1pytest-xdist --dist loadfileSo the crash is accumulated cross-library native process-state corruption: by 53% one process has ~280 threads (cocoindex + lancedb each run their own Tokio runtime, plus kuzu's
TaskSchedulerand torch). That corrupts the heap; kuzu's parallel scanner later reads a Symbol string property from a bad pointer. Process isolation prevents the accumulation.Fix
pytest-xdistto dev deps.pytest tests -n auto --dist loadfile -v— each test file runs in its own fresh worker process.This fixes CI/tests. The underlying native-stack corruption under sustained single-process load is a potential concern for the MCP server (it loads the same cocoindex + lancedb + kuzu + torch stack). Worth a follow-up: investigate the Tokio-runtime proliferation in cocoindex/lancedb and/or report upstream.
Cleanup
The throwaway debug workflow + branch (
debug/segfault-gdb) can be deleted after this merges.🤖 Generated with Claude Code