Context
#317 fixed the CI segfault by isolating test files into separate processes (pytest-xdist --dist loadfile). Root cause: running the full suite in one process accumulated native runtimes (cocoindex + lancedb Tokio, kuzu scheduler, torch) that corrupted the heap, crashing kuzu's NodeTableScanState::scanNext with a SIGSEGV at ~53%.
That fix is CI/tests only. This issue tracks the open production question: is the long-running MCP server at risk of the same native-stack corruption?
What the server loads at runtime
- ✅ kuzu (ladybug/lbug) — graph reads
- ✅ lancedb — vector reads
- ✅ torch / sentence-transformers — query encoding
- ❌ cocoindex — not imported by the server; only the indexer (
java_index_flow_lancedb.py, via init/increment) uses it.
So the server's steady-state stack (kuzu + lancedb + torch) is lighter than the crashing test process (which also had cocoindex + hundreds of repeated create/discard cycles). Risk is plausibly lower but unconfirmed — we don't know whether (a) the server leaks native resources per request, or (b) kuzu + lancedb + torch alone corrupt under sustained load.
Plan
- Measure (do first): sustained-load test of the server on real x86 — index the
bank-chat corpus (or a fixture), start the server, drive thousands of mixed graph + vector queries, monitor thread count, RSS, crashes over time. Flat + no crash → low risk. Growth or crash → confirmed.
- If at risk:
- Audit connection reuse — ensure one lancedb + one kuzu connection (singletons), not one per request.
- Lazy-load torch — defer SBERT until the first vector query.
- Architectural isolation — split vector search (lancedb + torch) from graph queries (kuzu) into separate processes.
- Upstream — report to ladybug/lancedb if co-load corruption is confirmed.
- Always good (regardless of measurement): run the server under a supervisor with
restart=always + health check; consider a periodic graceful restart.
Evidence
gdb backtrace from the debug investigation (real x86): kuzu NodeTableScanState::scanNext → bad-pointer memcpy (glibc AVX), on a kuzu TaskScheduler worker thread, with ~280 threads present (cocoindex + lancedb tokio-rt-worker + kuzu + torch). See the experiment matrix in #317.
🤖 Generated with Claude Code
Context
#317 fixed the CI segfault by isolating test files into separate processes (
pytest-xdist --dist loadfile). Root cause: running the full suite in one process accumulated native runtimes (cocoindex + lancedb Tokio, kuzu scheduler, torch) that corrupted the heap, crashing kuzu'sNodeTableScanState::scanNextwith a SIGSEGV at ~53%.That fix is CI/tests only. This issue tracks the open production question: is the long-running MCP server at risk of the same native-stack corruption?
What the server loads at runtime
java_index_flow_lancedb.py, viainit/increment) uses it.So the server's steady-state stack (kuzu + lancedb + torch) is lighter than the crashing test process (which also had cocoindex + hundreds of repeated create/discard cycles). Risk is plausibly lower but unconfirmed — we don't know whether (a) the server leaks native resources per request, or (b) kuzu + lancedb + torch alone corrupt under sustained load.
Plan
bank-chatcorpus (or a fixture), start the server, drive thousands of mixed graph + vector queries, monitor thread count, RSS, crashes over time. Flat + no crash → low risk. Growth or crash → confirmed.restart=always+ health check; consider a periodic graceful restart.Evidence
gdb backtrace from the debug investigation (real x86): kuzu
NodeTableScanState::scanNext→ bad-pointermemcpy(glibc AVX), on a kuzuTaskSchedulerworker thread, with ~280 threads present (cocoindex + lancedbtokio-rt-worker+ kuzu + torch). See the experiment matrix in #317.🤖 Generated with Claude Code