Follow-up: quantify & mitigate native-stack crash risk for the production MCP server

## Context
#317 fixed the CI segfault by isolating test files into separate processes (`pytest-xdist --dist loadfile`). Root cause: running the full suite in one process accumulated native runtimes (cocoindex + lancedb Tokio, kuzu scheduler, torch) that corrupted the heap, crashing kuzu's `NodeTableScanState::scanNext` with a SIGSEGV at ~53%.

That fix is **CI/tests only**. This issue tracks the open production question: **is the long-running MCP server at risk of the same native-stack corruption?**

## What the server loads at runtime
- &#9989; kuzu (ladybug/lbug) &mdash; graph reads
- &#9989; lancedb &mdash; vector reads
- &#9989; torch / sentence-transformers &mdash; query encoding
- &#10060; cocoindex &mdash; **not** imported by the server; only the indexer (`java_index_flow_lancedb.py`, via `init`/`increment`) uses it.

So the server's steady-state stack (kuzu + lancedb + torch) is lighter than the crashing test process (which also had cocoindex + hundreds of repeated create/discard cycles). Risk is plausibly lower but **unconfirmed** &mdash; we don't know whether (a) the server leaks native resources per request, or (b) kuzu + lancedb + torch alone corrupt under sustained load.

## Plan
1. **Measure (do first):** sustained-load test of the server on real x86 &mdash; index the `bank-chat` corpus (or a fixture), start the server, drive thousands of mixed graph + vector queries, monitor **thread count, RSS, crashes** over time. Flat + no crash &rarr; low risk. Growth or crash &rarr; confirmed.
2. **If at risk:**
   - **Audit connection reuse** &mdash; ensure one lancedb + one kuzu connection (singletons), not one per request.
   - **Lazy-load torch** &mdash; defer SBERT until the first vector query.
   - **Architectural isolation** &mdash; split vector search (lancedb + torch) from graph queries (kuzu) into separate processes.
   - **Upstream** &mdash; report to ladybug/lancedb if co-load corruption is confirmed.
3. **Always good (regardless of measurement):** run the server under a supervisor with `restart=always` + health check; consider a periodic graceful restart.

## Evidence
gdb backtrace from the debug investigation (real x86): kuzu `NodeTableScanState::scanNext` &rarr; bad-pointer `memcpy` (glibc AVX), on a kuzu `TaskScheduler` worker thread, with ~280 threads present (cocoindex + lancedb `tokio-rt-worker` + kuzu + torch). See the experiment matrix in #317.

&#129302; Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up: quantify & mitigate native-stack crash risk for the production MCP server #318

Context

What the server loads at runtime

Plan

Evidence

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Follow-up: quantify & mitigate native-stack crash risk for the production MCP server #318

Description

Context

What the server loads at runtime

Plan

Evidence

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions