Skip to content

Add extensions framework with entity management and graph sync#548

Merged
jfrench9 merged 17 commits intomainfrom
feature/ledger-graph-sync
Mar 27, 2026
Merged

Add extensions framework with entity management and graph sync#548
jfrench9 merged 17 commits intomainfrom
feature/ledger-graph-sync

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

Introduces a generalized extensions framework that replaces the previous ledger-scoped module structure, enabling support for multiple extension types (starting with RoboLedger). This PR also adds a new entity management system for the ledger API and enhances the graph materialization pipeline to support dynamic synchronization between the entity graph and extension data.

Key Changes

Architecture: ledgerextensions Refactor

  • Renamed and reorganized the ledger module hierarchy to extensions, with RoboLedger becoming a sub-module under extensions/roboledger/
  • Migrated all models, API schemas, operations, database configuration, and Alembic migrations from ledger to extensions namespacing
  • Updated QuickBooks pipeline references from roboledger to extensions
  • Renamed database configuration (db/ledger.pydb/extensions.py) and extended it to support the new entity tables

Entity Management

  • Added new Entity SQLAlchemy model and corresponding API models for CRUD operations
  • Introduced a new /ledger/entity router with full entity creation and update endpoints
  • Added Alembic migration (1eec7fa2156e) for the entities table
  • Integrated entity router into the ledger API module

Graph Materialization & Sync

  • Added robosystems/operations/extensions/materialize.py (540+ lines) — a comprehensive materialization engine that syncs graph data into extension tables
  • Enhanced materialization logic with dynamic QName prefixes based on external source
  • Added support for graph-type-aware upload and materialization flows
  • Refactored entity_graph_service.py to simplify and streamline entity creation/update processes

Dagster Integration

  • Added new robosystems/dagster/jobs/extensions.py with job definitions for extension materialization workflows
  • Updated Dagster definitions to register the new extension jobs

Infrastructure

  • Updated Postgres initialization to accommodate the renamed extensions schema
  • Updated environment variable examples and entrypoint scripts for the new database naming
  • Disabled the accounting demo (noted for re-enablement when entity graph features are ready)

Breaking Changes

  • Module renaming: All imports referencing robosystems.models.ledger, robosystems.db.ledger, robosystems.operations.ledger, and robosystems.models.api.ledger must be updated to their extensions equivalents
  • Migration path renamed: Alembic config moved from migrations/ledger.inimigrations/extensions.ini; existing deployments will need migration path updates
  • Environment variables: Database configuration env vars have changed (see .env.example updates)
  • Accounting demo disabled: The demo workflow is temporarily unavailable pending entity graph feature completion

Testing

  • Added unit tests for the new materialization logic (tests/operations/extensions/test_materialize.py)
  • Added entity router tests (tests/routers/ledger/test_entity.py)
  • Added materialization source routing tests (tests/routers/graphs/test_materialize_source.py)
  • Migrated and updated existing loader tests to the new extensions path
  • Updated all affected test files (QuickBooks pipeline, graph entity service, ledger router tests) to reflect the refactored module structure

Infrastructure Considerations

  • Database initialization scripts have been updated for the new schema naming — existing environments will require re-initialization or manual schema migration
  • Alembic migration configuration has moved; deployment pipelines referencing the old migration path will need updates
  • New Dagster jobs should be registered in orchestration environments to enable extension materialization scheduling

🤖 Generated with Claude Code

Branch Info:

  • Source: feature/ledger-graph-sync
  • Target: main
  • Type: feature

Co-Authored-By: Claude [email protected]

jfrench9 added 13 commits March 26, 2026 14:47
- Introduced support for ledger materialization jobs in the Dagster definitions.
- Added installation and loading of the postgres_scanner in the DuckDB connection pool.
- Updated the RoboLedger operations docstring to reflect new materialization features.
- Adjusted test assertions to expect 16 accounts instead of 15, ensuring accuracy in transaction validation.
- Added the "Entity" table to the materialization order to support ENTITY_HAS_TRANSACTION relationships.
- Updated the connection string in `build_postgres_connstr` to clarify its usage context for DuckDB's postgres_scanner.
- Modified SQL queries in `_staging_sql` to include the tenant schema in `postgres_scan` calls, ensuring accurate data retrieval for various tables.
- Introduced a new method to update entity information from the CompanyInfo connector, enhancing the entity management workflow.
- Simplified the entity creation process in the EntityGraphService by removing the controlled ingestion pattern and directly inserting into the OLTP database.
- Updated SQL materialization logic to ensure accurate retrieval of entity data, including additional fields for better data integrity.
- Enhanced unit tests to reflect changes in entity creation and update flows, ensuring robust testing of the new functionality.
- Included the entity router in the main ledger router to enhance modularity and support for entity-related operations.
- This change improves the organization of the API routes within the ledger module.
- Updated environment configuration to replace `ROBOLEDGER_DATABASE_URL` with `EXTENSIONS_DATABASE_URL` for better clarity and modularity.
- Modified database initialization scripts to create the `extensions` database and its test counterpart.
- Adjusted migration scripts and entry points to support the new extensions database structure.
- Enhanced the Dagster job definitions to include materialization for extensions data, ensuring compatibility with the new database schema.
- Introduced new models and API responses for extensions, facilitating integration with existing systems.
…sed on external source

- Modified the `_staging_sql` function to generate the `qname` field with dynamic prefixes ('qb:', 'xero:', 'plaid:', or 'rl:') based on the `external_source` value, enhancing data representation in the materialization process.
…sion

- Revised the description of the RoboLedger extension to specify that it utilizes OLTP tables in the `extensions` database with a schema-per-tenant approach, enhancing clarity on the database architecture.
- Renamed and updated the materialization job and configuration classes from `Ledger` to `Extensions` to reflect the new focus on extensions data.
- Modified the materialization process to support extensions data from the OLTP database, including changes to the API for triggering materialization.
- Enhanced the `MaterializeRequest` model to include a `source` field for specifying the data source during materialization.
- Updated logging messages and job tags to align with the new extensions functionality.
- Updated the `MaterializeRequest` model to allow `source` to be optional, with auto-detection based on graph type.
- Added validation to ensure the correct data source is used for entity and generic graphs during materialization.
- Modified the file upload logic to prevent uploads for entity graphs, directing users to use the materialization endpoint instead.
- Improved unit tests to reflect changes in graph type handling and validation logic.
… deprecation. The accounting demo is currently disabled due to changes in the entity graph handling, with a note to re-enable it when relevant features are ready.
…eline and related models

- Changed the database references in the QuickBooks pipeline to reflect the use of the extensions PostgreSQL database instead of the roboledger database.
- Updated descriptions and comments in the codebase to ensure consistency with the new database structure.
- Adjusted model documentation to clarify that entity details and fiscal period controls now pertain to the extensions database.
- Updated the README to provide a clearer description of the RoboSystems platform, emphasizing its capabilities in connecting data sources and building domain ontologies.
- Introduced new unit tests for the materialization pipeline, covering various scenarios including connection handling, table generation, and source validation for entity and generic graphs.
- Added tests for the materialize endpoint to ensure correct source inference and validation based on graph types, enhancing the robustness of the API.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Claude finished @jfrench9's task in 2m 24s —— View job


PR Review: Add extensions framework with entity management and graph sync

Tasks:

  • Gather context and understand the PR
  • Review core new files (materialize.py, entity models, extensions db)
  • Review API/router changes
  • Review Dagster integration
  • Review tests
  • Post final review

Overall this is a solid, well-structured PR. The architecture is clean — the ledger → extensions rename is thorough, the multi-tenant schema pattern is properly implemented with SQL injection defense, and the materialization pipeline is well-documented. A few issues warrant attention before merging.


Security

Credentials in SQL strings sent to Graph APImaterialize.py:95

build_postgres_connstr builds a connection string including the database password. This connstr is then embedded verbatim into every staging SQL statement in _staging_sql and sent to the Graph API via client.query_table. Any query logging at the DuckDB/Graph API layer will expose the database password in plain text.

# materialize.py:95
connstr = f"dbname={dbname} user={user} password={password} host={host} port={port}"
# ...then at :145:
FROM postgres_scan('{c}', '{s}', 'entities')  # c = connstr with password

This is an accepted pattern in postgres_scanner usage, but worth auditing whether the Graph API logs incoming SQL queries in staging/prod. If it does, consider filtering or truncating query logs for postgres_scan calls. Fix this →


Bugs

result.status stays "success" when critical staging tables failmaterialize.py:506-514

When a non-dimension relationship table fails to stage, the error is appended to result.errors but the status remains "success". The Dagster job checks result.status == "error" to decide whether to raise Failure. A partial failure in LINE_ITEM_RELATES_TO_ELEMENT or ENTITY_HAS_TRANSACTION will silently produce an incomplete graph with a success status.

# materialize.py:507-514
except Exception as e:
    error_msg = f"Failed to stage {table_name}: {e!s}"
    logger.warning(error_msg)
    if table_name in RELATIONSHIP_TABLES and "DIMENSION" in table_name:
        logger.info(f"Skipping {table_name} (no dimension data)")
    else:
        result.errors.append(error_msg)
        # ← result.status never set to "error" here

Fix this →

Dimension junction tables are never populated but always queriedloader.py:366-367, materialize.py:354-376

The OLTPLoader has a TODO acknowledging that transaction_dimensions, entry_dimensions, and line_item_dimensions junction tables are never written to. But _staging_sql still builds SQL to read from them. When these tables don't exist, staging will fail and the error will be silently swallowed (since they contain "DIMENSION" in the name). This is fine for now but should be tracked — if someone adds dimension data in the future, the junction wiring is missing.

session.query(Entity).first() is non-deterministicentity.py:75, 109

Both get_entity and update_entity call session.query(Entity).first() without an ORDER BY. With the parent/child hierarchy support (is_parent, parent_entity_id), there can be multiple entity rows per schema. The result depends on the DB execution plan. Should at minimum filter by Entity.is_parent == True or order by Entity.created_at.

build_postgres_connstr takes graph_id but never uses itmaterialize.py:69

The function signature accepts graph_id: str which is never referenced in the function body. This is currently harmless since it's used as the schema name in _staging_sql, but the unused parameter is misleading. Fix this →


Code Quality

Feature flag naming mismatchdb/extensions.py:64

The renamed module is extensions everywhere, but the guard still checks LEDGER_ENABLED:

# extensions.py:64
if not env.LEDGER_ENABLED:
    raise RuntimeError(
        "Extensions database access attempted but LEDGER_ENABLED is false. ..."
    )

The env var name and the error message both say LEDGER_ENABLED while the module is now called extensions. If the env var is intentionally staying as LEDGER_ENABLED for backwards compatibility, the error message should be updated to clarify this. Fix this →

Tests in test_materialize_source.py inline production logictests/routers/graphs/test_materialize_source.py:43-154

TestSourceInference, TestSourceValidation, and TestEntityGraphFileUploadGuard don't call the actual endpoint — they duplicate the logic inline with SimpleNamespace. If the logic in materialize_graph ever changes, these tests will still pass. Consider either:

  • Testing the handler directly with mocked dependencies
  • Or documenting these as "logic unit tests" reflecting the expected behavior spec
# test_materialize_source.py:48-52
# Simulate the inference logic from the endpoint
if req.source is None:
    req.source = "extensions" if graph.graph_type == "entity" else "staged"
assert req.source == "extensions"
# ← Tests the test's own code, not the production endpoint

Account deletion scoped to source type, not connectionloader.py:153-158

The existing TODO comment accurately captures a real data integrity issue: if a graph has two QuickBooks connections, re-syncing one will delete accounts from both. Worth flagging on the PR as something to address before supporting multi-connection graphs.


Minor Notes

  • robosystems/dagster/jobs/extensions.py:60asyncio.new_event_loop() in a Dagster op is the right pattern for running async code synchronously, but asyncio.set_event_loop(loop) sets the global event loop, which could interfere if Dagster ever moves to an async worker model. The loop.close() in the finally block is correct.
  • entity.py router uses legacy session.query() API. Not a blocker, but the rest of the codebase may be moving toward select().
  • The Structure table (line 248-261 in materialize.py) uses a SELECT with no FROM clause to create a single synthetic row. This is valid in DuckDB, but worth a comment explaining the intent.

Summary: The core architecture is sound. The two issues most worth fixing before merge are the silent success status on staging failures (which could produce corrupted graphs with no visible error), and the non-deterministic entity fetch without ordering/filtering.

- Updated the `_get_engine` function to provide clearer instructions for enabling the extensions OLTP database.
- Refactored the `build_postgres_connstr` function to remove the unnecessary `graph_id` parameter, simplifying connection string construction.
- Adjusted entity retrieval queries in the ledger router to filter for parent entities, improving data accuracy.
- Enhanced unit tests to reflect changes in connection string handling and entity query logic, ensuring comprehensive coverage.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude have we fixed all issues?

@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Claude finished @jfrench9's task in 1m 38s —— View job


Here's the status of every issue from the previous review:

Issue Status

# Issue Status
1 result.status stays "success" on staging failures ✅ Fixed
2 Non-deterministic session.query(Entity).first() ✅ Fixed
3 build_postgres_connstr takes unused graph_id param ✅ Fixed
4 LEDGER_ENABLED naming mismatch in db/extensions.py ⚠️ Partially fixed
5 Credentials embedded in SQL strings to Graph API ℹ️ Unchanged (accepted pattern)
6 Tests in test_materialize_source.py inline production logic ❌ Not fixed
7 Account deletion scoped to source type, not connection ❌ Not fixed (TODO still present)
8 Dimension junction tables never populated ℹ️ Unchanged (known limitation)

Details

✅ Bug fixes confirmed

  • result.status on staging failures (materialize.py:510-512): Now correctly sets result.status = "error" for non-dimension table failures. The Dagster job will raise Failure as expected.

  • Non-deterministic entity fetch (entity.py:75, 109): Both get_entity and update_entity now filter with .filter(Entity.is_parent.is_(True)), making the query deterministic.

  • Unused graph_id param (materialize.py:69): build_postgres_connstr() signature no longer accepts graph_id. Callers updated accordingly.


⚠️ LEDGER_ENABLED naming — partially fixed

db/extensions.py:64-69 still checks env.LEDGER_ENABLED and env.py:575 still defines it as LEDGER_ENABLED. The error message was updated to name which modules the flag enables, which is helpful, but the env var itself wasn't renamed to EXTENSIONS_ENABLED. This is a backwards-compatibility choice — just ensure it's documented. Fix this →


❌ Tests still inline production logic (tests/routers/graphs/test_materialize_source.py)

TestSourceInference and TestSourceValidation still replicate the endpoint logic using SimpleNamespace rather than calling the actual handler. For example:

# test_materialize_source.py:49-52
if req.source is None:
    req.source = "extensions" if graph.graph_type == "entity" else "staged"
assert req.source == "extensions"

This tests the test's own code. If the inference logic in materialize_graph ever changes, these tests will still pass. The TestMaterializeRequestSourceField tests (lines 11-34) are solid — they test the actual MaterializeRequest model. The remaining two classes need to be rewritten to call the handler directly with mocked dependencies, or explicitly marked as behavior specification tests with a comment explaining why. Fix this →


❌ Account deletion scoped to source type (loader.py:153-155)

The TODO comment is still present and the issue is unresolved — re-syncing one QuickBooks connection in a multi-connection graph still deletes accounts from all connections of the same source type. Not a blocker for the current use case (single connection), but should be addressed before multi-connection support ships.


ℹ️ Credentials in SQL stringsbuild_postgres_connstr() still embeds the database password in the connection string passed to the Graph API (this is required by postgres_scanner). The original note stands: audit whether the Graph API logs incoming staging SQL queries in prod. No code change needed unless query logging is confirmed to capture these strings.


Summary: 3 of 5 actionable bugs are fixed. The two remaining items (test_materialize_source.py logic isolation and LEDGER_ENABLED rename) are low-risk but worth cleaning up before merge.
|

- Added the `resolve_materialization_source` function to infer and validate the materialization source based on the graph type, enhancing clarity and maintainability.
- Refactored the materialization logic to utilize the new function, streamlining source inference and validation for entity and generic graphs.
- Updated unit tests to cover various scenarios for source resolution, ensuring robust validation and error handling for incompatible source and graph type combinations.
- Added `EXTENSIONS_ENABLED` feature flag to the environment configuration, allowing control over the extensions OLTP database and related functionalities.
- Updated the README to clarify the role of the extensions in the RoboSystems platform, including the removal of the `LEDGER_ENABLED` references in favor of a more modular approach.
- Refactored the entrypoint and database access logic to utilize the new `EXTENSIONS_ENABLED` flag, ensuring proper handling of extensions migrations and database access.
- Enhanced comments and documentation throughout the codebase to reflect the changes in feature flagging and database structure.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude can you review the latest changes. We have implemented a new EXTENSIONS_ENABLED to properly gate the extensions database creation

@claude
Copy link
Copy Markdown

claude Bot commented Mar 27, 2026

Claude finished @jfrench9's task in 1m 8s —— View job


PR Review: Latest Changes (EXTENSIONS_ENABLED)

Tasks:

  • Gather context
  • Review EXTENSIONS_ENABLED implementation in config/env.py
  • Review db/extensions.py guard
  • Review remaining open issues from prior review
  • Post final review

EXTENSIONS_ENABLED — Correctly Implemented ✅

The naming issue from the prior review is fully resolved. The implementation is clean:

config/env.py:577-587 — Two well-separated flags with clear documentation:

# Controls the extensions PostgreSQL database (engine, migrations, Dagster jobs).
# Required for any extension module (RoboLedger, RoboInvestor, etc.).
EXTENSIONS_ENABLED = get_bool_env("EXTENSIONS_ENABLED", ...)

# Controls the /v1/ledger/* API routes. Requires EXTENSIONS_ENABLED=true.
LEDGER_ENABLED = get_bool_env("LEDGER_ENABLED", ...)

db/extensions.py:64-69 — Guard now checks env.EXTENSIONS_ENABLED with accurate error message.

dagster/definitions.py:127 — Dagster jobs gated on env.EXTENSIONS_ENABLED.

routers/__init__.py:158 — Ledger API routes still gated on env.LEDGER_ENABLED (correct — the API layer has a separate toggle from the database layer).

.env.example — Both flags documented and commented out by default.

The two-flag design is well-thought-out: EXTENSIONS_ENABLED gates the database infrastructure, while LEDGER_ENABLED gates the API surface. A deployment could have the database enabled without exposing the API endpoints, which is useful for Dagster-only workflows.


Prior Review — Updated Issue Tracker

# Issue Status
1 result.status stays "success" on staging failures ✅ Fixed
2 Non-deterministic session.query(Entity).first() ✅ Fixed
3 build_postgres_connstr takes unused graph_id param ✅ Fixed
4 LEDGER_ENABLED naming mismatch in db/extensions.py ✅ Fixed
5 Credentials embedded in SQL strings to Graph API ℹ️ Accepted pattern
6 Tests in test_materialize_source.py inline production logic ✅ Fixed
7 Account deletion scoped to source type, not connection ❌ TODO still present
8 Dimension junction tables never populated ℹ️ Known limitation

Issue 6 is now fixed: TestResolveSource correctly calls resolve_materialization_source (the actual production function extracted from the handler), not a duplicated inline copy. TestMaterializeRequestSourceField tests the Pydantic model directly. Both are now well-structured.

The TestEntityGraphFileUploadGuard class still uses SimpleNamespace (lines 80-92), but it's testing Python attribute access semantics rather than any business logic — it's low-value but not misleading like before.


One Outstanding Issue

Account deletion scoped to source type (loader.py:153-158) — the TODO acknowledging that re-syncing one QuickBooks connection deletes accounts from all connections of the same source type is still unresolved. Not a blocker for a single-connection setup, but worth flagging again before multi-connection support ships.


Summary: All 4 bugs flagged in the prior review are fixed, the EXTENSIONS_ENABLED rename is properly implemented, and the test quality issues are resolved. The only remaining item is the known TODO on account deletion scope. This PR is ready to merge.

@jfrench9 jfrench9 merged commit 3194440 into main Mar 27, 2026
7 checks passed
@jfrench9 jfrench9 deleted the feature/ledger-graph-sync branch March 27, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant