Skip to content

Conversation

@knguyen1
Copy link

@knguyen1 knguyen1 commented Mar 20, 2025

Description

This PR adds s3 integration to GraphRAG; support both AWS s3 and s3-like services (via endpoint_url; minio, etc.).

Related Issues

#1306

Proposed Changes

  • Add S3 pipeline storage implementation with full PipelineStorage interface support (graphrag/storage/s3_pipeline_storage.py)
  • Add S3 workflow callbacks for logging workflow events to S3 buckets (graphrag/callbacks/s3_workflow_callbacks.py)
  • Add S3 prompt loading capability for retrieving prompts directly from S3 buckets (graphrag/config/prompt_getter.py)
  • Add configuration support for S3 across all storage components (input, output, cache, reporting)
  • Add comprehensive documentation covering configuration, authentication options, and troubleshooting (docs/config/s3.md)
  • Add unit tests with mocked AWS services for all S3 components

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

  • Supports multiple authentication methods: explicit credentials, environment variables, AWS credential chain, and IAM roles
  • Compatible with S3-compatible storage services via configurable endpoint URLs
  • Implements lazy loading of S3 clients for improved performance
  • Includes proper error handling and logging for S3 operations
  • Storage paths are configurable via environment variables or YAML configuration
  • All S3 operations are thoroughly tested with mocked AWS services

@knguyen1 knguyen1 requested review from a team as code owners March 20, 2025 14:45
@knguyen1 knguyen1 changed the title Feat/add s3 support feat(aws): add s3 support to input, storage, output, cache, etc. Mar 20, 2025
@knguyen1
Copy link
Author

@microsoft-github-policy-service agree

@Sirorororo
Copy link

Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?

@knguyen1
Copy link
Author

knguyen1 commented Apr 9, 2025

Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?

Done: f1fd55d

@knguyen1
Copy link
Author

knguyen1 commented Apr 9, 2025

Please review @natoverse

@knguyen1 knguyen1 force-pushed the feat/add-s3-support branch from 30380b4 to 4dcc89d Compare April 10, 2025 06:44
@knguyen1 knguyen1 force-pushed the feat/add-s3-support branch from 6272869 to 4040d4b Compare April 24, 2025 13:07
@knguyen1
Copy link
Author

@natoverse @AlonsoGuevara review please?

@qcloop
Copy link

qcloop commented May 21, 2025

What is the status of this PR? We run out infra on AWS, so would be cool to have this functionality

@knguyen1
Copy link
Author

Unless you review this PR soon, I'm going to close without merging. I am now getting conflicts too numerous and too complex to resolve cleanly. @natoverse @AlonsoGuevara

knguyen1 added 2 commits June 13, 2025 08:48
- Resolved conflicts in graphrag/config/defaults.py by accepting upstream changes and removing S3-specific fields from OutputDefaults and UpdateIndexOutputDefaults
- Resolved conflicts in graphrag/config/enums.py by renaming OutputType to StorageType and removing InputType as per upstream changes
- Resolved conflicts in graphrag/index/input/factory.py by accepting upstream simplification that uses passed storage parameter
- Removed graphrag/config/models/output_config.py as it was deleted upstream and replaced with generic StorageConfig
- Updated graphrag/config/models/graph_rag_config.py to use new enum names and fix S3 validation logic to work with new config structure
- Fixed graphrag/storage/factory.py to use StorageType instead of OutputType
- Added proper type annotations to dataclass fields in defaults.py to fix linting errors
- Maintained S3 support in InputConfig, CacheConfig, and ReportingConfig where it's still available
@knguyen1
Copy link
Author

knguyen1 commented Jun 13, 2025

Resolved conflicts and rebased: 40b0aff
Moved s3 configs to StorageConfig class: e893636
Update documentation: 980371e

@natoverse @AlonsoGuevara

@knguyen1 knguyen1 closed this Jun 26, 2025
@knguyen1
Copy link
Author

Closing due to inactivity.

@knguyen1 knguyen1 mentioned this pull request Jul 19, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants