Scalable Data Version Control

Manage your data as code using Git-like operations and achieve reproducible, high-quality data pipelines. Available Open Source or on the Cloud.

Take control of your data

COMPUTE ENGINES

lakeFS supports all standard computation engines.

lakefs

lakeFS uses metadata to manage data versions. Its versioning engine is highly scalable with minor impact to storage performance

formats

lakeFS is format agnostic, regardless of format type be it structured, unstructured, open table, or anything else.

Object Storage

lakeFS supports data in all object stores including all major cloud providers S3, Azure Blob, GCP, and on prem MinIO,  Ceph, Dell EMC and any other S3 compatible storage.

Use Cases

lakeFS helps data engineers and data scientists in every field manage their data like code — at scale

Perform Local Checkouts on Data

Clone specific portions of lakeFS' data to your local environment, and keep remote and local locations in sync.

Deduplicated Experimentation

Use lakeFS branches to run experiments in parallel with zero-copy clones in a fully deduplicated data lake, allowing you to effectively compare them to select the best one.

Reproducible Feature Engineering & Model Training

Commit the results of your experiments and use the
lakeFS Git integration to reproduce any experiment with the right version of the data, the code and the model weights.

data sciencemb

Isolated Dev/Test Environments

Create isolated dev/test environments using lakeFS branches and reduce your testing time by 80%. Conduct data cleaning, outlier handling, filling in missing values, etc. and ensure your data pipelines for pre-processing are robust and provide high quality.

Promote Only High Quality Data to Production

Implement CI/CD for data with lakeFS hooks, allowing for automation of quality validation checks.

Fix Bad Data with Production Rollback

Save entire consistent snapshots of your data using commits, allowing you to rollback to previous commits in case of bad data.

Data Collaboration

Provide your team with tools to easily collaborate and communicate on the data they use. Utilizing Git-like semantics, share a branch of a data repository or a commit ID to specify the data version being used or shared.

Data Auditing

Keep track of the data changes made, and by whom.
A full audit on all data-related actions, in all environments, allow you to trace back any result provided or experiment performed.

Storage Cost Reduction

Prevent your data lake from becoming a data swamp.
The use of a zero-copy branch allows data practitioners to get an isolated data lake for their use without creating actual copies that increase costs and pollute the data lake.

Data Opsgnmb

lakeFS is already helping thousands of organizations

Faster time to
market

Increased data
quality

Improved security
and governance

Here's what ML and Data Engineers using lakeFS have to say

Official partners

Seamless integration with
all your data stack

Object Storage
Compute Engines
Ingest Technologies
Data Storage Formats
Orchestration & Workflow
Research and ML
Data Quality

All common ingest technologies are integrated into lakeFS

lakeFS is format agnostic! Regardless of the format you’re using, lakeFS will support it

Manage Orchestration and Workflows better with popular orchestration tools supported on lakeFS

Data Quality is mandatory for your data lake health. Ensure/maintain the highest data quality together with lakeFS

Git for Data – lakeFS