BemiDB

BemiDB is a Postgres read replica optimized for analytics. It consists of a single binary that seamlessly connects to a Postgres database, replicates the data in a compressed columnar format, and allows you to run complex queries using its Postgres-compatible analytical query engine.

Highlights

Performance: runs analytical queries up to 2000x faster than Postgres.
Single Binary: consists of a single binary that can be run on any machine.
Postgres Replication: automatically syncs data from Postgres databases.
Compressed Data: uses an open columnar format for tables with 4x compression.
Scalable Storage: storage is separated from compute and can natively work on S3.
Query Engine: embeds a query engine optimized for analytical workloads.
Postgres-Compatible: integrates with any services and tools in the Postgres ecosystem.
Open-Source: released under an OSI-approved license.

Use cases

Run complex analytical queries like it's your Postgres database. Without worrying about performance impact and indexing.
Simplify your data stack down to a single binary. No complex setup, no data movement, no CDC, no ETL, no DW.
Integrate with Postgres-compatible tools and services. Query and visualize data with BI tools, notebooks, and ORMs.
Have all data automatically synced into your data lakehouse. Using Iceberg tables with Parquet data on object storage.

Quickstart

Install BemiDB:

curl -sSL https://raw.githubusercontent.com/BemiHQ/BemiDB/refs/heads/main/scripts/install.sh | bash

Sync data from a Postgres database:

./bemidb --pg-database-url postgres://postgres:postgres@localhost:5432/dbname sync

Sync data periodically from a Postgres database:

./bemidb --pg-database-url postgres://postgres:postgres@localhost:5432/dbname --interval 1h sync

This will sync the data every hour.

Alternatively, you can set the interval using an environment variable. Add the following line to your .env file:

PG_SYNC_INTERVAL=1h

Run BemiDB database:

bemidb start

Run Postgres queries on top of the BemiDB database:

# List all tables
psql postgres://localhost:54321/bemidb -c "SELECT * FROM information_schema.tables"

# Query a table
psql postgres://localhost:54321/bemidb -c "SELECT COUNT(*) FROM [table_name]"

Configuration

Local disk storage

By default, BemiDB stores data on the local disk. Here is an example of running BemiDB with default settings and storing data in a local iceberg directory:

bemidb \
  --port 54321 \
  --database bemidb \
  --storage-type LOCAL \
  --iceberg-path ./iceberg \ # $PWD/iceberg/*
  --init-sql ./init.sql \
  --log-level INFO \
  start

S3 block storage

BemiDB natively supports S3 storage. You can specify the S3 settings using the following flags:

bemidb \
  --port 54321 \
  --database bemidb \
  --storage-type AWS_S3 \
  --iceberg-path iceberg \ # s3://[AWS_S3_BUCKET]/iceberg/*
  --aws-region [AWS_REGION] \
  --aws-s3-bucket [AWS_S3_BUCKET] \
  --aws-access-key-id [AWS_ACCESS_KEY_ID] \
  --aws-secret-access-key [AWS_SECRET_ACCESS_KEY] \
  start

Here is the minimal IAM policy required for BemiDB to work with S3:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::[AWS_S3_BUCKET]",
                "arn:aws:s3:::[AWS_S3_BUCKET]/*"
            ]
        }
    ]
}

Architecture

BemiDB consists of the following main components:

Database Server: implements the Postgres protocol to enable Postgres compatibility.
Query Engine: embeds the DuckDB query engine to run analytical queries.
Storage Layer: uses the Iceberg table format to store data in columnar compressed Parquet files.
Postgres Connector: connects to a Postgres databases to sync tables' schema and data.

Benchmark

BemiDB is optimized for analytical workloads and can run complex queries up to 2000x faster than Postgres.

On the TPC-H benchmark with 22 sequential queries, BemiDB outperforms Postgres by a significant margin:

Scale factor: 0.1
- BemiDB unindexed: 2.3s 👍
- Postgres unindexed: 1h23m13s 👎 (2,170x slower)
- Postgres indexed: 1.5s 👍 (99.97% bottleneck reduction)
Scale factor: 1.0
- BemiDB unindexed: 25.6s 👍
- Postgres unindexed: ∞ 👎 (infinitely slower)
- Postgres indexed: 1h34m40s 👎 (220x slower)

See the benchmark directory for more details.

Data type mapping

Primitive data types are mapped as follows:

PostgreSQL	Parquet	Iceberg
`char`	`BYTE_ARRAY` (`UTF8`)	`string`
`varchar`	`BYTE_ARRAY` (`UTF8`)	`string`
`text`	`BYTE_ARRAY` (`UTF8`)	`string`
`bpchar`	`BYTE_ARRAY` (`UTF8`)	`string`
`int2`	`INT32`	`int`
`int4`	`INT32`	`int`
`int8`	`INT64`	`long`
`float4`	`FLOAT`	`float`
`float8`	`FLOAT`	`float`
`numeric`	`FIXED_LEN_BYTE_ARRAY` (`DECIMAL`)	`decimal(P, S)`
`bool`	`BOOLEAN`	`boolean`
`date`	`INT32` (`DATE`)	`date`
`time`	`INT64` (`TIME_MICROS` / `TIME_MILLIS`)	`time`
`timetz`	`INT64` (`TIME_MICROS` / `TIME_MILLIS`)	`time`
`timestamp`	`INT64` (`TIMESTAMP_MICROS` / `TIMESTAMP_MILLIS`)	`timestamp` / `timestamp_ns`
`timestamptz`	`INT64` (`TIMESTAMP_MICROS` / `TIMESTAMP_MILLIS`)	`timestamptz` / `timestamptz_ns`
`uuid`	`FIXED_LEN_BYTE_ARRAY`	`uuid`
`bytea`	`BYTE_ARRAY` (`UTF8`)	`binary`
`interval`	`BYTE_ARRAY` (`INTERVAL`)	`string`
`json`	`BYTE_ARRAY` (`UTF8`)	`string`
`jsonb`	`BYTE_ARRAY` (`UTF8`)	`string`
`tsvector`	`BYTE_ARRAY` (`UTF8`)	`string`
`_*` (array)	`REPEATED` `*`	`list`

Future roadmap

Native support for complex data structures like JSON and arrays.
Incremental data synchronization into Iceberg tables.
Direct Postgres-compatible write operations.
Real-time replication from Postgres using CDC.
TLS and authentication support for Postgres connections.
Iceberg table compaction and partitioning.
Cache layer for frequently accessed data.
Add support for materialized views.

Alternatives

BemiDB vs PostgreSQL

PostgreSQL pros:

It is the most loved general-purpose transactional (OLTP) database 💛
Capable of running analytical queries at small scale

PostgreSQL cons:

Slow for analytical (OLAP) queries on medium and large datasets
Requires creating indexes for specific analytical queries, which impacts the "write" performance for transactional queries
Materialized views as a "cache" require manual maintenance and become increasingly slow to refresh as the data grows
Further tuning may not be possible if executing various ad-hoc analytical queries

BemiDB vs PostgreSQL extensions

PostgreSQL extensions pros:

There is a wide range of extensions available in the PostgreSQL ecosystem
Open-source community driven

PostgreSQL extensions cons:

Performance overhead when running analytical queries affecting transactional queries
Limited support for installable extensions in managed PostgreSQL services (for example, AWS Aurora allowlist)
Increased PostgreSQL maintenance complexity when upgrading versions
Require manual data syncing and schema mapping if data is stored in a different format

Main types of extensions for analytics:

Foreign data wrapper extensions (parquet_fdw, parquet_s3_fdw, etc.)
- Pros: allow querying external data sources like columnar Parquet files directly from PostgreSQL
- Cons: use not optimized for analytics query engines
OLAP query engine extensions (pg_duckdb, pg_analytics, etc.)
- Pros: integrate an analytical query engine directly into PostgreSQL
- Cons: cumbersome to use (creating foreign tables, calling custom functions), data layer is not integrated and optimized

BemiDB vs DuckDB

DuckDB pros:

Designed for OLAP use cases
Easy to run with a single binary

DuckDB cons:

Limited support in the data ecosystem like notebooks, BI tools, etc.
Requires manual data syncing and schema mapping for best performance
Limited features compared to a full-fledged database: no support for writing into Iceberg tables, reading from Iceberg according to the spec, etc.

BemiDB vs real-time OLAP databases (ClickHouse, Druid, etc.)

Real-time OLAP databases pros:

High-performance optimized for real-time analytics

Real-time OLAP databases cons:

Require expertise to set up and manage distributed systems
Limitations on data mutability
Steeper learning curve
Require manual data syncing and schema mapping

BemiDB vs big data query engines (Spark, Trino, etc.)

Big data query engines pros:

Distributed SQL query engines for big data analytics

Big data query engines cons:

Complex to set up and manage a distributed query engine (ZooKeeper, JVM, etc.)
Don't have a storage layer themselves
Require manual data syncing and schema mapping

BemiDB vs proprietary solutions (Snowflake, Redshift, BigQuery, Databricks, etc.)

Proprietary solutions pros:

Fully managed cloud data warehouses and lakehouses optimized for OLAP

Proprietary solutions cons:

Can be expensive compared to other alternatives
Vendor lock-in and limited control over the data
Require separate systems for data syncing and schema mapping

Development

We develop BemiDB using Devbox to ensure a consistent development environment without relying on Docker.

To start developing BemiDB and run tests, follow these steps:

cp .env.sample .env
make install
make test

To run BemiDB locally, use the following command:

make up

To sync data from a Postgres database, use the following command:

make sync

License

Distributed under the terms of the AGPL-3.0 License. If you need to modify and distribute the code, please release it to contribute back to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
benchmark		benchmark
build		build
img		img
scripts		scripts
src		src
.env.sample		.env.sample
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
devbox.json		devbox.json
devbox.lock		devbox.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BemiDB

Contents

Highlights

Use cases

Quickstart

Configuration

Local disk storage

S3 block storage

Architecture

Benchmark

Data type mapping

Future roadmap

Alternatives

BemiDB vs PostgreSQL

BemiDB vs PostgreSQL extensions

BemiDB vs DuckDB

BemiDB vs real-time OLAP databases (ClickHouse, Druid, etc.)

BemiDB vs big data query engines (Spark, Trino, etc.)

BemiDB vs proprietary solutions (Snowflake, Redshift, BigQuery, Databricks, etc.)

Development

License

About

Releases

Packages

Languages

License

sikinatm/BemiDB

Folders and files

Latest commit

History

Repository files navigation

BemiDB

Contents

Highlights

Use cases

Quickstart

Configuration

Local disk storage

S3 block storage

Architecture

Benchmark

Data type mapping

Future roadmap

Alternatives

BemiDB vs PostgreSQL

BemiDB vs PostgreSQL extensions

BemiDB vs DuckDB

BemiDB vs real-time OLAP databases (ClickHouse, Druid, etc.)

BemiDB vs big data query engines (Spark, Trino, etc.)

BemiDB vs proprietary solutions (Snowflake, Redshift, BigQuery, Databricks, etc.)

Development

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages