Welcome to OLake
OLake
Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Visit olake.io for the full documentation, and benchmarks
Introduction to OLake
OLake is a blazing-fast open-source tool that replicates data from diverse sources into Apache Iceberg and Parquet, delivering real-time lakehouse analytics without the pain of ETL scripts or vendor lock-in.
GitHub Repository: https://github.com/datazip-inc/olake
What is OLake?
OLake is an open-source ELT framework, fully written in Golang for memory efficiency and high performance. It replicates data from sources like PostgreSQL, MySQL, MongoDB, Oracle and Kafka (WIP) directly into open lakehouse formats such as Apache Iceberg and Parquet. Using Incremental Sync and Change Data Capture (CDC), OLake keeps data continuously in sync while minimizing infrastructure overhead—offering a simple, reliable, and scalable path to building a modern lakehouse.
This allows organizations to:
- Replicate data at scale
- Power near real-time analytics
- Transform data lakes into fully functional lakehouses without the overhead of complex ETL tools
Why OLake?
-
Fastest Path to a Lakehouse → Achieve high throughput with parallelized chunking and resumable historical snapshots and blazing-fast incremental updates, even on massive datasets with exactly-once delivery.
-
Efficient Data Capture → Capture data efficiently with a full snapshot of your tables or collections, then keep them in sync through near real-time CDC using native database logs (WAL, binlogs, oplogs).
-
Schema-Aware Replication → Automatically detect schema changes to keep your pipelines consistent and reliable.
-
Open by Design → Store data in open formats like Parquet and Iceberg, enabling engine-agnostic analytics and eliminating vendor lock-in.
OLake Features Overview
Source-Level Features
Supported Connectors
- PostgreSQL → Supports Full Refresh, Incremental Sync, and WAL-based Full Refresh + CDC and Strict CDC (RDS, Aurora, Supabase, etc.)
- MySQL → Supports Full Refresh, Incremental Sync, and Binlog-based Full Refresh + CDC and Strict CDC (MySQL RDS, Aurora, older community versions)
- MongoDB → Supports Full Refresh, Incremental Sync, and Oplog-based Full Refresh + CDC and Strict CDC (sharded or replica-set clusters)
- Oracle → Supports Full Refresh and Incremental Sync
Optimized Chunking Strategies
- PostgreSQL → CTID ranges, batch-size splits, next-query paging
- MySQL → Range splits with LIMIT/OFFSET
- MongoDB → Split-Vector, Bucket-Auto, Timestamp
- Oracle → DBMS Parallel Execute
Destination-Level Features
Supported Connectors
-
S3 Parquet Writer → MinIO, S3, GCS
-
Apache Iceberg :
Catalog Integrations
- AWS Glue
- REST Catalog (Nessie, Polaris, Unity, LakeKeeper, S3 Tables)
- Hive Metastore
- JDBC Catalog
To know more, read OLake Catalog Integration.
Query Engine Compatibility
OLake outputs are immediately queryable in any Iceberg v2-compatible engine, including:
- AWS Athena
- Trino
- Spark
- Flink
- Presto
- Hive
- Snowflake
To know more, read OLake Query Engines Compatibility.