Skip to main content

Welcome to OLake

olake


OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Visit olake.io for the full documentation, and benchmarks

Introduction to OLake

OLake is a blazing-fast open-source tool that replicates data from diverse sources into Apache Iceberg and Parquet, delivering real-time lakehouse analytics without the pain of ETL scripts or vendor lock-in.

GitHub Repository: https://github.com/datazip-inc/olake

What is OLake?

OLake is an open-source ELT framework, fully written in Golang for memory efficiency and high performance. It replicates data from sources like PostgreSQL, MySQL, MongoDB, Oracle and Kafka (WIP) directly into open lakehouse formats such as Apache Iceberg and Parquet. Using Incremental Sync and Change Data Capture (CDC), OLake keeps data continuously in sync while minimizing infrastructure overhead—offering a simple, reliable, and scalable path to building a modern lakehouse.

This allows organizations to:

  • Replicate data at scale
  • Power near real-time analytics
  • Transform data lakes into fully functional lakehouses without the overhead of complex ETL tools

Why OLake?

  • Fastest Path to a Lakehouse → Achieve high throughput with parallelized chunking and resumable historical snapshots and blazing-fast incremental updates, even on massive datasets with exactly-once delivery.

  • Efficient Data Capture → Capture data efficiently with a full snapshot of your tables or collections, then keep them in sync through near real-time CDC using native database logs (WAL, binlogs, oplogs).

  • Schema-Aware Replication → Automatically detect schema changes to keep your pipelines consistent and reliable.

  • Open by Design → Store data in open formats like Parquet and Iceberg, enabling engine-agnostic analytics and eliminating vendor lock-in.


OLake Features Overview

Source-Level Features

Supported Connectors

  • PostgreSQL → Supports Full Refresh, Incremental Sync, and WAL-based Full Refresh + CDC and Strict CDC (RDS, Aurora, Supabase, etc.)
  • MySQL → Supports Full Refresh, Incremental Sync, and Binlog-based Full Refresh + CDC and Strict CDC (MySQL RDS, Aurora, older community versions)
  • MongoDB → Supports Full Refresh, Incremental Sync, and Oplog-based Full Refresh + CDC and Strict CDC (sharded or replica-set clusters)
  • Oracle → Supports Full Refresh and Incremental Sync

Optimized Chunking Strategies

  • PostgreSQL → CTID ranges, batch-size splits, next-query paging
  • MySQL → Range splits with LIMIT/OFFSET
  • MongoDB → Split-Vector, Bucket-Auto, Timestamp
  • Oracle → DBMS Parallel Execute

Destination-Level Features

Supported Connectors

  • S3 Parquet Writer → MinIO, S3, GCS

  • Apache Iceberg :

    Catalog Integrations

    • AWS Glue
    • REST Catalog (Nessie, Polaris, Unity, LakeKeeper, S3 Tables)
    • Hive Metastore
    • JDBC Catalog

    To know more, read OLake Catalog Integration.


Query Engine Compatibility

OLake outputs are immediately queryable in any Iceberg v2-compatible engine, including:

  • AWS Athena
  • Trino
  • Spark
  • Flink
  • Presto
  • Hive
  • Snowflake

To know more, read OLake Query Engines Compatibility.

Know More About OLake

🚀 Curious how OLake performs? Check out our Benchmarks

🔍 Dive deeper into OLake Features, read here Features



💡 Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
👉 Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. 🚀

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!