Skip to main content

Welcome to OLake

olake


OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Visit olake.io for the full documentation, and benchmarks

Introduction to OLake

Welcome to OLake – the fastest open source DB-to-Data LakeHouse pipeline designed to bring your Database (Postgres, MySQL, MongoDB) data into modern analytics ecosystems like Apache Iceberg. OLake was born out of the need to eliminate the toil of one-off ETL scripts, combat performance bottlenecks, and avoid vendor lock-in with a clean, high-performing solution.

GitHub Repository: https://github.com/datazip-inc/olake

Overview

OLake’s primary goal is simple: to provide the fastest data pipeline from your database to a Data LakeHouse—in this case, Apache Iceberg. With OLake you can:

  • Capture data efficiently: Start with a full snapshot of your Database tables / collections, then transition seamlessly to near real-time Change Data Capture (CDC) using each database change streams methods (WAL, binlogs, oplogs).
  • Achieve high throughput: Utilize parallelized chunking and integrated Destinations to handle large volumes of data—ensuring rapid full loads and lightning-fast incremental updates.
  • Maintain schema integrity: Detect and adapt to evolving document structures with built-in alerts for any schema changes.
  • Embrace openness: Store your data in open formats (e.g., Parquet and Apache Iceberg) to keep your analytics engine agnostic and avoid vendor lock-in.

1. OLake native features

  1. Open data formats – OLake writes raw Parquet and fully ACID snapshots in Apache Iceberg so your lakehouse stays engine-agnostic.
  2. Iceberg writer – The dedicated Iceberg Java Writer (migrating to Iceberg Go writer) produces exactly-once, rollback-ready Iceberg v2 tables.
  3. Smart partitioning – We support both Iceberg partitioning rules andAWS S3 partitioning for Parquet for fast scans and efficient querying.
  4. Parallelised chunking – OLake splits big tables into smaller chunks, slashing total faster processing and parallel execution.
  5. Change Data Capture – We capture WALs for Postgres, binlogs for MongoDB and oplogs for MySQL in near real-time to keep the lake fresh without reloads.
  6. Schema evolution & datatype changes – Column adds, drops and type promotions are auto-detected and written per Iceberg v2 spec, so pipelines never break.
  7. Stateful, resumable syncs – If a job crashes (or is paused), OLake resumes from the last committed checkpoint—no manual fixes needed.
  8. Back-off & retries – OLake supports backoff retry count, meaning, if a sync fails, it will retry the sync after a certain period of time.
  9. Synchronization Modes: OLake supports both full, CDC (Change Data Capture) and strict CDC (Tracks only new changes from the current position in the MongoDB change stream, without performing an initial backfill) synchronization modes. Incremental sync is a work in progress and will be released soon.
  10. Level-1 JSON flattening – Nested JSON fields are optionally expanded into top-level columns for easier SQL.
  11. Airflow-first orchestration – Drop our Docker images into your DAGs (EC2 or Kubernetes) and drive syncs via Airflow.
  12. Developer playground – A one-click OLake + Iceberg sandbox lets you experiment locally with Trino, Spark, Flink or Snowflake readers.

2. Source-level features

  1. PostgreSQL – Full loads and WAL-based CDC for RDS, Aurora, Supabase, etc. See the Postgres connector.

  2. MySQL – Full loads plus binlog CDC for MySQL RDS, Aurora and older community versions. See the MySQL connector.

  3. MongoDB – High-throughput oplog capture for sharded or replica-set clusters. See the MongoDB connector.

  4. Optimised chunking strategies

    • MongoDBSplit-Vector, Bucket-Auto & Timestamp; details in the blog What Makes OLake Fast.
    • MySQL – Range splits driven by LIMIT/OFFSET next-query logic.
    • PostgresCTID ranges, batch-size column splits or next-query paging.
  5. Work-in-progress connectors

  6. We map each of database datatypes to respective Iceberg datatypes so your source schema remains more or less unaffected by the type conversion from a source database to Iceberg. See for:

3. Destination-level features

  1. Apache Iceberg – Primary target format, can write to AWS S3, Azure, GCS: see Iceberg Writer docs.

  2. Catalog options:

  3. Plain Parquet – Write partitioned Parquet to S3 or Google Cloud Storage (GCS) with the Parquet Writer.

  4. Query engines – Any Iceberg v2-aware tool (Trino, Spark, Flink, Snowflake, etc.) can query OLake outputs immediately. See here for more

4. Upcoming features

  1. Telemetry hooks – Segment IO & Mixpanel integration (PR #290).
  2. Universal MySQL CDC – MariaDB, Percona & TiDB support (PR #359).
  3. Incremental sync – Rolling out for MongoDB PR #268, then Postgres and MySQL.
  4. Roadmap tracker – See the live OLake roadmap.

Architectural Overview

OLake is designed as a modular, high-performance system with distinct components that each excel at their core responsibilities. The following diagram (see architecture cover image) provides a visual overview of how data flows through OLake:

Data Flow in OLake

  1. Initial Snapshot:

    • Executes a full collection read from Databasees by firing queries.
    • Divides the collection into parallel chunks for rapid processing.
  2. Change Data Capture (CDC):

    • Sets up Database change streams based on oplogs to capture near real-time updates.
    • Ensures any changes that occur during the snapshot are also captured.
  3. Parallel Processing:

    • Users can configure the number of parallel threads thus balancing speed and the load on your Database cluster.
  4. Transformation & Normalization:

    • Flattens complex, semi-structured fields into relational streams.
    • Provides basic (Level 0) flattening now, with more advanced nested JSON support on the way.
  5. Integrated Writes:

    • Pushes transformed data directly to target destinations (e.g., local Parquet, S3 Iceberg Parquet) without intermediary buffering.
    • This integration minimizes latency and avoids blocking reads.
  6. Monitoring & Alerts:

    • Continuously monitors schema changes and system metrics.
    • Raises alerts for any discrepancies or potential issues, ensuring early detection of data loss or transformation errors.
  7. Logs & Testing:

    • Provides detailed logging for transparency.
    • Supports unit, integration, and workflow testing, ensuring the reliability of both full loads and incremental syncs.

Core Components

OLake is built around several key modules:

  • CLI & Commands:
    Offers commands like spec, check, discover, and sync for seamless pipeline orchestration. Configurable flags (such as --batch_size and --thread_count) allow you to optimize performance.

  • Framework (CDK):
    The robust foundation that powers OLake’s orchestration and modular design.

  • Connectors (Drivers):
    Each driver encapsulates the logic required to interact with a specific source system. The drivers (connector/ source database) manage:

    • Full Load: Efficiently partitioning and processing large collections.
    • CDC: Setting up and maintaining change streams to capture incremental changes.
    • Incremental Sync: Ensuring that only new or modified data is processed after the initial snapshot. (WIP, release soon)
    • Schema Discovery: Automatically identifying the schema of the source data.
    • Schema Evolution: Detecting and adapting to changes in the source schema.
    • Data Transformation: Flattening complex structures into a more manageable format. We currently support basic (Level 1) flattening, with plans for more advanced nested JSON handling in the future.
  • Destinations:

    • Tightly integrated with drivers, Destinations (Apache Iceberg format) ensure that once data is extracted, it is immediately pushed to your chosen destination—whether that be local storage or cloud-based solutions.
    • AWS S3 partitioning is supported, allowing you to store data in a structured manner that aligns with your analytical needs.
    • Iceberg data partitioning is also supported, enabling efficient querying and data management.
  • Monitoring & Alerting:
    An integrated system to keep track of process status, performance metrics, and schema evolution.

  • SDK & Testing Setup:
    Provides an SDK for custom integrations and a comprehensive testing suite to ensure robust data synchronization.

OLake in a Lakehouse Ecosystem

By storing data in open formats like Parquet and Apache Iceberg, OLake offers:

  • Flexibility: Seamless integration with popular query engines such as Spark, Trino, Flink, and even Snowflake external tables.
  • Avoidance of Vendor Lock-In: Your data remains accessible and queryable regardless of the analytical tool you choose.
  • Real-Time Data Replication: Near real-time updates through each Database's change streams keep your lakehouse data fresh.

Performance Benchmarks*

  1. Postgres Connector to Apache Iceberg: (See Detailed Benchmark)

    1. Full load - Syncs at 46,262 RPS for 4 billion rows. (101x Airbyte, 11.6x Estuary, 3.1x Debezium (memiiso))
    2. CDC - Sync at 36 982 RPS for 50 million changes. (63 Airbyte, 12x Estuary, 2.7x Debezium (memiiso), 1.4x fivetran)
  2. MongoDB Connector to Apache Iceberg: (See Detailed Benchmark)

    1. Syncs 35,694 records/sec; 230 million rows in 46 minutes for a 664 GB dataset (20× Airbyte, 15× Embedded Debezium, 6× Fivetran)
  3. MySQL Connector to Apache Iceberg: (See Detailed Benchmark)

    1. Syncs 1,000,000 records/sec for 10GB; ~209 mins for 100+GB

*These are preliminary performances, we'll published fully reproducible benchmark scores soon.

Future Enhancements

OLake is continually evolving. Upcoming features include:

  • Enhanced Nested JSON Handling: Advanced flattening for deeper nested structures.
  • Simplified Deployment: A single self-contained binary for easier setup and maintenance.
  • Flexible Deployment Options: Support for Bring Your Own Cloud (BYOC), On-Prem, and multiple cloud platforms (GCP, Azure).
  • Enterprise-Grade Security & Consistency: Instant transactional consistency and robust security integrations.
  • Expanded Connector Support: Future connectors for Kafka, S3 and DynamoDB. Visit roadmap for more detailed info.
  • Unified UI & Server Management: A centralized interface for managing all OLake features. OLake UI GitHub repo
  • Schema Evolution: Support for schema changes in the source database.

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!