Skip to main content

Welcome to OLake Documentation

olake


OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB. Visit olake.io/ for the full documentation, and benchmarks

Introduction to OLake

Welcome to OLake – the fastest open source DB-to-Data LakeHouse pipeline designed to bring your MongoDB data into modern analytics ecosystems like Apache Iceberg. OLake was born out of the need to eliminate the toil of one-off ETL scripts, combat performance bottlenecks, and avoid vendor lock-in with a clean, high-performing solution.

GitHub Repository: https://github.com/datazip-inc/olake

Overview

OLake’s primary goal is simple: to provide the fastest data pipeline from your database (starting with MongoDB) to a Data LakeHouse—in this case, Apache Iceberg. With OLake you can:

  • Capture data efficiently: Start with a full snapshot of your MongoDB collections, then transition seamlessly to near real-time Change Data Capture (CDC) using MongoDB’s change streams.
  • Achieve high throughput: Utilize parallelized chunking and integrated Writers to handle large volumes of data—ensuring rapid full loads and lightning-fast incremental updates.
  • Maintain schema integrity: Detect and adapt to evolving document structures with built-in alerts for any schema changes.
  • Embrace openness: Store your data in open formats (e.g., Parquet and Apache Iceberg) to keep your analytics engine agnostic and avoid vendor lock-in.

Architectural Overview

OLake is designed as a modular, high-performance system with distinct components that each excel at their core responsibilities. The following diagram (see architecture cover image) provides a visual overview of how data flows through OLake:

OLake Architecture

Data Flow in OLake

  1. Initial Snapshot:

    • Executes a full collection read from MongoDB by firing queries.
    • Divides the collection into parallel chunks for rapid processing.
  2. Change Data Capture (CDC):

    • Sets up MongoDB change streams based on oplogs to capture near real-time updates.
    • Ensures any changes that occur during the snapshot are also captured.
  3. Parallel Processing:

    • Users can configure the number of parallel threads thus balancing speed and the load on your MongoDB cluster.
  4. Transformation & Normalization:

    • Flattens complex, semi-structured fields into relational streams.
    • Provides basic (Level 0) flattening now, with more advanced nested JSON support on the way.
  5. Integrated Writes:

    • Pushes transformed data directly to target destinations (e.g., local Parquet, S3 Iceberg Parquet) without intermediary buffering.
    • This integration minimizes latency and avoids blocking reads.
  6. Monitoring & Alerts:

    • Continuously monitors schema changes and system metrics.
    • Raises alerts for any discrepancies or potential issues, ensuring early detection of data loss or transformation errors.
  7. Logs & Testing:

    • Provides detailed logging for transparency.
    • Supports unit, integration, and workflow testing, ensuring the reliability of both full loads and incremental syncs.

Core Components

OLake is built around several key modules:

  • CLI & Commands:
    Offers commands like spec, check, discover, and sync for seamless pipeline orchestration. Configurable flags (such as --batch_size and --thread_count) allow you to optimize performance.

  • Framework (CDK):
    The robust foundation that powers OLake’s orchestration and modular design.

  • Connectors (Drivers):
    Each driver encapsulates the logic required to interact with a specific source system. Starting with MongoDB, drivers manage:

    • Full Load: Efficiently partitioning and processing large collections.
    • CDC: Setting up and maintaining change streams to capture incremental changes.
  • Writers:
    Tightly integrated with drivers, Writers ensure that once data is extracted, it is immediately pushed to your chosen destination—whether that be local storage or cloud-based solutions.

  • Monitoring & Alerting:
    An integrated system to keep track of process status, performance metrics, and schema evolution.

  • SDK & Testing Setup:
    Provides an SDK for custom integrations and a comprehensive testing suite to ensure robust data synchronization.

OLake in a Lakehouse Ecosystem

By storing data in open formats like Parquet and Apache Iceberg, OLake offers:

  • Flexibility: Seamless integration with popular query engines such as Spark, Trino, Flink, and even Snowflake external tables.
  • Avoidance of Vendor Lock-In: Your data remains accessible and queryable regardless of the analytical tool you choose.
  • Real-Time Data Replication: Near real-time updates through MongoDB’s change streams keep your lakehouse data fresh.

Performance Highlights

OLake has been engineered with throughput in mind. In benchmark tests:

  • Full Load:
    Processed 230 million rows (~664.81 GB) in just 46 minutes.
  • Incremental Sync:
    Achieved approximately 35,694 records per second, ensuring minimal lag between source updates and target ingestion.

For more detailed benchmarks, please see our OLake MongoDB Benchmark.

Future Enhancements

OLake is continually evolving. Upcoming features include:

  • Enhanced Nested JSON Handling:
    Advanced flattening for deeper nested structures.
  • Simplified Deployment:
    A single self-contained binary for easier setup and maintenance.
  • Flexible Deployment Options:
    Support for Bring Your Own Cloud (BYOC), On-Prem, and multiple cloud platforms (GCP, Azure).
  • Enterprise-Grade Security & Consistency:
    Instant transactional consistency and robust security integrations.
  • Expanded Connector Support:
    Future connectors for Kafka, Postgres, MySQL, and DynamoDB.
  • Unified UI & Server Management:
    A centralized interface for managing all OLake features.

Get Started

Ready to revolutionize your data pipeline? Explore OLake today

For any questions or feedback, please contact us at hello@datazip.io or schedule a quick demo here.