Skip to main content

OLake Features Overview

OLake is an open-source, high-performance data pipeline designed to extract data from Databses (like MongoDB, MySQL, Postgres) and load it into modern data lakehouse formats like Apache Iceberg and Parquet.

Its modular architecture is built around fast, parallelized data ingestion, real-time change data capture (CDC), and native support for schema evolution and S3 data partitioning.

OLake minimizes vendor lock-in while offering extensive configurability—from full snapshot loads to incremental syncs and stateful resumption—making it an ideal solution for modern data engineering challenges.

1. OLake native features

  1. Open data formats – OLake writes raw Parquet and fully ACID snapshots in Apache Iceberg so your lakehouse stays engine-agnostic.
  2. Iceberg writer – The dedicated Iceberg Java Writer (migrating to Iceberg Go writer) produces exactly-once, rollback-ready Iceberg v2 tables.
  3. Smart partitioning – We support both Iceberg partitioning rules andAWS S3 partitioning for Parquet for fast scans and efficient querying.
  4. Parallelised chunking – OLake splits big tables into smaller chunks, slashing total faster processing and parallel execution.
  5. Change Data Capture – We capture WALs for Postgres, binlogs for MongoDB and oplogs for MySQL in near real-time to keep the lake fresh without reloads.
  6. Schema evolution & datatype changes – Column adds, drops and type promotions are auto-detected and written per Iceberg v2 spec, so pipelines never break.
  7. Stateful, resumable syncs – If a job crashes (or is paused), OLake resumes from the last committed checkpoint—no manual fixes needed.
  8. Back-off & retries – OLake supports backoff retry count, meaning, if a sync fails, it will retry the sync after a certain period of time.
  9. Synchronization Modes: OLake supports both full, CDC (Change Data Capture) and strict CDC (Tracks only new changes from the current position in the MongoDB change stream, without performing an initial backfill) synchronization modes. Incremental sync is a work in progress and will be released soon.
  10. Level-1 JSON flattening – Nested JSON fields are optionally expanded into top-level columns for easier SQL.
  11. Airflow-first orchestration – Drop our Docker images into your DAGs (EC2 or Kubernetes) and drive syncs via Airflow.
  12. Developer playground – A one-click OLake + Iceberg sandbox lets you experiment locally with Trino, Spark, Flink or Snowflake readers.

2. Source-level features

  1. PostgreSQL – Full loads and WAL-based CDC for RDS, Aurora, Supabase, etc. See the Postgres connector.

  2. MySQL – Full loads plus binlog CDC for MySQL RDS, Aurora and older community versions. See the MySQL connector.

  3. MongoDB – High-throughput oplog capture for sharded or replica-set clusters. See the MongoDB connector.

  4. Optimised chunking strategies

    • MongoDBSplit-Vector, Bucket-Auto & Timestamp; details in the blog What Makes OLake Fast.
    • MySQL – Range splits driven by LIMIT/OFFSET next-query logic.
    • PostgresCTID ranges, batch-size column splits or next-query paging.
  5. Work-in-progress connectors

  6. We map each of database datatypes to respective Iceberg datatypes so your source schema remains more or less unaffected by the type conversion from a source database to Iceberg. See for:

3. Destination-level features

  1. Apache Iceberg – Primary target format, can write to AWS S3, Azure, GCS: see Iceberg Writer docs.

  2. Catalog options:

  3. Plain Parquet – Write partitioned Parquet to S3 or Google Cloud Storage (GCS) with the Parquet Writer.

  4. Query engines – Any Iceberg v2-aware tool (Trino, Spark, Flink, Snowflake, etc.) can query OLake outputs immediately. See here for more

4. Upcoming features

  1. Telemetry hooks – Segment IO & Mixpanel integration (PR #290).
  2. Universal MySQL CDC – MariaDB, Percona & TiDB support (PR #359).
  3. Incremental sync – Rolling out for MongoDB PR #268, then Postgres and MySQL.
  4. Roadmap tracker – See the live OLake roadmap.

1. Connector Features

OLake’s connector includes several features that ensure efficient and reliable data replication:

Native BSON Extraction [for mongodb]

  • What it does: Extracts data in its raw BSON form directly from MongoDB.
  • Benefit: Maintains data fidelity while boosting speed by decoding on the ETL side.

Parallel Chunking

  • What it does: Splits large collections into smaller virtual chunks.
  • Benefit: Enables parallel reads, significantly reducing the time required for full collection snapshots.

CDC Cursor Preservation

  • What it does: Keeps track of CDC (Change Data Capture) offsets throughout the data ingestion process.
  • Benefit: Supports the addition of new large collections without interrupting ongoing incremental syncs.

Open Format for Freedom

  • What it does: Uses open formats like Parquet and table formats like Apache Iceberg.
  • Benefit: Sidesteps vendor lock-in and allows multi-engine querying.

Resumable Full Load & Incremental Sync

  • Both full snapshots and incremental (CDC) syncs are designed to be resumable. If a sync is interrupted, OLake uses state files to resume progress seamlessly.

S3 Data Partition

  • What it does: Supports partitioning data when writing to S3.
  • Benefit: Improves query performance by organizing data into directories based on user-defined partitioning logic.

Schema Evolution

  • What it does: Detects changes in source document structures and handles non-breaking changes.
  • Benefit: Minimizes disruptions when the underlying schema evolves; separate Parquet files are created if schema evolution is detected (currently in S3).

Normalization vs. Denormalization

  • What it does: Offers an option to choose between normalization and level-1 denormalization (flattening).
  • Benefit: Gives flexibility in data transformation depending on downstream requirements.

Supported Data Types

OLake supports a variety of data types. The table below summarizes the supported types along with descriptions and sample values:

Data TypeDescriptionSample Value
NullRepresents a null or missing valuenull
Int6464-bit integer value42
Float6464-bit floating point number3.14159
StringText string"Hello, world!"
BoolBoolean valuetrue
ObjectJSON object (key-value pairs){ "name": "Alice", "age": 30 }
ArrayArray of values[1, 2, 3]
UnknownUnspecified or unrecognized type"unknown"
TimestampTimestamp with default precision"2025-02-18T10:00:00Z"
TimestampMilliTimestamp with millisecond precision (3 decimal places)"2025-02-18T10:00:00.123Z"
TimestampMicroTimestamp with microsecond precision (6 decimal places)"2025-02-18T10:00:00.123456Z"
TimestampNanoTimestamp with nanosecond precision (9 decimal places)"2025-02-18T10:00:00.123456789Z"

Synchronization Modes

The streams file defines how streams are synchronized:

  • supported_sync_modes: Lists the supported modes (e.g., full_refresh, cdc).
  • sync_mode: Indicates the active sync mode for a stream.

Backoff Retry Count

  • What it does: Implements an exponential backoff (1, 2, 4, 8, 16 minutes…) for retrying sync operations.
  • Default: Set to 3 (or takes the default value if set to -1).

Sync with State of Last Synced Cursor

  • What it does: Saves progress in a state file (e.g., state.json) so that an interrupted sync can resume from the last known checkpoint.

2. Data Flow in OLake

OLake’s data flow is designed for high throughput and resiliency:

Initial Snapshot

  • Process: Executes a full collection read by firing queries.
  • Parallelism: Divides the collection into chunks that are processed in parallel.
  • Insight: Provides an estimated remaining time along with real-time stats (e.g., memory usage, running threads, records per second) in stats.json file created automatically at the place when your configs (source.json and destination.json are being added).
{
"Estimated Remaining Time": "340.85 s",
"Memory": "92329 mb",
"Running Threads": 10,
"Seconds Elapsed": "148.00",
"Speed": "5773.03 rps",
"Synced Records": 854411
}

Change Data Capture (CDC)

  • Process: Sets up MongoDB change streams based on oplogs to capture near real-time updates.
  • Resilience: Ensures that any changes during the snapshot are also captured, and subsequent updates are continuously synced.

Parallel Processing

  • Configuration: Users can set the number of parallel threads to balance speed with the load on the MongoDB cluster.

Transformation & Normalization

  • What it does: Flattens complex, semi-structured fields into relational streams.
  • Current Support: Level-0 flattening is available, with more advanced nested JSON support coming soon.

Integrated Writes

  • What it does: Pushes transformed data directly to target destinations (e.g., local Parquet files, S3 Iceberg Parquet) without intermediary buffering.
  • Benefit: Minimizes latency and avoids blocking reads during writes.

Logs & Testing

  • Logging: Detailed logs using Zerolog and Lumberjack help monitor the process, we create a logs.txt file that saves all your logs during the sync process.

3. Architecture and Components

OLake is built with a modular design to separate responsibilities across different components:

Drivers (Connectors)

  • Function: Embed the logic necessary to interact with the source system (starting with MongoDB, with plans for Kafka, Postgres, MySQL, and DynamoDB).
  • Responsibilities:
    • Full Load: Efficiently fetch complete collections by processing them in parallel.
    • CDC Management: Configure change streams to capture and sync incremental changes.

Destinations

  • Function: Tightly integrated with drivers to directly push records to destinations.
  • Supported Destinations:
    • Local Parquet files
    • AWS S3
    • Apache Iceberg

Core Functionality

  • CLI & Commands:
    • spec returns a JSON Schema for frontend configuration.
    • check validates configuration, streams, state files, and credentials.
    • discover lists streams and their schemas.
    • sync orchestrates the ETL process.
  • State Management: Ensures that CDC streams can resume after interruptions.
  • Configuration & Logging: Manages concurrency levels, connection credentials, and provides real-time monitoring of progress and throughput.

4. S3 Data Partitioning

OLake provides robust support for S3 data partitioning, enabling efficient data organization and faster query performance.

Overview

  • How It Works: After schema discovery, you can define a partition_regex in the streams.json file for each stream. This regex dynamically partitions data into folders in your S3 bucket.
  • Resulting Structure: The generated S3 path typically includes the database name, table/collection name, and partition values (e.g., date components or key-value pairs).

Sample Configuration

{
"selected_streams": {
"namespace": [
{
"stream_name": "table1",
"partition_regex": "/{column_name, default_value, granularity}",
"normalization": false,
"append_only": false
},
{
"stream_name": "table2",
"partition_regex": "",
"normalization": false,
"append_only": false
}
]
},
"streams": [
{
"stream": {
"name": "table1",
"namespace": "namespace",
"sync_mode": "cdc"
}
},
{
"stream": {
"name": "table2",
"namespace": "namespace",
"sync_mode": "cdc"
}
}
]
}

Partition Regex Attributes

  • Structure:
    "partition_regex": "/{column_name, default_value, granularity}"
  • Components:
    • column_name: (Required) Specifies which column’s value determines the partition.
    • default_value: (Optional) A fallback if the column is null or unparseable.
    • granularity: (Optional) For time-based columns, define granularity (e.g., HH, DD, MM, YYYY).

Supported Patterns

  • now(): Uses the current local timestamp.
  • default: Uses a constant fallback value.
  • {col_name}: Inserts the specified column value.
  • Combined patterns:
    • {now(),2025,YYYY} yields the year (e.g., 2025).
    • {now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD} combines year, month, and day.

Examples

  • Date-based Partition:

    "partition_regex": "/{title,,}/{now(),2025,DD}/{latest_version,,}"
  • Key-Value Style Partitioning:

    "partition_regex": "/age={age_column, default_age,}/city={city_column, default_city,}/type={type_column, default_type,}"

    For a record:

    { "age": "30", "city": "mumbai", "type": "existing" }

    The resulting S3 path would be:
    s3://your_bucket/age=30/location=city-mumbai/type=existing/<filename>.parquet

  • Composite Partitions Including Date:

    "partition_regex": "/dt={now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/country={country,default_country,}/city={city,default_city,}"

    This produces a path such as:
    s3://your_bucket/dt=2025-02-18/country=USA/city=NewYork/<filename>.parquet

Supported Timestamp Formats

OLake supports a wide range of timestamp formats, including:

  • "2006-01-02"
  • "2006-01-02 15:04:05"
  • "2006-01-02T15:04:05"
  • "2020-08-17T05:50:22.895Z"
  • and several others

For custom timestamp columns, replace now() with the column name and specify the desired format extraction.

Test Image Releaser

  • What it does: Adds support for building test images via a “test_mode” option in the GitHub workflow.
  • Benefits and changes:
    • Test Mode Active: Tags are prefixed with testing- (e.g., testing-dev-123).
    • Validation: Bypasses semantic version validation for test images, while production images must follow strict versioning.
    • Output Examples:
      Test mode: true
      ✅ Test mode active - skipping semantic version validation for version: dev-123
      **** Releasing olakego/source-mongodb for platforms [linux/amd64,linux/arm64] with version [testing-dev-123] ****
      Release successful for olakego/source-mongodb version testing-dev-123

5. Concluding Remarks and Future Roadmap

OLake is designed to be fast, flexible, and future-proof:

  • Future Enhancements:
    • Advanced nested JSON flattening (beyond level-0).
    • Effortless deployment with a single self-contained binary.
    • BYOC (Bring Your Own Cloud) support for both cloud and on-prem deployments.
    • Enterprise-grade security with no data leaving your infrastructure.
    • Kafka integration as a destination and expanded connector support.
    • A unified UI to manage drivers, writers, and all configuration details.
  • Performance Focus:
    Extensive benchmarks show OLake completing massive full loads (e.g., 230 million rows in 46 minutes) and incremental syncs at record speeds (~35,694 records/second).
  • Ecosystem Integration:
    By leveraging open formats and avoiding vendor lock-in, OLake seamlessly integrates with Spark, Trino, Flink, Snowflake, and other analytical engines.

OLake’s modular, open-source architecture is a modern solution to the challenges of high-volume, real-time data ingestion, transformation, and loading. It empowers data engineers to build robust, scalable, and efficient data pipelines with minimal operational overhead.

For more detailed information, examples, and the latest updates, please refer to our GitHub repository and the corresponding documentation sections.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!