OLake Features Overview
OLake is an open-source, high-performance data pipeline designed to extract data from Databses (like MongoDB, MySQL, Postgres) and load it into modern data lakehouse formats like Apache Iceberg and Parquet.
Its modular architecture is built around fast, parallelized data ingestion, real-time change data capture (CDC), and native support for schema evolution and S3 data partitioning.
OLake minimizes vendor lock-in while offering extensive configurability—from full snapshot loads to incremental syncs and stateful resumption—making it an ideal solution for modern data engineering challenges.
1. OLake native features
- Open data formats – OLake writes raw Parquet and fully ACID snapshots in Apache Iceberg so your lakehouse stays engine-agnostic.
- Iceberg writer – The dedicated Iceberg Java Writer (migrating to Iceberg Go writer) produces exactly-once, rollback-ready Iceberg v2 tables.
- Smart partitioning – We support both Iceberg partitioning rules andAWS S3 partitioning for Parquet for fast scans and efficient querying.
- Parallelised chunking – OLake splits big tables into smaller chunks, slashing total faster processing and parallel execution.
- Change Data Capture – We capture WALs for Postgres, binlogs for MongoDB and oplogs for MySQL in near real-time to keep the lake fresh without reloads.
- Schema evolution & datatype changes – Column adds, drops and type promotions are auto-detected and written per Iceberg v2 spec, so pipelines never break.
- Stateful, resumable syncs – If a job crashes (or is paused), OLake resumes from the last committed checkpoint—no manual fixes needed.
- Back-off & retries – OLake supports backoff retry count, meaning, if a sync fails, it will retry the sync after a certain period of time.
- Synchronization Modes: OLake supports both full, CDC (Change Data Capture) and strict CDC (Tracks only new changes from the current position in the MongoDB change stream, without performing an initial backfill) synchronization modes. Incremental sync is a work in progress and will be released soon.
- Level-1 JSON flattening – Nested JSON fields are optionally expanded into top-level columns for easier SQL.
- Airflow-first orchestration – Drop our Docker images into your DAGs (EC2 or Kubernetes) and drive syncs via Airflow.
- Developer playground – A one-click OLake + Iceberg sandbox lets you experiment locally with Trino, Spark, Flink or Snowflake readers.
2. Source-level features
-
PostgreSQL – Full loads and
WAL-based
CDC for RDS, Aurora, Supabase, etc. See the Postgres connector. -
MySQL – Full loads plus
binlog
CDC for MySQL RDS, Aurora and older community versions. See the MySQL connector. -
MongoDB – High-throughput
oplog
capture for sharded or replica-set clusters. See the MongoDB connector. -
Optimised chunking strategies –
- MongoDB –
Split-Vector
,Bucket-Auto
&Timestamp
; details in the blog What Makes OLake Fast. - MySQL – Range splits driven by
LIMIT/OFFSET
next-query logic. - Postgres –
CTID
ranges,batch-size
column splits ornext-query
paging.
- MongoDB –
-
Work-in-progress connectors –
- S3 source – GitHub issue #86
- Kafka – PR #339
-
We map each of database datatypes to respective Iceberg datatypes so your source schema remains more or less unaffected by the type conversion from a source database to Iceberg. See for:
3. Destination-level features
-
Apache Iceberg – Primary target format, can write to AWS S3, Azure, GCS: see Iceberg Writer docs.
-
Catalog options:
- AWS Glue – Glue catalog guide
- REST catalog – REST guide with sub-sections for Gravitino, Nessie, Polaris, Unity and LakeKeeper
- Hive Metastore – Hive catalog guide
- JDBC – JDBC catalog guide
- S3 Table Bucket (
SigV4
) – S3 Tables Guide - Unity Catalog (Databricks) – Unity Catalog Guide
-
Plain Parquet – Write partitioned Parquet to S3 or Google Cloud Storage (GCS) with the Parquet Writer.
-
Query engines – Any Iceberg v2-aware tool (Trino, Spark, Flink, Snowflake, etc.) can query OLake outputs immediately. See here for more
4. Upcoming features
- Telemetry hooks – Segment IO & Mixpanel integration (PR #290).
- Universal MySQL CDC – MariaDB, Percona & TiDB support (PR #359).
- Incremental sync – Rolling out for MongoDB PR #268, then Postgres and MySQL.
- Roadmap tracker – See the live OLake roadmap.
1. Connector Features
OLake’s connector includes several features that ensure efficient and reliable data replication:
Native BSON Extraction [for mongodb]
- What it does: Extracts data in its raw
BSON
form directly from MongoDB. - Benefit: Maintains data fidelity while boosting speed by decoding on the ETL side.
Parallel Chunking
- What it does: Splits large collections into smaller virtual chunks.
- Benefit: Enables parallel reads, significantly reducing the time required for full collection snapshots.
CDC Cursor Preservation
- What it does: Keeps track of CDC (Change Data Capture) offsets throughout the data ingestion process.
- Benefit: Supports the addition of new large collections without interrupting ongoing incremental syncs.
Open Format for Freedom
- What it does: Uses open formats like Parquet and table formats like Apache Iceberg.
- Benefit: Sidesteps vendor lock-in and allows multi-engine querying.
Resumable Full Load & Incremental Sync
- Both full snapshots and incremental (CDC) syncs are designed to be resumable. If a sync is interrupted, OLake uses state files to resume progress seamlessly.
S3 Data Partition
- What it does: Supports partitioning data when writing to S3.
- Benefit: Improves query performance by organizing data into directories based on user-defined partitioning logic.
Schema Evolution
- What it does: Detects changes in source document structures and handles non-breaking changes.
- Benefit: Minimizes disruptions when the underlying schema evolves; separate Parquet files are created if schema evolution is detected (currently in S3).
Normalization vs. Denormalization
- What it does: Offers an option to choose between normalization and level-1 denormalization (flattening).
- Benefit: Gives flexibility in data transformation depending on downstream requirements.
Supported Data Types
OLake supports a variety of data types. The table below summarizes the supported types along with descriptions and sample values:
Data Type | Description | Sample Value |
---|---|---|
Null | Represents a null or missing value | null |
Int64 | 64-bit integer value | 42 |
Float64 | 64-bit floating point number | 3.14159 |
String | Text string | "Hello, world!" |
Bool | Boolean value | true |
Object | JSON object (key-value pairs) | { "name": "Alice", "age": 30 } |
Array | Array of values | [1, 2, 3] |
Unknown | Unspecified or unrecognized type | "unknown" |
Timestamp | Timestamp with default precision | "2025-02-18T10:00:00Z" |
TimestampMilli | Timestamp with millisecond precision (3 decimal places) | "2025-02-18T10:00:00.123Z" |
TimestampMicro | Timestamp with microsecond precision (6 decimal places) | "2025-02-18T10:00:00.123456Z" |
TimestampNano | Timestamp with nanosecond precision (9 decimal places) | "2025-02-18T10:00:00.123456789Z" |
Synchronization Modes
The streams file defines how streams are synchronized:
supported_sync_modes
: Lists the supported modes (e.g.,full_refresh
,cdc
).sync_mode
: Indicates the active sync mode for a stream.
Backoff Retry Count
- What it does: Implements an exponential backoff (1, 2, 4, 8, 16 minutes…) for retrying sync operations.
- Default: Set to 3 (or takes the default value if set to -1).
Sync with State of Last Synced Cursor
- What it does: Saves progress in a state file (e.g.,
state.json
) so that an interrupted sync can resume from the last known checkpoint.
2. Data Flow in OLake
OLake’s data flow is designed for high throughput and resiliency:
Initial Snapshot
- Process: Executes a full collection read by firing queries.
- Parallelism: Divides the collection into chunks that are processed in parallel.
- Insight: Provides an estimated remaining time along with real-time stats (e.g., memory usage, running threads, records per second) in
stats.json
file created automatically at the place when your configs (source.json
anddestination.json
are being added).
{
"Estimated Remaining Time": "340.85 s",
"Memory": "92329 mb",
"Running Threads": 10,
"Seconds Elapsed": "148.00",
"Speed": "5773.03 rps",
"Synced Records": 854411
}
Change Data Capture (CDC)
- Process: Sets up MongoDB change streams based on oplogs to capture near real-time updates.
- Resilience: Ensures that any changes during the snapshot are also captured, and subsequent updates are continuously synced.
Parallel Processing
- Configuration: Users can set the number of parallel threads to balance speed with the load on the MongoDB cluster.
Transformation & Normalization
- What it does: Flattens complex, semi-structured fields into relational streams.
- Current Support: Level-0 flattening is available, with more advanced nested JSON support coming soon.
Integrated Writes
- What it does: Pushes transformed data directly to target destinations (e.g., local Parquet files, S3 Iceberg Parquet) without intermediary buffering.
- Benefit: Minimizes latency and avoids blocking reads during writes.
Logs & Testing
- Logging: Detailed logs using Zerolog and Lumberjack help monitor the process, we create a logs.txt file that saves all your logs during the sync process.
3. Architecture and Components
OLake is built with a modular design to separate responsibilities across different components:
Drivers (Connectors)
- Function: Embed the logic necessary to interact with the source system (starting with MongoDB, with plans for Kafka, Postgres, MySQL, and DynamoDB).
- Responsibilities:
- Full Load: Efficiently fetch complete collections by processing them in parallel.
- CDC Management: Configure change streams to capture and sync incremental changes.
Destinations
- Function: Tightly integrated with drivers to directly push records to destinations.
- Supported Destinations:
- Local Parquet files
- AWS S3
- Apache Iceberg
Core Functionality
- CLI & Commands:
spec
returns a JSON Schema for frontend configuration.check
validates configuration, streams, state files, and credentials.discover
lists streams and their schemas.sync
orchestrates the ETL process.
- State Management: Ensures that CDC streams can resume after interruptions.
- Configuration & Logging: Manages concurrency levels, connection credentials, and provides real-time monitoring of progress and throughput.
4. S3 Data Partitioning
OLake provides robust support for S3 data partitioning, enabling efficient data organization and faster query performance.
Overview
- How It Works: After schema discovery, you can define a
partition_regex
in thestreams.json
file for each stream. This regex dynamically partitions data into folders in your S3 bucket. - Resulting Structure: The generated S3 path typically includes the database name, table/collection name, and partition values (e.g., date components or key-value pairs).
Sample Configuration
{
"selected_streams": {
"namespace": [
{
"stream_name": "table1",
"partition_regex": "/{column_name, default_value, granularity}",
"normalization": false,
"append_only": false
},
{
"stream_name": "table2",
"partition_regex": "",
"normalization": false,
"append_only": false
}
]
},
"streams": [
{
"stream": {
"name": "table1",
"namespace": "namespace",
"sync_mode": "cdc"
}
},
{
"stream": {
"name": "table2",
"namespace": "namespace",
"sync_mode": "cdc"
}
}
]
}
Partition Regex Attributes
- Structure:
"partition_regex": "/{column_name, default_value, granularity}"
- Components:
- column_name: (Required) Specifies which column’s value determines the partition.
- default_value: (Optional) A fallback if the column is null or unparseable.
- granularity: (Optional) For time-based columns, define granularity (e.g., HH, DD, MM, YYYY).
Supported Patterns
- now(): Uses the current local timestamp.
- default: Uses a constant fallback value.
{col_name}
: Inserts the specified column value.- Combined patterns:
{now(),2025,YYYY}
yields the year (e.g.,2025
).{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}
combines year, month, and day.
Examples
-
Date-based Partition:
"partition_regex": "/{title,,}/{now(),2025,DD}/{latest_version,,}"
-
Key-Value Style Partitioning:
"partition_regex": "/age={age_column, default_age,}/city={city_column, default_city,}/type={type_column, default_type,}"
For a record:
{ "age": "30", "city": "mumbai", "type": "existing" }
The resulting S3 path would be:
s3://your_bucket/age=30/location=city-mumbai/type=existing/<filename>.parquet
-
Composite Partitions Including Date:
"partition_regex": "/dt={now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/country={country,default_country,}/city={city,default_city,}"
This produces a path such as:
s3://your_bucket/dt=2025-02-18/country=USA/city=NewYork/<filename>.parquet
Supported Timestamp Formats
OLake supports a wide range of timestamp formats, including:
"2006-01-02"
"2006-01-02 15:04:05"
"2006-01-02T15:04:05"
"2020-08-17T05:50:22.895Z"
- and several others
For custom timestamp columns, replace now()
with the column name and specify the desired format extraction.
Test Image Releaser
- What it does: Adds support for building test images via a “test_mode” option in the GitHub workflow.
- Benefits and changes:
- Test Mode Active: Tags are prefixed with
testing-
(e.g.,testing-dev-123
). - Validation: Bypasses semantic version validation for test images, while production images must follow strict versioning.
- Output Examples:
Test mode: true
✅ Test mode active - skipping semantic version validation for version: dev-123
**** Releasing olakego/source-mongodb for platforms [linux/amd64,linux/arm64] with version [testing-dev-123] ****
Release successful for olakego/source-mongodb version testing-dev-123
- Test Mode Active: Tags are prefixed with
5. Concluding Remarks and Future Roadmap
OLake is designed to be fast, flexible, and future-proof:
- Future Enhancements:
- Advanced nested JSON flattening (beyond level-0).
- Effortless deployment with a single self-contained binary.
- BYOC (Bring Your Own Cloud) support for both cloud and on-prem deployments.
- Enterprise-grade security with no data leaving your infrastructure.
- Kafka integration as a destination and expanded connector support.
- A unified UI to manage drivers, writers, and all configuration details.
- Performance Focus:
Extensive benchmarks show OLake completing massive full loads (e.g., 230 million rows in 46 minutes) and incremental syncs at record speeds (~35,694 records/second). - Ecosystem Integration:
By leveraging open formats and avoiding vendor lock-in, OLake seamlessly integrates with Spark, Trino, Flink, Snowflake, and other analytical engines.
OLake’s modular, open-source architecture is a modern solution to the challenges of high-volume, real-time data ingestion, transformation, and loading. It empowers data engineers to build robust, scalable, and efficient data pipelines with minimal operational overhead.
For more detailed information, examples, and the latest updates, please refer to our GitHub repository and the corresponding documentation sections.