Skip to main content

OLake Features

Source Level Features​

1. Parallelised Chunking​

Parallel chunking is a technique that splits large datasets or collections into smaller virtual chunks, allowing them to be read and processed simultaneously. It is used in sync modes such as Full Refresh, Full Refresh + CDC, and Full Refresh + Incremental.

What it does:

  • Splits big collections into manageable pieces without altering the underlying data.
  • Each chunk can be processed parallely & independently.

Benefit:

  • Enables parallel reads, dramatically reducing the time needed to perform full snapshots or scans of large datasets.
  • Improves ingestion speed, scalability, and overall system performance.

2. Sync Modes Supported​

OLake supports following sync modes to provide flexibility across use cases:

  • Full Refresh → Loads the complete table from the source.
  • Full Refresh + Incremental → Performs an initial full load, then captures subsequent changes using incremental logic (Primary or Fallback Cursor).
  • Full Refresh + CDC → Performs an initial full load, then continuously captures inserts, updates, and deletes in near real-time via Change Data Capture (CDC).
  • Strict CDC → Only captures changes from the source database logs (inserts, updates, deletes), without performing an initial full load.

3. Stateful, Resumable Syncs​

OLake ensures that data syncs resume automatically from the last checkpoint after interruptions. Applicable for Full Refresh + CDC and Full Refresh + Incremental & Strict CDC sync modes.

What it does:

  • Maintains state of in-progress syncs.
  • Automatically resumes after crashes, network failures, or manual pauses.
  • Eliminates the need for restarting jobs from scratch.

Benefit:

  • Reduces data duplication and processing time.
  • Ensures reliable, fault-tolerant pipelines.
  • Minimizes manual intervention for operational teams

4. Configurable Max Connections​

OLake allows configuring the maximum number of database connections per source, helping prevent overload and ensuring stable performance on the source system.

5. Exact Source Data Type Mapping​

OLake guarantees accurate mapping of source database types to Iceberg, maintaining schema integrity and ensuring reliable data replication.

6. Data Filter​

Data Filters let you replicate only the rows you need, based on specified column values, during Full Refreshes syncs (Full Refresh, Full Refresh + Incremental and Full Refresh + CDC). By filtering at the source, they reduce database load, save storage and processing resources, and make downstream queries faster and more efficient.


Destination Level Features​

1. Data Deduplication​

Data Deduplication ensures that only unique records are stored and processed : saving space, reducing costs, and improving data quality. OLake automatically deduplicates data using the primary key from the source tables, guaranteeing that each primary key maps to a single row in the destination along with its corresponding olake_id.

2. Hive Style Partitioning​

Partitioning is the process of dividing large datasets into smaller, more manageable segments based on specific column values (e.g., date, region, or category), improving query performance, scalability, and data organization

  • Iceberg partitioning → Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution.
  • S3-style partitioning → Traditional folder-based layout (e.g., year=2025/month=08/day=22/) for compatibility with external tools.
  • Normalization → Automatically expands level-1 nested JSON fields into top-level columns.

What it does:

  • Converts nested JSON objects into flat columns for easier querying.
  • Preserves all data while simplifying structure.

Benefit:

  • Makes SQL queries simpler and faster.
  • Reduces the need for complex JSON parsing in queries.
  • Improves readability and downstream analytics efficiency.

3. Schema Evolution & Data Types Changes​

OLake automatically handles changes in your table's schema without breaking downstream jobs. Read More Schema Evolution in OLake

What it does:

  • Detects column additions, deletions, or renames.
  • Supports data type promotions as of Iceberg v2 (e.g., int → long, float → double).
  • Updates table metadata seamlessly.

Benefit:

  • Ensures pipeline stability even as source schemas evolve.
  • Eliminates costly manual migrations or pipeline rewrites.
  • Keeps data consistent and queries reliable at scale.

4. Append Mode​

Adds all incoming data from the source to the destination table without performing deduplication. Full load always runs in append mode, and for CDC or incremental syncs, it can be used to disable upsert behavior. Upsert mode ensures no duplicate records by writing delete entries for existing rows before inserting new ones.

5. Dead Letter Queue Columns (WIP)​

The DLQ column handles values with data type changes not supported by Iceberg / Parquet destinations type promotions, safely storing them without loss. This prevents sync failures and ensures downstream models remain stable. By isolating incompatible values, it allows users to continue syncing data seamlessly while addressing type mismatches at their convenience, improving reliability and reducing manual intervention.



đź’ˇ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
👉 Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. 🚀

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!