Last updated:10/17/2025|... min read

Overview
Schema Evolution

OLake Features

Source Level Features

1. Parallelised Chunking

Parallel chunking is a technique that splits large datasets or collections into smaller virtual chunks, allowing them to be read and processed simultaneously. It is used in sync modes such as Full Refresh, Full Refresh + CDC, and Full Refresh + Incremental.

What it does:

Splits big collections into manageable pieces without altering the underlying data.
Each chunk can be processed parallely & independently.

Benefit:

Enables parallel reads, dramatically reducing the time needed to perform full snapshots or scans of large datasets.
Improves ingestion speed, scalability, and overall system performance.

2. Sync Modes Supported

OLake supports following sync modes to provide flexibility across use cases:

Full Refresh → Loads the complete table from the source.
Full Refresh + Incremental → Performs an initial full load, then captures subsequent changes using incremental logic (Primary or Fallback Cursor).
Full Refresh + CDC → Performs an initial full load, then continuously captures inserts, updates, and deletes in near real-time via Change Data Capture (CDC).
Strict CDC → Only captures changes from the source database logs (inserts, updates, deletes), without performing an initial full load.

3. Stateful, Resumable Syncs

OLake ensures that data syncs resume automatically from the last checkpoint after interruptions. Applicable for Full Refresh + CDC and Full Refresh + Incremental & Strict CDC sync modes.

What it does:

Maintains state of in-progress syncs.
Automatically resumes after crashes, network failures, or manual pauses.
Eliminates the need for restarting jobs from scratch.

Benefit:

Reduces data duplication and processing time.
Ensures reliable, fault-tolerant pipelines.
Minimizes manual intervention for operational teams

4. Configurable Max Connections

OLake allows configuring the maximum number of database connections per source, helping prevent overload and ensuring stable performance on the source system.

5. Exact Source Data Type Mapping

OLake guarantees accurate mapping of source database types to Iceberg, maintaining schema integrity and ensuring reliable data replication.

6. Data Filter

Data Filters let you replicate only the rows you need, based on specified column values, during Full Refreshes syncs (Full Refresh, Full Refresh + Incremental and Full Refresh + CDC). By filtering at the source, they reduce database load, save storage and processing resources, and make downstream queries faster and more efficient.

Destination Level Features

1. Data Deduplication

Data Deduplication ensures that only unique records are stored and processed : saving space, reducing costs, and improving data quality. OLake automatically deduplicates data using the primary key from the source tables, guaranteeing that each primary key maps to a single row in the destination along with its corresponding olake_id.

2. Hive Style Partitioning

Partitioning is the process of dividing large datasets into smaller, more manageable segments based on specific column values (e.g., date, region, or category), improving query performance, scalability, and data organization

Iceberg partitioning → Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution.
S3-style partitioning → Traditional folder-based layout (e.g., year=2025/month=08/day=22/) for compatibility with external tools.
Normalization → Automatically expands level-1 nested JSON fields into top-level columns.

What it does:

Converts nested JSON objects into flat columns for easier querying.
Preserves all data while simplifying structure.

Benefit:

Makes SQL queries simpler and faster.
Reduces the need for complex JSON parsing in queries.
Improves readability and downstream analytics efficiency.

3. Schema Evolution & Data Types Changes

OLake automatically handles changes in your table's schema without breaking downstream jobs. Read More Schema Evolution in OLake

What it does:

Detects column additions, deletions, or renames.
Supports data type promotions as of Iceberg v2 (e.g., int → long, float → double).
Updates table metadata seamlessly.

Benefit:

Ensures pipeline stability even as source schemas evolve.
Eliminates costly manual migrations or pipeline rewrites.
Keeps data consistent and queries reliable at scale.

4. Append Mode

Adds all incoming data from the source to the destination table without performing deduplication. Full load always runs in append mode, and for CDC or incremental syncs, it can be used to disable upsert behavior. Upsert mode ensures no duplicate records by writing delete entries for existing rows before inserting new ones.

5. Dead Letter Queue Columns (WIP)

The DLQ column handles values with data type changes not supported by Iceberg / Parquet destinations type promotions, safely storing them without loss. This prevents sync failures and ensures downstream models remain stable. By isolating incompatible values, it allows users to continue syncing data seamlessly while addressing type mismatches at their convenience, improving reliability and reducing manual intervention.

Schema Evolution and Data Type Changes

This document explains how OLake handles schema changes and data type changes in your data pipelines. It covers two distinct features that help maintain pipeline resilience when your source data structures evolve.

Schema Evolution

Schema evolution refers to changes in your database structure like adding, removing, or renaming columns and tables. OLake handles these changes to prevent pipeline failures and data loss.

Schema Evolution — Column-Level Changes

Change Type	How OLake Detects & Handles It	Typical Pipeline Impact	Extra Details & Tips
Adding a column	OLake runs a schema discovery at the start of every sync. When a new source column appears, it is automatically added to the Iceberg schema (new field-ID) and starts receiving values immediately. If the source back-fills historical rows, CDC registers them as updates. No user action required.	No breakage. Historical rows show `NULL` until back-filled.	• Monitor write throughput if a back-fill is large.
Deleting a column	Schema discovery also detects when a source column has been removed. After discovery confirms the column is no longer in the source, OLake updates the Iceberg schema accordingly. The deleted column still exists in the destination schema, so old snapshots remain queryable.	No breakage. ETL continues with a “virtual” column (null-filled).	• BI tools won’t break, but may show the column full of nulls — communicate schema changes. • Run a rewrite manifests job later to drop the dead column if storage footprint matters.
Renaming a column	Column renames are also detected during schema discovery. When a source column is renamed, OLake interprets this as: → The old column remains in the destination (but no new values are written). → A new column with the updated name is added and starts receiving incoming data. WIP: Because Iceberg keeps immutable field IDs, OLake can also just update the column’s name on the same field ID (e.g., `customer_id → client_id`) — avoiding data migration entirely.	No breakage.	• Renames are instant — no file rewrites. • Update SQL queries downstream to use the new column name.
JSON / Semi-structured key add / remove / rename	OLake flattens keys to a canonical path inside a single JSON column (or keeps raw JSON). • Added keys appear automatically. • Removed keys vanish from new rows. • Renamed keys are treated as “remove + add” because JSON has no intrinsic field ID.	No breakage.

info

A sparse new column (will not be synced to destination unless there is atleast 1 non NULL value). Because Iceberg stores data column-wise (Parquet).

Schema Evolution — Table-Level Changes

Change Type	How OLake Detects & Handles It	Typical Pipeline Impact	Extra Details & Tips
Adding a table	Newly detected source tables appear in the OLake UI list. You choose which ones to sync. Once added, OLake applies whichever sync mode you’ve configured. Tables not selected to sync are ignored.	No breakage. Pipelines for existing tables run as usual; disabled tables simply do not sync.	Initial full loads run in parallel.
Deleting a table	No new data will get added to the deleted table. Existing table data and metadata remain queryable.	No breakage. Downstream queries on historic data still work; new inserts stop.	If the table is recreated later with the same name but different structure, treat it as a brand-new stream to avoid field-ID collisions.
Renaming a table	When a source table is renamed, OLake treats it as a new table. It is discovered as a new stream in schema discovery. Once you enable sync for this table, OLake applies the configured sync mode. • The old Iceberg table keeps historical data.	No breakage, but post-rename data lands in a separate table unless you merge histories.	For continuous history, enable the new table quickly and (optionally) set an alias so both names map to the same Iceberg table.

Schema Data Type Changes

Schema data type changes refer to modifications to the data type of existing columns (e.g., changing INT to BIGINT). OLake leverages Apache Iceberg v2 tables' type promotion capabilities to handle compatible changes automatically.

Supported Data Type Promotions

OLake supports Iceberg v2’s widening promotions, where the destination column type expands to accommodate a larger range or higher precision without data loss. These include:

From	To	Notes
INT	LONG (BIGINT)	Widening integers is safe
FLOAT	DOUBLE	Promoting to higher precision works without data loss
DATE	TIMESTAMP, TIMESTAMP_NS	Dates can be safely converted to timestamps
DECIMAL(P, S)	DECIMAL(P', S) where P' > P	Only widening precision is supported

info

Writing an INT value into a FLOAT or DOUBLE column is supported in Iceberg, as it is considered a safe numeric conversion rather than a schema change.
Similarly, writing a BIGINT value into a DOUBLE is also supported.

caution

Iceberg v2 supports widening type changes only. Narrowing changes (e.g., BIGINT to INT) along with any other data type changes will result in an errror as are not supported.

Handling Incompatible Data Type Changes

For data type changes not supported by Iceberg v2:

(INT, LONG, FLOAT, DOUBLE) ➡ STRING - OLake provides enhanced type conversion handling. At the destination if the data type is string, and incoming values are numeric, then the values are converted and stored as string.
Narrowing Type Conversions like:
- BIGINT to INT
- DOUBLE to FLOAT
When a source value’s data type has a smaller range or precision than the destination column’s type, OLake treats this as a narrowing conversion and handles it seamlessly. For example, if the destination column is defined as BIGINT but the incoming values are INT, OLake recognizes that every INT value falls within BIGINT’s range and simply stores the values without error. This validation step ensures that compatible, smaller-range types are accepted even when Iceberg v2 would flag a mismatch.

tip

Any other incompatible data type changes will be captured by OLake using a Dead Letter Queue column (DLQ) (feature coming soon).

Unsupported Data Type Conversions

The data type changes listed below are unsupported in Iceberg v2 and OLake. Attempting these will result in sync failure.

Attempting to write a FLOAT value into an INT column.
Attempting to write a STRING value into INT/DOUBLE/LONG/FLOAT column.

Example Scenarios

Scenario 1: Adding a Column in Source

When a new column appears in your source data:

OLake automatically detects the new column
The column is added to your destination schema
New data includes values for this column
Historical data has null values for this column

Scenario 2: Adding a Table / Collection in Source

When a new table appears in your source database:

OLake automatically detects the new table in the next scheduled run
The table is added to your destination schema
A New table gets created.

Scenario 3: Table Name Change

When a table name changes in your source database:

OLake automatically detects the new table name
The table is added to your destination schema
A New table gets created
The old table name is retained in the destination schema but will not be populated with new data

Scenario 4: Widening Type Conversion

The following conversions are handled:

INT → BIGINT
FLOAT → DOUBLE

When a column data type is INT and it encounters BIGINT type or when a FLOAT column encounters a DOUBLE type:

OLake detects the widening type change
Column type is updated in the destination
All values are properly converted
Pipeline continues without interruption

Scenario 5: Narrowing Type Conversion (BIGINT to INT Conversion)

The following conversions are handled:

BIGINT → INT
DOUBLE → FLOAT

When a column data type is BIGINT and it encounters INT type or when a DOUBLE column encounters a FLOAT type:

OLake detects the narrowing type change
Column type is validated in the destination making sure it fits in the target type's range
All values are properly approved
Pipeline continues without interruption

Scenario 6: Incompatible Type Change

The following conversions are handled:

INT → STRING
BIGINT → STRING
FLOAT → STRING
DOUBLE → STRING

When a column data type is STRING and it encounters NUMERIC type:

OLake detects the incompatible type change
Numeric values are converted to their string representation
Converted values are stored as strings in the destination
Pipeline continues without interruption

Scenario 7: Unsupported Type Change

When the column data type is INT and it encounters FLOAT or when a NUMERIC column encounters a STRING type:

OLake detects the type mismatch
The write is not allowed and the sync fails

For more detailed information on Iceberg's schema evolution capabilities, refer to the Apache Iceberg documentation.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!

OLake Features

Source Level Features

1. Parallelised Chunking

2. Sync Modes Supported

3. Stateful, Resumable Syncs

4. Configurable Max Connections

5. Exact Source Data Type Mapping

6. Data Filter

Destination Level Features

1. Data Deduplication

2. Hive Style Partitioning

3. Schema Evolution & Data Types Changes

4. Append Mode

5. Dead Letter Queue Columns (WIP)

Schema Evolution and Data Type Changes

Schema Evolution

Schema Evolution — Column-Level Changes

Schema Evolution — Table-Level Changes

Schema Data Type Changes

Supported Data Type Promotions

Handling Incompatible Data Type Changes

Unsupported Data Type Conversions

Example Scenarios

Scenario 1: Adding a Column in Source

Scenario 2: Adding a Table / Collection in Source

Scenario 3: Table Name Change

Scenario 4: Widening Type Conversion

Scenario 5: Narrowing Type Conversion (BIGINT to INT Conversion)

Scenario 6: Incompatible Type Change

Scenario 7: Unsupported Type Change

💡 Join the OLake Community!

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

Source Level Features​

1. Parallelised Chunking​

2. Sync Modes Supported​

3. Stateful, Resumable Syncs​

4. Configurable Max Connections​

5. Exact Source Data Type Mapping​

6. Data Filter​

Destination Level Features​

1. Data Deduplication​

2. Hive Style Partitioning​

3. Schema Evolution & Data Types Changes​

4. Append Mode​

5. Dead Letter Queue Columns (WIP)​

Schema Evolution and Data Type Changes

Schema Evolution​

Schema Evolution — Column-Level Changes​

Schema Evolution — Table-Level Changes​

Schema Data Type Changes​

Supported Data Type Promotions​

Handling Incompatible Data Type Changes​

Unsupported Data Type Conversions​

Example Scenarios​

Scenario 1: Adding a Column in Source​

Scenario 2: Adding a Table / Collection in Source​

Scenario 3: Table Name Change​

Scenario 4: Widening Type Conversion​

Scenario 5: Narrowing Type Conversion (BIGINT to INT Conversion)​

Scenario 6: Incompatible Type Change​

Scenario 7: Unsupported Type Change​

💡 Join the OLake Community!

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

Source Level Features

1. Parallelised Chunking

2. Sync Modes Supported

3. Stateful, Resumable Syncs

4. Configurable Max Connections

5. Exact Source Data Type Mapping

6. Data Filter

Destination Level Features

1. Data Deduplication

2. Hive Style Partitioning

3. Schema Evolution & Data Types Changes

4. Append Mode

5. Dead Letter Queue Columns (WIP)

Schema Evolution

Schema Evolution — Column-Level Changes

Schema Evolution — Table-Level Changes

Schema Data Type Changes

Supported Data Type Promotions

Handling Incompatible Data Type Changes

Unsupported Data Type Conversions

Example Scenarios

Scenario 1: Adding a Column in Source

Scenario 2: Adding a Table / Collection in Source

Scenario 3: Table Name Change

Scenario 4: Widening Type Conversion

Scenario 5: Narrowing Type Conversion (BIGINT to INT Conversion)

Scenario 6: Incompatible Type Change

Scenario 7: Unsupported Type Change