Skip to main content

Source​

A Source is the system from which OLake reads data. This could be a database (MongoDB, Postgres, MySQL, Oracle), basically a data service. When you create a source in OLake, you are telling the platform where the data should come from.

A source becomes active once it is linked to at least one job. This means OLake is actively pulling data from this source and sending it to a destination. They are visible in the "Active Sources" section of the OLake UI.

A source is inactive when it has been created in OLake but is not yet assigned to any job. It exists in the system, but no data is being read from it until a job is created. They are visible in the "Inactive Sources" section of the OLake UI.

Destination​

A Destination is the system where OLake writes data after it has been extracted from a source. In OLake, destinations define where your data will be stored and in what format. Currently, OLake supports two types of destinations: Amazon S3 and Apache Iceberg.

A destination is active when it is assigned to at least one job. This means OLake is currently delivering data into this destination. They are visible in the "Active Destinations" section of the OLake UI.

A destination is inactive when it has been created in OLake but is not linked to any job. It is available for use but will not receive any data until a job is configured. They are visible in the "Inactive Destinations" section of the OLake UI.

Jobs​

A Job is the pipeline or process in OLake that moves data from a source to a destination. A job defines what data is moved, how it is moved (full refresh, incremental, CDC), and where it is delivered. Jobs are the central element of OLake, as they connect sources and destinations.

A job is active when it is running or scheduled to run. This means OLake is currently transferring data according to the job's configuration. Any newly created job will also appear on the "Active Jobs" section of the OLake UI.

A job is inactive when it has been paused. No data transfer takes place until it is resumed, but all configurations and previous states are preserved. They are visible in the "Inactive Jobs" section of the OLake UI.

These jobs are saved configurations that can be used to create new job runs. A saved job when scheduled or run will appear under the Active Jobs tab. They are visible in the "Saved Jobs" section of the OLake UI.

A job is failed when OLake encounters an error during execution (such as network issues, schema conflicts, or permission problems). Failed jobs require troubleshooting to identify the reason that led to the failure of the job. They are visible in the "Failed Jobs" section of the OLake UI.

Streams​

A Stream in OLake represents a unit of data (such as a table or collection) discovered from a source. This panel lets you choose which streams to sync, how the data is synced (i.e. you can choose from different sync modes), what schemas they use, and also allows the user to do partitioning on data before loading it into the destination.

Streams Properties​

1. Normalization​

Normalization in OLake is the transformation step that does Level-1 flattening of data in nested JSON format, mapping fields to proper columns, thus making data ready to be written into Iceberg/Parquet format tables.

  • Detects schema evolution (adds, drops, type promotions) and writes according to Iceberg v2 spec.
  • Flattens nested structures so records become query-friendly.
  • Focuses on mapping source types to Iceberg/Parquet types.

For input as given below,

note

Normalization must be enabled at the schema configuration step per table or stream within the job.

Olake Partition


Queried using AWS Athena
Initial state: Sync done, when normalization is off. Focus on the data column.

Olake Partition output

After state: Sync done, when normalization is on.

Olake Partition output

Advantages:

  • Normalized streams can be queried with standard SQL—no unusual parsing or UDFs needed.
  • Schema evolution (adds/drops/type promotions) is detected and written per Iceberg v2 spec, so downstream tables continue to operate without pipeline breaks as source schemas evolve.
  • With transformation and normalization integrated into the pipeline (before the write) mean no extra post-processing jobs, so no extra Spark/DBT/Glue jobs later to "fix" the shape is required.

2. Sync Modes​

Sync modes in Olake define the strategy used to replicate data from a source system to a destination. Each mode represents a different approach to data synchronization, with specific behaviours, guarantees, and performance characteristics.

OLake supports 4 distinct sync modes:

  1. Full Refresh:
    Entire table is re-copied from source to destination in parallel chunks. Useful as a main sync mode or for initial loads.

  2. Full Refresh + CDC (Change Data Capture):
    Real-time replication that first does a full-refresh, then streams changes (inserts, updates, deletes) in real-time.

  3. Full Refresh + Incremental:
    A delta-sync strategy that only processes new or changed records since the last sync. Requires primary (mandatory) and secondary cursor (optional) fields for change detection. Similar to CDC sync, an initial full-refresh takes place in this as well.

    info

    Cursor fields are columns in the source table used to track the last synced records.
    Olake allows setting up to two cursor fields:

    • Primary Cursor: This is the mandatory cursor field through which the changes in the records are captured and compared.
    • Secondary Cursor: In case primary cursor's value is null, then the value of secondary cursor is considered if provided.
  4. CDC Only:
    A CDC variant that skips full-refresh entirely, focusing on processing only the new changes after CDC begins.

Olake Partition output

3. Data Filter​

The data filter feature allows selective ingestion from source databases by applying SQL-style WHERE clauses or BSON-based conditions during ingestion.

  • Ensures only selected data enters the pipeline, saving on transfer, storage, and processing.
  • Supports combining up to two conditions with logical operators (AND/OR).
  • Operators: >, <, =, !=, >=, <=
  • Values can be numbers, quoted strings/timestamps/ids (eg.created_at > \"2025-08-21 17:38:35.017\"), or null.

Adoption of filter in drivers:

  • Postgres: During chunk processing, filters are applied alongside chunk conditions, ensuring only matching records are ingested—even with CTID-based chunking.
  • MySQL: During chunk processing, filters are applied within each chunk so only relevant rows are returned, even with limit-offset chunking.
  • MongoDB: During chunk processing, filters are enforced in the aggregation pipeline’s $match stage to ensure only compliant documents are processed.
  • Oracle: Similar to Postgres and MySQL, filters are applied within each chunk’s scan, guaranteeing only records satisfying conditions are ingested.
Olake Partition output

4. Schema​

Schema is the structure of the tables (or streams) that OLake creates when it scans and discovers the source data.
The ability to adjust a schema (add/remove columns, change types) without rewriting the entire table is called Schema Evolution.
Usage: If a product table adds a new category field, a schema-evolved table format like Iceberg can incorporate this new column on the fly.

For more information, refer to Schema Evolution Feature

Olake Partition output

5. Partitioning​

Partitioning organizes data by grouping similar rows based on column values or transformations, enabling efficient query processing and data management. In simple terms, routes output by record values. The result is fewer data files scanned, less I/O, and faster queries.

info

Unlike traditional systems like Hive, Iceberg's approach uses "hidden partitioning," where partition details are managed in metadata rather than physical folders or explicit columns.

  • Usage Format: "partition_regex" = "/{field_name, transform}/{next_field, transform}"
    Example:-
    • Partition Regex : "/{created_at, day}" => Partitioning done on the created_at field, transformed using 'day'.

For input as given below,

Olake Partition

As seen below, partition on created_at field, tranformation using 'day' has been done.

Olake Partition output


Benefits of Partitioning:

  • Faster Data Retrieval: Queries filtering on partitioned columns (e.g., time ranges) only scan relevant partitions, avoiding full table scans. For instance, in a logs table partitioned by date, a query for a specific day skips files from other dates, potentially speeding up results by orders of magnitude for large tables.
  • Automatic Partition Pruning: In case of Iceberg, it derives partition filters from logical predicates without requiring users to add extra filters, unlike Hive where missing partition conditions lead to scanning all data. This can improve query times highly in real-world scenarios, such as financial data analysis. This feature is vital for streaming logs, e-commerce orders, or IoT telemetry etc.
  • Efficient Handling of High-Volume Data: For high-cardinality fields (e.g., user IDs), transforms like bucketing distribute data evenly, preventing hotspots and enabling balanced query loads. Combined with sorting within partitions, this further optimizes scans by leveraging min/max statistics in file footers. For example, using “/{user_id, bucket[512]}", will hash user_id and assign it to buckets ranging from 0 - 511.
  • Diverse Transforms: Supports flexible operations like identity, year/month/day/hour extraction, bucketing, and truncation, enabling tailored partitioning for various use cases (e.g., time-series data or categorical grouping). For example, bucketing high-cardinality keys like UUIDs into fixed groups improves write distribution and query efficiency.
  • Schema Independence: Partitioning isn't tied to the table's schema, so you can evolve schemas (e.g., add columns) without breaking partitioning or queries.
  • Hidden Partitioning: Writers don't need to manually compute or supply partition values; For example, Iceberg derives them from source columns (e.g., converting a timestamp to a date). This avoids silent errors like incorrect formats or time zones (common in Hive).

6. Job Configuration​

The job configuration property refers to the options that defines job’s name, schedule, and execution in the Olake’s system.
User has to start with job creation, which will be followed with source configuration, then destination configuration, checking and enabling relevant streams from the schema for sync, and then finally in job configuration, job name and frequency has to be set.

  • Frequency Options:
    • Default options i.e every minute, hourly, daily, weekly
    • Custom frequency: Specify a cron expression.

Guide to Cron Expression:

*****
minute (0-59)hour (0-23)day of the month (1-31)month (1-12)day of the week (0-6)

Cron Examples:

  • * * * * * = Every minute
  • 0 * * * * = Every hour
  • 0 0 * * * = Every day at 12:00 AM
  • 0 0 * * FRI = At midnight only on Fridays
  • 0 0 1 * * = At midnight on the 1st day of each month
Olake Partition output


💡 Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
👉 Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. 🚀

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!