- OLake Go Terminologies
- General Terminologies
OLake Go Terminologiesβ
1. Sourceβ
A Source is the system from which OLake Go reads data. This could be a database (MongoDB, Postgres, MySQL, Oracle), basically a data service. When you create a source in OLake Go, you are telling the platform where the data should come from.
| Concept | Description |
|---|---|
| Active source | Linked to at least one job; OLake Go reads from this source and sends data to a destination. |
| Inactive source | Created in OLake Go but not assigned to any job yet; no data is read until a job uses it. |
2. Destinationβ
A Destination is the system where OLake Go writes data after it has been extracted from a source. In OLake Go, destinations define where your data will be stored and in what format. Currently, OLake supports two types of destinations: Amazon S3 and Apache Iceberg.
| Concept | Description |
|---|---|
| Active destination | Assigned to at least one job; OLake Go is delivering data into this destination. |
| Inactive destination | Created in OLake Go but not linked to any job yet; no data until a job uses it. |
3. Jobsβ
A Job is the pipeline or process in OLake Go that moves data from a source to a destination. A job defines what data is moved, how it is moved (full refresh, incremental, CDC), and where it is delivered. Jobs are the central element of OLake, as they connect sources and destinations.
| Concept | Description |
|---|---|
| Active job | Running or scheduled; OLake Go transfers data per the job configuration. Newly created jobs appear here while active. |
| Inactive job | Paused; no transfer until resumed. Configuration and state are kept. |
| Saved job | Saved configuration used to start new runs; when scheduled or run, it appears under Active Jobs. |
| Failed job | Execution error (for example network, schema, or permissions); needs investigation. |
When a job is inactive, certain job-level features are unavailable:
- Sync Now β cannot trigger immediate syncs
- Edit Streams β stream configuration cannot be modified
- Clear Destination β cannot clear destination data
To use these features, resume the job to move it back to active status.
4. OLake Go-Generated Columnsβ
When OLake Go replicates data from source (like PostgreSQL, MySQL, Oracle or MongoDB) to destination formats (Apache Iceberg, Parquet), it automatically adds several metadata columns to track the lifecycle and processing history of each record. These columns help users understand how and when each row was captured and written to the destination.
OLake Go adds these metadata columns to every destination table: _op_type, _olake_id, _olake_timestamp, and _cdc_timestamp.
1. Operation Types (_op_type):β
- Read/Snapshot (
r) -- Appears during initial full table loads (snapshots) from the source database.
- Example: When you first sync a table i.e. full refresh, all existing rows get
_op_type = 'r'.
- Create/Insert (
c) -- Generated during CDC, when new records are inserted into the source database.
- Example: After initial sync, inserting a new record creates a row with
_op_type = 'c'.
- Update (
u) -- Created when existing records are modified in the source database.
- Example: Updating a column value from the table generates a new row with
_op_type = 'u'.
- Delete (
d) -- Generated when records are deleted from the source database.
- When a record is deleted, primary key column and OLake-generated columns retain their values for tracking, while all other columns are set to either
Noneor they appear blank. - Example: Deleting a record from the table creates a tombstone row with
_op_type = 'd'.
2. Timestamps:β
_olake_timestamp-- Captures the exact time when OLake Go processed and wrote the record to the destination.
- Useful for tracking when data was ingested into the data lake and for debugging sync latency and understanding processing order.
_cdc_timestamp-- Reflects the actual time when the change occurred in the source database.
- Present only when using Update Method (during Source Configuration) as
CDC. - Whenever a full refresh is performed, snapshot records
(_op_type = 'r')will have the_cdc_timestampset to the epoch time (1970-01-01) indicating that it is not a CDC record. Even in (Full Refresh+ CDC) mode, the very first sync run starts with a full refresh β in this case too, you may see (1970-01-01) timestamps. - For CDC records
(_op_type = 'c', 'u', 'd'), it provides the precise timestamp of when the insert, update, or delete happened in the source.
3. Record Identification (_olake_id):β
OLake Go assigns a unique, deterministic identifier to every record as it is processed. The identifier is designed primarily for deduplication and internal ordering/tracking.
How it is generated:β
- Single primary key present in the source table -
_olake_idequals the recordβs primary key value and duplicates on the key will be detected.- Example: If the primary key is
idwith value 123, then_olake_idwill be 123.
- Composite primary key present in the source table -
_olake_idis a stable hash of all primary key columns for the record. Hence ensures that duplicates on the key will be detected.- Example: If the composite primary key is
(order_id, product_id)with values (456, 789), then_olake_idwill be a hash of (456, 789).
- With no primary key in the source table -
_olake_idequals the hash of all columns of the record. But in this situation, deduplication cannot be guaranteed as two different records could have the same hash value.- Example: If a record has columns
(name, age, city)with values (Alice, 30, NY), then_olake_idwill be hash of (Alice, 30, NY).
4. CDC Metadata Columnsβ
OLake Go also writes driver-specific CDC ordering metadata columns. These fields capture the exact log position and ordering information from each source systemβs native CDC mechanism and is supported for MongoDB, Postgres, MySQL and MSSQL drivers.
These columns are especially useful in scenarios where multiple changes happen within the same transaction (and thus share a single _cdc_timestamp), because they provide a stable, log-based key (for example, binlog position or LSN) that downstream systems can use to order events exactly as they occurred in the source.
MongoDBβ
- Column:
_cdc_resume_token_cdc_resume_tokenstores the MongoDB Change Streams resume token that identifies the exact position in the oplog where this change event was captured.- Example:
_cdc_resume_token:"82698C3243000000022B0429296E1404"
MySQLβ
- Columns:
_cdc_binlog_file_name,_cdc_binlog_file_pos- Together, these columns identify the exact location in MySQL's binary log where the change event was read.
_cdc_binlog_file_namespecifies which binlog file, and_cdc_binlog_file_posindicates the byte position within that file. - Example:
_cdc_binlog_file_name = "mysql-bin.000003",_cdc_binlog_file_pos = 1027
- Together, these columns identify the exact location in MySQL's binary log where the change event was read.
Postgresβ
- Column:
_cdc_lsn_cdc_lsnstores the Write-Ahead Log (WAL) Log Sequence Number that represents the precise position in Postgres's transaction log where this change was recorded. LSN values are monotonically increasing and can be used to order events in the exact sequence they were applied by Postgres.- Example:
_cdc_lsn = "16/B374D848"
MSSQLβ
- Columns:
_cdc_start_lsn,_cdc_seqval_cdc_start_lsnindicates the starting Log Sequence Number for the change row in SQL Server's CDC change table, while_cdc_seqvalis a sequence value that orders multiple changes that share the same LSN. Together, they provide a stable ordering key that matches SQL Server's internal CDC ordering.- Example:
_cdc_start_lsn = "000000bb000055d00003",_cdc_seqval = "000000bb000055d00002"
This feature is available starting from v0.3.16.
For jobs created before this feature was released, these columns are always present for Parquet-on-S3 destinations, but for Iceberg they appear only when Normalization was turned on and will not be present for older jobs where Iceberg Normalization was off.
General Terminologiesβ
1. Data Lake:β
A Data Lake is a central storage system that holds large volumes of raw or processed data in open formats such as Parquet, CSV, or JSON. In OLake Go, data is written to object stores like Amazon S3 in Parquet format, making it easy to use these storage systems as data lakes for analytics and processing.
2. Data Lakehouse:β
A Data Lakehouse is a modern architecture that combines the scalability and flexibility of data lakes with the reliability and performance features of data warehouses, such as transactions, schema enforcement, and query optimization. In OLake Go, this is achieved through Apache Iceberg, which adds ACID transactions and schema evolution on top of data lakes, enabling a true lakehouse environment.
3. Data Warehouse:β
Data Warehouse is a system where cleaned and structured data from many sources is stored for analysis and reporting.
4. Snapshot:β
A Snapshot is a one-time, point-in-time capture of the entire dataset from the source. In OLake Go, an initial snapshot is taken when a table or collection is first onboarded.
5. Polymorphic (or Heterogeneous) Data:β
Polymorphic (or Heterogeneous) Data means data in the same dataset that doesnβt always follow the exact same structure. This is common in NoSQL systems like MongoDB, where one record might have extra fields that another record doesnβt. For instance, in an e-commerce database, one product document might include a βcolorβ field while another doesnβt, and tools need to handle both without errors.
6. Chunking:β
Chunking (or Parallel Chunk-Based Loading) is the process of splitting a large dataset, such as a table or collection, into smaller segments called chunks. In OLake Go, all the chunks are first created, and then processed in parallel during full refresh operations. This approach makes it possible to load large collections more efficiently, significantly reducing the overall time needed for high-volume data transfers.
7. Thread Count/Concurrency:β
Thread Count (or Concurrency) is the number of parallel processes or threads used to read, transform, and write data simultaneously. Increasing concurrency often speeds up ingestion and improves throughput, but it can also put additional load on the source system. In OLake Go, this is managed using max_threads in the CLI or Max Threads in the UI, which defines how many chunks or streams are processed in parallel.
8. Writer:β
A Writer in OLake Go is the component responsible for formatting and writing data from the source to the chosen destination, such as Parquet files or Apache Iceberg tables. By writing directly from the source to the destination, OLake Go removes the need for an intermediate data queue, which reduces latency and simplifies the overall pipeline.
9. Parquet:β
Parquet is a columnar storage file format designed for efficient data compression and encoding, making it highly optimized for analytical workloads. In OLake Go, data is stored in Parquet format so that downstream systems like Spark or Trino, can query it efficiently, improving both performance and storage utilisation.
10. Apache Iceberg:β
Apache Iceberg is an open table format built for large-scale analytics that provides features like ACID transactions, schema evolution, partitioning, and time travel. In OLake Go, the Iceberg Writer outputs data as Iceberg-compatible Parquet files, allowing users to build a true Lakehouse architecture with reliable transaction semantics and efficient query performance.
11. Equality Deletes (Iceberg):β
Equality Deletes in Apache Iceberg are a way to handle row-level deletions by matching specific record attributes, such as primary keys, to identify which rows should be removed. In OLake Go, equality deletes are used to implement upserts, ensuring that changes captured through CDC are correctly applied so that updated or deleted rows are accurately reflected in the Iceberg table.
12. Real Time Replication:β
Real-Time Replication is the process of continuously capturing and applying database changes with minimal delay, ensuring the destination stays closely in sync with the source. Change Data Capture (CDC) in OLake Go captures and applies database changes from the source to the destination. By default, CDC runs as a job β it captures recent changes, replicates them, and then stops. However, when combined with orchestration tools (such as Airflow), these jobs can be scheduled at frequent intervals, effectively enabling continuous or near real-time replication to keep destinations closely in sync with sources.
13. State Management:β
State Management is the practice of tracking metadata about processed data such as the last CDC offset or the last completed chunk so that jobs can resume or continue without losing or duplicating records. In OLake Go, this is handled through a state file that stores checkpoint offsets or timestamps, allowing interrupted processes to restart from the exact point of failure and ensuring data consistency.
14. Catalog:β
A catalog in Apache Iceberg is the metadata and namespace service that manages Iceberg tables. It acts as the central registry where tables are created, organized, and discovered . It stores table metadata locations (not the data itself) and provides a namespace structure (like databases & schemas in SQL).
15. Flattening (JSON Flattening):β
JSON Flattening converts nested JSON fields into top-level columns for simpler and faster querying. OLake Go currently supports Level-1 JSON flattening through its βNormalizationβ feature, converting top-level nested fields into separate columns for easier querying.
16. Performance Benchmarks:β
Performance Benchmarks are structured tests that measure how efficiently a system processes data under defined conditions, such as dataset size, concurrency, or hardware resources. In OLake Go, benchmarks focus on metrics like throughput (rows per second) and total load times for large datasets, often demonstrating faster ingestion and replication compared to traditional ETL or CDC tools.
17. Monitoring & Alerting:β
Monitoring & Alerting refers to the practices and tools used to track system activity, capture metrics, and generate notifications when issues or anomalies occur. In OLake Go, monitoring provides real-time visibility into all sync modes, while alerting notifies users about events like schema changes in the source or job failures, ensuring problems are quickly identified and addressed.
18. BYOC (Bring Your Own Cloud):β
BYOC (Bring Your Own Cloud) is the approach of running software within the userβs chosen cloud or infrastructure instead of being tied to a vendor-managed environment. In OLake Go, this means the platform is cloud-agnostic, supporting deployments across AWS, GCP, Azure, or even on-premises setups, giving users flexibility without vendor lock-in.
19. Query Engines (Trino, Spark, Flink, Snowflake):β
Trino, Spark, Flink, Snowflake are popular data processing and query engines that can work directly with open data formats like Parquet and Iceberg. In OLake Go, writing data in these open formats ensures seamless compatibility, allowing these engines to query and process the data without requiring proprietary connectors or vendor lock-in.
20. gRPC:β
gRPC is a communication framework created by Google that lets different services talk to each other quickly and efficiently. It's often used in systems where data needs to move in real time, like event-driven applications or data pipelines. In data engineering, gRPC helps microservices share data safely and at high speed, supports streaming of large datasets (like logs or analytics data), and is also used to connect with machine learning models for fast predictions.
21. AWS S3:β
AWS S3 (Simple Storage Service) is Amazon's cloud storage where you can keep any amount of data safely and at low cost. It's widely used in data engineering as a data lake to store raw and processed data, a landing zone for ETL pipelines, and for backups or long-term archiving. Because it integrates well with big data and analytics tools (like Apache Iceberg, and Spark), S3 is one of the most common places where companies keep data for reporting, AI, and large-scale processing.
22. ETL (Extract, Transform, Load):β
ETL (Extract, Transform, Load) is a process used to move data from one place to another in three steps. First, the data is extracted (pulled) from a source system like a database. Then it is transformed (cleaned, reformatted, or enriched) so it's consistent and ready to use. Finally, it is loaded into a target system such as a data warehouse or data lake, where it can be used for reporting or analysis.
23. ELT (Extract, Load, Transform):β
ELT (Extract, Load, Transform) is a process where data is first extracted (pulled) from the source and then loaded directly into a central system like a data warehouse or data lake. After loading, the data is transformed inside that system using its own computing power (like Spark or Snowflake). This approach is popular for handling large amounts of raw data because the heavy transformations happen after the data is already stored in one place.
24. WAL (Write-Ahead Log):β
Write-Ahead Log (WAL) is a mechanism, most commonly used in PostgreSQL, that records all database changes (inserts, updates, deletes) before they are applied to the main storage. This ensures durability, consistency, and crash recovery, and it also serves as a foundation for replication and Change Data Capture (CDC).
25. Binlog (MySQL):β
Binlog (Binary Log) is a MySQL database log that records all changes to data (inserts, updates, deletes) in the exact order they occur. It enables replication, point-in-time recovery, and Change Data Capture (CDC), ensuring downstream systems stay in sync without requiring full reloads.
26. Oplog (MongoDB):β
Oplog (Operations Log) is a log specific to MongoDb that records all changes made to a database (inserts, updates, deletes) in chronological order. It enables replication and real-time Change Data Capture (CDC), ensuring downstream systems or replica nodes stay up to date without full reloads.
27. Table Format (e.g., Apache Iceberg, Delta Lake, Hudi):β
A table format manages your data lake files like a database table, with transactions, schema evolution, and query efficiency.
28. Schema Evolution:β
Schema evolution lets your data tables adapt to schema changes without disrupting pipelines or queries.
29. ACID Transactions:β
ACID stands for Atomicity, Consistency, Isolation, and Durability β a set of properties that guarantee reliable and safe database transactions. They ensure that updates happen fully or not at all (Atomicity), keep data valid (Consistency), let multiple users work without messing each other up (Isolation), and never lose changes once saved (Durability).
30. Avro:β
Avro is a file format that stores data in rows and is popular in streaming systems because it's fast and works well with changing schemas.
31. Orchestration (e.g., Airflow, Luigi, Dagster):β
These are tools that schedule and manage data pipeline tasks, making sure jobs run in the right order, handle errors, and retry if something fails.
32. Batch vs. Streaming:β
Batch means processing data in bulk at set times (like once a day), while streaming means handling data as it comes in, almost instantly.
33. Polyglot Persistence:β
A strategy of using different storage technologies to handle different data needs. For instance, MongoDB for flexible documents, MySQL for transactions, and S3 for analytics.
34. Aggregation / Analytical Query:β
Operations that summarise or derive insights from data (sums, counts, averages, group-by, etc.), typically run on columnar storage for efficiency.
35. Federation / Federated Query:β
The ability to query data from different sources (databases, lakes, APIs) using one query engine.
36. Ingestion Latency:β
The time it takes for new data to move from the source system to the destination or analytics tool.
37. Idempotency:β
A property of an operation where repeating it has the same effect as doing it once (important for data pipelines to avoid duplicates).
38. Observability (Logging, Metrics, Tracing):β
Tools and processes that help you understand and diagnose the behavior, performance, and health of your data pipelines.
39. Change Data Capture:β
CDC is a mode where only changes in the source databaseβinserts, updates, and deletesβare captured and replicated downstream. This allows near real-time synchronization without performing a full table reload, keeping target systems up to date efficiently. By default, CDC runs as a job β it captures recent changes, replicates them, and then stops. However, when combined with orchestration tools (such as Airflow), these jobs can be scheduled at frequent intervals, effectively enabling continuous or near real-time replication.
40. Full Refresh:β
Full Refresh is the process of reloading an entire table or collection from the source into the destination. This ensures the destination is a complete, up-to-date copy of the source, but it can be time-consuming and resource-intensive for large datasets. Full refresh is typically used for the first load of a dataset or when incremental tracking is not possible.
41. Incremental:β
Incremental replication is a method of loading only the new or updated records from a source system since the last successful run, using a tracking column such as a timestamp or an incrementing ID (cursor key). In OLake Go, incremental replication works through a cursor-based approach. Each run tracks the highest cursor_key value (for example, last_updated_at or an increasing primary key) that was processed previously. On the next run, OLake Go fetches only the rows where cursor_key is greater than the saved checkpoint and appends those records to the destination. This avoids reloading the full dataset and makes ongoing synchronization more efficient.
42. Position Deletes (Iceberg):β
Position deletes in Apache Iceberg identify which rows to remove by file path and row position within that file, rather than by matching column values. They are used when the table has no unique key or when deletes are applied at write time. During compaction, position deletes are merged into the data files so that deleted rows are physically removed and storage can be reclaimed.
43. Small Files:β
Small files are data files that are much smaller than the ideal size for efficient querying and storage (for example, well below the target or block size). Having many small files can hurt query performance (more metadata and I/O) and increase storage overhead. Iceberg maintenance (compaction or optimization) combines small files into larger ones so that reads and scans are more efficient.