Skip to main content

General Terminologies

1. Data Lake:

A Data Lake is a central storage system that holds large volumes of raw or processed data in open formats such as Parquet, CSV, or JSON. In OLake, data is written to object stores like Amazon S3 in Parquet format, making it easy to use these storage systems as data lakes for analytics and processing.

2. Data Lakehouse:

A Data Lakehouse is a modern architecture that combines the scalability and flexibility of data lakes with the reliability and performance features of data warehouses, such as transactions, schema enforcement, and query optimization. In OLake, this is achieved through Apache Iceberg, which adds ACID transactions and schema evolution on top of data lakes, enabling a true lakehouse environment.

3. Data Warehouse:

Data Warehouse is a system where cleaned and structured data from many sources is stored for analysis and reporting.

4. Snapshot:

A Snapshot is a one-time, point-in-time capture of the entire dataset from the source. In OLake, an initial snapshot is taken when a table or collection is first onboarded.

5. Polymorphic (or Heterogeneous) Data:

Polymorphic (or Heterogeneous) Data means data in the same dataset that doesn’t always follow the exact same structure. This is common in NoSQL systems like MongoDB, where one record might have extra fields that another record doesn’t. For instance, in an e-commerce database, one product document might include a “color” field while another doesn’t, and tools need to handle both without errors.

6. Chunking:

Chunking (or Parallel Chunk-Based Loading) is the process of splitting a large dataset, such as a table or collection, into smaller segments called chunks. In OLake, all the chunks are first created, and then processed in parallel during full refresh operations. This approach makes it possible to load large collections more efficiently, significantly reducing the overall time needed for high-volume data transfers.

7. Thread Count/Concurrency:

Thread Count (or Concurrency) is the number of parallel processes or threads used to read, transform, and write data simultaneously. Increasing concurrency often speeds up ingestion and improves throughput, but it can also put additional load on the source system. In OLake, this is managed using max_threads in the CLI or Max Threads in the UI, which defines how many chunks or streams are processed in parallel.

8. Writer:

A Writer in OLake is the component responsible for formatting and writing data from the source to the chosen destination, such as Parquet files or Apache Iceberg tables. By writing directly from the source to the destination, OLake removes the need for an intermediate data queue, which reduces latency and simplifies the overall pipeline.

9. Parquet:

Parquet is a columnar storage file format designed for efficient data compression and encoding, making it highly optimized for analytical workloads. In OLake, data is stored in Parquet format so that downstream systems like Spark or Trino, can query it efficiently, improving both performance and storage utilisation.

10. Apache Iceberg:

Apache Iceberg is an open table format built for large-scale analytics that provides features like ACID transactions, schema evolution, partitioning, and time travel. In OLake, the Iceberg Writer outputs data as Iceberg-compatible Parquet files, allowing users to build a true Lakehouse architecture with reliable transaction semantics and efficient query performance.

11. Equality Deletes (Iceberg):

Equality Deletes in Apache Iceberg are a way to handle row-level deletions by matching specific record attributes, such as primary keys, to identify which rows should be removed. In OLake, equality deletes are used to implement upserts, ensuring that changes captured through CDC are correctly applied so that updated or deleted rows are accurately reflected in the Iceberg table.

12. Real Time Replication:

Real-Time Replication is the process of continuously capturing and applying database changes with minimal delay, ensuring the destination stays closely in sync with the source. Change Data Capture (CDC) in OLake captures and applies database changes from the source to the destination. By default, CDC runs as a job — it captures recent changes, replicates them, and then stops. However, when combined with orchestration tools (such as Airflow), these jobs can be scheduled at frequent intervals, effectively enabling continuous or near real-time replication to keep destinations closely in sync with sources.

13. State Management:

State Management is the practice of tracking metadata about processed data such as the last CDC offset or the last completed chunk so that jobs can resume or continue without losing or duplicating records. In OLake, this is handled through a state file that stores checkpoint offsets or timestamps, allowing interrupted processes to restart from the exact point of failure and ensuring data consistency.

14. Catalog:

A catalog in Apache Iceberg is the metadata and namespace service that manages Iceberg tables. It acts as the central registry where tables are created, organized, and discovered . It stores table metadata locations (not the data itself) and provides a namespace structure (like databases & schemas in SQL).

15. Flattening (JSON Flattening):

JSON Flattening converts nested JSON fields into top-level columns for simpler and faster querying. OLake currently supports Level-1 JSON flattening through its ‘Normalization’ feature, converting top-level nested fields into separate columns for easier querying.

16. Performance Benchmarks:

Performance Benchmarks are structured tests that measure how efficiently a system processes data under defined conditions, such as dataset size, concurrency, or hardware resources. In OLake, benchmarks focus on metrics like throughput (rows per second) and total load times for large datasets, often demonstrating faster ingestion and replication compared to traditional ETL or CDC tools.

17. Monitoring & Alerting:

Monitoring & Alerting refers to the practices and tools used to track system activity, capture metrics, and generate notifications when issues or anomalies occur. In OLake, monitoring provides real-time visibility into all sync modes, while alerting notifies users about events like schema changes in the source or job failures, ensuring problems are quickly identified and addressed.

18. BYOC (Bring Your Own Cloud):

BYOC (Bring Your Own Cloud) is the approach of running software within the user’s chosen cloud or infrastructure instead of being tied to a vendor-managed environment. In OLake, this means the platform is cloud-agnostic, supporting deployments across AWS, GCP, Azure, or even on-premises setups, giving users flexibility without vendor lock-in.

Trino, Spark, Flink, Snowflake are popular data processing and query engines that can work directly with open data formats like Parquet and Iceberg. In OLake, writing data in these open formats ensures seamless compatibility, allowing these engines to query and process the data without requiring proprietary connectors or vendor lock-in.

20. gRPC:

gRPC is a communication framework created by Google that lets different services talk to each other quickly and efficiently. It's often used in systems where data needs to move in real time, like event-driven applications or data pipelines. In data engineering, gRPC helps microservices share data safely and at high speed, supports streaming of large datasets (like logs or analytics data), and is also used to connect with machine learning models for fast predictions.

21. AWS S3:

AWS S3 (Simple Storage Service) is Amazon's cloud storage where you can keep any amount of data safely and at low cost. It's widely used in data engineering as a data lake to store raw and processed data, a landing zone for ETL pipelines, and for backups or long-term archiving. Because it integrates well with big data and analytics tools (like Apache Iceberg, and Spark), S3 is one of the most common places where companies keep data for reporting, AI, and large-scale processing.

22. ETL (Extract, Transform, Load):

ETL (Extract, Transform, Load) is a process used to move data from one place to another in three steps. First, the data is extracted (pulled) from a source system like a database. Then it is transformed (cleaned, reformatted, or enriched) so it's consistent and ready to use. Finally, it is loaded into a target system such as a data warehouse or data lake, where it can be used for reporting or analysis.

23. ELT (Extract, Load, Transform):

ELT (Extract, Load, Transform) is a process where data is first extracted (pulled) from the source and then loaded directly into a central system like a data warehouse or data lake. After loading, the data is transformed inside that system using its own computing power (like Spark or Snowflake). This approach is popular for handling large amounts of raw data because the heavy transformations happen after the data is already stored in one place.

24. WAL (Write-Ahead Log):

Write-Ahead Log (WAL) is a mechanism, most commonly used in PostgreSQL, that records all database changes (inserts, updates, deletes) before they are applied to the main storage. This ensures durability, consistency, and crash recovery, and it also serves as a foundation for replication and Change Data Capture (CDC).

25. Binlog (MySQL):

Binlog (Binary Log) is a MySQL database log that records all changes to data (inserts, updates, deletes) in the exact order they occur. It enables replication, point-in-time recovery, and Change Data Capture (CDC), ensuring downstream systems stay in sync without requiring full reloads.

26. Oplog (MongoDB):

Oplog (Operations Log) is a log specific to MongoDb that records all changes made to a database (inserts, updates, deletes) in chronological order. It enables replication and real-time Change Data Capture (CDC), ensuring downstream systems or replica nodes stay up to date without full reloads.

27. Table Format (e.g., Apache Iceberg, Delta Lake, Hudi):

A table format manages your data lake files like a database table, with transactions, schema evolution, and query efficiency.

28. Schema Evolution:

Schema evolution lets your data tables adapt to schema changes without disrupting pipelines or queries.

29. ACID Transactions:

ACID stands for Atomicity, Consistency, Isolation, and Durability — a set of properties that guarantee reliable and safe database transactions. They ensure that updates happen fully or not at all (Atomicity), keep data valid (Consistency), let multiple users work without messing each other up (Isolation), and never lose changes once saved (Durability).

30. Avro:

Avro is a file format that stores data in rows and is popular in streaming systems because it's fast and works well with changing schemas.

31. Orchestration (e.g., Airflow, Luigi, Dagster):

These are tools that schedule and manage data pipeline tasks, making sure jobs run in the right order, handle errors, and retry if something fails.

32. Batch vs. Streaming:

Batch means processing data in bulk at set times (like once a day), while streaming means handling data as it comes in, almost instantly.

33. Polyglot Persistence:

A strategy of using different storage technologies to handle different data needs. For instance, MongoDB for flexible documents, MySQL for transactions, and S3 for analytics.

34. Aggregation / Analytical Query:

Operations that summarise or derive insights from data (sums, counts, averages, group-by, etc.), typically run on columnar storage for efficiency.

35. Federation / Federated Query:

The ability to query data from different sources (databases, lakes, APIs) using one query engine.

36. Ingestion Latency:

The time it takes for new data to move from the source system to the destination or analytics tool.

37. Idempotency:

A property of an operation where repeating it has the same effect as doing it once (important for data pipelines to avoid duplicates).

38. Observability (Logging, Metrics, Tracing):

Tools and processes that help you understand and diagnose the behavior, performance, and health of your data pipelines.

39. Change Data Capture:

CDC is a mode where only changes in the source database—inserts, updates, and deletes—are captured and replicated downstream. This allows near real-time synchronization without performing a full table reload, keeping target systems up to date efficiently. By default, CDC runs as a job — it captures recent changes, replicates them, and then stops. However, when combined with orchestration tools (such as Airflow), these jobs can be scheduled at frequent intervals, effectively enabling continuous or near real-time replication.

40. Full Refresh:

Full Refresh is the process of reloading an entire table or collection from the source into the destination. This ensures the destination is a complete, up-to-date copy of the source, but it can be time-consuming and resource-intensive for large datasets. Full refresh is typically used for the first load of a dataset or when incremental tracking is not possible.

41. Incremental:

Incremental replication is a method of loading only the new or updated records from a source system since the last successful run, using a tracking column such as a timestamp or an incrementing ID (cursor key). In OLake, incremental replication works through a cursor-based approach. Each run tracks the highest cursor_key value (for example, last_updated_at or an increasing primary key) that was processed previously. On the next run, OLake fetches only the rows where cursor_key is greater than the saved checkpoint and appends those records to the destination. This avoids reloading the full dataset and makes ongoing synchronization more efficient.



💡 Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
👉 Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. 🚀

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!