Last updated:5/2/2025|... min read

Open OLake doc issue

OLake related Terminologies

Data Lake

Definition: A central repository designed to store large amounts of raw or processed data in its native format, typically using files (Parquet, CSV, JSON, etc.).
Context in OLake: OLake outputs data into open formats (e.g., Parquet) on object stores (e.g., Amazon S3) that are frequently used as data lakes.

Additional Materials:

What is Data Lake - Video

Lakehouse

Definition: A new architectural approach that combines the flexibility of data lakes (cheap, scalable storage for structured and unstructured data) with some features of data warehouses (transactions, schema enforcement, performance optimizations).
Context in OLake: OLake supports writing to Apache Iceberg, a table format that brings ACID transactions and schema evolution to a data lake, effectively creating a lakehouse environment.

Snapshot (Initial Snapshot)

Definition: A full, point-in-time export of data from the source system.
Context in OLake: OLake performs an initial snapshot of each table or collection before switching to incremental or CDC-based replication. This ensures the entire dataset is copied over once before capturing changes.

CDC (Change Data Capture)

Definition: A method of detecting and capturing changes (inserts, updates, deletes) in a source database and delivering these changes to downstream systems in near real-time.
Context in OLake: OLake utilizes CDC to continuously capture changes from systems like MongoDB (via change streams on the oplog). This ensures the destination (lake/lakehouse) remains up-to-date with minimal latency.

Additional Materials:

What is CDC - Video - 1 | Video - 2 | Video 3

Oplog (MongoDB)

Definition: An internal rolling log of operations in MongoDB’s replica sets that records all modifications to the data.
Context in OLake: OLake monitors MongoDB’s oplog to implement CDC by reading new entries as they happen and translating them into inserts, updates, or deletes in the target system.

Chunking / Parallel Chunk-Based Loading

Definition: Splitting a large dataset (e.g., a collection or table) into multiple, more manageable segments (“chunks”) that can be processed in parallel.
Context in OLake: OLake divides large collections into chunks and runs them concurrently to speed up full snapshot operations. This can drastically reduce the total load time for high-volume data.

Thread Count / Concurrency

Definition: The number of concurrent processes or threads used to ingest, transform, and write data. Higher concurrency often increases throughput but also adds load on the source system.
Context in OLake: The --thread_count (or a similar flag) controls how many chunks or streams can be processed in parallel.

Driver (Connector)

Definition: A plugin or module within OLake that handles reading from a specific source (MongoDB, Postgres, MySQL, etc.)
Context in OLake: Drivers encapsulate the logic for initial snapshots, query chunking, and CDC for each database or source technology. They talk natively to the source’s API or protocols.

Writer

Definition: A module within OLake that formats and writes data to a particular destination, such as local Parquet files, S3 Iceberg Parquet, Snowflake, or BigQuery.
Context in OLake: Destinations eliminate the need for an intermediate data queue; data flows directly from the Driver to the Destinations, reducing latency and complexity.

Parquet

Definition: A columnar storage file format commonly used in analytics systems for efficient data compression and encoding schemes.
Context in OLake: OLake uses Parquet to store data in a columnar format, making subsequent analytics (e.g., with Spark, Trino, or Snowflake external tables) more performant.

Apache Iceberg

Definition: An open table format for huge analytical datasets that supports ACID transactions, schema evolution, partitioning, and more.
Context in OLake: OLake’s Iceberg Writer outputs data in Iceberg-compatible Parquet files, enabling a full Lakehouse architecture with transaction semantics.

Additional Materials:

What is Apache Iceberg - Video - 1 | Video 2 | Video 3

Equality Deletes (Iceberg)

Definition: A method in Apache Iceberg to handle row-level deletes by matching record attributes (often primary keys) to identify rows for deletion.
Context in OLake: OLake uses equality deletes for upserts, ensuring that changes detected via CDC are accurately reflected in the Iceberg table (e.g., updated or removed rows).

Real-Time Replication

Definition: The process of capturing and applying database changes almost immediately, minimizing lag between the source and destination systems.
Context in OLake: Via CDC, OLake can push changes to the lakehouse or other destinations in near real-time, making data available for analytics soon after it’s updated in the source.

Schema Evolution

Definition: The ability to adapt to changes in a data schema (e.g., adding columns, changing data types) without requiring a complete rebuild of the dataset.
Context in OLake: OLake automatically detects new fields in MongoDB documents or other sources and can add those columns to the Parquet or Iceberg tables without a full reload, alerting you to potential schema conflicts.

State Management

Definition: The tracking and storage of metadata about what data has been processed (e.g., last offset in CDC, last chunk processed) to allow for resuming or continuing jobs without data loss or duplication.
Context in OLake: OLake keeps a “state” file containing checkpoint offsets or timestamps so that if a process is interrupted, it can resume from the exact point of failure.

Catalog

Definition: A registry of table metadata that helps data engines understand how to interpret files (schema, partition information, versioning).
Context in OLake: When writing to Apache Iceberg, OLake works with a catalog (e.g., in Hive metastore, AWS Glue, or a custom catalog) to keep track of table definitions.

Flattening (JSON Flattening)

Definition: Transforming nested or hierarchical data (JSON objects, arrays) into a more relational or tabular structure.
Context in OLake: OLake supports level-0 (simple) flattening at the moment, with deeper nested array/object flattening coming soon.

Performance Benchmarks

Definition: Quantitative tests measuring how fast a tool or system processes data under certain conditions (e.g., data volume, concurrency).
Context in OLake: OLake’s benchmarks highlight throughput (rows/second) and total load times for high-volume datasets, often outperforming comparable ETL/CDC tools.

Monitoring & Alerting

Definition: Methods and tools used to observe system operations (logging, metrics) and notify users about anomalies or errors.
Context in OLake: OLake can raise alerts (e.g., changes in source schema, job failures) and logs real-time progress on snapshot or CDC operations.

Full Refresh / Full Load

Definition: An operation that reloads an entire table or collection from scratch, overwriting or replacing existing data in the target.
Context in OLake: OLake can perform a full refresh on a specific table that requires immediate, up-to-date replication (often used after major schema or data changes).

Incremental Load

Definition: Capturing only new or changed records (added or updated) since the last successful load, rather than reloading the entire table.
Context in OLake: For large datasets, incremental loads via CDC drastically reduce system overhead compared to repeated full loads.

BYOC (Bring Your Own Cloud)

Definition: The concept of running software in an environment of your choice, rather than being locked to a vendor’s infrastructure.
Context in OLake: OLake aims to be cloud-agnostic, allowing deployments in AWS, GCP, Azure, or on-premises with no forced vendor dependencies.

Trino, Spark, Flink, Snowflake

Definition: Various data processing and query engines capable of reading open-format data (Parquet, Iceberg).
Context in OLake: Because OLake writes to open formats, these engines can query data natively without a proprietary or locked-in format.

Golang / Go

Golang (Go) is a statically typed, compiled programming language designed for efficiency, concurrency, and scalability. In the data engineering industry, Golang is widely used for:

Building high-performance data processing services: Due to its low-latency execution, it is preferred for ETL (Extract, Transform, Load) pipelines.
Concurrency in data streaming: Go’s goroutines make it ideal for parallel data processing in real-time analytics.
Developing microservices for data pipelines: Go is commonly used to build API-driven microservices that handle data ingestion, transformation, and movement.
Interfacing with cloud storage and databases: Many cloud-based data engineering applications use Go to interact with databases, message queues, and cloud storage solutions like AWS S3.

gRPC

gRPC (Google Remote Procedure Call) is a high-performance, open-source RPC (Remote Procedure Call) framework that enables efficient communication between distributed services. It is particularly relevant in data engineering for:

Real-time data exchange: gRPC is widely used in event-driven architectures where services need to exchange structured data with low latency.
Microservices communication: Data pipelines often consist of multiple microservices for ingestion, transformation, and storage, and gRPC ensures efficient, type-safe interactions between them.
Streaming large datasets: With gRPC’s support for bidirectional streaming, it is used in high-throughput data processing applications such as log aggregation, telemetry processing, and real-time analytics.
Interfacing with machine learning models: Many ML/AI services use gRPC to serve predictions and process large data payloads efficiently.

AWS S3

AWS S3 (Simple Storage Service) is a cloud-based object storage service that provides scalability, durability, and cost-effectiveness for storing and managing large volumes of data. In data engineering, AWS S3 is used for:

Data lake storage: Organizations store raw and processed data in S3 to power analytics, reporting, and AI/ML workflows.
ETL pipelines: Data is often ingested into S3 before being processed and transformed using tools like Apache Spark, AWS Glue, or Presto.
Backup and archival: S3 provides durable storage for long-term data retention with lifecycle management policies.
Integrations with big data tools: S3 integrates with platforms like Apache Iceberg, Snowflake, Databricks, and AWS Redshift for analytics and querying.

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!

OLake related Terminologies

Data Lake

Additional Materials:

Lakehouse

Snapshot (Initial Snapshot)

CDC (Change Data Capture)

Additional Materials:

Oplog (MongoDB)

Chunking / Parallel Chunk-Based Loading

Thread Count / Concurrency

Driver (Connector)

Writer

Parquet

Apache Iceberg

Additional Materials:

Equality Deletes (Iceberg)

Real-Time Replication

Schema Evolution

State Management

Catalog

Flattening (JSON Flattening)

Performance Benchmarks

Monitoring & Alerting

Full Refresh / Full Load

Incremental Load

BYOC (Bring Your Own Cloud)

Trino, Spark, Flink, Snowflake

Golang / Go

gRPC

AWS S3

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

Data Lake​

Additional Materials:​

Lakehouse​

Snapshot (Initial Snapshot)​

CDC (Change Data Capture)​

Additional Materials:​

Oplog (MongoDB)​

Chunking / Parallel Chunk-Based Loading​

Thread Count / Concurrency​

Driver (Connector)​

Writer​

Parquet​

Apache Iceberg​

Additional Materials:​

Equality Deletes (Iceberg)​

Real-Time Replication​

Schema Evolution​

State Management​

Catalog​

Flattening (JSON Flattening)​

Performance Benchmarks​

Monitoring & Alerting​

Full Refresh / Full Load​

Incremental Load​

BYOC (Bring Your Own Cloud)​

Trino, Spark, Flink, Snowflake​

Golang / Go​

gRPC​

AWS S3​

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

Data Lake

Additional Materials:

Lakehouse

Snapshot (Initial Snapshot)

CDC (Change Data Capture)

Additional Materials:

Oplog (MongoDB)

Chunking / Parallel Chunk-Based Loading

Thread Count / Concurrency

Driver (Connector)

Writer

Parquet

Apache Iceberg

Additional Materials:

Equality Deletes (Iceberg)

Real-Time Replication

Schema Evolution

State Management

Catalog

Flattening (JSON Flattening)

Performance Benchmarks

Monitoring & Alerting

Full Refresh / Full Load

Incremental Load

BYOC (Bring Your Own Cloud)

Trino, Spark, Flink, Snowflake

Golang / Go

gRPC

AWS S3