Skip to main content

Olake related Terminologies

Data Lake

  • Definition: A central repository designed to store large amounts of raw or processed data in its native format, typically using files (Parquet, CSV, JSON, etc.).
  • Context in OLake: OLake outputs data into open formats (e.g., Parquet) on object stores (e.g., Amazon S3) that are frequently used as data lakes.

Lakehouse

  • Definition: A new architectural approach that combines the flexibility of data lakes (cheap, scalable storage for structured and unstructured data) with some features of data warehouses (transactions, schema enforcement, performance optimizations).
  • Context in OLake: OLake supports writing to Apache Iceberg, a table format that brings ACID transactions and schema evolution to a data lake, effectively creating a lakehouse environment.

Snapshot (Initial Snapshot)

  • Definition: A full, point-in-time export of data from the source system.
  • Context in OLake: OLake performs an initial snapshot of each table or collection before switching to incremental or CDC-based replication. This ensures the entire dataset is copied over once before capturing changes.

CDC (Change Data Capture)

  • Definition: A method of detecting and capturing changes (inserts, updates, deletes) in a source database and delivering these changes to downstream systems in near real-time.
  • Context in OLake: OLake utilizes CDC to continuously capture changes from systems like MongoDB (via change streams on the oplog). This ensures the destination (lake/lakehouse) remains up-to-date with minimal latency.

Oplog (MongoDB)

  • Definition: An internal rolling log of operations in MongoDB’s replica sets that records all modifications to the data.
  • Context in OLake: OLake monitors MongoDB’s oplog to implement CDC by reading new entries as they happen and translating them into inserts, updates, or deletes in the target system.

Chunking / Parallel Chunk-Based Loading

  • Definition: Splitting a large dataset (e.g., a collection or table) into multiple, more manageable segments (“chunks”) that can be processed in parallel.
  • Context in OLake: OLake divides large collections into chunks and runs them concurrently to speed up full snapshot operations. This can drastically reduce the total load time for high-volume data.

Thread Count / Concurrency

  • Definition: The number of concurrent processes or threads used to ingest, transform, and write data. Higher concurrency often increases throughput but also adds load on the source system.
  • Context in OLake: The --thread_count (or a similar flag) controls how many chunks or streams can be processed in parallel.

Driver (Connector)

  • Definition: A plugin or module within OLake that handles reading from a specific source (MongoDB, Postgres, MySQL, etc.)
  • Context in OLake: Drivers encapsulate the logic for initial snapshots, query chunking, and CDC for each database or source technology. They talk natively to the source’s API or protocols.

Writer

  • Definition: A module within OLake that formats and writes data to a particular destination, such as local Parquet files, S3 Iceberg Parquet, Snowflake, or BigQuery.
  • Context in OLake: Writers eliminate the need for an intermediate data queue; data flows directly from the Driver to the Writer, reducing latency and complexity.

Parquet

  • Definition: A columnar storage file format commonly used in analytics systems for efficient data compression and encoding schemes.
  • Context in OLake: OLake uses Parquet to store data in a columnar format, making subsequent analytics (e.g., with Spark, Trino, or Snowflake external tables) more performant.

Apache Iceberg

  • Definition: An open table format for huge analytical datasets that supports ACID transactions, schema evolution, partitioning, and more.
  • Context in OLake: OLake’s “S3 Iceberg Parquet” Writer outputs data in Iceberg-compatible Parquet files, enabling a full Lakehouse architecture with transaction semantics.

Equality Deletes (Iceberg)

  • Definition: A method in Apache Iceberg to handle row-level deletes by matching record attributes (often primary keys) to identify rows for deletion.
  • Context in OLake: OLake uses equality deletes for upserts, ensuring that changes detected via CDC are accurately reflected in the Iceberg table (e.g., updated or removed rows).

Real-Time Replication

  • Definition: The process of capturing and applying database changes almost immediately, minimizing lag between the source and destination systems.
  • Context in OLake: Via CDC, OLake can push changes to the lakehouse or other destinations in near real-time, making data available for analytics soon after it’s updated in the source.

Schema Evolution

  • Definition: The ability to adapt to changes in a data schema (e.g., adding columns, changing data types) without requiring a complete rebuild of the dataset.
  • Context in OLake: OLake automatically detects new fields in MongoDB documents or other sources and can add those columns to the Parquet or Iceberg tables without a full reload, alerting you to potential schema conflicts.

State Management

  • Definition: The tracking and storage of metadata about what data has been processed (e.g., last offset in CDC, last chunk processed) to allow for resuming or continuing jobs without data loss or duplication.
  • Context in OLake: OLake keeps a “state” file containing checkpoint offsets or timestamps so that if a process is interrupted, it can resume from the exact point of failure.

Catalog

  • Definition: A registry of table metadata that helps data engines understand how to interpret files (schema, partition information, versioning).
  • Context in OLake: When writing to Apache Iceberg, OLake often works with a catalog (e.g., in Hive metastore, AWS Glue, or a custom catalog) to keep track of table definitions.

Flattening (JSON Flattening)

  • Definition: Transforming nested or hierarchical data (JSON objects, arrays) into a more relational or tabular structure.
  • Context in OLake: OLake supports level-0 (simple) flattening at the moment, with deeper nested array/object flattening coming soon.

Performance Benchmarks

  • Definition: Quantitative tests measuring how fast a tool or system processes data under certain conditions (e.g., data volume, concurrency).
  • Context in OLake: OLake’s benchmarks highlight throughput (rows/second) and total load times for high-volume datasets, often outperforming comparable ETL/CDC tools.

Monitoring & Alerting

  • Definition: Methods and tools used to observe system operations (logging, metrics) and notify users about anomalies or errors.
  • Context in OLake: OLake can raise alerts (e.g., changes in source schema, job failures) and logs real-time progress on snapshot or CDC operations.

Full Refresh / Full Load

  • Definition: An operation that reloads an entire table or collection from scratch, overwriting or replacing existing data in the target.
  • Context in OLake: OLake can perform a full refresh on a specific table that requires immediate, up-to-date replication (often used after major schema or data changes).

Incremental Load

  • Definition: Capturing only new or changed records (added or updated) since the last successful load, rather than reloading the entire table.
  • Context in OLake: For large datasets, incremental loads via CDC drastically reduce system overhead compared to repeated full loads.

BYOC (Bring Your Own Cloud)

  • Definition: The concept of running software in an environment of your choice, rather than being locked to a vendor’s infrastructure.
  • Context in OLake: OLake aims to be cloud-agnostic, allowing deployments in AWS, GCP, Azure, or on-premises with no forced vendor dependencies.
  • Definition: Various data processing and query engines capable of reading open-format data (Parquet, Iceberg).
  • Context in OLake: Because OLake writes to open formats, these engines can query data natively without a proprietary or locked-in format.

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!