Skip to main content

General Terminologies

1. ETL (Extract, Transform, Load)

  • Definition: A classic process where data is pulled (extracted) from a source, transformed (e.g., cleaned, enriched), and loaded into a target system (like a data warehouse or lake).
  • Example: Extracting rows from MySQL, converting date fields to a uniform format, then loading them into a data warehouse for reporting.

2. ELT (Extract, Load, Transform)

  • Definition: A variant of ETL where data is first loaded into a centralized data store (like a lake or warehouse) before transformations occur, often using the destination’s computing engine.
  • Example: Loading raw CSVs from various sources into a data lake (like S3) and then using Spark to transform that raw data into a curated, analytics-ready format.

3. CDC (Change Data Capture)

  • Definition: A method to track and replicate changes (inserts, updates, deletes) in real-time or near real-time from source databases.
  • Example: Monitoring a PostgreSQL WAL (Write-Ahead Log) for new transactions and immediately applying them to a data lake.

4. WAL (Write-Ahead Log)

  • Definition: A log in databases like PostgreSQL that records all changes before they are written to the main data files, ensuring durability and making it possible to replay or replicate changes.
  • Example: A CDC tool reads the WAL of PostgreSQL to capture every row update or insert, then streams that data to a downstream system.

5. Oplog (MongoDB)

  • Definition: A rolling operation log used in MongoDB replica sets that records all data modifications. It is the backbone for MongoDB’s replication.
  • Example: To replicate MongoDB data in near-real time, a CDC tool tailors the oplog to identify document changes and updates a target system accordingly.

6. Binlog (MySQL)

  • Definition: A binary log in MySQL that records all statements that modify data or could potentially modify data (e.g., INSERT, UPDATE, DELETE).
  • Example: Debezium can read MySQL’s binlog to capture changes and stream them to Kafka for downstream consumption.

7. Debezium

  • Definition: An open-source platform for CDC (Change Data Capture) that supports databases like MySQL, Postgres, MongoDB, etc., and publishes changes to Kafka or other messaging systems.
  • Example: A company uses Debezium to capture updates from their production MySQL databases and stream them to Kafka, where analytics pipelines can consume near-real-time events.

8. Snapshot

  • Definition: A full, point-in-time copy of a dataset, often used as the initial baseline in replication before switching to incremental updates.
  • Example: When setting up CDC, tools often perform a one-time full snapshot of the source database’s tables, then continue with changes from logs (oplog, WAL, etc.).

9. Polymorphic (or Heterogeneous) Data

  • Definition: Data that can take various forms or structures within the same dataset, common in NoSQL environments (e.g., JSON documents in MongoDB that may have slightly different fields).
  • Example: In an e-commerce collection, some products have a color field, while others do not. Polymorphism requires the integration tool to handle both scenarios gracefully.

10. JSON Flattening

  • Definition: The process of converting nested JSON structures into a more tabular form (columns and rows). This often involves expanding out arrays or nested objects into separate tables or columns.
  • Example: A user document might have an embedded address object { "street": "123 Main St", "city": "Somewhere" }. Flattening would create columns like user.address.street and user.address.city.

11. Table Format (e.g., Apache Iceberg, Delta Lake, Hudi)

  • Definition: A structured way to store large analytics datasets in a data lake with features like ACID transactions, version control, partition pruning, and schema evolution.
  • Example: Apache Iceberg organizes data into manifests and metadata files, letting tools like Spark or Trino treat a folder of Parquet files as an updatable “table.”

12. Query Engines (e.g., Spark, Trino, Presto, Hive)

  • Definition: Systems that can query data stored in files (e.g., Parquet) on a data lake, without needing a traditional data warehouse. They typically distribute compute across a cluster.
  • Example: Using Trino to run SQL queries against an Iceberg table stored on S3.

13. Schema Evolution

  • Definition: The ability to adjust a schema (add/remove columns, change types) without reloading the entire dataset. Often used in modern table formats and NoSQL integrations.
  • Example: If a product table adds a new category field, a schema-evolved table format like Iceberg can incorporate this new column on the fly.

14. Schema Registry

  • Definition: A system or service that manages and enforces schemas (and their versions) for data streams, often used with Kafka or other event streaming platforms to ensure compatibility across producers and consumers.
  • Example: Confluent’s Schema Registry can store Avro or JSON schemas so that all microservices writing to a Kafka topic follow the same structure.

15. ACID Transactions

  • Definition: A set of properties (Atomicity, Consistency, Isolation, Durability) ensuring reliable transaction processing in databases or table formats.
  • Example: Apache Iceberg or Delta Lake can enforce ACID operations on top of a data lake, so queries see consistent snapshots of data even while new files are being written.

16. Parquet

  • Definition: A columnar file format that supports efficient compression and encoding. Widely used in big data analytical workloads for faster queries and smaller storage footprints.
  • Example: Instead of storing all user data in CSV, a data pipeline writes it to Parquet for quicker aggregation queries in Spark.

17. Avro

  • Definition: A row-oriented binary format commonly used in streaming architectures (e.g., Kafka) due to its flexible schema definition and schema evolution capabilities.
  • Example: In a Kafka-based pipeline, events might be serialized as Avro, with the schema stored in a Schema Registry so consumers can deserialize properly.

18. Data Lake

  • Definition: A large-scale repository of raw data in its original format, typically stored on object stores like Amazon S3 or Hadoop Distributed File System (HDFS).
  • Example: A company ingests website clickstream data, IoT sensor data, and transactional logs into an S3 bucket for future analytics without rigid schema requirements upfront.

19. Data Warehouse

  • Definition: A central repository of integrated data designed primarily for query and analysis, typically with a predefined schema.
  • Example: Snowflake, Redshift, or BigQuery: users run SQL queries on structured data that’s loaded and optimized for fast analytical queries.

20. Data Lakehouse

  • Definition: An architectural concept combining the flexibility/cost-effectiveness of a data lake with data warehouse features such as ACID transactions, schema enforcement, and performance optimizations.
  • Example: Using Apache Iceberg on S3 with a query engine like Trino or Spark to achieve transactional updates and traditional BI analytics, all on lower-cost object storage.

21. Orchestration (e.g., Airflow, Luigi, Dagster)

  • Definition: Tools or frameworks that manage the scheduling and execution of data pipeline tasks, enforcing dependencies and handling retries.
  • Example: Airflow can schedule a job that first loads fresh data from Postgres to S3, then triggers a Spark job to partition and optimize that data.

22. Batch vs. Streaming

  • Definition: Batch processing handles data in “chunks” at scheduled intervals, while streaming processes data in near real-time or event-by-event.
  • Example: A daily batch job might reload all the day’s sales data, whereas a streaming pipeline using CDC logs replicates changes within seconds.

23. Lake Metadata Catalog (e.g., Hive Metastore, AWS Glue)

  • Definition: A centralized store that tracks table schemas, locations, and partitions for data files in a lake environment.
  • Example: Apache Iceberg can store table metadata in the Hive Metastore or an AWS Glue Catalog, so query engines like Spark know how to access files.

24. Polygot Persistence

  • Definition: A strategy of using different storage technologies to handle different data needs.
  • Example: Using MongoDB for flexible document storage, MySQL for transactions, and a data lake on S3 for analytics—each store specialized for its use case.

25. Polymorphic Data / Semi-Structured Data

  • Definition: Data that can appear in varied forms within the same dataset, often JSON or XML, where records don’t always share an identical schema.
  • Example: One record might have a contact field with nested objects, while another record in the same collection has entirely different nested fields for contact details.

26. Aggregation / Analytical Query

  • Definition: Operations that summarize or derive insights from data (sums, counts, averages, group-by, etc.), typically run on columnar storage for efficiency.
  • Example: Running a Spark SQL job to compute the total sales by region from a Parquet table on S3.

27. Federation / Federated Query

  • Definition: The ability to query multiple data sources (databases, data lakes, REST APIs) through a single query interface.
  • Example: A Trino cluster configured with multiple catalogs can query S3 data in Iceberg format, data in a MySQL database, and a Hive cluster, all in one SQL statement.

28. Ingestion Latency

  • Definition: The time delay between when data is generated in the source system and when it becomes available in the target or analytics platform.
  • Example: A streaming CDC approach using Debezium typically has lower ingestion latency (seconds) compared to batch ingestion (hours or days).

29. Idempotency

  • Definition: A property of an operation where repeating it has the same effect as doing it once (important for data pipelines to avoid duplicates).
  • Example: If a pipeline retries an upsert operation on a row with the same primary key, it won’t cause double inserts or conflicting state.

30. Observability (Logging, Metrics, Tracing)

  • Definition: Tools and processes that help you understand and diagnose the behavior, performance, and health of your data pipelines.
  • Example: A pipeline might emit logs for each batch processed, metrics for rows per second, and traces for how long each step took to execute.

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!