Last updated:|... min read
General Terminologies
1. ETL (Extract, Transform, Load)
- Definition: A classic process where data is pulled (extracted) from a source, transformed (e.g., cleaned, enriched), and loaded into a target system (like a data warehouse or lake).
- Example: Extracting rows from MySQL, converting date fields to a uniform format, then loading them into a data warehouse for reporting.
2. ELT (Extract, Load, Transform)
- Definition: A variant of ETL where data is first loaded into a centralized data store (like a lake or warehouse) before transformations occur, often using the destination’s computing engine.
- Example: Loading raw CSVs from various sources into a data lake (like S3) and then using Spark to transform that raw data into a curated, analytics-ready format.
3. CDC (Change Data Capture)
- Definition: A method to track and replicate changes (inserts, updates, deletes) in real-time or near real-time from source databases.
- Example: Monitoring a PostgreSQL WAL (Write-Ahead Log) for new transactions and immediately applying them to a data lake.
4. WAL (Write-Ahead Log)
- Definition: A log in databases like PostgreSQL that records all changes before they are written to the main data files, ensuring durability and making it possible to replay or replicate changes.
- Example: A CDC tool reads the WAL of PostgreSQL to capture every row update or insert, then streams that data to a downstream system.
5. Oplog (MongoDB)
- Definition: A rolling operation log used in MongoDB replica sets that records all data modifications. It is the backbone for MongoDB’s replication.
- Example: To replicate MongoDB data in near-real time, a CDC tool tailors the oplog to identify document changes and updates a target system accordingly.
6. Binlog (MySQL)
- Definition: A binary log in MySQL that records all statements that modify data or could potentially modify data (e.g.,
INSERT
,UPDATE
,DELETE
). - Example: Debezium can read MySQL’s binlog to capture changes and stream them to Kafka for downstream consumption.
7. Debezium
- Definition: An open-source platform for CDC (Change Data Capture) that supports databases like MySQL, Postgres, MongoDB, etc., and publishes changes to Kafka or other messaging systems.
- Example: A company uses Debezium to capture updates from their production MySQL databases and stream them to Kafka, where analytics pipelines can consume near-real-time events.
8. Snapshot
- Definition: A full, point-in-time copy of a dataset, often used as the initial baseline in replication before switching to incremental updates.
- Example: When setting up CDC, tools often perform a one-time full snapshot of the source database’s tables, then continue with changes from logs (oplog, WAL, etc.).
9. Polymorphic (or Heterogeneous) Data
- Definition: Data that can take various forms or structures within the same dataset, common in NoSQL environments (e.g., JSON documents in MongoDB that may have slightly different fields).
- Example: In an e-commerce collection, some products have a
color
field, while others do not. Polymorphism requires the integration tool to handle both scenarios gracefully.
10. JSON Flattening
- Definition: The process of converting nested JSON structures into a more tabular form (columns and rows). This often involves expanding out arrays or nested objects into separate tables or columns.
- Example: A
user
document might have an embedded address object{ "street": "123 Main St", "city": "Somewhere" }
. Flattening would create columns likeuser.address.street
anduser.address.city
.
11. Table Format (e.g., Apache Iceberg, Delta Lake, Hudi)
- Definition: A structured way to store large analytics datasets in a data lake with features like ACID transactions, version control, partition pruning, and schema evolution.
- Example: Apache Iceberg organizes data into manifests and metadata files, letting tools like Spark or Trino treat a folder of Parquet files as an updatable “table.”
12. Query Engines (e.g., Spark, Trino, Presto, Hive)
- Definition: Systems that can query data stored in files (e.g., Parquet) on a data lake, without needing a traditional data warehouse. They typically distribute compute across a cluster.
- Example: Using Trino to run SQL queries against an Iceberg table stored on S3.
13. Schema Evolution
- Definition: The ability to adjust a schema (add/remove columns, change types) without reloading the entire dataset. Often used in modern table formats and NoSQL integrations.
- Example: If a product table adds a new
category
field, a schema-evolved table format like Iceberg can incorporate this new column on the fly.
14. Schema Registry
- Definition: A system or service that manages and enforces schemas (and their versions) for data streams, often used with Kafka or other event streaming platforms to ensure compatibility across producers and consumers.
- Example: Confluent’s Schema Registry can store Avro or JSON schemas so that all microservices writing to a Kafka topic follow the same structure.
15. ACID Transactions
- Definition: A set of properties (Atomicity, Consistency, Isolation, Durability) ensuring reliable transaction processing in databases or table formats.
- Example: Apache Iceberg or Delta Lake can enforce ACID operations on top of a data lake, so queries see consistent snapshots of data even while new files are being written.
16. Parquet
- Definition: A columnar file format that supports efficient compression and encoding. Widely used in big data analytical workloads for faster queries and smaller storage footprints.
- Example: Instead of storing all user data in CSV, a data pipeline writes it to Parquet for quicker aggregation queries in Spark.
17. Avro
- Definition: A row-oriented binary format commonly used in streaming architectures (e.g., Kafka) due to its flexible schema definition and schema evolution capabilities.
- Example: In a Kafka-based pipeline, events might be serialized as Avro, with the schema stored in a Schema Registry so consumers can deserialize properly.
18. Data Lake
- Definition: A large-scale repository of raw data in its original format, typically stored on object stores like Amazon S3 or Hadoop Distributed File System (HDFS).
- Example: A company ingests website clickstream data, IoT sensor data, and transactional logs into an S3 bucket for future analytics without rigid schema requirements upfront.
19. Data Warehouse
- Definition: A central repository of integrated data designed primarily for query and analysis, typically with a predefined schema.
- Example: Snowflake, Redshift, or BigQuery: users run SQL queries on structured data that’s loaded and optimized for fast analytical queries.
20. Data Lakehouse
- Definition: An architectural concept combining the flexibility/cost-effectiveness of a data lake with data warehouse features such as ACID transactions, schema enforcement, and performance optimizations.
- Example: Using Apache Iceberg on S3 with a query engine like Trino or Spark to achieve transactional updates and traditional BI analytics, all on lower-cost object storage.
21. Orchestration (e.g., Airflow, Luigi, Dagster)
- Definition: Tools or frameworks that manage the scheduling and execution of data pipeline tasks, enforcing dependencies and handling retries.
- Example: Airflow can schedule a job that first loads fresh data from Postgres to S3, then triggers a Spark job to partition and optimize that data.
22. Batch vs. Streaming
- Definition: Batch processing handles data in “chunks” at scheduled intervals, while streaming processes data in near real-time or event-by-event.
- Example: A daily batch job might reload all the day’s sales data, whereas a streaming pipeline using CDC logs replicates changes within seconds.
23. Lake Metadata Catalog (e.g., Hive Metastore, AWS Glue)
- Definition: A centralized store that tracks table schemas, locations, and partitions for data files in a lake environment.
- Example: Apache Iceberg can store table metadata in the Hive Metastore or an AWS Glue Catalog, so query engines like Spark know how to access files.
24. Polygot Persistence
- Definition: A strategy of using different storage technologies to handle different data needs.
- Example: Using MongoDB for flexible document storage, MySQL for transactions, and a data lake on S3 for analytics—each store specialized for its use case.
25. Polymorphic Data / Semi-Structured Data
- Definition: Data that can appear in varied forms within the same dataset, often JSON or XML, where records don’t always share an identical schema.
- Example: One record might have a
contact
field with nested objects, while another record in the same collection has entirely different nested fields for contact details.
26. Aggregation / Analytical Query
- Definition: Operations that summarize or derive insights from data (sums, counts, averages, group-by, etc.), typically run on columnar storage for efficiency.
- Example: Running a Spark SQL job to compute the total sales by region from a Parquet table on S3.
27. Federation / Federated Query
- Definition: The ability to query multiple data sources (databases, data lakes, REST APIs) through a single query interface.
- Example: A Trino cluster configured with multiple catalogs can query S3 data in Iceberg format, data in a MySQL database, and a Hive cluster, all in one SQL statement.
28. Ingestion Latency
- Definition: The time delay between when data is generated in the source system and when it becomes available in the target or analytics platform.
- Example: A streaming CDC approach using Debezium typically has lower ingestion latency (seconds) compared to batch ingestion (hours or days).
29. Idempotency
- Definition: A property of an operation where repeating it has the same effect as doing it once (important for data pipelines to avoid duplicates).
- Example: If a pipeline retries an upsert operation on a row with the same primary key, it won’t cause double inserts or conflicting state.
30. Observability (Logging, Metrics, Tracing)
- Definition: Tools and processes that help you understand and diagnose the behavior, performance, and health of your data pipelines.
- Example: A pipeline might emit logs for each batch processed, metrics for rows per second, and traces for how long each step took to execute.