Apache Iceberg: What It Is and Why It Matters
Data management has come a long way since the early days of Hadoop. If you’re running queries on petabyte-scale data lakes or managing real-time analytics, the challenges of data keep evolving with each passing year—and so do the tools.
Apache Iceberg is one tool that has rapidly gained traction for modern data management (and is not stopping anytime soon).
In this post, we’ll take a practical, example-driven take through the evolution of big data and see why Iceberg stands out.
Prerequisites: This is a beginner friendly guide to iceberg.
A good way to understand DATA management is by comparing it to a library:
-
Storage
- Physical and digital shelves where books (data) reside.
- In tech, this can be on-premises (like HDFS) or in the cloud (like Amazon S3, Azure Blob, or Google Cloud Storage).
-
Processing
- The librarian who locates and retrieves the requested book.
- In data terms, this could be engines such as Apache Spark, Presto, or even Apache Hive’s execution model.
-
Metadata
- The card catalog or indexing system that tells you where to find each book.
- Tools like the Hive Metastore or the Iceberg table metadata do something similar for data files.
The bigger the library (your data), the more crucial the metadata and indexing systems become.
A Brief History of Big Data
Hadoop and MapReduce: Early Data OGs
In the early 2000s, data volumes started exceeding what a single server could handle. Apache Hadoop came as a solution around 2005, introducing:
- Hadoop Distributed File System (HDFS): Distributed storage across many servers.
- MapReduce: A parallel processing model to handle data in batches.
This setup worked well for large-scale batch processing at that time, but writing Java-based MapReduce jobs was cumbersome, especially for analysts who preferred SQL.
and then,
Hive brought SQL into the Picture
Apache Hive came along in 2008, allowing analysts to write SQL queries that Hive would translate into MapReduce jobs. It also introduced the Hive Metastore, which recorded where data files were located in HDFS.
However, times changed quickly:
- Cloud Adoption: More data moved to cloud storage like Amazon S3.
- On-Demand Analytics: Tools like Presto and Spark offered faster, more interactive querying.
Hive’s coupling to HDFS and MapReduce meant it wasn’t as flexible for cloud storage or real-time needs of that time.
Real-Time Demands?
By the 2010s, the “library” (our data ecosystem) faced two big shifts:
- Storing data in the cloud for cost efficiency and scalability and faster query execution on that data.
- Running diverse workloads (batch, streaming, interactive queries) with multiple engines, not just MapReduce.
Yet many organizations still had huge investments in HDFS-based systems like Hive. The question became: How do we evolve our data management layer without throwing out everything we’ve built?
Apache Iceberg - The Metadata Layer Decoupling Storage and Compute
Introduced in 2017, Apache Iceberg doesn’t replace your storage or your processing engine (so, no migration burden). Instead, it adds a robust metadata layer between the two. Think of it as an upgraded “index” for your library that:
- Understands multiple “librarians” (processing engines): Spark, Presto, Trino, Flink, Hive—you name it.
- Works with multiple storage backends: HDFS, Amazon S3, Azure, Google Cloud, and beyond.
When queries come in from any engine, Iceberg’s metadata helps pinpoint exactly where the relevant data resides, drastically reducing unnecessary reads.
Catalog - Fine-Grained Metadata
Traditional systems (like our library) often store basic metadata—just enough to locate files or partitions. Iceberg goes further by maintaining a more granular index. This includes:
- File-level pointers (e.g., “This table partition is in these files, on that storage system”).
- Statistics like row counts, min/max values, and more (depending on your setup).
This granular detail makes queries faster because engines can skip reading irrelevant files. It’s like having a library index that not only tells you which shelf has a history book, but also which chapter or section you need.
Powerful Data Governance Features
On top of its performance benefits, Iceberg is good at data governance.
-
Snapshotting and Versioning
- Iceberg keeps a list of snapshots so you can roll back or audit changes over time.
- This is especially useful for regulatory compliance and debugging.
-
ACID Transactions
- Operations like INSERT, DELETE, and MERGE can be done atomically, ensuring data consistency.
- This is good for systems previously limited to batch updates.
-
Schema Evolution
- You can add, rename, or remove columns without breaking your queries or rewriting entire datasets.
- In a traditional approach, changes might require costly migrations.
-
Partition Evolution [Read about OLake S3 data partition]
- Changing how data is physically partitioned (e.g., by date or region or certain columns) no longer means rebuilding the entire table.
- Iceberg can handle new partition layouts over time while keeping old data accessible.
For a practical example, consider a table that was initially partitioned by month but now needs daily partitioning for more granular analytics. With Iceberg, you can apply the new partition scheme going forward, while still querying older monthly partitions.
Example Workflow: Spark and S3 with Iceberg
Let’s see how this might look in practice:
- Storage: You choose Amazon S3 for its scalability and cost-effectiveness.
- Compute: You have Spark jobs running on Kubernetes or EMR for both batch ETL and real-time analytics.
- Metadata: You set up Iceberg tables in S3. The metadata points to each data file, records statistics, and manages snapshots.
When a Spark job runs a query like SELECT * FROM orders WHERE region = 'US'
, Iceberg’s metadata tells Spark which files actually contain rows with region = 'US'
. Spark avoids scanning the entire dataset—leading to faster queries and lower cost.
Why Iceberg Continues to Shine
- Fine-grained metadata reduces I/O, saving both time and money in cloud environments.
- Supports multiple data engines (data engine refers to the query execution engine or processing framework that reads, writes, and processes data stored in Iceberg tables) and cloud storage providers, letting you mix and match your technologies as you see fit.
- Snapshotting, schema evolution, and partition evolution help maintain data quality and adaptability over time.
- Iceberg has wide adoption at companies like Netflix, Apple, and Adobe. Its vibrant open-source community ensures continuous innovation.
Conclusion
Modern data management needs a system that can keep pace with, Explosive data growth, Diverse processing requirements, Multiple storage locations, Strict governance demands.
Iceberg addresses these challenges by providing a sophisticated (yet efficient) metadata layer that decouples storage from compute, streamlines performance, and builds in advanced data governance features.
If you’re looking to future-proof your data infrastructure—especially as AI and real-time analytics drive new demands—Iceberg is a technology worth exploring and with OLake’s support for Iceberg writer (with Databases as Sources), you can replicate your data in as less time as ordering a pizza for a house party.
If you have any questions or want to learn more, drop us a line at hello@olake.io or book a quick demo meeting with us.