Why move to Apache Iceberg - A Practical Guide to Building an Open, Multi-Engine Data Lake?
OLake GitHub - https://github.com/datazip-inc/olake
Many data teams are exploring the benefits of Apache Iceberg for managing large datasets on a data lake. Iceberg promises features like multi-engine support, easier updates and deletes on massive volumes of data, and freedom from vendor lock-in.
However, it also comes with certain limitations and operational requirements (Iceberg is the future of big data, but not if you don’t have the kind of data it needs to make a difference).
This blog post aims to give you a full picture of what Iceberg (in total) is, why organizations are interested in it, how it compares to other solutions, and what pitfalls you should be aware of before jumping in.
What Is Apache Iceberg?
Apache Iceberg is an open table format designed for large analytic datasets stored in data lakes (for example, on AWS S3). It introduces an additional metadata layer on top of files like Parquet, making it easier to:
- Track schemas and schema evolution.
- Manage snapshots (time travel) - Every commit to an Iceberg table creates an immutable snapshot recorded in a metadata file. Each snapshot lists exactly which data files (and their row counts, statistics, etc.) belong to that version of the table. Because older snapshots are retained.
- Perform updates and deletes without rewriting entire files. (Rewrites data in Copy on Write, not in Merge on Read)
- Work with multiple query engines (like Spark, Trino, and StarRocks).
- Many more, but this isn’t an iceberg features guide
While some services use Iceberg “under the hood,” the real advantage comes from using its multi-engine capabilities and decoupling storage from compute. This means you can switch engines or vendors as needed without being locked into a single platform.
Why Are Companies Considering Apache Iceberg?
1. Multi-query-engine Support
Many organizations have different teams using different tools. One team might use Athena or Trino for interactive analytics, while another relies on Snowflake or Spark for batch processing. Iceberg acts like a common layer, allowing these teams to access the same data with their preferred engine.
2. Avoiding Vendor Lock-In
Iceberg’s open nature lets you pick the query engine or cloud service that fits your needs. If you ever decide to leave a particular vendor, your data remains in open file formats, and your Iceberg table metadata can still be read by other platforms. Cool, right?
3. Handling Large Datasets
Some companies deal with hundreds of terabytes (or more) of event data. Iceberg supports partial updates and deletes (equality deletes, positional deletes), which can be very useful if you need to comply with regulations like GDPR. Instead of rewriting entire Parquet files, you can target specific data that must be removed or altered.
Read more about iceberg deletes in our blog on Merge on Read and Copy on write Iceberg blog series
4. A “Lakehouse” Approach
Iceberg is often seen as part of a broader lakehouse architecture, bringing warehouse-like features (ACID properties, versioning, etc.) to a data lake. This can be an attractive alternative to a traditional data warehouse if you have varied workloads and want to pay only for storage and compute as you use them.
Iceberg is not something you MUST adopt leaving or simply abandoning anything or everything you had in place before you heard about Iceberg. It is a use case based technology, not to be used just about everywhere.
Potential Challenges and Limitations
1. Maintenance Overhead
Iceberg offers powerful features, but those features come with additional operational tasks. You may need to:
- You need to vacuum or remove outdated snapshots.
For smaller datasets or transactional workloads, a simple RDBMS (like MySQL) might require far less upkeep.
2. Security Limitations
Iceberg stores data in files (commonly Parquet), meaning entire files are accessed when queried. Row-level or column-level security has to be enforced at the catalog level. If you use multiple query engines with multiple catalogs, these security rules might not consistently apply everywhere. Organizations handling sensitive PII or requiring fine-grained access control may need a different approach.
The iceberg community is working on a solution to this, but till now no concrete decision has been made. Each table format is divided into the implementation details here.
3. The Small File Problem
One well-known drawback in data lake solutions is the “small file problem.” Real-time or frequent mini-batch ingest can create thousands of tiny files, which slows queries because each file must be opened individually.
Iceberg itself won’t automatically solve this; you need to routinely run “optimize” or similar jobs to combine small files into larger ones.
4. Writing from Multiple Engines
Certain engines (for instance, Snowflake) do not support writing to external catalogs. You might end up in a situation where only one engine can write to the Iceberg table, and the other engine can only read. This constraint can limit some aspects of multi-engine interoperability.
Comparing Iceberg with Other Table Formats
1. Delta Lake
Pros
- Strong ACID guarantees with mature merge-on-read implementation.
- Deep Databricks integration: seamless notebooks, MLflow tie-ins, and performance features optimized for the Databricks runtime.
- Growing ecosystem of connectors, automatic optimizations (e.g., OPTIMIZE), and governance tooling.
Cons
- Databricks-centric features: several advanced capabilities—OPTIMIZE, Z-ORDER, serverless SQL endpoints—require a paid Databricks SKU or run best on its platform.
- Vendor-controlled governance: the roadmap and licensing are primarily driven by Databricks, so community influence is limited compared with truly vendor-neutral projects.
- Narrower community adoption: outside the Databricks stack, Delta Lake sees less use than Apache Iceberg, which has broad contributions from Apple, Netflix, Alibaba, AWS, Snowflake, and many others.
When to choose Delta Lake
- Your analytics stack already runs mainly on Databricks and you want the tightest possible integration and turnkey optimizations.
When to prefer Apache Iceberg
- You need engine-agnostic access (Spark, Trino, Flink, DuckDB, etc.) or want to avoid vendor lock-in and benefit from a large, neutral open-source community.
2. Hudi
- Pros: Good for scenarios requiring high-frequency upserts and incremental data.
- Cons: Lower adoption rate and community support compared to Iceberg. Some teams report challenges with scaling or pivoting away.
3. Data Warehouse vs. Data Lakehouse
When to Favour… | Snowflake-style Warehouse | Iceberg-powered Lakehouse |
---|---|---|
Typical Data Size | ≤ a few TB, mostly BI dashboards | 10-s TBs → PBs, diverse use-cases |
Primary Need | Turn-key SQL analytics with minimal ops | Unified platform for BI and Data Science / ML |
Cost Profile | Optimized at small–medium scale (compute + premium storage bundled) | Cheap object storage + decoupled compute → 30–60 % lower $/TB at scale |
Flexibility | Single engine, limited ML tooling | Query the same table from Spark, Trino, Flink, Pandas, etc. |
Advanced Features | Built-in governance & RBAC | Open standard ACID tables, schema evolution, time-travel, row-level deletes |
Key Takeaways
- Scale & Cost: Warehouses shine for smaller, pure-analytics workloads. Once your data crosses the tens-of-terabytes mark, Iceberg lets you keep costs predictable by separating cheap storage from on-demand compute.
- Breadth of Workloads: Iceberg isn’t just “warehouse-like SQL on a lake.” It unlocks the same governed datasets for both ad-hoc BI and data-science notebooks or Spark pipelines—something most proprietary warehouses struggle to match.
- Future-Proofing: Because Iceberg is an open format, you avoid vendor lock-in and can plug in new engines (e.g., Apache Llama, DuckDB, Ray) as your stack evolves, without rewriting ETL or migrating data.
When to Use Spark, Trino, and StarRocks
Iceberg is often paired with multiple engines, each serving a different use case:
- Spark: Ideal for large-scale batch processing and data transformations.
- Trino: Well-suited for interactive SQL queries and fast analytics across diverse data sources.
- StarRocks: Focused on high-performance, real-time analytical workloads.
By placing Iceberg underneath, you enable all these engines to operate on the same data without duplicating it or relying on different file layouts. This is a major advantage for teams that want specialized processing but still want to keep a single source of truth.
Who Should (and Shouldn’t) Adopt Iceberg?
Strong Candidates for Iceberg:
- Large Data Volumes: Organizations with hundreds of terabytes or petabytes benefit from its advanced metadata and partial update/delete capabilities.
- Multi-Engine Environments: If you need Athena, Spark, Trino, and more to work on the same data, Iceberg is built for that.
- GDPR or Regulatory Compliance: Iceberg can simplify removing specific lines of data compared to plain Parquet on a data lake.
- Data Science & ML Experimentation: Because Iceberg is fully open-source and engine-agnostic, data scientists can load the exact same table version into Pandas, PySpark, or notebooks without export steps; time-travel snapshots make experiments reproducible and support lineage/audit for model training.
Cases Where Iceberg May Not Fit:
- Strict Row/Column Security Requirements: Catalog-level security may not be enough if you need granular or engine-specific enforcement.
- Small or Transactional Workloads: A simple database like MySQL may be easier to manage if your data volume is relatively small.
Are Vendors Worth It?
One big question is whether to manage Iceberg yourself or use a vendor. Vendors may provide:
- One-Click Maintenance: Automated file compaction and snapshot cleanup.
- Managed Catalog: Simplified schema management and less trouble with region-based duplications.
- Support and Expertise: Help with tricky integrations or performance optimizations, saving you from constant troubleshooting.
While open-source offers independence and flexibility, a vendor solution can reduce your operational overhead—especially useful if you do not have a large in-house data platform team.
But what can be better than an open source tool that dumps to Iceberg like OLake? Check us out at github.com/datazip-inc/olake
Some Parting Thoughts
Even if Iceberg helps unify data, it does not free you from housekeeping. Plan for regular optimize/vacuum tasks and track snapshots.
If you are on AWS, explore how Iceberg interacts with EMR, Glue, and Lake Formation in a sandbox environment. Recently in 2025, AWS has got great support for Iceberg.
For highly sensitive data, confirm how you will handle row or column-level restrictions, especially if multiple engines are reading the same files.
Right Tool for the Right Job:
If your current setup has a single database setup,
- Iceberg lets you offload growing, read-heavy data from MySQL onto cheap object storage while still getting full ACID guarantees, keeping MySQL fast for OLTP.
- A single Iceberg table can be queried by Spark, Trino, Flink, and more at the same time—eliminating data copies and unlocking multi-engine analytics at petabyte scale.
- You gain painless schema evolution, partition pruning, and time-travel queries that MySQL lacks, simplifying analytics workflows as your dataset and team grow.
In short, Apache Iceberg can transform your data lake into something that feels closer to a warehouse—providing ACID-like transactions, flexible updates, and broad engine compatibility.
Yet it is no magic bullet. You must be ready to handle its operational tasks, integration quirks, and potential security gaps for row-level or column-level data. For those that get it right, Iceberg’s open ecosystem and vendor-neutral approach can be a game-changer, especially in large, multi-tool environments.
Conclusion
Apache Iceberg is a powerful way to build a truly open, multi-engine data lake. It addresses many shortcomings of using raw Parquet or Hive tables, such as difficult schema evolution and complicated deletes.
Before adopting Iceberg, ask yourself:
- Do I have large datasets needing frequent updates or deletions?
- Will multiple teams or engines benefit from the same data?
- Are my security requirements covered by catalog-level rules?
- Do I have the capacity to manage and optimize Iceberg tables?
If the answer to these questions aligns with Iceberg’s strengths, it can become a solid cornerstone of your data platform strategy. Otherwise, you might be better served by simpler solutions or a different table format—at least until your data reaches a scale and complexity where Iceberg truly shines.
OLake
Achieve 5x speed data replication to Lakehouse format with OLake, our open source platform for efficient, quick and scalable big data ingestion for real-time analytics.