Skip to main content

Apache Spark 3.3+

The reference implementation for Apache Iceberg with comprehensive read/write support, full DML capabilities, and GA Format V3 support

Key Features

100
Full Support

Comprehensive Catalog Support

Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via spark.sql.catalog.* settings

Explore details
100
Full Support

Complete Read/Write Operations

Full table scans, metadata-table reads, INSERT INTO, atomic INSERT OVERWRITE, DataFrame writeTo, and stored procedures

Explore details
100
Full Support

Advanced DML Operations

MERGE INTO, UPDATE, DELETE via Spark Session Extensions; Iceberg 0.14+ emits position/equality-delete files

Explore details
100
Full Support

MoR/CoW Storage Strategies

Copy-on-Write default for delete commands; Merge-on-Read enabled via write.delete.mode=merge-on-read

Explore details
75
Partial Support

Streaming Capabilities

Incremental reads with stream-from-timestamp; Append/Complete output modes; overwrite and delete snapshots skipped by default

Explore details
100
GA

Format V3 Support

GA read + write on Spark 3.5 with Iceberg 1.8+; Deletion Vectors, Row Lineage, new types, multi-arg transforms

Explore details
100
Native SQL

Time Travel & Versioning

SQL VERSION AS OF / TIMESTAMP AS OF supported since Spark 3.3+; DataFrame as-of-timestamp option

Explore details
100
Delegated

Enterprise Security

Delegates ACLs to underlying catalog (Hive Ranger, AWS IAM, Nessie policies); snapshot isolation; audit hooks

Explore details

Apache Spark Iceberg Feature Matrix

Comprehensive breakdown of Iceberg capabilities in Apache Spark 3.3+

Dimension
Support Level
Implementation Details
Min Version
Catalog Types
FullComplete
Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, custom plug-ins
3.0+
Read/Write Operations
FullComplete
Table scans, metadata reads, INSERT INTO, INSERT OVERWRITE, DataFrame writeTo, stored procedures
3.0+
DML Operations
FullComplete
MERGE INTO, UPDATE, DELETE with position/equality delete files (Iceberg 0.14+)
3.2+
MoR/CoW Support
FullConfigurable
Copy-on-Write default; Merge-on-Read via write.delete.mode configuration
3.2+
Streaming Capabilities
PartialLimited
Incremental reads, append/complete modes; delete snapshots skipped by default
3.0+
Format V3 Support
FullGA
Deletion Vectors, Row Lineage, new types, multi-arg transforms (Spark 3.5+, Iceberg 1.8+)
3.5+
Time Travel
FullNative SQL
VERSION AS OF / TIMESTAMP AS OF SQL syntax; DataFrame as-of-timestamp option
3.3+
Security & Governance
FullDelegated
Delegates to catalog ACLs (Ranger, IAM, Nessie); snapshot isolation; audit metadata
3.0+
Schema Evolution
FullInstant
Add/drop/rename columns, type promotion; metadata-only operations
3.0+
Table Maintenance
FullBuilt-in
Stored procedures: expire_snapshots, rewrite_data_files, rewrite_manifests, remove_orphan_files
3.0+
Metadata Tables
FullComplete
history, snapshots, files, manifests, partitions, all_data_files, metadata_log_entries
3.0+
Branching & Tagging
PartialAPI Only
Java API support; SQL DDL and procedures for branch operations
3.3+

Showing 12 entries

Use Cases

Enterprise Data Lake Analytics

Large-scale analytical workloads with complex transformations

  • Multi-terabyte ETL pipelines with schema evolution
  • Complex analytical queries across multiple fact tables
  • Machine learning feature engineering on historical data
  • Cross-functional data sharing with fine-grained access control

Real-time and Batch Processing

Unified platform for both streaming and batch workloads

  • Lambda architecture with consistent data processing logic
  • CDC pipelines processing database changes in real-time
  • Event-driven architectures with exactly-once processing
  • Near real-time analytics with historical context

Data Engineering and Pipeline Development

Robust data transformation and pipeline orchestration

  • Multi-stage transformation pipelines with checkpointing
  • Data quality validation and cleansing at scale
  • Cross-system data synchronization and replication
  • Automated data archival and lifecycle management

Advanced Analytics and ML

Machine learning and advanced analytical workloads

  • Feature store implementations with time travel
  • A/B testing with historical data comparisons
  • Reproducible ML experiments with data versioning
  • Large-scale model training on partitioned datasets


πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!