Apache Spark 3.3+
The reference implementation for Apache Iceberg with comprehensive read/write support, full DML capabilities, and GA Format V3 support
Key Features
Comprehensive Catalog Support
Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via spark.sql.catalog.* settings
Complete Read/Write Operations
Full table scans, metadata-table reads, INSERT INTO, atomic INSERT OVERWRITE, DataFrame writeTo, and stored procedures
Advanced DML Operations
MERGE INTO, UPDATE, DELETE via Spark Session Extensions; Iceberg 0.14+ emits position/equality-delete files
MoR/CoW Storage Strategies
Copy-on-Write default for delete commands; Merge-on-Read enabled via write.delete.mode=merge-on-read
Streaming Capabilities
Incremental reads with stream-from-timestamp; Append/Complete output modes; overwrite and delete snapshots skipped by default
Format V3 Support
GA read + write on Spark 3.5 with Iceberg 1.8+; Deletion Vectors, Row Lineage, new types, multi-arg transforms
Time Travel & Versioning
SQL VERSION AS OF / TIMESTAMP AS OF supported since Spark 3.3+; DataFrame as-of-timestamp option
Enterprise Security
Delegates ACLs to underlying catalog (Hive Ranger, AWS IAM, Nessie policies); snapshot isolation; audit hooks
Apache Spark Iceberg Feature Matrix
Comprehensive breakdown of Iceberg capabilities in Apache Spark 3.3+
Dimension | Support Level | Implementation Details | Min Version |
---|---|---|---|
Catalog Types | FullComplete | Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, custom plug-ins | 3.0+ |
Read/Write Operations | FullComplete | Table scans, metadata reads, INSERT INTO, INSERT OVERWRITE, DataFrame writeTo, stored procedures | 3.0+ |
DML Operations | FullComplete | MERGE INTO, UPDATE, DELETE with position/equality delete files (Iceberg 0.14+) | 3.2+ |
MoR/CoW Support | FullConfigurable | Copy-on-Write default; Merge-on-Read via write.delete.mode configuration | 3.2+ |
Streaming Capabilities | PartialLimited | Incremental reads, append/complete modes; delete snapshots skipped by default | 3.0+ |
Format V3 Support | FullGA | Deletion Vectors, Row Lineage, new types, multi-arg transforms (Spark 3.5+, Iceberg 1.8+) | 3.5+ |
Time Travel | FullNative SQL | VERSION AS OF / TIMESTAMP AS OF SQL syntax; DataFrame as-of-timestamp option | 3.3+ |
Security & Governance | FullDelegated | Delegates to catalog ACLs (Ranger, IAM, Nessie); snapshot isolation; audit metadata | 3.0+ |
Schema Evolution | FullInstant | Add/drop/rename columns, type promotion; metadata-only operations | 3.0+ |
Table Maintenance | FullBuilt-in | Stored procedures: expire_snapshots, rewrite_data_files, rewrite_manifests, remove_orphan_files | 3.0+ |
Metadata Tables | FullComplete | history, snapshots, files, manifests, partitions, all_data_files, metadata_log_entries | 3.0+ |
Branching & Tagging | PartialAPI Only | Java API support; SQL DDL and procedures for branch operations | 3.3+ |
Showing 12 entries
Use Cases
Enterprise Data Lake Analytics
Large-scale analytical workloads with complex transformations
- Multi-terabyte ETL pipelines with schema evolution
- Complex analytical queries across multiple fact tables
- Machine learning feature engineering on historical data
- Cross-functional data sharing with fine-grained access control
Real-time and Batch Processing
Unified platform for both streaming and batch workloads
- Lambda architecture with consistent data processing logic
- CDC pipelines processing database changes in real-time
- Event-driven architectures with exactly-once processing
- Near real-time analytics with historical context
Data Engineering and Pipeline Development
Robust data transformation and pipeline orchestration
- Multi-stage transformation pipelines with checkpointing
- Data quality validation and cleansing at scale
- Cross-system data synchronization and replication
- Automated data archival and lifecycle management
Advanced Analytics and ML
Machine learning and advanced analytical workloads
- Feature store implementations with time travel
- A/B testing with historical data comparisons
- Reproducible ML experiments with data versioning
- Large-scale model training on partitioned datasets