Skip to main content

Apache Spark 3.3+

The reference implementation for Apache Iceberg with comprehensive read/write support, full DML capabilities, and GA Format V3 support

Key Features

100
Full Support

Comprehensive Catalog Support

Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via spark.sql.catalog.* settings

Explore details
100
Full Support

Complete Read/Write Operations

Full table scans, metadata-table reads, INSERT INTO, atomic INSERT OVERWRITE, DataFrame writeTo, and stored procedures

Explore details
100
Full Support

Advanced DML Operations

MERGE INTO, UPDATE, DELETE via Spark Session Extensions; Iceberg 0.14+ emits position/equality-delete files

Explore details
100
Full Support

MoR/CoW Storage Strategies

Copy-on-Write default for delete commands; Merge-on-Read enabled via write.delete.mode=merge-on-read

Explore details
75
Partial Support

Streaming Capabilities

Incremental reads with stream-from-timestamp; Append/Complete output modes; overwrite and delete snapshots skipped by default

Explore details
100
GA

Format V3 Support

GA read + write on Spark 3.5 with Iceberg 1.8+; Deletion Vectors, Row Lineage, new types, multi-arg transforms

Explore details
100
Native SQL

Time Travel & Versioning

SQL VERSION AS OF / TIMESTAMP AS OF supported since Spark 3.3+; DataFrame as-of-timestamp option

Explore details
100
Delegated

Enterprise Security

Delegates ACLs to underlying catalog (Hive Ranger, AWS IAM, Nessie policies); snapshot isolation; audit hooks

Explore details

Apache Spark Iceberg Feature Matrix

Comprehensive breakdown of Iceberg capabilities in Apache Spark 3.3+

Dimension
Support Level
Implementation Details
Min Version
Catalog Types
FullComplete
Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, custom plug-ins
3.0+
Read/Write Operations
FullComplete
Table scans, metadata reads, INSERT INTO, INSERT OVERWRITE, DataFrame writeTo, stored procedures
3.0+
DML Operations
FullComplete
MERGE INTO, UPDATE, DELETE with position/equality delete files (Iceberg 0.14+)
3.2+
MoR/CoW Support
FullConfigurable
Copy-on-Write default; Merge-on-Read via write.delete.mode configuration
3.2+
Streaming Capabilities
PartialLimited
Incremental reads, append/complete modes; delete snapshots skipped by default
3.0+
Format V3 Support
FullGA
Deletion Vectors, Row Lineage, new types, multi-arg transforms (Spark 3.5+, Iceberg 1.8+)
3.5+
Time Travel
FullNative SQL
VERSION AS OF / TIMESTAMP AS OF SQL syntax; DataFrame as-of-timestamp option
3.3+
Security & Governance
FullDelegated
Delegates to catalog ACLs (Ranger, IAM, Nessie); snapshot isolation; audit metadata
3.0+
Schema Evolution
FullInstant
Add/drop/rename columns, type promotion; metadata-only operations
3.0+
Table Maintenance
FullBuilt-in
Stored procedures: expire_snapshots, rewrite_data_files, rewrite_manifests, remove_orphan_files
3.0+
Metadata Tables
FullComplete
history, snapshots, files, manifests, partitions, all_data_files, metadata_log_entries
3.0+
Branching & Tagging
PartialAPI Only
Java API support; SQL DDL and procedures for branch operations
3.3+

Showing 12 entries

Use Cases

Enterprise Data Lake Analytics

Large-scale analytical workloads with complex transformations

  • Multi-terabyte ETL pipelines with schema evolution
  • Complex analytical queries across multiple fact tables
  • Machine learning feature engineering on historical data
  • Cross-functional data sharing with fine-grained access control

Real-time and Batch Processing

Unified platform for both streaming and batch workloads

  • Lambda architecture with consistent data processing logic
  • CDC pipelines processing database changes in real-time
  • Event-driven architectures with exactly-once processing
  • Near real-time analytics with historical context

Data Engineering and Pipeline Development

Robust data transformation and pipeline orchestration

  • Multi-stage transformation pipelines with checkpointing
  • Data quality validation and cleansing at scale
  • Cross-system data synchronization and replication
  • Automated data archival and lifecycle management

Advanced Analytics and ML

Machine learning and advanced analytical workloads

  • Feature store implementations with time travel
  • A/B testing with historical data comparisons
  • Reproducible ML experiments with data versioning
  • Large-scale model training on partitioned datasets

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!