Last updated:6/30/2025|... min read

Apache Spark 3.3+

The reference implementation for Apache Iceberg with comprehensive read/write support, full DML capabilities, and GA Format V3 support

Key Features

100

Full Support

Comprehensive Catalog Support

Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via spark.sql.catalog.* settings

Explore details

100

Full Support

Complete Read/Write Operations

Full table scans, metadata-table reads, INSERT INTO, atomic INSERT OVERWRITE, DataFrame writeTo, and stored procedures

Explore details

100

Full Support

Advanced DML Operations

MERGE INTO, UPDATE, DELETE via Spark Session Extensions; Iceberg 0.14+ emits position/equality-delete files

Explore details

100

Full Support

MoR/CoW Storage Strategies

Copy-on-Write default for delete commands; Merge-on-Read enabled via write.delete.mode=merge-on-read

Explore details

Partial Support

Streaming Capabilities

Incremental reads with stream-from-timestamp; Append/Complete output modes; overwrite and delete snapshots skipped by default

Explore details

100

Format V3 Support

GA read + write on Spark 3.5 with Iceberg 1.8+; Deletion Vectors, Row Lineage, new types, multi-arg transforms

Explore details

100

Native SQL

Time Travel & Versioning

SQL VERSION AS OF / TIMESTAMP AS OF supported since Spark 3.3+; DataFrame as-of-timestamp option

Explore details

100

Delegated

Enterprise Security

Delegates ACLs to underlying catalog (Hive Ranger, AWS IAM, Nessie policies); snapshot isolation; audit hooks

Explore details

Apache Spark Iceberg Feature Matrix

Comprehensive breakdown of Iceberg capabilities in Apache Spark 3.3+

Dimension	Support Level	Implementation Details	Min Version
Catalog Types	FullComplete	Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, custom plug-ins	3.0+
Read/Write Operations	FullComplete	Table scans, metadata reads, INSERT INTO, INSERT OVERWRITE, DataFrame writeTo, stored procedures	3.0+
DML Operations	FullComplete	MERGE INTO, UPDATE, DELETE with position/equality delete files (Iceberg 0.14+)	3.2+
MoR/CoW Support	FullConfigurable	Copy-on-Write default; Merge-on-Read via write.delete.mode configuration	3.2+
Streaming Capabilities	PartialLimited	Incremental reads, append/complete modes; delete snapshots skipped by default	3.0+
Format V3 Support	FullGA	Deletion Vectors, Row Lineage, new types, multi-arg transforms (Spark 3.5+, Iceberg 1.8+)	3.5+
Time Travel	FullNative SQL	VERSION AS OF / TIMESTAMP AS OF SQL syntax; DataFrame as-of-timestamp option	3.3+
Security & Governance	FullDelegated	Delegates to catalog ACLs (Ranger, IAM, Nessie); snapshot isolation; audit metadata	3.0+
Schema Evolution	FullInstant	Add/drop/rename columns, type promotion; metadata-only operations	3.0+
Table Maintenance	FullBuilt-in	Stored procedures: expire_snapshots, rewrite_data_files, rewrite_manifests, remove_orphan_files	3.0+
Metadata Tables	FullComplete	history, snapshots, files, manifests, partitions, all_data_files, metadata_log_entries	3.0+
Branching & Tagging	PartialAPI Only	Java API support; SQL DDL and procedures for branch operations	3.3+

Showing 12 entries

Live data

For issues, click here (GitHub)

Use Cases

Enterprise Data Lake Analytics

Large-scale analytical workloads with complex transformations

Multi-terabyte ETL pipelines with schema evolution
Complex analytical queries across multiple fact tables
Machine learning feature engineering on historical data
Cross-functional data sharing with fine-grained access control

Real-time and Batch Processing

Unified platform for both streaming and batch workloads

Lambda architecture with consistent data processing logic
CDC pipelines processing database changes in real-time
Event-driven architectures with exactly-once processing
Near real-time analytics with historical context

Data Engineering and Pipeline Development

Robust data transformation and pipeline orchestration

Multi-stage transformation pipelines with checkpointing
Data quality validation and cleansing at scale
Cross-system data synchronization and replication
Automated data archival and lifecycle management

Advanced Analytics and ML

Machine learning and advanced analytical workloads

Feature store implementations with time travel
A/B testing with historical data comparisons
Reproducible ML experiments with data versioning
Large-scale model training on partitioned datasets

Resources & Documentation

Official Documentation

Complete API reference and guides

Getting Started Guide

Quick start tutorials and examples

Spark Configuration

Documentation

Spark Procedures

Documentation

Performance Tuning

Documentation

Structured Streaming

Documentation

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!

Apache Spark 3.3+

Key Features

Comprehensive Catalog Support

Complete Read/Write Operations

Advanced DML Operations

MoR/CoW Storage Strategies

Streaming Capabilities

Format V3 Support

Time Travel & Versioning

Enterprise Security

Apache Spark Iceberg Feature Matrix

Use Cases

Enterprise Data Lake Analytics

Real-time and Batch Processing

Data Engineering and Pipeline Development

Advanced Analytics and ML

Resources & Documentation

Official Documentation

Getting Started Guide

Spark Configuration

Spark Procedures

Performance Tuning

Structured Streaming

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube