Last updated:6/23/2025|... min read

Overview

This getting started guide provides a comprehensive overview of OLake, for you to get started with OLake.

Where you can run OLake:

Below are different ways you can run OLake.

Platform / Orchestration	Quick-start or setup docs	Comments
OLake UI	LINK	(Recommended)
Local system (bare metal / laptop)	Local Iceberg Setup (with minIO)
Stand-alone Docker container	using OLake docker image	Easiest for PoC; ships the CLI plus driver in one image.
Airflow on EC2	EC2 DAG how-to	Spins up a short-lived EC2 worker that pulls the OLake image, runs `sync`, then terminates.
Airflow on Kubernetes	K8s + Airflow example (OLake)	Same DAG, but the `KubernetesPodOperator` schedules OLake pods inside the cluster.

Supported Sources

	Sources	Links
	MongoDB	Docs / Code / Getting Started
	Postgres	Docs / Code / Getting Started
	MySQL	Docs / Code / Getting Started
	Oracle	Docs / Code / Getting Started
	AWS S3	check issue
	Kafka	WIP, check PR

Visit OLake roadmap for more information.

OLake supports a variety of data sources and destinations. Below is a list of currently supported connectors:

Supported Destinations

Destination	Supported	Docs	Comments
Apache Iceberg	Yes	Link
AWS S3	Yes	Link	Supports both plain-Parquet and Iceberg format writes; requires `aws_access_key` / IAM role.
Azure	Yes	Link
Google Cloud Storage	Yes	Link (Iceberg) Link (Parquet)	Any S3 protocol compliant object store can work with OLake
Local Filesystem	Yes	Link

File formats OLake can emit

Output format	Docs	Comments
Apache Iceberg tables	Iceberg writer overview	Full snapshot + incremental CDC → Iceberg; works with all catalogs listed below.
Parquet files	Parquet writer modes	Simple columnar dumps (no table metadata); choose local or S3 sub-mode.

Query OLake dumped data Using Different Catalogs

AWS Glue + AWS Athena
AWS Glue + Spark
AWS Glue + Snowflake
AWS Glue + DuckDB
AWS Glue + Trino
REST Catalog + DuckDB
REST catalog + ClickHouse

Iceberg catalogs OLake can register to

Catalog type	Docs / example	Comments
REST – Lakekeeper	LINK	[Officially Supported] Rust-native catalog with optimistic locking; Helm chart available for K8s.
REST - Unity	LINK	[Supported and tested] Unity Catalog (Databricks) with Personal Access Token authentication.
REST – Gravitino	LINK	[Supported, Yet to be tested] Uses standard Iceberg REST; Gravitino adds multi-cloud routing.
REST – Nessie	LINK	[Supported, Yet to be tested] Time-travel branches/tags; supply `nessie.endpoint` in `destination.json`.
REST – Polaris	LINK	[Supported, Yet to be tested]
Hive Metastore	Hive catalog config	Classic HMS; good fit for on-prem Hadoop or EMR.
JDBC Catalog (Postgres/MySQL)	JDBC catalog sample	Stores Iceberg metadata in an RDBMS(Postgres); easiest to spin up locally with Postgres.
AWS Glue Catalog	Glue catalog IAM & config	Best choice on AWS; lets Athena, EMR, and Redshift query the same tables instantly.
Azure Purview	Not Planned, submit a request
BigLake Metastore	Not Planned, submit a request

Data from OLake on Apache Iceberg can be queried from:

QUERY TOOL	AWS Glue Catalog	Hive Metastore	JDBC Catalog	Iceberg REST Catalog
Amazon Athena	✅	❌	❌	❌
Apache Spark (v3.3 +)	✅	✅	✅	✅
Apache Flink (v1.18 +)	✅	✅	✅	✅
Trino (v475 +)	✅	✅	✅	✅
Starburst Enterprise	✅	✅	✅	✅
Presto (v0.288 +)	✅	✅	✅	✅
Apache Hive v4.0	✅	✅	❌	✅
Apache Impala v4.4	❌	✅	❌	❌
Dremio v25/26	✅	✅	❌	✅
DuckDB v1.2.1	❌	❌	❌	✅
ClickHouse v24.3 +	❌	❌	❌	✅
StarRocks v3.2 +	❌	✅	❌	✅
Apache Doris v2.1 +	✅	✅	❌	✅
Google BigQuery (BigLake)	❌	❌	❌	❌*
Snowflake (Iceberg GA)	❌	❌	❌	✅
Databricks (Unity Catalog API)	❌	❌	❌	✅

*BigQuery’s BigLake tables read Iceberg manifests directly without using an Iceberg catalog, so none of the four catalog types apply.

Sl	Query / Analytics Engine (“tool”)	Link	Supported Iceberg catalogs	Comments
1	Amazon Athena	[Query Iceberg tables] AWS Documentation	Only the AWS Glue Data Catalog	Athena-v3 can read & write Iceberg v2 tables only when they’re registered in Glue.
2	Apache Spark (3.3 → 4.x)	[Spark catalog config] Apache Iceberg	Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via `spark.sql.catalog.*` settings.	Configure with `spark.sql.catalog.<name>.type` = `glue
3	Apache Flink (1.18+)	[Flink catalog options] Apache Iceberg	Hive Metastore, Hadoop catalog, REST catalog (incl. Nessie), AWS Glue, JDBC, plus any custom implementation via `catalog-impl`.	Set catalog-type=glue
4	Trino ( ≥ 475 )	[Iceberg metastores] trino.io	Hive Metastore, AWS Glue, JDBC, REST, Nessie, or Snowflake;	iceberg.catalog.type=glue
5	Starburst Enterprise (SEP 413 →)	[SEP Iceberg connector] Starburst	Hive Metastore, AWS Glue, JDBC, REST, Nessie, Snowflake, and Starburst Galaxy’s managed metastore	Same config keys as Trino; extra features (Warp Speed, Polaris).
6	PrestoDB (0.288+)	[Lab guide using REST] IBM GitHub	Hive Metastore, AWS Glue, REST/Nessie (0.277+ with OAuth 2), Hadoop (file-based); JDBC	Glue needs the AWS SDK fat-jar; REST landed in 0.288.
7	Apache Hive (4.0)	[Hive integration] Apache Iceberg	Hive Metastore is the default catalog; Hadoop, REST/Nessie, AWS Glue, JDBC, or custom catalogs can also be configured.	Uses StorageHandler; for Glue add AWS bundle.
8	Apache Impala (4.4)	[Impala Iceberg docs] Impala	Hive Metastore, HadoopCatalog, HadoopTables, Other Iceberg catalog-impl values can be registered in Hive site-config (but Impala itself cannot talk to Glue/REST/Nessie directly).	Can create & query Iceberg; Glue via HMS federation only.
9	Dremio (25/26)	[REST catalog support] Dremio Documentation	Polaris / “Dremio Catalog” (built-in REST, Nessie-backed) • Generic Iceberg REST Catalog (Tabular, Unity, Snowflake Open, etc.) • Arctic (Nessie) sources via UI • Hive Metastore (HMS) • AWS Glue • Hadoop (file-based) • Nessie stand-alone** endpoint supported as Source	Add source in Lakehouse Catalogs UI; JDBC not yet.
10	DuckDB (1.2.1)	[Iceberg extension] DuckDB [REST preview] DuckDB	• Hadoop (File-system) — still the simplest path: `iceberg_scan('/bucket/table/')` or a direct metadata JSON. • Iceberg REST catalogs (e.g., Nessie, Tabular) supported via the `rest` option; now accepts bearer/OAuth tokens through the new `rest_auth_token` parameter. • No native Hive/Glue catalog yet, but you can proxy them through REST.	`INSTALL iceberg; LOAD iceberg;` then `ATTACH 'nessie://…'` or `ATTACH 'glue+rest://…'`.
11	ClickHouse (24.3+)	[Iceberg table engine] ClickHouse	Path (Hadoop-style) since 24.3, REST catalog (Nessie, Polaris/Unity, Glue REST) in 24.12 via `SETTINGS catalog_type='rest'`; Hive Metastore experimental & AWS Glue integrations in testing;R2 (Cloudflare) REST catalog on roadmap.	Read-only engine; REST integration roadmap notes catalog auto-discovery.
12	StarRocks (3.2+)	[StarRocks Iceberg quick-start]StarRocks Documentation	Hive Metastore, AWS Glue Data Catalog (requires S3 storage), REST, (default, with AWS Glue REST endpoint support), JDBC-compatible metastore	Create `CREATE CATALOG iceberg … PROPERTIES('type'='iceberg','metastore'='…')`.
13	Apache Doris (2.1+)	[Doris Iceberg catalog] doris.apache.org	Hive Metastore, REST, Hadoop (filesystem metadata), AWS Glue, Alibaba Cloud DLF, AWS S3 Tables Catalog.	Multi-catalog; external queries and writes supported since v3.1
14	Google BigQuery (BigLake)	[BigLake Iceberg tables] Google Cloud	You cannot configure BigQuery to use REST, Hive, or Glue catalogs for writing Iceberg tables.	Creates external tables; catalog-less — points directly at manifests in GCS/S3.
15	Snowflake (2025 GA)	[Snowflake Iceberg tables] Snowflake Documentation	Snowflake-native, REST (read-only); External-catalog Iceberg tables (whether via AWS Glue, Iceberg metadata files, Open Catalog/Polaris, or remote REST catalogs) are read-only in Snowflake	Snowflake maintains its own catalog but can read external REST catalogs (Unity, Glue REST).
16	Databricks Unity Catalog	[UC Iceberg REST endpoint] Databricks Documentation Databricks Documentation	Full support for the Iceberg REST Catalog API, allowing external engines to read (Generally Available) and write (Public Preview) to Unity Catalog–managed Iceberg tables. Iceberg catalog federation is in Public Preview, enabling you to govern and query Iceberg tables managed in AWS Glue, Hive Metastore, and Snowflake Horizon without copying data.	Endpoint `/api/2.1/unity-catalog/iceberg`; external clients Spark, Flink, Trino, DuckDB, ClickHouse, etc.
17	e6data Lakehouse Compute Engine	[e6data × S3 Tables] e6data.com	AWS Glue, REST, Hive (read)	Serverless SQL engine; advertises compatibility with “all table formats & catalogs.”

Glue support in Presto requires the hive.metastore=glue shim or running Presto inside AWS EMR.

Notes & gotchas

REST flavours. “REST” above covers standard Iceberg REST plus branded servers (Nessie, Lakekeeper, Gravitino, AWS Glue REST endpoint, Databricks Unity, Snowflake Polaris).
JDBC catalog is production-ready in Spark, Flink, Trino/Starburst, and Presto. Engines not listed in that column (e.g., ClickHouse) cannot yet use it.
Hive vs Hadoop. Some engines list “Hadoop Catalog” separately (path-based, no service). I’ve rolled those under Hive here if the engine simply re-uses the HMS client.
Read-only vs read-write. ClickHouse and BigQuery are read-only; Athena supports INSERT/UPDATE/MERGE (MoR only); most others are full read-write when using Glue/Hive/JDBC/REST.

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!

Overview

Where you can run OLake:

Supported Sources

Supported Destinations

File formats OLake can emit

Query OLake dumped data Using Different Catalogs

Iceberg catalogs OLake can register to

Data from OLake on Apache Iceberg can be queried from:

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

Where you can run OLake:​

Supported Sources​

Supported Destinations​

File formats OLake can emit​

Query OLake dumped data Using Different Catalogs​

Iceberg catalogs OLake can register to​

Data from OLake on Apache Iceberg can be queried from:​

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

Where you can run OLake:

Supported Sources

Supported Destinations

File formats OLake can emit

Query OLake dumped data Using Different Catalogs

Iceberg catalogs OLake can register to

Data from OLake on Apache Iceberg can be queried from: