Last updated:|... min read
Overview
This getting started guide provides a comprehensive overview of OLake, for you to get started with OLake.
Where you can run OLake:
Below are different ways you can run OLake.
Platform / Orchestration | Quick-start or setup docs | Comments |
---|---|---|
OLake UI | LINK | (Recommended) |
Local system (bare metal / laptop) | Local Iceberg Setup (with minIO) | |
Stand-alone Docker container | using OLake docker image | Easiest for PoC; ships the CLI plus driver in one image. |
Airflow on EC2 | EC2 DAG how-to | Spins up a short-lived EC2 worker that pulls the OLake image, runs sync , then terminates. |
Airflow on Kubernetes | K8s + Airflow example (OLake) | Same DAG, but the KubernetesPodOperator schedules OLake pods inside the cluster. |
Supported Sources
Sources | Links | |
---|---|---|
![]() | MongoDB | Docs / Code / Getting Started |
![]() | Postgres | Docs / Code / Getting Started |
![]() | MySQL | Docs / Code / Getting Started |
![]() | Oracle | Docs / Code / Getting Started |
![]() | AWS S3 | check issue |
![]() | Kafka | WIP, check PR |
Visit OLake roadmap for more information.
OLake supports a variety of data sources and destinations. Below is a list of currently supported connectors:
Supported Destinations
Destination | Supported | Docs | Comments |
---|---|---|---|
![]() | Yes | Link | |
![]() | Yes | Link | Supports both plain-Parquet and Iceberg format writes; requires aws_access_key / IAM role. |
![]() | Yes | Link | |
![]() | Yes | Any S3 protocol compliant object store can work with OLake | |
![]() | Yes | Link |
File formats OLake can emit
Output format | Docs | Comments |
---|---|---|
Apache Iceberg tables | Iceberg writer overview | Full snapshot + incremental CDC → Iceberg; works with all catalogs listed below. |
Parquet files | Parquet writer modes | Simple columnar dumps (no table metadata); choose local or S3 sub-mode. |
Query OLake dumped data Using Different Catalogs
- AWS Glue + AWS Athena
- AWS Glue + Spark
- AWS Glue + Snowflake
- AWS Glue + DuckDB
- AWS Glue + Trino
- REST Catalog + DuckDB
- REST catalog + ClickHouse
Iceberg catalogs OLake can register to
Catalog type | Docs / example | Comments | |
---|---|---|---|
![]() | REST – Lakekeeper | LINK | [Officially Supported] Rust-native catalog with optimistic locking; Helm chart available for K8s. |
![]() | REST - Unity | LINK | [Supported and tested] Unity Catalog (Databricks) with Personal Access Token authentication. |
REST – Gravitino | LINK | [Supported, Yet to be tested] Uses standard Iceberg REST; Gravitino adds multi-cloud routing. | |
![]() | REST – Nessie | LINK | [Supported, Yet to be tested] Time-travel branches/tags; supply nessie.endpoint in destination.json . |
![]() | REST – Polaris | LINK | [Supported, Yet to be tested] |
![]() | Hive Metastore | Hive catalog config | Classic HMS; good fit for on-prem Hadoop or EMR. |
![]() | JDBC Catalog (Postgres/MySQL) | JDBC catalog sample | Stores Iceberg metadata in an RDBMS(Postgres); easiest to spin up locally with Postgres. |
AWS Glue Catalog | Glue catalog IAM & config | Best choice on AWS; lets Athena, EMR, and Redshift query the same tables instantly. | |
![]() | Azure Purview | Not Planned, submit a request | |
![]() | BigLake Metastore | Not Planned, submit a request |
Data from OLake on Apache Iceberg can be queried from:
QUERY TOOL | AWS Glue Catalog | Hive Metastore | JDBC Catalog | Iceberg REST Catalog |
---|---|---|---|---|
Amazon Athena | ✅ | ❌ | ❌ | ❌ |
Apache Spark (v3.3 +) | ✅ | ✅ | ✅ | ✅ |
Apache Flink (v1.18 +) | ✅ | ✅ | ✅ | ✅ |
Trino (v475 +) | ✅ | ✅ | ✅ | ✅ |
Starburst Enterprise | ✅ | ✅ | ✅ | ✅ |
Presto (v0.288 +) | ✅ | ✅ | ✅ | ✅ |
Apache Hive v4.0 | ✅ | ✅ | ❌ | ✅ |
Apache Impala v4.4 | ❌ | ✅ | ❌ | ❌ |
Dremio v25/26 | ✅ | ✅ | ❌ | ✅ |
DuckDB v1.2.1 | ❌ | ❌ | ❌ | ✅ |
ClickHouse v24.3 + | ❌ | ❌ | ❌ | ✅ |
StarRocks v3.2 + | ❌ | ✅ | ❌ | ✅ |
Apache Doris v2.1 + | ✅ | ✅ | ❌ | ✅ |
Google BigQuery (BigLake) | ❌ | ❌ | ❌ | ❌* |
Snowflake (Iceberg GA) | ❌ | ❌ | ❌ | ✅ |
Databricks (Unity Catalog API) | ❌ | ❌ | ❌ | ✅ |
*BigQuery’s BigLake tables read Iceberg manifests directly without using an Iceberg catalog, so none of the four catalog types apply.
Sl | Query / Analytics Engine (“tool”) | Link | Supported Iceberg catalogs | Comments |
---|---|---|---|---|
1 | Amazon Athena | [Query Iceberg tables] AWS Documentation | Only the AWS Glue Data Catalog | Athena-v3 can read & write Iceberg v2 tables only when they’re registered in Glue. |
2 | Apache Spark (3.3 → 4.x) | [Spark catalog config] Apache Iceberg | Hive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via spark.sql.catalog.* settings. | Configure with spark.sql.catalog.<name>.type = `glue |
3 | Apache Flink (1.18+) | [Flink catalog options] Apache Iceberg | Hive Metastore, Hadoop catalog, REST catalog (incl. Nessie), AWS Glue, JDBC, plus any custom implementation via catalog-impl . | Set catalog-type=glue |
4 | Trino ( ≥ 475 ) | [Iceberg metastores] trino.io | Hive Metastore, AWS Glue, JDBC, REST, Nessie, or Snowflake; | iceberg.catalog.type=glue |
5 | Starburst Enterprise (SEP 413 →) | [SEP Iceberg connector] Starburst | Hive Metastore, AWS Glue, JDBC, REST, Nessie, Snowflake, and Starburst Galaxy’s managed metastore | Same config keys as Trino; extra features (Warp Speed, Polaris). |
6 | PrestoDB (0.288+) | [Lab guide using REST] IBM GitHub | Hive Metastore, AWS Glue, REST/Nessie (0.277+ with OAuth 2), Hadoop (file-based); JDBC | Glue needs the AWS SDK fat-jar; REST landed in 0.288. |
7 | Apache Hive (4.0) | [Hive integration] Apache Iceberg | Hive Metastore is the default catalog; Hadoop, REST/Nessie, AWS Glue, JDBC, or custom catalogs can also be configured. | Uses StorageHandler; for Glue add AWS bundle. |
8 | Apache Impala (4.4) | [Impala Iceberg docs] Impala | Hive Metastore, HadoopCatalog, HadoopTables, Other Iceberg catalog-impl values can be registered in Hive site-config (but Impala itself cannot talk to Glue/REST/Nessie directly). | Can create & query Iceberg; Glue via HMS federation only. |
9 | Dremio (25/26) | [REST catalog support] Dremio Documentation | Polaris / “Dremio Catalog” (built-in REST, Nessie-backed)** • Generic Iceberg REST Catalog (Tabular, Unity, Snowflake Open, etc.) • Arctic (Nessie) sources via UI • Hive Metastore (HMS) • AWS Glue • Hadoop (file-based) • Nessie stand-alone endpoint supported as Source | Add source in Lakehouse Catalogs UI; JDBC not yet. |
10 | DuckDB (1.2.1) | [Iceberg extension] DuckDB [REST preview] DuckDB | • Hadoop (File-system) — still the simplest path: iceberg_scan('/bucket/table/') or a direct metadata JSON. • Iceberg REST catalogs (e.g., Nessie, Tabular) supported via the rest option; now accepts bearer/OAuth tokens through the new rest_auth_token parameter. • No native Hive/Glue catalog yet, but you can proxy them through REST. | INSTALL iceberg; LOAD iceberg; then ATTACH 'nessie://…' or ATTACH 'glue+rest://…' . |
11 | ClickHouse (24.3+) | [Iceberg table engine] ClickHouse | Path (Hadoop-style) since 24.3, REST catalog (Nessie, Polaris/Unity, Glue REST) in 24.12 via SETTINGS catalog_type='rest' ; Hive Metastore experimental & AWS Glue integrations in testing;R2 (Cloudflare) REST catalog on roadmap. | Read-only engine; REST integration roadmap notes catalog auto-discovery. |
12 | StarRocks (3.2+) | [StarRocks Iceberg quick-start]StarRocks Documentation | Hive Metastore, AWS Glue Data Catalog (requires S3 storage), REST, (default, with AWS Glue REST endpoint support), JDBC-compatible metastore | Create CREATE CATALOG iceberg … PROPERTIES('type'='iceberg','metastore'='…') . |
13 | Apache Doris (2.1+) | [Doris Iceberg catalog] doris.apache.org | Hive Metastore, REST, Hadoop (filesystem metadata), AWS Glue, Alibaba Cloud DLF, AWS S3 Tables Catalog. | Multi-catalog; external queries and writes supported since v3.1 |
14 | Google BigQuery (BigLake) | [BigLake Iceberg tables] Google Cloud | You cannot configure BigQuery to use REST, Hive, or Glue catalogs for writing Iceberg tables. | Creates external tables; catalog-less — points directly at manifests in GCS/S3. |
15 | Snowflake (2025 GA) | [Snowflake Iceberg tables] Snowflake Documentation | Snowflake-native, REST (read-only); External-catalog Iceberg tables (whether via AWS Glue, Iceberg metadata files, Open Catalog/Polaris, or remote REST catalogs) are read-only in Snowflake | Snowflake maintains its own catalog but can read external REST catalogs (Unity, Glue REST). |
16 | Databricks Unity Catalog | [UC Iceberg REST endpoint] Databricks DocumentationDatabricks Documentation | Full support for the Iceberg REST Catalog API, allowing external engines to read (Generally Available) and write (Public Preview) to Unity Catalog–managed Iceberg tables. Iceberg catalog federation is in Public Preview, enabling you to govern and query Iceberg tables managed in AWS Glue, Hive Metastore, and Snowflake Horizon without copying data. | Endpoint /api/2.1/unity-catalog/iceberg ; external clients Spark, Flink, Trino, DuckDB, ClickHouse, etc. |
17 | e6data Lakehouse Compute Engine | [e6data × S3 Tables] e6data.com | AWS Glue, REST, Hive (read) | Serverless SQL engine; advertises compatibility with “all table formats & catalogs.” |
- Glue support in Presto requires the
hive.metastore=glue
shim or running Presto inside AWS EMR.
Notes & gotchas
- REST flavours. “REST” above covers standard Iceberg REST plus branded servers (Nessie, Lakekeeper, Gravitino, AWS Glue REST endpoint, Databricks Unity, Snowflake Polaris).
- JDBC catalog is production-ready in Spark, Flink, Trino/Starburst, and Presto. Engines not listed in that column (e.g., ClickHouse) cannot yet use it.
- Hive vs Hadoop. Some engines list “Hadoop Catalog” separately (path-based, no service). I’ve rolled those under Hive here if the engine simply re-uses the HMS client.
- Read-only vs read-write. ClickHouse and BigQuery are read-only; Athena supports
INSERT/UPDATE/MERGE
(MoR only); most others are full read-write when using Glue/Hive/JDBC/REST.