Skip to main content

Overview

This getting started guide provides a comprehensive overview of OLake, for you to get started with OLake.

Where you can run OLake:

Below are different ways you can run OLake.

Platform / OrchestrationQuick-start or setup docsComments
OLake UILINK(Recommended)
Local system (bare metal / laptop)Local Iceberg Setup (with minIO)
Stand-alone Docker containerusing OLake docker imageEasiest for PoC; ships the CLI plus driver in one image.
Airflow on EC2EC2 DAG how-toSpins up a short-lived EC2 worker that pulls the OLake image, runs sync, then terminates.
Airflow on KubernetesK8s + Airflow example (OLake)Same DAG, but the KubernetesPodOperator schedules OLake pods inside the cluster.

Supported Sources

SourcesLinks
MongoDBDocs / Code / Getting Started
PostgresDocs / Code / Getting Started
MySQLDocs / Code / Getting Started
OracleDocs / Code / Getting Started
AWS S3check issue
KafkaWIP, check PR

Visit OLake roadmap for more information.

OLake supports a variety of data sources and destinations. Below is a list of currently supported connectors:

Supported Destinations

DestinationSupportedDocsComments
Apache IcebergYesLink
AWS S3YesLinkSupports both plain-Parquet and Iceberg format writes; requires aws_access_key / IAM role.
AzureYesLink
Google Cloud StorageYes
  • Link (Iceberg)
  • Link (Parquet)
  • Any S3 protocol compliant object store can work with OLake
    Local FilesystemYesLink

    File formats OLake can emit

    Output formatDocsComments
    Apache Iceberg tablesIceberg writer overviewFull snapshot + incremental CDC → Iceberg; works with all catalogs listed below.
    Parquet filesParquet writer modesSimple columnar dumps (no table metadata); choose local or S3 sub-mode.

    Query OLake dumped data Using Different Catalogs

    1. AWS Glue + AWS Athena
    2. AWS Glue + Spark
    3. AWS Glue + Snowflake
    4. AWS Glue + DuckDB
    5. AWS Glue + Trino
    6. REST Catalog + DuckDB
    7. REST catalog + ClickHouse

    Iceberg catalogs OLake can register to

    Catalog typeDocs / exampleComments
    REST – LakekeeperLINK[Officially Supported] Rust-native catalog with optimistic locking; Helm chart available for K8s.
    REST - UnityLINK[Supported and tested] Unity Catalog (Databricks) with Personal Access Token authentication.
    REST – GravitinoLINK[Supported, Yet to be tested] Uses standard Iceberg REST; Gravitino adds multi-cloud routing.
    REST – NessieLINK[Supported, Yet to be tested] Time-travel branches/tags; supply nessie.endpoint in destination.json.
    REST – PolarisLINK[Supported, Yet to be tested]
    Hive MetastoreHive catalog configClassic HMS; good fit for on-prem Hadoop or EMR.
    JDBC Catalog (Postgres/MySQL)JDBC catalog sampleStores Iceberg metadata in an RDBMS(Postgres); easiest to spin up locally with Postgres.
    AWS Glue CatalogGlue catalog IAM & configBest choice on AWS; lets Athena, EMR, and Redshift query the same tables instantly.
    Azure PurviewNot Planned, submit a request
    BigLake MetastoreNot Planned, submit a request

    Data from OLake on Apache Iceberg can be queried from:

    QUERY TOOLAWS Glue CatalogHive MetastoreJDBC CatalogIceberg REST Catalog
    Amazon Athena
    Apache Spark (v3.3 +)
    Apache Flink (v1.18 +)
    Trino (v475 +)
    Starburst Enterprise
    Presto (v0.288 +)
    Apache Hive v4.0
    Apache Impala v4.4
    Dremio v25/26
    DuckDB v1.2.1
    ClickHouse v24.3 +
    StarRocks v3.2 +
    Apache Doris v2.1 +
    Google BigQuery (BigLake)❌*
    Snowflake (Iceberg GA)
    Databricks (Unity Catalog API)

    *BigQuery’s BigLake tables read Iceberg manifests directly without using an Iceberg catalog, so none of the four catalog types apply.

    SlQuery / Analytics Engine (“tool”)LinkSupported Iceberg catalogsComments
    1Amazon Athena[Query Iceberg tables] AWS DocumentationOnly the AWS Glue Data CatalogAthena-v3 can read & write Iceberg v2 tables only when they’re registered in Glue.
    2Apache Spark (3.3 → 4.x)[Spark catalog config] Apache IcebergHive Metastore, Hadoop warehouse, REST, AWS Glue, JDBC, Nessie, plus custom plug-ins via spark.sql.catalog.* settings.Configure with spark.sql.catalog.<name>.type = `glue
    3Apache Flink (1.18+)[Flink catalog options] Apache IcebergHive Metastore, Hadoop catalog, REST catalog (incl. Nessie), AWS Glue, JDBC, plus any custom implementation via catalog-impl.Set catalog-type=glue
    4Trino ( ≥ 475 )[Iceberg metastores] trino.ioHive Metastore, AWS Glue, JDBC, REST, Nessie, or Snowflake;iceberg.catalog.type=glue
    5Starburst Enterprise (SEP 413 →)[SEP Iceberg connector] StarburstHive Metastore, AWS Glue, JDBC, REST, Nessie, Snowflake, and Starburst Galaxy’s managed metastoreSame config keys as Trino; extra features (Warp Speed, Polaris).
    6PrestoDB (0.288+)[Lab guide using REST] IBM GitHubHive Metastore, AWS Glue, REST/Nessie (0.277+ with OAuth 2), Hadoop (file-based); JDBCGlue needs the AWS SDK fat-jar; REST landed in 0.288.
    7Apache Hive (4.0)[Hive integration] Apache IcebergHive Metastore is the default catalog; Hadoop, REST/Nessie, AWS Glue, JDBC, or custom catalogs can also be configured.Uses StorageHandler; for Glue add AWS bundle.
    8Apache Impala (4.4)[Impala Iceberg docs] ImpalaHive Metastore, HadoopCatalog, HadoopTables, Other Iceberg catalog-impl values can be registered in Hive site-config (but Impala itself cannot talk to Glue/REST/Nessie directly).Can create & query Iceberg; Glue via HMS federation only.
    9Dremio (25/26)[REST catalog support] Dremio DocumentationPolaris / “Dremio Catalog” (built-in REST, Nessie-backed)** • Generic Iceberg REST Catalog (Tabular, Unity, Snowflake Open, etc.) • Arctic (Nessie) sources via UI • Hive Metastore (HMS) • AWS Glue • Hadoop (file-based) • Nessie stand-alone endpoint supported as SourceAdd source in Lakehouse Catalogs UI; JDBC not yet.
    10DuckDB (1.2.1)[Iceberg extension] DuckDB [REST preview] DuckDBHadoop (File-system) — still the simplest path: iceberg_scan('/bucket/table/') or a direct metadata JSON. • Iceberg REST catalogs (e.g., Nessie, Tabular) supported via the rest option; now accepts bearer/OAuth tokens through the new rest_auth_token parameter. • No native Hive/Glue catalog yet, but you can proxy them through REST.INSTALL iceberg; LOAD iceberg; then ATTACH 'nessie://…' or ATTACH 'glue+rest://…'.
    11ClickHouse (24.3+)[Iceberg table engine] ClickHousePath (Hadoop-style) since 24.3, REST catalog (Nessie, Polaris/Unity, Glue REST) in 24.12 via SETTINGS catalog_type='rest'; Hive Metastore experimental & AWS Glue integrations in testing; R2 (Cloudflare) REST catalog on roadmap.Read-only engine; REST integration roadmap notes catalog auto-discovery.
    12StarRocks (3.2+)[StarRocks Iceberg quick-start]StarRocks DocumentationHive Metastore, AWS Glue Data Catalog (requires S3 storage), REST, (default, with AWS Glue REST endpoint support), JDBC-compatible metastoreCreate CREATE CATALOG iceberg … PROPERTIES('type'='iceberg','metastore'='…').
    13Apache Doris (2.1+)[Doris Iceberg catalog] doris.apache.orgHive Metastore, REST, Hadoop (filesystem metadata), AWS Glue, Alibaba Cloud DLF, AWS S3 Tables Catalog.Multi-catalog; external queries and writes supported since v3.1
    14Google BigQuery (BigLake)[BigLake Iceberg tables] Google CloudYou cannot configure BigQuery to use REST, Hive, or Glue catalogs for writing Iceberg tables.Creates external tables; catalog-less — points directly at manifests in GCS/S3.
    15Snowflake (2025 GA)[Snowflake Iceberg tables] Snowflake DocumentationSnowflake-native, REST (read-only); External-catalog Iceberg tables (whether via AWS Glue, Iceberg metadata files, Open Catalog/Polaris, or remote REST catalogs) are read-only in SnowflakeSnowflake maintains its own catalog but can read external REST catalogs (Unity, Glue REST).
    16Databricks Unity Catalog[UC Iceberg REST endpoint] Databricks DocumentationDatabricks DocumentationFull support for the Iceberg REST Catalog API, allowing external engines to read (Generally Available) and write (Public Preview) to Unity Catalog–managed Iceberg tables. Iceberg catalog federation is in Public Preview, enabling you to govern and query Iceberg tables managed in AWS Glue, Hive Metastore, and Snowflake Horizon without copying data.Endpoint /api/2.1/unity-catalog/iceberg; external clients Spark, Flink, Trino, DuckDB, ClickHouse, etc.
    17e6data Lakehouse Compute Engine[e6data × S3 Tables] e6data.comAWS Glue, REST, Hive (read)Serverless SQL engine; advertises compatibility with “all table formats & catalogs.”
    • Glue support in Presto requires the hive.metastore=glue shim or running Presto inside AWS EMR.

    Notes & gotchas

    • REST flavours. “REST” above covers standard Iceberg REST plus branded servers (Nessie, Lakekeeper, Gravitino, AWS Glue REST endpoint, Databricks Unity, Snowflake Polaris).
    • JDBC catalog is production-ready in Spark, Flink, Trino/Starburst, and Presto. Engines not listed in that column (e.g., ClickHouse) cannot yet use it.
    • Hive vs Hadoop. Some engines list “Hadoop Catalog” separately (path-based, no service). I’ve rolled those under Hive here if the engine simply re-uses the HMS client.
    • Read-only vs read-write. ClickHouse and BigQuery are read-only; Athena supports INSERT/UPDATE/MERGE (MoR only); most others are full read-write when using Glue/Hive/JDBC/REST.

    Need Assistance?

    If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

    • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
    • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
    • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

    Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!