Skip to main content

Docker Compose to local setup and testing

This page explains how to set up your local test environment using Docker Compose. Below is the docker-compose.yml file used for local setup and testing.

The setup is simple:

Step 1. Clone the OLake repository and navigate to the local test directory.

Step 2. Start the Docker Compose services.

Step 3. Create the source configuration file (config.json) in any directory of your choosing. Just make sure to reference its file path it in the sync command.

Step 4. Create the writer configuration file (writer.json).

Step 5. Run the discover process for stream schema discovery. ( This command creates a catalog.json file with complete schema of your database like all streams / tables / collections, column names. column data types, etc. You can also set partition_regex).

Step 6. Run the sync process.

Step 7. Verify the data population via spark-sql.

Congratulations! You have successfully synced your data to Iceberg.

Prerequisites Installation

This section covers the installation steps for Java 17, Docker & Docker Compose, and Maven for macOS (M1 & Intel), Linux, and Windows.

Java 17

Install via Homebrew:

brew install openjdk@17

Or follow the Adoptium Installation Guide.

Verification:

java -version

Ensure it mentions Java 17.

Note: If needed, add export PATH="/usr/local/opt/openjdk@17/bin:$PATH" (or equivalent) to your shell config (e.g., ~/.bashrc or ~/.zshrc).

Docker & Docker Compose

Option 1: Docker Desktop

  1. Install Docker Desktop from Docker Hub.
  2. Ensure Docker is running.

Option 2: Homebrew

brew install --cask docker

Verification:

docker --version

Maven Installation

Install via Homebrew:

brew install maven

Or download from Apache Maven.

Verification:

mvn -version

Local Minio + JDBC (Local Test Setup)

Steps to run:

  1. Prerequisite: Ensure Docker is installed (instructions in the Prerequisites section above).

  2. Clone OLake and Navigate to the local test directory:

    git clone git@github.com:datazip-inc/olake.git
    cd writers/iceberg/local-test
  3. Start the docker-compose services:

    docker compose up

    This command starts the following services:

    • Postgres: Acts as the JDBC catalog.
    • Minio: Provides an AWS S3-like filesystem for storing Iceberg data.
    • Spark: For querying the Iceberg data.
  4. Create the source configuration file:
    Get the source configs of all Databases here from where you wish to sync data from.

  5. Create the writer configuration file:
    Create a file named writer.json in your working directory with the following content:

    {
    "type": "ICEBERG",
    "writer": {
    "catalog_type": "jdbc",
    "jdbc_url": "jdbc:postgresql://localhost:5432/iceberg",
    "jdbc_username": "iceberg",
    "jdbc_password": "password",
    "normalization": false,
    "iceberg_s3_path": "s3a://warehouse",
    "s3_endpoint": "http://localhost:9000",
    "s3_use_ssl": false,
    "s3_path_style": true,
    "aws_access_key": "admin",
    "aws_secret_key": "password",
    "iceberg_db": "olake_iceberg"
    }
    }
  6. Run the Sync Process:

    • Discover Command: <DISCOVER_COMMAND>
    • Sync Command: <SYNC_COMMAND>
    • Sync with State Command: <SYNC_WITH_STATE_COMMAND>

Refer to respective Database docs to use the command for discover schema and sync the data.

note
  1. The first time you run the sync command, it might take a while for the process to build and download the necessary dependencies and jar files. Subsequent runs will be faster.
  2. The writer.json file should be in the same directory where you run the sync command.

If you see the final screen after running the sync command such as the below one: iceberg-sync

the data is synced successfully to the Iceberg table.

  1. Verify Data Population:

    Connect to the spark-iceberg container:

    docker exec -it spark-iceberg bash

    Start Spark SQL:

    spark-sql

    Run a query to inspect your data:

    select * from olake_iceberg.olake_iceberg.table_name;

Now you can run queries on your Iceberg data using Spark SQL. Some useful commands are:

  • show databases;
  • use <database_name>
  • show tables from olake_iceberg.olake_iceberg;
  • describe formatted olake_iceberg.olake_iceberg.table_name;

Docker Compose file

Here is the docker-compose.yml file used for the local setup:

docker-compose.yml
version: "3"

services:
# Spark container running a preconfigured Spark environment with Iceberg support.
spark-iceberg:
image: tabulario/spark-iceberg # Prebuilt image with Spark and Iceberg
container_name: spark-iceberg
build: spark/ # Build context for custom Spark configurations
networks:
iceberg_net:
depends_on:
- minio # Wait for Minio service to start
- postgres # Wait for PostgreSQL service to start
volumes:
- ./spark-defaults.conf:/opt/spark/conf/spark-defaults.conf # Custom Spark configuration file
- ./data/ivy-cache:/root/.ivy2 # Local cache for dependencies
environment:
- AWS_ACCESS_KEY_ID=admin # AWS key (used by Spark to access S3/Minio)
- AWS_SECRET_ACCESS_KEY=password # AWS secret key
- AWS_REGION=us-east-1 # AWS region configuration
ports:
- 8888:8888 # Spark UI/Jupyter port mapping
- 8088:8080 # Spark web UI port mapping
- 10000:10000 # JDBC/Thrift server port mapping
- 10001:10001 # Additional port if needed

# Minio container acting as a local S3-like storage service.
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin # Minio admin username
- MINIO_ROOT_PASSWORD=password # Minio admin password
- MINIO_DOMAIN=minio # Domain alias configuration
networks:
iceberg_net:
aliases:
- warehouse.minio # Alias used for referencing the storage bucket
ports:
- 9001:9001 # Minio console port
- 9000:9000 # S3 API port
volumes:
- ./data/minio-data:/data # Persist Minio data locally
command: ["server", "/data", "--console-address", ":9001"] # Command to start Minio server with console

# Minio Client container to manage bucket setup and policies.
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: |
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
if ! /usr/bin/mc ls minio/warehouse > /dev/null 2>&1; then
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
fi;
tail -f /dev/null
"

# PostgreSQL container serving as the JDBC catalog for Iceberg.
postgres:
image: postgres:13
container_name: iceberg-postgres
networks:
iceberg_net:
environment:
- POSTGRES_USER=iceberg # PostgreSQL username for the catalog
- POSTGRES_PASSWORD=password # PostgreSQL password for the catalog
- POSTGRES_DB=iceberg # Database name used by Iceberg
ports:
- 5432:5432 # PostgreSQL port mapping
volumes:
- ./data/postgres-data:/var/lib/postgresql/data # Persist PostgreSQL data locally

networks:
iceberg_net: # Custom network for all services

volumes:
postgres-data:
minio-data:

Services Overview

  • spark-iceberg:
    Runs Spark with Iceberg support to query and verify the data. It depends on both Minio (for S3-like storage) and PostgreSQL (as the metadata catalog).

  • minio:
    Simulates an AWS S3 environment locally. Iceberg stores its data and metadata here.

  • mc (Minio Client):
    Automates the creation of the storage bucket (warehouse) in Minio and sets the appropriate access policies.

  • postgres:
    Provides a PostgreSQL-based JDBC catalog to manage Iceberg metadata.

info

Refer to spark-defaults.conf for more information about the values set for catalog configurations.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!