Docker Compose to local setup and testing
This page explains how to set up your local test environment using Docker Compose. Below is the docker-compose.yml
file used for local setup and testing.
The setup is simple:
Step 1. Clone the OLake repository and navigate to the local test directory.
Step 2. Start the Docker Compose services.
Step 3. Create the source configuration file (config.json
) in any directory of your choosing. Just make sure to reference its file path it in the sync command.
Step 4. Create the writer configuration file (writer.json
).
Step 5. Run the discover process for stream schema discovery. ( This command creates a catalog.json
file with complete schema of your database like all streams / tables / collections, column names. column data types, etc. You can also set partition_regex
).
Step 6. Run the sync process.
Step 7. Verify the data population via spark-sql
.
Congratulations! You have successfully synced your data to Iceberg.
Prerequisites Installation
This section covers the installation steps for Java 17, Docker & Docker Compose, and Maven for macOS (M1 & Intel), Linux, and Windows.
Java 17
- macOS (M1 & Intel)
- Linux
- Windows
Install via Homebrew:
brew install openjdk@17
Or follow the Adoptium Installation Guide.
Verification:
java -version
Ensure it mentions Java 17.
Note: If needed, add
export PATH="/usr/local/opt/openjdk@17/bin:$PATH"
(or equivalent) to your shell config (e.g.,~/.bashrc
or~/.zshrc
).
For Ubuntu/Debian:
sudo apt update && sudo apt install openjdk-17-jdk
Verification:
java -version
Ensure it mentions Java 17.
Note: Add the Java binary to your
PATH
if your system doesn’t do it automatically.
Docker & Docker Compose
- macOS (M1 & Intel)
- Linux
- Windows
Option 1: Docker Desktop
- Install Docker Desktop from Docker Hub.
- Ensure Docker is running.
Option 2: Homebrew
brew install --cask docker
Verification:
docker --version
Follow Docker Docs for your distribution.
For Ubuntu/Debian (example):
sudo apt update && sudo apt install docker.io docker-compose
Verification:
docker --version
- Download Docker Desktop from Docker Hub.
- Run the installer.
- Launch Docker Desktop.
Verification (Command Prompt):
docker --version
Maven Installation
- macOS (M1 & Intel)
- Linux
- Windows
For Ubuntu/Debian:
sudo apt install maven
Or follow the Maven Installation Guide for other distros.
Verification:
mvn -version
- Download Maven from Apache Maven.
- Extract or install it.
- Add the Maven
bin
folder to your PATH.
Verification (Command Prompt):
mvn -version
Local Minio + JDBC (Local Test Setup)
Steps to run:
-
Prerequisite: Ensure Docker is installed (instructions in the Prerequisites section above).
-
Clone OLake and Navigate to the local test directory:
git clone git@github.com:datazip-inc/olake.git
cd writers/iceberg/local-test -
Start the docker-compose services:
docker compose up
This command starts the following services:
- Postgres: Acts as the JDBC catalog.
- Minio: Provides an AWS S3-like filesystem for storing Iceberg data.
- Spark: For querying the Iceberg data.
-
Create the source configuration file:
Get the source configs of all Databases here from where you wish to sync data from. -
Create the writer configuration file:
Create a file namedwriter.json
in your working directory with the following content:{
"type": "ICEBERG",
"writer": {
"catalog_type": "jdbc",
"jdbc_url": "jdbc:postgresql://localhost:5432/iceberg",
"jdbc_username": "iceberg",
"jdbc_password": "password",
"normalization": false,
"iceberg_s3_path": "s3a://warehouse",
"s3_endpoint": "http://localhost:9000",
"s3_use_ssl": false,
"s3_path_style": true,
"aws_access_key": "admin",
"aws_secret_key": "password",
"iceberg_db": "olake_iceberg"
}
} -
Run the Sync Process:
- Discover Command:
<DISCOVER_COMMAND>
- Sync Command:
<SYNC_COMMAND>
- Sync with State Command:
<SYNC_WITH_STATE_COMMAND>
- Discover Command:
Refer to respective Database docs to use the command for discover schema and sync the data.
- MongoDB Discover and sync command
- Postgres Discover and sync command
- MySQL Discover and sync command
- The first time you run the sync command, it might take a while for the process to build and download the necessary dependencies and jar files. Subsequent runs will be faster.
- The
writer.json
file should be in the same directory where you run the sync command.
If you see the final screen after running the sync command such as the below one:
the data is synced successfully to the Iceberg table.
-
Verify Data Population:
Connect to the
spark-iceberg
container:docker exec -it spark-iceberg bash
Start Spark SQL:
spark-sql
Run a query to inspect your data:
select * from olake_iceberg.olake_iceberg.table_name;
Now you can run queries on your Iceberg data using Spark SQL. Some useful commands are:
show databases;
use <database_name>
show tables from olake_iceberg.olake_iceberg;
describe formatted olake_iceberg.olake_iceberg.table_name;
Docker Compose file
Here is the docker-compose.yml
file used for the local setup:
version: "3"
services:
# Spark container running a preconfigured Spark environment with Iceberg support.
spark-iceberg:
image: tabulario/spark-iceberg # Prebuilt image with Spark and Iceberg
container_name: spark-iceberg
build: spark/ # Build context for custom Spark configurations
networks:
iceberg_net:
depends_on:
- minio # Wait for Minio service to start
- postgres # Wait for PostgreSQL service to start
volumes:
- ./spark-defaults.conf:/opt/spark/conf/spark-defaults.conf # Custom Spark configuration file
- ./data/ivy-cache:/root/.ivy2 # Local cache for dependencies
environment:
- AWS_ACCESS_KEY_ID=admin # AWS key (used by Spark to access S3/Minio)
- AWS_SECRET_ACCESS_KEY=password # AWS secret key
- AWS_REGION=us-east-1 # AWS region configuration
ports:
- 8888:8888 # Spark UI/Jupyter port mapping
- 8088:8080 # Spark web UI port mapping
- 10000:10000 # JDBC/Thrift server port mapping
- 10001:10001 # Additional port if needed
# Minio container acting as a local S3-like storage service.
minio:
image: minio/minio
container_name: minio
environment:
- MINIO_ROOT_USER=admin # Minio admin username
- MINIO_ROOT_PASSWORD=password # Minio admin password
- MINIO_DOMAIN=minio # Domain alias configuration
networks:
iceberg_net:
aliases:
- warehouse.minio # Alias used for referencing the storage bucket
ports:
- 9001:9001 # Minio console port
- 9000:9000 # S3 API port
volumes:
- ./data/minio-data:/data # Persist Minio data locally
command: ["server", "/data", "--console-address", ":9001"] # Command to start Minio server with console
# Minio Client container to manage bucket setup and policies.
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
networks:
iceberg_net:
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: |
/bin/sh -c "
until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
if ! /usr/bin/mc ls minio/warehouse > /dev/null 2>&1; then
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
fi;
tail -f /dev/null
"
# PostgreSQL container serving as the JDBC catalog for Iceberg.
postgres:
image: postgres:13
container_name: iceberg-postgres
networks:
iceberg_net:
environment:
- POSTGRES_USER=iceberg # PostgreSQL username for the catalog
- POSTGRES_PASSWORD=password # PostgreSQL password for the catalog
- POSTGRES_DB=iceberg # Database name used by Iceberg
ports:
- 5432:5432 # PostgreSQL port mapping
volumes:
- ./data/postgres-data:/var/lib/postgresql/data # Persist PostgreSQL data locally
networks:
iceberg_net: # Custom network for all services
volumes:
postgres-data:
minio-data:
Services Overview
-
spark-iceberg:
Runs Spark with Iceberg support to query and verify the data. It depends on both Minio (for S3-like storage) and PostgreSQL (as the metadata catalog). -
minio:
Simulates an AWS S3 environment locally. Iceberg stores its data and metadata here. -
mc (Minio Client):
Automates the creation of the storage bucket (warehouse
) in Minio and sets the appropriate access policies. -
postgres:
Provides a PostgreSQL-based JDBC catalog to manage Iceberg metadata.
Refer to spark-defaults.conf for more information about the values set for catalog configurations.