Skip to main content

Docker Compose to local setup and testing

This page explains how to set up your local test environment using Docker Compose. Here's is the docker-compose.yml file used for local setup and testing.

The setup is simple:

Step 1. Clone the OLake repository and navigate to the local test directory.

git clone git@github.com:datazip-inc/olake.git

Step 2. Start the Docker Compose services.

Step 3. Create the source configuration file (source.json) in any directory of your choosing. Just make sure to reference its file path it in the sync command.

Step 4. Create the destination configuration file (destination.json).

Step 5. Run the discover process for stream schema discovery. ( This command creates a streams.json file with complete schema of your database like all streams / tables / collections, column names. column data types, etc. You can also set partition_regex).

Step 6. Run the sync process.

Step 7. Verify the data population via spark-sql.

Now lets move on to detailed setup instructions.

Prerequisites Installation

This section covers the installation steps for Java 17, Docker & Docker Compose, and Maven for macOS (M1 & Intel), Linux, and Windows.

Java 17

Install via Homebrew:

brew install openjdk@17

Or follow the Adoptium Installation Guide.

Verification:

java -version

Ensure it mentions Java 17.

Note: If needed, add export PATH="/usr/local/opt/openjdk@17/bin:$PATH" (or equivalent) to your shell config (e.g., ~/.bashrc or ~/.zshrc).

Docker & Docker Compose

Option 1: Docker Desktop

  1. Install Docker Desktop from Docker Hub.
  2. Ensure Docker is running.

Option 2: Homebrew

brew install --cask docker

Verification:

docker --version

Maven Installation

Install via Homebrew:

brew install maven

Or download from Apache Maven.

Verification:

mvn -version

Local catalog Test Setup

You can test:

  1. Glue
  2. Hive
  3. JDBC
  4. REST

catalogs using the local test setup.

Steps to run:

  1. Prerequisite: Ensure Docker is installed (instructions in the Prerequisites section above).

  2. Clone OLake and Navigate to the local test directory:

    git clone git@github.com:datazip-inc/olake.git
    cd writers/iceberg/local-test
  3. Start the docker-compose services:

    docker compose up

    This command starts the following services:

    • Postgres: Acts as the JDBC catalog.
    • Lakekeeper: Acts as the Iceberg catalog.
    • hive-metastore: Acts as the Hive catalog.
    • Minio: Provides an AWS S3-like filesystem for storing Iceberg data.
    • Spark: For querying the Iceberg data.
  4. Create the source configuration file:
    Get the source configs of all Databases here from where you wish to sync data from.

  5. Create the destination configuration file:
    Create a file named destination.json in your working directory with the following content:

destination.json
{
"type": "ICEBERG",
"writer": {
"catalog_type": "glue",
"normalization": false,
"iceberg_s3_path": "s3://<BUCKET_NAME>/<S3_PREFIX_VALUE>",
"aws_region": "ap-south-1",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"iceberg_db": "ICEBERG_DATABASE_NAME",
"grpc_port": 50051,
"server_host": "localhost"
}
}

Refer here for more about catalog configs.

  1. Run the Sync Process:
    • Discover Command: <DISCOVER_COMMAND>
    • Sync Command: <SYNC_COMMAND>
    • Sync with State Command: <SYNC_WITH_STATE_COMMAND>

Refer to respective Database docs to use the command for discover schema and sync the data.

note
  1. The first time you run the sync command, it might take a while for the process to build and download the necessary dependencies, docker images and .jar files. Subsequent runs will be faster.
  2. The destination.json file should be in the same directory where you run the sync command.

If you see the final screen after running the sync command such as the below one: iceberg-sync

the data is synced successfully to the Iceberg table.

  1. Verify Data Population:

    Connect to the spark-iceberg container:

    docker exec -it spark-iceberg bash

    Start Spark SQL:

    spark-sql

    Run a query to inspect your data:

    select * from olake_iceberg.olake_iceberg.table_name;
info

The __op column values will have r, c, u, d that stands for - "r" for read/backfill, "c" for create, "u" for update, and "d" for delete. The __op column is used to track the operation performed on the data.

Using Spark UI

Query data using Spark SQL in a Jupyter notebook or via the Spark UI. You can access the Spark UI via Jupyter Notebook or directly through the Spark UI.

Step 1: Head over to http://localhost:8888 to access the Spark UI. Step 2: Click on File > New > Notebook.

select-catalog

Step 3: Choose Python 3 as the kernel.

choose-kernel

Step 4: The Jupyter Notebook will open in a new tab. Step 5: To run the SQL queries, you can use the following code snippet in a Jupyter Notebook cell:

%%sql

SELECT * FROM CATALOG_NAME.ICEBERG_DATABASE_NAME.TABLE_NAME;

query-sql-via-spark

  • CATALOG_NAME can be: jdbc_catalog, hive_catalog, rest_catalog, etc.
  • ICEBERG_DATABASE_NAME is the name of the Iceberg database you created / added as a value in destination.json file.

Now you can run queries on your Iceberg data using Spark SQL. Some useful commands are:

  • show databases;
  • use <database_name>
  • show tables from olake_iceberg.olake_iceberg;
  • describe formatted olake_iceberg.olake_iceberg.table_name;
info

Refer to spark-defaults.conf for more information about the values set for catalog configurations.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!