Docker Compose to local setup and testing
This page explains how to set up your local test environment using Docker Compose. Here's is the docker-compose.yml
file used for local setup and testing.
The setup is simple:
Step 1. Clone the OLake repository and navigate to the local test directory.
git clone git@github.com:datazip-inc/olake.git
Step 2. Start the Docker Compose services.
Step 3. Create the source configuration file (source.json
) in any directory of your choosing. Just make sure to reference its file path it in the sync command.
Step 4. Create the destination configuration file (destination.json
).
Step 5. Run the discover process for stream schema discovery. ( This command creates a streams.json
file with complete schema of your database like all streams / tables / collections, column names. column data types, etc. You can also set partition_regex
).
Step 6. Run the sync process.
Step 7. Verify the data population via spark-sql
.
Now lets move on to detailed setup instructions.
Prerequisites Installation
This section covers the installation steps for Java 17, Docker & Docker Compose, and Maven for macOS (M1 & Intel), Linux, and Windows.
Java 17
- macOS (M1 & Intel)
- Linux
- Windows
Install via Homebrew:
brew install openjdk@17
Or follow the Adoptium Installation Guide.
Verification:
java -version
Ensure it mentions Java 17.
Note: If needed, add
export PATH="/usr/local/opt/openjdk@17/bin:$PATH"
(or equivalent) to your shell config (e.g.,~/.bashrc
or~/.zshrc
).
For Ubuntu/Debian:
sudo apt update && sudo apt install openjdk-17-jdk
Verification:
java -version
Ensure it mentions Java 17.
Note: Add the Java binary to your
PATH
if your system doesn’t do it automatically.
Docker & Docker Compose
- macOS (M1 & Intel)
- Linux
- Windows
Option 1: Docker Desktop
- Install Docker Desktop from Docker Hub.
- Ensure Docker is running.
Option 2: Homebrew
brew install --cask docker
Verification:
docker --version
Follow Docker Docs for your distribution.
For Ubuntu/Debian (example):
sudo apt update && sudo apt install docker.io docker-compose
Verification:
docker --version
- Download Docker Desktop from Docker Hub.
- Run the installer.
- Launch Docker Desktop.
Verification (Command Prompt):
docker --version
Maven Installation
- macOS (M1 & Intel)
- Linux
- Windows
For Ubuntu/Debian:
sudo apt install maven
Or follow the Maven Installation Guide for other distros.
Verification:
mvn -version
- Download Maven from Apache Maven.
- Extract or install it.
- Add the Maven
bin
folder to your PATH.
Verification (Command Prompt):
mvn -version
Local catalog Test Setup
You can test:
- Glue
- Hive
- JDBC
- REST
catalogs using the local test setup.
Steps to run:
-
Prerequisite: Ensure Docker is installed (instructions in the Prerequisites section above).
-
Clone OLake and Navigate to the local test directory:
git clone git@github.com:datazip-inc/olake.git
cd writers/iceberg/local-test -
Start the docker-compose services:
docker compose up
This command starts the following services:
- Postgres: Acts as the JDBC catalog.
- Lakekeeper: Acts as the Iceberg catalog.
- hive-metastore: Acts as the Hive catalog.
- Minio: Provides an AWS S3-like filesystem for storing Iceberg data.
- Spark: For querying the Iceberg data.
-
Create the source configuration file:
Get the source configs of all Databases here from where you wish to sync data from. -
Create the destination configuration file:
Create a file nameddestination.json
in your working directory with the following content:
- AWS Glue
- JDBC Catalog
- REST Catalog
- Hive Catalog
{
"type": "ICEBERG",
"writer": {
"catalog_type": "glue",
"normalization": false,
"iceberg_s3_path": "s3://<BUCKET_NAME>/<S3_PREFIX_VALUE>",
"aws_region": "ap-south-1",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"iceberg_db": "ICEBERG_DATABASE_NAME",
"grpc_port": 50051,
"server_host": "localhost"
}
}
{
"type": "ICEBERG",
"writer": {
"catalog_type": "jdbc",
"jdbc_url": "jdbc:postgresql://host.docker.internal:5432/iceberg",
"jdbc_username": "iceberg",
"jdbc_password": "password",
"normalization": false,
"iceberg_s3_path": "s3a://warehouse",
"s3_endpoint": "http://host.docker.internal:9000",
"s3_use_ssl": false,
"s3_path_style": true,
"aws_access_key": "admin",
"aws_region": "ap-south-1",
"aws_secret_key": "password",
"iceberg_db": "ICEBERG_DATABASE_NAME"
}
}
{
"type": "ICEBERG",
"writer": {
"catalog_type": "rest",
"normalization": false,
"rest_catalog_url": "http://localhost:8181/catalog",
"iceberg_s3_path": "warehouse",
"iceberg_db": "ICEBERG_DATABASE_NAME"
}
}
{
"type": "ICEBERG",
"writer": {
"catalog_type": "hive",
"normalization": false,
"iceberg_s3_path": "s3a://warehouse/",
"aws_region": "us-east-1",
"aws_access_key": "admin",
"aws_secret_key": "password",
"s3_endpoint": "http://localhost:9000",
"hive_uri": "thrift://localhost:9083",
"s3_use_ssl": false,
"s3_path_style": true,
"hive_clients": 5,
"hive_sasl_enabled": false,
"iceberg_db": "ICEBERG_DATABASE_NAME"
}
}
Refer here for more about catalog configs.
- Run the Sync Process:
- Discover Command:
<DISCOVER_COMMAND>
- Sync Command:
<SYNC_COMMAND>
- Sync with State Command:
<SYNC_WITH_STATE_COMMAND>
- Discover Command:
Refer to respective Database docs to use the command for discover schema and sync the data.
- MongoDB Discover and sync command
- Postgres Discover and sync command
- MySQL Discover and sync command
- The first time you run the sync command, it might take a while for the process to build and download the necessary dependencies, docker images and
.jar
files. Subsequent runs will be faster. - The
destination.json
file should be in the same directory where you run the sync command.
If you see the final screen after running the sync command such as the below one:
the data is synced successfully to the Iceberg table.
-
Verify Data Population:
Connect to the
spark-iceberg
container:docker exec -it spark-iceberg bash
Start Spark SQL:
spark-sql
Run a query to inspect your data:
select * from olake_iceberg.olake_iceberg.table_name;
The __op
column values will have r
, c
, u
, d
that stands for - "r" for read/backfill, "c" for create, "u" for update, and "d" for delete. The __op
column is used to track the operation performed on the data.
Using Spark UI
Query data using Spark SQL in a Jupyter notebook or via the Spark UI. You can access the Spark UI via Jupyter Notebook or directly through the Spark UI.
Step 1: Head over to http://localhost:8888 to access the Spark UI.
Step 2: Click on File
> New
> Notebook
.
Step 3: Choose Python 3
as the kernel.
Step 4: The Jupyter Notebook will open in a new tab. Step 5: To run the SQL queries, you can use the following code snippet in a Jupyter Notebook cell:
%%sql
SELECT * FROM CATALOG_NAME.ICEBERG_DATABASE_NAME.TABLE_NAME;
CATALOG_NAME
can be:jdbc_catalog
,hive_catalog
,rest_catalog
, etc.ICEBERG_DATABASE_NAME
is the name of the Iceberg database you created / added as a value indestination.json
file.
Now you can run queries on your Iceberg data using Spark SQL. Some useful commands are:
show databases;
use <database_name>
show tables from olake_iceberg.olake_iceberg;
describe formatted olake_iceberg.olake_iceberg.table_name;
Refer to spark-defaults.conf for more information about the values set for catalog configurations.