Skip to main content

From Postgres to Iceberg: Creating OLake Jobs with Docker CLI and UI

ยท 7 min read
Vishal
OLake Maintainer

Data replication has become one of the most essential building blocks in modern data engineering. Whether it's keeping your analytics warehouse in sync with operational databases or feeding real-time pipelines for machine learning, companies rely on tools to move data quickly and reliably.

Today, there's no shortage of optionsโ€”platforms like Fivetran, Airbyte, Debezium, and even custom-built Flink or Spark pipelines are widely used to handle replication. But each of these comes with trade-offs: infrastructure complexity, cost, or lack of flexibility when you want to adapt replication to your specific needs.

That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control.

In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres โ†’ Apache Iceberg (Glue catalog) with CDC, normalization, filters, partitioning, and schedulingโ€”all running seamlessly.

Two Setup Styles (pick what fits you)โ€‹

Option A โ€” UI "Job-first" (guided, all-in-one)โ€‹

Perfect if you want a clear wizard and visual guardrails.

Option B โ€” CLI (Docker)โ€‹

Great if you prefer terminal, versioned JSON, or automation.

Both produce the same result. Choose the path that matches your workflow today.

Option A โ€” OLake UI (Guided)โ€‹

We'll take the "job-first" approach. It's straightforward and keeps you in one flow.

1) Create a Jobโ€‹

From the left nav, go to Jobs โ†’ Create Job.
You'll land on a wizard that starts with the source.

Job page

2) Configure the Source (Postgres)โ€‹

Choose Set up a new source โ†’ select Postgres โ†’ keep OLake version at the latest stable.
Name it clearly, fill the Postgres endpoint config, and hit Test Connection.

Job source connector

Job source config

๐Ÿ“ Planning for CDC?
Make sure a replication slot exists in Postgres.
See: Replication Slot Guide.

3) Configure the Destination (Iceberg + Glue)โ€‹

Now we set where the data will land.
Pick Apache Iceberg as the destination, and AWS Glue as the catalog.

Job dest connector

Job dest catalog

Provide the connection details and Test Connection.

Job dest config

4) Configure Streamsโ€‹

This is where we dial in what to replicate and how.
For this walkthrough, we'll:

  • Include stream fivehundred
  • Sync mode: Full Refresh + CDC
  • Normalization: On
  • Filter: dropoff_datetime >= "2010-01-01 00:00:00"
  • Partitioning: by year extracted from dropoff_datetime
  • Schedule: every day at 12:00 AM

Job streams page

Select the checkbox for fivehundred, then click the stream name to open stream settings.
Pick the sync mode and toggle Normalization.

Select stream

Let's make the destination query-friendly. Open Partitioning โ†’ choose dropoff_datetime โ†’ year.
Want more? Read the Partitioning Guide.

Stream partitioning

Add the Data Filter so we only move rows from 2010 onward.

Stream filter

Click Next to continue.

5) Schedule the Jobโ€‹

Give the job a clear name, set Every Day @ 12:00 AM, and hit Create Job.

Job schedule

You're set! ๐ŸŽ‰

Job created

Want results right away? Start a run immediately with Jobs โ†’ (โ‹ฎ) โ†’ Sync Now.

Sync now

You'll see status badges on the right (Running / Failed / Completed).
For more details, open Job Logs & History.

  • Running
    Job running

  • Completed
    Job success

Finally, verify that data landed in S3/Iceberg as configured:

S3 data

6) Manage Your Job (from the Jobs page)โ€‹

Sync Now โ€” Trigger a run without waiting.

Edit Streams โ€” Change which streams are included and tweak replication settings.
Use the stepper to jump between Source and Destination.

Edit streams

By default, source/destination editing is locked. Click Edit to unlock.

Edit destination

๐Ÿ”„ Need to change Partitioning / Filter / Normalization for an existing stream?
Unselect the stream โ†’ Save โ†’ reopen Edit Streams โ†’ re-add it with new settings.

Pause Job โ€” Temporarily stop runs. You'll find paused jobs under Inactive Jobs, where you can Resume any time.

Pause/Resume

Job Logs & History โ€” See all runs. Use View Logs for per-run details.

Job logs list

Logs page

Job Settings โ€” Rename, change frequency, pause, or delete.
Deleting a job moves its source/destination to inactive (if not used elsewhere).

Job settings

Option B โ€” OLake CLI (Docker)โ€‹

Prefer terminals, PR reviews, and repeatable runs? Let's do the same pipeline via Docker.

Prerequisitesโ€‹

  • Docker installed and running
  • OLake images: Docker Hub โ†’ olakego/*

How the CLI flow worksโ€‹

  1. Configure source & destination (JSON files)
  2. Discover streams โ†’ writes a streams.json
  3. Edit stream configuration (normalization, filters, partitions, sync mode)
  4. Run the sync
  5. Monitor with stats.json

What we'll buildโ€‹

  • Source: Postgres
  • Destination: Apache Iceberg (Glue catalog)
  • Table: fivehundred
  • CDC mode + Normalization
  • Filter: dropoff_datetime >= "2010-01-01 00:00:00"
  • Partition by year from dropoff_datetime

1) Create Config Filesโ€‹

We'll put everything under /path/to/config/.

Source โ€” source.json

source.json
{
"host": "dz-stag.postgres.database.azure.com",
"port": 5432,
"database": "postgres",
"username": "postgres",
"password": "XXX",
"jdbc_url_params": {},
"ssl": { "mode": "require" },
"update_method": {
"replication_slot": "replication_slot",
"intital_wait_time": 120
},
"default_mode": "cdc",
"max_threads": 6
}

๐Ÿ“ If you plan to run CDC, ensure a Postgres replication slot exists. See: Replication Slot Guide.

Destination โ€” destination.json

destination.json
{
"type": "ICEBERG",
"writer": {
"iceberg_s3_path": "s3://vz-testing-olake/olake_cli_demo",
"aws_region": "XXX",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"iceberg_db": "olake_cli_demo",
"grpc_port": 50051,
"sink_rpc_server_host": "localhost"
}
}

2) Discover Streamsโ€‹

This pulls available tables and writes streams.json.

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
discover \
--config /mnt/config/source.json

Start logs Discover start

Completion Discover end

โ„น๏ธ Logs are also written to: /path/to/config/logs/sync_[YYYY-MM-DD]_[HH-MM-SS]/olake.log

3) Edit streams.jsonโ€‹

Select exactly what to move and how.

  • Select streams โ†’ keep only fivehundred under "selected_streams".
  • Normalization โ†’ "normalization": true
  • Filter โ†’ "filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
  • Partitioning โ†’ "partition_regex": "/{dropoff_datetime, year}"
  • Sync mode โ†’ set the stream's "sync_mode" to "cdc"

Minimal selection block

streams.json (selection)
{
"selected_streams": {
"public": [
{
"partition_regex": "/{dropoff_datetime, year}",
"stream_name": "fivehundred",
"normalization": true,
"filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
}
]
}
}

Full stream entry (showing supported modes)

streams.json (stream detail)
{
"streams": [
{
"stream": {
"name": "fivehundred",
"namespace": "public",
"type_schema": {
"properties": {
"dropoff_datetime": { "type": ["timestamp", "null"] }
}
},
"supported_sync_modes": [
"strict_cdc",
"full_refresh",
"incremental",
"cdc"
],
"source_defined_primary_key": [],
"available_cursor_fields": ["id", "pickup_datetime", "rate_code_id"],
"sync_mode": "cdc"
}
}
]
}

๐Ÿ“š Need a refresher on how modes differ? Check out our documentation on sync modes.

4) Run the Syncโ€‹

Kick off replication:

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json

Sync start Sync start

Sync completed Sync completed

5) Monitor Progress with stats.jsonโ€‹

A stats.json appears next to your configs:

stats.json
{
"Estimated Remaining Time": "0.00 s",
"Memory": "367 mb",
"Running Threads": 0,
"Seconds Elapsed": "34.01",
"Speed": "14.70 rps",
"Synced Records": 500
}

Confirm the data in your destination (S3 / Iceberg):

Data in Iceberg

6) About the state.json (Resumable & CDC-friendly)โ€‹

When a sync starts, OLake writes a state.json that tracks progress and CDC offsets (e.g., Postgres LSN). This lets you resume without duplicates and continue CDC seamlessly.

To resume / keep streaming:

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json \
--state /mnt/config/state.json

More details: Check out our Postgres connector documentation for state file configuration.


Quick Q&Aโ€‹

UI or CLIโ€”how should I choose? If you're new to OLake or prefer a guided setup, start with UI. If you're automating, versioning configs, or scripting in CI, use CLI.

Why "Full Refresh + CDC"? You get a baseline snapshot and continuous changesโ€”ideal for keeping downstream analytics fresh.

Can I change partitioning later?

  • UI: unselect the stream โ†’ save โ†’ re-add with updated partitioning/filter/normalization.
  • CLI: edit streams.json and re-run.