Skip to main content

Create OLake Replication Jobs: Postgres to Iceberg Docker CLI

ยท 10 min read
Vishal
OLake Maintainer

Data replication has become one of the most essential building blocks in modern data engineering. Whether it's keeping your analytics warehouse in sync with operational databases or feeding real-time pipelines for machine learning, companies rely on tools to move data quickly and reliably.

Today, there's no shortage of optionsโ€”platforms like Fivetran, Airbyte, Debezium, and even custom-built Flink or Spark pipelines are widely used to handle replication. But each of these comes with trade-offs: infrastructure complexity, cost, or lack of flexibility when you want to adapt replication to your specific needs.

That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control.

In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres to Apache Iceberg (Glue Catalog) with CDC, normalization, filters, partitioning, and schedulingโ€”all running seamlessly.

Two Setup Styles (pick what fits you)โ€‹

Option A โ€” UI "Job-first" (guided, all-in-one)โ€‹

Perfect if you want a clear wizard and visual guardrails.

Option B โ€” CLI (Docker)โ€‹

Great if you prefer terminal, versioned JSON, or automation.

Both produce the same result. Choose the path that matches your workflow today.

Option A โ€” OLake UI (Guided)โ€‹

We'll take the "job-first" approach. It's straightforward and keeps you in one flow.

1) Create a Jobโ€‹

From the left nav, go to Jobs โ†’ Create Job.
You'll land on a wizard that starts with the source.

OLake jobs dashboard with the Jobs tab, Create Job button, and Create your first Job button highlighted

2) Configure the Source (Postgres)โ€‹

Choose Set up a new source โ†’ select Postgres โ†’ keep OLake version at the latest stable.
Name it clearly, fill the Postgres endpoint config, and hit Test Connection.

OLake Create Job step 2 screen, showing source connector options including Postgres, MongoDB, MySQL, and Oracle, with Postgres highlighted

OLake Create Job with Postgres source configuration fields and a side help panel with setup steps

๐Ÿ“ Planning for CDC?
Make sure a replication slot exists in Postgres.
See: Replication Slot Guide.

3) Configure the Destination (Iceberg + Glue)โ€‹

Now we set where the data will land.
Pick Apache Iceberg as the destination, and AWS Glue as the catalog.

OLake Create Job step 3 destination setup, showing connector selection with Amazon S3 and Apache Iceberg options, and Apache Iceberg highlighted

OLake Create Job destination setup for Apache Iceberg, with Catalog Type dropdown showing AWS Glue, JDBC, Hive, and REST options

Provide the connection details and Test Connection.

OLake Create Job destination config for Apache Iceberg with AWS Glue; right panel shows AWS Glue Catalog Write Guide with setup and prerequisites

4) Configure Streamsโ€‹

This is where we dial in what to replicate and how.
For this walkthrough, we'll:

  • Include stream fivehundred
  • Sync mode: Full Refresh + CDC
  • Normalization: On
  • Filter: dropoff_datetime >= "2010-01-01 00:00:00"
  • Partitioning: by year extracted from dropoff_datetime
  • Schedule: every day at 12:00 AM

OLake streams selection, employee_data and other tables checked, sync mode set to Full Refresh + CDC

Select the checkbox for fivehundred, then click the stream name to open stream settings.
Pick the sync mode and toggle Normalization.

OLake streams- only five hundred selected, Full Refresh + CDC mode

Let's make the destination query-friendly. Open Partitioning โ†’ choose dropoff_datetime โ†’ year.
Want more? Read the Partitioning Guide.

OLake: fivehundred stream selected, partition by dropoff_datetime and year

Add the Data Filter so we only move rows from 2010 onward.

OLake: fivehundred stream, filter dropoff_datetime >= 2010-01-01

Click Next to continue.

5) Schedule the Jobโ€‹

Give the job a clear name, set Every Day @ 12:00 AM, and hit Create Job.

OLake Create Job page showing step 1, with job name, frequency dropdown (Every Day highlighted), and job start time settings

You're set! ๐ŸŽ‰

OLake job created successfully for fivehundred stream, Full Refresh + CDC

Want results right away? Start a run immediately with Jobs โ†’ (โ‹ฎ) โ†’ Sync Now.

Active jobs screen for OLake with job options menu expanded.

You'll see status badges on the right (Running / Failed / Completed).
For more details, open Job Logs & History.

  • Running
    OLake active jobs screen showing a running job

  • Completed
    OLake active jobs screen showing a completed job

Finally, verify that data landed in S3/Iceberg as configured:

Amazon S3 folder view showing two Parquet files under dropoff_datetime_year=2011

6) Manage Your Job (from the Jobs page)โ€‹

Sync Now โ€” Trigger a run without waiting.

Edit Streams โ€” Change which streams are included and tweak replication settings.
Use the stepper to jump between Source and Destination.

Stream selection screen for OLake Postgres Iceberg job, with S3 folder and sync steps shown

By default, source/destination editing is locked. Click Edit to unlock.

OLake Postgres Iceberg job destination config with AWS Glue setup and edit option

๐Ÿ”„ Need to change Partitioning / Filter / Normalization for an existing stream?
Unselect the stream โ†’ Save โ†’ reopen Edit Streams โ†’ re-add it with new settings.

Pause Job โ€” Temporarily stop runs. You'll find paused jobs under Inactive Jobs, where you can Resume any time.

Inactive jobs tab showing a PostgreSQL job with the option to resume in the OLake UI

Job Logs & History โ€” See all runs. Use View Logs for per-run details.

Job log history for a Postgres Iceberg job, showing a completed status and option to view logs.

OLake Postgres Iceberg job logs showing system info and sync steps with Iceberg writer and Postgres source.

Job Settings โ€” Rename, change frequency, pause, or delete.
Deleting a job moves its source/destination to inactive (if not used elsewhere).

Active Postgres Iceberg job settings screen; job runs daily at 12 AM UTC with pause and delete options

Option B โ€” OLake CLI (Docker)โ€‹

Prefer terminals, PR reviews, and repeatable runs? Let's do the same pipeline via Docker.

Prerequisitesโ€‹

  • Docker installed and running
  • OLake images: Docker Hub โ†’ olakego/*

How the CLI flow worksโ€‹

  1. Configure source & destination (JSON files)
  2. Discover streams โ†’ writes a streams.json
  3. Edit stream configuration (normalization, filters, partitions, sync mode)
  4. Run the sync
  5. Monitor with stats.json

What we'll buildโ€‹

  • Source: Postgres
  • Destination: Apache Iceberg (Glue catalog)
  • Table: fivehundred
  • CDC mode + Normalization
  • Filter: dropoff_datetime >= "2010-01-01 00:00:00"
  • Partition by year from dropoff_datetime

1) Create Config Filesโ€‹

We'll put everything under /path/to/config/.

Source โ€” source.json

source.json
{
"host": "dz-stag.postgres.database.azure.com",
"port": 5432,
"database": "postgres",
"username": "postgres",
"password": "XXX",
"jdbc_url_params": {},
"ssl": { "mode": "require" },
"update_method": {
"replication_slot": "replication_slot",
"intital_wait_time": 120
},
"default_mode": "cdc",
"max_threads": 6
}

๐Ÿ“ If you plan to run CDC, ensure a Postgres replication slot exists. See: Replication Slot Guide.

Destination โ€” destination.json

destination.json
{
"type": "ICEBERG",
"writer": {
"iceberg_s3_path": "s3://vz-testing-olake/olake_cli_demo",
"aws_region": "XXX",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"iceberg_db": "olake_cli_demo",
"grpc_port": 50051,
"sink_rpc_server_host": "localhost"
}
}

2) Discover Streamsโ€‹

This pulls available tables and writes streams.json.

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
discover \
--config /mnt/config/source.json

Start logs OLake sync job log: CDC found, Iceberg writer batch settings, stream discovery, type warnings, normalization steps

Completion OLake log: CDC sync catalog output with trip and fare columns, Postgres discovery completed, cleanup in progress

โ„น๏ธ Logs are also written to: /path/to/config/logs/sync_[YYYY-MM-DD]_[HH-MM-SS]/olake.log

3) Edit streams.jsonโ€‹

Select exactly what to move and how.

  • Select streams โ†’ keep only fivehundred under "selected_streams".
  • Normalization โ†’ "normalization": true
  • Filter โ†’ "filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
  • Partitioning โ†’ "partition_regex": "/{dropoff_datetime, year}"
  • Sync mode โ†’ set the stream's "sync_mode" to "cdc"

Minimal selection block

streams.json (selection)
{
"selected_streams": {
"public": [
{
"partition_regex": "/{dropoff_datetime, year}",
"stream_name": "fivehundred",
"normalization": true,
"filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
}
]
}
}

Full stream entry (showing supported modes)

streams.json (stream detail)
{
"streams": [
{
"stream": {
"name": "fivehundred",
"namespace": "public",
"type_schema": {
"properties": {
"dropoff_datetime": { "type": ["timestamp", "null"] }
}
},
"supported_sync_modes": [
"strict_cdc",
"full_refresh",
"incremental",
"cdc"
],
"source_defined_primary_key": [],
"available_cursor_fields": ["id", "pickup_datetime", "rate_code_id"],
"sync_mode": "cdc"
}
}
]
}

๐Ÿ“š Need a refresher on how modes differ? Check out our documentation on sync modes.

4) Run the Syncโ€‹

Kick off replication:

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json

Sync start OLake sync job log: CDC and Iceberg writer initialized, user-defined type warnings, skipping unselected streams

Sync completed OLake job log: Iceberg commit confirmed, server shutdown, sync completed, records read and cleanup

5) Monitor Progress with stats.jsonโ€‹

A stats.json appears next to your configs:

stats.json
{
"Estimated Remaining Time": "0.00 s",
"Memory": "367 mb",
"Running Threads": 0,
"Seconds Elapsed": "34.01",
"Speed": "14.70 rps",
"Synced Records": 500
}

Confirm the data in your destination (S3 / Iceberg):

Amazon S3 dropoff_datetime_year=2011 partition showing two parquet files from OLake ETL pipeline

6) About the state.json (Resumable & CDC-friendly)โ€‹

When a sync starts, OLake writes a state.json that tracks progress and CDC offsets (e.g., Postgres LSN). This lets you resume without duplicates and continue CDC seamlessly.

To resume / keep streaming:

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json \
--state /mnt/config/state.json

More details: Check out our Postgres connector documentation for state file configuration.


Quick Q&Aโ€‹

UI or CLIโ€”how should I choose? If you're new to OLake or prefer a guided setup, start with UI. If you're automating, versioning configs, or scripting in CI, use CLI.

Why "Full Refresh + CDC"? You get a baseline snapshot and continuous changesโ€”ideal for keeping downstream analytics fresh.

Can I change partitioning later?

  • UI: unselect the stream โ†’ save โ†’ re-add with updated partitioning/filter/normalization.
  • CLI: edit streams.json and re-run.