Skip to main content

Create OLake Replication Jobs: Postgres to Iceberg Docker CLI

ยท 10 min read
Vishal
OLake Maintainer

Creating OLake Replication Jobs

Data replication has become one of the most essential building blocks in modern data engineering. Whether it's keeping your analytics warehouse in sync with operational databases or feeding real-time pipelines for machine learning, companies rely on tools to move data quickly and reliably.

Today, there's no shortage of optionsโ€”platforms like Fivetran, Airbyte, Debezium, and even custom-built Flink or Spark pipelines are widely used to handle replication. But each of these comes with trade-offs: infrastructure complexity, cost, or lack of flexibility when you want to adapt replication to your specific needs.

That's where OLake comes in. Instead of forcing you into one way of working, OLake focuses on making replication into Apache Iceberg (and other destinations) straightforward, fast, and adaptable. You can choose between a guided UI experience for simplicity or a Docker CLI flow for automation and DevOps-style control.

In this blog, we'll walk through how to set up a replication job in OLake, step by step. We'll start with the UI wizard for those who prefer a visual setup, then move on to the CLI-based workflow for teams that like to keep things in code. By the end, you'll have a job that continuously replicates from Postgres to Apache Iceberg (Glue Catalog) with CDC, normalization, filters, partitioning, and schedulingโ€”all running seamlessly.

Two Setup Styles (pick what fits you)โ€‹

Option A โ€” UI "Job-first" (guided, all-in-one)โ€‹

Perfect if you want a clear wizard and visual guardrails.

Option B โ€” CLI (Docker)โ€‹

Great if you prefer terminal, versioned JSON, or automation.

Both produce the same result. Choose the path that matches your workflow today.

Option A โ€” OLake UI (Guided)โ€‹

We'll take the "job-first" approach. It's straightforward and keeps you in one flow.

1) Create a Jobโ€‹

From the left nav, go to Jobs โ†’ Create Job.
You'll land on a wizard that starts with the source.

OLake jobs dashboard for new users with option to create first job highlighted

2) Configure the Source (Postgres)โ€‹

Choose Set up a new source โ†’ select Postgres โ†’ keep OLake version at the latest stable.
Name it clearly, fill the Postgres endpoint config, and hit Test Connection.

OLake create job interface with new source connector selection for MongoDB, Postgres, MySQL, Oracle

OLake create job screen showing Postgres source endpoint and CDC configuration with setup guide

๐Ÿ“ Planning for CDC?
Make sure a replication slot exists in Postgres.
See: Replication Slot Guide.

3) Configure the Destination (Iceberg + Glue)โ€‹

Now we set where the data will land.
Pick Apache Iceberg as the destination, and AWS Glue as the catalog.

OLake create job destination step showing Apache Iceberg and Amazon S3 connector selection

OLake create job destination endpoint config with catalog type selection AWS Glue JDBC Hive REST

Provide the connection details and Test Connection.

OLake create job destination setup with Apache Iceberg, AWS Glue catalog, and S3 configuration form

4) Configure Streamsโ€‹

This is where we dial in what to replicate and how.
For this walkthrough, we'll:

  • Include stream fivehundred
  • Sync mode: Full Refresh + CDC
  • Normalization: On
  • Filter: dropoff_datetime >= "2010-01-01 00:00:00"
  • Partitioning: by year extracted from dropoff_datetime
  • Schedule: every day at 12:00 AM

OLake stream selection UI for Postgres to Iceberg job with Full Refresh + CDC mode

Select the checkbox for fivehundred, then click the stream name to open stream settings.
Pick the sync mode and toggle Normalization.

OLake create job stream selection for Postgres to Iceberg with Full Refresh + CDC on fivehundred

Let's make the destination query-friendly. Open Partitioning โ†’ choose dropoff_datetime โ†’ year.
Want more? Read the Partitioning Guide.

OLake partitioning UI for stream fivehundred using dropoff_datetime and year fields in Iceberg

Add the Data Filter so we only move rows from 2010 onward.

OLake create job with data filter for Postgres to Iceberg pipeline on dropoff_datetime column

Click Next to continue.

5) Schedule the Jobโ€‹

Give the job a clear name, set Every Day @ 12:00 AM, and hit Create Job.

OLake create job stream filter UI for Postgres to Iceberg pipeline using dropoff_datetime column and operators

You're set! ๐ŸŽ‰

OLake job creation success dialog for Postgres to Iceberg ETL pipeline

Want results right away? Start a run immediately with Jobs โ†’ (โ‹ฎ) โ†’ Sync Now.

OLake jobs dashboard with actions menu for sync, edit streams, pause, logs, settings, delete

You'll see status badges on the right (Running / Failed / Completed).
For more details, open Job Logs & History.

  • Running
    OLake jobs dashboard showing active job status as running for Postgres to Iceberg pipeline

  • Completed
    OLake jobs dashboard showing completed status for Postgres to Iceberg pipeline job

Finally, verify that data landed in S3/Iceberg as configured:

Amazon S3 browser showing parquet files for dropoff_datetime_year=2011 partition folder

6) Manage Your Job (from the Jobs page)โ€‹

Sync Now โ€” Trigger a run without waiting.

Edit Streams โ€” Change which streams are included and tweak replication settings.
Use the stepper to jump between Source and Destination.

OLake Postgres Iceberg job UI with stepper showing Job Config, Source, Destination, Streams steps

By default, source/destination editing is locked. Click Edit to unlock.

OLake destination config edit screen for Postgres Iceberg job with AWS Glue write guide

๐Ÿ”„ Need to change Partitioning / Filter / Normalization for an existing stream?
Unselect the stream โ†’ Save โ†’ reopen Edit Streams โ†’ re-add it with new settings.

Pause Job โ€” Temporarily stop runs. You'll find paused jobs under Inactive Jobs, where you can Resume any time.

OLake inactive jobs list with menu showing resume job option for Postgres Iceberg pipeline

Job Logs & History โ€” See all runs. Use View Logs for per-run details.

OLake Postgres Iceberg job logs history screen showing completed run and view logs action

OLake job logs screen displaying detailed execution logs for Postgres to Iceberg sync job

Job Settings โ€” Rename, change frequency, pause, or delete.
Deleting a job moves its source/destination to inactive (if not used elsewhere).

OLake job settings screen showing scheduling, pause and delete options for Postgres Iceberg job

Option B โ€” OLake CLI (Docker)โ€‹

Prefer terminals, PR reviews, and repeatable runs? Let's do the same pipeline via Docker.

Prerequisitesโ€‹

  • Docker installed and running
  • OLake images: Docker Hub โ†’ olakego/*

How the CLI flow worksโ€‹

  1. Configure source & destination (JSON files)
  2. Discover streams โ†’ writes a streams.json
  3. Edit stream configuration (normalization, filters, partitions, sync mode)
  4. Run the sync
  5. Monitor with stats.json

What we'll buildโ€‹

  • Source: Postgres
  • Destination: Apache Iceberg (Glue catalog)
  • Table: fivehundred
  • CDC mode + Normalization
  • Filter: dropoff_datetime >= "2010-01-01 00:00:00"
  • Partition by year from dropoff_datetime

1) Create Config Filesโ€‹

We'll put everything under /path/to/config/.

Source โ€” source.json

source.json
{
"host": "dz-stag.postgres.database.azure.com",
"port": 5432,
"database": "postgres",
"username": "postgres",
"password": "XXX",
"jdbc_url_params": {},
"ssl": { "mode": "require" },
"update_method": {
"replication_slot": "replication_slot",
"intital_wait_time": 120
},
"default_mode": "cdc",
"max_threads": 6
}

๐Ÿ“ If you plan to run CDC, ensure a Postgres replication slot exists. See: Replication Slot Guide.

Destination โ€” destination.json

destination.json
{
"type": "ICEBERG",
"writer": {
"iceberg_s3_path": "s3://vz-testing-olake/olake_cli_demo",
"aws_region": "XXX",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"iceberg_db": "olake_cli_demo",
"grpc_port": 50051,
"sink_rpc_server_host": "localhost"
}
}

2) Discover Streamsโ€‹

This pulls available tables and writes streams.json.

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
discover \
--config /mnt/config/source.json

Start logs OLake sync job log: CDC found, Iceberg writer batch settings, stream discovery, type warnings, normalization steps

Completion OLake log: CDC sync catalog output with trip and fare columns, Postgres discovery completed, cleanup in progress

โ„น๏ธ Logs are also written to: /path/to/config/logs/sync_[YYYY-MM-DD]_[HH-MM-SS]/olake.log

3) Edit streams.jsonโ€‹

Select exactly what to move and how.

  • Select streams โ†’ keep only fivehundred under "selected_streams".
  • Normalization โ†’ "normalization": true
  • Filter โ†’ "filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
  • Partitioning โ†’ "partition_regex": "/{dropoff_datetime, year}"
  • Sync mode โ†’ set the stream's "sync_mode" to "cdc"

Minimal selection block

streams.json (selection)
{
"selected_streams": {
"public": [
{
"partition_regex": "/{dropoff_datetime, year}",
"stream_name": "fivehundred",
"normalization": true,
"filter": "dropoff_datetime >= \"2010-01-01 00:00:00\""
}
]
}
}

Full stream entry (showing supported modes)

streams.json (stream detail)
{
"streams": [
{
"stream": {
"name": "fivehundred",
"namespace": "public",
"type_schema": {
"properties": {
"dropoff_datetime": { "type": ["timestamp", "null"] }
}
},
"supported_sync_modes": [
"strict_cdc",
"full_refresh",
"incremental",
"cdc"
],
"source_defined_primary_key": [],
"available_cursor_fields": ["id", "pickup_datetime", "rate_code_id"],
"sync_mode": "cdc"
}
}
]
}

๐Ÿ“š Need a refresher on how modes differ? Check out our documentation on sync modes.

4) Run the Syncโ€‹

Kick off replication:

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json

Sync start OLake sync job log: CDC and Iceberg writer initialized, user-defined type warnings, skipping unselected streams

Sync completed OLake job log: Iceberg commit confirmed, server shutdown, sync completed, records read and cleanup

5) Monitor Progress with stats.jsonโ€‹

A stats.json appears next to your configs:

stats.json
{
"Estimated Remaining Time": "0.00 s",
"Memory": "367 mb",
"Running Threads": 0,
"Seconds Elapsed": "34.01",
"Speed": "14.70 rps",
"Synced Records": 500
}

Confirm the data in your destination (S3 / Iceberg):

Amazon S3 dropoff_datetime_year=2011 partition showing two parquet files from OLake ETL pipeline

6) About the state.json (Resumable & CDC-friendly)โ€‹

When a sync starts, OLake writes a state.json that tracks progress and CDC offsets (e.g., Postgres LSN). This lets you resume without duplicates and continue CDC seamlessly.

To resume / keep streaming:

docker run --pull=always \
-v "/path/to/config:/mnt/config" \
olakego/source-postgres:latest \
sync \
--config /mnt/config/streams.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/destination.json \
--state /mnt/config/state.json

More details: Check out our Postgres connector documentation for state file configuration.


Quick Q&Aโ€‹

UI or CLIโ€”how should I choose? If you're new to OLake or prefer a guided setup, start with UI. If you're automating, versioning configs, or scripting in CI, use CLI.

Why "Full Refresh + CDC"? You get a baseline snapshot and continuous changesโ€”ideal for keeping downstream analytics fresh.

Can I change partitioning later?

  • UI: unselect the stream โ†’ save โ†’ re-add with updated partitioning/filter/normalization.
  • CLI: edit streams.json and re-run.

OLake

Achieve 5x speed data replication to Lakehouse format with OLake, our open source platform for efficient, quick and scalable big data ingestion for real-time analytics.

Contact us at hello@olake.io