Postgres → Iceberg → Doris: A Smooth Lakehouse Journey Powered by Olake

November 4, 2025 · 10 min read

OLake Maintainer

If you've been working with data lakes, you've probably felt the friction of keeping your analytics engine separate from your storage layer. With your data neatly sitting in Iceberg, the next challenge is querying it efficiently without moving it around.

That's a pretty fair reason to bring Doris in.

Building a modern data lakehouse shouldn't require stitching together a dozen tools or writing complex Spark jobs. In this guide, I'll show you how to create a complete, production-ready lakehouse architecture that:

Captures real-time changes from PostgreSQL using CDC (Change Data Capture)
Stores data in open Apache Iceberg format on object storage
Queries data at lightning speed with Apache Doris
All orchestrated seamlessly by OLake

By the end, you'll have a running system that syncs database changes in real-time and lets you query — without moving or duplicating data.

So, what is Apache Doris?

Apache Doris is a real-time analytical database built on MPP (Massively Parallel Processing) architecture, designed to handle complex analytical queries at scale — often delivering sub-second query latency, even on large datasets.

Probably, too much of technical jargon, isn't it? Here is what it means in simple terms:

"A fast, intelligent query engine that lets you analyze your Iceberg tables directly, without having to move or duplicate your data anywhere else."

Why Doris for Your Lakehouse?

At its core, Doris combines three powerful execution capabilities:

Vectorized Execution Engine: Unlike traditional row-by-row processing, Doris processes data in batches (vectors), allowing it to leverage modern CPU capabilities like SIMD (Single Instruction, Multiple Data) instructions. This translates to faster query execution on the same hardware.

Pipeline Execution Model: Doris breaks down complex queries into pipeline stages that can execute in parallel across multiple cores and machines. Think of it like an assembly line where each stage processes data simultaneously, rather than waiting for the previous step to complete entirely.

Advanced Query Optimizer: The query optimizer automatically rewrites your SQL queries to find the most efficient execution plan. It handles complex operations like multi-table joins, aggregations, and sorting without you having to manually optimize your queries.

We are currently at version 4.0.0 of Apache Doris, currently to the time of writing this blog, and it has introduced comprehensive support for Apache Iceberg's core features.

What Doris brings to your Iceberg tables:

Universal Catalog Support: Works with all major Iceberg catalog types — REST, AWS Glue, Hive Metastore, Hadoop, Google Dataproc Metastore, and DLF.
Full Delete File Support: Reads both Equality Delete Files and Positional Delete Files, which is crucial for CDC workloads where updates and deletes happen frequently.
Time Travel Queries: Query historical snapshots of your Iceberg tables to see how data looked at any point in time.
Snapshot History: Access complete snapshot metadata via table functions to understand data evolution.
Transparent Query Routing: Doris automatically routes queries to materialized views when available, accelerating common query patterns without changing your SQL.

The Data Pipeline: How the Pieces Fit Together

Let's understand the complete data flow from your operational database to real-time analytics.

The Architecture

Doris lakehouse stack running successfully

Here's how data flows through our lakehouse stack:

Source Database (PostgreSQL): Your operational database continues running normally, handling transactional workloads.
OLake CDC Engine: Captures changes from PostgreSQL using logical replication and writes them directly to Apache Iceberg format.
Apache Iceberg Tables: Your data lands in Iceberg tables stored in object storage (MinIO/S3), maintaining full ACID guarantees with snapshot isolation.
REST Catalog: Tracks the current state of your Iceberg tables, managing metadata pointers so query engines always read the latest consistent snapshot.
Apache Doris: Queries your Iceberg tables directly from object storage, delivering sub-second analytics without moving data.

Why This Stack?

No Data Duplication: Unlike traditional ETL pipelines that copy data multiple times, your source data is written once to Iceberg and queried directly by Doris.

Real-Time Insights: Changes in PostgreSQL appear in your analytics, OLake's CDC sync captures inserts, updates, and deletes as they happen.

Cost-Effective Storage: Object storage (S3/MinIO) costs a fraction of traditional data warehouse storage, while Iceberg's efficient metadata handling keeps query performance high.

Decoupled Compute and Storage: Scale your query engine (Doris) independently from storage. Need more query power? Add Doris nodes. Need more storage? Just expand your object store.

About OLake

OLake is an open-source CDC tool specifically built for lakehouse architectures. It supports these sources: PostgreSQL, MySQL, MongoDB, Oracle, and Kafka. You can check out our official documentation for detailed source configurations.

What makes OLake different? It writes directly to Apache Iceberg format with proper metadata management, schema evolution support, and automatic handling of CDC operations (inserts, updates, deletes). No need for complex Spark jobs or Kafka pipelines — OLake handles the entire ingestion flow. We support all major Iceberg catalogs.

Our Demo Setup

For this tutorial, we're using:

tabulario/iceberg-rest: A lightweight REST catalog implementation
MinIO: S3-compatible object storage that runs locally
Apache Doris: Hosted on a cloud instance for remote querying

This setup mirrors what Apache Doris recommends in their official lakehouse documentation.

Prerequisites

Before we dive into the setup:

A cloud instance (AWS EC2, Azure VM, or GCP Compute Engine) with SSH access
Docker installed
At least 4GB RAM and 20GB disk space
Basic familiarity with Linux terminal commands

Let's get started.

Step 1 – Start your REST Catalog + MinIO + Doris

Configure System Parameters

First, we need to configure a critical Linux kernel parameter. Apache Doris uses memory-mapped files extensively for its storage engine, and the default limit is too low.

sudo sysctl -w vm.max_map_count=2000000

This sets the maximum number of memory map areas a process can have. Without this, Doris Backend (BE) nodes will fail to start with memory allocation errors.

Make it permanent across reboots:

echo "vm.max_map_count=2000000" >> /etc/sysctl.conf

Verify the setting:

sysctl vm.max_map_count

You should see vm.max_map_count = 2000000.

Deploy the Lakehouse Stack

Apache Doris provides a convenient docker-compose setup that includes everything we need:

git clone https://github.com/apache/doris.git

Navigate to the lakehouse sample directory:

cd doris/samples/datalake/iceberg_and_paimon

This directory contains a complete lakehouse stack with the following services:

Apache Doris: MPP query engine with Frontend (FE) and Backend (BE) nodes for analytics
Iceberg REST Catalog: Manages table metadata and schema evolution
MinIO: S3-compatible object storage for Iceberg table data files
MinIO Client (mc): Automatically initializes buckets and sets permissions
Sample configurations: Pre-configured with connectors and initialization scripts

Start all services:

bash ./start_all.sh

This script will take some amount of time to execute, it will:

Pull required Docker images (first run takes 5-10 minutes depending on your connection)
Start MinIO and create necessary buckets
Initialize the Iceberg REST catalog
Start Doris Frontend and Backend nodes
Wait for all services to be healthy

You'll see output as each service starts. Wait until you see:

Doris lakehouse stack running successfully

Access the Doris CLI

bash start_doris_client.sh

This opens the Doris SQL client, and you can now run queries against your lakehouse. We'll use this later to query the Iceberg tables.

Step 2 – Set Up OLake for CDC Ingestion

Now we'll configure OLake to capture changes from your PostgreSQL database and write them to Iceberg tables.

Start OLake UI

Open a new terminal session on your cloud instance and deploy OLake:

curl -sSL https://raw.githubusercontent.com/datazip-inc/olake-ui/master/docker-compose.yml | docker compose -f - up -d

This starts the OLake UI and backend services. OLake runs on port 8000, but since it's on your remote cloud instance, you'll need to access it from your local machine.

Set Up SSH Port Forwarding

To access both OLake UI and MinIO from your local browser, create SSH tunnels. Run these commands on your local machine (not on the cloud instance):

ssh -L 8000:localhost:8000 olake-server
ssh -L 19002:localhost:19002 olake-server

What's happening here?

-L 8000:localhost:8000: Forwards local port 8000 to the instance's port 8000 (OLake UI)
-L 19002:localhost:19002: Forwards local port 19002 to the instance's port 19002 (MinIO)

SSH Configuration

If you haven't configured an SSH alias, add this to your local ~/.ssh/config:

Host olake-server
  HostName <YOUR_INSTANCE_PUBLIC_IP>
  User azureuser
  IdentityFile <PATH_TO_YOUR_PEM_FILE>

Replace the values with your instance details. This lets you use olake-server instead of typing the full SSH command each time.

Access the Web Interfaces

With port forwarding active, you can now access both services from your local browser:

OLake UI: http://INSTANCE_IP:8000
MinIO Console: http://INSTANCE_IP:19002

Replace INSTANCE_IP with your cloud instance's public IP address.

Prepare MinIO Storage

MinIO needs a bucket to store Iceberg table data:

Open MinIO at http://INSTANCE_IP:19002
Login with default credentials (typically admin / password)
Create a new bucket named warehouse

This bucket will hold all your Iceberg table data files (Parquet) and metadata (JSON/Avro).

Configure OLake Job

Now let's configure OLake to sync data from your source database to Iceberg.

Create Source Connection:

In OLake UI, navigate to Sources → Create Source
Choose your source database type (PostgreSQL, MySQL, MongoDB, etc.)
Enter connection details:
- Host, port, database name
- Username and password
- For PostgreSQL CDC: Enable logical replication and provide publication/slot names, etc.

You can follow a detailed doc for creating postgres source connection.

Create Destination (Iceberg):

Setup the config like this,

Doris lakehouse stack running successfully

For any other detail you can check out our official documentation.

Create and Run Job:

Navigate to Jobs → Create Job
Select your source and destination
Choose tables/collections to sync
Select sync mode:
- Full Refresh + CDC: Initial snapshot followed by real-time changes
- CDC Only: Stream only new changes
Start the sync

You can check out our official documentation for a detailed workflow from creating your job pipeline to managing ongoing sync operations.

Step 3 – Query Your Iceberg Tables with Doris

With OLake continuously syncing data to Iceberg, it's time to query that data using Apache Doris. Let's explore your lakehouse!

Connect to Doris

Now, in the Doris CLI which we created in Step 1, run the following commands.

List all available catalogs in Doris:

SHOW CATALOGS;

You should see the iceberg catalog listed. This catalog was pre-configured in the docker-compose setup to point to the REST catalog.

Switch to Iceberg Catalog

Set the Iceberg catalog as your active context:

SWITCH iceberg;

Now all queries will run against Iceberg tables unless you explicitly specify another catalog.

Refresh Catalog Metadata

The Iceberg catalog might not immediately reflect newly created tables. Refresh it to pull the latest metadata:

REFRESH CATALOG iceberg;

Why refresh? Doris caches catalog metadata for performance. When OLake creates new tables or updates schemas, refreshing ensures Doris sees the latest state.

Explore Your Data

List all databases (namespaces in Iceberg terminology):

SHOW DATABASES;

Switch to your database:

USE <database_name>;

List all tables:

SHOW TABLES;

Query Your Synced Data

Now run a simple query:

SELECT * FROM iceberg.<database_name>.<table_name> LIMIT 10;

Doris lakehouse stack running successfully

What you're seeing: Data from your source database, stored in Iceberg format, queried through Doris's MPP engine. No data movement, no duplication — just direct querying from object storage.

Troubleshooting

`ERROR 1105 (HY000): errCode = 2, detailMessage = There is no scanNode Backend available.[10002: not alive]`

You need to set vm.max_map_count to 2000000 under root. So, run this command:

sudo sysctl -w vm.max_map_count=2000000

then restart your Doris BE and then run your table query command and it should work fine.

Happy Engineering! Happy Iceberg!

So, what is Apache Doris?​

Why Doris for Your Lakehouse?​

What Doris brings to your Iceberg tables:​

The Data Pipeline: How the Pieces Fit Together​

The Architecture​

Why This Stack?​

About OLake​

Our Demo Setup​

Prerequisites​

Step 1 – Start your REST Catalog + MinIO + Doris​

Configure System Parameters​

Deploy the Lakehouse Stack​

Access the Doris CLI​

Step 2 – Set Up OLake for CDC Ingestion​

Start OLake UI​

Set Up SSH Port Forwarding​

Access the Web Interfaces​

Prepare MinIO Storage​

Configure OLake Job​

Step 3 – Query Your Iceberg Tables with Doris​

Connect to Doris​

Switch to Iceberg Catalog​

Refresh Catalog Metadata​

Explore Your Data​

Query Your Synced Data​

Troubleshooting​

ERROR 1105 (HY000): errCode = 2, detailMessage = There is no scanNode Backend available.[10002: not alive]​

So, what is Apache Doris?

Why Doris for Your Lakehouse?

What Doris brings to your Iceberg tables:

The Data Pipeline: How the Pieces Fit Together

The Architecture

Why This Stack?

About OLake

Our Demo Setup

Prerequisites

Step 1 – Start your REST Catalog + MinIO + Doris

Configure System Parameters

Deploy the Lakehouse Stack

Access the Doris CLI

Step 2 – Set Up OLake for CDC Ingestion

Start OLake UI

Set Up SSH Port Forwarding

Access the Web Interfaces

Prepare MinIO Storage

Configure OLake Job

Step 3 – Query Your Iceberg Tables with Doris

Connect to Doris

Switch to Iceberg Catalog

Refresh Catalog Metadata

Explore Your Data

Query Your Synced Data

Troubleshooting

`ERROR 1105 (HY000): errCode = 2, detailMessage = There is no scanNode Backend available.[10002: not alive]`