Last updated:6/27/2025|... min read

Getting Started with OLake for MongoDB

OLake helps you replicate data from MongoDB into local or S3-based data lakes using Parquet or Iceberg table formats. This tutorial walks you through every step of the setup—from creating necessary configuration files to running your first data sync.

info

OLake UI is live (beta)! You can now use the UI to configure your MongoDB source, discover streams, and sync data. Check it out at OLake UI regarding how to setup using Docker Compose and running it locally.

olake-source-mongodb

Now, you can use the UI to configure your MongoDB source, discover streams, and sync data.

Refer to MongoDB Connector documentation for more details.

TLDR:

Create a source.json with your MongoDB connection details.
Create a destination.json with your Writer (Apache Iceberg / AWS S3 / Azure ADLS / Google Cloud Storage) connection details.
Run discover to generate a streams.json of available streams.
Run sync to replicate data to your specified destination.

discover-sync

Introduction & Requirements

To use OLake, ensure you have:

Docker installed and running on your machine.
MongoDB credentials (hosts, replica set name, username/password if applicable).
- Docker Compose instructions to spin up MongoDB replica sets
- Need sample dataset to ingest in MongoDB? Refer -> Sample Datasets
AWS S3 credentials (if you plan to write data to AWS S3).
Apache Iceberg and Catalog configuration credentials (if you plan to write data to Iceberg tables).

Refer here for more details on Writer requirements.

You will also need:

An empty directory to store OLake configuration files and outputs. This guide will refer to it as OLAKE_DIRECTORY.

note

For setting up the project locally on your system and debugging configs to be made, follow this guide - Setting up debugger in VS Code

Step 1: Prepare Your Directory

Create a new directory on your local machine. Let’s call it OLAKE_DIRECTORY:
```
mkdir OLAKE_DIRECTORY
```
Inside this folder, create two files:
- destination.json: Specifies your output destination (local or S3).
- source.json: Contains connection settings for MongoDB (or other databases in the future).
```
cd OLAKE_DIRECTORY
touch destination.json
touch source.json
```

Folder Structure:

OLAKE_DIRECTORY/
  ├─ destination.json
  └─ source.json

1.1 Example `destination.json`

Refer to Destination config section for individual writers or refer them here..

Destination	Supported	Docs	Comments
Apache Iceberg	Yes	Link
AWS S3	Yes	Link	Supports both plain-Parquet and Iceberg format writes; requires `aws_access_key` / IAM role.
Azure	Yes	Link
Google Cloud Storage	Yes	Link (Iceberg) Link (Parquet)	Any S3 protocol compliant object store can work with OLake
Local Filesystem	Yes	Link

1.2 Example `source.json` (MongoDB)

Below is a sample source.json for connecting to a MongoDB replica set. Customize each field to match your environment.

OLAKE_DIRECTORY/source.json
{
  "hosts": [
    "host1:27017",
    "host2:27017",
    "host3:27017"
  ],
  "username": "username",
  "password": "password",
  "authdb": "admin",
  "replica_set": "rs0",
  "read_preference": "secondaryPreferred",
  "srv": false,
  "database": "database",
  "max_threads": 5,
  "backoff_retry_count": 4,
  "chunking_strategy":""
}

Description of above parameters

Refer to source configuration for more details on source.json.

Step 2: Generate a Streams File

OLake needs to discover which collections (streams) exist in your MongoDB. This step will create a streams.json listing available streams, schemas, and default sync modes.

Open your terminal in the same directory (say OLAKE_DIRECTORY) containing source.json and destination.json.
Run the discover command using Docker:

OLake Docker
Locally run OLake

macOS / Linux
CMD
Powershell

docker run --pull=always  \
  -v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \
  olakego/source-mongodb:latest \
  discover \
  --config /mnt/config/source.json

docker run --pull=always  ^
  -v "%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY:/mnt/config" ^
  olakego/source-mongodb:latest ^
  discover ^
  --config /mnt/config/source.json

docker run --pull=always  `
  -v "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY:/mnt/config" `
  olakego/source-mongodb:latest `
  discover `
  --config /mnt/config/source.json

macOS / Linux
CMD
Powershell

OLAKE_BASE_PATH="$HOME/PATH_TO_OLAKE_DIRECTORY/olake/drivers/mongodb/config" && \
./build.sh driver-mongodb discover \
  --config "$OLAKE_BASE_PATH/source.json"

set "OLAKE_BASE_PATH=%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY\olake\drivers\mongodb\config" && ^
./build.sh driver-mongodb discover ^
  --config "%OLAKE_BASE_PATH%\source.json"

$OLAKE_BASE_PATH = "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY\olake\drivers\mongodb\config"; `
./build.sh driver-mongodb discover `
  --config "$OLAKE_BASE_PATH\source.json"

info

PATH_TO_OLAKE_DIRECTORY is the absolute path where you have created the directory [as discussed above]. -v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \ maps to -v /Users/JOHN_DOE_USERNAME/Desktop/projects/OLAKE_DIRECTORY:/mnt/config \ in macOS and Linux systems. Follow the same pattern in other systems.

Catalog

Flag/Parameter	Description
`discover`	The OLake sub-command that scans MongoDB schemas.
`--config /mnt/config/source.json`	Tells OLake where to find your MongoDB connection details.

2.1 Understanding the `streams.json` File

After running discover, OLake generates streams.json in OLAKE_DIRECTORY with entries like:

OLAKE_DIRECTORY/streams.json
{
    "selected_streams": {
        "otter_db": [
            {
                "partition_regex": "{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
                "stream_name": "stream_0",
                "normalization": false,
                "append_only": false,
                "chunk_column": "",         //column name to be specified
            },
            {
                "partition_regex": "{,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
                "stream_name": "stream_8",
                "normalization": false,
                "append_only": false,
                "chunk_column": "",         //column name to be specified

            }
        ]
    },
    "streams": [
        {
            "stream": {
                "name": "stream_8",
                "namespace": "otter_db",
                "type_schema": { ... },
                "supported_sync_modes": [
                    "full_refresh",
                    "cdc"
                ],
                "source_defined_primary_key": [
                    "_id"
                ],
                "available_cursor_fields": [],
                "sync_mode": "cdc"
            }
        },
    // ... other streams
  ]
}

info

chunk_column is not yet supported for MongoDB source.

selected_streams: The streams / tables / collections OLake will replicate.
streams: Metadata for each discovered collection, including schemas and sync modes (e.g., cdc, full_refresh).
partition_regex: Specify the regex pattern. For more details, refer to S3 docs
normalization: If set to true, OLake will flatten nested JSON structures (Level 1 flattening).
append_only: The append_only flag determines whether records can be written to the iceberg delete file. If set to true, no records will be written to the delete file. Know more about delete file: Iceberg MOR and COW

tip

Exclude Streams: You can remove unneeded collections by editing selected_streams directly. For instance, deleting "customers" if you only want to sync orders.

Before (including customers):

OLAKE_DIRECTORY/streams.json
"selected_streams": {
        "otter_db": [
            {
                "stream_name": "order",
                "partition_regex": "",
                "normalization": false,
                "append_only": false,
                "chunk_column": "",         //column name to be specified
            },
            {
                "stream_name": "customer",
                "partition_regex": "",
                "normalization": false,
                "append_only": false,
                "chunk_column": "",         //column name to be specified
            }
        ]
    },

After (to exclude customers):

OLAKE_DIRECTORY/streams.json
"selected_streams": {
        "otter_db": [
            {
                "stream_name": "order",
                "partition_regex": "",
                "normalization": false,
                "append_only": false,
                "chunk_column": "",         //column name to be specified  
            },
        ]
    },

Step 3: Run Your First Data Sync

Now that you have streams.json, it’s time to sync data from MongoDB to your specified destination (local or S3).

OLake Docker
Locally run OLake

macOS / Linux
CMD
Powershell

docker run --pull=always  \
  -v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \
  olakego/source-mongodb:latest \
  sync \
  --config /mnt/config/source.json \
  --catalog /mnt/config/streams.json \
  --destination /mnt/config/destination.json

docker run --pull=always  ^
  -v "%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY:/mnt/config" ^
  olakego/source-mongodb:latest ^
  sync ^
  --config /mnt/config/source.json ^
  --catalog /mnt/config/streams.json ^
  --destination /mnt/config/destination.json

docker run --pull=always  `
  -v "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY:/mnt/config" `
  olakego/source-mongodb:latest `
  sync `
  --config /mnt/config/source.json `
  --catalog /mnt/config/streams.json `
  --destination /mnt/config/destination.json

macOS / Linux
CMD
Powershell

OLAKE_BASE_PATH="$HOME/PATH_TO_OLAKE_DIRECTORY/olake/drivers/mongodb/config" && \
./build.sh driver-mongodb sync \
  --config "$OLAKE_BASE_PATH/source.json" \
  --catalog "$OLAKE_BASE_PATH/streams.json" \
  --destination "$OLAKE_BASE_PATH/destination.json"

set "OLAKE_BASE_PATH=%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY\olake\drivers\mongodb\config" && ^
./build.sh driver-mongodb sync ^
  --config "%OLAKE_BASE_PATH%\source.json" ^
  --catalog "%OLAKE_BASE_PATH%\streams.json" ^
  --destination "%OLAKE_BASE_PATH%\destination.json"

$OLAKE_BASE_PATH = "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY\olake\drivers\mongodb\config"; `
./build.sh driver-mongodb sync `
  --config "$OLAKE_BASE_PATH\source.json" `
  --catalog "$OLAKE_BASE_PATH\streams.json" `
  --destination "$OLAKE_BASE_PATH\destination.json"

First Data Sync

Flag/Parameter	Description
`sync`	The OLake sub-command that runs a data replication (snapshot + CDC).
`--config /mnt/config/source.json`	MongoDB connection settings.
`--catalog /mnt/config/streams.json`	The file detailing which streams OLake will replicate.
`--destination /mnt/config/destination.json`	The output configuration file (local or S3).

This command performs both the initial snapshot.
If you only want a full one-time snapshot, set the stream’s sync_mode to "full_refresh" in streams.json.

info

Example: If your sync_mode is "cdc", OLake will:

Do a one-time full snapshot of each selected collection.
Automatically begin listening to MongoDB’s oplog for near real-time changes.

When the sync finishes, you should see new files either:

Locally (in the volume-mapped directory).
On S3 (inside the specified s3_path).

Step 3.1 Synced Data

If you are using VS Code, install a parquet reader extension to visualize the parquet file contents that will be made post sync process.

First Data Sync JSON

Step 3.2 Synced Data Normalized

If you have turned on "normalization": true in streams.json for you streams, expect the below Level 1 Flattening of JSON data.

Read more about JSON flattening here - Flatten Object Types and Query Arrays in Semi-Structured Nested JSON

Running the sync command with normalization turned on

JSON Normalized logs

Output Data Dump

First Data Sync JSON Normalized

Step 3.3 Change output directory

If you need to output the parquet dump to some other location, you can make changes in the destination.json file by appending the /mnt/config/my_directory

OLAKE_DIRECTORY/destination.json
{
    "type": "PARQUET",
       "writer": {
         "local_path": "/mnt/config/my_directory" 
    }
  }

Here, /mnt/config represents the OLAKE_DIRECTORY.

Step 4: Resume Sync with a State File

If a sync is interrupted or you need to resume from a previous checkpoint, OLake automatically saves progress in a state.json file. Use the --state parameter to continue from that point:

OLake Docker
Locally run OLake

macOS / Linux
CMD
Powershell

docker run --pull=always  \
  -v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \
  olakego/source-mongodb:latest \
  sync \
  --config /mnt/config/source.json \
  --catalog /mnt/config/streams.json \
  --destination /mnt/config/destination.json \
  --state /mnt/config/state.json

docker run --pull=always  ^
  -v "%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY:/mnt/config" ^
  olakego/source-mongodb:latest ^
  sync ^
  --config /mnt/config/source.json ^
  --catalog /mnt/config/streams.json ^
  --destination /mnt/config/destination.json ^
  --state /mnt/config/state.json

docker run --pull=always  `
  -v "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY:/mnt/config" `
  olakego/source-mongodb:latest `
  sync `
  --config /mnt/config/source.json `
  --catalog /mnt/config/streams.json `
  --destination /mnt/config/destination.json `
  --state /mnt/config/state.json

macOS / Linux
CMD
Powershell

OLAKE_BASE_PATH="$HOME/PATH_TO_OLAKE_DIRECTORY/olake/drivers/mongodb/config" && \
./build.sh driver-mongodb sync \
  --config "$OLAKE_BASE_PATH/source.json" \
  --catalog "$OLAKE_BASE_PATH/streams.json" \
  --destination "$OLAKE_BASE_PATH/destination.json" \
  --state "$OLAKE_BASE_PATH/state.json"

set "OLAKE_BASE_PATH=%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY\olake\drivers\mongodb\config" && ^
./build.sh driver-mongodb sync ^
  --config "%OLAKE_BASE_PATH%\source.json" ^
  --catalog "%OLAKE_BASE_PATH%\streams.json" ^
  --destination "%OLAKE_BASE_PATH%\destination.json" ^
  --state "%OLAKE_BASE_PATH%\state.json"

$OLAKE_BASE_PATH = "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY\olake\drivers\mongodb\config"; `
./build.sh driver-mongodb sync `
  --config "$OLAKE_BASE_PATH\source.json" `
  --catalog "$OLAKE_BASE_PATH\streams.json" `
  --destination "$OLAKE_BASE_PATH\destination.json" `
  --state "$OLAKE_BASE_PATH\state.json"

Resume Sync with a State File

Flag/Parameter	Description
`--state /mnt/config/state.json`	Points OLake to an existing state file.

state.json typically includes a resume token (for MongoDB) or an offset for other databases, ensuring OLake does not reprocess records it has already synced.

A typical state.json file has the following structure:

state.json
{
  "type": "STREAM",
  "streams": [
    {
      "stream": "stream_8",
      "namespace": "otter_db",
      "sync_mode": "",
      "state": {
        "_data": "8267B34D61000000022B0429296E1404"
      }
    },
    {
      "stream": "stream_0",
      "namespace": "otter_db",
      "sync_mode": "",
      "state": {
        "_data": "8267B34D61000000022B0429296E1404"
      }
    }
  ]
}

In this example, "_data": "8267B34D6..." is a MongoDB resume token that tells OLake where to pick up the CDC stream.

For more details on the state.json configuration, refer the state docs

Debugging

Follow the debugging instructions in this guide - Setting up debugger in VS Code

Docker Commands & Flags

Click here for more info about Docker Commands & Flags

Next Steps & Wrap-Up

Check Your Output: Verify your Parquet files (or Iceberg tables) were created either locally or in your S3 bucket.
Explore Schema Evolution: If your MongoDB documents gain new fields, OLake can adapt automatically. Watch for updated schemas in subsequent runs.
Try More Destinations: OLake can also write to Iceberg on S3 (and more in the future). Update your destination config as needed.
Analytics & Querying: Connect your newly created Parquet/Iceberg data to engines like Trino, Spark, or Presto for powerful querying.

Congratulations! You’ve completed your first OLake data replication. If you encounter any issues or have feedback, please visit our GitHub repository to open an issue or contribute.

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!

Getting Started with OLake for MongoDB

TLDR:

Introduction & Requirements

Step 1: Prepare Your Directory

1.1 Example `destination.json`

1.2 Example `source.json` (MongoDB)

Description of above parameters

Step 2: Generate a Streams File

2.1 Understanding the `streams.json` File

Step 3: Run Your First Data Sync

Step 3.1 Synced Data

Step 3.2 Synced Data Normalized

Running the sync command with normalization turned on

Output Data Dump

Step 3.3 Change output directory

Step 4: Resume Sync with a State File

Debugging

Docker Commands & Flags

Next Steps & Wrap-Up

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

TLDR:​

Introduction & Requirements​

Step 1: Prepare Your Directory​

1.1 Example destination.json​

1.2 Example source.json (MongoDB)​

Description of above parameters​

Step 2: Generate a Streams File​

2.1 Understanding the streams.json File​

Step 3: Run Your First Data Sync​

Step 3.1 Synced Data​

Step 3.2 Synced Data Normalized​

Running the sync command with normalization turned on​

Output Data Dump​

Step 3.3 Change output directory​

Step 4: Resume Sync with a State File​

Debugging​

Docker Commands & Flags​

Next Steps & Wrap-Up​

Need Assistance?

Join our growing community

GitHub

Slack

Twitter

LinkedIn

YouTube

TLDR:

Introduction & Requirements

Step 1: Prepare Your Directory

1.1 Example `destination.json`

1.2 Example `source.json` (MongoDB)

Description of above parameters

Step 2: Generate a Streams File

2.1 Understanding the `streams.json` File

Step 3: Run Your First Data Sync

Step 3.1 Synced Data

Step 3.2 Synced Data Normalized

Running the sync command with normalization turned on

Output Data Dump

Step 3.3 Change output directory

Step 4: Resume Sync with a State File

Debugging

Docker Commands & Flags

Next Steps & Wrap-Up