Skip to main content

Getting Started with OLake (Docker)

OLake helps you replicate data from MongoDB (and in the future, other databases) into local or S3-based data lakes using Parquet or Iceberg table formats. This tutorial walks you through every step of the setup—from creating necessary configuration files to running your first data sync.

Find Docker image of OLake here - olakego/source-mongodb

info

Please join OLake slack community's #contributing-to-olake channel for all the information regarding contributions. Join OLake Community Slack here.

Introduction & Requirements

To use OLake, ensure you have:

You will also need:

  • An empty directory to store OLake configuration files and outputs. This guide will refer to it as olake_folder.
note

For setting up the project locally on your system and debugging configs to be made, follow this guide - Setting up debugger in VS Code

Step 1: Prepare Your Directory

  1. Create a new directory on your local machine. Let’s call it olake_folder:

    mkdir olake_folder
  2. Inside this folder, create two files:

    • writer.json: Specifies your output destination (local or S3).
    • config.json: Contains connection settings for MongoDB (or other databases in the future).
    cd olake_folder
    touch writer.json
    touch config.json

Folder Structure:

olake_folder/
├─ writer.json
└─ config.json

Directory

1.1 Example writer.json (Local)

olake_folder/writer.json
{
"type": "PARQUET",
"writer": {
"normalization": false,
"local_path": "/mnt/config"
}
}
  • type: The output file format (e.g., PARQUET).
  • writer.normalization: If true, OLake applies level-1 JSON flattening to nested objects.
  • writer.local_path: Local directory inside the Docker container. We map this path to your host file system via a Docker volume.

1.2 Example writer.json (S3)

Live Editor
{
  "type": "PARQUET",
  "writer": {
    "normalization": false,
    "s3_bucket": "olake",
    "s3_region": "",
    "s3_access_key": "",
    "s3_secret_key": "",
    "s3_path": ""
  }
}
Result
Loading...
  • s3_bucket: Name of your S3 bucket.
  • s3_path: Subdirectory within the bucket for organizing data.
  • s3_access_key / s3_secret_key: AWS credentials (if you’re not using an IAM role).
tip

Tip: For basic JSON flattening, set "normalization": true in your writer.json.

Read more about S3 configuration in S3 Writer Docs

1.3 Example config.json (MongoDB)

Below is a sample config.json for connecting to a MongoDB replica set. Customize each field to match your environment.

olake_folder/config.json
{
"hosts": [
"host1:27017",
"host2:27017",
"host3:27017"
],
"username": "test",
"password": "test",
"authdb": "admin",
"replica-set": "rs0",
"read-preference": "secondaryPreferred",
"srv": true,
"server-ram": 16,
"database": "database",
"max_threads": 50,
"default_mode": "cdc",
"backoff_retry_count": 2
}

Description of above parameters

Refer to source configuration for more details on config.json.

Step 2: Generate a Catalog File

OLake needs to discover which collections (streams) exist in your MongoDB. This step will create a catalog.json listing available streams, schemas, and default sync modes.

  1. Open your terminal in the same directory (olake_folder) containing config.json and writer.json.
  2. Run the discover command using Docker:
docker run \
-v olake_folder:/mnt/config \
olakego/source-mongodb:latest \
discover \
--config /mnt/config/config.json
info
  • olake_folder is the absolute path (not relative) of your olake_folder and will look something like /Users/user_name/Desktop/projects/olake_folder that OLake maps onto /mnt/config

Go to the root of the olake_folder and write:

pwd

This will give you the absolute path that you need to replace with olake_folder.

PWD

Final Command (Sample)

docker run \
-v /Users/priyansh_dz/Desktop/projects/olake-docker:/mnt/config \
olakego/source-mongodb:latest \
discover \
--config /mnt/config/config.json

Catalog

Flag/ParameterDescription
-v olake_folder:/mnt/configMaps your local olake_folder into the container at /mnt/config.
discoverThe OLake sub-command that scans MongoDB schemas.
--config /mnt/config/config.jsonTells OLake where to find your MongoDB connection details.

Important: Use the absolute path to your folder in the -v argument.

2.1 Understanding the catalog.json File

After running discover, OLake generates catalog.json in olake_folder with entries like:

olake_folder/catalog.json
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),2025,MM}-{now(),2025,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
},
"streams": [
{
"StreamMetadata": {
"partition_regex": "",
"stream_name": ""
},
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": { ... },
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
},
// ... other streams
]
}
  • selected_streams: The streams / tables / collections OLake will replicate.
  • streams: Metadata for each discovered collection, including schemas and sync modes (e.g., cdc, full_refresh).
  • partition_regex: Specify the regex pattern. For more details, refer to S3 docs
tip

Exclude Streams: You can remove unneeded collections by editing selected_streams directly. For instance, deleting "customers" if you only want to sync orders.

Before (including customers):

olake_folder/catalog.json
"selected_streams": {
"otter_db": [
{
"stream_name": "order",
"partition_regex": ""
},
{
"stream_name": "customer",
"partition_regex": ""
}
]
},

After (to exclude customers):

olake_folder/catalog.json
"selected_streams": {
"otter_db": [
{
"stream_name": "order",
"partition_regex": ""
},
]
},

Step 3: Run Your First Data Sync

Now that you have catalog.json, it’s time to sync data from MongoDB to your specified destination (local or S3).

docker run \
-v olake_folder:/mnt/config \
olakego/source-mongodb:latest \
sync \
--config /mnt/config/config.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/writer.json

Final Command (Sample)

docker run \
-v /Users/priyansh_dz/Desktop/projects/olake-docker:/mnt/config \
olakego/source-mongodb:latest \
sync \
--config /mnt/config/config.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/writer.json

First Data Sync

Flag/ParameterDescription
syncThe OLake sub-command that runs a data replication (snapshot + CDC).
--config /mnt/config/config.jsonMongoDB connection settings.
--catalog /mnt/config/catalog.jsonThe file detailing which streams OLake will replicate.
--destination /mnt/config/writer.jsonThe output configuration file (local or S3).
  • This command performs both the initial snapshot and, if configured, continuous CDC ("default_mode": "cdc").
  • If you only want a full one-time snapshot, set the stream’s sync_mode to "full_refresh" in catalog.json.
info

Example: If your sync_mode is "cdc", OLake will:

  1. Do a one-time full snapshot of each selected collection.
  2. Automatically begin listening to MongoDB’s oplog for near real-time changes.

When the sync finishes, you should see new files either:

  • Locally (in the volume-mapped directory).
  • On S3 (inside the specified s3_path).

Step 3.1 Synced Data

If you are using VS Code, install a parquet reader extension to visualize the parquet file contents that will be made post sync process.

First Data Sync JSON

Step 3.2 Synced Data Normalized

If you have turned on "normalization": true in your writer.json file, expect the below Level 1 Flattening of JSON data.

warning

Normalized data dump might not work as fast as normal dump or can be heavy on your machine, we are optimizing and currently in testing phase to introduce a better parquet writer that will make normalized dump super quick and resource efficient. Stay tuned.

Read more about JSON flattening here - Flatten Object Types and Query Arrays in Semi-Structured Nested JSON

Running the sync command with normalization turned on

JSON Normalized logs

Output Data Dump

First Data Sync JSON Normalized

Step 3.3 Change output directory

If you need to output the parquet dump to some other location, you can make changes in the writer.json file by appending the /mnt/config/my_directory

olake_folder/catalog.json
{
"type": "PARQUET",
"writer": {
"normalization":true,
"local_path": "/mnt/config/my_directory"
}
}

Here, /mnt/config represents the olake_folder.

Step 4: Resume Sync with a State File

If a sync is interrupted or you need to resume from a previous checkpoint, OLake automatically saves progress in a state.json file. Use the --state parameter to continue from that point:

docker run \
-v olake_folder:/mnt/config \
olakego/source-mongodb:latest \
sync \
--config /mnt/config/config.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/writer.json \
--state /mnt/config/state.json

Final Command (Sample)

docker run \
-v /Users/priyansh_dz/Desktop/projects/olake-docker:/mnt/config \
olakego/source-mongodb:latest \
sync \
--config /mnt/config/config.json \
--catalog /mnt/config/catalog.json \
--destination /mnt/config/writer.json \
--state /mnt/config/state.json

Resume Sync with a State File

Flag/ParameterDescription
--state /mnt/config/state.jsonPoints OLake to an existing state file.
info
  • olake_folder is the absolute path (not relative) of your olake_folder and will look something like /Users/user_name/Desktop/projects/olake_folder that OLake maps onto /mnt/config
  • state.json typically includes a resume token (for MongoDB) or an offset for other databases, ensuring OLake does not reprocess records it has already synced.

Sample contents of a state file might look like:

olake_folder/state.json
{
"type": "STREAM",
"streams": [
{
"stream": "stream_9",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
},
{
"stream": "stream_0",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
}
]
}

In this example, "_data": "8267B34D6..." is a MongoDB resume token that tells OLake where to pick up the CDC stream.

For more details on the state.json configuration, refer the state docs

Below is a sample log after running the above command

{"type":"LOG","log":{"level":"info","message":"Running sync with state: \u0026{\u003cnil\u003e STREAM \u003cnil\u003e [0x40001862a0]}"}}
{"type":"LOG","log":{"level":"info","message":"local writer configuration found, writing at location[%s], /mnt/config/"}}
{"type":"LOG","log":{"level":"info","message":"Starting discover for MongoDB database otter_db"}}
{"type":"LOG","log":{"level":"info","message":"producing type schema for stream [stream_7]"}}
{"type":"LOG","log":{"level":"info","message":"producing type schema for stream [stream_5]"}}
...
{"type":"LOG","log":{"level":"info","message":"producing type schema for stream [stream_0]"}}
{"type":"LOG","log":{"level":"warn","message":"Skipping stream stream_0.otter_db; not in selected streams."}}
{"type":"LOG","log":{"level":"warn","message":"Skipping stream stream_6.otter_db; not in selected streams."}}
...
{"type":"LOG","log":{"level":"info","message":"Valid selected streams are otter_db.stream_9"}}
{"type":"LOG","log":{"level":"info","message":"Starting ChangeStream process in driver"}}
{"type":"LOG","log":{"level":"info","message":"Starting CDC sync for stream[otter_db.stream_9] with resume token[826799E292000000022B0429296E1404]"}}
{"type":"LOG","log":{"level":"info","message":"Read Process Completed"}}
{"type":"LOG","log":{"level":"debug","message":"Deleted file [/mnt/config/otter_db/stream_9/2025-1-29_9-33-0_01JJRPG9EFNTS1CZYP371AQKYG.parquet] with 0 records (no records written)."}}
{"type":"LOG","log":{"level":"info","message":"Total records read: 0"}}
{"type":"STATE","state":{"type":"STREAM","streams":[{"stream":"stream_9","namespace":"otter_db","sync_mode":"","state":{"_data":"826799E292000000022B0429296E1404"}}]}}

Debugging

Follow the debugging instructions in this guide - Setting up debugger in VS Code

Docker Commands & Flags

Click here for more info about Docker Commands & Flags

Next Steps & Wrap-Up

  1. Check Your Output: Verify your Parquet files (or Iceberg tables) were created either locally or in your S3 bucket.
  2. Explore Schema Evolution: If your MongoDB documents gain new fields, OLake can adapt automatically. Watch for updated schemas in subsequent runs.
  3. Try More Destinations: OLake can also write to Iceberg on S3 (and more in the future). Update your writer config as needed.
  4. Analytics & Querying: Connect your newly created Parquet/Iceberg data to engines like Trino, Spark, or Presto for powerful querying.

Congratulations! You’ve completed your first OLake data replication. If you encounter any issues or have feedback, please visit our GitHub repository to open an issue or contribute.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!