Skip to main content

Docker (CLI)

Docker CLI can be installed and executed using the official Docker images. Each supported source driver has a dedicated image available on the OLake DockerHub page.

Prerequisites​

  • Docker installed and running on the host system
  • Internet access to pull Docker images

In the subsequent commands, replace [SOURCE-NAME] with the value corresponding to the required driver.

Driver Name[SOURCE-NAME]
Postgrespostgres
MySQLmysql
Oracleoracle
MongoDBmongodb

With the driver identified, the next step is configuring the source system.

Sources​

Each source driver requires specific configuration. Detailed setup instructions can be found in the following guides:

What's Next : Once the source has been configured, the next step is to define the destination.

Destination​

Destinations specify where ingested data will be written. OLake supports multiple destinations. The following guides provide setup of:

What's Next : Once both source and destination are configured, the next step is to discover available streams/tables.

Discover Command​

To discover the available streams from the source, run the following command:

discover command
docker run --pull=always \
-v "[PATH_OF_CONFIG_FOLDER]:/mnt/config" \
olakego/source-[SOURCE-NAME]:latest \
discover \
--config /mnt/config/source.json
Use version specific command

Using a version specific image is recommended instead of using latest in the commands for production use case.

Running this command generates a streams.json file, which contains the list of available streams along with their metadata.

What's Next : Understanding the streams config file for stream/table level modifications

Streams Config​

The streams.json file is organized into two main sections:

  • selected_streams: Lists the streams that have been chosen for processing. These are grouped by namespace.
  • streams: Contains an array of stream definitions. Each stream holds details about its schema, supported synchronization modes, primary keys, and other metadata.

1. Selected Streams​

The selected_streams section groups streams by their namespace. For example, the configuration might look like this:

streams.json
"selected_streams": {
"my_db": [
{
"partition_regex": "/{dropoff_datetime, year}",
"stream_name": "table1",
"normalization": true,
"append_mode": false,
"filter": "UPDATED_AT >= \"08-JUN-25 07.19.23.690870000 AM\""
},
{
"partition_regex": "",
"stream_name": "table2",
"normalization": false,
"append_mode": false,
"filter": "city = \"London\""
}
]
}

Details about all the fields mentioned in selected streams​

ComponentTypeExample ValueDescription
namespacestringmy_dbGroups streams that belong to a specific database or logical category
stream_namestring"table1", "table2"Identifier for the stream. Must match the stream name defined in the Streams configurations.
partition_regexstring"/{dropoff_datetime, year}"A pattern defining how to partition the data. To read more, refer the Partition Regex for S3 or Partition Regex for iceberg
normalizationbooleantrueConverts top-level JSON fields into columns, while nested JSON objects are stored as string values.
append_modebooleanfalseDisables upserts in Iceberg when set to true. (In upsert mode dedupe happens via equality delete)
filterstring"UPDATED_AT >= \"08-JUN-25 07.19.23.690870000 AM\""Ensures that only data matching the specified condition is synced.

2. Streams​

The streams section is an array where each element is an object that defines a specific data stream. For example, one stream definition looks like this:

streams.json
{
"stream": {
"name": "table1",
"namespace": "my_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"name": {
"type": ["string"]
},
"marks": {
"type": ["integer"]
},
"updated_at": {
"type": ["timestamp"]
},
...
}
},
"supported_sync_modes": ["full_refresh", "cdc", "incremental", "strict_cdc"],
"source_defined_primary_key": ["_id"],
"available_cursor_fields": ["_id", "name", "marks", "updated_at"],
"sync_mode": "incremental",
"cursor_field": "updated_at",
}
}

2.1 Stream Configuration Elements​

ComponentExample ValueDescription & Possible Values
name"stream_8"Unique identifier for the stream. Each stream must have a unique name.
namespace"olake_db"The grouping or database name that the stream belongs to.
type_schema(JSON object with properties)Defines the columns in streams/tables.
supported_sync_modes["full_refresh", "cdc", "incremental","strict_cdc"]Lists the sync modes the stream supports. Includes "full_refresh", "cdc", "strict_cdc" and "incremental". To read more about sync modes read
source_defined_primary_key["_id"]Specifies the field(s) that is set as a primary key in the source.
available_cursor_fields["_id", "name", "marks", "updated_at"]Lists fields that can be used to track synchronization progress in incremental sync mode.
sync_mode"incremental"Indicates the active sync mode. Possible values are defined in supported_sync_modes.
cursor_field"updated_at"Defines the cursor field used to track incremental sync. To read more about incremental sync read this

Whats's Next: Once the streams.json file is configured as required, sync can be initiated.

Sync​

The Sync section describes how to start the sync process using the configuration files.It also explains how to run sync with a state file when operating in Incremental (Bookmark/cursor) or CDC mode.

Sync command​

This command is executed when a sync needs to be run in Full Refresh mode or when running a sync for the first time (i.e., before a state file has been created).

sync command
docker run --pull=always \
-v "[PATH_OF_CONFIG_FOLDER]:/mnt/config" \
olakego/source-[SOURCE-NAME]:latest \
sync \
--config /mnt/config/source.json \
--catalog /mnt/config/streams.json \
--destination /mnt/config/destination.json

Sync with state command​

When using Incremental (Bookmark/cursor) or CDC mode, the state gets saved, and this file is used to start the sync from last saved point.

sync with state command
docker run --pull=always \
-v "[PATH_OF_CONFIG_FOLDER]:/mnt/config" \
olakego/source-[SOURCE-NAME]:latest \
sync \
--config /mnt/config/source.json \
--catalog /mnt/config/streams.json \
--destination /mnt/config/destination.json \
--state /mnt/config/state.json

Whats's Next: Monitor sync progress and performance using the stats.json file.

Stats​

The stats.json file is created as soon as the sync starts. The file looks like this:

stats.json
{"Estimated Remaining Time": "1642.00","Memory": "2228 mb","Running Threads": 21,"Seconds Elapsed": "186.00","Speed": "76542.20 rps","Synced Records": 14236868}
ComponenetExample ValueDescription
Estimated Remaining Time1642.00A rough estimation in seconds, indicating when the sync is expected to complete
Memory2228 mbMemory which is currently getting utilizied
Running Threads21Number of parallel threads actively syncing data
Seconds Elapsed186.00Time in seconds elapsed since the sync started
Speed76542.20 rpsHow many rows per second are being synced in this sync.
Synced Records14236868Total number of records which have been synced since the sync started.

The stats.json file remains available after sync completion.

What's Next: Similarly logs get saved for every command that is executed. And those can be reviewed for additional information.

Logs​

During a sync, Olake automatically creates an olake.log file inside a folder named sync_<TIMESTAMP_IN_UTC> (for example: sync_2025-02-17_11-41-40). This folder is created in the same directory as the configuration files.

  • The olake.log file contains a complete record of all logs generated while the command is running.

logs

  • Console logs are displayed in real time with color formatting for easier identification.

logs console

"debug": "\033[36m", // Cyan
"info": "\033[32m", // Green
"warn": "\033[33m", // Yellow
"error": "\033[31m", // Red
"fatal": "\033[31m", // Red

With this, the sync setup using the OLake CLI is complete, from configuring source and destination to reviewing key outputs such as stats and logs.




πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!