Configurations
config.json
Configuration
Below is a sample config.json
for connecting to a MongoDB replica set. Customize each field to match your environment.
{
"hosts": [
"host1:27017",
"host2:27017",
"host3:27017"
],
"username": "test",
"password": "test",
"authdb": "admin",
"replica-set": "rs0",
"read-preference": "secondaryPreferred",
"srv": true,
"server-ram": 16,
"database": "database",
"max_threads": 50,
"default_mode": "cdc",
"backoff_retry_count": 2,
"partition_strategy":""
}
Description of above parameters
Field | Description | Example Value | Data Type |
---|---|---|---|
hosts | List of MongoDB hosts. Use DNS SRV if srv = true . | x.xxx.xxx.120:27017 , x.xxx.xxx.120:27017 , x.xxx.xxx.133:27017 (can be multiple) | []STRING |
username/password | Credentials for MongoDB authentication. | "test"/"test" | STRING |
authdb | Authentication database (often admin ). | "admin" | STRING |
replica-set | Name of the replica set, if applicable. | "rs0" | STRING |
read-preference | Which node to read from (e.g., secondaryPreferred ). | "secondaryPreferred" | STRING |
srv | If using DNS SRV connection strings, set to true . When true , there can only be 1 host in hosts field. | true , false | BOOL |
server-ram | Memory management hint for the OLake container. | 16 | UINT |
database | The MongoDB database name to replicate. | "database_name" | STRING |
max_threads | Maximum parallel threads for chunk-based snapshotting. | 50 | INT |
default_mode | Default sync mode ("cdc" or "full_refresh" ). | "cdc" , "full_refresh" , "incremental" (WIP) | |
backoff_retry_count | Retries attempt to establish sync again if it fails, increases exponentially ( in minutes - 1, 2,4,8,16... depending upon the backoff_retry_count value) | defaults to 3, takes default value if set to -1 | INT |
partition_strategy | The partition strategy for backfill | timestamp , default uses Split-Vector Strategy if left empty |
Refer here for more about sync modes.
catalog.json
Configuration
Here we explain the structure and contents of your catalog.json
file, which is used to configure and manage data streams. It covers the following topics:
- Overall File Structure
- Selected Streams
- Streams and Their Configuration
- Type Schema: Properties and Data Types
- Key-Value Pair Explanation
- Synchronization Modes
1. Overall File Structure
The catalog.json
file is organized into two main sections:
selected_streams
: Lists the streams that have been chosen for processing. These are grouped by namespace.streams
: Contains an array of stream definitions. Each stream holds details about its data schema, supported synchronization modes, primary keys, and other metadata.
2. Selected Streams
The selected_streams
section groups streams by their namespace(database name). For example, the configuration might look like this:
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
}
Key components:
Component | Data Type | Example Value | Description & Possible Values |
---|---|---|---|
namespace | string | otter_db | Groups streams that belong to a specific database or logical category |
stream_name | string | "stream_0" , "stream_8" | The identifier for the stream. Should match the stream name defined in the stream configurations. |
partition_regex | string | "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}" or "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}" | A pattern defining how to partition the data. Includes date tokens (e.g., year, month, day) or other markers like language or revision indicators. |
For more information about partition_regex
, refer to S3 partition documentation.
3. Streams
The streams section is an array where each element is an object that defines a specific data stream. Each stream object includes a stream key that holds the configuration details. For example, one stream definition looks like this:
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
},
...
}
},
"supported_sync_modes": ["full_refresh", "cdc"],
"source_defined_primary_key": ["_id"],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
}
3.1 Stream Configuration Elements
Component | Data Type | Example Value | Description & Possible Values |
---|---|---|---|
name | string | "stream_8" | Unique identifier for the stream. Each stream should have a unique name. |
namespace | string | "otter_db" | Grouping or database name that the stream belongs to. It helps organize streams by logical or physical data sources. |
type_schema | object | (JSON object with properties) | Defines the structure of the records in the stream. Contains a properties object that maps each field (key) to its allowed data types (e.g., string, integer, array, object). |
supported_sync_modes | array | ["full_refresh", "cdc"] | Lists the synchronization modes the stream supports. Typically includes "full_refresh" for complete reloads and "cdc" (change data capture) for incremental updates. |
source_defined_primary_key | array | ["_id"] | Specifies the field(s) that uniquely identify a record within the stream. This key is used to ensure data uniqueness and integrity. |
available_cursor_fields | array | [] | Lists fields that can be used to track the synchronization progress. Often empty if no cursors are required or defined. |
sync_mode | string | "cdc" | Indicates the active synchronization mode. This field is set to either "cdc" for change data capture or "full_refresh" when a complete data reload is used. |
4. Type Schema: Properties and Data Types
The type_schema
is central to defining what data each stream will handle. Under this schema:
4.1 Properties Object
The properties object is a collection of key-value pairs where:
Key
: Represents the name of the property or field (e.g.,"title"
,"authors"
,"publish_date"
).Value
: Is an object that describes the property's allowed data types.
For example:
"_id": {
"type": ["string"]
}
This definition means that the _id
property must be a string.
4.2 Supported Data Types
The schema supports several JSON data types, including:
Below is the updated table that summarizes the supported data types along with a brief description and a sample value for each:
Data Type | Description | Sample Value |
---|---|---|
Null | Represents a null or missing value. | null |
Int64 | Represents a 64-bit integer value. | 42 |
Float64 | Represents a 64-bit floating point number. | 3.14159 |
String | Represents a text string. | "Hello, world!" |
Bool | Represents a boolean value (true/false). | true |
Object | Represents a JSON object (key-value pairs). | { "name": "Alice", "age": 30 } |
Array | Represents an array of values. | [1, 2, 3] |
Unknown | Represents an unspecified or unrecognized type. | "unknown" |
Timestamp | Represents a timestamp with default precision. | "2025-02-18T10:00:00Z " |
TimestampMilli | Represents a timestamp with millisecond precision (3 decimal places). | "2025-02-18T10:00:00.123Z " |
TimestampMicro | Represents a timestamp with microsecond precision (6 decimal places). | "2025-02-18T10:00:00.123456Z " |
TimestampNano | Represents a timestamp with nanosecond precision (9 decimal places). | "2025-02-18T10:00:00.123456789Z " |
More advanced data type support coming up soon. Track the progress.
A field can allow multiple types. For instance:
"coordinates": {
"type": [
"null",
"object"
]
},
This means the coordinates
field is presently null
but is of object data type.
5. Synchronization Modes
The catalog specifies how streams should be synchronized:
Component | Description | Possible Values |
---|---|---|
supported_sync_modes | Lists all the modes a stream supports. | full_refresh - (a complete reload of the data) cdc (incremental updates capturing only changes) |
sync_mode | Indicates which mode is actively being used for that stream. | full_refresh or cdc |
Sample configuration
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
},
"streams": [
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": { ... },
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
},
// ... other streams
]
}
Sample partition over title
column (not recomended)
state.json
Configuration
The state.json
file is a critical component in OLake's synchronization process. It tracks the point (via a cursor token, resume token, or offset) until which data has been processed. This mechanism allows OLake to resume syncing from where it left off, preventing duplicate processing of records.
File Structure
A typical state.json
file has the following structure:
{
"type": "STREAM",
"streams": [
{
"stream": "stream_8",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
},
{
"stream": "stream_0",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
}
]
}
Key Components
Key | Data Type | Description | Sample Value |
---|---|---|---|
type | string | Identifies the type of state stored. Typically, it is set to "STREAM" . | "STREAM" |
streams | array | An array containing state objects for each stream. | [ { ... }, { ... } ] |
stream | string | The unique identifier for the stream whose state is recorded. | "stream_8" or "stream_0" |
namespace | string | The namespace or logical grouping the stream belongs to. | "otter_db" |
sync_mode | string | Indicates the active synchronization mode for the stream. This value may be empty or contain a specific mode. | "" (empty string) or sync modes like "cdc" , "full_refresh" , "incremental" (WIP) |
state | object | Contains the resume token or offset. This token determines the point until which data has been synced. | { "_data": "8267B34D61000000022B0429296E1404" } |
Refer here for more about sync modes.
How It Works
- Resume Token / Offset:
The value stored in thestate
object (in the_data
field) represents the cursor token, resume token (in MongoDB), or offset (in other databases) indicating the last processed record. - Incremental Syncing:
By keeping track of the token, OLake can start the next sync run from this point, ensuring that previously processed records are not re-fetched. - Multiple Streams:
Each stream in thestreams
array maintains its own synchronization state. This allows OLake to handle multiple data sources or partitions independently.
Benefits
- Efficiency:
Incremental synchronization reduces data transfer and processing by only fetching new or changed records. - Data Consistency:
Tracking the synchronization state prevents duplicate processing, ensuring that data remains consistent. - Flexibility:
The state mechanism supports various data sources (e.g., MongoDB with resume tokens, other databases with offsets), making it adaptable to different backend systems.