Skip to main content

Configurations

config.json Configuration

Below is a sample config.json for connecting to a Postgres replica set. Customize each field to match your environment.

config.json
{
"host": "localhost",
"port": 5432,
"database": "main",
"username": "main",
"password": "password",
"jdbc_url_params": {},
"ssl": {
"mode": "disable"
},
"update_method": {
"replication_slot": "postgres_slot",
"intial_wait_time": 10
},
"reader_batch_size": 100000,
"default_mode": "cdc",
"max_threads": 50
}

Description of above parameters

FieldDescriptionExample ValueData Type
hostThe hostname or IP address of the database server.localhostString
portThe port number through which the database server is accessible.5432Integer
databaseThe name of the target database to connect to.mainString
usernameThe username used for authenticating with the database.mainString
passwordThe password corresponding to the provided username for authentication.passwordString
jdbc_url_paramsA collection of additional JDBC URL parameters to fine-tune the connection.{}Object
sslSSL configuration for the database connection. Contains details such as the SSL mode.{"mode": "disable"}Object
update_methodSpecifies the mechanism for updating data. Includes properties for a replication slot and an initial wait time.{"replication_slot": "postgres_slot", "intial_wait_time": 10}Object
reader_batch_sizeThe maximum number of records processed per batch during reading operations.100000Integer
default_modeDefines the default mode of operation, for example, using CDC (Change Data Capture).cdcString
max_threadsThe maximum number of threads allocated for parallel processing tasks.50Integer

catalog.json Configuration

Here we explain the structure and contents of your catalog.json file, which is used to configure and manage data streams. It covers the following topics:

  • Overall File Structure
  • Selected Streams
  • Streams and Their Configuration
  • Type Schema: Properties and Data Types
  • Key-Value Pair Explanation
  • Synchronization Modes

1. Overall File Structure

The catalog.json file is organized into two main sections:

  • selected_streams: Lists the streams that have been chosen for processing. These are grouped by namespace.
  • streams: Contains an array of stream definitions. Each stream holds details about its data schema, supported synchronization modes, primary keys, and other metadata.

2. Selected Streams

The selected_streams section groups streams by their namespace(database name). For example, the configuration might look like this:

catalog.json
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
}

Key components:

ComponentData TypeExample ValueDescription & Possible Values
namespacestringotter_dbGroups streams that belong to a specific database or logical category
stream_namestring"stream_0", "stream_8"The identifier for the stream. Should match the stream name defined in the stream configurations.
partition_regexstring"{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}"
or
"{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}"
A pattern defining how to partition the data. Includes date tokens (e.g., year, month, day) or other markers like language or revision indicators.

For more information about partition_regex, refer to S3 partition documentation.

3. Streams

The streams section is an array where each element is an object that defines a specific data stream. Each stream object includes a stream key that holds the configuration details. For example, one stream definition looks like this:

catalog.json
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
},
...
}
},
"supported_sync_modes": ["full_refresh", "cdc"],
"source_defined_primary_key": ["_id"],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
}

3.1 Stream Configuration Elements

ComponentData TypeExample ValueDescription & Possible Values
namestring"stream_8"Unique identifier for the stream. Each stream should have a unique name.
namespacestring"otter_db"Grouping or database name that the stream belongs to. It helps organize streams by logical or physical data sources.
type_schemaobject(JSON object with properties)Defines the structure of the records in the stream. Contains a properties object that maps each field (key) to its allowed data types (e.g., string, integer, array, object).
supported_sync_modesarray["full_refresh", "cdc"]Lists the synchronization modes the stream supports. Typically includes "full_refresh" for complete reloads and "cdc" (change data capture) for incremental updates.
source_defined_primary_keyarray["_id"]Specifies the field(s) that uniquely identify a record within the stream. This key is used to ensure data uniqueness and integrity.
available_cursor_fieldsarray[]Lists fields that can be used to track the synchronization progress. Often empty if no cursors are required or defined.
sync_modestring"cdc"Indicates the active synchronization mode. This field is set to either "cdc" for change data capture or "full_refresh" when a complete data reload is used.

4. Type Schema: Properties and Data Types

The type_schema is central to defining what data each stream will handle. Under this schema:

4.1 Properties Object

The properties object is a collection of key-value pairs where:

  • Key: Represents the name of the property or field (e.g., "title", "authors", "publish_date").
  • Value: Is an object that describes the property's allowed data types.

For example:

"_id": {
"type": ["string"]
}

This definition means that the _id property must be a string.

4.2 Supported Data Types

The schema supports several JSON data types, including:

Below is the updated table that summarizes the supported data types along with a brief description and a sample value for each:

Data TypeDescriptionSample Value
NullRepresents a null or missing value.null
Int64Represents a 64-bit integer value.42
Float64Represents a 64-bit floating point number.3.14159
StringRepresents a text string."Hello, world!"
BoolRepresents a boolean value (true/false).true
ObjectRepresents a JSON object (key-value pairs).{ "name": "Alice", "age": 30 }
ArrayRepresents an array of values.[1, 2, 3]
UnknownRepresents an unspecified or unrecognized type."unknown"
TimestampRepresents a timestamp with default precision."2025-02-18T10:00:00Z"
TimestampMilliRepresents a timestamp with millisecond precision (3 decimal places)."2025-02-18T10:00:00.123Z"
TimestampMicroRepresents a timestamp with microsecond precision (6 decimal places)."2025-02-18T10:00:00.123456Z"
TimestampNanoRepresents a timestamp with nanosecond precision (9 decimal places)."2025-02-18T10:00:00.123456789Z"

More advanced data type support coming up soon. Track the progress.

A field can allow multiple types. For instance:

"coordinates": {
"type": [
"null",
"object"
]
},

This means the coordinates field is presently null but is of object data type.

5. Synchronization Modes

The catalog specifies how streams should be synchronized:

ComponentDescriptionPossible Values
supported_sync_modesLists all the modes a stream supports.
  • full_refresh - (a complete reload of the data)
  • cdc (incremental updates capturing only changes)
  • sync_modeIndicates which mode is actively being used for that stream.full_refresh or cdc

    Sample configuration

    olake_directory/catalog.json
    {
    "selected_streams": {
    "otter_db": [
    {
    "partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
    "stream_name": "stream_0"
    },
    {
    "partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
    "stream_name": "stream_8"
    }
    ]
    },
    "streams": [
    {
    "stream": {
    "name": "stream_8",
    "namespace": "otter_db",
    "type_schema": { ... },
    "supported_sync_modes": [
    "full_refresh",
    "cdc"
    ],
    "source_defined_primary_key": [
    "_id"
    ],
    "available_cursor_fields": [],
    "sync_mode": "cdc"
    }
    },
    // ... other streams
    ]
    }

    Sample partition over title column (not recomended)

    partition image

    state.json Configuration

    The state.json file is a critical component in OLake's synchronization process. It tracks the point (via a cursor token, resume token, or offset) until which data has been processed. This mechanism allows OLake to resume syncing from where it left off, preventing duplicate processing of records.

    File Structure

    A typical state.json file has the following structure:

    state.json
    {
    "type": "GLOBAL",
    "global": {
    "state": {
    "lsn": "0/198C935"
    },
    "streams": [
    "public.sample_data"
    ]
    },
    "streams": [
    {
    "stream": "sample_data",
    "namespace": "public",
    "sync_mode": "",
    "state": {
    "chunks": []
    }
    }
    ]
    }

    Key Components

    Below is a detailed property description table for the provided state.json structure:

    KeyData TypeDescriptionSample Value
    typestringIdentifies the type of state stored. In this file, "GLOBAL" indicates that the state applies globally to the replication process."GLOBAL"
    globalobjectContains global state information including overall replication progress and a list of globally tracked streams.{ "state": { "lsn": "0/198C935" }, "streams": ["public.sample_data"] }
    global.stateobjectHolds the global state details required to resume replication, such as log sequence markers.{ "lsn": "0/198C935" }
    global.state.lsnstringThe Log Sequence Number (LSN) indicating the last processed position in the transaction log."0/198C935"
    global.streamsarrayAn array listing fully-qualified stream identifiers tracked at the global level.[ "public.sample_data" ]
    streamsarrayAn array of objects, each representing the state of an individual stream. These objects contain details needed to resume replication for that stream.[ { "stream": "sample_data", "namespace": "public", "sync_mode": "", "state": { "chunks": [] } } ]
    streams[].streamstringThe identifier (typically a table name) of the stream whose state is being tracked."sample_data"
    streams[].namespacestringIndicates the namespace or schema to which the stream belongs."public"
    streams[].sync_modestringSpecifies the synchronization mode for the stream. It may be left empty if no specific mode is applied."" (empty string)
    streams[].stateobjectContains state details specific to the stream, such as information on data segmentation or chunks.{ "chunks": [] }
    streams[].state.chunksarrayAn array used to store chunk information, useful for managing segmented or large datasets during replication.[]

    This table provides a clear and concise reference for each configuration property in the state.json file, making it easier for developers to understand and work with the replication state details.

    Refer here for more about sync modes.


    Need Assistance?

    If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

    • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
    • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
    • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

    Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!