Configurations
source.json
Configuration
Below is a sample source.json
for connecting to a Oracle replica set. Customize each field to match your environment.
{
"host": "oracle-host",
"username": "oracle-user",
"password": "oracle-password",
"service_name": "oracle-service-name",
"port": 1521,
"max_threads": 10,
"retry_count": 0,
"jdbc_url_params": {},
"ssl": {
"mode": "disable"
}
}
Description of above parameters
Field | Description | Example Value | Data Type |
---|---|---|---|
host | Hostname or IP address of the Oracle database server. | oracle-host | String |
port | TCP port on which the Oracle listener is accepting connections. | 1521 | Integer |
service_name | Oracle service name that identifies the specific database service to connect to. | oracle-service-name | String |
username | Database user used to authenticate the connection. | oracle-user | String |
password | Password for the specified user. | oracle-password | String |
max_threads | Maximum number of worker threads the connector can spin up for parallel tasks. | 10 | Integer |
retry_count | Number of times the connector will retry a failed operation before giving up. | 0 | Integer |
jdbc_url_params | Extra JDBC URL parameters for fine-tuning the connection (left empty if none are needed). | {} | Object |
ssl | SSL settings for the connection (e.g., whether SSL is disabled, allowed, or required). | {"mode": "disable"} | Object |
streams.json
Configuration
Here we explain the structure and contents of your streams.json
file, which is used to configure and manage data streams. It covers the following topics:
- Overall File Structure
- Selected Streams
- Streams and Their Configuration
- Type Schema: Properties and Data Types
- Key-Value Pair Explanation
- Synchronization Modes
1. Overall File Structure
The streams.json
file is organized into two main sections:
selected_streams
: Lists the streams that have been chosen for processing. These are grouped by namespace.streams
: Contains an array of stream definitions. Each stream holds details about its data schema, supported synchronization modes, primary keys, and other metadata.
2. Selected Streams
The selected_streams
section groups streams by their namespace(database name). For example, the configuration might look like this:
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0",
"normalization": false,
"append_only": false,
"chunk_column": ""
},
{
"partition_regex": "{,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8",
"normalization": false,
"append_only": false,
"chunk_column": ""
}
]
}
Selected Streams Details
Component | Data Type | Example Value | Description & Possible Values |
---|---|---|---|
namespace | string | otter_db | Groups streams that belong to a specific database or logical category |
stream_name | string | "stream_0" , "stream_8" | The identifier for the stream. Should match the stream name defined in the stream configurations. |
partition_regex | string | "{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}" or "{,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,}" | A pattern defining how to partition the data. Includes date tokens (e.g., year, month, day) or other markers like language or revision indicators. |
normalization | boolean | false | Determines whether OLake applies level-1 JSON flattening to nested objects. Set to true if you require normalized output; otherwise, use false . |
append_only | boolean | false | The append_only flag determines whether records can be written to the iceberg delete file. If set to true, no records will be written to the delete file. Know more about delete file: Iceberg MOR and COW |
chunk_column | string | "" | Column name to be specified, used to divide data into chunks for efficient parallel querying and extraction from the database. |
For more information about partition_regex
, refer to Iceberg Partition Documentation or S3 Partition Documentation.
3. Streams
The streams section is an array where each element is an object that defines a specific data stream. Each stream object includes a stream key that holds the configuration details. For example, one stream definition looks like this:
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
},
...
}
},
"supported_sync_modes": ["full_refresh", "cdc"],
"source_defined_primary_key": ["_id"],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
}
3.1 Stream Configuration Elements
Component | Data Type | Example Value | Description & Possible Values |
---|---|---|---|
name | string | "stream_8" | Unique identifier for the stream. Each stream should have a unique name. |
namespace | string | "otter_db" | Grouping or database name that the stream belongs to. It helps organize streams by logical or physical data sources. |
type_schema | object | (JSON object with properties) | Defines the structure of the records in the stream. Contains a properties object that maps each field (key) to its allowed data types (e.g., string, integer, array, object). |
supported_sync_modes | array | ["full_refresh", "cdc"] | Lists the synchronization modes the stream supports. Typically includes "full_refresh" for complete reloads and "cdc" (change data capture) for incremental updates. |
source_defined_primary_key | array | ["_id"] | Specifies the field(s) that uniquely identify a record within the stream. This key is used to ensure data uniqueness and integrity. |
available_cursor_fields | array | [] | Lists fields that can be used to track the synchronization progress. Often empty if no cursors are required or defined. |
sync_mode | string | "cdc" | Indicates the active synchronization mode. This field is set to either "cdc" for change data capture or "full_refresh" when a complete data reload is used. |
4. Type Schema: Properties and Data Types
The type_schema
is central to defining what data each stream will handle. Under this schema:
4.1 Properties Object
The properties object is a collection of key-value pairs where:
Key
: Represents the name of the property or field (e.g.,"title"
,"authors"
,"publish_date"
).Value
: Is an object that describes the property's allowed data types.
For example:
"_id": {
"type": ["string"]
}
This definition means that the _id
property must be a string.
A field can allow multiple types. For instance:
"coordinates": {
"type": [
"null",
"object"
]
},
This means the coordinates
field is presently null
but is of object data type.
5. Synchronization Modes
The streams file specifies how streams should be synchronized:
Component | Description | Possible Values |
---|---|---|
supported_sync_modes | Lists all the modes a stream supports. | full_refresh - (a complete reload of the data) |
sync_mode | Indicates which mode is actively being used for that stream. | full_refresh |
Sample configuration
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
},
{
"partition_regex": "{,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
}
]
},
"streams": [
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": { ... },
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
},
// ... other streams
]
}