streams.json
Configurationβ
Overall File Structureβ
The streams.json
file is organized into two main sections:
selected_streams
: Lists the streams that have been chosen for processing. These are grouped by namespace.streams
: Contains an array of stream definitions. Each stream holds details about its data schema, supported synchronization modes, primary keys, and other metadata.
1. Selected Streamsβ
The selected_streams
section groups streams by their namespace(database name). For example, the configuration might look like this:
"selected_streams": {
"my_db": [
{
"partition_regex": "/{dropoff_datetime, year}",
"stream_name": "table1",
"normalization": true,
"append_only": false,
"filter": "UPDATED_AT >= \"08-JUN-25 07.19.23.690870000 AM\""
},
{
"partition_regex": "",
"stream_name": "table2",
"normalization": false,
"append_only": false,
"filter": "city = \"London\""
}
]
}
Details about all the fields mentioned in selected streamsβ
Component | Data Type | Example Value | Description |
---|---|---|---|
namespace | string | my_db | Groups streams that belong to a specific database or logical category |
stream_name | string | "table1" , "table2" | The identifier for the stream. Should match the stream name defined in the stream configurations. |
partition_regex | string | "/{dropoff_datetime, year}" | A pattern defining how to partition the data. To read more, refer the Partition Regex Documentation |
normalization | boolean | true | Determines whether OLake applies level-1 JSON flattening to Level 0 nested objects. Set to true if you require normalized output; otherwise, use false . |
append_mode | boolean | false | To disable upserts in iceberg by setting this to true . |
filter | string | "UPDATED_AT >= \"08-JUN-25 07.19.23.690870000 AM\"" | Only the data that satisfies the specified condition will be synced. |
2. Streamsβ
The streams section is an array where each element is an object that defines a specific data stream. Each stream object includes a stream key that holds the configuration details. For example, one stream definition looks like this:
{
"stream": {
"name": "stream_8",
"namespace": "olake_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"name": {
"type": ["string"]
},
"marks": {
"type": ["integer"]
},
"updated_at": {
"type": ["timestamp"]
},
...
}
},
"supported_sync_modes": ["full_refresh", "cdc", "incremental"],
"source_defined_primary_key": ["_id"],
"available_cursor_fields": ["_id", "name", "marks", "updated_at"],
"sync_mode": "incremental",
"cursor_field": "updated_at",
}
}
2.1 Stream Configuration Elementsβ
Component | Example Value | Description & Possible Values |
---|---|---|
name | "stream_8" | Unique identifier for the stream. Each stream must have a unique name. |
namespace | "olake_db" | The grouping or database name that the stream belongs to. Helps organize streams by logical or physical data sources. |
type_schema | (JSON object with properties) | Defines the structure of the records in the stream. Contains a properties object that maps each field (key) to its allowed data types (e.g., string, integer, array, object). |
supported_sync_modes | ["full_refresh", "cdc", "incremental","strict_cdc"] | Lists the synchronization modes the stream supports. Typically includes "full_refresh" , "cdc" , "strict_cdc" and "incremental" . |
source_defined_primary_key | ["_id"] | Specifies the field(s) that is set as a primary key in the source. |
available_cursor_fields | ["_id", "name", "marks", "updated_at"] | Lists fields that can be used to track synchronization progress in incremental sync mode. |
sync_mode | "incremental" | Indicates the active synchronization mode. Possible values are defined in supported_sync_modes . |
cursor_field | "updated_at" | Defines the cursor field used to track incremental sync. A secondary cursor field can also be specified, separated by a colon. To read more about Incremental sync refer this . |
For more information about partition_regex
, refer to Iceberg Partition Documentation or S3 Partition Documentation.