streams.json
This document explains the structure and contents of your streams.json file, which is used to configure and manage data streams. It covers the following topics:
- Overall File Structure
- Selected Streams
- Streams and Their Configuration
- Type Schema: Properties and Data Types
- Key-Value Pair Explanation
- Synchronization Modes
Below is a consolidated summary table that captures the key components of the streams.json
file, followed by a single sample configuration file.
Streams Configuration Summary
Component | Data Type | Example Value | Description |
---|---|---|---|
Overall Structure | - | Two sections: selected_streams and streams | The file is divided into a section for selected streams (grouped by namespace) and detailed stream definitions. |
namespace (in selected_streams) | string | otter_db | Groups streams by a logical or physical data source. |
stream_name (in selected_streams) | string | "stream_0" , "stream_8" | Identifies the stream and must match the name used in the streams section. |
partition_regex (in selected_streams) | string | {now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,} or {,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,} | Defines how to partition data using tokens (like dates or revisions) to organize data into folders or segments. |
normalization | boolean | false | Determines whether OLake applies level-1 JSON flattening to nested objects. Set to true if you require normalized output; otherwise, use false . |
name (in streams) | string | "stream_8" | Unique identifier for a stream's configuration. |
namespace (in streams) | string | "otter_db" | Groups the stream definition under a specific database or logical category. |
type_schema | object | JSON schema (e.g., properties _id , authors , etc.) | Describes the structure and allowed data types for records in the stream. |
supported_sync_modes | array | ["full_refresh", "cdc"] | Lists synchronization modes supported by the stream—complete reload or incremental updates (CDC). |
source_defined_primary_key | array | ["_id"] | Specifies the field(s) that uniquely identify records in the stream. |
available_cursor_fields | array | [] | Lists fields that can track sync progress; typically left empty if not used. |
sync_mode | string | "cdc" | Indicates the active synchronization mode for the stream, either full_refresh or cdc . |
append_only | boolean | false | The append_only flag determines whether records can be written to the iceberg delete file. If set to true, no records will be written to the delete file. Know more about delete file: Iceberg MOR and COW |
Sample Configuration File
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0",
"normalization": false,
"append_only": false
},
{
"partition_regex": "{,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8",
"normalization": false,
"append_only": false
}
]
},
"streams": [
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
}
// ... additional fields as defined in your schema
}
},
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [<primary_key>],
"available_cursor_fields": [<cursor_field>],
"sync_mode": "cdc"
}
}
// ... additional streams if needed
]
}
Your streams.json
file gets updated (merged) by the discover command (in case new column type or new streams, etc gets added), which generates the latest schema and stream definitions based on your database streams. Refer here for more information.