Last updated:|... min read
catalog.json
This document explains the structure and contents of your catalog.json file, which is used to configure and manage data streams. It covers the following topics:
- Overall File Structure
- Selected Streams
- Streams and Their Configuration
- Type Schema: Properties and Data Types
- Key-Value Pair Explanation
- Synchronization Modes
Below is a consolidated summary table that captures the key components of the catalog.json
file, followed by a single sample configuration file.
Catalog Configuration Summary
Component | Data Type | Example Value | Description |
---|---|---|---|
Overall Structure | - | Two sections: selected_streams and streams | The file is divided into a section for selected streams (grouped by namespace) and detailed stream definitions. |
namespace (in selected_streams) | string | otter_db | Groups streams by a logical or physical data source. |
stream_name (in selected_streams) | string | "stream_0" , "stream_8" | Identifies the stream and must match the name used in the streams section. |
partition_regex (in selected_streams) | string | {now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,} or {,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,} | Defines how to partition data using tokens (like dates or revisions) to organize data into folders or segments. |
name (in streams) | string | "stream_8" | Unique identifier for a stream's configuration. |
namespace (in streams) | string | "otter_db" | Groups the stream definition under a specific database or logical category. |
type_schema | object | JSON schema (e.g., properties _id , authors , etc.) | Describes the structure and allowed data types for records in the stream. |
supported_sync_modes | array | ["full_refresh", "cdc"] | Lists synchronization modes supported by the stream—complete reload or incremental updates (CDC). |
source_defined_primary_key | array | ["_id"] | Specifies the field(s) that uniquely identify records in the stream. |
available_cursor_fields | array | [] | Lists fields that can track sync progress; typically left empty if not used. |
sync_mode | string | "cdc" | Indicates the active synchronization mode for the stream, either full_refresh or cdc . |
Sample Configuration File
catalog.json
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
},
"streams": [
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
}
// ... additional fields as defined in your schema
}
},
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
}
// ... additional streams if needed
]
}
For more information, refer to MongoDB Connector catalog file