catalog.json
This document explains the structure and contents of your catalog.json file, which is used to configure and manage data streams. It covers the following topics:
- Overall File Structure
- Selected Streams
- Streams and Their Configuration
- Type Schema: Properties and Data Types
- Key-Value Pair Explanation
- Synchronization Modes
1. Overall File Structure
The catalog.json
file is organized into two main sections:
selected_streams
: Lists the streams that have been chosen for processing. These are grouped by namespace.streams
: Contains an array of stream definitions. Each stream holds details about its data schema, supported synchronization modes, primary keys, and other metadata.
2. Selected Streams
The selected_streams
section groups streams by their namespace(database name). For example, the configuration might look like this:
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
}
Key components:
Component | Data Type | Example Value | Description & Possible Values |
---|---|---|---|
namespace | string | otter_db | Groups streams that belong to a specific database or logical category |
stream_name | string | "stream_0" , "stream_8" | The identifier for the stream. Should match the stream name defined in the stream configurations. |
partition_regex | string | "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}" or "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}" | A pattern defining how to partition the data. Includes date tokens (e.g., year, month, day) or other markers like language or revision indicators. |
For more information about partition_regex
, refer to S3 partition documentation.
3. Streams
The streams section is an array where each element is an object that defines a specific data stream. Each stream object includes a stream key that holds the configuration details. For example, one stream definition looks like this:
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
},
...
}
},
"supported_sync_modes": ["full_refresh", "cdc"],
"source_defined_primary_key": ["_id"],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
}
3.1 Stream Configuration Elements
Component | Data Type | Example Value | Description & Possible Values |
---|---|---|---|
name | string | "stream_8" | Unique identifier for the stream. Each stream should have a unique name. |
namespace | string | "otter_db" | Grouping or database name that the stream belongs to. It helps organize streams by logical or physical data sources. |
type_schema | object | (JSON object with properties) | Defines the structure of the records in the stream. Contains a properties object that maps each field (key) to its allowed data types (e.g., string, integer, array, object). |
supported_sync_modes | array | ["full_refresh", "cdc"] | Lists the synchronization modes the stream supports. Typically includes "full_refresh" for complete reloads and "cdc" (change data capture) for incremental updates. |
source_defined_primary_key | array | ["_id"] | Specifies the field(s) that uniquely identify a record within the stream. This key is used to ensure data uniqueness and integrity. |
available_cursor_fields | array | [] | Lists fields that can be used to track the synchronization progress. Often empty if no cursors are required or defined. |
sync_mode | string | "cdc" | Indicates the active synchronization mode. This field is set to either "cdc" for change data capture or "full_refresh" when a complete data reload is used. |
4. Type Schema: Properties and Data Types
The type_schema
is central to defining what data each stream will handle. Under this schema:
4.1 Properties Object
The properties object is a collection of key-value pairs where:
Key
: Represents the name of the property or field (e.g.,"title"
,"authors"
,"publish_date"
).Value
: Is an object that describes the property's allowed data types.
For example:
"_id": {
"type": ["string"]
}
This definition means that the _id
property must be a string.
4.2 Supported Data Types
The schema supports several JSON data types, including:
Below is the updated table that summarizes the supported data types along with a brief description and a sample value for each:
Data Type | Description | Sample Value |
---|---|---|
Null | Represents a null or missing value. | null |
Int64 | Represents a 64-bit integer value. | 42 |
Float64 | Represents a 64-bit floating point number. | 3.14159 |
String | Represents a text string. | "Hello, world!" |
Bool | Represents a boolean value (true/false). | true |
Object | Represents a JSON object (key-value pairs). | { "name": "Alice", "age": 30 } |
Array | Represents an array of values. | [1, 2, 3] |
Unknown | Represents an unspecified or unrecognized type. | "unknown" |
Timestamp | Represents a timestamp with default precision. | "2025-02-18T10:00:00Z " |
TimestampMilli | Represents a timestamp with millisecond precision (3 decimal places). | "2025-02-18T10:00:00.123Z " |
TimestampMicro | Represents a timestamp with microsecond precision (6 decimal places). | "2025-02-18T10:00:00.123456Z " |
TimestampNano | Represents a timestamp with nanosecond precision (9 decimal places). | "2025-02-18T10:00:00.123456789Z " |
More advanced data type support coming up soon. Track the progress.
A field can allow multiple types. For instance:
"coordinates": {
"type": [
"null",
"object"
]
},
This means the coordinates
field is presently null
but is of object data type.
5. Synchronization Modes
The catalog specifies how streams should be synchronized:
Component | Description | Possible Values |
---|---|---|
supported_sync_modes | Lists all the modes a stream supports. | full_refresh - (a complete reload of the data) cdc (incremental updates capturing only changes) |
sync_mode | Indicates which mode is actively being used for that stream. | full_refresh or cdc |
Sample configuration
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
},
"streams": [
{
"StreamMetadata": {
"partition_regex": "",
"stream_name": ""
},
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": { ... },
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
},
// ... other streams
]
}