Skip to main content

catalog.json

This document explains the structure and contents of your catalog.json file, which is used to configure and manage data streams. It covers the following topics:

  • Overall File Structure
  • Selected Streams
  • Streams and Their Configuration
  • Type Schema: Properties and Data Types
  • Key-Value Pair Explanation
  • Synchronization Modes

Below is a consolidated summary table that captures the key components of the catalog.json file, followed by a single sample configuration file.

Catalog Configuration Summary

ComponentData TypeExample ValueDescription
Overall Structure-Two sections: selected_streams and streamsThe file is divided into a section for selected streams (grouped by namespace) and detailed stream definitions.
namespace (in selected_streams)stringotter_dbGroups streams by a logical or physical data source.
stream_name (in selected_streams)string"stream_0", "stream_8"Identifies the stream and must match the name used in the streams section.
partition_regex (in selected_streams)string{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,} or {,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}Defines how to partition data using tokens (like dates or revisions) to organize data into folders or segments.
name (in streams)string"stream_8"Unique identifier for a stream's configuration.
namespace (in streams)string"otter_db"Groups the stream definition under a specific database or logical category.
type_schemaobjectJSON schema (e.g., properties _id, authors, etc.)Describes the structure and allowed data types for records in the stream.
supported_sync_modesarray["full_refresh", "cdc"]Lists synchronization modes supported by the stream—complete reload or incremental updates (CDC).
source_defined_primary_keyarray["_id"]Specifies the field(s) that uniquely identify records in the stream.
available_cursor_fieldsarray[]Lists fields that can track sync progress; typically left empty if not used.
sync_modestring"cdc"Indicates the active synchronization mode for the stream, either full_refresh or cdc.

Sample Configuration File

catalog.json
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0"
},
{
"partition_regex": "{,1999,YY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8"
}
]
},
"streams": [
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": {
"properties": {
"_id": {
"type": ["string"]
},
"authors": {
"type": ["array"]
},
"backreferences": {
"type": ["array"]
},
"birth_date": {
"type": ["string"]
}
// ... additional fields as defined in your schema
}
},
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
}
// ... additional streams if needed
]
}

For more information, refer to MongoDB Connector catalog file


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!