state.json
The state.json
file is a critical component in OLake's synchronization process. It tracks the point (via a cursor token, resume token, or offset) until which data has been processed. This mechanism allows OLake to resume syncing from where it left off, preventing duplicate processing of records.
File Structure
A typical state.json
file has the following structure:
{
"type": "STREAM",
"streams": [
{
"stream": "stream_8",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
},
{
"stream": "stream_0",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
}
]
}
Key Components
Key | Data Type | Description | Sample Value |
---|---|---|---|
type | string | Identifies the type of state stored. Typically, it is set to "STREAM" . | "STREAM" |
streams | array | An array containing state objects for each stream. | [ { ... }, { ... } ] |
stream | string | The unique identifier for the stream whose state is recorded. | "stream_8" or "stream_0" |
namespace | string | The namespace or logical grouping the stream belongs to. | "otter_db" |
sync_mode | string | Indicates the active synchronization mode for the stream. This value may be empty or contain a specific mode. | "" (empty string) or sync modes like "cdc" , "full_refresh" , "incremental" (WIP) |
state | object | Contains the resume token or offset. This token determines the point until which data has been synced. | { "_data": "8267B34D61000000022B0429296E1404" } |
Refer here for more about sync modes.
How It Works
- Resume Token / Offset:
The value stored in thestate
object (in the_data
field) represents the cursor token, resume token (in MongoDB), or offset (in other databases) indicating the last processed record. - Incremental Syncing:
By keeping track of the token, OLake can start the next sync run from this point, ensuring that previously processed records are not re-fetched. - Multiple Streams:
Each stream in thestreams
array maintains its own synchronization state. This allows OLake to handle multiple data sources or partitions independently.
Benefits
- Efficiency:
Incremental synchronization reduces data transfer and processing by only fetching new or changed records. - Data Consistency:
Tracking the synchronization state prevents duplicate processing, ensuring that data remains consistent. - Flexibility:
The state mechanism supports various data sources (e.g., MongoDB with resume tokens, other databases with offsets), making it adaptable to different backend systems.
If you have any further questions or need additional guidance on setting up your state configuration, please refer to the OLake documentation or contact support.