state.json
The state.json
file is a critical component in OLake's synchronization process. It tracks the point (via a cursor token, resume token, or offset) until which data has been processed. This mechanism allows OLake to resume syncing from where it left off, preventing duplicate processing of records.
- MongoDB
- Postgres
- MySQL
File Structure
A typical state.json
file has the following structure:
{
"type": "STREAM",
"streams": [
{
"stream": "stream_8",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
},
{
"stream": "stream_0",
"namespace": "otter_db",
"sync_mode": "",
"state": {
"_data": "8267B34D61000000022B0429296E1404"
}
}
]
}
Key Components
Key | Data Type | Description | Sample Value |
---|---|---|---|
type | string | Identifies the type of state stored. Typically, it is set to "STREAM" . | "STREAM" |
streams | array | An array containing state objects for each stream. | [ { ... }, { ... } ] |
stream | string | The unique identifier for the stream whose state is recorded. | "stream_8" or "stream_0" |
namespace | string | The namespace or logical grouping the stream belongs to. | "otter_db" |
sync_mode | string | Indicates the active synchronization mode for the stream. This value may be empty or contain a specific mode. | "" (empty string) or sync modes like "cdc" , "full_refresh" , "incremental" (WIP) |
state | object | Contains the resume token or offset. This token determines the point until which data has been synced. | { "_data": "8267B34D61000000022B0429296E1404" } |
Refer here for more about sync modes.
How It Works
- Resume Token / Offset:
The value stored in thestate
object (in the_data
field) represents the cursor token, resume token (in MongoDB), or offset (in other databases) indicating the last processed record. - Incremental Syncing:
By keeping track of the token, OLake can start the next sync run from this point, ensuring that previously processed records are not re-fetched. - Multiple Streams:
Each stream in thestreams
array maintains its own synchronization state. This allows OLake to handle multiple data sources or partitions independently.
Benefits
- Efficiency:
Incremental synchronization reduces data transfer and processing by only fetching new or changed records. - Data Consistency:
Tracking the synchronization state prevents duplicate processing, ensuring that data remains consistent. - Flexibility:
The state mechanism supports various data sources (e.g., MongoDB with resume tokens, other databases with offsets), making it adaptable to different backend systems.
File Structure
A typical state.json
file has the following structure:
{
"type": "GLOBAL",
"global": {
"state": {
"lsn": "0/198C935"
},
"streams": [
"public.sample_data"
]
},
"streams": [
{
"stream": "sample_data",
"namespace": "public",
"sync_mode": "",
"state": {
"chunks": []
}
}
]
}
Key Components
Below is a detailed property description table for the provided state.json
structure:
Key | Data Type | Description | Sample Value |
---|---|---|---|
type | string | Identifies the type of state stored. In this file, "GLOBAL" indicates that the state applies globally to the replication process. | "GLOBAL" |
global | object | Contains global state information including overall replication progress and a list of globally tracked streams. | { "state": { "lsn": "0/198C935" }, "streams": ["public.sample_data"] } |
global.state | object | Holds the global state details required to resume replication, such as log sequence markers. | { "lsn": "0/198C935" } |
global.state.lsn | string | The Log Sequence Number (LSN) indicating the last processed position in the transaction log. | "0/198C935" |
global.streams | array | An array listing fully-qualified stream identifiers tracked at the global level. | [ "public.sample_data" ] |
streams | array | An array of objects, each representing the state of an individual stream. These objects contain details needed to resume replication for that stream. | [ { "stream": "sample_data", "namespace": "public", "sync_mode": "", "state": { "chunks": [] } } ] |
streams[].stream | string | The identifier (typically a table name) of the stream whose state is being tracked. | "sample_data" |
streams[].namespace | string | Indicates the namespace or schema to which the stream belongs. | "public" |
streams[].sync_mode | string | Specifies the synchronization mode for the stream. It may be left empty if no specific mode is applied. | "" (empty string) |
streams[].state | object | Contains state details specific to the stream, such as information on data segmentation or chunks. | { "chunks": [] } |
streams[].state.chunks | array | An array used to store chunk information, useful for managing segmented or large datasets during replication. | [] |
This table provides a clear and concise reference for each configuration property in the state.json
file, making it easier for developers to understand and work with the replication state details.
Refer here for more about sync modes.
File Structure
A typical state.json
file has the following structure:
{
"type": "STREAM",
"streams": [
{
"stream": "sample_table",
"namespace": "main",
"sync_mode": "",
"state": {
"binlog_file": "mysql-bin.000003",
"binlog_position": 1027,
"chunks": [],
"server_id": 1000
}
}
]
}
Key Components
Key | Data Type | Description | Sample Value |
---|---|---|---|
type | string | Identifies the type of state stored. For streaming replication, it is typically set to "STREAM" . | "STREAM" |
streams | array | An array that contains one or more stream state objects. Each object represents the replication state for a specific table or partition. | [ { ... } ] |
stream | string | Within each stream object, this specifies the unique identifier (often the table name) whose state is being tracked. | "sample_table" |
namespace | string | Indicates the database name or logical grouping the stream belongs to. | "main" |
sync_mode | string | Represents the active synchronization mode for the stream. It may be left empty or specify a mode such as "cdc" . | "" (empty string) |
state | object | Contains the details required to resume replication. This nested object holds the exact point up to which data has been processed. | { "binlog_file": "mysql-bin.000003", "binlog_position": 1027, "chunks": [], "server_id": 1000 } |
binlog_file | string | (Nested in state ) The name of the binary log file from which replication will resume. | "mysql-bin.000003" |
binlog_position | integer | (Nested in state ) The specific position in the binary log file indicating where to resume. | 1027 |
chunks | array | (Nested in state ) An array for storing chunk information, useful for managing segmented or large datasets. | [] |
server_id | integer | (Nested in state ) The identifier of the source MySQL server that generated the binary logs, used for ensuring correct replication tracking. | 1000 |
How It Works
-
State Tracking:
Thetype
field declares the kind of state (here, a streaming state), while thestreams
array holds one or more stream objects. Each stream object tracks the replication state for a particular table or partition. -
Resuming Synchronization:
Thestate
object inside each stream contains fields likebinlog_file
andbinlog_position
which tell the system exactly where to resume data replication. This prevents reprocessing already synced records. -
Handling Data Chunks:
Thechunks
field, although empty in this sample, can be used to manage segmented data, which is useful when handling large datasets. -
Source Identification:
Theserver_id
field helps identify which MySQL server’s binary logs are being tracked, ensuring consistency in multi-server replication setups.
Refer here for more about sync modes.