MongoDB Synchronization Strategies Explained - Pros, Cons, and Practical Tips

November 5, 2024 · 6 min read

OLake Maintainer

mongodb-synchronization-strategies-cover

MongoDB, a widely used NoSQL database, offers developers significant advantages, particularly its schema-less architecture. This design allows for direct data dumping without the constraints of a fixed schema, providing enhanced flexibility and ease of development.

However, implementing real-time analytics services with MongoDB presents several challenges that require careful management.

In this blog, we will delve into the real-time data synchronization strategies available in MongoDB, as well as the specific challenges that data engineers face in achieving effective real-time data synchronization.

Sync Strategy

Mainly three key strategies are being used to make MongoDB data real-time:

The following strategies of synchronization are typically used after the full load is completed. The process of full load can be executed by splitting data streams or by making use of the .find() queries for each stream to fetch and load.

Incremental Based
Oplogs Based
Change Streams Based

The real-time sync strategies oplog-based and CDC-based do not work with standalone MongoDB.

Incremental Data Sync

Many data engineers initialize an incremental strategy of synchronization using a cursor-based approach as the starting point.

This method, generally, applies the use of an updated_at column for tracking and reading only data that was changed since the last time data were synced. This is just intuitive, but it does have some drawbacks and pitfalls:

Drawbacks of Incremental Sync Strategy

Data skew and missed updates: updated_at may miss all the changes, particularly if timestamp updates occur simultaneously or system clocks are not perfectly in sync, leading to missing updates or duplicated entries during synchronization.
Performance Bottlenecks: As the dataset grows, queries based on the updated_at column are badly inefficient, which leads to slower queries and higher latency in the sync process. This is particularly problematic for high-velocity environments where data updates are coming in at an enormous rate.
Handling Deletes: The incremental sync method normally only tracks updated or newly inserted records. Deletions of records from the source database are not inherently tracked with this approach and can also propagate stale or inconsistent data downstream.

Oplog

Oplog stands for the operation log of MongoDB. It keeps track of all the insert, update, and delete operations. To sync data real-time you may need to iterate over oplog sequentially to capture all the changes that happened on the database.

Some Key Points about oplogs:

It is used by MongoDB to make sure all the nodes have consistent data in a replica set.
Oplogs are stored in oplog.rs collection of the Local database in MongoDB.
Oplogs can be tailed.
It is not available for standalone instances.

An oplog object looks something like this:

{
  "ts": 6642324230736186000,
  "h": -1695843663874728400,
  "v": 2,
  "op": "u",
  "ns": "analysts.datazip",
  "o": {
    "$set": {
      "r": 0
    }
  },
  "o2": {
    "_id": "878262a2f76853482c9bc3c1"
  }
}

Where each entry defines:

field	optional	description
ts	NO	`BSON` Timestamp. Often converted to `64bit` `INT` for `JSON`
h	YES	Random hash
v	NO	`oplog` protocol version. default `1`.
op	NO	Type of op. one of `i`, `u`, `d`, `n`, `c`
ns	NO	`BSON` Namespacestring. Serialized as db. collection
o	NO	Operation applied. object
o2	YES	Object 2. Additional information, usually operand
fromMigrate	YES	Boolean. Indicates if this operation is part of chunk migration between shards

Change Streams

Change Streams on the other hand is a real-time data stream that transfers the captured changes from MongoDB to your application. It is a highly optimized way compared to oplogs tailing. This is totally based on oplogs. For more info check the official doc.

Some Key Points about change streams:

Must be using wired tiger storage engine.

db.serverStatus().storageEngine //check storage engine via command

Change Stream can be opened on any database even on any collection which is not possible on manually tailing oplogs
It is easy to open and filter out data based on resume token and operation time.
change streams are available regardless of read concern.
Can look for the pre-image and post-image of the document
- pre-image: document image before change applied
- post-image: document image after change applied

Go code to open the change stream

pipeline := mongo.Pipeline{bson.D{{"$match", bson.D{{"$or",
	bson.A{
		bson.D{{"fullDocument.username", "datazip"}},
		bson.D{{"operationType", "delete"}}}}},
}}}
cs, err := coll.Watch(ctx, pipeline)
assert.NoError(t, err)
defer cs.Close(ctx)
ok := cs.Next(ctx)
next := cs.Current // document

Problems Faced by Data Engineers During Real-Time Synchronization

Achieving real-time data synchronization in MongoDB presents a unique set of challenges for data engineers. These challenges can impact the effectiveness and reliability of data pipelines, making it essential for engineers to identify and address them proactively. Below are some of the primary issues encountered during real-time synchronization:

Oplog Cursor Mismatch: Discrepancies between the cursor and the oplog can lead to inconsistencies in data capture, resulting in missed changes or duplicated entries.

Solution: To solve the cursor mismatch problem we can use simple maths to store 7 days oplogs
1. use the command to get the total oplog size and total oplog hours.
```
rs.printReplicationInfo()
```
2. Based on the above information calculate the 1-hour oplog size and increase oplog size based on it.
```
db.adminCommand({replSetResizeOplog: 1, size: Double(16000)}) // size is in mb's
```
3. Also, increase oplog retention time if it is set to less than 7 days
```
//check for oplog min retention time
db.adminCommand({ getParameter: 1, oplogMinRetentionHours: 1 })
```
Slow Write Operations: Suboptimal write performance can create bottlenecks in data pipelines, affecting the overall throughput and responsiveness of the system.

Solution: Use readPreference while reading from change streams or oplogs. If you have doubts if using secondary it may miss changes you can check the official answer.
Data Consistency Challenges: Ensuring data consistency across distributed systems is complex, particularly when multiple sources are writing to the same MongoDB instance.

Solution: Above mentioned strategies can solve this problem as per MongoDB docs there are no chances of missing the data by using any of the above strategies properly.

OLake

Achieve 5x speed data replication to Lakehouse format with OLake, our open source platform for efficient, quick and scalable big data ingestion for real-time analytics.

Schedule a meet Signup Explore OLake GitHub

Sync Strategy​

Incremental Data Sync​

Oplog​

Change Streams​

Problems Faced by Data Engineers During Real-Time Synchronization​