MongoDB Benchmarks
In the fast-paced world of data management, every second counts. When it comes to syncing massive datasets from MongoDB into a data warehouse or even a lakehouse, you need a tool that is not just reliable but also blazing fast and cost-effective.
This is where Olake comes into picture.
Speed Comparison: Full Load Performance
For a collection of 230 million rows (664.81GB) from Twitter data*, here's how Olake compares to leading competitors:
Tool | Full Load Time | Performance |
---|---|---|
Olake | 46 mins | X times |
Fivetran | 4 hours 39 mins (279 mins) | 6x slower |
Airbyte | 16 hours (960 mins) | 20x slower |
Debezium (Embedded) | 11.65 hours (699 mins) | 15x slower |
Olake is up to 20x faster than competitors like Airbyte, significantly reducing the time and resources required for full data syncs.
No more waiting for hours or even days for your data to be loaded into your warehouse. 46 minutes is all you need to process 230 million rows with Olake.
We used Debezium server version v2.6.2 for carrying out these benchmarks.
Incremental Sync Performance
Testing with 1 Million Rows (2.88GB, 999450 records) across 10 collections showed how efficiently we do it:
Tool | Incremental Sync Time | Records per Second (r/s) | Performance |
---|---|---|---|
Olake | 28.3 sec | 35,694 r/s | X times |
Fivetran | 3 min 10 sec | 5,260 r/s | 6.7x slower |
Airbyte | 12 min 44 sec | 1,308 r/s | 27.3x slower |
Debezium (Embedded) | 12 min 44 sec | 1,308 r/s | 27.3x slower |
Olake processes 1 million records in just 28.3 seconds, achieving 35,694 records per second (r/s), which is 6.7x faster than Fivetran and a 27.3x faster than Airbyte and Debezium (Embedded).
Stability & Reliability
Performance means little without stability. While Airbyte struggled with multiple sync failures during testing, Olake handled larger datasets without, ensuring a seamless, uninterrupted sync experience. This means less downtime and more reliable data for your business.
Cost Comparison (Considering 230M first full load & 50M incremental rows per month as of 30th Sep)
When it comes to pricing, Olake is not just faster; it's cost-efficient. Here's the breakdown based on a typical use case involving a 230 million-row first full sync and 50 million incremental rows per month:
Tool | First Full Sync Cost | Incremental Sync Cost (Monthly) | Total Monthly Cost | Info | Factor |
---|---|---|---|---|---|
Olake | 10-50 USD | 250 USD | 300 USD | Heavier instance required only for 1-2 hours | X times |
Fivetran | Free | 6000 USD | 6000 USD | 15 min sync frequency, pricing for 50M rows & standard plan | 20x costlier |
Airbyte | 6000 USD | 1408 USD | 7400 USD | First Load - 1.15 TB data synced | 24.6x costlier |
Debezium MSK connect + AWS MSK serverless | - | - | 100 USD + 800 USD = 900 USD | 1.2 TB total data (Incremental & first full sync) | 3x costlier |
Olake offers a total cost of just 300 USD per month, compared to 6000 USD for Fivetran and a staggering 7400 USD for Airbyte.
That's 20x more cost-effective than Fivetran and 24x cheaper than Airbyte.
Why Choose Olake?
- Speed: Olake is up to 20x faster than competitors for full data syncs and 27.3x faster for incremental syncs.
- Stability: No failed syncs, no downtime. Olake delivers a reliable experience even for the largest datasets.
- Cost-Effective: At 300 USD per month, Olake is 20x cheaper than Fivetran and 24x more affordable than Airbyte, with 3x savings against Debezium MSK connect + AWS MSK serverless setup without sacrificing performance.
Testing Infrastructure
The impressive performance metrics of Olake were achieved using a robust infrastructure setup, which included:
- Virtual Machine: Standard_D64as_v5
- CPU: 64 vCPUs
- Memory: 256 GiB RAM
- Storage: 250 GB of shared storage
MongoDB Setup:
- 3 Nodes running in a replica set configuration:
- 1 Primary Node (Master) that handles all write operations.
- 2 Secondary Nodes (Replicas) that replicate data from the primary node.
What We Do Differently
Faster Full Load
- When syncing large collections, instead of pulling everything at once, we split the data into smaller virtual chunks (think of them as manageable pieces of the collection).
- Each chunk is processed in parallel (optimal number of threads so that it does not overload the CPU), which drastically speeds up the sync process.
- By breaking up the collection, we avoid bottlenecks and can handle much larger datasets efficiently.
Efficient Incremental Sync
- For ongoing data updates, we use MongoDB's change streams (a real-time feed of changes in the database) to parallelize syncing for each collection.
- This means multiple collections can be updated simultaneously, making the process faster and ensuring near real-time sync. That means your data is always fresh, updated fast, and with minimal lag.
Optimized Data Pull
- Instead of directly transforming data from MongoDB during extraction, we first pull it in its native BSON format (Binary JSON, MongoDB's data format, which stores JSON-like documents more efficiently).
- Once we have the data, we decode it on the ETL side. This reduces the workload on MongoDB itself and allows us to pull data faster, improving ingestion speed.
*Twitter dataset - Archive.org (This JSON dataset has 4 levels of complex nesting).