Skip to main content

MongoDB Benchmarks

In the fast-paced world of data management, every second counts. When it comes to syncing massive datasets from MongoDB into a data warehouse or even a lakehouse, you need a tool that is not just reliable but also blazing fast and cost-effective.

This is where Olake comes into picture.

Speed Comparison: Full Load Performance

For a collection of 230 million rows (664.81GB) from Twitter data*, here's how Olake compares to leading competitors:

ToolFull Load TimePerformance
Olake46 minsX times
Fivetran4 hours 39 mins (279 mins)6x slower
Airbyte16 hours (960 mins)20x slower
Debezium (Embedded)11.65 hours (699 mins)15x slower

Olake is up to 20x faster than competitors like Airbyte, significantly reducing the time and resources required for full data syncs.

No more waiting for hours or even days for your data to be loaded into your warehouse. 46 minutes is all you need to process 230 million rows with Olake.

note

We used Debezium server version v2.6.2 for carrying out these benchmarks.

Incremental Sync Performance

Testing with 1 Million Rows (2.88GB, 999450 records) across 10 collections showed how efficiently we do it:

ToolIncremental Sync TimeRecords per Second (r/s)Performance
Olake28.3 sec35,694 r/sX times
Fivetran3 min 10 sec5,260 r/s6.7x slower
Airbyte12 min 44 sec1,308 r/s27.3x slower
Debezium (Embedded)12 min 44 sec1,308 r/s27.3x slower

Olake processes 1 million records in just 28.3 seconds, achieving 35,694 records per second (r/s), which is 6.7x faster than Fivetran and a 27.3x faster than Airbyte and Debezium (Embedded).

Stability & Reliability

Performance means little without stability. While Airbyte struggled with multiple sync failures during testing, Olake handled larger datasets without, ensuring a seamless, uninterrupted sync experience. This means less downtime and more reliable data for your business.

Cost Comparison (Considering 230M first full load & 50M incremental rows per month as of 30th Sep)

When it comes to pricing, Olake is not just faster; it's cost-efficient. Here's the breakdown based on a typical use case involving a 230 million-row first full sync and 50 million incremental rows per month:

ToolFirst Full Sync CostIncremental Sync Cost (Monthly)Total Monthly CostInfoFactor
Olake10-50 USD250 USD300 USDHeavier instance required only for 1-2 hoursX times
FivetranFree6000 USD6000 USD15 min sync frequency, pricing for 50M rows & standard plan20x costlier
Airbyte6000 USD1408 USD7400 USDFirst Load - 1.15 TB data synced24.6x costlier
Debezium MSK connect + AWS MSK serverless--100 USD + 800 USD = 900 USD1.2 TB total data (Incremental & first full sync)3x costlier

Olake offers a total cost of just 300 USD per month, compared to 6000 USD for Fivetran and a staggering 7400 USD for Airbyte.

That's 20x more cost-effective than Fivetran and 24x cheaper than Airbyte.

Why Choose Olake?

  • Speed: Olake is up to 20x faster than competitors for full data syncs and 27.3x faster for incremental syncs.
  • Stability: No failed syncs, no downtime. Olake delivers a reliable experience even for the largest datasets.
  • Cost-Effective: At 300 USD per month, Olake is 20x cheaper than Fivetran and 24x more affordable than Airbyte, with 3x savings against Debezium MSK connect + AWS MSK serverless setup without sacrificing performance.

Testing Infrastructure

The impressive performance metrics of Olake were achieved using a robust infrastructure setup, which included:

  • Virtual Machine: Standard_D64as_v5
  • CPU: 64 vCPUs
  • Memory: 256 GiB RAM
  • Storage: 250 GB of shared storage

MongoDB Setup:

  • 3 Nodes running in a replica set configuration:
    • 1 Primary Node (Master) that handles all write operations.
    • 2 Secondary Nodes (Replicas) that replicate data from the primary node.

What We Do Differently

Faster Full Load

  • When syncing large collections, instead of pulling everything at once, we split the data into smaller virtual chunks (think of them as manageable pieces of the collection).
  • Each chunk is processed in parallel (optimal number of threads so that it does not overload the CPU), which drastically speeds up the sync process.
  • By breaking up the collection, we avoid bottlenecks and can handle much larger datasets efficiently.

Efficient Incremental Sync

  • For ongoing data updates, we use MongoDB's change streams (a real-time feed of changes in the database) to parallelize syncing for each collection.
  • This means multiple collections can be updated simultaneously, making the process faster and ensuring near real-time sync. That means your data is always fresh, updated fast, and with minimal lag.

Optimized Data Pull

  • Instead of directly transforming data from MongoDB during extraction, we first pull it in its native BSON format (Binary JSON, MongoDB's data format, which stores JSON-like documents more efficiently).
  • Once we have the data, we decode it on the ETL side. This reduces the workload on MongoDB itself and allows us to pull data faster, improving ingestion speed.

*Twitter dataset - Archive.org (This JSON dataset has 4 levels of complex nesting).


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!