Benchmarks
Use the tabs below to view detailed benchmarks per connector. Each tab has a unique URL you can copy/share.
- Postgres
- MongoDB
- MySQL
PostgreSQL โ Apache Iceberg Connector Benchmarkโ
(OLake vs. Popular Data-Movement Tools)
1. Speed Comparison โ Full-Load Performanceโ
Tool | Rows Synced | Throughput (rows / sec) | Relative to OLake |
---|---|---|---|
OLake | 4.01 B | 46,262 RPS | โ |
Fivetran | 4.01 B | 46,395 RPS | Parity (โค1 % faster) |
Debezium (memiiso) | 1.28 B | 14,839 RPS | 3.1 ร slower |
Estuary | 0.34 B | 3,982 RPS | 11.6 ร slowerยน |
Airbyte Cloud | 12.7 M | 457 RPS | 101 ร slower |
ยน Estuary ran the same 24-hour window but processed a ~10ร smaller dataset, so its throughput looks even lower when normalized.
- The time elapsed for all the tools was 24 hours, but OLake, Debezium, Estuary and Fivetran were able to process the entire dataset in that time. Airbyte failed with a sync after 7.5 hours, so we only have throughput for the first part of the test.
Key takeaway: OLake sustains the same top-tier bulk-load speed as Fivetran while outpacing every other open-source option by 3-to-100ร.
2. Speed Comparison โ Change-Data-Capture (CDC)โ
Tool | CDC Window | Throughput (rows / sec) | Relative to OLake |
---|---|---|---|
OLake | 22.5 min | 36 982 RPS | โ |
Fivetran | 31 min | 26,910 RPS | 1.4 ร slower |
Debezium (memiiso) | 60 min | 13,808 RPS | 2.7 ร slower |
Estuary | 4.5 h | 3,085 RPS | 12 ร slower |
Airbyte Cloud | 23 h | 585 RPS | 63 ร slower |
The rows synced in the CDC test were the same 50 million changes that OLake processed in 22.5 minutes. The other tools were tested on the same dataset, but they had different CDC windows (timings).
Key takeaway: For incremental workloads OLake leads the pack, moving 50 million PostgreSQL changes into Iceberg 40 % faster than Fivetran and 10-60ร faster than other OSS connectors.
3. Cost Comparison (Vendor List Prices)โ
Tool | Scenario | Spend (USD) | Rows Synced |
---|---|---|---|
OLake | Full Load / CDC | Cost of a Standard D64ls v5 (64 vcpus, 128 GiB memory) running for 24 hours < $75 | 4.01 B / 50M |
Fivetran | Full Load | $ 0 (free full sync) | 4.01 B |
Estuary | Full Load | $ 1,668 | 0.34 B |
Airbyte Cloud | Full Load | $ 5,560 | 12.7 M |
Fivetran | CDC | $ 2, 375.80 | 50 M |
Estuary | CDC | $ 17.63 | 50 M |
Airbyte Cloud | CDC | $ 148.95 | 50 M |
- OLake is open-source and can be deployed on your own Kubernetes cluster or cloud VMs; you pay only for the compute and storage you provision.
Dataset and Table Schemasโ
Please refer to this GitHub repository for the dataset we used to conduct these benchmarks.
We first performed a full-load sync of empty dummy tables. Afterwards, we inserted the top 25 million records from both trips
and fhv_trips
into these tables and ran a CDC sync.
trips
tableโ
CREATE TABLE trips (
id bigserial NOT NULL,
cab_type_id int4 NULL,
vendor_id int4 NULL,
pickup_datetime timestamp NULL,
dropoff_datetime timestamp NULL,
store_and_fwd_flag bool NULL,
rate_code_id int4 NULL,
pickup_longitude numeric NULL,
pickup_latitude numeric NULL,
dropoff_longitude numeric NULL,
dropoff_latitude numeric NULL,
passenger_count int4 NULL,
trip_distance numeric NULL,
fare_amount numeric NULL,
extra numeric NULL,
mta_tax numeric NULL,
tip_amount numeric NULL,
tolls_amount numeric NULL,
ehail_fee numeric NULL,
improvement_surcharge numeric NULL,
congestion_surcharge numeric NULL,
airport_fee numeric NULL,
total_amount numeric NULL,
payment_type int4 NULL,
trip_type int4 NULL,
pickup_nyct2010_gid int4 NULL,
dropoff_nyct2010_gid int4 NULL,
pickup_location_id int4 NULL,
dropoff_location_id int4 NULL,
CONSTRAINT trips_pkey PRIMARY KEY (id)
);
fhv_trips
tableโ
CREATE TABLE fhv_trips (
id bigserial NOT NULL,
hvfhs_license_num text NULL,
dispatching_base_num text NULL,
originating_base_num text NULL,
request_datetime timestamp NULL,
on_scene_datetime timestamp NULL,
pickup_datetime timestamp NULL,
dropoff_datetime timestamp NULL,
pickup_location_id int4 NULL,
dropoff_location_id int4 NULL,
trip_miles numeric NULL,
trip_time numeric NULL,
base_passenger_fare numeric NULL,
tolls numeric NULL,
black_car_fund numeric NULL,
sales_tax numeric NULL,
congestion_surcharge numeric NULL,
airport_fee numeric NULL,
tips numeric NULL,
driver_pay numeric NULL,
shared_request bool NULL,
shared_match bool NULL,
access_a_ride bool NULL,
wav_request bool NULL,
wav_match bool NULL,
legacy_shared_ride int4 NULL,
affiliated_base_num text NULL,
CONSTRAINT fhv_trips_pkey PRIMARY KEY (id)
);
We used AWS Glue as Iceberg catalog and AWS S3 as the storage layer on the destination side for this benchmarks.
Bottom line: If you need to land terabytes of PostgreSQL data into Apache Iceberg quicklyโand keep it continually up-to-dateโOLake delivers enterprise-grade speed without the enterprise-grade bill.
MongoDB Benchmarksโ
In the fast-paced world of data management, every second counts. When it comes to syncing massive datasets from MongoDB into a data warehouse or even a lakehouse, you need a tool that is not just reliable but also blazing fast and cost-effective.
This is where OLake comes into picture.
Speed Comparison: Full Load Performanceโ
For a collection of 230 million rows (664.81GB) from Twitter data*, here's how OLake compares to leading competitors:
Tool | Full Load Time | Performance |
---|---|---|
OLake | 46 mins | X times |
Fivetran | 4 hours 39 mins (279 mins) | 6x slower |
Airbyte | 16 hours (960 mins) | 20x slower |
Debezium (Embedded) | 11.65 hours (699 mins) | 15x slower |
OLake is up to 20x faster than competitors like Airbyte, significantly reducing the time and resources required for full data syncs.
No more waiting for hours or even days for your data to be loaded into your warehouse. 46 minutes is all you need to process 230 million rows with OLake.
We used Debezium server version v2.6.2 for carrying out these benchmarks.
CDC Sync Performanceโ
Testing with 1 Million Rows (2.88GB, 999450 records) across 10 collections showed how efficiently we do it:
Tool | CDC Sync Time | Records per Second (r/s) | Performance |
---|---|---|---|
OLake | 28.3 sec | 35,694 r/s | X times |
Fivetran | 3 min 10 sec | 5,260 r/s | 6.7x slower |
Airbyte | 12 min 44 sec | 1,308 r/s | 27.3x slower |
Debezium (Embedded) | 12 min 44 sec | 1,308 r/s | 27.3x slower |
OLake processes 1 million records in just 28.3 seconds, achieving 35,694 records per second (r/s), which is 6.7x faster than Fivetran and a 27.3x faster than Airbyte and Debezium (Embedded).
Cost Comparison (Considering 230M first full load & 50M CDC rows per month as of 30th Sep)โ
When it comes to pricing, OLake is not just faster; it's cost-efficient. Here's the breakdown based on a typical use case involving a 230 million-row first full sync and 50 million CDC rows per month:
Tool | First Full Sync Cost | CDC Sync Cost (Monthly) | Total Monthly Cost | Info | Factor |
---|---|---|---|---|---|
OLake | 10-50 USD | 250 USD | 300 USD | Heavier instance required only for 1-2 hours | X times |
Fivetran | Free | 6000 USD | 6000 USD | 15 min sync frequency, pricing for 50M rows & standard plan | 20x costlier |
Airbyte | 6000 USD | 1408 USD | 7400 USD | First Load - 1.15 TB data synced | 24.6x costlier |
Debezium MSK connect + AWS MSK serverless | - | - | 100 USD + 800 USD = 900 USD | 1.2 TB total data (CDC & first full sync) | 3x costlier |
OLake offers a total cost of just 300 USD per month, compared to 6000 USD for Fivetran and a staggering 7400 USD for Airbyte.
That's 20x more cost-effective than Fivetran and 24x cheaper than Airbyte.
Why Choose OLake?โ
- Speed: OLake is up to 20x faster than competitors for full data syncs and 27.3x faster for CDC syncs.
- Stability: No failed syncs, no downtime. OLake delivers a reliable experience even for the largest datasets.
- Cost-Effective: At 300 USD per month, OLake is 20x cheaper than Fivetran and 24x more affordable than Airbyte, with 3x savings against Debezium MSK connect + AWS MSK serverless setup without sacrificing performance.
Testing Infrastructureโ
The impressive performance metrics of OLake were achieved using a robust infrastructure setup, which included:
- Virtual Machine: Standard_D64as_v5
- CPU: 64 vCPUs
- Memory: 256 GiB RAM
- Storage: 250 GB of shared storage
MongoDB Setup:โ
- 3 Nodes running in a replica set configuration:
- 1 Primary Node (Master) that handles all write operations.
- 2 Secondary Nodes (Replicas) that replicate data from the primary node.
*Twitter dataset - Archive.org (This JSON dataset has 4 levels of complex nesting).
MySQL Benchmarksโ
Coming Soon