Benchmarks
PostgreSQL → Apache Iceberg Connector Benchmark
(OLake vs. Popular Data-Movement Tools)
1. Speed Comparison – Full-Load Performance
Tool | Rows Synced | Throughput (rows / sec) | Relative to OLake |
---|---|---|---|
OLake | 4.01 B | 46,262 RPS | – |
Fivetran | 4.01 B | 46,395 RPS | Parity (≤1 % faster) |
Debezium (memiiso) | 1.28 B | 14,839 RPS | 3.1 × slower |
Estuary | 0.34 B | 3,982 RPS | 11.6 × slower¹ |
Airbyte Cloud | 12.7 M | 457 RPS | 101 × slower |
¹ Estuary ran the same 24-hour window but processed a ~10× smaller dataset, so its throughput looks even lower when normalized.
- The time elapsed for all the tools was 24 hours, but OLake, Debezium, Estuary and Fivetran were able to process the entire dataset in that time. Airbyte failed with a sync after 7.5 hours, so we only have throughput for the first part of the test.
Key takeaway: OLake sustains the same top-tier bulk-load speed as Fivetran while outpacing every other open-source option by 3-to-100×.
2. Speed Comparison – Change-Data-Capture (CDC)
Tool | CDC Window | Throughput (rows / sec) | Relative to OLake |
---|---|---|---|
OLake | 22.5 min | 36 982 RPS | – |
Fivetran | 31 min | 26,910 RPS | 1.4 × slower |
Debezium (memiiso) | 60 min | 13,808 RPS | 2.7 × slower |
Estuary | 4.5 h | 3,085 RPS | 12 × slower |
Airbyte Cloud | 23 h | 585 RPS | 63 × slower |
The rows synced in the CDC test were the same 50 million changes that OLake processed in 22.5 minutes. The other tools were tested on the same dataset, but they had different CDC windows (timings).
Key takeaway: For incremental workloads OLake leads the pack, moving 50 million PostgreSQL changes into Iceberg 40 % faster than Fivetran and 10-60× faster than other OSS connectors.
3. Cost Comparison (Vendor List Prices)
Tool | Scenario | Spend (USD) | Rows Synced |
---|---|---|---|
OLake | Full Load / CDC | Cost of a Standard D64ls v5 (64 vcpus, 128 GiB memory) running for 24 hours < $75 | 4.01 B / 50M |
Fivetran | Full Load | $ 0 (free full sync) | 4.01 B |
Estuary | Full Load | $ 1,668 | 0.34 B |
Airbyte Cloud | Full Load | $ 5,560 | 12.7 M |
Fivetran | CDC | $ 2, 375.80 | 50 M |
Estuary | CDC | $ 17.63 | 50 M |
Airbyte Cloud | CDC | $ 148.95 | 50 M |
- OLake is open-source and can be deployed on your own Kubernetes cluster or cloud VMs; you pay only for the compute and storage you provision.
Footnotes
- Airbyte: Please find attached data for the Airbyte issues we faced during the test - here.
Dataset and Table Schemas
Please refer to this GitHub repository for the dataset we used to conduct these benchmarks.
We first performed a full-load sync of empty dummy tables. Afterwards, we inserted the top 25 million records from both trips
and fhv_trips
into these tables and ran a CDC sync.
trips
table
CREATE TABLE trips (
id bigserial NOT NULL,
cab_type_id int4 NULL,
vendor_id int4 NULL,
pickup_datetime timestamp NULL,
dropoff_datetime timestamp NULL,
store_and_fwd_flag bool NULL,
rate_code_id int4 NULL,
pickup_longitude numeric NULL,
pickup_latitude numeric NULL,
dropoff_longitude numeric NULL,
dropoff_latitude numeric NULL,
passenger_count int4 NULL,
trip_distance numeric NULL,
fare_amount numeric NULL,
extra numeric NULL,
mta_tax numeric NULL,
tip_amount numeric NULL,
tolls_amount numeric NULL,
ehail_fee numeric NULL,
improvement_surcharge numeric NULL,
congestion_surcharge numeric NULL,
airport_fee numeric NULL,
total_amount numeric NULL,
payment_type int4 NULL,
trip_type int4 NULL,
pickup_nyct2010_gid int4 NULL,
dropoff_nyct2010_gid int4 NULL,
pickup_location_id int4 NULL,
dropoff_location_id int4 NULL,
CONSTRAINT trips_pkey PRIMARY KEY (id)
);
- Column count: 29
- Type mix: 1 x
bigserial
, 10 xint4
, 2 xtimestamp
, 1 xbool
, 15 xnumeric
fhv_trips
table
CREATE TABLE fhv_trips (
id bigserial NOT NULL,
hvfhs_license_num text NULL,
dispatching_base_num text NULL,
originating_base_num text NULL,
request_datetime timestamp NULL,
on_scene_datetime timestamp NULL,
pickup_datetime timestamp NULL,
dropoff_datetime timestamp NULL,
pickup_location_id int4 NULL,
dropoff_location_id int4 NULL,
trip_miles numeric NULL,
trip_time numeric NULL,
base_passenger_fare numeric NULL,
tolls numeric NULL,
black_car_fund numeric NULL,
sales_tax numeric NULL,
congestion_surcharge numeric NULL,
airport_fee numeric NULL,
tips numeric NULL,
driver_pay numeric NULL,
shared_request bool NULL,
shared_match bool NULL,
access_a_ride bool NULL,
wav_request bool NULL,
wav_match bool NULL,
legacy_shared_ride int4 NULL,
affiliated_base_num text NULL,
CONSTRAINT fhv_trips_pkey PRIMARY KEY (id)
);
- Column count: 27
- Type mix: 1 x
bigserial
, 4 xtext
, 4 xtimestamp
, 3 xint4
, 10 xnumeric
, 5 xbool
Average row size & storage footprint
Sync Mode | Table | Rows | Raw CSV size | Size ÷ rows |
---|---|---|---|---|
Full Load | trips + fhv_trips | ≈ 3.96 B | ≈ 585 GB un-compressed | ≈ 158.62 Byte/row |
CDC | trips + fhv_trips | ≈ 50 M | ≈ 6.8 GB un-compressed | ≈ 60.13 Byte/row |
We used AWS Glue as Iceberg catalog and AWS S3 as the storage layer on the destination side for this benchmarks.
What These Numbers Mean for You
- Peak throughput without lock-in: OLake matches or beats proprietary SaaS speeds while letting you keep data and infrastructure in your own account.
- Superior CDC latency: Faster change propagation means fresher downstream analytics and near-real-time feature generation for ML.
- Predictable TCO: Because OLake is self-hosted, you scale resources up or down to hit your desired SLA at the lowest cloud cost—no opaque credit systems.
- Resource profile: In these tests OLake used 57.6 GB RAM (roughly one
Standard D64ls v5 (64 vcpus, 128 GiB memory)
VM) for both full-load and CDC runs; adjust sizing linearly with your workload.
Bottom line: If you need to land terabytes of PostgreSQL data into Apache Iceberg quickly—and keep it continually up-to-date—OLake delivers enterprise-grade speed without the enterprise-grade bill.