Skip to main content

Benchmarks

PostgreSQL → Apache Iceberg Connector Benchmark

(OLake vs. Popular Data-Movement Tools)

1. Speed Comparison – Full-Load Performance

ToolRows SyncedThroughput (rows / sec)Relative to OLake
OLake4.01 B46,262 RPS
Fivetran4.01 B46,395 RPSParity (≤1 % faster)
Debezium (memiiso)1.28 B14,839 RPS3.1 × slower
Estuary0.34 B3,982 RPS11.6 × slower¹
Airbyte Cloud12.7 M457 RPS101 × slower

¹ Estuary ran the same 24-hour window but processed a ~10× smaller dataset, so its throughput looks even lower when normalized.

info
  1. The time elapsed for all the tools was 24 hours, but OLake, Debezium, Estuary and Fivetran were able to process the entire dataset in that time. Airbyte failed with a sync after 7.5 hours, so we only have throughput for the first part of the test.

Key takeaway: OLake sustains the same top-tier bulk-load speed as Fivetran while outpacing every other open-source option by 3-to-100×.

2. Speed Comparison – Change-Data-Capture (CDC)

ToolCDC WindowThroughput (rows / sec)Relative to OLake
OLake22.5 min36 982 RPS
Fivetran31 min26,910 RPS1.4 × slower
Debezium (memiiso)60 min13,808 RPS2.7 × slower
Estuary4.5 h3,085 RPS12 × slower
Airbyte Cloud23 h585 RPS63 × slower
info

The rows synced in the CDC test were the same 50 million changes that OLake processed in 22.5 minutes. The other tools were tested on the same dataset, but they had different CDC windows (timings).

Key takeaway: For incremental workloads OLake leads the pack, moving 50 million PostgreSQL changes into Iceberg 40 % faster than Fivetran and 10-60× faster than other OSS connectors.

3. Cost Comparison (Vendor List Prices)

ToolScenarioSpend (USD)Rows Synced
OLakeFull Load / CDCCost of a Standard D64ls v5 (64 vcpus, 128 GiB memory) running for 24 hours < $754.01 B / 50M
FivetranFull Load$ 0 (free full sync)4.01 B
EstuaryFull Load$ 1,6680.34 B
Airbyte CloudFull Load$ 5,56012.7 M
FivetranCDC$ 2, 375.8050 M
EstuaryCDC$ 17.6350 M
Airbyte CloudCDC$ 148.9550 M
  • OLake is open-source and can be deployed on your own Kubernetes cluster or cloud VMs; you pay only for the compute and storage you provision.

Footnotes

  1. Airbyte: Please find attached data for the Airbyte issues we faced during the test - here.

Dataset and Table Schemas

Please refer to this GitHub repository for the dataset we used to conduct these benchmarks.

trips table (used for Full Load)

CREATE TABLE trips (
trip_id BIGINT,
vendor_id VARCHAR,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1),
rate_code_id BIGINT,
pickup_longitude DOUBLE,
pickup_latitude DOUBLE,
dropoff_longitude DOUBLE,
dropoff_latitude DOUBLE,
passenger_count BIGINT,
trip_distance DOUBLE,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
ehail_fee DOUBLE,
improvement_surcharge DOUBLE,
total_amount DOUBLE,
payment_type VARCHAR,
trip_type VARCHAR,
pickup VARCHAR,
dropoff VARCHAR,
cab_type VARCHAR,
precipitation BIGINT,
snow_depth BIGINT,
snowfall BIGINT,
max_temperature BIGINT,
min_temperature BIGINT,
average_wind_speed BIGINT,
pickup_nyct2010_gid BIGINT,
pickup_ctlabel VARCHAR,
pickup_borocode BIGINT,
pickup_boroname VARCHAR,
pickup_ct2010 VARCHAR,
pickup_boroct2010 BIGINT,
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR,
pickup_ntaname VARCHAR,
pickup_puma VARCHAR,
dropoff_nyct2010_gid BIGINT,
dropoff_ctlabel VARCHAR,
dropoff_borocode BIGINT,
dropoff_boroname VARCHAR,
dropoff_ct2010 VARCHAR,
dropoff_boroct2010 BIGINT,
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR,
dropoff_ntaname VARCHAR,
dropoff_puma VARCHAR
);
  • Column count: 51
  • Type mix: 26 × VARCHAR (labels & categorical codes), 18 × numeric (DOUBLE or BIGINT), 5 × TIMESTAMP, 2 × single-byte flags.
  • Why so wide? Weather columns (precipitation → wind) and geography lookup columns (borough/census codes) are pre-joined to make BI queries single-table.

fhv_trips table (used for CDC)

#ColumnTypeNote
1dispatching_base_numVARCHAR(6)TLC base licence
2pickup_datetimeTIMESTAMPISO 8601
3dropoff_datetimeTIMESTAMP"
4PUlocationIDINTTLC zone ID
5DOlocationIDINT"
6SR_FlagSMALLINT NULL1 = shared-ride
7affiliated_base_numberVARCHAR(6)vehicle’s own base
  • Column count: 7
  • All types present: timestamp, integer, varchar, boolean-as-tinyint.

Average row size & storage footprint

TableRowsRaw CSV sizeSize ÷ rows
trips≈ 3.96 B≈ 518 GB un-compressed≈ 505 B/row
fhv_trips≈ 50 M≈ 2.8 GB un-compressed≈ 56 B/row
note

We used AWS Glue as Iceberg catalog and AWS S3 as the storage layer on the destination side for this benchmarks.

What These Numbers Mean for You

  • Peak throughput without lock-in: OLake matches or beats proprietary SaaS speeds while letting you keep data and infrastructure in your own account.
  • Superior CDC latency: Faster change propagation means fresher downstream analytics and near-real-time feature generation for ML.
  • Predictable TCO: Because OLake is self-hosted, you scale resources up or down to hit your desired SLA at the lowest cloud cost—no opaque credit systems.
  • Resource profile: In these tests OLake used 57.6 GB RAM (roughly one Standard D64ls v5 (64 vcpus, 128 GiB memory) VM) for both full-load and CDC runs; adjust sizing linearly with your workload.

Bottom line: If you need to land terabytes of PostgreSQL data into Apache Iceberg quickly—and keep it continually up-to-date—OLake delivers enterprise-grade speed without the enterprise-grade bill.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!