Skip to main content

OLake vs AWS DMS Benchmark​

In today's data-driven world, organizations are migrating massive datasets from traditional databases to cloud storage. But here's the million-dollar question: Which tool can handle multi billion records efficiently without breaking the bank?

The following benchmark evaluates performance, environment configuration, and cost considerations for migrating PostgreSQL data to Parquet (AWS S3) using AWS Database Migration Service (DMS) and OLake.

Workload​

  • Dataset: This benchmark uses two large tables from NYC Taxi Data trips2 and fhv_trips. For details on how the data is generated, check here: How the data is generated!
  • Total rows: 4,008,587,913 rows including both tables
  • Modes evaluated: We are performing Full Refresh and CDC workload (50 million rows)

Environment​

  • Source database: Azure PostgreSQL (32 vCores, 128 GB RAM)
  • Destination: Amazon S3 (Parquet)
  • Migration compute: c6i.16xlarge (64 vCPU, 128 GB RAM)
  • Reference compute price: ~$3.07/hour
note

The same dataset and compute were used for both the tools.

Full Refresh Results​

The full-load test evaluates the time and throughput required to transfer the complete dataset from the source PostgreSQL database to Parquet files in S3, establishing a baseline for bulk data movement under sustained load conditions.

ToolRows ProcessedTotal TimeAvg Throughput (rows/sec)
OLake4,008,587,9131 h 59 m558,765
AWS DMS4,008,587,9139 h 8 m122,000

Key observations:

  • OLake achieved a 4.6Γ— faster full refresh performance compared to AWS DMS.
  • OLake eliminates the need for manual partition-boundary scripting, while DMS relies on manual boundary generation to achieve PostgreSQL parallelization.

Parallelism used:

  • OLake ran with 32 threads.
  • DMS ran with: 40 parallel tasks.

CDC Results (Insert-only; 50M rows)​

The CDC test evaluates the sustained ingestion of incremental changes from PostgreSQL into Parquet, validating plugin configuration, replication slots, and overall throughput. Each of the tables trips2 and fhv_trips will ingest 25 million records.

ToolRows ProcessedTotal TimeAvg Throughput (rows/sec)
AWS DMS50,000,00022 m 41 s36,738
OLake50,000,00016 m 24 s50,812

Key Observations:

  • OLake performed the CDC workload approximately 1.38Γ— faster than AWS DMS.
  • OLake requires minimal configuration , selecting the Full Refresh + CDC Sync Mode option is sufficient, whereas DMS requires enabling pglogical via Azure extensions, performing a detailed pglogical setup, and specifying the starting LSN for replication tasks.

Parallelism used:

  • OLake ran with 32 threads.
  • DMS ran with: 40 parallel tasks.

Resource Utilization​

We’re focusing on resource utilization for the Full Refresh process, as it’s significantly more resource-intensive than CDC.

The memory utilization shown corresponds to a transfer of ~4 billion records. As the data volumes increase, being memory efficiency becomes highly crucial β€” an area where OLake excels.

Memory StatsOLakeDMS
Min2.01 GB20 GB
Max46.72 GB40 GB
Mean25.70 GB30 GB

OLake's memory usage ranged from 2.12 GB to 46.72 GB, averaging 25.70 GB throughout the transfer. For DMS, the memory usage ranged from 20 GB to 40 GB, averaging 30 GB throughout the transfer.

Cost Comparison (Compute Only)​

Compute costs scale linearly with runtime at a given instance class for both OLake and DMS.

Here is the cost comparison for Full Refresh process.

ScenarioInstanceRuntimeApprox. Cost
DMS Full Refreshc6i.16xlarge9h 08m~$28.03
OLake Full Refreshc6i.16xlarge1h 59m~$6.08

Important-

OLake is delivering 4.61x cost savings as compared to DMS on compute alone. When you're moving terabytes monthly, those savings add up fast!

Scaling Impact-

Let us take a closer look at how the costs evolve over time for both the tools.

Assume that the same dataset is migrated "once daily" for the durations specified below.

Tool1 Month Cost2 Months Cost6 Months Cost
DMS~$840~$1,681~$5045
OLake~$182~$364~$1094

The cost trends over time make it evident which tool is more cost-efficient. Considering, typical workflows involve more than one daily sync, so the costs would increase from the baseline shown here.

Point to Remember

OLake is open source and incurs no licensing fees ; costs depend solely on the user’s infrastructure and storage consumption

Dataset and Table Schemas​

The OLake benchmarks page provides NYC Taxi table schemas designed to support both bulk transfer and CDC scenarios at scale, ensuring comparability across different tools.

The dataset and reproducible setup are available in the GitHub repository: NYC Taxi Data Benchmark.

The repository includes scripts and instructions to generate the NYC trips tables used in this benchmark, enabling users to replicate the setup and results.

trips2 table​

Schema for the trips2 table used in this benchmark:

CREATE TABLE public.trips2 (
id BIGSERIAL NOT NULL,
cab_type_id INT NULL,
vendor_id INT NULL,
pickup_datetime TIMESTAMP NULL,
dropoff_datetime TIMESTAMP NULL,
store_and_fwd_flag BOOLEAN NULL,
rate_code_id INT NULL,
pickup_longitude NUMERIC NULL,
pickup_latitude NUMERIC NULL,
dropoff_longitude NUMERIC NULL,
dropoff_latitude NUMERIC NULL,
passenger_count INT NULL,
trip_distance NUMERIC NULL,
fare_amount NUMERIC NULL,
extra NUMERIC NULL,
mta_tax NUMERIC NULL,
tip_amount NUMERIC NULL,
tolls_amount NUMERIC NULL,
ehail_fee NUMERIC NULL,
improvement_surcharge NUMERIC NULL,
congestion_surcharge NUMERIC NULL,
airport_fee NUMERIC NULL,
total_amount NUMERIC NULL,
payment_type INT NULL,
trip_type INT NULL,
pickup_nyct2010_gid INT NULL,
dropoff_nyct2010_gid INT NULL,
pickup_location_id INT NULL,
dropoff_location_id INT NULL,
CONSTRAINT trips2_pkey PRIMARY KEY (id)
);

fhv_trips table​

Schema for the fhv_trips table used in this benchmark:

CREATE TABLE fhv_trips (
id bigserial NOT NULL,
hvfhs_license_num text NULL,
dispatching_base_num text NULL,
originating_base_num text NULL,
request_datetime timestamp NULL,
on_scene_datetime timestamp NULL,
pickup_datetime timestamp NULL,
dropoff_datetime timestamp NULL,
pickup_location_id int4 NULL,
dropoff_location_id int4 NULL,
trip_miles numeric NULL,
trip_time numeric NULL,
base_passenger_fare numeric NULL,
tolls numeric NULL,
black_car_fund numeric NULL,
sales_tax numeric NULL,
congestion_surcharge numeric NULL,
airport_fee numeric NULL,
tips numeric NULL,
driver_pay numeric NULL,
shared_request bool NULL,
shared_match bool NULL,
access_a_ride bool NULL,
wav_request bool NULL,
wav_match bool NULL,
legacy_shared_ride int4 NULL,
affiliated_base_num text NULL,
CONSTRAINT fhv_trips_pkey PRIMARY KEY (id)
);

Implementation Considerations​

  • DMS PostgreSQL full loads may require manual task mapping with partition boundaries to achieve higher parallelism.
  • DMS CDC for Azure PostgreSQL required enabling pglogical via azure.extensions and configuring replication sets with aligned start LSN.
  • OLake parallelization and schema handling are automatic; no manual boundary generation was required in this benchmark.


πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!