OLake vs AWS DMS Benchmarkβ
In today's data-driven world, organizations are migrating massive datasets from traditional databases to cloud storage. But here's the million-dollar question: Which tool can handle multi billion records efficiently without breaking the bank?
The following benchmark evaluates performance, environment configuration, and cost considerations for migrating PostgreSQL data to Parquet (AWS S3) using AWS Database Migration Service (DMS) and OLake.
Workloadβ
- Dataset: This benchmark uses two large tables from NYC Taxi Data
trips2
andfhv_trips
. For details on how the data is generated, check here: How the data is generated! - Total rows: 4,008,587,913 rows including both tables
- Modes evaluated: We are performing Full Refresh and CDC workload (50 million rows)
Environmentβ
- Source database: Azure PostgreSQL (32 vCores, 128 GB RAM)
- Destination: Amazon S3 (Parquet)
- Migration compute: c6i.16xlarge (64 vCPU, 128 GB RAM)
- Reference compute price: ~$3.07/hour
The same dataset and compute were used for both the tools.
Full Refresh Resultsβ
The full-load test evaluates the time and throughput required to transfer the complete dataset from the source PostgreSQL database to Parquet files in S3, establishing a baseline for bulk data movement under sustained load conditions.
Tool | Rows Processed | Total Time | Avg Throughput (rows/sec) |
---|---|---|---|
OLake | 4,008,587,913 | 1 h 59 m | 558,765 |
AWS DMS | 4,008,587,913 | 9 h 8 m | 122,000 |
Key observations:
- OLake achieved a 4.6Γ faster full refresh performance compared to AWS DMS.
- OLake eliminates the need for manual partition-boundary scripting, while DMS relies on manual boundary generation to achieve PostgreSQL parallelization.
Parallelism used:
- OLake ran with 32 threads.
- DMS ran with: 40 parallel tasks.
CDC Results (Insert-only; 50M rows)β
The CDC test evaluates the sustained ingestion of incremental changes from PostgreSQL into Parquet, validating plugin configuration, replication slots, and overall throughput. Each of the tables trips2
and fhv_trips
will ingest 25 million records.
Tool | Rows Processed | Total Time | Avg Throughput (rows/sec) |
---|---|---|---|
AWS DMS | 50,000,000 | 22 m 41 s | 36,738 |
OLake | 50,000,000 | 16 m 24 s | 50,812 |
Key Observations:
- OLake performed the CDC workload approximately 1.38Γ faster than AWS DMS.
- OLake requires minimal configuration , selecting the Full Refresh + CDC Sync Mode option is sufficient, whereas DMS requires enabling pglogical via Azure extensions, performing a detailed pglogical setup, and specifying the starting LSN for replication tasks.
Parallelism used:
- OLake ran with 32 threads.
- DMS ran with: 40 parallel tasks.
Resource Utilizationβ
Weβre focusing on resource utilization for the Full Refresh process, as itβs significantly more resource-intensive than CDC.
The memory utilization shown corresponds to a transfer of ~4 billion records. As the data volumes increase, being memory efficiency becomes highly crucial β an area where OLake excels.
Memory Stats | OLake | DMS |
---|---|---|
Min | 2.01 GB | 20 GB |
Max | 46.72 GB | 40 GB |
Mean | 25.70 GB | 30 GB |
OLake's memory usage ranged from 2.12 GB to 46.72 GB, averaging 25.70 GB throughout the transfer. For DMS, the memory usage ranged from 20 GB to 40 GB, averaging 30 GB throughout the transfer.
Cost Comparison (Compute Only)β
Compute costs scale linearly with runtime at a given instance class for both OLake and DMS.
Here is the cost comparison for Full Refresh process.
Scenario | Instance | Runtime | Approx. Cost |
---|---|---|---|
DMS Full Refresh | c6i.16xlarge | 9h 08m | ~$28.03 |
OLake Full Refresh | c6i.16xlarge | 1h 59m | ~$6.08 |
Important-
OLake is delivering 4.61x cost savings as compared to DMS on compute alone. When you're moving terabytes monthly, those savings add up fast!
Scaling Impact-
Let us take a closer look at how the costs evolve over time for both the tools.
Assume that the same dataset is migrated "once daily" for the durations specified below.
Tool | 1 Month Cost | 2 Months Cost | 6 Months Cost |
---|---|---|---|
DMS | ~$840 | ~$1,681 | ~$5045 |
OLake | ~$182 | ~$364 | ~$1094 |
The cost trends over time make it evident which tool is more cost-efficient. Considering, typical workflows involve more than one daily sync, so the costs would increase from the baseline shown here.
OLake is open source and incurs no licensing fees ; costs depend solely on the userβs infrastructure and storage consumption
Dataset and Table Schemasβ
The OLake benchmarks page provides NYC Taxi table schemas designed to support both bulk transfer and CDC scenarios at scale, ensuring comparability across different tools.
The dataset and reproducible setup are available in the GitHub repository: NYC Taxi Data Benchmark.
The repository includes scripts and instructions to generate the NYC trips tables used in this benchmark, enabling users to replicate the setup and results.
trips2
tableβ
Schema for the trips2
table used in this benchmark:
CREATE TABLE public.trips2 (
id BIGSERIAL NOT NULL,
cab_type_id INT NULL,
vendor_id INT NULL,
pickup_datetime TIMESTAMP NULL,
dropoff_datetime TIMESTAMP NULL,
store_and_fwd_flag BOOLEAN NULL,
rate_code_id INT NULL,
pickup_longitude NUMERIC NULL,
pickup_latitude NUMERIC NULL,
dropoff_longitude NUMERIC NULL,
dropoff_latitude NUMERIC NULL,
passenger_count INT NULL,
trip_distance NUMERIC NULL,
fare_amount NUMERIC NULL,
extra NUMERIC NULL,
mta_tax NUMERIC NULL,
tip_amount NUMERIC NULL,
tolls_amount NUMERIC NULL,
ehail_fee NUMERIC NULL,
improvement_surcharge NUMERIC NULL,
congestion_surcharge NUMERIC NULL,
airport_fee NUMERIC NULL,
total_amount NUMERIC NULL,
payment_type INT NULL,
trip_type INT NULL,
pickup_nyct2010_gid INT NULL,
dropoff_nyct2010_gid INT NULL,
pickup_location_id INT NULL,
dropoff_location_id INT NULL,
CONSTRAINT trips2_pkey PRIMARY KEY (id)
);
fhv_trips
tableβ
Schema for the fhv_trips
table used in this benchmark:
CREATE TABLE fhv_trips (
id bigserial NOT NULL,
hvfhs_license_num text NULL,
dispatching_base_num text NULL,
originating_base_num text NULL,
request_datetime timestamp NULL,
on_scene_datetime timestamp NULL,
pickup_datetime timestamp NULL,
dropoff_datetime timestamp NULL,
pickup_location_id int4 NULL,
dropoff_location_id int4 NULL,
trip_miles numeric NULL,
trip_time numeric NULL,
base_passenger_fare numeric NULL,
tolls numeric NULL,
black_car_fund numeric NULL,
sales_tax numeric NULL,
congestion_surcharge numeric NULL,
airport_fee numeric NULL,
tips numeric NULL,
driver_pay numeric NULL,
shared_request bool NULL,
shared_match bool NULL,
access_a_ride bool NULL,
wav_request bool NULL,
wav_match bool NULL,
legacy_shared_ride int4 NULL,
affiliated_base_num text NULL,
CONSTRAINT fhv_trips_pkey PRIMARY KEY (id)
);
Implementation Considerationsβ
- DMS PostgreSQL full loads may require manual task mapping with partition boundaries to achieve higher parallelism.
- DMS CDC for Azure PostgreSQL required enabling
pglogical
viaazure.extensions
and configuring replication sets with aligned start LSN. - OLake parallelization and schema handling are automatic; no manual boundary generation was required in this benchmark.