Skip to main content

Iceberg Partitioning Support

Partitioning groups rows that share common values at write-time, so that queries filtering on those values read only the relevant files. The result is fewer data files scanned, less I/O, and faster queries.

OLake allows seamless writing into partitioned Iceberg tables, supporting all built-in transformations in Apache Icebergβ„’.

How Iceberg Handles Partitions?​

Apache Iceberg stores partition data in both its metadata and manifest files to enable efficient querying and file pruning. Rather than embedding partition information in file paths, Iceberg records the partition values for each file in manifests and maintains a snapshot-based metadata tree that tracks those manifests.

When you run a query like WHERE date > …, Iceberg consults the metadata to skip over any data files whose recorded partition values cannot satisfy the filter.

Key Features​

  • Hidden partitioning
  • Evolvable specs
  • Rich partition transforms
  • Manifest metadata
  • Snapshot isolation
  • Efficient pruning

Learn more about Iceberg Partitioning.

How to Add a Partition in OLake?​

When adding a partition, you need to configure two components:

ComponentMeaning
field_nameColumn in the table (created_at, user_id, etc.)
transformIceberg transform that converts the value, e.g., identity

Example 1: Time-based partitioning for event data

/{event_date, day}/{event_hour, hour}
  • Partitions data by day and hour for efficient time-range queries
  • Useful for analytics workloads with time-based filtering

Example 2: Geographic and temporal partitioning

/{region, identity}/{created_at, month}
  • Partitions by geographic region and creation month
  • Ideal for multi-region applications with time-based analysis

Check all the available transforms below.

Configuration​

  1. Before adding partitioning, make sure you have configured your destination.
  2. Then select your table.
  3. Keep Normalization enabled.
  4. Select Partitioning in the right tab.
  5. Add your partition field along with the transform.
  6. Then we can move forward to Schedule a Job.

Partitioning OLake UI

For hierarchical partitioning, you can have multiple /{field_name, transform} entries in the Partition regex. Example: /{created_at, year}/{user_id, bucket[32]}

Special case

You may use now() as a pseudo-column that evaluates to the _olake_timestamp column which is the writer's current timestamp (useful when your records lack a suitable time field).

Supported Transformations​

All transforms create partition folders with predictable naming patterns for efficient querying:

TransformTypical use-caseHow It WorksSpec SnippetExample Input β†’ Partition Value
identityColumns with only a few distinct values – e.g. country, statusWrites the raw value unchanged/{status, identity}status = 'active' β†’ folder status=active/
yearTime-series data where queries slice whole yearsExtracts the calendar year as an integer/{created_at, year}created_at = 2025-04-24 15:42 β†’ year=2025/
monthMonthly reporting, rollups, retentionReturns the month number (1-12) of the timestamp/{created_at, month}created_at = 2025-04-24 15:42 β†’ month=4/
dayDaily dashboards, 30-day TTL jobsReturns the local calendar day (1-31)/{event_date, day}event_date = 2025-04-24 β†’ day=24/
hourHigh-volume logs where analysts drill into single hoursReturns hour-of-day (0-23)/{event_time, hour}event_time = 2025-04-24 15:42 β†’ hour=15/
bucket[N]Very high-cardinality keys (email, UUID) to distribute writesHashes the value and assigns it to N numbered buckets (0…N-1)/{user_id, bucket[64]}user_id = 'f47ac10b...' β†’ user_id_bucket=17/ (one of 64)
truncate[N]Long strings where only the prefix matters (domains, URLs)Keeps the first N UTF-8 characters/{domain, truncate[4]}domain = 'example.com' β†’ domain=exam/
voidDropping partition fields without removing from specAlways produces null values/{old_field, void}old_field = 'any_value' β†’ old_field=null/
Void Transform and Non-Linear Transformations

The void transform in Apache Iceberg always produces null values regardless of input. It effectively drops a partition field without removing it from the partition specification. Starting from Iceberg V2, it is used when dropping a partition field.

Non-linear transformations like bucket and void do not preserve ordering, meaning there's no predictable relationship between input values and their transformed output positions.

Important

Iceberg does not support redundant fields during partitioning, even with different transforms on the same column. Avoid applying multiple time transforms to the same column in a single partition specification.

Redundant transform combinations:

  • year(ts) is redundant with month(ts), hour(ts), day(ts)
  • month(ts) is redundant with hour(ts) and day(ts)
  • day(ts) is redundant with hour(ts) For example, this is incorrect: /{timestamp_col, hour}/{timestamp_col, minute}

Putting it all together​

Let's look at a real-world example of an e-commerce analytics table with three partitions for optimal query performance.

Example: E-commerce Events Table​

Partition Configuration:

/{event_date, day}/{region, identity}/{user_id, bucket[256]}

This creates a hierarchical partitioning strategy:

  1. Daily partitions for time-based queries
  2. Regional partitions for geographic analysis
  3. User bucketing to distribute high-cardinality data

Directory Structure​

s3://warehouse/ecommerce_events/
β”œβ”€β”€ metadata/
β”‚ β”œβ”€β”€ v0.metadata.json
β”‚ β”œβ”€β”€ v1.metadata.json
β”‚ β”œβ”€β”€ snap-0001.avro
β”‚ └── manifest-01.avro
└── data/
β”œβ”€β”€ event_date_day=2025-01-15/
β”‚ β”œβ”€β”€ region=us-east/
β”‚ β”‚ └── user_id_bucket=0/
β”‚ β”‚ └── data-001.parquet
β”‚ └── region=eu-west/
β”‚ └── user_id_bucket=128/
β”‚ └── data-002.parquet
└── event_date_day=2025-01-16/
└── region=ap-south/
└── user_id_bucket=64/
└── data-003.parquet

SQL Query Examples​

1. Daily Report (scans only 1 day's data):

SELECT event_type, COUNT(*)
FROM ecommerce_events
WHERE event_date = '2025-01-15'
GROUP BY event_type;

Iceberg scans only: event_date_day=2025-01-15/ directory

2. Regional Analysis (scans specific region across time):

SELECT DATE(event_date) as day, SUM(revenue)
FROM ecommerce_events
WHERE region = 'us-east'
AND event_date BETWEEN '2025-01-15' AND '2025-01-17'
GROUP BY DATE(event_date);

Iceberg scans only: region=us-east/ subdirectories within the date range

Performance Benefits​

  • Time queries: Skip irrelevant days entirely
  • Regional queries: Skip other geographic regions
  • User queries: Skip 255 out of 256 user buckets
  • Combined filters: Maximum pruning when filtering on multiple partition fields

This partitioning strategy enables sub-second queries on petabyte-scale datasets by reading only the relevant data files.

Real-World Examples​

1. Analytics Data Warehouse

Partition Regex:

/{event_year, year}/{event_month, month}/{event_day, day}

Configuration:

{
"stream_name": "analytics_events",
"partition_regex": "/{event_year, year}/{event_month, month}/{event_day, day}"
}

Folder Structure:

event_year_year=2025/event_month_month=4/event_day_day=24/part-00001.parquet

SQL Query:

SELECT event_type, COUNT(*)
FROM analytics.events
WHERE event_year = 2025
AND event_month = 4
GROUP BY event_type;

2. Financial Transactions

Partition Regex:

/{transaction_date, day}/{account_type, identity}

Configuration:

{
"stream_name": "transactions",
"partition_regex": "/{transaction_date, day}/{account_type, identity}"
}

Folder Structure:

transaction_date_day=2025-04-24/account_type=premium/part-00001.parquet

SQL Query:

SELECT SUM(amount)
FROM transactions
WHERE transaction_date = '2025-04-24'
AND account_type = 'premium';

3. IoT Sensor Data

Partition Regex:

/{timestamp, hour}/{device_id, bucket[256]}/{sensor_type, identity}

Configuration:

{
"stream_name": "sensor_readings",
"partition_regex": "/{timestamp, hour}/{device_id, bucket[256]}/{sensor_type, identity}"
}

Folder Structure:

timestamp_hour=2025-04-24-14/device_id_bucket=128/sensor_type=temperature/part-00001.parquet

SQL Query:

SELECT AVG(value)
FROM sensor_readings
WHERE timestamp >= '2025-04-24 14:00'
AND sensor_type = 'temperature';

4. E-commerce Orders

Partition Regex:

/{order_date, month}/{region, identity}/{customer_tier, identity}

Configuration:

{
"stream_name": "orders",
"partition_regex": "/{order_date, month}/{region, identity}/{customer_tier, identity}"
}

Folder Structure:

order_date_month=2025-04/region=us-west/customer_tier=gold/part-00001.parquet

SQL Query:

SELECT COUNT(*), SUM(total_amount)
FROM orders
WHERE order_date >= '2025-04-01'
AND region = 'us-west'
AND customer_tier = 'gold';

Checklist before you decide​

QuestionIf Yes, Consider
Do your workloads frequently filter by date or time?Use day(), month(), or year() transforms for time-based partitioning
Do you have columns with many distinct values (user IDs, session IDs)?Use bucket(n) to evenly distribute high-cardinality data into n buckets
Do you have columns with few distinct values (country, status)?Direct partitioning on the column using identity transform is sufficient
Do your queries often target numeric ranges (price, timestamps)?Use truncate(width) to group values into fixed-size ranges
Do you frequently filter by region, department, or category?Partition by identity on that column for geographic/categorical grouping
Might your schema or query patterns change over time?Choose hidden partitioning and allow spec evolution for flexibility
Are you concerned about metadata overhead from too many partitions?Balance granularity: avoid partitions smaller than a few MB to optimize performance vs. file count
Do you have data skew or potential hot partitions?Combine high-cardinality and low-cardinality transforms (e.g., bucket within day)
Do you prioritize flat directory structures?Limit nested transforms to two or three levels for storage layout simplicity
Quick Reference
  • Cardinality β€” Aim for 100 – 10,000 files per partition folder.

  • Skew β€” Use bucket[N] when one value dominates (e.g., a single large tenant).

  • Time range queries β€” Put the time transform first if nearly every query filters on a date.

  • Evolution β€” Start simple (e.g., {event_date, day}) and add more fields later; Iceberg keeps old snapshots readable.



πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!