Skip to main content

Iceberg Partitioning Support

Partitioning groups rows that share common values at write-time, so that queries filtering on those values read only the relevant files. The result is fewer data files scanned, less I/O, and faster queries.

How Iceberg handles partitions

Iceberg automatically stores the derived partition values for each row in table metadata.
Because the physical layout is hidden, your SQL can stay simple (WHERE date > …); Iceberg prunes away files that cannot match.

Partition spec syntax used by the writer

In streams.json file, add values in the below format to partition_regex string that describes one or more partition fields in order:

"partition_regex": "/{field_name, transform}/{next_field, transform}"
ComponentMeaning
field_nameColumn in the incoming record (created_at, user_id, etc.)
transformIceberg transform that converts the value.

Special case
You may use now() as a pseudo-column that evaluates to the writer’s current timestamp (useful when your records lack a suitable time field).

Complete examples

# Partition rows by calendar year of a timestamp:
/{created_at, year}

# Partition by whole day:
/{event_date, day}

# Two-level spec: customer id bucket then month of event time
/{customer_id, bucket[32]}/{event_time, month}

# Daily partitions based on the writer clock
/{now(), day}

Supported transforms—what they do and what you’ll see on disk

Transform & SyntaxTypical use-caseHow the transform worksSpec snippetExample input → partition value
identityColumns with only a few distinct values – e.g. country, statusWrites the raw value unchanged/{status, identity}status = 'active' → folder status=active/
yearTime-series data where queries slice whole yearsExtracts the calendar year as an integer/{created_at, year}created_at = 2025-04-24 15:42year=2025/
monthMonthly reporting, rollups, retentionReturns the month number (1-12) of the timestamp/{created_at, month}created_at = 2025-04-24 15:42month=4/
dayDaily dashboards, 30-day TTL jobsReturns the local calendar day (1-31)/{event_date, day}event_date = 2025-04-24day=24/
hourHigh-volume logs where analysts drill into single hoursReturns hour-of-day (0-23)/{event_time, hour}event_time = 2025-04-24 15:42hour=15/
bucket[N]Very high-cardinality keys (email, UUID) to distribute writesHashes the value and assigns it to N numbered buckets (0…N-1)/{user_id, bucket[64]}user_id = 'f47ac10b...'user_id_bucket=17/ (one of 64)
truncate[N]Long strings where only the prefix matters (domains, URLs)Keeps the first N UTF-8 characters/{domain, truncate[4]}domain = 'example.com'domain=exam/

Quick guide to data types

  • identity and truncate[N] keep string values as strings.
  • year, month, day, hour and bucket[N] store integers under the hood (e.g. month=4, bucket=17).
    That makes partition pruning cheap and predictable.

What if a field is missing?

If the incoming record lacks a declared partition column, the writer sets that partition value to null. Iceberg treats null like any other value, so the row is written successfully and queries can still skip non-matching files.

Putting it all together

config/snippets/stream.json
{
"selected_streams": {
"my_namespace": [
{
"stream_name": "my_stream",
"partition_regex": "/{timestamp_col, day}/{region, identity}"
}
]
}
}

Run your sync as usual; the writer produces an Iceberg table whose directories reflect the spec:

my_stream/
└─ day=2025-04-14/
└─ region=in/
└─ data-file.parquet

Querying partitioned tables

-- Only partitions for 1 May 2023 and region 'us-east' are read
SELECT *
FROM olake_iceberg.olake_iceberg.my_stream
WHERE timestamp_col = DATE '2023-05-01'
AND region = 'us-east';

Because Iceberg pushes the predicates (timestamp_col, region) down to the metadata layer, files outside those partitions are skipped entirely.

Partition-spec “cheat-sheet”

All the examples below follow the same four-block layout so you can scan, copy, and adapt them quickly:

  • Partition Spec – the exact string to paste into partition_regex
  • Good For – what kind of workload/problem it solves
  • Why It Works – the principle behind the layout
  • Sample Folders & Query – one real path and a SQL snippet to show pruning in action

1 Streaming click-stream logs

Partition Spec

/{event_time, hour}/{user_id, bucket[512]}

Config

{
"stream_name": "web_events",
"partition_regex": "/{event_time, hour}/{user_id, bucket[512]}"
}

Folder created

event_time_hour=2025-04-24-15/user_id_bucket=344/part-00001.parquet

Query that prunes

SELECT count(*)
FROM web.events
WHERE event_time BETWEEN TIMESTAMP '2025-04-24 15:00'
AND TIMESTAMP '2025-04-24 15:59';

2 Global e-commerce orders

Partition Spec

/{order_date, month}/{ship_country, identity}

Config

{
"stream_name": "orders",
"partition_regex": "/{order_date, month}/{ship_country, identity}"
}

Folder

order_date_month=2025-04/ship_country=IN/part-00037.parquet

Query

SELECT sum(grand_total)
FROM sales.orders
WHERE order_date BETWEEN DATE '2025-04-01' AND DATE '2025-04-30'
AND ship_country = 'IN';

3 IoT device telemetry

Partition Spec

/{event_date, month}/{device_id, bucket[64]}

Config

{
"stream_name": "telemetry",
"partition_regex": "/{event_date, month}/{device_id, bucket[64]}"
}

Folder

event_date_month=2025-04/device_id_bucket=22/part-00003.parquet

Query

SELECT *
FROM iot.telemetry
WHERE device_id = 'sensor-9f2c'
AND event_date_month = '2025-04';

4 Modernising a Hive yyyy/mm/dd layout

Partition Spec

/{order_ts, year}/{order_ts, month}/{order_ts, day}

Config

{
"stream_name": "hive_migrated_orders",
"partition_regex": "/{order_ts, year}/{order_ts, month}/{order_ts, day}"
}

Folder

order_ts_year=2025/order_ts_month=04/order_ts_day=24/part-99983.parquet

Query

SELECT *
FROM sales.orders
WHERE order_ts = DATE '2025-04-24';

5 GDPR / CCPA expiry window (day-level)

Partition Spec

/{delete_after, day}

Config

{
"stream_name": "regulated_events",
"partition_regex": "/{delete_after, day}"
}

Folder

delete_after_day=2025-05-24/part-00017.parquet

6 Pure identity on low-cardinality code

Partition Spec

/{status, identity}

Config

{
"stream_name": "user_status",
"partition_regex": "/{status, identity}"
}

Folder

status=active/part-00009.parquet

Query

SELECT *
FROM users.status
WHERE status = 'active';

7 Hierarchical calendar (year → month → day)

Partition Spec

/{event_ts, year}/{event_ts, month}/{event_ts, day}

Config

{
"stream_name": "event_log",
"partition_regex": "/{event_ts, year}/{event_ts, month}/{event_ts, day}"
}

Folder

event_ts_year=2025/event_ts_month=04/event_ts_day=24/part-00571.parquet

Query

SELECT *
FROM logs.events
WHERE event_ts BETWEEN DATE '2025-04-24' AND DATE '2025-04-24';

8 Prefix truncate on long strings

Partition Spec

/{domain, truncate[4]}

Config

{
"stream_name": "web_domains",
"partition_regex": "/{domain, truncate[4]}"
}

Folder

domain=exam/part-00001.parquet          # holds rows like 'example.com'

Query

SELECT *
FROM web.domains
WHERE domain LIKE 'exam%';

9 Boolean flag shield

Partition Spec

/{is_test, identity}

Config

{
"stream_name": "mixed_traffic",
"partition_regex": "/{is_test, identity}"
}

Folder

is_test=true/part-00005.parquet

Query

SELECT *
FROM prod.events
WHERE is_test = false; -- skips the whole test partition

10 Schema-versioned blobs

Partition Spec

/{schema_version, identity}/{update_ts, month}

Config

{
"stream_name": "json_blobs",
"partition_regex": "/{schema_version, identity}/{update_ts, month}"
}

Folder

schema_version=v3/update_ts_month=2025-04/part-00077.parquet

Query

SELECT payload->>'new_field'
FROM repo.blobs
WHERE schema_version >= 'v3';

Quick-select flowchart

Rule of thumb – Aim for 100 - 10 000 small files inside each partition folder.
Adjust bucket counts or combine fields until you hit that sweet spot.

Use these patterns as a baseline, evolve when your query patterns or data volume change, and you’ll keep scans fast without over-engineering your layout.

Checklist before you decide

QuestionIf yes, consider …
Do most queries filter on a date range?Put a time transform first (day, month).
Is one field extremely high-cardinality?Add bucket[N].
Do you need easy bulk-deletes?Partition on expiry or ingest date.
Do analysts group by a specific code (country, status)?Use identity on that column.

Stick to one or two partition fields at first; you can evolve later if query patterns change.

Quick checklist before you pick a spec

  1. Cardinality — Aim for 100 – 10 000 files per partition folder.
  2. Skew — Use bucket[N] when one value dominates (e.g., a single large tenant).
  3. Time range queries — Put the time transform first if nearly every query filters on a date.
  4. Evolution — Start simple (e.g., {event_date, day}) and add more fields later; Iceberg keeps old snapshots readable.

These patterns should cover 90 % of use-cases we’ve seen in the field.
If your workload looks different, open an issue or drop us a message—we love weird corner-cases!

Further reading

Need more help?

If any part of this guide feels unclear, open an issue in the docs repo or ping us on Slack—feedback helps us improve!


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!