Types of Compactionβ
Before diving into the types of compaction, it helps to understand how OLake Fusion categorizes files in an Iceberg table:
- Small files (fragments) β Tiny files well below the target size. These pile up quickly when data is written frequently, such as in streaming or high-frequency CDC scenarios.
- Medium-sized files (segments) β Files bigger than fragments but still not at the ideal size. These are partially compacted files that haven't yet reached the target.
- Optimally sized files β Files at exactly the configured target file size. These are what compaction aims to produce.
The goal of compaction is to turn fragments and segments into optimally sized files.
OLake supports three types of compaction:
1. Lite Compactionβ
Lite compaction is the lightest and most frequently run type. It focuses on two things, merging fragment into larger ones and converting Equality Delete Files into Position Delete Files. Position Delete Files are cheaper for query engines to process, so this conversion alone improves read performance without doing a heavy rewrite. Since streaming writes and high-frequency CDC constantly produce small fragment files, Minor compaction is typically scheduled to run frequentlly to keep the table tidy before the clutter builds up.
2. Medium Compactionβ
Medium compaction goes a step further. It merges segment files up to the target file size, and when too many Position Delete Files have accumulated, it merges them directly into the corresponding Data Files that is physically removing deleted rows from the table. This is more thorough than Minor compaction but still does not rewrite the entire table. The Medium Compaction is typically scheduled less frequently than the Lite Compaction to keep the table efficient and not spend too much compute.
3. Full Compactionβ
Full compaction is the deepest and most comprehensive type. It rewrites all data files fragments, segments, and delete files into optimally sized files that exactly match the configured target file size. Because it rewrites the entire table, it is the most compute-intensive option and is typically run less frequently. Use it when tables have accumulated heavy fragmentation over time or when you need the best possible query performance.
When more than one type of compaction is scheduled for a table to run at the same time, only the highest runs: Full overrides Medium and Lite; Medium overrides Lite. For example, if Full, Medium, and Lite are all due together, Full runs alone; if Medium and Lite are due together, Medium runs alone.
Choosing the Right Compaction Typeβ
| Compaction Type | Output | What it Does | Cost Incurred | When to Use |
|---|---|---|---|---|
| Lite | Equality delete files are converted to positional delete files and small files are merged | Improves query engine compatibility without rewriting data files | Low | Use when the table has too many small files and you want lightweight maintenance with low compute. |
| Medium | Deletes are applied and data files are merged; output sizes fall between 1/8 of target file size and the target file size itself | Reduces fragmentation by merging data files into larger files up to the target size | Medium | Use when you need more than Lite: deletes fully applied and files merged toward the target size without a full table rewrite. |
| Full | Data files are completely rewritten into files aligned with the target file size | Performs a full copy-on-write rewrite of the table to produce the most best file layout | High | Use when tables are heavily fragmented or when maximum query performance and best file layout are required. |