Skip to main content

Types of Compaction​

Before diving into the types of compaction, it helps to understand how OLake Fusion categorizes files in an Iceberg table:

  • Small files (fragments) β€” Tiny files well below the target size. These pile up quickly when data is written frequently, such as in streaming or high-frequency CDC scenarios.
  • Medium-sized files (segments) β€” Files bigger than fragments but still not at the ideal size. These are partially compacted files that haven't yet reached the target.
  • Optimally sized files β€” Files at exactly the configured target file size. These are what compaction aims to produce.

The goal of compaction is to turn fragments and segments into optimally sized files.

OLake supports three types of compaction:

1. Lite Compaction​

Lite compaction is the lightest and most frequently run type. It focuses on two things, merging fragment into larger ones and converting Equality Delete Files into Position Delete Files. Position Delete Files are cheaper for query engines to process, so this conversion alone improves read performance without doing a heavy rewrite. Since streaming writes and high-frequency CDC constantly produce small fragment files, Minor compaction is typically scheduled to run frequentlly to keep the table tidy before the clutter builds up.

2. Medium Compaction​

Medium compaction goes a step further. It merges segment files up to the target file size, and when too many Position Delete Files have accumulated, it merges them directly into the corresponding Data Files that is physically removing deleted rows from the table. This is more thorough than Minor compaction but still does not rewrite the entire table. The Medium Compaction is typically scheduled less frequently than the Lite Compaction to keep the table efficient and not spend too much compute.

3. Full Compaction​

Full compaction is the deepest and most comprehensive type. It rewrites all data files fragments, segments, and delete files into optimally sized files that exactly match the configured target file size. Because it rewrites the entire table, it is the most compute-intensive option and is typically run less frequently. Use it when tables have accumulated heavy fragmentation over time or when you need the best possible query performance.

Compaction precedence

When more than one type of compaction is scheduled for a table to run at the same time, only the highest runs: Full overrides Medium and Lite; Medium overrides Lite. For example, if Full, Medium, and Lite are all due together, Full runs alone; if Medium and Lite are due together, Medium runs alone.

Choosing the Right Compaction Type​

Compaction TypeOutputWhat it DoesCost IncurredWhen to Use
LiteEquality delete files are converted to positional delete files and small files are mergedImproves query engine compatibility without rewriting data filesLowUse when the table has too many small files and you want lightweight maintenance with low compute.
MediumDeletes are applied and data files are merged; output sizes fall between 1/8 of target file size and the target file size itselfReduces fragmentation by merging data files into larger files up to the target sizeMediumUse when you need more than Lite: deletes fully applied and files merged toward the target size without a full table rewrite.
FullData files are completely rewritten into files aligned with the target file sizePerforms a full copy-on-write rewrite of the table to produce the most best file layoutHighUse when tables are heavily fragmented or when maximum query performance and best file layout are required.


πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!