What is OLake Go?β
OLake Go is a high-performance, open-source EL (ExtractβLoad) platform that bridges operational databases and open lakehouse storage - writing Apache Iceberg tables or Parquet files on Amazon S3 , enabling organizations to replicate data at scale with minimal overhead. Supporting Incremental Sync, Change Data Capture (CDC), and stateful, resumable syncs, OLake Go ensures your tables remain fresh, organized, and optimized for analytics.
Supported Sourcesβ
OLake Go supports ingestion from the following sources:
| Source | Full Refresh | Full Refresh + Incremental | Full Refresh + CDC | CDC Only |
|---|---|---|---|---|
| PostgreSQL | β | β | β | β |
| MySQL | β | β | β | β |
| MongoDB | β | β | β | β |
| Oracle Database | β | β | β | β |
| Apache Kafka | β | β | β | β |
| DB2 | β | β | β | β |
| MSSQL | β | β | β | β |
| S3 | β | β | β | β |
Destinationsβ
OLake Go writes data to open lakehouse storage:
-
Parquet files on object storage such as Amazon S3, MinIO, and Google Cloud Storage
-
Apache Iceberg tables with support for multiple catalog integrations including:
- AWS Glue Data Catalog
- Apache Hive Metastore
- REST catalogs such as Nessie, Polaris and Unity Catalog
- JDBC catalogs
To know more, read OLake Catalog Integration.
Capabilitiesβ
1. Parallelised Chunkingβ
Parallel chunking is a technique that splits large datasets or collections into smaller virtual chunks, allowing them to be read and processed simultaneously. It is used in sync modes such as Full Refresh, Full Refresh + CDC, and Full Refresh + Incremental.
What it does:
- Splits big collections into manageable pieces without altering the underlying data.
- Each chunk can be processed parallely & independently.
Benefit:
- Enables parallel reads, dramatically reducing the time needed to perform full snapshots or scans of large datasets.
- Improves ingestion speed, scalability, and overall system performance.
2. Stateful, Resumable Syncsβ
It ensures that data syncs resume automatically from the last checkpoint after interruptions. Applicable for Full Refresh + CDC and Full Refresh + Incremental & Strict CDC sync modes.
What it does:
- Maintains state of in-progress syncs.
- Automatically resumes after crashes, network failures, or manual pauses.
- Eliminates the need to resync data from the beginning. Interrupted runs resume from the last checkpoint instead.
Benefit:
- Reduces data duplication and processing time.
- Ensures reliable, fault-tolerant pipelines.
- Minimizes manual intervention for operational teams.
3. Configurable Max Connectionsβ
Each job can set its own maximum number of database connections to the source. This limit is per job, not shared across jobsβeven when several jobs use the same source, each jobβs setting applies only to that job. Helps prevent overload and ensures stable performance on the source system.
4. Data Deduplicationβ
Data Deduplication ensures that only unique records are stored and processed in Upsert ingestion mode: saving space, reducing costs, and improving data quality. OLake automatically deduplicates data using the primary key from the source tables, guaranteeing that each primary key maps to a single row in the destination along with its corresponding olake_id.
5. Hive Style Partitioningβ
Partitioning is the process of dividing large datasets into smaller, more manageable segments based on specific column values (e.g., date, region, or category), improving query performance, scalability, and data organization.
- Iceberg partitioning β Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution.
- S3-style partitioning β Traditional folder-based layout (e.g.,
year=2025/month=08/day=22/) for compatibility with external tools. - Normalization β Automatically expands level-1 nested JSON fields into top-level columns.
What it does:
- Converts nested JSON objects into flat columns for easier querying.
- Preserves all data while simplifying structure.
Benefit:
- Makes SQL queries simpler and faster.
- Reduces the need for complex JSON parsing in queries.
- Improves readability and downstream analytics efficiency.
6. Schema Evolution & Data Types Changesβ
OLake automatically handles changes in your table's schema without breaking downstream jobs. Read More Schema Evolution in OLake
7. Dead Letter Queue Columns (Coming soon)β
The DLQ column handles values with data type changes not supported by Iceberg / Parquet destinations type promotions, safely storing them without loss. This prevents sync failures and ensures downstream models remain stable. By isolating incompatible values, it allows users to continue syncing data seamlessly while addressing type mismatches at their convenience, improving reliability and reducing manual intervention.
8. Two Phase Commitβ
OLake uses this mechanism for the Iceberg destination during full refresh, incremental, and CDC (MongoDB, PostgreSQL, MySQL, MSSQL), so per-chunk commit status is tracked via destination state making sure the syncs avoid inconsistencies and duplicate writes.