Skip to main content

What is OLake Go?​

OLake Go is a high-performance, open-source EL (Extract–Load) platform that bridges operational databases and open lakehouse storage - writing Apache Iceberg tables or Parquet files on Amazon S3 , enabling organizations to replicate data at scale with minimal overhead. Supporting Incremental Sync, Change Data Capture (CDC), and stateful, resumable syncs, OLake Go ensures your tables remain fresh, organized, and optimized for analytics.

Supported Sources​

OLake Go supports ingestion from the following sources:

SourceFull RefreshFull Refresh + IncrementalFull Refresh + CDCCDC Only
PostgreSQLβœ…βœ…βœ…βœ…
MySQLβœ…βœ…βœ…βœ…
MongoDBβœ…βœ…βœ…βœ…
Oracle Databaseβœ…βœ…β€”β€”
Apache Kafkaβ€”β€”β€”βœ…
DB2βœ…βœ…β€”β€”
MSSQLβœ…βœ…βœ…βœ…
S3βœ…βœ…β€”β€”

Destinations​

OLake Go writes data to open lakehouse storage:

  • Parquet files on object storage such as Amazon S3, MinIO, and Google Cloud Storage

  • Apache Iceberg tables with support for multiple catalog integrations including:

    • AWS Glue Data Catalog
    • Apache Hive Metastore
    • REST catalogs such as Nessie, Polaris and Unity Catalog
    • JDBC catalogs

    To know more, read OLake Catalog Integration.

Capabilities​

1. Parallelised Chunking​

Parallel chunking is a technique that splits large datasets or collections into smaller virtual chunks, allowing them to be read and processed simultaneously. It is used in sync modes such as Full Refresh, Full Refresh + CDC, and Full Refresh + Incremental.

What it does:

  • Splits big collections into manageable pieces without altering the underlying data.
  • Each chunk can be processed parallely & independently.

Benefit:

  • Enables parallel reads, dramatically reducing the time needed to perform full snapshots or scans of large datasets.
  • Improves ingestion speed, scalability, and overall system performance.

2. Stateful, Resumable Syncs​

It ensures that data syncs resume automatically from the last checkpoint after interruptions. Applicable for Full Refresh + CDC and Full Refresh + Incremental & Strict CDC sync modes.

What it does:

  • Maintains state of in-progress syncs.
  • Automatically resumes after crashes, network failures, or manual pauses.
  • Eliminates the need to resync data from the beginning. Interrupted runs resume from the last checkpoint instead.

Benefit:

  • Reduces data duplication and processing time.
  • Ensures reliable, fault-tolerant pipelines.
  • Minimizes manual intervention for operational teams.

3. Configurable Max Connections​

Each job can set its own maximum number of database connections to the source. This limit is per job, not shared across jobsβ€”even when several jobs use the same source, each job’s setting applies only to that job. Helps prevent overload and ensures stable performance on the source system.

4. Data Deduplication​

Data Deduplication ensures that only unique records are stored and processed in Upsert ingestion mode: saving space, reducing costs, and improving data quality. OLake automatically deduplicates data using the primary key from the source tables, guaranteeing that each primary key maps to a single row in the destination along with its corresponding olake_id.

5. Hive Style Partitioning​

Partitioning is the process of dividing large datasets into smaller, more manageable segments based on specific column values (e.g., date, region, or category), improving query performance, scalability, and data organization.

  • Iceberg partitioning β†’ Metadata-driven, no need for directory-based partitioning; enables efficient pruning and schema evolution.
  • S3-style partitioning β†’ Traditional folder-based layout (e.g., year=2025/month=08/day=22/) for compatibility with external tools.
  • Normalization β†’ Automatically expands level-1 nested JSON fields into top-level columns.

What it does:

  • Converts nested JSON objects into flat columns for easier querying.
  • Preserves all data while simplifying structure.

Benefit:

  • Makes SQL queries simpler and faster.
  • Reduces the need for complex JSON parsing in queries.
  • Improves readability and downstream analytics efficiency.

6. Schema Evolution & Data Types Changes​

OLake automatically handles changes in your table's schema without breaking downstream jobs. Read More Schema Evolution in OLake

7. Dead Letter Queue Columns (Coming soon)​

The DLQ column handles values with data type changes not supported by Iceberg / Parquet destinations type promotions, safely storing them without loss. This prevents sync failures and ensures downstream models remain stable. By isolating incompatible values, it allows users to continue syncing data seamlessly while addressing type mismatches at their convenience, improving reliability and reducing manual intervention.

8. Two Phase Commit​

OLake uses this mechanism for the Iceberg destination during full refresh, incremental, and CDC (MongoDB, PostgreSQL, MySQL, MSSQL), so per-chunk commit status is tracked via destination state making sure the syncs avoid inconsistencies and duplicate writes.



πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!