What is the Iceberg Writer?
The Iceberg Writer syncs data from databases (MySQL, MongoDB, PostgreSQL) into Apache Iceberg. Apache Iceberg is a table format that offers a number of benefits over traditional table formats like Parquet and ORC. Iceberg tables are designed to be efficient for both reads and writes, and they support schema evolution, ACID transactions, and time travel.
Supported Catalogs
Catalog Type | Description |
---|---|
JDBC | Uses PostgreSQL as the metadata catalog (local testing) |
AWS Glue | Uses AWS Glue for metadata catalog and AWS S3 for storage |
REST | Uses a REST API for metadata catalog and storage |
Hive | Hive Meta Store Catalog |
For more catalog options, please refer to the OLake roadmap.
Quick Start Guide
Its a simple 3 step process:
- Create a source config file and lets name it
source.json
, - Create another config file named
destination.json
and - Run the discover and sync commands to fetch the schema and start syncing the data respectively.
source.json
- holds the source database information like host, port, username, password, database name, etc.destination.json
- holds the iceberg destination configurations like iceberg table name, iceberg database name, catalog information, etc.
Now, depending upon from where (source) to where (destination) you would like to sync the data, you can choose the below configurations.
- PostgreSQL to Iceberg | Postgres Source Config
- MongoDB to Iceberg | MongoDB Source Config
- MySQL to Iceberg | MySQL Source Config
Now that you have the source configuration set, lets move on to the destination configuration.
OLake supports 4 types of catalogs for Iceberg writer, JDBC, AWS Glue, REST and Hive Meta Store. You can choose any of these destination configurations based on your requirement.
-
Run Sync Commands:
To replicate data from the source database to the Iceberg table, you need to run the sync commands. The sync command will read the data from the source database and write it to the Iceberg table.- Discover Command:
<DISCOVER_COMMAND>
- Sync Command:
<SYNC_COMMAND>
- Sync with State Command:
<SYNC_WITH_STATE_COMMAND>
- Discover Command:
Refer to respective Database docs to use the command for discover schema and sync the data.
- MongoDB Discover and sync command
- Postgres Discover and sync command
- MySQL Discover and sync command
A sample disover & sync command would look like this:
docker run --pull=always \
-v /Users/USERNAME/Desktop/projects/OLAKE_DIRECTORY:/mnt/config \
olakego/source-mongodb:latest \
discover \
--config /mnt/config/source.json
The olakego/source-mongodb
is the OLake image for MongoDB source. You can replace it with the respective source image for PostgreSQL (source-postgres) or MySQL (source-mysql) or can build one locally.