What is the Iceberg Writer?
The Iceberg Writer syncs data from databases (MySQL, MongoDB, PostgreSQL) into Apache Iceberg. Apache Iceberg is a table format that offers a number of benefits over traditional table formats like Parquet and ORC. Iceberg tables are designed to be efficient for both reads and writes, and they support schema evolution, ACID transactions, and time travel.
Supported Catalogs
Catalog Type | Description |
---|---|
JDBC | Uses PostgreSQL as the metadata catalog (local testing) |
AWS Glue | Uses AWS Glue for metadata catalog and AWS S3 for storage |
REST | Uses a REST API for metadata catalog and storage |
For more catalog options, please refer to the OLake roadmap.
Quick Start Guide
Its a simple 3 step process:
- Create a config file and lets name it
config.json
, - Create another config file named
writer.json
and - Run the discover and sync commands to fetch the schema and start syncing the data respectively.
config.json
- holds the source database information like host, port, username, password, database name, etc.writer.json
- holds the iceberg writer configurations like iceberg table name, iceberg database name, catalog information, etc.
Now, depending upon from where (source) to where (destination) you would like to sync the data, you can choose the below configurations.
- PostgreSQL to Iceberg | Postgres Source Config
- MongoDB to Iceberg | MongoDB Source Config
- MySQL to Iceberg | MySQL Source Config
Now that you have the source configuration set, lets move on to the destination configuration.
Here's what the writer.json
looks like for the AWS Glue catalog configuration:
{
"type": "ICEBERG",
"writer": {
"normalization": false,
"s3_path": "s3://bucket_name/olake_iceberg/test_olake",
"aws_region": "ap-south-1",
"aws_access_key": "XXX",
"aws_secret_key": "XXX",
"database": "olake_iceberg",
"grpc_port": 50051,
"server_host": "localhost"
}
}
Get more information refer here
- Run Sync Commands:
- Discover Command:
<DISCOVER_COMMAND>
- Sync Command:
<SYNC_COMMAND>
- Sync with State Command:
<SYNC_WITH_STATE_COMMAND>
- Discover Command:
Refer to respective Database docs to use the command for discover schema and sync the data.
- MongoDB Discover and sync command
- Postgres Discover and sync command
- MySQL Discover and sync command
A sample disover & sync command would look like this:
- Using Docker
- Using OLake Local Build
docker run \
-v /Users/USERNAME/Desktop/projects/olake-docker:/mnt/config \
olakego/source-mongodb:latest \
discover \
--config /mnt/config/config.json
./build.sh driver-mongodb sync --config /Users/USERNAME/Desktop/projects/olake/drivers/mongodb/config/config.json --catalog /Users/USERNAME/Desktop/projects/olake/drivers/mongodb/config/catalog.json --destination /Users/USERNAME/Desktop/projects/olake/drivers/mongodb/config/writer.json
The olakego/source-mongodb
is the OLake image for MongoDB source. You can replace it with the respective source image for PostgreSQL (source-postgres) or MySQL (source-mysql) or can build one locally.