Getting Started with OLake for Oracle
OLake helps you replicate data from Oracle into local or S3-based data lakes using Parquet or Iceberg table formats. This tutorial walks you through every step of the setup—from creating necessary configuration files to running your first data sync.
OLake UI is live (beta)! You can now use the UI to configure your Oracle source, discover streams, and sync data. Check it out at OLake UI regarding how to setup using Docker Compose and running it locally.
Now, you can use the UI to configure your Oracle source, discover streams, and sync data.
Refer to Oracle Connector documentation for more details.
TLDR:
- Create a
source.json
with your Oracle connection details. - Create a
destination.json
with your Writer (Apache Iceberg / AWS S3 / Azure ADLS / Google Cloud Storage) connection details. - Run
discover
to generate astreams.json
of available streams. - Run
sync
to replicate data to your specified destination.
Introduction & Requirements
To use OLake, ensure you have:
- Docker installed and running on your machine.
- Oracle credentials (hosts, replica set name, username/password if applicable).
- Docker Compose instructions to spin up Oracle replica sets
- Need sample dataset to ingest in Oracle? Refer -> Sample Datasets
- AWS S3 credentials (if you plan to write data to AWS S3).
- Apache Iceberg and Catalog configuration credentials (if you plan to write data to Iceberg tables).
Refer here for more details on Writer requirements.
You will also need:
- An empty directory to store OLake configuration files and outputs. This guide will refer to it as
OLAKE_DIRECTORY
.
For setting up the project locally on your system and debugging configs to be made, follow this guide - Setting up debugger in VS Code
Step 1: Prepare Your Directory
-
Create a new directory on your local machine. Let’s call it
OLAKE_DIRECTORY
:mkdir OLAKE_DIRECTORY
-
Inside this folder, create two files:
destination.json
: Specifies your output destination (local or S3).source.json
: Contains connection settings for Oracle (or other databases in the future).
cd OLAKE_DIRECTORY
touch destination.json
touch source.json
Folder Structure:
OLAKE_DIRECTORY/
├─ destination.json
└─ source.json
1.1 Example destination.json
Refer to Destination config section for individual writers or refer them here..
Destination | Supported | Docs | Comments |
---|---|---|---|
![]() | Yes | Link | |
![]() | Yes | Link | Supports both plain-Parquet and Iceberg format writes; requires aws_access_key / IAM role. |
![]() | Yes | Link | |
![]() | Yes | Any S3 protocol compliant object store can work with OLake | |
![]() | Yes | Link |
1.2 Example source.json
(Oracle)
Below is a sample source.json
for connecting to a Oracle replica set. Customize each field to match your environment.
{
"host": "oracle-host",
"username": "oracle-user",
"password": "oracle-password",
"service_name": "oracle-service-name",
"port": 1521,
"max_threads": 10,
"retry_count": 0,
"jdbc_url_params": {},
"ssl": {
"mode": "disable"
}
}
Description of above parameters
Refer to source configuration for more details on source.json
.
Step 2: Generate a Streams File
OLake needs to discover which collections (streams) exist in your Oracle. This step will create a streams.json
listing available streams, schemas, and default sync modes.
- Open your terminal in the same directory (say
OLAKE_DIRECTORY
) containingsource.json
anddestination.json
. - Run the
discover
command using Docker:
- OLake Docker
- Locally run OLake
- macOS / Linux
- CMD
- Powershell
docker run --pull=always \
-v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \
olakego/source-oracle:latest \
discover \
--config /mnt/config/source.json
docker run --pull=always ^
-v "%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY:/mnt/config" ^
olakego/source-oracle:latest ^
discover ^
--config /mnt/config/source.json
docker run --pull=always `
-v "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY:/mnt/config" `
olakego/source-oracle:latest `
discover `
--config /mnt/config/source.json
- macOS / Linux
- CMD
- Powershell
OLAKE_BASE_PATH="$HOME/PATH_TO_OLAKE_DIRECTORY/olake/drivers/oracle/config" && \
./build.sh driver-oracle discover \
--config "$OLAKE_BASE_PATH/source.json"
set "OLAKE_BASE_PATH=%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY\olake\drivers\oracle\config" && ^
./build.sh driver-oracle discover ^
--config "%OLAKE_BASE_PATH%\source.json"
$OLAKE_BASE_PATH = "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY\olake\drivers\oracle\config"; `
./build.sh driver-oracle discover `
--config "$OLAKE_BASE_PATH\source.json"
PATH_TO_OLAKE_DIRECTORY
is the absolute path where you have created the directory [as discussed above].
-v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \
maps to -v /Users/JOHN_DOE_USERNAME/Desktop/projects/OLAKE_DIRECTORY:/mnt/config \
in macOS and Linux systems. Follow the same pattern in other systems.
Flag/Parameter | Description |
---|---|
discover | The OLake sub-command that scans Oracle schemas. |
--config /mnt/config/source.json | Tells OLake where to find your Oracle connection details. |
2.1 Understanding the streams.json
File
After running discover
, OLake generates streams.json
in OLAKE_DIRECTORY
with entries like:
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{now(),2025,YYYY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}",
"stream_name": "stream_0",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
},
{
"partition_regex": "{,1999,YYYY}-{,09,MM}-{,31,DD}/{latest_revision,,}",
"stream_name": "stream_8",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
}
]
},
"streams": [
{
"stream": {
"name": "stream_8",
"namespace": "otter_db",
"type_schema": { ... },
"supported_sync_modes": [
"full_refresh",
"cdc"
],
"source_defined_primary_key": [
"_id"
],
"available_cursor_fields": [],
"sync_mode": "cdc"
}
},
// ... other streams
]
}
selected_streams
: The streams / tables / collections OLake will replicate.streams
: Metadata for each discovered collection, including schemas and sync modes (e.g.,full_refresh
).partition_regex
: Specify the regex pattern. For more details, refer to S3 docsnormalization
: If set totrue
, OLake will flatten nested JSON structures (Level 1 flattening)
append_only
and chunk_column
are not yet supported for Oracle source.
Exclude Streams: You can remove unneeded collections by editing selected_streams
directly. For instance, deleting "customers"
if you only want to sync orders
.
Before (including customers
):
"selected_streams": {
"otter_db": [
{
"stream_name": "order",
"partition_regex": "",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
},
{
"stream_name": "customer",
"partition_regex": "",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
}
]
},
After (to exclude customers
):
"selected_streams": {
"otter_db": [
{
"stream_name": "order",
"partition_regex": "",
"normalization": false,
"append_only": false,
"chunk_column": "", //column name to be specified
},
]
},
Step 3: Run Your First Data Sync
Now that you have streams.json
, it’s time to sync data from Oracle to your specified destination (local or S3).
- OLake Docker
- Locally run OLake
- macOS / Linux
- CMD
- Powershell
docker run --pull=always \
-v "$HOME/PATH_TO_OLAKE_DIRECTORY:/mnt/config" \
olakego/source-oracle:latest \
sync \
--config /mnt/config/source.json \
--catalog /mnt/config/streams.json \
--destination /mnt/config/destination.json
docker run --pull=always ^
-v "%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY:/mnt/config" ^
olakego/source-oracle:latest ^
sync ^
--config /mnt/config/source.json ^
--catalog /mnt/config/streams.json ^
--destination /mnt/config/destination.json
docker run --pull=always `
-v "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY:/mnt/config" `
olakego/source-oracle:latest `
sync `
--config /mnt/config/source.json `
--catalog /mnt/config/streams.json `
--destination /mnt/config/destination.json
- macOS / Linux
- CMD
- Powershell
OLAKE_BASE_PATH="$HOME/PATH_TO_OLAKE_DIRECTORY/olake/drivers/oracle/config" && \
./build.sh driver-oracle sync \
--config "$OLAKE_BASE_PATH/source.json" \
--catalog "$OLAKE_BASE_PATH/streams.json" \
--destination "$OLAKE_BASE_PATH/destination.json"
set "OLAKE_BASE_PATH=%USERPROFILE%\PATH_TO_OLAKE_DIRECTORY\olake\drivers\oracle\config" && ^
./build.sh driver-oracle sync ^
--config "%OLAKE_BASE_PATH%\source.json" ^
--catalog "%OLAKE_BASE_PATH%\streams.json" ^
--destination "%OLAKE_BASE_PATH%\destination.json"
$OLAKE_BASE_PATH = "$env:USERPROFILE\PATH_TO_OLAKE_DIRECTORY\olake\drivers\oracle\config"; `
./build.sh driver-oracle sync `
--config "$OLAKE_BASE_PATH\source.json" `
--catalog "$OLAKE_BASE_PATH\streams.json" `
--destination "$OLAKE_BASE_PATH\destination.json"
Flag/Parameter | Description |
---|---|
sync | The OLake sub-command that runs a data replication (snapshot (full sync)). |
--config /mnt/config/source.json | Oracle connection settings. |
--catalog /mnt/config/streams.json | The file detailing which streams OLake will replicate. |
--destination /mnt/config/destination.json | The output configuration file (local or S3). |
- This command performs both the initial snapshot
- If you only want a full one-time snapshot, set the stream’s
sync_mode
to"full_refresh"
instreams.json
.
When the sync finishes, you should see new files either:
- Locally (in the volume-mapped directory).
- On S3 (inside the specified
s3_path
).
Step 3.1 Synced Data
If you are using VS Code, install a parquet reader extension to visualize the parquet file contents that will be made post sync process.
Step 3.2 Synced Data Normalized
If you have turned on "normalization": true
in streams.json
for you streams, expect the below Level 1 Flattening of JSON data.
Read more about JSON flattening here - Flatten Object Types and Query Arrays in Semi-Structured Nested JSON
Running the sync command with normalization turned on
Output Data Dump
Step 3.3 Change output directory
If you need to output the parquet dump to some other location, you can make changes in the destination.json
file by appending the /mnt/config/my_directory
{
"type": "PARQUET",
"writer": {
"local_path": "/mnt/config/my_directory"
}
}
Here, /mnt/config
represents the OLAKE_DIRECTORY
.
Debugging
Follow the debugging instructions in this guide - Setting up debugger in VS Code
Docker Commands & Flags
Click here for more info about Docker Commands & Flags
Next Steps & Wrap-Up
- Check Your Output: Verify your Parquet files (or Iceberg tables) were created either locally or in your S3 bucket.
- Explore Schema Evolution: If your Oracle documents gain new fields, OLake can adapt automatically. Watch for updated schemas in subsequent runs.
- Try More Destinations: OLake can also write to Iceberg on S3 (and more in the future). Update your destination config as needed.
- Analytics & Querying: Connect your newly created Parquet/Iceberg data to engines like Trino, Spark, or Presto for powerful querying.
Congratulations! You’ve completed your first OLake data replication. If you encounter any issues or have feedback, please visit our GitHub repository to open an issue or contribute.