Get Started With First Job!
This guide is for end-users who want to replicate data between the various sources and destinations that OLake supports. Using the OLake UI, you can configure a source, set up a destination, and create a job to move data between them.
By the end of this tutorial, youโll have a complete replication workflow running in OLake.
Prerequisitesโ
Follow the Quickstart Setup Guide to ensure the OLake UI is running at localhost:8000
What is a Job?โ
A job in OLake is a pipeline that defines how data should be synchronized from a source (where your data comes from) to a destination (where your data goes).
Sources and destinations can be:
- New - configured during job creation.
- Existing - already set up and reused across multiple jobs.
Two ways to create a Jobโ
1. Job-first workflow:โ
Start from the Jobs page and set up everything in one flow.
- Go to Jobs in the left menu and click Create Job.
- Configure job name & schedule
- Configure the source.
- Configure the destination.
- Configure streams and save.
2. Resource-first workflow:โ
Set up your source and destination first, then link them in a job.
- Create a source from the Sources page.
- Create a destination from the Destinations page.
- Go to Jobs โ Create Job, and Configure job name & schedule
- select the existing source and destination.
- Configure streams and save.
The two methods achieve the same result. Choose Job-first if you want a guided setup in one go. Choose Resource-first if your source and destination are already configured, or if you prefer to prepare them in advance.
Tutorial: Creating a Jobโ
In this guide, we'll use the Job-first workflow to set up a job from configuring the source and destination to running it. If you prefer video, check out our video walkthrough.
First things first, every job needs a source and a destination before it can run. For this demonstration, we'll use Postgres as the source and Iceberg with Glue catalog as the destination.
Let's get started!
1. Create a New Jobโ
Navigate to Jobs section and select + Create Job button in the top right corner. This opens the Job creation wizard, starting with the Configure Job Name & Schedule step.
2. Configure Job Name & Scheduleโ
Give your job a descriptive name. For this guide, set the Frequency dropdown to Every Day and choose 12:00 AM as the Time.
3. Configure Sourceโ
Since we're following the Job-first workflow, select the Set up a new source option.
For this guide, choose Postgres from the connector dropdown, and keep the OLake version set to the latest stable version.
Give your source a descriptive name, then fill in the required Postgres connection details in the Endpoint Config form.
Once the test connection succeeds, OLake shows a success message and takes you to the destination configuration step.
You can find the configuration and troubleshooting guides for all supported source connectors below.
Sources | Config |
---|---|
MySQL | Config |
Postgres | Config |
MongoDB | Config |
Oracle | Config |
If you plan to enable CDC (Change Data Capture), make sure a replication slot already exists on your Postgres database. You can learn how to check or create one in our Replication Slot Guide.
4. Configure Destinationโ
Similarly, here we'll be using Iceberg with AWS Glue Catalog as the destination.
For this guide, select Apache Iceberg from the connector dropdown, and keep the OLake version set to the latest stable version.
Choose the catalog as AWS Glue from the Catalog Type dropdown.
Give your destination a descriptive name, then fill in the required connection details in the Endpoint Config form.
Once the test connection succeeds, OLake shows a success message and takes you to the streams configuration step.
You can find the configuration and troubleshooting guides for all supported destination connectors below.
-
Parquet S3
Destinations Config Parquet Config -
Iceberg
Catalogs Config AWS Glue Catalog Config Hive Catalog Config JDBC Catalog Config REST Catalog Config Nessie Catalog Config LakeKeeper Config S3 Tables Config Polaris Config Unity Config
5. Configure Streamsโ
The Streams page is where you select which streams to replicate to the destination. Here, you can choose your preferred sync mode and configure partitioning and Destination Database as well as other stream-level settings here.
For this guide, we'll configure the following:
- Replicate the
fivehundred
stream (name of the table). - Use Full Refresh + CDC as the sync mode.
- Enable data Normalization.
- Modify Destination Database name (if required).
- Replicate only data where
dropoff_datetime
>=2010-01-01 00:00:00
(basically data from 2010 onward). - Partition the data by the year extracted from a timestamp column in the selected stream.
- Run the sync every day at 12:00 AM.
Let's start by selecting the fivehundred
stream (or any stream from your source) by checking its checkbox to include it in the replication.
Click the stream name to open the stream-level settings panel on the right side.
In the panel, set the sync mode to Full Refresh + CDC, and enable Normalization by toggling the switch on.
To learn more about sync modes, refer to our Sync Modes Guide in the documentation.
To partition the data, click the Partitioning tab and configure it based on the required details.
In our case, the fivehundred
stream has a timestamp column named dropoff_datetime
, which we will partition by year. Learn more about partitioning in the Partitioning Guide.
To replicate only data from 2010 onward, we'll use a Data Filter to filter the data based on the dropoff_datetime
column.
Make sure the Value provided is in the same format as the column schema.
To edit the Destination Database name, select the edit icon beside the Destination Database (Iceberg DB or S3 Folder) and make the changes.
Once configured, click Create Job in the bottom-right corner. Tada! you've successfully created your first job!
The sync will start at the next scheduled time. You can also start it manually by going to the Jobs section, finding your job, clicking the options menu, and selecting Sync Now.
You can verify the sync status by checking the badge at the right end of the job row. Possible statuses include Running, Failed, and Completed. You can also monitor the sync logs by selecting Job Logs and History from the job options menu.
-
Job running:
-
Job completed:
Yay! The sync is complete, and our data has been replicated to Iceberg exactly as we configured it.
6. Manage Your Jobโ
Once your job is created, you can manage it from the Jobs page using the Actions menu (โฎ)
Here's what each option does:
1. Sync Nowโ
Run the job immediately without waiting for the next scheduled time.
2. Edit Streamsโ
Use this option to modify which streams are included in your job and adjust their replication settings. When you click Edit Streams you'll be redirected to the Stream Configuration page.
Here you can:
- Add new streams from your source.
- Change the sync mode for selected streams.
- Adjust partitioning or Normalization for newly added streams.
- You can also navigate to Source and Destination settings using the stepper at the top-right of the page.
- By default, source and destination editing is locked click Edit to unlock them.
You cannot directly change the Normalization, data filter, or partition scheme for existing streams. To update these:
- Unselect the stream.
- Save the job.
- Reopen Edit Streams and re-add the stream with the updated settings
3. Pause Jobโ
Stops the job from running until resumed. Paused jobs appear under Inactive Jobs. Resume them anytime from the Inactive Jobs tab.
4. Cancel Jobโ
Stops a currently running sync safely. Available only while a sync is in "Running" status.
What the Cancel Button Does:
- When the cancel button is clicked, it sends a cancellation signal to the running sync
- The sync stops gracefully at a safe checkpoint (no partial/corrupted data)
- Performs cleanup (closes connections, releases resources)
- Marks the sync run as "Canceled"
- Saves progress only if the sync mode generates a state file
Cancellation takes up to 1 minute to complete. If the sync finishes before cancellation completes, the status will show "Completed" instead of "Canceled" - whichever happens first determines the final status.
What Happens to Your Data?
i. Data already synced before cancellation:โ
โ
Stays in the destination
โ
Not rolled back or deleted
โ
Remains in a consistent state (no partial records)
ii. Data being processed during cancellation:โ
- The system completes the current batch/record being processed
- Stops before starting new work
- Prevents incomplete writes
Whether your sync resumes from where it stopped depends on if it generates a state file:
Sync Mode | State File? | What Happens on Restart? |
---|---|---|
Full Refresh | โ No | Reloads entire dataset each time as it does not track position because no state file is generated. |
Full Refresh + CDC | โ Yes | Generates a state file for both Full Refresh and CDC syncs. Hence, resumes from the last saved state. |
Full Refresh + Incremental | โ Yes | Generates a state file for both Full Refresh and Incremental syncs. Hence, resumes from the last saved state. |
Strict CDC | โ Yes | Generates a state file whenever the CDC syncs runs. Hence, resumes from the last saved state. |
After cancellation completes, you can start the job again with Sync Now or wait for the next scheduled run. The next run resumes from the last saved state if a state file exists, or starts fresh if not.
5. Job Logs & Historyโ
This page lets you view and monitor a job's sync history and logs. You'll see a list of all current and past job runs. To view logs for a specific run, click View Logs in the Actions column.
Once you click View Logs, you'll see the logs for the selected job run.
6. Job settingsโ
Here, you can edit the frequency, and other configuration settings. You can also pause or delete the job.
When a job is deleted, its associated source and destination are automatically moved to the inactive state, provided they are not being used by any other job.