Skip to main content

Get Started With First Job!

This guide is for end-users who want to replicate data between the various sources and destinations that OLake supports. Using the OLake UI, you can configure a source, set up a destination, and create a job to move data between them.

By the end of this tutorial, youโ€™ll have a complete replication workflow running in OLake.

Prerequisitesโ€‹

Follow the Quickstart Setup Guide to ensure the OLake UI is running at localhost:8000

What is a Job?โ€‹

A job in OLake is a pipeline that defines how data should be synchronized from a source (where your data comes from) to a destination (where your data goes).

Sources and destinations can be:

  • New - configured during job creation.
  • Existing - already set up and reused across multiple jobs.

Two ways to create a Jobโ€‹

1. Job-first workflow:โ€‹

Start from the Jobs page and set up everything in one flow.

  1. Go to Jobs in the left menu and click Create Job.
  2. Configure job name & schedule
  3. Configure the source.
  4. Configure the destination.
  5. Configure streams and save.

2. Resource-first workflow:โ€‹

Set up your source and destination first, then link them in a job.

  1. Create a source from the Sources page.
  2. Create a destination from the Destinations page.
  3. Go to Jobs โ†’ Create Job, and Configure job name & schedule
  4. select the existing source and destination.
  5. Configure streams and save.
tip

The two methods achieve the same result. Choose Job-first if you want a guided setup in one go. Choose Resource-first if your source and destination are already configured, or if you prefer to prepare them in advance.


Tutorial: Creating a Jobโ€‹

In this guide, we'll use the Job-first workflow to set up a job from configuring the source and destination to running it. If you prefer video, check out our video walkthrough.

First things first, every job needs a source and a destination before it can run. For this demonstration, we'll use Postgres as the source and Apache Iceberg with Glue Catalog as the destination.

Let's get started!

1. Create a New Jobโ€‹

Navigate to Jobs section and select + Create Job button in the top right corner. This opens the Job creation wizard, starting with the Configure Job Name & Schedule step.

OLake jobs dashboard with the Jobs tab, Create Job button, and Create your first Job button highlighted

2. Configure Job Name & Scheduleโ€‹

Give your job a descriptive name. For this guide, set the Frequency dropdown to Every Day and choose 12:00 AM as the Time.

OLake Create Job page showing step 1, with job name, frequency dropdown (Every Day highlighted), and job start time settings

3. Configure Sourceโ€‹

Since we're following the Job-first workflow, select the Set up a new source option.

For this guide, choose Postgres from the connector dropdown, and keep the OLake version set to the latest stable version.

Job source
creation

Give your source a descriptive name, then fill in the required Postgres connection details in the Endpoint Config form.

Job source
creation

Once the test connection succeeds, OLake shows a success message and takes you to the destination configuration step.

You can find the configuration and troubleshooting guides for all supported source connectors below.

SourcesConfig
MySQLConfig
PostgresConfig
MongoDBConfig
OracleConfig
note

If you plan to enable CDC (Change Data Capture), make sure a replication slot already exists on your Postgres database. You can learn how to check or create one in our Replication Slot Guide.

4. Configure Destinationโ€‹

Similarly, here we'll be using Iceberg with AWS Glue Catalog as the destination.

For this guide, select Apache Iceberg from the connector dropdown, and keep the OLake version set to the latest stable version.

Job destination
creation

Choose the catalog as AWS Glue from the Catalog Type dropdown.

Job destination
catalog

Give your destination a descriptive name, then fill in the required connection details in the Endpoint Config form.

Job destination
config

Once the test connection succeeds, OLake shows a success message and takes you to the streams configuration step.

You can find the configuration and troubleshooting guides for all supported destination connectors below.

5. Configure Streamsโ€‹

The Streams page is where you select which streams to replicate to the destination. Here, you can choose your preferred sync mode and configure partitioning and Destination Database as well as other stream-level settings here.

OLake streams selection, employee_data and other tables checked, sync mode set to Full Refresh + CDC

For this guide, we'll configure the following:

  • Replicate the fivehundred stream (name of the table).
  • Use Full Refresh + CDC as the sync mode.
  • Enable data Normalization.
  • Modify Destination Database name (if required).
  • Replicate only data where dropoff_datetime >= 2010-01-01 00:00:00 (basically data from 2010 onward).
  • Partition the data by the year extracted from a timestamp column in the selected stream.
  • Run the sync every day at 12:00 AM.

Let's start by selecting the fivehundred stream (or any stream from your source) by checking its checkbox to include it in the replication. Click the stream name to open the stream-level settings panel on the right side. In the panel, set the sync mode to Full Refresh + CDC, and enable Normalization by toggling the switch on.

Job streams
selection

To learn more about sync modes, refer to our Sync Modes Guide in the documentation.

To partition the data, click the Partitioning tab and configure it based on the required details. In our case, the fivehundred stream has a timestamp column named dropoff_datetime, which we will partition by year. Learn more about partitioning in the Partitioning Guide.

Job stream
partitioning

To replicate only data from 2010 onward, we'll use a Data Filter to filter the data based on the dropoff_datetime column. Make sure the Value provided is in the same format as the column schema.

Job stream data
filter

To edit the Destination Database name, select the edit icon beside the Destination Database (Iceberg DB or S3 Folder) and make the changes.

Job stream Destination
DB

Once configured, click Create Job in the bottom-right corner. Tada! you've successfully created your first job!

Job
create

The sync will start at the next scheduled time. You can also start it manually by going to the Jobs section, finding your job, clicking the options menu, and selecting Sync Now.

OLake jobs dashboard with actions menu for sync, edit streams, pause, logs, settings, delete

You can verify the sync status by checking the badge at the right end of the job row. Possible statuses include Running, Failed, and Completed. You can also monitor the sync logs by selecting Job Logs and History from the job options menu.

  • Job running: OLake jobs dashboard showing active job status as running for Postgres to Iceberg pipeline

  • Job completed: OLake jobs dashboard showing completed status for Postgres to Iceberg pipeline job

Yay! The sync is complete, and our data has been replicated to Iceberg exactly as we configured it.

Amazon S3 browser showing parquet files for dropoff_datetime_year=2011 partition folder


6. Manage Your Jobโ€‹

Once your job is created, you can manage it from the Jobs page using the Actions menu (โ‹ฎ)

To know more about job level features, refer to the Job Level Features guide.



๐Ÿ’ก Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
๐Ÿ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. ๐Ÿš€

Your success with OLake is our priority. Donโ€™t hesitate to contact us if you need any help or further clarification!