Skip to main content

S3 Writer

OLake’s Parquet S3 writer is now live. This component allows you to efficiently write your database data into Amazon S3 in Parquet format. For more background details, please refer to the README. Before proceeding, make sure you have completed the getting started instructions.

S3 Writer Configuration

To enable S3 writes, you must create a writer.json file with the configuration parameters listed below. The sample configuration provided here outlines the necessary keys and their expected values.

{
"type": "PARQUET",
"writer": {
"normalization": false,
"s3_bucket": "olake-s3-test",
"s3_region": "ap-south-1",
"s3_access_key": "xxxxxxxxxxxxxxxxxxxx",
"s3_secret_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"s3_path": "/data"
}
}

Configuration Key Details

KeyDescriptionData TypeProbable Values
typeSpecifies the output file format for writing data. Currently, only the "PARQUET" format is supported.string"PARQUET"
writer.normalizationIndicates whether data normalization (JSON flattening) should be applied before writing data to S3. Use true if you require normalized output, or false if not.booleantrue or false (this example uses false)
writer.s3_bucketThe name of the Amazon S3 bucket where your output files will be stored. Ensure that the bucket exists and that you have proper access.stringA valid S3 bucket name (e.g. "olake-s3-test")
writer.s3_regionThe AWS region where the specified S3 bucket is hosted.stringAWS region codes such as "ap-south-1", "us-west-2", etc.
writer.s3_access_keyThe AWS access key used for authenticating S3 requests. This is typically a 20-character alphanumeric string.stringA valid AWS access key
writer.s3_secret_keyThe AWS secret key used for S3 authentication. This key is generally longer (often 40+ characters) and should be kept secure.stringA valid AWS secret key
writer.s3_pathThe specific path (or prefix) within the S3 bucket where data files will be written. This is typically a folder path that starts with a / (e.g. "/data").stringA valid path string

OLake S3 Bucket

info
  1. The generated .parquet files use SNAPPY compression (Read more). Note that SNAPPY is no longer supported by S3 Select when performing queries.
  2. OLake creates a test folder named olake_writer_test containing a single text file (.txt) with the content:
    S3 write test
    This is used to verify that you have the necessary permissions to write to S3.

IAM Permissions Required

To successfully sync your database data with S3, ensure that your AWS IAM policy grants at least minimal permissions for listing, creating, reading, and writing to the S3 bucket. Below is a sample IAM policy. Note that this example may provide more permissions than the absolute minimum required.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ConsoleAccess",
"Effect": "Allow",
"Action": [
"s3:GetAccountPublicAccessBlock",
"s3:GetBucketAcl",
"s3:GetBucketLocation",
"s3:GetBucketPolicyStatus",
"s3:GetBucketPublicAccessBlock",
"s3:ListAccessPoints",
"s3:ListAllMyBuckets"
],
"Resource": "*"
},
{
"Sid": "ListObjectsInBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": [
"arn:aws:s3:::olake-s3-test"
]
},
{
"Sid": "AllObjectActions",
"Effect": "Allow",
"Action": "s3:*Object",
"Resource": [
"arn:aws:s3:::olake-s3-test/*"
]
}
]
}

S3 Data Partitioning

OLake supports data partitioning when writing to S3. After the initial schema discovery, you can define a partition_regex within your catalog.json file. This regex determines how data is partitioned into folders within your S3 bucket. Partitioning is defined on a per-stream (table or collection) basis.

Sample catalog.json with Partitioning

{
"selected_streams": {
"namespace": [
{
"stream_name": "table1",
"partition_regex": "/{column_name, default_value, granularity}"
},
{
"stream_name": "table2",
"partition_regex": ""
}
]
},
"streams": [
{
"stream": {
"name": "table1",
"namespace": "namespace",
"sync_mode": "cdc"
}
},
{
"stream": {
"name": "table2",
"namespace": "namespace",
"sync_mode": "cdc"
}
}
]
}
info
  • The partition_regex is a stream-level property; you can define different patterns for each stream.
  • For example, the following sample demonstrates partitioning for a namespace named otter_db:
    {
    "selected_streams": {
    "otter_db": [
    {
    "stream_name": "stream_8",
    "partition_regex": "{title,,}/{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{latest_revision,,}"
    },
    {
    "stream_name": "stream_0",
    "partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}"
    }
    ]
    }
    // Additional configuration...
    }

In this context, title and latest_revision are column names. The partition is dynamically generated based on the column values as well as the current date.

When the partitioning is applied, the resulting S3 folder structure will include the database name and the table/collection name. For instance, if your table is named table1 in a database called test_db, the generated path might look like:

olake-s3-test/data-s3/test_db/table1/<value_of_the_partition_column1>/YY-MM-DD/<value_of_latest_version>

Partition Regex Attributes

You can partition data based on column values using the partition_regex. A typical partition pattern looks like:

"partition_regex": "/{column_name, default_value, granularity}"
  • column_name:
    (Required) The name of the column whose value will determine the partition folder. You can also use the now() function to represent the current date.
  • default_value:
    (Optional) If the column value is null or unparseable, a default value (prefixed with default_) will be used. Default value is a fallback value.
  • granularity:
    (Optional) For time-based columns, you can specify the granularity. Supported values include HH (hour), DD (day), WW (week), MM (month), and YY (year). granularity is not needed if the column_name is not a timestamp field and can be left empty.
info
  1. Each segment enclosed between slashes (/.../) in the partition regex can include up to three parameters.
  2. You can define as many directory path partitions as you wish to.
  3. Path name other than english are also supported (as long as they are STRING values).

Supported Regex Patterns

The partition regex supports several patterns to handle both column values and timestamps. Below is a summary of the supported patterns:

PatternDescriptionExample
now()Uses the current local timestamp to extract date components such as day (DD), month (MM), or year (YY).
defaultIf a specified column does not exist, the value is treated as a constant string prefixed with default_.
{col_name}Inserts the value of the specified column.
{now(),2025,YY}Extracts only the year part of the current timestamp.2025
{now(),2025,MM}Extracts only the month part of the current timestamp.1, 12
{now(),2025,DD}Extracts only the day part of the current timestamp.1, 31
{now(),2025,YY}-{now(),06,MM}Combines year and month from the current timestamp.2025-02
{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}Combines year, month, and day from the current timestamp.2025-02-13, 2025-12-31

Usage Examples:

  • Correct:

    • "partition_regex": "/{title,,}/{now(),2025,DD}/{latest_version,,}"
    • "partition_regex": "{title,,}/{now(),2025,DD}/{latest_version,,}"
  • Incorrect:

    • "partition_regex": "{title},,}/{now(),2025,DD}/{latest_version,,}"
      (Note the misplaced comma and bracket.)

Default value vs Parsed value

Default value vs Parsed value

Handling Special Characters in Column Values

When data is converted to Parquet format, escape characters or special symbols (e.g., \u0026 for &) are converted to their corresponding characters while constructing the folder name. However, if a column value contains the / character, it will be interpreted as a directory separator, thus creating additional nested folders.

For example, consider a column named title with the value:

Hi/my name is / priyansh

This value will generate a nested directory structure as shown below:

olake-s3-test
└── data-s3
└── test_db
└── table1
└── Hi
└── my name is
└── priyansh
└── 423878fsd87fsd78XXXXXX.parquet
note

If you intend the entire value to be treated as a single folder name, ensure that the data does not include the / character or consider sanitizing your column values before ingestion.

Supported Timestamp Formats for Partitioning

OLake attempts to parse a wide range of timestamp formats when generating partition folders. The following formats are currently supported:

"2006-01-02",
"2006-01-02 15:04:05",
"2006-01-02 15:04:05 -07:00",
"2006-01-02 15:04:05-07:00",
"2006-01-02T15:04:05",
"2006-01-02T15:04:05.000000",
"2006-01-02T15:04:05.999999999Z07:00",
"2006-01-02T15:04:05+0000",
"2020-08-17T05:50:22.895Z",
"2006-01-02 15:04:05.999999-07",
"2006-01-02 15:04:05.999999+00"

For the latest updates or to add support for additional timestamp formats, please refer to the implementation code or create an issue.

For a column mname that is a timestamp, replace now() with the name of the column and specify which value you want to extract from the timestamp field.

{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{timestamp_column,2025,YY}-{timestamp_column,12,MM}-{timestamp_column,13,DD}",
"stream_name": "stream_0"
}
]
},
"streams": [
...
}

Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!