S3 Writer
OLake’s Parquet S3 writer is now live. This component allows you to efficiently write your database data into Amazon S3 in Parquet format. For more background details, please refer to the README. Before proceeding, make sure you have completed the getting started instructions.
S3 Writer Configuration
To enable S3 writes, you must create a writer.json
file with the configuration parameters listed below. The sample configuration provided here outlines the necessary keys and their expected values.
{
"type": "PARQUET",
"writer": {
"normalization": false,
"s3_bucket": "olake-s3-test",
"s3_region": "ap-south-1",
"s3_access_key": "xxxxxxxxxxxxxxxxxxxx",
"s3_secret_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"s3_path": "/data"
}
}
Configuration Key Details
Key | Description | Data Type | Probable Values |
---|---|---|---|
type | Specifies the output file format for writing data. Currently, only the "PARQUET" format is supported. | string | "PARQUET" |
writer.normalization | Indicates whether data normalization (JSON flattening) should be applied before writing data to S3. Use true if you require normalized output, or false if not. | boolean | true or false (this example uses false ) |
writer.s3_bucket | The name of the Amazon S3 bucket where your output files will be stored. Ensure that the bucket exists and that you have proper access. | string | A valid S3 bucket name (e.g. "olake-s3-test" ) |
writer.s3_region | The AWS region where the specified S3 bucket is hosted. | string | AWS region codes such as "ap-south-1" , "us-west-2" , etc. |
writer.s3_access_key | The AWS access key used for authenticating S3 requests. This is typically a 20-character alphanumeric string. | string | A valid AWS access key |
writer.s3_secret_key | The AWS secret key used for S3 authentication. This key is generally longer (often 40+ characters) and should be kept secure. | string | A valid AWS secret key |
writer.s3_path | The specific path (or prefix) within the S3 bucket where data files will be written. This is typically a folder path that starts with a / (e.g. "/data" ). | string | A valid path string |
- The generated
.parquet
files use SNAPPY compression (Read more). Note that SNAPPY is no longer supported by S3 Select when performing queries. - OLake creates a test folder named
olake_writer_test
containing a single text file (.txt
) with the content:This is used to verify that you have the necessary permissions to write to S3.S3 write test
IAM Permissions Required
To successfully sync your database data with S3, ensure that your AWS IAM policy grants at least minimal permissions for listing, creating, reading, and writing to the S3 bucket. Below is a sample IAM policy. Note that this example may provide more permissions than the absolute minimum required.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3ConsoleAccess",
"Effect": "Allow",
"Action": [
"s3:GetAccountPublicAccessBlock",
"s3:GetBucketAcl",
"s3:GetBucketLocation",
"s3:GetBucketPolicyStatus",
"s3:GetBucketPublicAccessBlock",
"s3:ListAccessPoints",
"s3:ListAllMyBuckets"
],
"Resource": "*"
},
{
"Sid": "ListObjectsInBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": [
"arn:aws:s3:::olake-s3-test"
]
},
{
"Sid": "AllObjectActions",
"Effect": "Allow",
"Action": "s3:*Object",
"Resource": [
"arn:aws:s3:::olake-s3-test/*"
]
}
]
}
S3 Data Partitioning
OLake supports data partitioning when writing to S3. After the initial schema discovery, you can define a partition_regex
within your catalog.json
file. This regex determines how data is partitioned into folders within your S3 bucket. Partitioning is defined on a per-stream (table or collection) basis.
Sample catalog.json
with Partitioning
{
"selected_streams": {
"namespace": [
{
"stream_name": "table1",
"partition_regex": "/{column_name, default_value, granularity}"
},
{
"stream_name": "table2",
"partition_regex": ""
}
]
},
"streams": [
{
"stream": {
"name": "table1",
"namespace": "namespace",
"sync_mode": "cdc"
}
},
{
"stream": {
"name": "table2",
"namespace": "namespace",
"sync_mode": "cdc"
}
}
]
}
- The
partition_regex
is a stream-level property; you can define different patterns for each stream. - For example, the following sample demonstrates partitioning for a namespace named
otter_db
:{
"selected_streams": {
"otter_db": [
{
"stream_name": "stream_8",
"partition_regex": "{title,,}/{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{latest_revision,,}"
},
{
"stream_name": "stream_0",
"partition_regex": "{now(),2025,YY}-{now(),06,MM}-{now(),13,DD}/{string_change_language,,}"
}
]
}
// Additional configuration...
}
In this context, title
and latest_revision
are column names. The partition is dynamically generated based on the column values as well as the current date.
When the partitioning is applied, the resulting S3 folder structure will include the database name and the table/collection name. For instance, if your table is named table1
in a database called test_db
, the generated path might look like:
olake-s3-test/data-s3/test_db/table1/<value_of_the_partition_column1>/YY-MM-DD/<value_of_latest_version>
Partition Regex Attributes
You can partition data based on column values using the partition_regex
. A typical partition pattern looks like:
"partition_regex": "/{column_name, default_value, granularity}"
- column_name:
(Required) The name of the column whose value will determine the partition folder. You can also use thenow()
function to represent the current date. - default_value:
(Optional) If the column value is null or unparseable, a default value (prefixed withdefault_
) will be used. Default value is a fallback value. - granularity:
(Optional) For time-based columns, you can specify the granularity. Supported values includeHH
(hour),DD
(day),WW
(week),MM
(month), andYY
(year).granularity
is not needed if thecolumn_name
is not a timestamp field and can be left empty.
- Each segment enclosed between slashes (
/.../
) in the partition regex can include up to three parameters. - You can define as many directory path partitions as you wish to.
- Path name other than english are also supported (as long as they are
STRING
values).
Supported Regex Patterns
The partition regex supports several patterns to handle both column values and timestamps. Below is a summary of the supported patterns:
Pattern | Description | Example |
---|---|---|
now() | Uses the current local timestamp to extract date components such as day (DD), month (MM), or year (YY). | – |
default | If a specified column does not exist, the value is treated as a constant string prefixed with default_ . | – |
{col_name} | Inserts the value of the specified column. | – |
{now(),2025,YY} | Extracts only the year part of the current timestamp. | 2025 |
{now(),2025,MM} | Extracts only the month part of the current timestamp. | 1 , 12 |
{now(),2025,DD} | Extracts only the day part of the current timestamp. | 1 , 31 |
{now(),2025,YY}-{now(),06,MM} | Combines year and month from the current timestamp. | 2025-02 |
{now(),2025,YY}-{now(),06,MM}-{now(),13,DD} | Combines year, month, and day from the current timestamp. | 2025-02-13 , 2025-12-31 |
Usage Examples:
-
Correct: ✅
"partition_regex": "/{title,,}/{now(),2025,DD}/{latest_version,,}"
"partition_regex": "{title,,}/{now(),2025,DD}/{latest_version,,}"
-
Incorrect: ❌
"partition_regex": "{title},,}/{now(),2025,DD}/{latest_version,,}"
(Note the misplaced comma and bracket.)
Default value vs Parsed value
Handling Special Characters in Column Values
When data is converted to Parquet format, escape characters or special symbols (e.g., \u0026
for &
) are converted to their corresponding characters while constructing the folder name. However, if a column value contains the /
character, it will be interpreted as a directory separator, thus creating additional nested folders.
For example, consider a column named title
with the value:
Hi/my name is / priyansh
This value will generate a nested directory structure as shown below:
olake-s3-test
└── data-s3
└── test_db
└── table1
└── Hi
└── my name is
└── priyansh
└── 423878fsd87fsd78XXXXXX.parquet
If you intend the entire value to be treated as a single folder name, ensure that the data does not include the /
character or consider sanitizing your column values before ingestion.
Supported Timestamp Formats for Partitioning
OLake attempts to parse a wide range of timestamp formats when generating partition folders. The following formats are currently supported:
"2006-01-02",
"2006-01-02 15:04:05",
"2006-01-02 15:04:05 -07:00",
"2006-01-02 15:04:05-07:00",
"2006-01-02T15:04:05",
"2006-01-02T15:04:05.000000",
"2006-01-02T15:04:05.999999999Z07:00",
"2006-01-02T15:04:05+0000",
"2020-08-17T05:50:22.895Z",
"2006-01-02 15:04:05.999999-07",
"2006-01-02 15:04:05.999999+00"
For the latest updates or to add support for additional timestamp formats, please refer to the implementation code or create an issue.
For a column mname that is a timestamp, replace now()
with the name of the column and specify which value you want to extract from the timestamp field.
{
"selected_streams": {
"otter_db": [
{
"partition_regex": "{timestamp_column,2025,YY}-{timestamp_column,12,MM}-{timestamp_column,13,DD}",
"stream_name": "stream_0"
}
]
},
"streams": [
...
}