Skip to main content

S3 Parquet

OLake supports writing in parquet format directly to S3. Before proceeding with the S3 Parquet destination, we recommend reviewing the Getting Started and Installation sections.

Prerequisites​

Configuration​

KeyDescriptionData TypeProbable Values
S3 BucketThe name of the Amazon/Google S3 bucket (without s3:// or gs://) where your output files will be stored. Ensure that the bucket exists and that you have proper access.stringA valid S3 bucket name (e.g. "olake-s3-test")
S3 RegionThe AWS/MinIO/GCS region where the specified S3 bucket is hosted.stringAWS/GCS region codes such as "ap-south-1", "us-west-2", etc.
S3 Access KeyThe AWS/MinIO/GCS HMAC access key used for authenticating S3 requests.stringA valid AWS/GCS HMAC access key
S3 Secret KeyThe AWS/MinIO/GCS HMAC secret key used for S3 authentication. This key should be kept secure.stringA valid AWS/GCS HMAC secret key
S3 PathThe specific path (or prefix) within the S3 bucket where data files will be written. This is typically a folder path that starts with a / (e.g. "/data").stringA valid path string
S3 Endpoint(Optional) Custom S3-compatible endpoint. Required when using GCS HMAC keys or MinIO S3.string"https://storage.googleapis.com", "https://<MinIO-Storage-Endpoint>:9000"

Create an S3-compatible destination in OLake UI​

Steps to get started:-

  1. Navigate to Destinations Tab.
  2. Click on + Create Destination.
  3. Select AWS S3 as the Destination type from Connector drop down.
  4. Fill in the required connection details in the form.
  5. Click on Create ->.
  6. OLake will test the destination connection and display the results. If the connection is successful, you will see a success message. If there are any issues, OLake will provide error messages to help you troubleshoot.

This will create a S3 destination in OLake, now you can use this destination in your Jobs Pipeline to sync data from any Source to AWS S3.

olake-destination

Using AWS Amazon S3 Credentials​

OLake supports direct syncing of data from source to AWS S3 using Amazon S3 credentials.
For this, refer to IAM permission needed for Amazon-powered AWS S3. User needs to provide,

  • AWS S3 bucket path
  • AWS S3 access key and secret key
  • AWS S3 region
  • AWS S3 path (if any)
  • AWS S3 endpoint (required only for s3-compatible storage systems other than AWS S3. Refer to S3 Endpoint Doc)

olake-s3-bucket

note

If using AWS IAM Role with the required permissions, the AWS Access Key and Secret Key fields can be left blank.

Using GCS-compatible S3 Credentials​

OLake supports writing data to Google Cloud Storage (GCS) in parquet format.

Google Cloud Storage provides an S3-compatible interface, allowing you to use S3-compatible tools and libraries to interact with GCS buckets and objects. This interoperability enables seamless integration and ingestion of data into GCS by supporting the Amazon S3 API, which means you can use existing S3 tools and workflows with minimal changes such as updating the endpoint to https://storage.googleapis.com and authenticating via HMAC keys (discussed in next section). This compatibility simplifies migrations, data transfers, and tool usage across platforms. For role based permissions, refer to GCP IAM Permission.

Creation of HMAC Keys​

  • HMAC keys will act as the access key and secret key for S3 writer.
  • In Google Cloud Console, go to storage, then settings, then select Interoperability.
  • Copy the request endpoint and provide it as the S3 endpoint in OLake.
  • Create HMAC keys for the service account, which will have an access key and corresponding secret key.
  • Use those HMAC keys as S3 access key and secret key.
info

HMAC (Hash-based Message Authentication Code) keys in Google Cloud Storage are used for authentication when accessing GCS resources, particularly through the S3-compatible API. They consist of an access key and a secret key, which provide a way to sign requests and verify identity without using Google account credentials. \

Refer - https://cloud.google.com/storage/docs/authentication/hmackeys

olake-gcs-s3

Using Minio S3 Credentials​

OLake supports S3-compatible MinIO as well in its S3 destination configuration.

User can create a MinIO service account, create bucket in it, and provide MinIO access key, secret key with bucket URL in the config.
In S3 endpoint, URL through MinIO bucket is accessible has to be provided.

note

MinIO destination configuration was tested by spinninig up the MinIO docker container and this was in the same directory where OLake's UI backend is present. OLake UI docker container was spin up.

olake-minio-s3


info
  1. The generated .parquet files use SNAPPY compression (Read more). Note that SNAPPY is no longer supported by S3 Select when performing queries.
  2. OLake creates a test folder named olake_writer_test containing a single text file (.txt) with the content:
    S3 write test
    This is used to verify that you have the necessary permissions to write to S3.


πŸ’‘ Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
πŸ‘‰ Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. πŸš€

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!