Skip to main content

Hive Catalog

info

Use iceberg_s3_path with s3a prefix if your Hive is configured so. This will work for most use cases. Otherwise, use iceberg_path with s3 prefix.

ParameterSample ValueDescription
Iceberg S3 Paths3a://warehouse/ or gs://hive-dataproc-generated-bucket/hive-warehouseDetermines the S3 path or storage location for Iceberg data. The value "s3a://warehouse/" represents the designated S3 bucket or directory. If using GCP, use dataproc hive metastore bucket.
AWS Regionus-east-1Specifies the AWS region associated with the S3 bucket where the data is stored.
AWS Access KeyadminProvides the AWS access key used for authentication when connecting to S3.
AWS Secret KeypasswordProvides the AWS secret key used for authentication when connecting to S3.
S3 Endpointhttp://localhost:9000Specifies the endpoint URL for the S3 service. This may be used when connecting to an S3-compatible storage service like MinIO running on localhost.
Hive URIthrift://localhost:9083 or thrift://METASTORE_IP:9083Defines the URI of the Hive Metastore service that the writer will connect to for catalog interactions. METASTORE_IP will be provided by GCP's Hive dataproc metastore.
Use SSL for S3falseIndicates whether SSL is enabled for S3 connections. "false" means that SSL is disabled for these communications.
Use Path Style for S3trueDetermines if path-style access is used for S3. "true" means that the writer will use path-style addressing instead of the default virtual-hosted style.
Hive Clients5Specifies the number of Hive clients allocated for managing interactions with the Hive Metastore.
Enable SASL for HivefalseIndicates whether SASL authentication is enabled for the Hive connection. "false" means that SASL is disabled.
Iceberg Databaseolake_icebergSpecifies the name of the Iceberg database to be used by the destination configuration.

You can query the data via:

SELECT * FROM CATALOG_NAME.ICEBERG_DATABASE_NAME.TABLE_NAME;
  • CATALOG_NAME can be: jdbc_catalog, hive_catalog, rest_catalog, etc.
  • ICEBERG_DATABASE_NAME is the name of the Iceberg database you created / added as a value in destination.json file.

For S3 related permissions which is needed to write data to S3, refer to the AWS S3 Permissions documentation.

Using GCP Dataproc Metastore (Hive) with Google Cloud Storage (GCS)

OLake supports using Google Cloud Dataproc Metastore (Hive) as the Iceberg catalog and Google Cloud Storage (GCS) as the data lake destination. This allows you to leverage GCP-native services for scalable, managed metadata and storage.

Dataproc Metastore dataproc-metastore

Step-by-Step Setup

  1. Create a GCP Project (if you don't have one).
  2. Provision a Dataproc Metastore (Hive):
    • Go to the GCP Console → Dataproc → Metastore services.
    • Click "Create Metastore Service".
    • Fill in service name, location, version, release channel, port (default: 9083), and service tier.
    • Set the endpoint protocol to Thrift.
    • Expose the service to your running network (VPC/subnet).
    • Enable the Data Catalog sync option if desired.
    • Choose database type and other options as needed.
    • Click Submit. Creation may take 20–30 minutes.
  3. Expose the Metastore endpoint to the network where OLake will run (ensure network connectivity and firewall rules allow access to the Thrift port).
  4. Create or choose a GCS bucket for Iceberg data.
  5. Deploy OLake in the same network (or with access to the Metastore endpoint).

GCP-Hive OLake Destination Config

destination.json (GCP Hive + GCS)
{
"type": "ICEBERG",
"writer": {
"catalog_type": "hive",
"hive_uri": "thrift://<METASTORE_IP>:9083",
"hive_clients": 10,
"hive_sasl_enabled": false,
"iceberg_db": "olake_iceberg",
"iceberg_s3_path": "gs://<hive-dataproc-generated-bucket>/hive-warehouse",
"aws_region": "us-central1"
}
}
  • Replace <METASTORE_IP> with your Dataproc Metastore’s internal IP or hostname.
  • Replace <hive-dataproc-generated-bucket> with the Dataproc Metastore generated hive.metastore.warehouse.dir bucket.
  • Set aws_region to your GCP region (e.g., us-central1).

Notes

  • The hive_uri must use the Thrift protocol and point to your Dataproc Metastore endpoint.
  • The iceberg_s3_path can use the gs:// prefix for GCS buckets.
  • Ensure OLake has network access to the Metastore and permissions to write to the GCS bucket.
  • Data written will be in Iceberg format, queryable via compatible engines (e.g., Spark, Trino) configured with the same Hive Metastore and GCS bucket.

Troubleshooting

  • If you encounter connection issues, verify firewall rules and VPC peering between OLake and the Dataproc Metastore.
  • Ensure the Dataproc Metastore is in a running state and the Thrift port is open.
info

If you wish to test out the REST Catalog locally, you can use the docker-compose setup. The local test setup uses Minio as an S3-compatible storage and other all supported catalog types.

You can then setup local spark to run queries on the iceberg tables created in the local test setup.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!