Skip to main content

Hive Catalog Write Guide

OLake integrates with Hive Catalog to provide full support for Apache Iceberg tables.

With this setup:

  • Data is stored in object storage (S3, GCS, MinIO, or any S3-compatible system).
  • Metadata is managed by Hive Metastore.
  • OLake seamlessly writes into Iceberg tables using Hive Metastore + Object storage.

Prerequisites

Before configuring OLake with Hive Catalog, ensure the following:

1. Hive Metastore

A Hive Metastore service will serve as the Iceberg metadata catalog. This can be:

  • Managed service: GCP Dataproc Metastore, AWS EMR, or Azure HDInsight
  • Self-hosted: Apache Hive Metastore running on your infrastructure
  • Local development: Docker-based Hive Metastore

Required Metastore Configuration:

  • Thrift protocol enabled (default port 9083)
  • Database backend (PostgreSQL, MySQL, or other JDBC-supported database) for metadata storage with required permissions of creating, inserting, updating, deleting, and reading from the database.

2. Object Storage

A bucket for storing Iceberg data files (Parquet + metadata).

Configuration

  • Before setting up the destination, make sure you have successfully set up the source.

Hive-Catalog

ParameterSample ValueDescription
Iceberg S3 Path (Warehouse)s3://warehouse/ or gs://hive-dataproc-generated-bucket/hive-warehouseDetermines the S3 path or storage location for Iceberg data. The value s3://warehouse/ represents the designated S3 bucket or directory. If using GCP, use dataproc hive metastore bucket. Use s3a:// if using Minio
AWS Regionus-east-1Specifies the AWS region associated with the S3 bucket where the data is stored.
AWS Access KeyadminAWS access key with sufficient permissions for S3. Optional if using IAM role attached to running instance/pod.
AWS Secret KeypasswordAWS secret key with sufficient permissions for S3. Optional if using IAM role attached to running instance/pod.
S3 Endpointhttp://S3_ENDPOINTSpecifies the endpoint URL for the S3 service. This may be used when connecting to an S3-compatible storage service like MinIO running on localhost.
Hive URIthrift://<hostname>:9083 or thrift://METASTORE_IP:9083Defines the URI of the Hive Metastore service that the writer will connect to for catalog interactions. METASTORE_IP will be provided by GCP's Hive dataproc metastore or thrift://localhost:9083 if you are using local setup using docker compose. If you are using separate docker for hive then it can be thrift://host.docker.internal:9083
Use SSL for S3falseIndicates whether SSL is enabled for S3 connections. "false" means that SSL is disabled for these communications.
Use Path Style for S3trueDetermines if path-style access is used for S3. "true" means that the writer will use path-style addressing instead of the default virtual-hosted style.
Hive Clients5Specifies the number of Hive clients allocated for managing interactions with the Hive Metastore.
Enable SASL for HivefalseIndicates whether SASL authentication is enabled for the Hive connection. "false" means that SASL is disabled.
Iceberg Databaseiceberg_dbSpecifies the name of the Iceberg database to be used by the destination configuration.

After you have successfully set up the destination: Configure your streams


GCP Dataproc Metastore

OLake supports using Google Cloud Dataproc Metastore (Hive) as the Iceberg catalog and Google Cloud Storage (GCS) as the data lake destination. This allows you to leverage GCP-native services for scalable, managed metadata and storage.

Dataproc Metastore dataproc-metastore

Step-by-Step Setup

  1. Create a GCP Project (if you don't have one).
  2. Provision a Dataproc Metastore (Hive):
    • Go to the GCP Console → Dataproc → Metastore services.
    • Click "Create Metastore Service".
    • Fill in service name, location, version, release channel, port (default: 9083), and service tier.
    • Set the endpoint protocol to Thrift.
    • Expose the service to your running network (VPC/subnet).
    • Enable the Data Catalog sync option if desired.
    • Choose database type and other options as needed.
    • Click Submit. Creation may take 20–30 minutes.
  3. Expose the Metastore endpoint to the network where OLake will run (ensure network connectivity and firewall rules allow access to the Thrift port).
  4. Create or choose a GCS bucket for Iceberg data.
  5. Deploy OLake in the same network (or with access to the Metastore endpoint).

GCP-Hive OLake Destination Config

ParameterSample ValueDescription
Iceberg S3 Path (Warehouse)gs://<hive-dataproc-generated-bucket>/hive-warehouseUse the Dataproc Metastore generated Hive warehouse bucket.
AWS Regionus-central1Your GCP region.
Hive URIthrift://<METASTORE_IP>:9083Dataproc Metastore Thrift endpoint.
Hive Clients10Number of concurrent Hive clients.
Enable SASL for HivefalseLeave disabled unless your metastore requires SASL.
Iceberg Databaseolake_icebergTarget Iceberg database name.

Notes

  • The hive_uri must use the Thrift protocol and point to your Dataproc Metastore endpoint.
  • The iceberg_s3_path can use the gs:// prefix for GCS buckets.
  • Ensure OLake has network access to the Metastore and permissions to write to the GCS bucket.
  • Data written will be in Iceberg format, queryable via compatible engines (e.g., Spark, Trino) configured with the same Hive Metastore and GCS bucket.

Troubleshooting

The OLake Hive Catalog connector stops immediately upon encountering errors to ensure data accuracy. Below are common issues and their fixes:

  • Hive Metastore JAR Dependencies Missing
    • Cause: Required JAR files not available in Hive Metastore classpath for S3 and PostgreSQL connectivity.
    • Fix:
      • Verify the following essential JARs are present in /opt/hive/lib/:
        # Check for required JARs
        ls -la /opt/hive/lib/ | grep -E "(hadoop-aws|postgresql|aws-java-sdk)"
      • Required JARs for S3 Integration:
        • hadoop-aws-3.3.4.jar - Enables S3A filesystem support
        • aws-java-sdk-bundle-1.12.262.jar - AWS SDK for S3 operations
      • Required JARs for PostgreSQL Backend:
        • postgresql-42.5.4.jar - PostgreSQL JDBC driver for metadata storage
  • Connection Refused to Hive Metastore
    • Cause: Hive Metastore service not accessible or network connectivity issues.
    • Fix:
      • Verify Hive Metastore is running and accessible:
        telnet <hive-metastore-host> 9083
      • Check hive_uri format: thrift://<hostname>:9083
      • Use host.docker.internal instead of localhost when running in Docker.
      • Ensure firewall rules allow access to port 9083.
      • For GCP Dataproc Metastore, verify VPC connectivity and service status.
  • Database Backend Connection Failed
    • Cause: PostgreSQL/MySQL backend database not accessible or misconfigured.
    • Fix:
      • Verify database backend is running:
        psql -h <db-host> -p 5432 -U <username> -d <database>
      • Ensure database user has required permissions:
        GRANT CREATE, INSERT, UPDATE, DELETE, SELECT ON DATABASE <database_name> TO <username>;
      • Check Hive Metastore configuration for correct JDBC URL.
  • S3/Object Storage Access Denied
    • Cause: Invalid AWS credentials or insufficient S3 bucket permissions.
    • Fix:
      • Verify AWS credentials and permissions:
        aws s3 ls s3://<bucket-name>
      • For MinIO, ensure correct endpoint and credentials:
        mc config host add minio http://<endpoint> <access-key> <secret-key>
        mc ls minio/<bucket-name>
      • Check s3_endpoint, aws_access_key, and aws_secret_key configuration.
      • Ensure bucket exists and is in the correct region.
  • Path Style Access Error with MinIO
    • Cause: S3 addressing configuration issue with MinIO or non-AWS S3.
    • Fix:
      • Set s3_path_style: true for MinIO and non-AWS S3 services.
      • Use correct endpoint format: http://minio:9000 (no bucket in URL).
      • Ensure s3_use_ssl: false for HTTP endpoints.


💡 Join the OLake Community!

Got questions, ideas, or just want to connect with other data engineers?
👉 Join our Slack Community to get real-time support, share feedback, and shape the future of OLake together. 🚀

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!