Hive Catalog
Use iceberg_s3_path
with s3a
prefix if your Hive is configured so. This will work for most use cases. Otherwise, use iceberg_path
with s3
prefix.
- OLake UI
- OLake CLI
Parameter | Sample Value | Description |
---|---|---|
Iceberg S3 Path | s3a://warehouse/ or gs://hive-dataproc-generated-bucket/hive-warehouse | Determines the S3 path or storage location for Iceberg data. The value "s3a://warehouse/" represents the designated S3 bucket or directory. If using GCP, use dataproc hive metastore bucket. |
AWS Region | us-east-1 | Specifies the AWS region associated with the S3 bucket where the data is stored. |
AWS Access Key | admin | Provides the AWS access key used for authentication when connecting to S3. |
AWS Secret Key | password | Provides the AWS secret key used for authentication when connecting to S3. |
S3 Endpoint | http://localhost:9000 | Specifies the endpoint URL for the S3 service. This may be used when connecting to an S3-compatible storage service like MinIO running on localhost. |
Hive URI | thrift://localhost:9083 or thrift://METASTORE_IP:9083 | Defines the URI of the Hive Metastore service that the writer will connect to for catalog interactions. METASTORE_IP will be provided by GCP's Hive dataproc metastore. |
Use SSL for S3 | false | Indicates whether SSL is enabled for S3 connections. "false" means that SSL is disabled for these communications. |
Use Path Style for S3 | true | Determines if path-style access is used for S3. "true" means that the writer will use path-style addressing instead of the default virtual-hosted style. |
Hive Clients | 5 | Specifies the number of Hive clients allocated for managing interactions with the Hive Metastore. |
Enable SASL for Hive | false | Indicates whether SASL authentication is enabled for the Hive connection. "false" means that SASL is disabled. |
Iceberg Database | olake_iceberg | Specifies the name of the Iceberg database to be used by the destination configuration. |
{
"type": "ICEBERG",
"writer": {
"catalog_type": "hive",
"iceberg_s3_path": "s3a://warehouse/",
"aws_region": "us-east-1",
"aws_access_key": "admin",
"aws_secret_key": "password",
"s3_endpoint": "http://localhost:9000",
"hive_uri": "thrift://localhost:9083",
"s3_use_ssl": false,
"s3_path_style": true,
"hive_clients": 5,
"hive_sasl_enabled": false,
"iceberg_db": "ICEBERG_DATABASE_NAME"
}
}
Hive Configuration Parameters
Parameter | Sample Value | Description |
---|---|---|
catalog_type | hive | Indicates the catalog type used by the writer. "hive" means that the writer uses the Hive Metastore for catalog operations. |
iceberg_s3_path | s3a://warehouse/ or gs://hive-dataproc-generated-bucket/hive-warehouse | Determines the S3 path or storage location for Iceberg data. The value "s3a://warehouse/" represents the designated S3 bucket or directory. If using GCP, use dataproc hive metastore bucket. |
aws_region | us-east-1 | Specifies the AWS region associated with the S3 bucket where the data is stored. |
aws_access_key | admin | Provides the AWS access key used for authentication when connecting to S3. |
aws_secret_key | password | Provides the AWS secret key used for authentication when connecting to S3. |
s3_endpoint | http://localhost:9000 | Specifies the endpoint URL for the S3 service. This may be used when connecting to an S3-compatible storage service like MinIO running on localhost. |
hive_uri | thrift://localhost:9083 or thrift://METASTORE_IP:9083 | Defines the URI of the Hive Metastore service that the writer will connect to for catalog interactions. METASTORE_IP will be provided by GCP's Hive dataproc metastore. |
s3_use_ssl | false | Indicates whether SSL is enabled for S3 connections. "false" means that SSL is disabled for these communications. |
s3_path_style | true | Determines if path-style access is used for S3. "true" means that the writer will use path-style addressing instead of the default virtual-hosted style. |
hive_clients | 5 | Specifies the number of Hive clients allocated for managing interactions with the Hive Metastore. |
hive_sasl_enabled | false | Indicates whether SASL authentication is enabled for the Hive connection. "false" means that SASL is disabled. |
iceberg_db | olake_iceberg | Specifies the name of the Iceberg database to be used by the destination configuration. |
You can query the data via:
SELECT * FROM CATALOG_NAME.ICEBERG_DATABASE_NAME.TABLE_NAME;
CATALOG_NAME
can be:jdbc_catalog
,hive_catalog
,rest_catalog
, etc.ICEBERG_DATABASE_NAME
is the name of the Iceberg database you created / added as a value indestination.json
file.
For S3 related permissions which is needed to write data to S3, refer to the AWS S3 Permissions documentation.
Using GCP Dataproc Metastore (Hive) with Google Cloud Storage (GCS)
OLake supports using Google Cloud Dataproc Metastore (Hive) as the Iceberg catalog and Google Cloud Storage (GCS) as the data lake destination. This allows you to leverage GCP-native services for scalable, managed metadata and storage.
Dataproc Metastore
Step-by-Step Setup
- Create a GCP Project (if you don't have one).
- Provision a Dataproc Metastore (Hive):
- Go to the GCP Console → Dataproc → Metastore services.
- Click "Create Metastore Service".
- Fill in service name, location, version, release channel, port (default: 9083), and service tier.
- Set the endpoint protocol to Thrift.
- Expose the service to your running network (VPC/subnet).
- Enable the Data Catalog sync option if desired.
- Choose database type and other options as needed.
- Click Submit. Creation may take 20–30 minutes.
- Expose the Metastore endpoint to the network where OLake will run (ensure network connectivity and firewall rules allow access to the Thrift port).
- Create or choose a GCS bucket for Iceberg data.
- Deploy OLake in the same network (or with access to the Metastore endpoint).
GCP-Hive OLake Destination Config
{
"type": "ICEBERG",
"writer": {
"catalog_type": "hive",
"hive_uri": "thrift://<METASTORE_IP>:9083",
"hive_clients": 10,
"hive_sasl_enabled": false,
"iceberg_db": "olake_iceberg",
"iceberg_s3_path": "gs://<hive-dataproc-generated-bucket>/hive-warehouse",
"aws_region": "us-central1"
}
}
- Replace
<METASTORE_IP>
with your Dataproc Metastore’s internal IP or hostname. - Replace
<hive-dataproc-generated-bucket>
with the Dataproc Metastore generatedhive.metastore.warehouse.dir
bucket. - Set
aws_region
to your GCP region (e.g.,us-central1
).
Notes
- The
hive_uri
must use the Thrift protocol and point to your Dataproc Metastore endpoint. - The
iceberg_s3_path
can use thegs://
prefix for GCS buckets. - Ensure OLake has network access to the Metastore and permissions to write to the GCS bucket.
- Data written will be in Iceberg format, queryable via compatible engines (e.g., Spark, Trino) configured with the same Hive Metastore and GCS bucket.
Troubleshooting
- If you encounter connection issues, verify firewall rules and VPC peering between OLake and the Dataproc Metastore.
- Ensure the Dataproc Metastore is in a running state and the Thrift port is open.
If you wish to test out the REST Catalog locally, you can use the docker-compose setup. The local test setup uses Minio as an S3-compatible storage and other all supported catalog types.
You can then setup local spark to run queries on the iceberg tables created in the local test setup.