Skip to main content

Generic PostgreSQL CDC Setup Guide

This guide covers setting up Change Data Capture (CDC) for PostgreSQL running on any server infrastructure including EC2 instances, virtual machines, bare metal servers, or on-premises installations. The setup uses logical replication with the wal2json plugin for real-time data streaming.

Prerequisites:

  • PostgreSQL version 9.4+ (version 10+ recommended for full logical replication support)
  • Superuser access or ability to modify postgresql.conf and pg_hba.conf
  • The wal2json plugin installed (or ability to install it)
  • TLS/SSL enabled for secure connections (recommended)
  • Sufficient disk space for WAL retention
Read Replica Support

PostgreSQL 16+: Logical replication on read replicas is supported but may have limitations. Row deletions and vacuum operations on the primary can affect replication slot stability.

PostgreSQL 15 and earlier: Read replicas cannot act as publishers for logical replication. Connect OLake to the primary instance for CDC functionality.

Recommendation: Use the primary instance for production CDC workloads to ensure reliability.

Steps:

1. Install wal2json Plugin

The wal2json plugin is required for logical decoding. Installation varies by system:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install postgresql-15-wal2json # Replace 15 with your version

CentOS/RHEL/Amazon Linux:

sudo yum install postgresql15-wal2json  # Replace 15 with your version
# Or for newer versions:
sudo dnf install postgresql15-wal2json

From Source (if package not available):

git clone https://github.com/eulerto/wal2json.git
cd wal2json
make USE_PGXS=1
sudo make USE_PGXS=1 install

Verify Installation:

SELECT * FROM pg_available_extensions WHERE name = 'wal2json';

2. Configure PostgreSQL Parameters

Edit postgresql.conf (usually in /etc/postgresql/15/main/ or /var/lib/pgsql/15/data/):

# Enable logical replication
wal_level = logical

# Set replication limits
max_replication_slots = 10
max_wal_senders = 10

# Optional: Prevent timeouts during long snapshots
wal_sender_timeout = 0

# Optional: Control WAL retention
max_slot_wal_keep_size = 1GB

# Performance tuning (adjust based on your workload)
checkpoint_timeout = 15min
max_wal_size = 2GB
ParameterRecommended ValuePurpose
wal_levellogicalEnable logical decoding
max_replication_slots≥5Number of concurrent CDC connections
max_wal_senders≥7Should exceed replication slots
wal_sender_timeout0Prevents snapshot timeouts
max_slot_wal_keep_size1GB+Prevents WAL accumulation

3. Configure Client Authentication

Edit pg_hba.conf to allow replication connections. Add entries for your OLake client:

# Allow replication connections from OLake
host replication cdc_user 10.0.0.0/16 md5
host all cdc_user 10.0.0.0/16 md5

# For SSL connections (recommended)
hostssl replication cdc_user 10.0.0.0/16 md5
hostssl all cdc_user 10.0.0.0/16 md5

Replace 10.0.0.0/16 with your actual network range or specific IP addresses.

4. Restart PostgreSQL

Apply configuration changes by restarting PostgreSQL:

# Ubuntu/Debian
sudo systemctl restart postgresql

# CentOS/RHEL
sudo systemctl restart postgresql-15 # Replace with your version

# Or using pg_ctl
sudo -u postgres pg_ctl restart -D /var/lib/pgsql/15/data/

Verify Configuration:

SELECT name, setting FROM pg_settings 
WHERE name IN ('wal_level', 'max_replication_slots', 'max_wal_senders');

5. Create Replication User

Connect as a superuser (usually postgres) and create a dedicated CDC user:

-- Create user with replication privileges
CREATE USER cdc_user WITH PASSWORD 'strongpassword';
ALTER ROLE cdc_user WITH REPLICATION;

Grant Database Access:

-- Grant connection to specific database
GRANT CONNECT ON DATABASE mydb TO cdc_user;

Optional: Read-only Access for Initial Snapshot

For stricter security, grant explicit read permissions:

-- Example for the "public" schema in database mydb
GRANT USAGE ON SCHEMA public TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO cdc_user;
-- Ensure future tables are covered automatically
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO cdc_user;

Multiple Schema Access:

-- Repeat for each schema you want to sync
GRANT USAGE ON SCHEMA schema1 TO cdc_user;
GRANT SELECT ON ALL TABLES IN SCHEMA schema1 TO cdc_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA schema1 GRANT SELECT ON TABLES TO cdc_user;

6. Create Logical Replication Slot

Set up the replication slot for OLake to consume changes:

Slot Name Configuration

Use the exact replication slot name that you configure in OLake UI or CLI. The slot name must match the replication_slot parameter in your OLake source configuration.

-- Replace 'your_olake_slot_name' with the slot name from your OLake configuration
SELECT pg_create_logical_replication_slot('your_olake_slot_name', 'wal2json');

Example with common slot names:

-- If your OLake source config uses "olake_slot"
SELECT pg_create_logical_replication_slot('olake_slot', 'wal2json');

-- If your OLake source config uses "postgres_slot"
SELECT pg_create_logical_replication_slot('postgres_slot', 'wal2json');

Verify Slot Creation:

SELECT * FROM pg_replication_slots WHERE slot_name = 'your_olake_slot_name';

7. Test Logical Decoding

Verify that CDC is working correctly:

-- Create test data
CREATE TABLE IF NOT EXISTS test_table (id SERIAL PRIMARY KEY, data TEXT);
INSERT INTO test_table (data) VALUES ('Generic PostgreSQL CDC test');

-- Check for changes (replace 'your_olake_slot_name' with your actual slot name)
SELECT data FROM pg_logical_slot_get_changes(
'your_olake_slot_name', null, null,
'pretty-print', '1'
);

Success: Returns JSON change records from wal2json
Permission denied: Check user has REPLICATION role
Plugin not found: Verify wal2json installation and wal_level setting

For secure connections, configure SSL in postgresql.conf:

# Enable SSL
ssl = on
ssl_cert_file = 'server.crt'
ssl_key_file = 'server.key'
ssl_ca_file = 'ca.crt' # Optional for client certificate verification

Generate Self-signed Certificate (for testing):

sudo -u postgres openssl req -new -x509 -days 365 -nodes -text \
-out /var/lib/pgsql/15/data/server.crt \
-keyout /var/lib/pgsql/15/data/server.key \
-subj "/CN=your-server-hostname"

sudo -u postgres chmod 600 /var/lib/pgsql/15/data/server.key

9. Firewall Configuration

Ensure PostgreSQL port (default 5432) is accessible:

Ubuntu/Debian (ufw):

sudo ufw allow from 10.0.0.0/16 to any port 5432

CentOS/RHEL (firewalld):

sudo firewall-cmd --permanent --add-rich-rule="rule family='ipv4' source address='10.0.0.0/16' port protocol='tcp' port='5432' accept"
sudo firewall-cmd --reload

AWS EC2 Security Groups:

  • Add inbound rule for port 5432 from OLake's IP addresses
  • Ensure outbound rules allow responses

Now that you have configured the PostgreSQL server and created the CDC user, you can add the PostgreSQL source in OLake to build an ELT pipeline to Iceberg or Parquet. See the PostgreSQL connector overview for a high-level walkthrough.

Troubleshooting

Configuration Issues

  • wal2json Plugin Not Found:

    ERROR: could not access file "wal2json": No such file or directory

    Solution: Install the wal2json package for your PostgreSQL version or compile from source.

  • WAL Level Not Logical:

    ERROR: logical decoding requires wal_level >= logical

    Solution: Set wal_level = logical in postgresql.conf and restart PostgreSQL.

  • Configuration Not Applied: If settings don't take effect:

    • Verify you edited the correct postgresql.conf file: SHOW config_file;
    • Ensure PostgreSQL was restarted after changes
    • Check PostgreSQL logs for configuration errors

Permission Issues

  • Replication Slot Creation Failed:

    ERROR: permission denied to create replication slot

    Solution: Ensure the user has REPLICATION role: ALTER ROLE cdc_user WITH REPLICATION;

  • Table Access Denied:

    ERROR: permission denied for table "some_table"

    Solution: Grant SELECT privileges on the schema and tables the user needs to access.

Connection Issues

  • Connection Refused: Verify:

    • PostgreSQL is running: sudo systemctl status postgresql
    • Firewall allows connections on port 5432
    • pg_hba.conf has appropriate entries for your client IP
    • listen_addresses = '*' in postgresql.conf (or specific IP)
  • SSL/TLS Issues:

    • Ensure certificates have correct permissions (600 for private keys)
    • Verify certificate hostname matches connection string
    • Check client SSL configuration matches server requirements

Performance and Monitoring

  • WAL Accumulation: Monitor disk usage as inactive slots retain WAL files:

    SELECT slot_name, active, restart_lsn, confirmed_flush_lsn 
    FROM pg_replication_slots;

    Solution: Drop unused slots or ensure CDC consumers are actively reading.

  • High CPU/Memory Usage: Tune checkpoint settings:

    checkpoint_timeout = 15min
    checkpoint_completion_target = 0.9
    max_wal_size = 2GB
  • Slow Snapshots: For large tables, consider:

    • Increasing work_mem for the CDC session
    • Using wal_sender_timeout = 0 to prevent timeouts
    • Running snapshots during low-activity periods

Schema and Data Issues

  • Missing Changes for Updates/Deletes: Ensure tables have primary keys or set replica identity:

    ALTER TABLE my_table REPLICA IDENTITY FULL;  -- For tables without PK
  • DDL Changes Not Captured: wal2json captures DDL in WAL, but OLake may need schema refresh for structural changes.

  • Large Transaction Handling: For very large transactions, monitor:

    • WAL disk space usage
    • Replication slot lag
    • Memory usage during decoding

Security Best Practices

  1. Use Dedicated User: Create a specific user for CDC with minimal required privileges
  2. Enable TLS: Always use encrypted connections for production
  3. Network Security: Restrict access using firewalls and security groups
  4. Regular Monitoring: Set up alerts for replication lag and disk usage
  5. Password Security: Use strong passwords and consider certificate-based authentication

This setup provides a robust foundation for CDC from any PostgreSQL server to OLake, enabling real-time data integration with your data lake infrastructure.


Need Assistance?

If you have any questions or uncertainties about setting up OLake, contributing to the project, or troubleshooting any issues, we’re here to help. You can:

  • Email Support: Reach out to our team at hello@olake.io for prompt assistance.
  • Join our Slack Community: where we discuss future roadmaps, discuss bugs, help folks to debug issues they are facing and more.
  • Schedule a Call: If you prefer a one-on-one conversation, schedule a call with our CTO and team.

Your success with OLake is our priority. Don’t hesitate to contact us if you need any help or further clarification!