Implementing real-time change data capture with Debezium for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL

Amazon Aurora PostgreSQL-Compatible Edition and Amazon RDS for PostgreSQL have introduced native streaming replication, enabling data changes to flow seamlessly from source databases. Despite this capability, many organizations face challenges in effectively capturing and propagating these changes to downstream systems in real time, often without compromising database performance or introducing data lag. Traditional batch-based extract, transform, and load (ETL) pipelines frequently result in delays of minutes or hours, rendering it difficult to respond to events promptly. This leads to outdated inventory data, delayed notifications, and lost opportunities to act on transactional signals as they arise.

Debezium, an open-source distributed platform for change data capture (CDC), offers a solution by monitoring databases and streaming changes to applications or data pipelines. It facilitates the real-time streaming of changes from databases to Kafka topics, supporting event-driven architectures. This functionality enables businesses to maintain current data across multiple systems, minimize data synchronization delays, and respond swiftly to business events.

Solution Overview

The CDC solution discussed here leverages the native logical replication capabilities of PostgreSQL, combined with the robust change capture framework of Debezium. Both Amazon Aurora for PostgreSQL and Amazon RDS for PostgreSQL support logical replication, providing flexible options for implementing CDC solutions. In this instance, we will focus on Amazon Aurora for PostgreSQL.

The implementation begins by enabling logical replication on Amazon Aurora for PostgreSQL through DB cluster parameter groups. Debezium connectors then monitor the database’s Write-Ahead Logging (WAL) via logical replication slots, transforming transaction log entries into structured event streams for downstream consumption.

The Key Components of This Solution Architecture

  • Amazon Aurora for PostgreSQL as the source database with logical replication enabled
  • A Debezium PostgreSQL connector running on MSK Connect for managed change capture
  • Amazon MSK for reliable, scalable message streaming
  • An Amazon EC2 instance for testing and consuming change events

Note: Additional downstream integration targets shown in the architecture diagram can be configured based on specific use case requirements but are not part of this core CDC implementation.

Implement the Solution

This solution can be implemented through the AWS Management Console or by utilizing the latest version of the AWS Command Line Interface (AWS CLI).

Create an Amazon Aurora for PostgreSQL DB Cluster and Enable Logical Replication

To create an Aurora PostgreSQL DB cluster and enable logical replication, follow these steps:

  1. Create a DB cluster parameter group. Choose a Parameter group family that matches the PostgreSQL major version of your database instance, preferably the latest version available for Aurora PostgreSQL.
  2. Modify the DB cluster parameter group to set the rds.logical_replication parameter to 1.
  3. Associate the DB cluster parameter group with the Aurora PostgreSQL DB cluster, and stop and start the DB cluster to synchronize the parameter group with the database.

Create an Amazon MSK Cluster

Once the database is set up for replication, create an Amazon MSK serverless cluster by following these instructions:

  1. Sign in to the AWS Management Console and navigate to the Amazon MSK console.
  2. Select Create cluster.
  3. Choose Custom create to specify a virtual private cloud (VPC), subnets, and security groups.
  4. Enter a descriptive name for your cluster.
  5. Select Serverless for Cluster type and proceed.
  6. On the Networking page, select the VPC where the database was created.
  7. Select at least two subnets of the chosen VPC.
  8. Use the same security group attached to the database and continue.
  9. Proceed through the Security and Metrics and tags pages.
  10. Review your selections and choose Create cluster.
  11. Wait for the cluster status to change from Creating to Active.

Set Up the EC2 Instance

Create and launch an Amazon EC2 instance to install Kafka, download dependencies, authenticate with the Amazon MSK cluster using IAM, configure a Kafka client, and connect to the database. Ensure that the EC2 instance is in the same VPC and security group as the Amazon MSK cluster and Aurora PostgreSQL database. Additionally, add a security group for SSH access to the instance.

  1. To install Kafka on Amazon Linux or Red Hat Enterprise Linux, execute the following commands:
  2. # Install dependencies
    sudo yum install java-17-amazon-corretto
    
    # Download Apache Kafka binary distribution
    wget https://archive.apache.org/dist/kafka/4.0.0/kafka_2.13-4.0.0.tgz
    
    # Extract the archive in the home directory
    tar -xzf kafka_2.13-4.0.0.tgz
  3. Follow the instructions in the Amazon MSK Developer Guide to configure clients for IAM access control. Download the latest stable release of the Amazon MSK Library for IAM:
  4. wget https://github.com/aws/aws-msk-iam-auth/releases/download/v2.3.2/aws-msk-iam-auth-2.3.2-all.jar -P kafka/libs/
  5. Create a client.properties file in the ~/kafka_2.13-4.0.0/config/ directory to configure a Kafka client for IAM authentication.
  6. Source the environment variables to add Kafka binaries to the PATH and the Amazon MSK Library for IAM to the CLASSPATH.
  7. Install the PostgreSQL client and related dependencies on your Amazon EC2 instance:
  8. sudo yum install postgresql17 -y

Create a Custom Plugin

Next, create a custom plugin for Amazon MSK to install on MSK Connect workers, allowing the connector to replicate changes from RDS for PostgreSQL. Download the PostgreSQL connector plugin from the Debezium website and convert it to ZIP format for compatibility with MSK Connect.

  1. Create a directory for Debezium plugins:
  2. mkdir -p ~/opt/debezium
  3. Change the directory:
  4. cd ~/opt/debezium
  5. Download the Debezium connector:
  6. wget https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/3.1.0.Final/debezium-connector-postgres-3.1.0.Final-plugin.tar.gz
  7. Extract the downloaded file:
  8. tar -xzvf debezium-connector-postgres-3.1.0.Final-plugin.tar.gz
  9. Zip the plugin files for upload to Amazon S3.

Upload the custom plugin in ZIP format to an Amazon S3 bucket in the same AWS Region as your MSK Connect setup. This enables MSK Connect to distribute the connector code across workers for CDC capabilities.

Store the necessary credentials in AWS Secrets Manager and create an IAM role with specific permissions for MSK Connect to access these secrets and interact with Amazon MSK clusters.

Test the Solution

Connect to the Amazon EC2 instance and create the BOOTSTRAP_SERVERS environment variable to store the bootstrap servers of your Amazon MSK cluster. After retrieving the bootstrap server endpoints, run commands to test the connection and verify replication.

Test Real-Time Changes

After confirming the initial data replication, insert new records into the PostgreSQL database. These changes should automatically stream through Debezium to your Amazon MSK topics, demonstrating the live CDC functionality.

Monitoring and Troubleshooting

Monitor your CDC pipeline using key metrics available through CloudWatch Logs and the MSK Connect console. Common issues such as replication slot lag, connector failures, schema evolution, and network connectivity can be addressed with specific resolution steps outlined in the documentation.

Clean Up

To avoid incurring future charges, delete the resources created during this implementation in the specified order, ensuring that all necessary data is backed up beforehand.

Tech Optimizer
Implementing real-time change data capture with Debezium for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL