AI-powered tuning tools for Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL databases: PI Reporter

October 31, 2025

AWS provides a suite of services designed to collect and analyze database performance metrics for the Amazon Relational Database Service (Amazon RDS). Among these services are Amazon CloudWatch and CloudWatch Database Insights. Users can also create custom dashboards to monitor Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition using various tools. For further insights, refer to resources such as Create an Amazon CloudWatch dashboard and Monitor performance using PGSnapper.

With the rise of artificial intelligence and machine learning (AI/ML), numerous solutions have emerged that leverage ML capabilities in the realm of database monitoring. These tools provide a variety of features, from identifying performance bottlenecks to addressing operational issues, along with prescriptive recommendations for both proactive and reactive resolutions.

Considerations for monitoring

In this section, we delve into the metrics essential for every database, which should be monitored regularly to maintain historical data for comparison and benchmarking against workload changes, database parameters, and other performance-affecting factors. The following table outlines the recommended metrics to monitor for your AWS-managed databases.

Source for Monitoring	Parameter/Metrics
Amazon CloudWatch	Database Connections
	CPU Utilization
	Freeable Memory
	FreeStorageSpace
	ReadLatency, WriteLatency
	DiskQueueDepth
	ReadIOPS, WriteIOPS
	WriteThroughput, ReadThroughput
	ReplicaLag
	OldestReplicationSlotLag
	ReplicationSlotDiskUsage
	MaximumUsedTransactionIDs
Amazon CloudWatch Database Insights	DatabaseLoad
	IO Latency
	EBS IO
	LongestIdleInTransaction
	CPU Utilization (%)
	Sessions
	Tuples
	Transactions, Transaction in progress
	IO cache vs Disk read
	Deadlocks
	OS Processes
pg_stat_progress_vacuum	Vacuum progress
pg_stat_activity	State of the query/idle_in_transaction

For comprehensive lists, consult the following resources: OS metrics in Enhanced Monitoring, SQL statistics for RDS PostgreSQL, CloudWatch Database Insights counters for Amazon RDS for PostgreSQL, Vacuum Progress Reporting, and pg_stat_activity.

We now turn our attention to an AI/ML-driven database monitoring and troubleshooting tool: PI Reporter, exploring its capabilities and use cases.

PI Reporter

PI Reporter is an open-source tool crafted by an AWS solutions architect, designed to capture performance metrics and workload snapshots, generate detailed comparison reports for Amazon Aurora PostgreSQL-Compatible Edition, and offer optional report analysis through Amazon Bedrock.

PI Reporter seamlessly integrates with Amazon Bedrock, utilizing the power of large language models (LLMs) such as Anthropic’s Claude or Amazon Nova models to analyze individual snapshots and comparative data. Users must ensure access to the required models, as this analysis yields comprehensive summaries, root cause analyses, and actionable recommendations for database performance issues identified during the snapshot period.

With PI Reporter, users can:

Obtain a detailed HTML report on instance-related information within minutes
Compare periodic reports to identify performance, workload, or configuration changes
Evaluate whether the instance can accommodate the workload and determine right-sizing needs
Share instance statistics with third parties while maintaining system security
Receive LLM analysis, including root cause identification and recommendations

This tool proves beneficial in various scenarios. For instance, if a sudden decline in database performance occurs, PI Reporter can capture snapshots from both the affected and a comparable normal activity period. The resulting HTML comparison report quickly highlights changes and potential root causes.

Another significant use case involves assessing workload changes following planned database modifications. This includes scenarios such as launching new major applications, upgrading to significant database versions, migrating to new clusters via blue-green deployment, or implementing other substantial changes. In these instances, snapshots can be taken before and after the modifications to generate comparison reports.

For monitoring Amazon Aurora PostgreSQL Serverless instances, PI Reporter aids in identifying reasons for unexpected high Aurora Capacity Units (ACU) usage by comparing expected and unexpected ACU utilization patterns.

These examples illustrate just a few ways the tool can be utilized; its applications extend to any scenario requiring performance comparison and analysis.

Designed for simplicity and efficiency, PI Reporter can be deployed as a Node.js script on Amazon Elastic Compute Cloud (Amazon EC2) or on-premises. A portable version of the script is also available, compiled for Linux x86 systems. For detailed setup instructions, refer to the GitHub repository.

Solution overview

The architecture of PI Reporter, utilizing AWS services, is illustrated in the following diagram.

The solution operates across four layers:

Data collection layer:
Processing layer:
- The PI Reporter tool aggregates data from all sources.
- An instance role with appropriate permissions is required, configured via the pireporterPolicy.json IAM Policy.
- The tool processes the collected data and generates a JSON snapshot file containing consolidated metrics.
Analysis layer:
- The solution integrates with Amazon Bedrock for enhanced analysis.
- Performance metrics, resource utilization data, and workload information (including SQL statistics) are sent to Amazon Bedrock, enriched with relevant knowledge.
- Amazon Bedrock’s LLM capabilities yield:
  - Comprehensive summaries
  - In-depth analysis
  - Actionable recommendations
Output:
- The final output is an HTML report that includes both raw metrics and LLM-powered insights.
- This architecture ensures thorough performance monitoring and analysis while upholding security through proper IAM roles and permissions.

Prerequisites

To utilize PI Reporter, enabling Amazon CloudWatch Database Insights is essential, as the tool relies on this information.

The following PostgreSQL-specific requirements are common for tuning and troubleshooting tools:

Enable the pg_stat_statements extension to collect per-query statistics, which is enabled by default in Aurora PostgreSQL.
PostgreSQL databases truncate queries longer than 1,024 bytes by default. To increase the logged query size, adjust the track_activity_query_size parameter in the DB parameter group associated with your DB instance. A database restart is required for this change.

For further information, please refer to the GitHub repository.

Install and run PI Reporter

PI Reporter is designed for ease of use. It can be downloaded from the GitHub repository on Amazon EC2 with a Linux OS or on an on-premises Linux host that has access to the AWS Region where the Aurora PostgreSQL cluster operates.

To install PI Reporter, follow these steps:

Clone the repository to your local file system:
```
git clone https://github.com/awslabs/pireporter.git
```
After cloning, locate the pireporterPolicy.json AWS Identity and Access Management (IAM) policy file within the pireporter directory. This policy encompasses the permissions necessary to operate PI Reporter, which are read-only and limited to essential requirements. The policy stipulates that only instances and clusters tagged with pireporter:allow can be accessed. Modifications to the policy file can be made to adjust these restrictions.
Attach the pireporterPolicy to the instance role of the EC2 instance designated for running the tool. If operating on an on-premises Linux host, utilize your access key and secret key in the shared credentials file ~/.aws/credentials. Ensure the latest version of the AWS CLI is installed on the EC2 instance for programmatic connection to the RDS instance. The AWS SDK employed in PI Reporter will automatically read the policy file upon loading. In this case, the policy must be linked to the IAM entity associated with the access key. The AWS Region will automatically align with the region of the hosting EC2 instance based on instance metadata, but can be overridden by setting the AWS_REGION environment variable.
Run the tool using one of the following options:
1. To execute the PI Reporter tool using Node.js, ensure Node.js is installed on the host from which you wish to generate the report:
```
cd pireporter
npm install
node pireporter.js --help
```
2. For a portable version (if Node.js installation is not preferred), use the following command:
```
cd pireporter/portable
./pireporter --help
```
To generate a snapshot for a specific period, use the following command:
```
./pireporter --create-snapshot --rds-instance myinstance --start-time 2025-01-20T10:00 --end-time 2025-01-20T10:15 --comment "Unusually slow inserts"
```
In this example, the create-snapshot command captures data for a 15-minute interval during which suspicious activity or unusual behavior is observed. If no data is available, it will exit with the message No performance data available from Performance Insights for the selected time frame. Please choose a time frame with application load or user activity. The duration may vary from a few seconds to a minute, depending on the specified time window. It is advisable to restrict snapshot boundaries to the period of interest to minimize metric average value dilution. The command generates a JSON snapshot file in the snapshot’s subfolder, and the --comment argument allows for associating a comment with the generated snapshot, which can influence the LLM’s reasoning.
To create an HTML report with generative AI analysis and recommendations for the captured snapshot, use the following command:
```
./pireporter --create-report --snapshot snapshot_myinstance_202501201000_202501201015.json --ai-analyzes
```
The --ai-analyzes argument incorporates LLM analysis from Amazon Bedrock into the HTML report, which is saved in the reports subfolder.

To check which LLM (Region and model ID) is utilized by the tool, refer to the conf.json file. LLMs in Amazon Bedrock that support the Converse API can be employed.

Considerations and recommendations

PI Reporter is designed to identify changes in instance behavior, thereby minimizing the problem detection phase. It is recommended to generate two snapshots:

One for the problematic period when the instance exhibits unusual behavior
One for a similar period when the instance functioned normally

Utilize --create-compare-report to generate a comparative HTML report, aiding in the review of metrics and SQLs that have significantly changed.

The generative AI analysis for the comparative report will be more insightful with data from both periods. Both periods must meet the following criteria:

Both periods must be of identical length
The problematic snapshot must start when the issue began
The problematic snapshot must end when the issue concluded, or a reasonable time after the start time if the problem persists, such as 60 minutes

Additionally, providing meaningful comments to the snapshots is beneficial. Comments can guide the LLM, directing it to specific areas or observations made by the user.

Generative AI may occasionally produce inaccurate assumptions. Efforts have been made to mitigate this by supplying additional context to the LLM, including valuable database engine-specific knowledge. It is advisable to use generative AI analysis in conjunction with a database specialist for evaluation.

Interpretation of the report

The section of the report generated by the LLM will appear in a light blue box at the top of the HTML report. This section begins with a general summary, outlining its main findings and root causes of any identified problems.

Following the general summary, there will be a breakdown for each report section, covering general instance configuration, non-default parameters, wait events, OS metrics, DB metrics, and additional metrics such as overall network throughput of the DB instance, calculated from other statistics and SQL sections. The report may also include analysis of database log files for the snapshot period if the --include-logfiles argument was specified during snapshot creation.

The summary section for the generative AI analysis includes observations regarding high resource usage, such as CPU, memory, and network throughput. It also identifies the workload type responsible for the high load (insert statements and other write activities) and presents two SQL statements: an insert and an autovacuum activity on the public.employee table. Selecting the SQL reference ID allows for viewing the full SQL text, along with the root cause of the performance issues.

The recommendations section outlines steps generated by the LLM to address the root cause of the identified problems. In our example, the LLM suggests scaling the instance up to a recommended instance type capable of handling the observed workload, alongside reviewing the workload and adjusting specific autovacuum parameters to mitigate its impact on the system.

Reactive use case: Bulk data insert

This section examines a use case involving a bulk data load on the database, monitoring the utilization of various resources such as CPU, disk, IOPS, and network. Elevated resource utilization may trigger alerts, allowing for proactive management of resource upgrades.

Prerequisites for bulk data insert

To explore this use case, a table for bulk-inserting data is required. For example, consider the following code:

BEGIN;

CREATE TABLE employee(
 emp_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
 name TEXT NOT NULL
 ); 
COMMIT;

Bulk data insert use case

To perform a bulk insert, use the following code:

BEGIN;
INSERT INTO employee (name)
SELECT substr(md5(random()::text), 1, 12)
FROM generate_series(1, 100000000) AS _g;
ANALYZE;
COMMIT;

This test case executes the preceding INSERT statement with a count of 100000000 for large inserts into the employee table, resulting in the creation of 500 MB of data.

To create an AI/ML-based report using PI Reporter, run the following command:

./pireporter --create-snapshot --rds-instance myinstance --start-time 2024-10-15T09:00 --end-time 2024-10-16T08:00 --comment "Bulk inserts"
./pireporter --create-report --snapshot snapshot_myinstance_202410150900_202410160800.json --ai-analyzes

Here, the --start-time and --end-time parameters should correspond to the timestamps before and after the execution of the insert statement. Once the report is generated, open the HTML report file to review its contents. The report will provide recommendations, starting with details about the report’s start and end times and the total duration for which it was generated. It will also highlight the primary cause of elevated resource utilization, such as the executed bulk insert statement, and emphasize the necessity for regular vacuum and checkpoint tuning.

The recommendations section will reiterate this information but with more detailed point-by-point insights.

The initial part of the recommendations covers general instance information, including high availability practices, backup, and monitoring configurations. The subsequent section presents static metrics regarding memory and CPU usage for the report period. The final section addresses any non-default parameters present.

The AI-generated analysis provides insights into resource utilization and IO events, discussing the current instance size, its utilization, and suitable types for the instance. In this case, resources were underutilized, leading to a recommendation for selecting the right instance type for cost optimization. When generating these recommendations, ensure that a sufficiently long period is considered, and validate them with database experts before implementation.

The recommendations regarding database metrics are particularly valuable, as they include suggestions for bulk insert transaction activity, checkpoint tuning, and regular vacuuming.

The final GenAI analysis offers insights into instance underutilization, but it is crucial to evaluate overall instance performance before considering downsizing, factoring in CPU, network bandwidth, efficient caching, disk IO, checkpointing, and autovacuum. These insights are vital for optimizing database performance, with combined recommendations covering query optimization, instance upgrades, deletion protection, backup retention, and the addition of a reader instance for high availability.

Idle in transaction use case

The PI Reporter tool operates on data collected through snapshots. Statistics gathered in snapshots containing idle-in-transaction sessions will be reflected in the final report. It is advisable to implement the idle_in_transaction_session_timeout parameter with a value of 300 seconds (5 minutes) to automatically terminate idle transactions and prevent resource contention.

In the report, PI Reporter effectively identifies a session that was idle in transaction, recommending an appropriate value for the related parameter. It also suggests reviewing application code to identify any transaction blocks lacking a commit or rollback. Additionally, it is prudent to establish monitoring alerts for transactions that remain idle for over 5 minutes, allowing for timely intervention to terminate such transactions or rectify the underlying issues before they consume significant database resources. Amazon CloudWatch Database Insights provides metrics for maximum idle-in-transaction sessions under the Database Telemetry -> Metrics option.

The database metrics section offers an estimate of the duration that the session has been idle in transaction. The final analysis section summarizes overall recommendations for enhancing performance and optimizing costs.

Clean up

Resources created by Amazon Bedrock may incur costs as long as they remain in use. Each time a report that includes generative AI analysis is generated, PI Reporter will display the number of input and output tokens utilized during that invocation. When these resources are no longer needed, it is important to clean them up by deleting the associated services and scripts. If a test environment was created to follow along with this post, ensure that resources are cleaned up once they are no longer required.

Feature summary

The following table provides a high-level feature summary for the PI Reporter tool.

Parameter/Feature	PI Reporter
Cloud-agnostic	No
On-premises database	No
Configuration recommendation	Yes
Database health	Yes
Index recommendation	Yes
Inefficient SQLs	Yes
Autovacuum	Yes
Performance charts	No
Agent type	No
Production-ready	Yes
Cost	Apache-2.0 license Costs associated with infrastructure and Amazon Bedrock

*Factors such as deployment options, the size of your database fleet, and multi-year contracts significantly influence the final cost.

Tech Optimizer