This post was co-written with Sandeep Koppula, Senior Manager, Software Engineering, Rahul Gupta, Senior Software Engineer at Mindbody & Mukesh Agrawal, Database Specialist, Sr. Solutions Architect at AWS.
Mindbody stands as a beacon in the realm of cloud-based technology for the fitness and wellness industry, offering a robust platform that empowers businesses to flourish. Their innovative suite of solutions encompasses client booking, scheduling, integrated payments, marketing, and analytics, all designed to streamline operations and enhance customer engagement. With thousands of businesses relying on Mindbody for comprehensive management and millions of consumers booking experiences through its application, the platform exemplifies scalability, reliability, and continuous innovation.
The recent introduction of Amazon Aurora PostgreSQL Optimized Reads marks a significant enhancement for the Amazon Aurora PostgreSQL Compatible Edition. This feature is tailored to boost the efficiency of read operations, effectively reducing latency and increasing throughput for queries. For an in-depth exploration of this feature and its use cases, one can refer to Amazon Aurora Optimized Reads for Aurora PostgreSQL with up to 8x query latency improvement for I/O-intensive applications.
Current architecture
Mindbody’s email marketing platform is built on an Aurora PostgreSQL cluster, tailored to meet their operational needs. The database, currently at version 13.8, spans approximately 17 TB and features partitioned tables containing billions of rows. To accommodate peak workloads, Mindbody employs the largest Aurora instances, with a workload distribution of 80% reads and 20% writes.
Scaling and query performance challenges
Despite the robust capabilities of Aurora, Mindbody’s Marketing Suite application encountered significant scaling and performance challenges, primarily due to architectural limitations and escalating data demands. Although Aurora supports read scaling with up to 15 read replicas, the legacy application stack (Rails) lacks the ability to effectively split read and write operations, resulting in all workloads being directed to the writer node while utilizing readers solely as failover targets. The complexity of dynamic SQL queries, which involve joins across billions of rows in partitioned tables, further exacerbates the difficulties in query optimization.
The absence of an integrated data warehouse necessitates reliance on the online transaction processing (OLTP) database, thereby placing additional strain on the Aurora writer instance. With vertical scaling maxed out at db.r6i.32xlarge, the application experiences slower query execution, extended response times, and limited scalability. Consequently, Mindbody sought expert guidance on scaling their cluster to enhance query performance without resorting to extensive query tuning, application rewrites, or database refactoring involving sharding.
Root cause analysis
Upon analyzing performance data from Amazon CloudWatch and Amazon RDS Performance Insights, alongside cost analysis from AWS Cost Explorer, several critical findings emerged.
Firstly, the average BufferCacheHitRatio
consistently remained below 80%, whereas a healthy ratio typically exceeds 95%. This decline indicates that queries frequently access disk storage rather than being served from the cache. Performance insights corroborate this, highlighting the top wait event IO:DataFileRead
, which occurs when a connection waits for a backend process to retrieve a required page from storage due to its unavailability in shared memory. Such scenarios lead to increased load on the database instance’s CPU and network utilization, resulting in higher query latencies and I/O costs.
To alleviate the impact of network I/O latency associated with reading data pages from Aurora storage, Mindbody configured larger instances, db.r6i.32xlarge, ensuring memory capacity aligns with their working datasets to meet business SLAs. However, concerns arose regarding the cost implications of over-provisioned instances designed to accommodate memory requirements. Analysis from Cost Explorer revealed that the Mindbody Aurora cluster exhibited significant I/O intensity related to read operations, with monthly average I/O costs accounting for approximately 48% of total Aurora expenditure.
Reason for choosing Aurora Optimized Reads
In light of the root cause analysis, Mindbody opted to adopt Aurora Optimized Reads for three primary reasons:
- It offers an out-of-the-box tiered cache capability that enhances DB instance caching capacity by leveraging local NVMe storage.
- It includes a temporary objects capability, hosting temporary objects in NVMe storage, which improves latency and throughput for queries that require sorting, joining, or merging large volumes of data exceeding configured memory limits.
- Given the I/O-intensive workloads, the Aurora I/O-Optimized feature facilitates superior price-performance.
Transitioning to Aurora PostgreSQL Optimized Reads
Transitioning to Aurora Optimized Reads necessitates upgrading the database cluster from version 13.8 to 14.9 or higher. Acknowledging the technical complexities and the imperative for strict high availability, the Mindbody team undertook extensive testing in a proof-of-concept environment to gather key performance metrics for business-critical queries prior to implementing changes in the production cluster.
To minimize production disruption, Mindbody employed blue/green deployment and adhered to the recommended upgrade path, initially updating the cluster to the latest minor version before proceeding with the major version upgrade.
Process followed to create a proof-of-concept environment
The creation of a production-sized proof-of-concept environment aimed to achieve two objectives:
- To conduct comprehensive testing that demonstrates the benefits and develop a detailed runbook for replicating the process in the production environment.
- To assess the impact, the Mindbody team identified a representative set of top business-critical queries and executed them in the proof-of-concept environment, enabling side-by-side performance comparisons.
The workflow for establishing the proof-of-concept environment is illustrated in the accompanying diagram.
Top business-critical queries were executed concurrently on both the blue (non-optimized reads) and green (optimized reads) Aurora clusters to capture runtime metrics. A comparison of these runtimes is presented in the following table.
At this juncture, the Mindbody team had amassed the necessary metrics to confidently proceed with the adoption of Aurora Optimized Reads for the production migration, alongside preparing a detailed runbook for replicating the steps in the production environment.
Benefits of switching to Aurora PostgreSQL Optimized Reads
Following the execution of the runbook developed during the proof of concept, the Mindbody team first upgraded the minor version from 13.8 to 13.12, followed by a major version upgrade to 14.9 using Aurora blue/green deployment. After the switchover to the green environment, the Aurora cluster’s storage configuration was modified to the Aurora I/O-Optimized configuration, an online operation that did not necessitate downtime.
At this stage, the Aurora cluster instances were primed for modification to replace them with Optimized Reads-compatible instances, an operation requiring downtime. To minimize disruption, the team first modified the reader instance to an Optimized Reads-compatible instance db.r6id.32xlarge. At this point, the Aurora primary writer instance remained on a non-Optimized Reads instance, while the reader instance transitioned to an Optimized Reads instance type.
On June 21, Mindbody executed a manual failover, effectively swapping the roles of the writer and reader instances. This maneuver successfully transitioned the primary (writer) instance of the Aurora cluster to an Optimized Reads-compatible configuration. Following this recognition of benefits, Mindbody modified the new reader (previously the writer) instance to an Optimized Reads-compatible db.r6id.32xlarge instance.
The ensuing sections delve into the key performance improvements and cost savings that the Mindbody team experienced after transitioning to Aurora PostgreSQL Optimized Reads.
Key performance improvements
Key metrics collected after the writer instance began operating on the Optimized Reads-compatible Aurora instance revealed significant enhancements in runtime for top modules. This improvement translated to a reduction in overall failures while fetching contacts for automations and campaigns.
Data from CloudWatch indicated that the average CPUUtilization
metric demonstrated a decrease to 6% CPU (with Optimized Reads) compared to 12% CPU (without Optimized Reads), reflecting a 50% reduction in average daily CPUUtilization
.
Additionally, the ReadIOPS
metric from CloudWatch exhibited a remarkable 90% reduction with Optimized Reads relative to the previous configuration.
Performance Insights further illustrated a significant drop in the IO:DataFileRead
metric, showing a reduction to 0.81 with Optimized Reads, compared to 7.79 without it.
The AuroraOptimizedReadsCacheHitRatio
CloudWatch metric indicated that an average of 85% of read requests were served by the Aurora Optimized Reads cache.
Cost benefits
The detailed cost breakdown for the Aurora service, accessible via Cost Explorer, provides insights into the tangible cost savings and enhanced price predictability achieved by Mindbody in their Aurora usage.
Following the transition to Optimized Reads on June 21, Cost Explorer graphs illustrate monthly Aurora costs categorized by usage type over a six-month period—three months preceding and three months following the transition. This data reveals an approximate 23% cost reduction commencing in June. Notably, the Aurora:StorageIOUsage
metric was evident from March to June, while from July onwards, this metric was absent, resulting in a more predictable overall cost structure despite a substantial increase in InstanceUsageIOOptimized:db.r6id.32xlarge
.
With a 50% reduction in CPUUtilization
, Mindbody now has the potential to downsize their Aurora instances from db.r6id.32xlarge to db.r6id.24xlarge. This adjustment could unlock further cost savings, facilitating more efficient resource utilization while maintaining optimal performance.
About the Authors
Sandeep Koppula is a Senior Engineering Leader at Mindbody with over 18 years of IT experience, including 7 years in Technical Architecture and 8 years in Technical Management. He specializes in architecting, designing, developing, and deploying complex, high-volume, and scalable enterprise applications across diverse domains such as Fitness & Wellness services, Email Marketing, FinTech, PropTech, LMS/LXP, e-Commerce, Entertainment Industry Payroll, and Credit Bureau.
Rahul Gupta is a Senior Software Engineer at Mindbody with over 8 years of experience in full-stack development and cloud technologies. He has played a pivotal role in building scalable systems, enhancing application performance, and optimizing workflows. Proficient in ReactJS, Redux, Ruby on Rails, PostgreSQL, and AWS, Rahul has been instrumental in enabling Mindbody to deliver over a billion emails and texts annually.
Mukesh Agrawal is a Database Specialist and Senior Solutions Architect at AWS, dedicated to assisting customers in designing scalable, optimized, and innovative database solutions to maximize the value of AWS services.