How Firmex used AWS SCT and AWS DMS to move 65,000 on-premises Microsoft SQL Server databases to an Amazon Aurora PostgreSQL cluster

This post is co-authored with Eric Boyer and Maria Hristova of Firmex.

Firmex stands as a prominent provider of Virtual Data Rooms, facilitating the opening of over 20,000 new rooms annually. Customers rely on Firmex to upload and share thousands of documents, ensuring stringent access controls and permissions. The company has established a solid reputation for security and availability, crucial for clients navigating due diligence processes.

In utilizing Firmex’s SaaS service, clients prioritize the isolation and protection of their data from other users. To fulfill this need, Firmex implements a rigorous data isolation strategy, confining customer data within individual databases. Each client is allocated a tenant database, along with additional databases for various projects. The application architecture is heavily reliant on over a hundred stored procedures for data access and modification.

The data architecture employed by Firmex presents numerous advantages for Virtual Data Rooms, including straightforward data partitioning, security isolation, access control, and rebalancing. Firmex previously operated approximately 65,000 databases across four Microsoft SQL Servers within an on-premises data center, adhering to this architecture. The multitude of small databases allowed engineers to focus on horizontal scaling rather than investing in larger machines. However, challenges arose from managing such a vast number of databases, including nearing the maximum database limit of 32,000 per server, handling backups, and implementing changes across the databases. Additionally, the licensing terms for this configuration were not favorable, complicating upgrades.

Challenges with the Current Architecture

The total number of databases managed by Firmex was a critical metric due to SQL Server’s hard limits. In 2020, engineers recognized that they were approaching the database cap, with less than a year remaining before reaching the limit at their current creation rate. Cloud providers impose restrictions on the number of databases that can be created on any instance, presenting a significant barrier to adopting cloud solutions for data storage. Prioritizing a solution that aligned with their data model in the cloud became essential.

To tackle the burgeoning number of databases, Firmex shifted from a database-based isolation model to a schema-based isolation approach. This transition aimed to maintain tenant separation while operating within database limit constraints. A comparative analysis of SQL Server schema isolation versus PostgreSQL schema isolation revealed that PostgreSQL’s search_path feature offered enhanced control over object resolution. This feature allowed the specification of a customer data schema, a shared code schema for functions, and a common schema, a level of control that SQL Server could not match without extensive code deployment across individual schemas.

With the potential for schema isolation evaluated, Firmex’s engineers collaborated with AWS Specialist Solutions Architects to explore additional features, costs, reliability, and performance. Aurora PostgreSQL-Compatible Edition presented an appealing price-performance ratio, enabling Firmex to leverage AWS Graviton processors for increased memory allocation, enhancing application performance. The multi-AZ failover capability of Aurora simplified high availability processes, with Aurora managing replication across Availability Zones.

The transition from SQL Server to PostgreSQL was a formidable undertaking, involving not just data migration but also a comprehensive transformation of stored procedures, DDL, and data isolation. Firmex had already begun migrating some services to the cloud, utilizing Aurora PostgreSQL-Compatible for components like a session handler and an email scheduler. The favorable pricing and performance of Aurora PostgreSQL-Compatible made it a viable option for storing customer data. For the migration, Firmex opted to employ AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT).

The accompanying diagram illustrates the heterogeneous migration architecture from Microsoft SQL Server to an Aurora PostgreSQL database using AWS Database Migration Service Schema Conversion (DMS SC), a fully managed service. While this diagram showcases the DMS SC approach, the focus here is on achieving similar results with the AWS Schema Conversion Tool (SCT). For further details about the managed service option, refer to Migrating SQL Server databases to Amazon Aurora PostgreSQL with DMS Schema Conversion.

A Path Forward

Prior to initiating the migration project, Firmex’s team began converting their stored procedures to ensure the application functioned correctly on PostgreSQL instances. Given the extensive logic embedded within the T-SQL stored procedures, automated migration tools proved challenging without manual intervention. This endeavor required over nine months of effort to develop replacement patterns and rigorously test each of the 260 stored procedures to confirm their functionality in the new server environment. The differences between the database engines and the data access libraries necessitated a more complex approach than a simple drop-in replacement for most database calls.

The following diagram illustrates the simplified architecture of the Firmex migration.

SCT Challenges

The application’s data definition language for both tenant-isolation and project-isolation databases was converted to PostgreSQL schemas using the AWS SCT. However, Firmex’s SQL Server databases utilized a case-insensitive collation that did not directly map to PostgreSQL. This required engineers to define a case-insensitive collation and apply it to any columns needing case insensitivity.

Data Migration Challenges

AWS DMS facilitated the migration of data from SQL Server to PostgreSQL, but several constraints emerged due to the extensive number of databases and the requirement for all databases to reside in proximity to the application for low latency. The following challenges were encountered during the database migration:

Endpoint Limits

For optimal performance with AWS DMS, the ideal migration strategy involves creating one task per database for continuous data transfer over an extended period. However, for Firmex’s application, this would necessitate 65,000 tasks, exceeding the quota of 600 tasks per account and the endpoint limit of 1,000 per account. (If your migration requires a limit increase, please consult AWS Support and your AWS account team – see DMS API limits).

AWS DMS Task Limits

Due to the vast number of databases and the limits on the number of AWS DMS Tasks, engineers were unable to create sufficient tasks simultaneously to migrate all databases continuously. To circumvent this limitation, they developed one-shot migration tasks on an as-needed basis. Custom tooling was crafted using the C# AWS SDK to effectively set up replication endpoints, create AWS DMS Tasks, execute one-shot migration tasks, and document the final state while adhering to DMS API limits.

Data Transfer

Firmex’s engineers devised a comprehensive process that included provisioning a VPN tunnel, configuring AWS DMS tasks and endpoints for migration, placing the customer in maintenance mode, initiating AWS DMS migration tasks, monitoring progress, and subsequently removing tasks and endpoints after migration completion. Many migration tasks were executed in parallel to ensure timely completion and optimal resource utilization. The migration progress and state were meticulously recorded in their application database to track which customers had been migrated and their respective statuses.

API Limits

The aggressive parallelization and substantial data volume led to multiple issues during the migration process. One significant constraint was bandwidth limitations within their data center, resulting in dropped packets and connection issues when the bandwidth was exceeded. Additionally, the rapid creation and deletion of tasks targeted at different databases encountered AWS DMS limits. The team identified that endpoints were not being deleted quickly enough, contributing to the limit challenges. Collaborating with their AWS account team, they successfully increased the number of endpoints per replication instance from 100 to 200 and raised the account limit from 1,000 to 3,000.

To navigate the API limit that could not be raised, Firmex strategically scheduled as many API calls as possible outside the migration window, establishing replication instances and endpoints to enhance the likelihood of successful calls during the migration period.

Maintenance Windows

Ensuring compliance with their availability SLA for customers posed another challenge. The initial migration waves targeted tenants used for testing and validation, providing confidence in the approach and timing related to project sizes and numbers. Migrating actual customers necessitated temporarily rendering individual tenants unavailable, utilizing a built-in application feature. Given the variable number of projects per customer, meticulous analysis of migration behavior was essential to align with available migration windows. Targeted customer windows ensured that downtime affected only a limited number of customers. Firmex’s engineering team collaborated closely with customer support to accommodate customers within multiple small windows, executing over 45 migration waves over three months. Some waves were dedicated to individual customers with extensive databases to migrate.

Database Connection Counts

As more tenants transitioned to AWS, a notable increase in connection counts to the databases was observed in Amazon RDS Performance Insights. To connect to a tenant schema, the search_path was included in the connection string. The application utilized the npgsql library, resulting in thousands of connections to the database with small per-schema connection pools. Unique connection strings per schema hindered connection sharing, leading to high memory usage and occasional I/O pressure on the machines. The application and migration process competed for connections, pushing against maximum connection limits. With Aurora, they could easily scale the database’s memory to accommodate the increased load until they refined connection management. By separating the setting of the search_path from the initial connection string, they achieved a tenfold improvement, reducing connection counts from 4,000-6,000 to 200-300.

Performance

Firmex’s T-SQL stored procedures heavily relied on temporary tables for intermediate results and data transfer between calls. In PostgreSQL, this necessitated extensive use of system tables for tracking, placing additional stress on the auto-vacuum process. They encountered issues with incomplete auto-vacuum processes and unresponsive stats_collector errors, primarily attributed to temporary tables. Refactoring one of the most utilized stored procedures that targeted the writer node to instead target the reader node significantly alleviated the load on the writer database, reducing vCPU usage from 2-4 to below 0.5 while improving average response times by 75% (from 113ms to 27ms).

The integrated Performance Insights monitoring tool in Aurora PostgreSQL-Compatible has proven invaluable for Firmex in analyzing database and I/O metrics, facilitating informed decisions regarding instance sizing. They employed a combination of vCPU load and memory metrics to determine optimal sizing. With the ease of adding additional reader nodes to an Aurora cluster, efforts are ongoing to redirect read-only workloads to these nodes, enhancing resource utilization on the writer node and allowing the application to scale effectively.

The comprehensive redevelopment of the application to function on Aurora PostgreSQL-Compatible demanded dedicated effort. Over 18 months, Firmex successfully migrated their T-SQL to PL/pgSQL and navigated a clear migration plan. Post-migration, they have realized benefits from using Aurora PostgreSQL-Compatible, including the ability to right-size database instances, reduce operating costs, and leverage automatic and fast failovers, ultimately aiding Firmex in meeting availability targets.

Forward Thinking

As their application was originally designed for SQL Server, Firmex is contemplating future enhancements to optimize their PostgreSQL utilization. Key areas of focus include database connection management via Amazon RDS Proxy, minimizing temporary table usage, tuning auto-vacuum processes, employing the more expressive type system, optimizing reader node utilization, and leveraging blue/green deployments to streamline engine upgrade times.

In the future, they plan to harness the connection management capabilities of RDS Proxy to enhance value within their environment.


About the Authors

Eric Boyer is the Senior Lead Architect at Firmex Inc., specializing in designing secure, scalable, and responsive solutions that meet customer needs and technical requirements. He has been instrumental in guiding Firmex’s transition to the cloud and breaking down their monolithic application into manageable AWS cloud-native components. Outside of work, he enjoys reading science fiction, biking, and playing board games with his family.

Maria Hristova is a technology executive focused on strategic technical transformations and building highly capable, diverse teams. She has overseen Firmex’s evolution into a cloud-first development organization, better positioned to achieve the company’s growth objectives. Maria’s strategic vision and leadership have been pivotal in successfully guiding numerous technical projects to completion. In her leisure time, she enjoys discussing whisky distilling and Taylor Swift over a glass of wine while in the company of her cats.

Kiran Singh is a Senior Database Specialist Solutions Architect at AWS, specializing in Amazon RDS and Amazon Aurora, with expertise in PostgreSQL. Kiran collaborates with partners to design and implement database migration and modernization strategies, assisting customers in building scalable, secure, performant, and robust database architectures in the AWS Cloud.

Radhika Chakravarty is a Database Solutions Architect with Amazon Web Services, providing technical assistance to customers and partners in designing and implementing cloud migration projects, facilitating the migration and modernization of existing databases to AWS Cloud.

Dylan Souvage is a Partner Solutions Architect based in Austin, Texas, who enjoys working with customers to understand their business needs and enable their cloud journey. In his spare time, he loves exploring nature and embarking on long road trips.

Daniel Rios is a WWSO Data & AIML Sales Specialist, acting as a trusted advisor to customers by offering guidance on unlocking new value streams, addressing key business challenges, and delivering results using AWS Data & AIML Services.

Tech Optimizer
How Firmex used AWS SCT and AWS DMS to move 65,000 on-premises Microsoft SQL Server databases to an Amazon Aurora PostgreSQL cluster