Reliability

Introduction

Ensuring service availability and resilience has always been a top priority at VGS. To enhance our Disaster Recovery (DR) capabilities, we have made a significant investment in Cross-Region Disaster Recovery (CRDR), enabling seamless failover and continuity even in the event of a regional failure. This document provides an overview of our CRDR architecture, the rationale behind this investment, our testing approach, the migration and replication plan, and key information for customers in the unlikely event of a regional failure.

Architecture Overview

For a long time, VGS services have been utilizing a multi-site active-active architecture within a given region using Availability Zones (AZ), with database snapshots copied to our secondary region. In the last few years, we've recognized the need to invest and elevate our infrastructure's resilience to meet the increasing demands of our customers, such as building a platform that is able to recover when region-level failure events occur. Our Cross-Region Disaster Recovery framework, currently available in the US geography only, follows the AWS Warm Standby strategy, which ensures a scaled-down but fully functional copy of our tier 1 services within our production environment in a secondary region. This approach reduces recovery time as the workload is always running in another region, ready to scale up when needed. Our architecture can handle traffic immediately at reduced capacity levels, ensuring continuity with minimal disruption. Additionally, as everything is already deployed and running, failing over only requires switching the database master and re-routing traffic, making it a highly efficient solution for rapid failover.

Key Features of VGS CRDR Architecture:

Always-On Infrastructure: Services remain operational in the secondary region at reduced capacity and can immediately process requests after the switch over.
Auto Scaling for Full Recovery: Our infrastructure remains active in the secondary region, allowing for rapid autoscaling to meet production demand during a failure.
Configuration: Our primary services within the US geography operate in Virginia (us-east-1), with the failover region in Ohio (us-east-2).
Failover & Failback: Failover is triggered through manually invoked automations, ensuring minimal disruption in the event of a regional failure.
Cross-Region Data Replication: We use cross-region replication for database synchronization, providing high-availability storage solutions.

Architecture Diagram

Current VGS Services with CRDR

Our Proxy services are currently available for Cross-Region Disaster Recovery (CRDR) using the Warm standby option, ensuring high availability and resilience in the event of a regional failure. All other VGS services utilize a Backup & Restore approach to ensure high availability and reliability across our infrastructure. We are constantly evaluating roadmap priorities for additional services from the Backup & Restore cohort to graduate to using Warm Standby.

Please note: the Warm Standby option is not available for Proxy users with private connectivity.

Why We Made This Investment

We recognize the increasing risks of regional cloud outages and the critical importance of uninterrupted service for our customers. The following factors drove the investment in CRDR:

Business Continuity: Ensuring that customers experience minimal downtime in the face of major regional disruptions.
Regulatory & Compliance Needs: Many industries require data redundancy across geographically separated locations.
Enterprise Readiness & Reliability: Our enhanced DR strategy provides confidence to businesses relying on VGS infrastructure.
Scaling for Future Growth: A cross-region framework positions us for scalability and resilience as customer needs evolve.

Testing and Validation

To ensure the effectiveness of our CRDR framework, we have implemented a structured testing plan to conduct Cross-Region Disaster Recovery (CRDR) tests annually. These recurring tests help validate our failover processes, ensure service resilience, and proactively identify areas for improvement.

We conducted rigorous testing, including:

Cross-Region Switchover & Switchback: Successfully transitioned workloads between Virginia and Ohio in less than ten minutes in a controlled environment.
Data Consistency Verification: Validate that replicated data remained accurate and complete across regions.
Service Performance Under Failover: Ensured latency and performance remained within acceptable thresholds post-failover.
Failure Scenario Simulations: Modeled real-world failure scenarios to confirm system resilience and response effectiveness.

We continuously refine our disaster recovery strategy by performing these tests annually and ensuring our systems are prepared for any potential regional failure. To ensure consistency and correctness, we are exploring plans to continuously run a small percentage of traffic in our secondary region.

What Customers Should Know in the Event of a Disaster

In the unlikely event of a failure affecting our primary region, customers should be aware of the following:

Data Integrity Assurance: All replicated data remains consistent, ensuring the continuity of business operations.
Customer Communication: Our status page will provide real-time updates and transparency regarding service status and expected recovery timelines.
Planned Failback: Once the primary region is restored, services will be carefully transitioned back with minimal impact.

Conclusion

Our investment in Cross-Region Disaster Recovery ensures a higher level of resilience and reliability for our customers. By implementing this architecture, we are committed to providing uninterrupted service even during major cloud provider outages.

Please contact our support team at [email protected] for further details or assistance regarding our CRDR capabilities, our warm standby roadmap for other VGS services, and potential expansion to additional regions.

PreviousConnectivity NextObservability

Last updated 7 days ago