# Reliability

Ensuring service availability and resilience has always been a top priority at VGS. To enhance our Disaster Recovery (DR) capabilities, we have made a significant investment in Cross-Region Disaster Recovery (CRDR), enabling seamless failover and continuity even in the event of a regional failure. This document provides an overview of our CRDR architecture, the rationale behind this investment, our testing approach, the migration and replication plan, and key information for customers in the unlikely event of a regional failure.

## Architecture Overview

For a long time, VGS services have been utilizing a multi-site active-active architecture within a given region using Availability Zones (AZ), with database snapshots copied to our secondary region. In the last few years, we've recognized the need to invest and elevate our infrastructure's resilience to meet the increasing demands of our customers, such as building a platform that is able to recover when region-level failure events occur. Our Cross-Region Disaster Recovery framework, currently available in the US geography only, follows the [AWS Warm Standby](https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html#warm-standby) strategy, which ensures a scaled-down but fully functional copy of our tier 1 services within our production environment in a secondary region. This approach reduces recovery time as the workload is always running in another region, ready to scale up when needed. Our architecture can handle traffic immediately at reduced capacity levels, ensuring continuity with minimal disruption. Additionally, as everything is already deployed and running, failing over only requires switching the database master and re-routing traffic, making it a highly efficient solution for rapid failover.

### Key Features of VGS CRDR Architecture:

* **Always-On Infrastructure:** Services remain operational in the secondary region at reduced capacity and can immediately process requests after the switch over.
* **Auto Scaling for Full Recovery:** Our infrastructure remains active in the secondary region, allowing for rapid autoscaling to meet production demand during a failure.
* **Configuration:** Our primary services within the US geography operate in Virginia (us-east-1), with the failover region in Ohio (us-east-2).
* **Failover & Failback:** Failover is triggered through manually invoked automations, ensuring minimal disruption in the event of a regional failure.
* **Cross-Region Data Replication:** We use cross-region replication for database synchronization, providing high-availability storage solutions.

### Architecture Diagram

<figure><img src="/files/4ZrDgXTObGhRJJj0SXHr" alt=""><figcaption></figcaption></figure>

## Current VGS Services with CRDR

Our Proxy services are currently available for Cross-Region Disaster Recovery (CRDR) using the [Warm standby](https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html#warm-standby) option, ensuring high availability and resilience in the event of a regional failure. All other VGS services utilize a [Backup & Restore](https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html#backup-and-restore) approach to ensure high availability and reliability across our infrastructure. We are constantly evaluating roadmap priorities for additional services from the Backup & Restore cohort to graduate to using Warm Standby.

Please note: the Warm Standby option is not available for Proxy users with private connectivity.

## Why We Made This Investment

We recognize the increasing risks of regional cloud outages and the critical importance of uninterrupted service for our customers. The following factors drove the investment in CRDR:

* **Business Continuity:** Ensuring that customers experience minimal downtime in the face of major regional disruptions.
* **Regulatory & Compliance Needs:** Many industries require data redundancy across geographically separated locations.
* **Enterprise Readiness & Reliability:** Our enhanced DR strategy provides confidence to businesses relying on VGS infrastructure.
* **Scaling for Future Growth:** A cross-region framework positions us for scalability and resilience as customer needs evolve.

## Testing and Validation

To ensure the effectiveness of our CRDR framework, we have implemented a structured testing plan to conduct Cross-Region Disaster Recovery (CRDR) tests annually. These recurring tests help validate our failover processes, ensure service resilience, and proactively identify areas for improvement.

We conducted rigorous testing, including:

* **Cross-Region Switchover & Switchback:** Successfully transitioned workloads between Virginia and Ohio in less than ten minutes in a controlled environment.
* **Data Consistency Verification:** Validate that replicated data remained accurate and complete across regions.
* **Service Performance Under Failover:** Ensured latency and performance remained within acceptable thresholds post-failover.
* **Failure Scenario Simulations:** Modeled real-world failure scenarios to confirm system resilience and response effectiveness.

We continuously refine our disaster recovery strategy by performing these tests annually and ensuring our systems are prepared for any potential regional failure. To ensure consistency and correctness, we are exploring plans to continuously run a small percentage of traffic in our secondary region.

## What Customers Should Know in the Event of a Disaster

In the unlikely event of a failure affecting our primary region, customers should be aware of the following:

* **Data Integrity Assurance:** All replicated data remains consistent, ensuring the continuity of business operations.
* **Customer Communication:** Our [status page](https://status.verygoodsecurity.com/) will provide real-time updates and transparency regarding service status and expected recovery timelines.
* **Planned Failback:** Once the primary region is restored, services will be carefully transitioned back with minimal impact.

## Conclusion

Our investment in Cross-Region Disaster Recovery ensures a higher level of resilience and reliability for our customers. By implementing this architecture, we are committed to providing uninterrupted service even during major cloud provider outages.

Please contact our support team please contact us at <support@vgs.io> for further details or assistance regarding our CRDR capabilities, our warm standby roadmap for other VGS services, and potential expansion to additional regions.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.verygoodsecurity.com/enterprise-platform/reliability-and-security/reliability.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
