Service level agreements
Grand Central offers tiered availability commitments:| Tier | Uptime SLA | Use case |
|---|---|---|
| Signature | 99.9% | Mission-critical banking operations |
| Premium | 99.5% | Production banking services |
| Essential | 99.5% | Standard banking workloads |
- Recovery Time Objective (RTO): 6 hours for nominal situations
- Recovery Point Objective (RPO): Varies by data criticality; critical financial data uses synchronous replication for minimal data loss
High availability architecture
The platform uses redundant infrastructure and automated recovery to maintain continuous operation during failures.Multi-zone deployment
The platform deploys services across multiple Azure availability zones within each region:- Availability zones: Each zone operates independently with isolated power, cooling, and networking.
- Container distribution: Kubernetes pods spread across zones using anti-affinity rules and zone-aware scheduling.
- Data redundancy: Databases and storage replicate synchronously across zones for automatic failover without data loss.
- Network resilience: Load balancers distribute traffic across zones and reroute automatically when zones become unhealthy.
Auto-scaling and load balancing
The platform automatically adjusts capacity based on demand:| Component | Function |
|---|---|
| Horizontal Pod Autoscaler | Scales pod count based on CPU, memory, or custom metrics |
| Vertical Pod Autoscaler | Adjusts resource limits for right-sizing |
| Layer 4 load balancing | TCP/UDP distribution for high-throughput scenarios |
| Layer 7 load balancing | HTTP/HTTPS routing with SSL termination |
Health monitoring
Kubernetes probes ensure application health:- Readiness probes: Check if pods are ready to receive traffic.
- Liveness probes: Restart unhealthy containers automatically.
- Startup probes: Allow extended initialization time for slow-starting containers.
Fault tolerance patterns
The platform implements patterns to prevent cascade failures:- Circuit breakers: Detect failing services and fail fast to protect the system.
- Exponential backoff: Retry with increasing delays to prevent retry storms.
- Fallback services: Serve cached responses when primary services are unavailable.
- Dead letter queues: Route permanently failed requests for later analysis.
GitOps and self-healing
Theapplications-live repository acts as the single source of truth for your runtime environment. ArgoCD continuously monitors cluster state:
- Drift detection: ArgoCD detects when cluster resources deviate from Git configuration, such as manual changes.
- Automated sync: The SelfHeal feature automatically reverts the cluster to the desired state defined in Git.
- Configuration reload: Reloader watches for ConfigMap and Secret changes and performs rolling restarts to apply new configurations.
Observability stack
The platform includes a pre-configured monitoring stack. Access is controlled through roles assigned in theself-service.tfvars file.
Datadog
Primary tool for aggregating logs, infrastructure metrics, and distributed traces.| Attribute | Value |
|---|---|
| Access URL | https://gc-ecos.datadoghq.eu |
| Requirements | Runtime role assigned (such as dev-ro, stg-rw) |
| Note | First login may result in error as account is provisioned just-in-time |
Grafana
Visualization dashboards for cluster health, performance metrics, and business logic.| Attribute | Value |
|---|---|
| Access URL | https://ecosecos.grafana.net |
| Requirements | Azure AD credentials mapped to self-service roles |
Additional observability tools
| Tool | Purpose |
|---|---|
| Prometheus | Metrics collection from all platform components |
| Jaeger | Distributed tracing across services |
| ELK Stack | Log aggregation with full-text search |
Disaster recovery strategy
The platform implements multi-region failover to ensure business continuity during regional outages.Cross-region replication
Data and services replicate across multiple Azure regions for geographic redundancy:| Region type | Function |
|---|---|
| Primary | Active production workloads with real-time data processing |
| Secondary | Standby environments with continuous data replication |
Data replication patterns
| Pattern | RPO | Use case |
|---|---|---|
| Synchronous | 0 (zero data loss) | Financial transactions, account balances, authentication data |
| Asynchronous | < 15 minutes | Analytics, audit logs, configuration |
Backup strategy
| Backup type | Frequency | Description |
|---|---|---|
| Full | Weekly | Complete backup of all data and configurations |
| Incremental | Daily | Changed data since last backup |
| Point-in-time | Continuous | Change data capture for precise recovery |
Failover procedures
Failover triggers include:- Critical failures: Region outages, database corruption, security breaches
- Performance degradation: Response time thresholds, error rate spikes, SLA breaches
- Planned maintenance: Scheduled updates, infrastructure upgrades
- Detection: Automated monitoring identifies issues
- Assessment: Evaluate scope and determine recovery strategy
- Redirection: Update DNS and load balancer configurations
- Validation: Health checks verify recovered services
Recovery testing
| Test type | Frequency | Scope |
|---|---|---|
| Component testing | Ongoing | Database failover, load balancer switching, backup validation |
| Partial DR testing | Quarterly | Service layer failover, data replication |
| Full DR testing | Bi-annually | Complete region failover with end-to-end validation |
Incident management
Incident classification
| Priority | Description | Response time |
|---|---|---|
| P1 Critical | Service down | 15 minutes |
| P2 High | Major degradation | 1 hour |
| P3 Medium | Minor impact | 4 hours |
| P4 Low | No service impact | 24 hours |
Escalation procedures
- Automated escalation: Time-based rules, severity-based routing, on-call integration
- Communication protocols: Stakeholder notifications, status page updates, customer communication
Regulatory compliance
The platform supports banking regulatory requirements for business continuity:| Standard | Description |
|---|---|
| Basel III | Operational resilience requirements |
| PCI DSS | Availability standards for payment processing |
| ISO 22301 | Business continuity management |
| ISO 27031 | ICT readiness for business continuity |