Multi-CDN Resilience: DNS Failover & IaC Runbook

Step-by-step multi-CDN runbook: add a second CDN, automate health checks, and implement DNS-level failover while preserving cache and WAF controls.

When Cloudflare (or any single CDN) fails: why you need a multi-CDN plan in 2026 — now

Hook: On Jan 16, 2026 a major Cloudflare incident caused widespread service disruptions — a clear reminder that relying on a single CDN or DNS provider is a business risk. For engineering teams facing unpredictable cloud and CDN outages, the questions are: how do you add a second CDN quickly, automate health checks, and switch traffic at DNS without tearing down caches or security controls?

Executive summary — what this runbook delivers

This article gives a pragmatic, step-by-step architecture and Infrastructure-as-Code (IaC) examples for implementing multi-CDN resilience in production:

How to add a second CDN without changing origin behavior or invalidating existing caches
How to automate active health checks that monitor CDN edge and origin behavior
How to implement DNS-level failover and traffic steering using Route53/NS1 (examples in Terraform)
How to keep caching, WAF rules, and origin access controls in sync across CDNs
Operational playbook: monitoring, cache warming, and verification steps for failover

Context: trends in 2025–2026 that change the calculus

By 2026 multi-CDN is a mainstream resilience pattern. Enterprises and high-scale platforms now expect:

DNS providers (Route53, NS1, Gandi, Dyn) offering richer health checks, traffic steering and native failover policies
Increased adoption of multi-CDN orchestration platforms and vendor-neutral edge policies
Greater emphasis on origin authentication (mTLS and signed requests) to limit origin access to CDN POPs
A focus on consistent caching semantics at the edge (surrogate-control, stale-while-revalidate) so failovers don’t produce cache storms

High-level architecture

Goal: run two CDNs in parallel (CDN-A and CDN-B), keep origin and security configurations identical, and let DNS automagically steer traffic based on health checks. Architecture components:

Authoritative DNS: primary DNS at provider that supports health checks/steering (AWS Route53 or NS1). Avoid putting authoritative DNS behind the CDN you want to protect.
Two CDN fronts: CDN-A (existing, e.g., Cloudflare) and CDN-B (secondary, e.g., Fastly or AWS CloudFront) each configured to pull from the same origin.
Origin: origin servers or origin load balancer with origin ACLs allowing only CDN POP IP ranges (and your monitoring IPs) via mTLS or IP allowlists.
Health checks: DNS provider checks the CDN front domain (via a special heartbeat endpoint) and optionally origin checks.
Failover records: DNS weighted/primary-secondary records configured to respond instantly to health failures within TTL constraints.
Automation & IaC: Terraform modules for DNS / health checks / CDN config / WAF rules so changes are reproducible.

Diagram (text)

Client -> DNS (Route53/NS1) -> CDN-A or CDN-B (CNAME to edge) -> Origin (mTLS/IP allowlist). Health checks target CDN-A and CDN-B endpoint (path: /__health?no_cache=1).

Step 1 — Choose a second CDN and align features

Pick a CDN-B that matches the features you actually use: TLS support, WAF, edge compute, Cache-Control semantics, purge APIs, and pricing. In 2026 the usual candidates are Fastly, AWS CloudFront (with CloudFront Functions / Lambda@Edge), Akamai, and performant niche players like BunnyCDN for cost-sensitive workloads.

If you rely on edge compute (JS/VCL/Wasm), choose a CDN with comparable runtimes to avoid re-architecture.
If you use managed WAF rules, ensure the second CDN supports equivalent protections or that you manage rules via IaC so they stay functionally identical.
Check origin authentication options: mTLS and signed origin headers are essential to reject direct-to-origin attacks.

Step 2 — Configure canonical hostnames and origin behavior

Key principle: both CDNs must present identical behavior for origin pulls and caching keys. Use a single canonical origin hostname (origin.example.net) and set both CDNs' origin to that host. Use consistent request headers so cache keys align.

Use Cache-Control + Surrogate-Control headers to control edge TTLs.
Use a cache key that matches your application expectations (e.g., host + path + query keys) and implement the same rules on both CDNs.
Use an origin heartbeat endpoint: /__health that returns 200 fast and can be set to bypass cache when called with a special header.

Step 3 — Secure origin access so both CDNs can fetch safely

Don’t rely on origin IP allowlists alone — CDNs rotate POPs. Use one or more of:

mTLS between CDN and origin (many CDNs now support mTLS for origin pulls in 2026)
Signed origin headers or token-based origin authentication
Short-lived origin credentials rotated by automation

Store mTLS certs/secrets in your secrets manager and deploy via IaC along with origin LB settings.

Step 4 — DNS strategy: authoritative, health checks, and routing policies

Authoritative DNS must be outside of the CDN you are protecting (don’t use Cloudflare DNS as the only authoritative if Cloudflare is your primary CDN). Route53 and NS1 are common choices because of mature health checks and advanced steering.

DNS options

Weighted records — simple: split traffic 100/0, switch to 0/100 when primary fails.
Failover (primary/secondary) — built-in Route53 failover records with health checks.
Latency / Geo-steering — direct clients to the best CDN per region; combine with health checks to make it resilient.
Secondary DNS — keep a secondary authoritative DNS provider that will take over if the primary DNS provider becomes unavailable.

Practical DNS: Route53 + health check example (Terraform)

Below is a minimal Terraform example that creates two CNAME records for example.com pointing to CDN-A and CDN-B and a failover policy that prefers CDN-A unless the health check fails. This pattern keeps caching intact because we use CNAMEs; CDNs still serve cached content. Customize TTL to 30–60 seconds for faster propagation.

# Terraform (AWS Route53) - health check & failover
resource "aws_route53_health_check" "cdn_a" {
  fqdn              = "cdn-a.example-cdn.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/__health?no_cache=1"
  failure_threshold = 2
  request_interval  = 10
}

resource "aws_route53_record" "www_primary" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "www.example.com"
  type    = "CNAME"
  ttl     = 60
  set_identifier = "cdn-a-primary"
  weight = 100
  records = ["cdn-a.example-cdn.com"]
  health_check_id = aws_route53_health_check.cdn_a.id
}

resource "aws_route53_record" "www_secondary" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "www.example.com"
  type    = "CNAME"
  ttl     = 60
  set_identifier = "cdn-b-secondary"
  weight = 0
  records = ["cdn-b.example-cdn.com"]
}

When the health check for CDN-A fails, Route53 will stop answering with the primary record (depending on routing configuration) and route traffic to CDN-B. If you need geo-failover, use Route53 geoproximity or NS1's filters, with the same health check pattern.

Step 5 — Health checks that detect real user impact (not just TCP)

Health checks must validate: TCP connect, TLS handshake, valid 200 from /__health, and also a test for actual response payload integrity (e.g., specific JSON key). Configure checks to hit the CDN edge (example: cdn-a.example-cdn.com/__health?no_cache=1) with a special header to bypass cache, and include an origin check so you know whether the issue is edge or origin.

Set checks to run from multiple geographic locations (Route53/NS1 support multi-location checks).
Use synthetic monitoring (Checkly, Datadog Synthetics, Pingdom) to run multi-step checks (login, API call, page render) and alert on failures.
Log health-check failures to your incident system (PagerDuty) and to your GitOps pipeline (operator review).

Step 6 — Keep caching and security controls in sync

When you failover, you want to avoid cache stampedes or inconsistent responses. Strategies:

Use identical cache-control policies in responses. Where CDNs allow edge rules, version them in IaC so both CDNs apply the same cache key logic.
Implement stale-while-revalidate and stale-if-error so requests are served during short origin/CDN flapping.
Replicate WAF rules and bot protections into both CDN vendors via IaC (Terraform provider for Fastly, CloudFront managed rules, Akamai rules, etc.).

IaC example: keep WAF rules in Terraform

Use provider-specific modules but keep rule definitions in a central format (YAML/JSON) that your pipeline translates into provider calls. Example pseudo-structure:

# Pseudo: shared WAF rule source
locals {
  waf_rules = jsonencode(file("waf-rules.json"))
}

# Deploy to CloudFront / Fastly using provider modules
module "cloudfront_waf" {
  source = "./modules/cloudfront-waf"
  rules  = local.waf_rules
}

module "fastly_waf" {
  source = "./modules/fastly-waf"
  rules  = local.waf_rules
}

Step 7 — Automation: cache warm-up and purge orchestration

Failover is smoother if the secondary CDN has some warmed cache. Two practical approaches:

Proactive seeding: Use a controlled crawler to request the most popular URLs via CDN-B on a schedule (e.g., when you deploy a new version or after a failover) — run from multiple regions to populate regional POPs.
On-demand seeding post-failover: A failover webhook triggers a cache-seed job and partial purge for seamless transition.

Example cache-warm script (bash + GNU parallel):

#!/usr/bin/env bash
urls_file=popular-urls.txt
parallel -a $urls_file -j50 curl -s -H 'Cache-Control: no-cache' -o /dev/null "https://www.example.com/{}"

Step 8 — Observability and verification

Operational success depends on observability:

Synthetic checks (global) and real-user monitoring (RUM) to detect degradations not captured by health checks.
Edge metrics from both CDNs (requests, cache hit ratio, error rates) pushed into a single dashboard (Prometheus/Grafana or Datadog).
Alerting rules that combine multiple signals: DNS health check failure + increased origin 5xx rate = urgent incident.

Step 9 — Runbook for failover (play-by-play)

Keep a short checklist for on-call engineers. Example automated-first runbook:

Alert triggers: Route53 health check fails for CDN-A OR synthetic checks degrade above threshold.
Validate: Inspect CDN provider status pages and vendor BGP/POPs (ThousandEyes/BGPStream) to confirm it's CDN-A.
DNS: Confirm Route53/NS1 has marked CDN-A unhealthy and switched traffic to CDN-B (check dig +short www.example.com from multiple regions).
Cache: Trigger cache-warm job for CDN-B for top N URLs and verify 200 responses from edge headers.
Security: Verify WAF logs on CDN-B and ensure rate limits are engaged to avoid overloads.
Rollback: If CDN-B shows errors, revert DNS weight back and open incident for origin debugging.

Advanced strategies and considerations (2026)

Advanced teams in 2026 add:

Multi-provider traffic orchestration platforms that provide A/B/geo routing and observability across CDNs.
Using RPKI and BGP monitoring to detect upstream routing anomalies that manifest as CDN outages.
Edge function parity testing in CI so feature parity across CDNs is validated before cutovers.
Using DNS over HTTPS (DoH) and TLS-resistant steering where applicable to bypass ISP DNS caching issues.

Practical pitfalls and how to avoid them

Don’t make your CDN the DNS authority: If you use Cloudflare as both authoritative DNS and primary CDN, an outage at Cloudflare can take DNS down. Keep authoritative DNS independent or have a multi-authoritative strategy.
TTL illusions: Very low TTLs speed failover but increase DNS query volume. Use ~30–60s for critical records and cache fingerprinting to reduce load.
Cache incoherence: If CDNs use different cache keys, users may see inconsistent content. Standardize cache keys and headers across CDNs via IaC.
WAF divergence: If WAF rules differ, security posture changes during failover. Keep rules in a single source of truth and deploy to all CDNs automatically.

Real-world example: From Cloudflare primary to Fastly secondary (condensed)

Concrete steps a team followed after the Jan 2026 Cloudflare incident:

Moved authoritative DNS to Route53 (kept Cloudflare as a proxied CDN for the subdomain) to decouple DNS.
Provisioned Fastly service matching cache key and edge logic; configured mTLS to the origin.
Created Route53 health checks for cdn-primary.example-cdn.com and cdn-secondary.example-cdn.com pointing to /__health?no_cache=1.
Configured Route53 weighted records with Terraform and set TTL=60s; automated seeding and WAF sync via CI pipeline.
Verified failover by simulating an edge outage (blocking Cloudflare IP ranges from health check) and observed automatic switch to Fastly within two minutes; warm-up script executed and cache hit ratio normalized.

Checklist to implement today (actionable takeaways)

Move authoritative DNS off any single CDN provider or enable secondary DNS replication.
Standardize cache keys and responses; add /__health endpoint that bypasses cache with a header.
Provision a second CDN and configure origin authentication (mTLS or signed headers).
Create DNS health checks targeting the CDN edge and configure failover/weighted records via IaC (Terraform).
Automate WAF and edge logic deployment to both CDNs using a central ruleset and CI pipeline.
Implement cache warm-up scripts and synthetic monitoring; add alerts for combined signals.

Closing thoughts — the cost of resilience

Multi-CDN resilience is not free: it requires engineering time, additional vendor costs, and operational discipline. In 2026, however, the cost of not having multi-CDN protection is higher — outages cascade, customers lose trust, and regulatory scrutiny increases when availability lapses. Design your multi-CDN strategy to be automated, testable, and reversible. Treat the system as code: health checks, steering, cache key rules and WAF definitions must be under version control and part of your CI/CD pipeline.

If you only remember one thing: keep DNS independent, standardize cache rules, and automate health-driven routing.

Call to action

Ready to deploy a resilient multi-CDN stack? Get the companion Git repo with Terraform modules, CI examples, and cache-warm scripts we used for the examples above. Download the runbook, or contact us for an architecture review tailored to your traffic patterns and compliance needs.

Designing Multi-CDN Resilience: Practical Architecture to Survive a Cloudflare Outage

When Cloudflare (or any single CDN) fails: why you need a multi-CDN plan in 2026 — now

Executive summary — what this runbook delivers

Context: trends in 2025–2026 that change the calculus

High-level architecture

Diagram (text)

Step 1 — Choose a second CDN and align features

Step 2 — Configure canonical hostnames and origin behavior

Step 3 — Secure origin access so both CDNs can fetch safely

Step 4 — DNS strategy: authoritative, health checks, and routing policies

DNS options

Practical DNS: Route53 + health check example (Terraform)

Step 5 — Health checks that detect real user impact (not just TCP)

Step 6 — Keep caching and security controls in sync

IaC example: keep WAF rules in Terraform

Step 7 — Automation: cache warm-up and purge orchestration

Step 8 — Observability and verification

Step 9 — Runbook for failover (play-by-play)

Advanced strategies and considerations (2026)

Practical pitfalls and how to avoid them

Real-world example: From Cloudflare primary to Fastly secondary (condensed)

Checklist to implement today (actionable takeaways)

Closing thoughts — the cost of resilience

Call to action

Related Topics

whata

Up Next

Docker on a VPS: A Beginner-Friendly Deployment Guide

How to Deploy a Node.js App on a VPS

How to Deploy a Static Site With a Custom Domain

From Our Network

Best DNS Providers Compared: Speed, Reliability, API Access, and Pricing

How to Move Email When Transferring a Domain or Changing Hosts

Website Backup Strategies: How Often to Back Up and Where to Store Copies

How to Set Up Staging for Your Website Before Going Live

Website Speed Optimization Checklist for Cloud-Hosted Sites

Best Cheap Domain Registration Options That Stay Affordable at Renewal

When Cloudflare (or any single CDN) fails: why you need a multi-CDN plan in 2026 — now

Executive summary — what this runbook delivers

Context: trends in 2025–2026 that change the calculus

High-level architecture

Diagram (text)

Step 1 — Choose a second CDN and align features

Step 2 — Configure canonical hostnames and origin behavior

Step 3 — Secure origin access so both CDNs can fetch safely

Step 4 — DNS strategy: authoritative, health checks, and routing policies

DNS options

Practical DNS: Route53 + health check example (Terraform)

Step 5 — Health checks that detect real user impact (not just TCP)

Step 6 — Keep caching and security controls in sync

IaC example: keep WAF rules in Terraform

Step 7 — Automation: cache warm-up and purge orchestration

Step 8 — Observability and verification

Step 9 — Runbook for failover (play-by-play)

Advanced strategies and considerations (2026)

Practical pitfalls and how to avoid them

Real-world example: From Cloudflare primary to Fastly secondary (condensed)

Checklist to implement today (actionable takeaways)

Closing thoughts — the cost of resilience

Call to action

Related Reading

Related Topics

whata

Up Next

Docker on a VPS: A Beginner-Friendly Deployment Guide

How to Deploy a Node.js App on a VPS

How to Deploy a Static Site With a Custom Domain

From Our Network

Best DNS Providers Compared: Speed, Reliability, API Access, and Pricing

How to Move Email When Transferring a Domain or Changing Hosts

Website Backup Strategies: How Often to Back Up and Where to Store Copies

How to Set Up Staging for Your Website Before Going Live

Website Speed Optimization Checklist for Cloud-Hosted Sites

Best Cheap Domain Registration Options That Stay Affordable at Renewal