Runbook: Emergency DNS & CDN Failover (2026)

Concise, executable runbook for on-call teams to switch DNS, enable CDN failover, and update cache rules during major control-plane outages.

Emergency Runbook: DNS and CDN Switches for Wide-Scale Control-Plane Outages (2026)

Hook: When a major CDN or cloud provider control plane goes dark — like the high-profile Cloudflare/AWS incidents that spiked reports in January 2026 — your on-call team must move faster than typical playbooks allow. This runbook gives concise, executable steps to switch DNS, enable failover origins, update caching rules, and keep stakeholders informed with minimal manual friction.

Why this matters in 2026

Control-plane incidents are now more frequent and more consequential as enterprises consolidate on managed CDNs and global edge platforms. In late 2025 and early 2026 the industry accelerated toward multi-CDN and multi-control-plane orchestration, but many deployments still depend on a single provider's APIs and routing. This runbook assumes you need to perform an urgent provider switch or failover with minimal tooling — or to execute quick API-driven changes from your incident bridge.

Quick decisions (first 5 minutes)

Confirm the incident: Check provider status pages, BGP visibility, and internal telemetry. If your origin continues to respond but the CDN control plane is failing, a DNS-level or alternate-CDN switch is viable.
Declare the scope: Is the outage global or regional? Are only control-plane APIs affected while data-plane can still serve cached content? This drives actions (DNS TTL changes vs immediate DNS switch).
Choose the path: (A) DNS failover to an alternate CDN or direct-to-origin; (B) Enable failover/origin fallback in the existing CDN (if control plane available); (C) Short-term route via a secondary provider using DNS weighted/geoproximity routing.
Assign roles: Ops lead, DNS engineer, CDN engineer, communications owner. Keep roles fixed to avoid overlap.

Pre-incident configuration (what to have ready)

These are the hygiene items every team should prepare in advance; they dramatically reduce time-to-failover.

Low safety TTLs: Keep an emergency TTL (60–300s) for critical records or have a plan to lower TTL instantly via API.
Pre-provisioned secondary providers: Have accounts and service mappings with at least one alternative CDN and a DNS provider that supports weighted/geolocation and API updates.
Origin access & auth: Ensure origin accepts direct traffic (CORS, host header, TLS certificate) and test origin response headers for caching behavior.
Automated scripts: Maintain tested scripts for DNS updates, CDN config toggles, and cache purges in a secure repo. Store API tokens in secrets manager with on-call access.
Monitoring & health checks: Route53 or external health checks that can be switched to trigger failover routing automatically when needed.
DNSSEC & delegation knowledge: Know when DNSSEC or registrar locks will block quick changes. Keep registrar contacts for emergency changes.

Runbook: Step-by-step execution

Phase A — Triage and immediate mitigation (0–10 minutes)

Open the incident channel and document: affected domains, start time, symptoms, initial hypothesis.
Run fast checks from multiple vantage points: dig, curl, traceroute, and a few public HTTP probes (e.g., perf-tools or custom probes).

# Example checks
dig +short yoursite.com
curl -I https://yoursite.com --resolve yoursite.com:443:1.2.3.4

If cached content is still accessible and acceptable, extend cache ttl or pause purge to preserve traffic.
If the CDN control plane is the problem but data-plane responds, avoid mass cache purges and postpone origin switches.

Phase B — DNS-level failover (10–30 minutes)

DNS changes are the most reliable method when you must retarget traffic away from a provider with a failing control plane.

Lower TTL if it isn’t already low. Short TTLs let you flip quickly in future waves.
Decide routing strategy:
- Weighted routing: split traffic to a secondary CDN gradually for verification.
- Failover routing: full cutover to alternate IPs or CNAMEs when health checks fail.
- Geolocation routing: divert only affected regions.
Execute DNS switch via API. Example curl for Route53 using aws cli is below, and Cloudflare/NS1 examples follow.

# Route53 example (change-resource-record-sets JSON pre-made)
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch file://change.json

# Cloudflare: set CNAME to secondary
curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/YOUR_ZONE/dns_records/RECORD_ID" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"type":"CNAME","name":"www","content":"secondary.cdn.example.net","ttl":120,"proxied":false}'

Verify propagation with dig +short from multiple public resolvers and via CDN logs. Use several vantage points (Cloud monitoring, RIPE Atlas, or runbook-specified probes).

Phase C — CDN failover and origin fallback (parallel)

If your primary CDN supports origin failover or secondary origins, toggle those before a DNS cut if the control plane is available.

Cloudflare: enable Origin Rules or configure a Load Balancer with a secondary pool — use API if the portal is flaky.
Fastly: enable Shielding fallback/backends via API and switch the active backend.
Akamai: adjust property IP allowlists and host rules via your automation if the Luna UI is resilient.

# Fastly backend switch (example)
curl -X PUT "https://api.fastly.com/service/SERVICE_ID/version/VERSION/backend/BACKEND_NAME" \
  -H "Fastly-Key: $FASTLY_KEY" \
  -H "Content-Type: application/json" \
  --data '{"address":"secondary-origin.example.net","port":443,"use_ssl":true}'

Phase D — Cache control and invalidation (10–45 minutes)

Cache strategy during failover matters. You want to avoid thrashing origin while ensuring users get fresh content.

Set short cache-control for dynamic assets if you need rapid rollbacks (Cache-Control: public, max-age=60, s-maxage=60).
Use surrogate keys for bulk invalidation. Purge by key rather than URL where possible.
Execute targeted purges: avoid purging the entire CDN if you can purge only critical assets. Full purges cause high load and may be rate-limited.

# Cloudflare purge by URL
curl -X POST "https://api.cloudflare.com/client/v4/zones/YOUR_ZONE/purge_cache" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"files":["https://yoursite.com/index.html"]}'

# Fastly purge by surrogate key
curl -X POST "https://api.fastly.com/service/SERVICE_ID/purge" \
  -H "Fastly-Key: $FASTLY_KEY" \
  -H "Fastly-Soft-Purge: 1" \
  -H "Cache-Tag: my-key"

Monitor origin load and throttle cache TTL changes if origin saturates.

Phase E — Verification and observability (10–60 minutes)

Run smoke tests on key pages and APIs from multiple locations.
Confirm TLS handshake and certificate validity when switching providers — SNI and host header mismatches are common.
Watch logs for 5xx spikes and origin latency increases.
Use synthetic checks and real user metrics to determine when traffic is stable.

Automation scripts & quick templates

Keep these minimal, trusted scripts in your incident repo and accessible via your secrets manager. Replace tokens with secure references.

# Minimal Cloudflare DNS switch script (bash)
CF_ZONE=yourzone.example
RECORD_ID=record-id
NEW_CNAME=secondary.cdn.example.net
curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/$CF_ZONE/dns_records/$RECORD_ID" \
  -H "Authorization: Bearer $CF_API_TOKEN" \
  -H "Content-Type: application/json" \
  --data "{'type':'CNAME','name':'www','content':'$NEW_CNAME','ttl':120}" | jq .

# Route53 weighted record change example (change.json must be prepared ahead)
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch file://change.json

Incident communication — concise status updates

On-call teams must publish short, factual updates every 10–15 minutes until stabilized. Use this template and pin it to the incident channel.

Status: Investigating — CDN control-plane degraded for provider X. Impact: partial site outage for US-EAST. Next update: +15 minutes. Action: DNS weighted failover to secondary CDN initiated; TTL set to 120s. Owner: @dns-lead.

Include: timestamp, impact, mitigation actions completed, next steps, and expected customer-facing behavior.

Rollback and post-incident

Plan rollback criteria before cutover: monitored error rate below X% for Y minutes, or provider confirms restoration.
Reverse steps in the same controlled manner: re-enable primary CDN origin, switch DNS weight back to primary gradually.
Don’t immediately purge caches during rollback — prefer gradual weight shifting to avoid origin floods.
Run a postmortem within 72 hours. Capture timelines, decisions, and automation gaps. Assign remediation tasks.

Common pitfalls and mitigations

TTL illusions: Long TTLs can delay failover. Mitigation: plan for emergency TTLs or use low TTLs on CNAME/ALIAS records for critical hosts.
DNSSEC/Registrar blocks: Registrar locks can prevent quick changes. Keep emergency contacts and permission delegations on record.
Certificate and SNI mismatches: Secondary CDN must present valid certificates for your host or use ACM/managed certs on both providers.
Rate limits: API rate limits on purge/change endpoints may block mass operations. Use surrogate-key purges and staged DNS changes.
Origin overload: Failing over to origin directly can swamp backend. Use staged traffic re-route and autoscaling rules.

2026 trends and how they affect this runbook

As of 2026, three trends shape incident response for CDN/control-plane outages:

Multi-CDN orchestration platforms matured: Many teams adopted orchestration to automate fast failovers. If you have a multi-CDN controller, integrate this runbook's actions into its playbooks.
Edge compute increases complexity: When logic runs at the edge, a provider control-plane outage can silently break routing rules. Validate edge worker fallbacks in runbook tests.
API-based DNS and CDN automation is mainstream: Rely on well-tested API scripts, not manual UI steps. 2026 saw organizations standardize incident automation and expose it via secure runbooks to on-call.

Checklist — Incident cheat-sheet

Confirm incident source & scope
Assign roles in incident channel
Lower TTL (if safe) or ensure TTL already low
Decide DNS vs CDN config failover
Execute API-driven DNS change or enable secondary origin
Update cache headers & perform targeted purges
Monitor origin load and user-facing errors
Communicate every 10–15 minutes using the template
Initiate rollback only after defined stability criteria

Appendix: Useful commands and verification

Verify DNS resolution from multiple resolvers

dig +short @1.1.1.1 yoursite.com
dig +short @8.8.8.8 yoursite.com

Check HTTP headers and CDN response

curl -I -s https://yoursite.com | egrep -i "server:|via:|x-cache|cache-control"

Trace to identify where traffic drops

traceroute yoursite.com
mtr -r -c 50 yoursite.com

Final actionable takeaways

Prepare multi-provider playbooks: Pre-provisioned secondary CDNs and DNS endpoints cut 30–90 minutes from incident time in our benchmarks.
Automate the common path: Keep minimal, IDempotent API scripts for DNS updates, backend switches, and cache invalidations in your secrets-backed runbook repo.
Communicate with cadence: Short, factual updates calm stakeholders and reduce interruption costs.
Test your runbook: Run quarterly chaos drills simulating control-plane failures to verify TLS, origin capacity, and TTL behaviors.

Call to action

If your team doesn't already have these scripts and health checks in a secured runbook repo, start today: create the API scripts listed above, pre-provision a secondary CDN, and run a dry failover drill within the next 30 days. Need a checklist or automation templates tailored to your environment? Reach out to our team for an incident-ready runbook audit and custom automation pack.

Runbook: Emergency DNS and CDN Switches During a Wide-Scale Outage

Emergency Runbook: DNS and CDN Switches for Wide-Scale Control-Plane Outages (2026)

Why this matters in 2026

Quick decisions (first 5 minutes)

Pre-incident configuration (what to have ready)

Runbook: Step-by-step execution

Phase A — Triage and immediate mitigation (0–10 minutes)

Phase B — DNS-level failover (10–30 minutes)

Phase C — CDN failover and origin fallback (parallel)

Phase D — Cache control and invalidation (10–45 minutes)

Phase E — Verification and observability (10–60 minutes)

Automation scripts & quick templates

Incident communication — concise status updates

Rollback and post-incident

Common pitfalls and mitigations

2026 trends and how they affect this runbook

Checklist — Incident cheat-sheet

Appendix: Useful commands and verification

Final actionable takeaways

Call to action

Related Topics

whata

Up Next

Docker on a VPS: A Beginner-Friendly Deployment Guide

How to Deploy a Node.js App on a VPS

How to Deploy a Static Site With a Custom Domain

From Our Network

Best DNS Providers Compared: Speed, Reliability, API Access, and Pricing

How to Move Email When Transferring a Domain or Changing Hosts

Website Backup Strategies: How Often to Back Up and Where to Store Copies

How to Set Up Staging for Your Website Before Going Live

Website Speed Optimization Checklist for Cloud-Hosted Sites

Best Cheap Domain Registration Options That Stay Affordable at Renewal

Emergency Runbook: DNS and CDN Switches for Wide-Scale Control-Plane Outages (2026)

Why this matters in 2026

Quick decisions (first 5 minutes)

Pre-incident configuration (what to have ready)

Runbook: Step-by-step execution

Phase A — Triage and immediate mitigation (0–10 minutes)

Phase B — DNS-level failover (10–30 minutes)

Phase C — CDN failover and origin fallback (parallel)

Phase D — Cache control and invalidation (10–45 minutes)

Phase E — Verification and observability (10–60 minutes)

Automation scripts & quick templates

Incident communication — concise status updates

Rollback and post-incident

Common pitfalls and mitigations

2026 trends and how they affect this runbook

Checklist — Incident cheat-sheet

Appendix: Useful commands and verification

Final actionable takeaways

Call to action

Related Reading

Related Topics

whata

Up Next

Docker on a VPS: A Beginner-Friendly Deployment Guide

How to Deploy a Node.js App on a VPS

How to Deploy a Static Site With a Custom Domain

From Our Network

Best DNS Providers Compared: Speed, Reliability, API Access, and Pricing

How to Move Email When Transferring a Domain or Changing Hosts

Website Backup Strategies: How Often to Back Up and Where to Store Copies

How to Set Up Staging for Your Website Before Going Live

Website Speed Optimization Checklist for Cloud-Hosted Sites

Best Cheap Domain Registration Options That Stay Affordable at Renewal