Runbook: Emergency DNS and CDN Switches During a Wide-Scale Outage
Concise, executable runbook for on-call teams to switch DNS, enable CDN failover, and update cache rules during major control-plane outages.
Emergency Runbook: DNS and CDN Switches for Wide-Scale Control-Plane Outages (2026)
Hook: When a major CDN or cloud provider control plane goes dark — like the high-profile Cloudflare/AWS incidents that spiked reports in January 2026 — your on-call team must move faster than typical playbooks allow. This runbook gives concise, executable steps to switch DNS, enable failover origins, update caching rules, and keep stakeholders informed with minimal manual friction.
Why this matters in 2026
Control-plane incidents are now more frequent and more consequential as enterprises consolidate on managed CDNs and global edge platforms. In late 2025 and early 2026 the industry accelerated toward multi-CDN and multi-control-plane orchestration, but many deployments still depend on a single provider's APIs and routing. This runbook assumes you need to perform an urgent provider switch or failover with minimal tooling — or to execute quick API-driven changes from your incident bridge.
Quick decisions (first 5 minutes)
- Confirm the incident: Check provider status pages, BGP visibility, and internal telemetry. If your origin continues to respond but the CDN control plane is failing, a DNS-level or alternate-CDN switch is viable.
- Declare the scope: Is the outage global or regional? Are only control-plane APIs affected while data-plane can still serve cached content? This drives actions (DNS TTL changes vs immediate DNS switch).
- Choose the path: (A) DNS failover to an alternate CDN or direct-to-origin; (B) Enable failover/origin fallback in the existing CDN (if control plane available); (C) Short-term route via a secondary provider using DNS weighted/geoproximity routing.
- Assign roles: Ops lead, DNS engineer, CDN engineer, communications owner. Keep roles fixed to avoid overlap.
Pre-incident configuration (what to have ready)
These are the hygiene items every team should prepare in advance; they dramatically reduce time-to-failover.
- Low safety TTLs: Keep an emergency TTL (60–300s) for critical records or have a plan to lower TTL instantly via API.
- Pre-provisioned secondary providers: Have accounts and service mappings with at least one alternative CDN and a DNS provider that supports weighted/geolocation and API updates.
- Origin access & auth: Ensure origin accepts direct traffic (CORS, host header, TLS certificate) and test origin response headers for caching behavior.
- Automated scripts: Maintain tested scripts for DNS updates, CDN config toggles, and cache purges in a secure repo. Store API tokens in secrets manager with on-call access.
- Monitoring & health checks: Route53 or external health checks that can be switched to trigger failover routing automatically when needed.
- DNSSEC & delegation knowledge: Know when DNSSEC or registrar locks will block quick changes. Keep registrar contacts for emergency changes.
Runbook: Step-by-step execution
Phase A — Triage and immediate mitigation (0–10 minutes)
- Open the incident channel and document: affected domains, start time, symptoms, initial hypothesis.
- Run fast checks from multiple vantage points: dig, curl, traceroute, and a few public HTTP probes (e.g., perf-tools or custom probes).
- If cached content is still accessible and acceptable, extend cache ttl or pause purge to preserve traffic.
- If the CDN control plane is the problem but data-plane responds, avoid mass cache purges and postpone origin switches.
# Example checks
dig +short yoursite.com
curl -I https://yoursite.com --resolve yoursite.com:443:1.2.3.4
Phase B — DNS-level failover (10–30 minutes)
DNS changes are the most reliable method when you must retarget traffic away from a provider with a failing control plane.
- Lower TTL if it isn’t already low. Short TTLs let you flip quickly in future waves.
- Decide routing strategy:
- Weighted routing: split traffic to a secondary CDN gradually for verification.
- Failover routing: full cutover to alternate IPs or CNAMEs when health checks fail.
- Geolocation routing: divert only affected regions.
- Execute DNS switch via API. Example curl for Route53 using aws cli is below, and Cloudflare/NS1 examples follow.
- Verify propagation with dig +short from multiple public resolvers and via CDN logs. Use several vantage points (Cloud monitoring, RIPE Atlas, or runbook-specified probes).
# Route53 example (change-resource-record-sets JSON pre-made)
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch file://change.json
# Cloudflare: set CNAME to secondary
curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/YOUR_ZONE/dns_records/RECORD_ID" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"type":"CNAME","name":"www","content":"secondary.cdn.example.net","ttl":120,"proxied":false}'
Phase C — CDN failover and origin fallback (parallel)
If your primary CDN supports origin failover or secondary origins, toggle those before a DNS cut if the control plane is available.
- Cloudflare: enable Origin Rules or configure a Load Balancer with a secondary pool — use API if the portal is flaky.
- Fastly: enable Shielding fallback/backends via API and switch the active backend.
- Akamai: adjust property IP allowlists and host rules via your automation if the Luna UI is resilient.
# Fastly backend switch (example)
curl -X PUT "https://api.fastly.com/service/SERVICE_ID/version/VERSION/backend/BACKEND_NAME" \
-H "Fastly-Key: $FASTLY_KEY" \
-H "Content-Type: application/json" \
--data '{"address":"secondary-origin.example.net","port":443,"use_ssl":true}'
Phase D — Cache control and invalidation (10–45 minutes)
Cache strategy during failover matters. You want to avoid thrashing origin while ensuring users get fresh content.
- Set short cache-control for dynamic assets if you need rapid rollbacks (Cache-Control: public, max-age=60, s-maxage=60).
- Use surrogate keys for bulk invalidation. Purge by key rather than URL where possible.
- Execute targeted purges: avoid purging the entire CDN if you can purge only critical assets. Full purges cause high load and may be rate-limited.
- Monitor origin load and throttle cache TTL changes if origin saturates.
# Cloudflare purge by URL
curl -X POST "https://api.cloudflare.com/client/v4/zones/YOUR_ZONE/purge_cache" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data '{"files":["https://yoursite.com/index.html"]}'
# Fastly purge by surrogate key
curl -X POST "https://api.fastly.com/service/SERVICE_ID/purge" \
-H "Fastly-Key: $FASTLY_KEY" \
-H "Fastly-Soft-Purge: 1" \
-H "Cache-Tag: my-key"
Phase E — Verification and observability (10–60 minutes)
- Run smoke tests on key pages and APIs from multiple locations.
- Confirm TLS handshake and certificate validity when switching providers — SNI and host header mismatches are common.
- Watch logs for 5xx spikes and origin latency increases.
- Use synthetic checks and real user metrics to determine when traffic is stable.
Automation scripts & quick templates
Keep these minimal, trusted scripts in your incident repo and accessible via your secrets manager. Replace tokens with secure references.
# Minimal Cloudflare DNS switch script (bash)
CF_ZONE=yourzone.example
RECORD_ID=record-id
NEW_CNAME=secondary.cdn.example.net
curl -s -X PATCH "https://api.cloudflare.com/client/v4/zones/$CF_ZONE/dns_records/$RECORD_ID" \
-H "Authorization: Bearer $CF_API_TOKEN" \
-H "Content-Type: application/json" \
--data "{\"type\":\"CNAME\",\"name\":\"www\",\"content\":\"$NEW_CNAME\",\"ttl\":120}" | jq .
# Route53 weighted record change example (change.json must be prepared ahead)
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch file://change.json
Incident communication — concise status updates
On-call teams must publish short, factual updates every 10–15 minutes until stabilized. Use this template and pin it to the incident channel.
Status: Investigating — CDN control-plane degraded for provider X. Impact: partial site outage for US-EAST. Next update: +15 minutes. Action: DNS weighted failover to secondary CDN initiated; TTL set to 120s. Owner: @dns-lead.
Include: timestamp, impact, mitigation actions completed, next steps, and expected customer-facing behavior.
Rollback and post-incident
- Plan rollback criteria before cutover: monitored error rate below X% for Y minutes, or provider confirms restoration.
- Reverse steps in the same controlled manner: re-enable primary CDN origin, switch DNS weight back to primary gradually.
- Don’t immediately purge caches during rollback — prefer gradual weight shifting to avoid origin floods.
- Run a postmortem within 72 hours. Capture timelines, decisions, and automation gaps. Assign remediation tasks.
Common pitfalls and mitigations
- TTL illusions: Long TTLs can delay failover. Mitigation: plan for emergency TTLs or use low TTLs on CNAME/ALIAS records for critical hosts.
- DNSSEC/Registrar blocks: Registrar locks can prevent quick changes. Keep emergency contacts and permission delegations on record.
- Certificate and SNI mismatches: Secondary CDN must present valid certificates for your host or use ACM/managed certs on both providers.
- Rate limits: API rate limits on purge/change endpoints may block mass operations. Use surrogate-key purges and staged DNS changes.
- Origin overload: Failing over to origin directly can swamp backend. Use staged traffic re-route and autoscaling rules.
2026 trends and how they affect this runbook
As of 2026, three trends shape incident response for CDN/control-plane outages:
- Multi-CDN orchestration platforms matured: Many teams adopted orchestration to automate fast failovers. If you have a multi-CDN controller, integrate this runbook's actions into its playbooks.
- Edge compute increases complexity: When logic runs at the edge, a provider control-plane outage can silently break routing rules. Validate edge worker fallbacks in runbook tests.
- API-based DNS and CDN automation is mainstream: Rely on well-tested API scripts, not manual UI steps. 2026 saw organizations standardize incident automation and expose it via secure runbooks to on-call.
Checklist — Incident cheat-sheet
- Confirm incident source & scope
- Assign roles in incident channel
- Lower TTL (if safe) or ensure TTL already low
- Decide DNS vs CDN config failover
- Execute API-driven DNS change or enable secondary origin
- Update cache headers & perform targeted purges
- Monitor origin load and user-facing errors
- Communicate every 10–15 minutes using the template
- Initiate rollback only after defined stability criteria
Appendix: Useful commands and verification
- Verify DNS resolution from multiple resolvers
dig +short @1.1.1.1 yoursite.com dig +short @8.8.8.8 yoursite.com - Check HTTP headers and CDN response
curl -I -s https://yoursite.com | egrep -i "server:|via:|x-cache|cache-control" - Trace to identify where traffic drops
traceroute yoursite.com mtr -r -c 50 yoursite.com
Final actionable takeaways
- Prepare multi-provider playbooks: Pre-provisioned secondary CDNs and DNS endpoints cut 30–90 minutes from incident time in our benchmarks.
- Automate the common path: Keep minimal, IDempotent API scripts for DNS updates, backend switches, and cache invalidations in your secrets-backed runbook repo.
- Communicate with cadence: Short, factual updates calm stakeholders and reduce interruption costs.
- Test your runbook: Run quarterly chaos drills simulating control-plane failures to verify TLS, origin capacity, and TTL behaviors.
Call to action
If your team doesn't already have these scripts and health checks in a secured runbook repo, start today: create the API scripts listed above, pre-provision a secondary CDN, and run a dry failover drill within the next 30 days. Need a checklist or automation templates tailored to your environment? Reach out to our team for an incident-ready runbook audit and custom automation pack.
Related Reading
- Avoid Placebo Kitchen Tech: How to Spot Gimmicks Before You Buy
- Stadium Chant Prank: How to Coordinate a Viral Half-Time Gag for Matchday
- Edge Computing Lessons from Warehouse Automation: Designing Resilient Data Infrastructure
- Behind the Backflip: How Rimmel’s Gravity-Defying Mascara Launch Uses Stunts to Sell Beauty
- How to Use AI for Execution, Not Strategy: Excel Macros That Automate Repetitive Work Without Replacing Decisions
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Android Malware: Securing Your Cloud-Based Mobile Applications
Building Smart Homes Smarter: The Importance of Water Leak Detection Technology
Reimagining Legacy Operating Systems: Lessons from Windows 8 on Linux
Preparing for Tax Season: Leveraging Software Solutions
AI in Meeting Scheduling: A Game Changer for IT Teams
From Our Network
Trending stories across our publication group