Domain 7 · Lesson 5 of 6
Business Continuity & Disaster Recovery
Liên tục Kinh doanh & Phục hồi Thảm họa
BCP vs DRP — Understanding the Distinction
| BCP — Business Continuity Plan | DRP — Disaster Recovery Plan | |
|---|---|---|
| Focus | Business operations — people, process, and communication | IT systems and data recovery |
| Perspective | Business continuity (keep the company running) | Technical recovery (restore IT systems) |
| Relationship | Parent plan — the umbrella strategy | Subset of BCP — the IT-specific component |
| Goal | Keep business running at minimum viable capacity during disaster | Restore normal IT operations as quickly as possible |
BIA — Business Impact Analysis
The foundation of both BCP and DRP. BIA identifies: (1) which business processes are critical, (2) the maximum tolerable downtime for each, and (3) the financial/operational impact of each process being unavailable. BIA output DRIVES the RTO/RPO targets. Business sets these — not IT.
Recovery Objectives — Set by Business, Not IT
| Metric | Full Name | Meaning | Who Sets It |
|---|---|---|---|
| MTD | Maximum Tolerable Downtime | The absolute maximum time before the business FAILS — cannot operate at all | Executive / Business Owner |
| RTO | Recovery Time Objective | Target time to restore systems — must be ≤ MTD | IT (constrained by MTD) |
| RPO | Recovery Point Objective | Maximum acceptable data loss, measured in TIME (e.g., RPO=15min = lose max 15 min of transactions) | Business / Operations |
| WRT | Work Recovery Time | Time to validate and restore data AFTER systems are back up (re-enter transactions, reconcile) | Operations team |
Critical Formula: RTO + WRT ≤ MTD
The total recovery timeline (IT systems up + data validated) must fit within the maximum tolerable downtime. If RTO + WRT exceeds MTD, the business will fail before IT can recover it.
Example:
MTD = 8 hours. RTO = 4 hours (systems back up). WRT = 2 hours (data reconciliation). Total = 6 hours ≤ 8 hours. ✓ Valid plan.
If RTO = 6 hours + WRT = 3 hours = 9 hours > MTD of 8 hours → business fails before recovery completes → plan must be improved.
Recovery Site Types
| Type | Tiếng Việt | Readiness | Cost | Typical RTO |
|---|---|---|---|---|
| Hot Site | Nóng | Fully operational, real-time data replication, ready to take over immediately | Highest | Minutes |
| Warm Site | Ấm | Partially equipped with hardware, needs configuration and data restore | Medium | Hours |
| Cold Site | Lạnh | Empty facility — power, space, connectivity. Bring your own hardware and software. | Lowest | Days to Weeks |
| Cloud DR | Đám mây | Elastic spin-up on demand — pay only when activated | Variable (pay-per-use) | Minutes to Hours |
Trade-off Rule
More ready = more expensive (hot site = always running). Less ready = cheaper but slower recovery (cold site = days to weeks). Choose based on RTO requirements from BIA. For Partner A (MTD=8hrs), a warm site on GCP multi-region is the right trade-off. For Bank A payments (MTD=2hrs), a hot standby is required.
BCP Testing Types — Least to Most Disruptive
Checklist / Document Review
Review plan completeness on paper. No systems, no people mobilized. Cheapest and least informative. Good for initial validation.
Tabletop Exercise
Scenario discussion — managers and key staff talk through response. No actual system changes. Good for validating decision flows and identifying gaps. Annual compliance staple.
Simulation
More formal walkthrough, practice roles and procedures. No actual failover — teams go through motions without touching production systems.
Parallel Test
Recovery systems are activated WHILE production continues to run. Both systems operate simultaneously. Validates recovery without production risk — safe but expensive (running dual systems).
Full Interruption Test
Actually fail over — production is shut down, recovery site takes over for real. Most realistic and most informative, but causes real downtime. Rarely used; requires extensive planning and executive approval.
Backup Strategies & Key Terms
Full Backup
All data. Slowest to create, fastest to restore.
Differential
Changes since last full backup. Medium speed. Restore = full + latest differential.
Incremental
Changes since last backup. Fastest to create, slowest to restore (full + all incrementals).
3-2-1 Backup Rule
Total copies of data
Different storage media types
Copy stored offsite (different location)
Ransomware can encrypt all online copies. The offsite copy (and especially an immutable backup) survives ransomware attacks.
Immutable Backups
Backups that cannot be modified or deleted for a defined period — even by administrators. GCS Object Lock (WORM), AWS S3 Object Lock. Critical ransomware protection — attackers who compromise admin credentials cannot destroy your recovery capability.
Key Terms
- RTO must be ≤ MTD. If IT cannot recover systems within MTD, the business fails. The business sets MTD, IT must design to meet it.
- BCP is the PARENT plan. DRP is a technical subset of BCP (IT recovery is part of business continuity).
- RPO = data loss tolerance measured in TIME. RPO=1hr means you can lose at most 1 hour of transactions — backups must run at least hourly.
- Hot site = most expensive, minutes to failover. Cold site = cheapest, days to weeks to recover. Choose based on RTO requirement.
- Full interruption test is most realistic but causes real production downtime — rarely used. Tabletop = cheapest, zero system disruption, good for annual compliance testing.
- 3-2-1 backup rule: 3 copies, 2 media types, 1 offsite. Ransomware can encrypt all online copies — the 1 offsite (immutable) copy saves you.
- Formula: RTO + WRT ≤ MTD. This is frequently tested in scenario questions.
Partner A VN (live production loans — MTD=8hrs): RTO=4hrs, RPO=15min. Solution: GCP multi-region warm site, Temporal workflow replay on recovery (Temporal preserves workflow state for replay after recovery). Backup: PostgreSQL logical replication to secondary region every 15 minutes.
Platform B / Bank A (loyalty and payments — MTD=2hrs): RTO=1hr, RPO=5min. Hot standby required for payment processing. Sub-5-minute RPO means near-real-time replication. Any downtime greater than 2 hours violates Bank A SLA — direct financial penalty and regulatory exposure.
Partner E (planned card product — MTD=2hrs): RTO=30min, RPO=2min. Hot site + immutable GCS backups. Card transaction processing demands the highest availability tier — BSP may require formal DRP testing.
Action: Conduct tabletop exercise with Partner A stakeholders to test "PostgreSQL primary failure" scenario. Document: who makes the failover decision, what the decision criteria are, how long each recovery step takes, and what the communication plan is for Partner A during the outage. Output: gap analysis and updated DRP runbook.
Practice Quiz
Q1. Business sets MTD=6hrs. IT proposes RTO=5hrs and WRT=2hrs. Is this a valid DR plan?
▼ Reveal Answer
Q2. RPO of 15 minutes for Partner A — what does this mean for backup frequency?
▼ Reveal Answer
Q3. What happens during a tabletop exercise?
▼ Reveal Answer
Q4. Bank A payment processing requires immediate failover. Which recovery site type is required?
▼ Reveal Answer
Q5. The 3-2-1 backup rule — what does each number represent?