Domain 7 · Lesson 5 of 6

Business Continuity & Disaster Recovery

Liên tục Kinh doanh & Phục hồi Thảm họa

BCP vs DRP — Understanding the Distinction

	BCP — Business Continuity Plan	DRP — Disaster Recovery Plan
Focus	Business operations — people, process, and communication	IT systems and data recovery
Perspective	Business continuity (keep the company running)	Technical recovery (restore IT systems)
Relationship	Parent plan — the umbrella strategy	Subset of BCP — the IT-specific component
Goal	Keep business running at minimum viable capacity during disaster	Restore normal IT operations as quickly as possible

BIA — Business Impact Analysis

The foundation of both BCP and DRP. BIA identifies: (1) which business processes are critical, (2) the maximum tolerable downtime for each, and (3) the financial/operational impact of each process being unavailable. BIA output DRIVES the RTO/RPO targets. Business sets these — not IT.

Recovery Objectives — Set by Business, Not IT

Metric	Full Name	Meaning	Who Sets It
MTD	Maximum Tolerable Downtime	The absolute maximum time before the business FAILS — cannot operate at all	Executive / Business Owner
RTO	Recovery Time Objective	Target time to restore systems — must be ≤ MTD	IT (constrained by MTD)
RPO	Recovery Point Objective	Maximum acceptable data loss, measured in TIME (e.g., RPO=15min = lose max 15 min of transactions)	Business / Operations
WRT	Work Recovery Time	Time to validate and restore data AFTER systems are back up (re-enter transactions, reconcile)	Operations team

Critical Formula: RTO + WRT ≤ MTD

The total recovery timeline (IT systems up + data validated) must fit within the maximum tolerable downtime. If RTO + WRT exceeds MTD, the business will fail before IT can recover it.

Example:

MTD = 8 hours. RTO = 4 hours (systems back up). WRT = 2 hours (data reconciliation). Total = 6 hours ≤ 8 hours. ✓ Valid plan.

If RTO = 6 hours + WRT = 3 hours = 9 hours > MTD of 8 hours → business fails before recovery completes → plan must be improved.

Recovery Site Types

Type	Tiếng Việt	Readiness	Cost	Typical RTO
Hot Site	Nóng	Fully operational, real-time data replication, ready to take over immediately	Highest	Minutes
Warm Site	Ấm	Partially equipped with hardware, needs configuration and data restore	Medium	Hours
Cold Site	Lạnh	Empty facility — power, space, connectivity. Bring your own hardware and software.	Lowest	Days to Weeks
Cloud DR	Đám mây	Elastic spin-up on demand — pay only when activated	Variable (pay-per-use)	Minutes to Hours

Trade-off Rule

More ready = more expensive (hot site = always running). Less ready = cheaper but slower recovery (cold site = days to weeks). Choose based on RTO requirements from BIA. For Partner A (MTD=8hrs), a warm site on GCP multi-region is the right trade-off. For Bank A payments (MTD=2hrs), a hot standby is required.

BCP Testing Types — Least to Most Disruptive

Checklist / Document Review

Review plan completeness on paper. No systems, no people mobilized. Cheapest and least informative. Good for initial validation.

Tabletop Exercise

Scenario discussion — managers and key staff talk through response. No actual system changes. Good for validating decision flows and identifying gaps. Annual compliance staple.

Simulation

More formal walkthrough, practice roles and procedures. No actual failover — teams go through motions without touching production systems.

Parallel Test

Recovery systems are activated WHILE production continues to run. Both systems operate simultaneously. Validates recovery without production risk — safe but expensive (running dual systems).

Full Interruption Test

Actually fail over — production is shut down, recovery site takes over for real. Most realistic and most informative, but causes real downtime. Rarely used; requires extensive planning and executive approval.

Backup Strategies & Key Terms

Full Backup

All data. Slowest to create, fastest to restore.

Differential

Changes since last full backup. Medium speed. Restore = full + latest differential.

Incremental

Changes since last backup. Fastest to create, slowest to restore (full + all incrementals).

3-2-1 Backup Rule

Total copies of data

Different storage media types

Copy stored offsite (different location)

Ransomware can encrypt all online copies. The offsite copy (and especially an immutable backup) survives ransomware attacks.

Immutable Backups

Backups that cannot be modified or deleted for a defined period — even by administrators. GCS Object Lock (WORM), AWS S3 Object Lock. Critical ransomware protection — attackers who compromise admin credentials cannot destroy your recovery capability.

Key Terms

BCP DRP BIA MTD RTO RPO WRT Hot Site Warm Site Cold Site Tabletop Exercise Full Interruption Test 3-2-1 Backup Immutable Backup

Exam Tips

RTO must be ≤ MTD. If IT cannot recover systems within MTD, the business fails. The business sets MTD, IT must design to meet it.
BCP is the PARENT plan. DRP is a technical subset of BCP (IT recovery is part of business continuity).
RPO = data loss tolerance measured in TIME. RPO=1hr means you can lose at most 1 hour of transactions — backups must run at least hourly.
Hot site = most expensive, minutes to failover. Cold site = cheapest, days to weeks to recover. Choose based on RTO requirement.
Full interruption test is most realistic but causes real production downtime — rarely used. Tabletop = cheapest, zero system disruption, good for annual compliance testing.
3-2-1 backup rule: 3 copies, 2 media types, 1 offsite. Ransomware can encrypt all online copies — the 1 offsite (immutable) copy saves you.
Formula: RTO + WRT ≤ MTD. This is frequently tested in scenario questions.

Work Application — Platform C BCP Parameters by Tenant

Partner A VN (live production loans — MTD=8hrs): RTO=4hrs, RPO=15min. Solution: GCP multi-region warm site, Temporal workflow replay on recovery (Temporal preserves workflow state for replay after recovery). Backup: PostgreSQL logical replication to secondary region every 15 minutes.

Platform B / Bank A (loyalty and payments — MTD=2hrs): RTO=1hr, RPO=5min. Hot standby required for payment processing. Sub-5-minute RPO means near-real-time replication. Any downtime greater than 2 hours violates Bank A SLA — direct financial penalty and regulatory exposure.

Partner E (planned card product — MTD=2hrs): RTO=30min, RPO=2min. Hot site + immutable GCS backups. Card transaction processing demands the highest availability tier — BSP may require formal DRP testing.

Action: Conduct tabletop exercise with Partner A stakeholders to test "PostgreSQL primary failure" scenario. Document: who makes the failover decision, what the decision criteria are, how long each recovery step takes, and what the communication plan is for Partner A during the outage. Output: gap analysis and updated DRP runbook.

Practice Quiz

Q1. Business sets MTD=6hrs. IT proposes RTO=5hrs and WRT=2hrs. Is this a valid DR plan?

▼ Reveal Answer

No. RTO (5hrs) + WRT (2hrs) = 7 hours total recovery time. This exceeds the MTD of 6 hours — the business will fail before IT can complete recovery. IT must reduce RTO or WRT so their sum is ≤ 6 hours.

The formula RTO + WRT ≤ MTD is a hard constraint, not a guideline. WRT is often forgotten — systems being back up doesn't mean business operations resume immediately. Staff still need to reconcile transactions, validate data integrity, re-process failed jobs, and communicate status to customers. These WRT activities take time and must be accounted for in the total recovery timeline. If the numbers don't work, you need a better recovery solution (hot site instead of warm site, more automated recovery, etc.).

Q2. RPO of 15 minutes for Partner A — what does this mean for backup frequency?

▼ Reveal Answer

Data must be backed up (or replicated) at least every 15 minutes. An RPO of 15 minutes means you can tolerate losing at most 15 minutes of loan transaction data. If backups run every hour, a failure could result in 59 minutes of lost data — violating the RPO.

RPO is measured in time and directly determines backup/replication frequency. RPO=15min → backup at least every 15 minutes. RPO=5min → near-real-time replication (streaming replication, not point-in-time snapshots). RPO=0 → synchronous replication (zero data loss, highest cost). For Partner A loans, losing 15 minutes of loan decisions is manageable with a defined reconciliation process. Losing 1 hour of decisions would be a regulatory and business problem — customers approved for loans might not get them, and records might be inconsistent.

Q3. What happens during a tabletop exercise?

▼ Reveal Answer

Key staff and managers gather to discuss a disaster scenario — talking through what they would do at each stage. No actual system changes are made. It tests decision-making, communication, and plan completeness without any production risk. The discussion typically reveals gaps in the plan (who calls whom? who has authority to declare a disaster?).

Tabletop exercises are the most common form of BCP testing because they are cheap, low-risk, and surprisingly revealing. You typically present a scenario ("A flood has taken out the main data center — it's 3am. What happens?") and let the team work through it verbally. Common discoveries: nobody knows where the emergency contact list is, the person who knows the failover procedure is on vacation, or the recovery steps assume tools that aren't available at the recovery site. This is why it's good for annual compliance — it costs almost nothing and catches real gaps.

Q4. Bank A payment processing requires immediate failover. Which recovery site type is required?

▼ Reveal Answer

Hot site — fully operational with real-time data replication. Only a hot site (or cloud DR equivalent with pre-provisioned infrastructure) can achieve the minutes-level RTO required for payment processing. A warm site (hours) or cold site (days) cannot meet Bank A's 2-hour MTD.

Payment processing has zero tolerance for extended outages. Every minute of downtime means failed transactions, frustrated customers, potential SLA penalties, and regulatory scrutiny. Hot sites maintain real-time data synchronization and can failover in minutes — the RTO is limited only by the time to redirect traffic and validate the system. For Platform B/Bank A with MTD=2hrs and RTO=1hr, a hot standby on GCP (active-passive multi-region) is the minimum viable solution. The cost of the hot site is justified by the SLA penalty risk of downtime.

Q5. The 3-2-1 backup rule — what does each number represent?

▼ Reveal Answer

3 = three total copies of data (including the original). 2 = stored on two different types of storage media (e.g., disk and tape, or local SSD and cloud). 1 = one copy stored offsite (geographically separate from primary). The offsite copy survives disasters that destroy the primary site.

The 3-2-1 rule provides defense-in-depth for data protection. Three copies means you can lose two without losing data. Two media types means a failure mode that affects one type (e.g., disk controller failure) doesn't affect the other. One offsite means a site-level disaster (fire, flood, ransomware spreading across all local systems) cannot destroy your ability to recover. For Platform C, a practical 3-2-1 implementation: (1) live database on primary GCP region, (2) daily snapshot on GCS (same region, different media), (3) weekly export to a different GCP region or separate GCS bucket with Object Lock (immutable).

← Lesson 4: Change Management Lesson 6: Physical & Environmental →