This is one of those situations that looks good on paper… until it happens at 2AM.
You set up automated backups. You tested them once. The dashboard shows “success” every week. Everyone sleeps better.
Then reality hits:
- RAID array dies (two disks down, no redundancy left)
- Or worse, root-level compromise wipes the server clean
- You go to restore from JetBackup, R1Soft, or similar
- And the backup storage itself is corrupted, half-missing, or won’t mount
At that moment, the phrase “we have backups” suddenly feels… very optimistic.
I’ve had a few of these incidents over the years. The kind where you stare at a broken filesystem and realize your job is no longer “restore backup”, but “salvage whatever still exists and keep the business alive”.
Let’s walk through how I handle it when things go this bad.
The Reality Check: Backups Are Not Recovery Until Verified
Before anything technical, there’s one uncomfortable truth:
A backup is not a backup until you’ve restored it successfully.
I’ve seen systems that:
- Ran weekly backups for months
- Reported “success” every time
- Failed completely when actually needed
Common causes:
- Corrupted backup chain (incremental dependency broken)
- Storage filesystem damage (NFS, S3 gateway, local disk pool)
- Silent partial writes
- Encryption or compression errors that only show during restore
So when disaster hits, the first assumption is simple:
Treat backups as untrusted data until proven otherwise.
That mindset saves time.
Step 1: Stop and Stabilize the Damage
If this is a ransomware or active compromise scenario, the first goal is containment.
I usually do:
- Isolate the affected server from network immediately
- Block outbound traffic (especially 80/443/25 depending on role)
- Snapshot or freeze disks if possible (for forensic recovery)
If it’s hardware failure (like RAID collapse), I avoid reboot loops and unnecessary writes.
Why this matters:
Every write operation can destroy recoverable data blocks.
At this stage, panic actions cause more damage than the incident itself.
Step 2: Emergency Filesystem Repair (fsck and Friends)
If the storage is still partially accessible, I try a controlled filesystem check.
For ext4 systems:
fsck -f /dev/md0
Or for individual disks if RAID is degraded:
fsck -f /dev/sdX
But here’s the important rule I learned the hard way:
Never run fsck blindly on production data without imaging first if the disk is failing.
If the system is severely damaged, I prefer:
- Mount as read-only
- Or clone disk first using ddrescue: ddrescue -f -n /dev/sdX /recovery/image.img /recovery/logfile.log
Why ddrescue matters:
It skips bad sectors instead of dying immediately, which is often the difference between partial recovery and total loss.
Step 3: Extract Whatever Survives From Backup Storage
Now we move to backup systems like:
- JetBackup
- R1Soft / Idera
- Custom rsync or tar-based backups
First thing I do is NOT trust the UI.
I go CLI.
Check raw storage:
ls -lah /backup/
Then verify archive integrity manually:
For tar backups:
tar -tzf backup-file.tar.gz
For gzip issues:
gzip -t backup-file.tar.gz
For database dumps:
mysql -u root -p -e "source backup.sql"
Or at least check header validity:
head -n 20 backup.sql
What I’m looking for:
- Missing incremental chain files
- Zero-byte archives
- Truncated SQL dumps
- Partially uploaded backups
A backup system that says “completed” but produces broken archives is unfortunately more common than people admit.
Step 4: Incremental Rebuild from Partial Data
This is where things get messy but also interesting.
You rarely get a perfect restore.
Instead, you rebuild from fragments:
- Old full backup (if available)
- Partial incrementals
- Database dumps (even incomplete ones)
- File-level snapshots
For databases:
- Restore last known good full dump first
- Apply incremental logs if available
- Repair tables if needed: mysqlcheck -u root -p –auto-repair –all-databases
For filesystems:
- Extract only valid tar segments
- Skip corrupted files instead of aborting: tar –ignore-zeros -xzf backup.tar.gz
Then I manually compare:
- File counts
- Critical application directories (WordPress wp-content, uploads, configs)
- Recent changes vs last known good state
This is slow work, but it’s where systems are actually saved.
Step 5: Temporary Node and Traffic Rerouting
While recovery is happening, business doesn’t wait.
So I usually spin up a temporary node:
- Fresh VPS or clean bare-metal
- Minimal stack: Nginx, PHP, database
- Restore partial data first
Then I rewrite DNS quickly:
- Lower TTL beforehand (if you were lucky enough to plan)
- Point A record to temporary IP
- Or use load balancer fallback if available
Why this matters:
Even a degraded system online is better than total downtime.
Customers don’t care if everything is perfect. They care if the site loads.
Step 6: Verify Backup System Integrity (The Hard Lesson)
Once the fire is under control, I go back and inspect the backup pipeline.
Usually I find one of these:
- Backup job silently failing for weeks
- Storage pool corruption (common with cheap NAS setups)
- No checksum verification
- No restore testing ever done
- Disk full warnings ignored
Then I fix the real system:
- Add checksum validation (sha256 per archive)
- Schedule restore testing on a staging server
- Separate backup storage from production network
- Use immutable storage if possible
- Rotate backups properly (not just overwrite last file)
Why this matters:
Backups that are never tested are just expensive hope.
Common Mistakes That Make Recovery Worse
I’ve done some of these myself in early days, so no judgment—but worth calling out.
Mistake 1: Trusting backup “success” logs
Success message ≠ usable backup.
Mistake 2: Restoring directly to production
Always restore to staging first if possible.
Mistake 3: Skipping filesystem imaging
You lose forensic recovery options permanently.
Mistake 4: Not isolating compromised systems
Ransomware loves to spread during recovery.
Final Thoughts: Backups Are a System, Not a File
When everything breaks at once—RAID failure, ransomware, corrupted storage—the problem is no longer just “restore data”.
It becomes:
- Data archaeology
- System reconstruction
- Traffic rerouting under pressure
The real lesson I keep learning (sometimes the hard way) is this:
A backup strategy is only as strong as its last successful restore test.
Not the dashboard. Not the logs. The actual restore.
And when things go wrong, your job shifts from “sysadmin” to “emergency engineer keeping the business breathing”.
Messy, stressful, sometimes ugly work—but also the kind that quietly keeps the internet running while everyone else sleeps.
No Comments