Backup Restoration Failures After Ransomware or Hardware Crash: When “We Have Backups” Is Not Enough

This is one of those situations that looks good on paper… until it happens at 2AM.

You set up automated backups. You tested them once. The dashboard shows “success” every week. Everyone sleeps better.

Then reality hits:

RAID array dies (two disks down, no redundancy left)
Or worse, root-level compromise wipes the server clean
You go to restore from JetBackup, R1Soft, or similar
And the backup storage itself is corrupted, half-missing, or won’t mount

At that moment, the phrase “we have backups” suddenly feels… very optimistic.

I’ve had a few of these incidents over the years. The kind where you stare at a broken filesystem and realize your job is no longer “restore backup”, but “salvage whatever still exists and keep the business alive”.

Let’s walk through how I handle it when things go this bad.

The Reality Check: Backups Are Not Recovery Until Verified

Before anything technical, there’s one uncomfortable truth:

A backup is not a backup until you’ve restored it successfully.

I’ve seen systems that:

Ran weekly backups for months
Reported “success” every time
Failed completely when actually needed

Common causes:

Corrupted backup chain (incremental dependency broken)
Storage filesystem damage (NFS, S3 gateway, local disk pool)
Silent partial writes
Encryption or compression errors that only show during restore

So when disaster hits, the first assumption is simple:

Treat backups as untrusted data until proven otherwise.

That mindset saves time.

Step 1: Stop and Stabilize the Damage

If this is a ransomware or active compromise scenario, the first goal is containment.

I usually do:

Isolate the affected server from network immediately
Block outbound traffic (especially 80/443/25 depending on role)
Snapshot or freeze disks if possible (for forensic recovery)

If it’s hardware failure (like RAID collapse), I avoid reboot loops and unnecessary writes.

Why this matters:
Every write operation can destroy recoverable data blocks.

At this stage, panic actions cause more damage than the incident itself.

Step 2: Emergency Filesystem Repair (fsck and Friends)

If the storage is still partially accessible, I try a controlled filesystem check.

For ext4 systems:

fsck -f /dev/md0

Or for individual disks if RAID is degraded:

fsck -f /dev/sdX

But here’s the important rule I learned the hard way:

Never run fsck blindly on production data without imaging first if the disk is failing.

If the system is severely damaged, I prefer:

Mount as read-only
Or clone disk first using ddrescue: ddrescue -f -n /dev/sdX /recovery/image.img /recovery/logfile.log

Why ddrescue matters:
It skips bad sectors instead of dying immediately, which is often the difference between partial recovery and total loss.

Step 3: Extract Whatever Survives From Backup Storage

Now we move to backup systems like:

JetBackup
R1Soft / Idera
Custom rsync or tar-based backups

First thing I do is NOT trust the UI.

I go CLI.

Check raw storage:

ls -lah /backup/

Then verify archive integrity manually:

For tar backups:

tar -tzf backup-file.tar.gz

For gzip issues:

gzip -t backup-file.tar.gz

For database dumps:

mysql -u root -p -e "source backup.sql"

Or at least check header validity:

head -n 20 backup.sql

What I’m looking for:

Missing incremental chain files
Zero-byte archives
Truncated SQL dumps
Partially uploaded backups

A backup system that says “completed” but produces broken archives is unfortunately more common than people admit.

Step 4: Incremental Rebuild from Partial Data

This is where things get messy but also interesting.

You rarely get a perfect restore.

Instead, you rebuild from fragments:

Old full backup (if available)
Partial incrementals
Database dumps (even incomplete ones)
File-level snapshots

For databases:

Restore last known good full dump first
Apply incremental logs if available
Repair tables if needed: mysqlcheck -u root -p –auto-repair –all-databases

For filesystems:

Extract only valid tar segments
Skip corrupted files instead of aborting: tar –ignore-zeros -xzf backup.tar.gz

Then I manually compare:

File counts
Critical application directories (WordPress wp-content, uploads, configs)
Recent changes vs last known good state

This is slow work, but it’s where systems are actually saved.

Step 5: Temporary Node and Traffic Rerouting

While recovery is happening, business doesn’t wait.

So I usually spin up a temporary node:

Fresh VPS or clean bare-metal
Minimal stack: Nginx, PHP, database
Restore partial data first

Then I rewrite DNS quickly:

Lower TTL beforehand (if you were lucky enough to plan)
Point A record to temporary IP
Or use load balancer fallback if available

Why this matters:
Even a degraded system online is better than total downtime.

Customers don’t care if everything is perfect. They care if the site loads.

Step 6: Verify Backup System Integrity (The Hard Lesson)

Once the fire is under control, I go back and inspect the backup pipeline.

Usually I find one of these:

Backup job silently failing for weeks
Storage pool corruption (common with cheap NAS setups)
No checksum verification
No restore testing ever done
Disk full warnings ignored

Then I fix the real system:

Add checksum validation (sha256 per archive)
Schedule restore testing on a staging server
Separate backup storage from production network
Use immutable storage if possible
Rotate backups properly (not just overwrite last file)

Why this matters:
Backups that are never tested are just expensive hope.

Common Mistakes That Make Recovery Worse

I’ve done some of these myself in early days, so no judgment—but worth calling out.

Mistake 1: Trusting backup “success” logs

Success message ≠ usable backup.

Mistake 2: Restoring directly to production

Always restore to staging first if possible.

Mistake 3: Skipping filesystem imaging

You lose forensic recovery options permanently.

Mistake 4: Not isolating compromised systems

Ransomware loves to spread during recovery.

Final Thoughts: Backups Are a System, Not a File

When everything breaks at once—RAID failure, ransomware, corrupted storage—the problem is no longer just “restore data”.

It becomes:

Data archaeology
System reconstruction
Traffic rerouting under pressure

The real lesson I keep learning (sometimes the hard way) is this:

A backup strategy is only as strong as its last successful restore test.

Not the dashboard. Not the logs. The actual restore.

And when things go wrong, your job shifts from “sysadmin” to “emergency engineer keeping the business breathing”.

Messy, stressful, sometimes ugly work—but also the kind that quietly keeps the internet running while everyone else sleeps.

Photo by GuerrillaBuzz on Unsplash