Introduction
Recently HDD storage pool in my Homelab started to work “funny”. Funkiness appeared as pool clogging when heavy traffic happened on it. I’ve lived with that, restarting machine occasionally for a month or so, being busy with the life.
But one day I thought that maybe it’s time to make a backup of the data on that pool. Oh boy, was I wrong and right at the same time. Making backups was definitely right thing to do, but not at that specific time.
After I’ve started the backup process, the pool started to clog again. “No problem, another reboot and we’re alive” I’ve thought. And this time it was different - the pool didn’t import anymore.
All disks in pool had clear SMART status before the malfunction. So I’ve started to look into the zfs status output. And it was not good:
|
|
One of the disks in the pool was dead. And it was dead that much, that it interfered with controller’s operation. Moreover, last reboot not only didn’t help, but caused the pool to be exported forcefully (I guess) and it broke the metadata.
Troubleshooting
First, I’ve tried to import the pool with -f
flag, then -f -F
, but it didn’t help either. The pool was definitely dead.
Normally, I would just replace the broken disk with a new one, replace it in zfs and do a resilver. But the problem was that the metadata was corrupted. And pool with corrupted metadata can’t be imported. And without importing the pool, I can’t replace the disk.
After some Googling and more or less proper troubleshooting, I’ve found a way to import the pool with broken metadata. These commands did the trick:
|
|
First one makes it possible to import the pool with missing disk. The other two disable metadata and data verification. It’s not recommended to use these commands in production. But in my case, I didn’t have anything to lose.
After that, wiser with the knowledge gathered during Googling, I’ve imported the pool in read-only mode:
|
|
And it worked! The pool was imported in read-only mode. I’ve checked the data and it was there. Mostly. Some files were corrupted, but most of them were intact.
Data recovery
After the pool was imported, I’ve tried to copy the data to another disk. First with simple cp
, then with rsync
. But they all hanged on broken files.
Another research later I’ve found a tool called cpio
. It’s a tool that can copy files to and from archives. But with some flags, and some pipes magic, I’ve managed to use it to copy file-by-file.
|
|
This command copies all files from /hdd
to /target
.
After a few hours, the data was copied. I’ve managed to rescue about 70% of files intact. The rest was corrupted. But it was better than nothing :)
Conclusion
Backups, people!
Shame one me, cause it was second time when I lost some data because of broken disk. And it was second time when I didn’t have a proper, automated backup. I’ve learned my lesson and I’ve started to make backups of my data. And I recommend you to do the same.
The good thing is, I’ve had some old (manual) backup of the very same pool, so I’ve managed to backfill some of the lost files from it.
RAID (or ZRAID) is not a backup
I knew that, but I didn’t act on that knowledge. RAID is not a backup. It’s a redundancy. It’s good to have it, but it’s not enough. You should have a backup of your data.
SMART monitoring won’t help you sometimes
All disks in the pool had clear SMART status before the malfunction. It just happens that disk can die without any SMART errors. So don’t rely only on SMART monitoring.
But it’s still better to have them monitored than not. I’ve had few disks that started showing SMART errors, and I wast fast enough to replace them before they died.