<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Proxmox on /home/andrzejgor.ski</title>
    <link>https://andrzejgor.ski/tags/proxmox/</link>
    <description>Recent content in Proxmox on /home/andrzejgor.ski</description>
    <image>
      <title>/home/andrzejgor.ski</title>
      <url>https://andrzejgor.ski/metaimage.png</url>
      <link>https://andrzejgor.ski/metaimage.png</link>
    </image>
    <generator>Hugo -- 0.145.0</generator>
    <language>en-us</language>
    <lastBuildDate>Sat, 22 Feb 2025 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://andrzejgor.ski/tags/proxmox/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>How I (almost) rescued data from a failed ZFS pool</title>
      <link>https://andrzejgor.ski/posts/zfs_rescue_when_disk_failed/</link>
      <pubDate>Sat, 22 Feb 2025 00:00:00 +0000</pubDate>
      <guid>https://andrzejgor.ski/posts/zfs_rescue_when_disk_failed/</guid>
      <description>How I&amp;rsquo;ve managed to rescue some data from a failed ZFS pool, when one of the disks died and the metadata was corrupted.</description>
      <content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Recently HDD storage pool in my Homelab started to work &ldquo;funny&rdquo;. Funkiness appeared as pool clogging when heavy traffic happened on it. I&rsquo;ve lived with that, restarting machine occasionally for a month or so, being busy with the life.</p>
<p>But one day I thought that maybe it&rsquo;s time to make a backup of the data on that pool. Oh boy, was I wrong and right at the same time. Making backups was definitely right thing to do, but not at that specific time.</p>
<p>After I&rsquo;ve started the backup process, the pool started to clog again. &ldquo;No problem, another reboot and we&rsquo;re alive&rdquo; I&rsquo;ve thought. And this time it was different - the pool didn&rsquo;t import anymore.</p>
<p>All disks in pool had clear SMART status before the malfunction. So I&rsquo;ve started to look into the zfs status output. And it was not good:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span><span class="lnt">4
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="c1"># zpool import hdd</span>
</span></span><span class="line"><span class="cl">              cannot import <span class="s1">&#39;hdd&#39;</span>: I/O error
</span></span><span class="line"><span class="cl">        Destroy and re-create the pool from
</span></span><span class="line"><span class="cl">        a backup source.
</span></span></code></pre></td></tr></table>
</div>
</div><p>One of the disks in the pool was dead. And it was dead that much, that it interfered with controller&rsquo;s operation. Moreover, last reboot not only didn&rsquo;t help, but caused the pool to be exported forcefully (I guess) and it broke the metadata.</p>
<h2 id="troubleshooting">Troubleshooting</h2>
<p>First, I&rsquo;ve tried to import the pool with <code>-f</code> flag, then <code>-f -F</code>, but it didn&rsquo;t help either. The pool was definitely dead.</p>
<p>Normally, I would just replace the broken disk with a new one, replace it in zfs and do a resilver. But the problem was that the metadata was corrupted. And pool with corrupted metadata can&rsquo;t be imported. And without importing the pool, I can&rsquo;t replace the disk.</p>
<p>After some Googling and more or less proper troubleshooting, I&rsquo;ve found a way to import the pool with broken metadata. These commands did the trick:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span><span class="lnt">2
</span><span class="lnt">3
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="m">1</span> &gt; /sys/module/zfs/parameters/zfs_max_missing_tvds
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> <span class="m">0</span> &gt; /sys/module/zfs/parameters/spa_load_verify_metadata
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> <span class="m">0</span> &gt; /sys/module/zfs/parameters/spa_load_verify_data
</span></span></code></pre></td></tr></table>
</div>
</div><p>First one makes it possible to import the pool with missing disk. The other two disable metadata and data verification. <strong>It&rsquo;s not recommended to use these commands in production</strong>. But in my case, I didn&rsquo;t have anything to lose.</p>
<p>After that, wiser with the knowledge gathered during Googling, I&rsquo;ve imported the pool in read-only mode:</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">zpool import -f -o <span class="nv">readonly</span><span class="o">=</span>on hdd
</span></span></code></pre></td></tr></table>
</div>
</div><p>And it worked! The pool was imported in read-only mode. I&rsquo;ve checked the data and it was there. Mostly. Some files were corrupted, but most of them were intact.</p>
<h2 id="data-recovery">Data recovery</h2>
<p>After the pool was imported, I&rsquo;ve tried to copy the data to another disk. First with simple <code>cp</code>, then with <code>rsync</code>. But they all hanged on broken files.</p>
<p>Another research later I&rsquo;ve found a tool called <code>cpio</code>. It&rsquo;s a tool that can copy files to and from archives. But with some flags, and some pipes magic, I&rsquo;ve managed to use it to copy file-by-file.</p>
<div class="highlight"><div class="chroma">
<table class="lntable"><tr><td class="lntd">
<pre tabindex="0" class="chroma"><code><span class="lnt">1
</span></code></pre></td>
<td class="lntd">
<pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl">find /hdd -depth -print0 <span class="p">|</span> cpio -pdmv0 /target
</span></span></code></pre></td></tr></table>
</div>
</div><p>This command copies all files from <code>/hdd</code> to <code>/target</code>.</p>
<p>After a few hours, the data was copied. I&rsquo;ve managed to rescue about 70% of files intact. The rest was corrupted. But it was better than nothing :)</p>
<h2 id="conclusion">Conclusion</h2>
<h3 id="backups-people">Backups, people!</h3>
<p>Shame one me, cause it was second time when I lost some data because of broken disk. And it was second time when I didn&rsquo;t have a proper, automated backup. I&rsquo;ve learned my lesson and I&rsquo;ve started to make backups of my data. And I recommend you to do the same.</p>
<p>The good thing is, I&rsquo;ve had some old (manual) backup of the very same pool, so I&rsquo;ve managed to backfill some of the lost files from it.</p>
<h3 id="raid-or-zraid-is-not-a-backup">RAID (or ZRAID) is not a backup</h3>
<p>I knew that, but I didn&rsquo;t act on that knowledge. RAID is not a backup. It&rsquo;s a redundancy. It&rsquo;s good to have it, but it&rsquo;s not enough. You should have a backup of your data.</p>
<h3 id="smart-monitoring-wont-help-you-sometimes">SMART monitoring won&rsquo;t help you sometimes</h3>
<p>All disks in the pool had clear SMART status before the malfunction. It just happens that disk can die without any SMART errors. So don&rsquo;t rely only on SMART monitoring.</p>
<p>But it&rsquo;s still better to have them monitored than not. I&rsquo;ve had few disks that started showing SMART errors, and I wast fast enough to replace them before they died.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
