That WAS the title of this post when I wrote up most of it yesterday. But at least for the time being, there's no deaths to report. I was going to write up a whole post about the process of realizing you have a failing drive and then replacing it, but the drive is fine for now. So I'm going to write about that instead! We'll see how much of the old post gets carried over. (Editor's Note: not a lot)
So a quick preface. I'm running TrueNAS Scale, which uses ZFS for everything on the storage backend and basically just provides a nice way to manage and share that, plus tends to make it really hard to do Bad Practices. Other stuff too, but mainly that.
What happened for this story is, I found the pool in a degraded state because one of the drives had faulted. That sounds very scary, and it kind of is, but I need to explain what all of that actually means.
Errors in ZFS
ZFS, unlike most file systems in common use today, actively verifies data when requested. Everything is hashed all the way up, with enough verification to isolate exactly where the issue(s) are, be that in the data itself or in a checksum. Device errors happen in ZFS because some stage of this process fails, due to the device not responding as expected.
So ZFS can potentially cough up an error for one of two reasons. First, is if a file is accessed that doesn't return correctly. Second, is during a scrub, which is effectively going through and accessing everything that has been stored. You'll note that file access errors will only happen when, well, files get accessed. In addition to that, scrubs don't happen automatically by default in ZFS. This is to say: You need to have scrubs scheduled, and if you somehow have disks with good and bad regions, you may not encounter issues until you use more space. This thus means that scrubs take longer the fuller your pool is.
It should be noted here that I believe TrueNAS will set monthly scrubs by default. A good example of how it helps avoid Bad Practices.
Also, these are errors in ZFS, which is to say the filesystem, not errors coming from the disks themselves via SMART. I very intentionally ignore SMART errors. I can't say for sure that this is good advice, there's some amount of evidence to support that it's bad advice, but in my experience SMART errors are just a lot of noise over drives that still behave fine in ZFS. I've actually got like 3 drives that will fail short SMART tests in my array right now, and none of them are the ones that started spitting errors in ZFS. Take that for what you will.
Notifications
This is, something I know but was reminded that I really should have set up. TrueNAS gives you lots of notifications for anything important, in the web UI. Now, I'm the kind of person that likes to just pop in the web UI on a regular basis, but that's not really reliable. I found out about these errors because I had popped into the web UI for a completely different reason, which happened to be about 75% of the way through a monthly scrub. I could have very easily not seen any of this for days or even a couple of weeks. With how my array is set up that's not nessisarily a big deal (since I can lose 3 disks without losing data), but it's still not good.
It's not something where I think I want my phone blowing up, but email or something of the likes would be good. Something needs to show up outside of the web UI to tell me there's a problem.
What actually happened
So, originally I was going to go through the whole process of what the error looked like, what drive replacements are like, etc etc, but I also included a section with what I deemed Bad Advice. The bad advice was, whenever I have errors in ZFS, the first thing I do once the scrub is done is reboot the TrueNAS system and run the scrub again. Rebooting the system clears out any errors logged in ZFS on the pool (something I need to look into the reasoning behind), so by doing this I'm basically saying "ok, try it again to see if that was just a fluke". In my personal experience, it's more often a fluke than a persistent issue.
This is bad advice because it's very common for when a drive fails in any kind of array, that the process of rebuilding that drive causes others to fail. Scrubs and resivlers1 are both similar in terms of the load they put on the drives, so running another scrub can potentially cause more issues, rather than less. For me personally, I'm going to redo it every time unless I have reason to believe that it's very likely for me to have 2+ drives flunk out.
In this case, I had one drive with 87 errors on read which got faulted, and another with 4 errors on read. Faulted meant ZFS stopped trying to access it and just let it chill. ZFS is also smart enough to not fault a drive just because of a single error, instead the threshold is some amount of errors over time. So if I never reboot my system and a drive racks up 200 read errors over the course of two years, it should still be online in the pool. What I did was let the monthly scrub finish, reboot the system, start another scrub and watch the error count for a bit. No errors in the first 15 minutes, so let it run. A full scrub takes something in the range of 16 hours for me 17 hours 32 minutes, so it's very much a wait and see deal. Scrub completed, no errors, all is good. See you next month.
So why does this happen? I'm not exactly sure. I've had the occasional issue with my shelf not playing nice with my UPS causing it to throw errors in ZFS. When that has happened in the past I tend to see errors on all of the drives though, not just one or two. Realistically, I think it's possible the drive hit a bad sector and stopped responding and/or remapped the sector. This is obviously Not Good, but drive still works, soooooooo....
I do have drives were specifically the reallocated sector count is freaking out SMART, but again those aren't any of the ones that caused me issues today. One of the things I had written in my original post was that, this is not the best batch of drives. They're kind of shit. But they're what I've got for better or worse, so until they're giving me more persistent issues, they stay.
Angry storage admins, now you can chime in below! :)
-
For traditional resilvers. In ZFS if you just add a new drive to an array, it will try to pull the data directly from the faulted drive (with verification) to do as much of the rebuild as possible, and use the rest of the array when that data is bad. This is REALLY COOL and is why you should always have a spare connection on your array if you can. More about that when I actually have a disk that keeps faulting.... maybe I should make a separate post....