Sunday, November 8, 2020

One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

This is not message I wanted to see in my FreeNAS GUI when running brand new disks, but oh well.

First scrub on July 19 2020

I discovered the error on July 29 2020 when moving some 300 GB data to one of my mirrored pools. After the initial panic of "what is this?!" I logged in to my server over ssh and checked what has happened with zpool status.
zpoll status
pool: Tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 24.8M in 0 days 00:00:01 with 0 errors on Sun Jul 19 00:00:32 2020
config:

        NAME                                                STATE     READ WRITE CKSUM
        SafeHaven                                           ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            gptid/blip_disk1.eli  ONLINE       0     0     3
            gptid/
blip_disk2.eli  ONLINE       0     0     0

errors: No known data errors


Well, well, well. It seems that there were a few MB of data missmatch between the members of the mirror. As these are new disks, I am not particularly paniced for the moment, especially after reading the link above and reflecting a bit on recent events.
 

The likely culprit


During a maintenance/upgrade about two weeks ago one of the drives "fell out" of the pool due to a loosely attached SATA power cable and therefore my pool became "DEGRADED" (another word that one does not see with great pleasure in the GUI...). Since this pool was at the time used for the system log, as well as for my jails, the remaining one disk was still carrying out read/write operations thereby getting out of sync with the other - at the time OFFLINE - drive. In the end I managed to get the disk back in the pool, however, I imagine that the changes that happened on the first disk were not mirrored automatically upon re-ataching the second disk. That July 19 midnight seems like a scrub, which must have caught the data missmatch and fixed it.

In this case it is probably not a huge issue. I cleared the error message by dismissing it in the GUI and from the terminal as well via,
sudo zpool clear Tank gptid/blip_disk1.eli
and will continue to monitor the situation.

Another scrub on Aug 9 2020

The scrub this time also caught some things, and zpool status gave the following.

pool: Tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 12K in 0 days 06:29:41 with 0 errors on Sun Aug  9 06:53:43 2020
config:

        NAME                                                STATE     READ WRITE CKSUM
        SafeHaven                                           ONLINE       0     0     0
          mirror-0                                          ONLINE       0     0     0
            gptid/blip_disk1.eli  ONLINE       0     0     3
            gptid/blip_disk2.eli  ONLINE       0     0     0

errors: No known data errors
 
Now, during a nighly scub, another 12K was discovered and repaired. This was again on the same disk as previously and I am still wondering if this is not some leftover of the previously described issue. Perhaps something that was not caught last time? According to Yikues, scrub repaired 172K it could be anything or nothing since I am running server-grade hardware with ECC memory.  Either way, out of precaution I am doing the following:
  • create a snapshot,
  • refresh my backup,
  • schedule a long SMART test and
  • (if time allows) run a memtest.

Note: I know that some just love recommending running a memtest. However, looking at the issue, statistically, it is extremely unlikely that it is a memory issue as proper memory - which server-grade memory is  - should pass qality checks after manufacturing and they really rarely go bad.

If the SMART tests will be passed, I will call it a day and keep observing the system. If the SMART test throws back some errors or if the error happens another time on the same drive, I will contact the retailer as the drive is well withing garantee.

Drive S.M.A.R.T. status 

Checking the drive SMART status with

 sudo smartctl -a /dev/ada1 
revealed no apparent errors with the disk. SMART tests previously all completed without errors.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1404         -
# 2  Extended offline    Completed without error       00%      1329         -
# 3  Short offline       Completed without error       00%      1164         -
# 4  Short offline       Completed without error       00%       996         -
# 5  Short offline       Completed without error       00%       832         -
# 6  Short offline       Completed without error       00%       664         -
# 7  Short offline       Completed without error       00%       433         -
# 8  Short offline       Completed without error       00%       265         -
# 9  Extended offline    Completed without error       00%       190         -
#10  Extended offline    Completed without error       00%        18         -
#11  Short offline       Completed without error       00%         0         -

Memtest

Came back clean. I am not particularly surprised here.

Status on 08 November 2020

A few months have passed since I started writing this post. In the meanwhile I was monitoring the situation and did not discover any further issues. The pool is running fine and no further scrubs reported any errors. I am therefore concluding that the issue was caused most likely by the above malfunction and has nothing to do with the drive itself.