tarbackup Simple md5-based integrity checks

Since the storage array is a RAID-6 rather than ZFS or BTRFS, there should be regular integrity checks to verify the integrity of the data. This can be done by creating an md5 one the file is uploaded and checking the file's hash vs the saved md5 on a regular basis.

Oct 01 '12 13:10 nanch

I guess this is a simple way, yes.

You could index new objects every X interval, and then run a long running, low nice process to verify the full index on another interval.

If you wanted to get a bit more fancy, you could split the index up and then verify chunks, to reduce i/o

Can I ask if you've considered a filesystem for your RAID-6 that includes error-checking as an option?

Oct 02 '12 08:10 kyle0r

I've thought about using other file systems, but they were either messy to work with (unable to expand ZFS 1 hard drive at a time), required strange things (20% free space requirement with ZFS), or are too new to use with confidence (btrfs).

In the end I ended up with ext4, but it will require a simple background daemon to check file integrity. If there are any better file system options, I'd be happy to try something different than ext4.

Oct 02 '12 13:10 nanch

http://en.wikipedia.org/wiki/List_of_file_systems#File_systems_with_built-in_fault-tolerance

Looks like there is a near zero Linux, production ready CRC checking file systems? Not without having to do some backflips which are likely to catch you out at some stage.

Perhaps there is a userspace package under your distro of choice than can handle the checksumming work automatically for now? i.e. a simple daemon that you configure to checksum some paths on interval X?

Oct 05 '12 08:10 kyle0r

http://clusterbuffer.wordpress.com/2011/10/09/checksumming-files-to-find-bit-rot/ has an interesting implementation that may be worth integrating into a daemon.

Oct 05 '12 14:10 nanch

Very interesting, I will be implementing something like this on my NAS :)

Oct 06 '12 10:10 kyle0r

How to check for bitrot is easy, I need to read that article properly and do some research on combating bitrot. Thanks for introducing be to this :)

Oct 06 '12 10:10 kyle0r

Also, just noting a crazy idea, I wonder if it would be possible set up a local-loopback distributed fault tolerant file system ?

So, think distributed fault tolerant file system but on the same physical system, perhaps that could provide an answer to checksuming and bitrot?

Oct 06 '12 11:10 kyle0r

Glad you enjoy that article. RAID-6 with regular scrubbing works pretty well for bit-rot and failing hard drives.

As is stands, with regards to tarbackup, error-correction is not a requirement for the service. It is enough just to be aware of the error in a timely fashion and notify the user to re-upload.

Oct 06 '12 20:10 nanch

This was a new thing for me

regular scrubbing?

This makes interesting reading: http://www.nerdblog.com/2009/03/bitrot-huge-disks-and-raid.html

As an example, it looks like my NAS does scrubbing weekly through a cron, which calls a binary called vs_refresh. I will message the manufacture and check for sure.

Thanks for the insights

Oct 06 '12 21:10 kyle0r

scrubbing is critical to keep things recoverable on a raid5/raid6; unrecoverable errors are a nightmare if you learn about them about after a disk failure :D

as you mentioned, mdadm can be scheduled to scrub regularly via cron on a weekly basis (sunday) you can see the load increase on the machine every sunday here: http://www.piqd.com/tarbackup_load_average.png

on centos the command is /usr/sbin/raid-check and it's scheduled via /etc/cron.d/raid-check

Run system wide raid-check once a week on Sunday at 1am by default

0 1 * * Sun root /usr/sbin/raid-check

This relates to tarbackup because the data storage is done on a RAID-6 with regular scrubbing. The only thing better would be keeping copies of data; but that's expensive. A good low-overhead middle-ground solution for guaranteeing data validity is an md5 check.

Oct 06 '12 22:10 nanch

Thanks for the share, very interesting :)

Oct 10 '12 09:10 kyle0r