Simple md5-based integrity checks
Since the storage array is a RAID-6 rather than ZFS or BTRFS, there should be regular integrity checks to verify the integrity of the data. This can be done by creating an md5 one the file is uploaded and checking the file's hash vs the saved md5 on a regular basis.
I guess this is a simple way, yes.
You could index new objects every X interval, and then run a long running, low nice process to verify the full index on another interval.
If you wanted to get a bit more fancy, you could split the index up and then verify chunks, to reduce i/o
Can I ask if you've considered a filesystem for your RAID-6 that includes error-checking as an option?
I've thought about using other file systems, but they were either messy to work with (unable to expand ZFS 1 hard drive at a time), required strange things (20% free space requirement with ZFS), or are too new to use with confidence (btrfs).
In the end I ended up with ext4, but it will require a simple background daemon to check file integrity. If there are any better file system options, I'd be happy to try something different than ext4.
http://en.wikipedia.org/wiki/List_of_file_systems#File_systems_with_built-in_fault-tolerance
Looks like there is a near zero Linux, production ready CRC checking file systems? Not without having to do some backflips which are likely to catch you out at some stage.
Perhaps there is a userspace package under your distro of choice than can handle the checksumming work automatically for now? i.e. a simple daemon that you configure to checksum some paths on interval X?
http://clusterbuffer.wordpress.com/2011/10/09/checksumming-files-to-find-bit-rot/ has an interesting implementation that may be worth integrating into a daemon.
Very interesting, I will be implementing something like this on my NAS :)
How to check for bitrot is easy, I need to read that article properly and do some research on combating bitrot. Thanks for introducing be to this :)
Also, just noting a crazy idea, I wonder if it would be possible set up a local-loopback distributed fault tolerant file system ?
So, think distributed fault tolerant file system but on the same physical system, perhaps that could provide an answer to checksuming and bitrot?
Glad you enjoy that article. RAID-6 with regular scrubbing works pretty well for bit-rot and failing hard drives.
As is stands, with regards to tarbackup, error-correction is not a requirement for the service. It is enough just to be aware of the error in a timely fashion and notify the user to re-upload.
This was a new thing for me
regular scrubbing?
This makes interesting reading: http://www.nerdblog.com/2009/03/bitrot-huge-disks-and-raid.html
As an example, it looks like my NAS does scrubbing weekly through a cron, which calls a binary called vs_refresh. I will message the manufacture and check for sure.
Thanks for the insights
scrubbing is critical to keep things recoverable on a raid5/raid6; unrecoverable errors are a nightmare if you learn about them about after a disk failure :D
as you mentioned, mdadm can be scheduled to scrub regularly via cron on a weekly basis (sunday) you can see the load increase on the machine every sunday here: http://www.piqd.com/tarbackup_load_average.png
on centos the command is /usr/sbin/raid-check and it's scheduled via /etc/cron.d/raid-check
Run system wide raid-check once a week on Sunday at 1am by default
0 1 * * Sun root /usr/sbin/raid-check
This relates to tarbackup because the data storage is done on a RAID-6 with regular scrubbing. The only thing better would be keeping copies of data; but that's expensive. A good low-overhead middle-ground solution for guaranteeing data validity is an md5 check.
Thanks for the share, very interesting :)