Speed up directory scanning by ordering by inode
Spent most of today experimenting ways to speed up the directory scanning. Even for cached files, the scanning was taking too long on one of my machines (slow hd, and other io going on, with backup running at idle io priority)
On ext3/ext4 sorting the directory entries by inode number makes a big difference. I've not tested it on other filesystems yet, but I doubt it will make things worse.
See - https://github.com/exobuzz/attic/commit/ae77443d4d0c3d7881a911d992e3a97611a8e2ed
I'm still testing this on the machine in question, but on my desktop machine which has a much faster HD, the speed up was significant, making it over 30% quicker. Of course, this means the backup order isn't alphabetical. I'm not sure it is that important, but this could be made optional perhaps, or maybe a sort could be done on the repository at the end or something.
Did you try to simply leave the directory entries unsorted?
os.listdir() returns entries in "arbitrary order" which probably means that the entries probably already are in inode order due to the b-tree usage in most filesystems.
Yeh, actually that was my first thought too, but as far as ext filesystems it doesn't seem to work that way. I actually read someone mentioning this in regards to python and os.walk etc (and even made some code to use a supposedly faster scandir). The current code could be improved by using readdir directly, as it could reduce lstats somewhat (as readdir can return some data such as type, etc)
Try it yourself anyway and see if it makes a difference.
The main bottleneck with attic seems to be the mmap'd chunk file. I have experimented with madvise calls, but need to do some more tests. I think having a constantly written mmap'ed file on the same hd as the backup is the biggest problem with throughput. in the case of one of my servers which lacks fast IO, and is serving up pages/and doing database queries, I actually wonder if reading data from the remote repository into a memory cache, and not having it disk based would be more efficient. I'm really eager to replace my current rdiff-backup solution with attic, but the backups from attic, even with a complete file cache (caching everything), the mmap'd chunk file is constantly updated.
I also think that’s why the inode order has such an impact, as it reduces seeking. Unfortunately I don't understand the code enough yet to have a better solution myself (I've just spent the whole weekend on it though - as I really like the concept of attic).
If you get time, I would be happy to test out any other scenarios. The situation is now, that rdiff-backup took about an hour to scan / backup this filesystem with few changes, attic takes 4-5 hours but with massive IO overhead due to the chunk file writes (and this is one part im confused on, as even if a complete file cache exists, it still constantly updates the mmap'ed chunk file - so I don't fully understand the system yet). with the inode ordering, the backup takes about 3 hours, but it still has significant io overhead due to the writes. I suspect on my system, it could be faster having a memory buffer, and requesting data from the attic server on the other end.
Cheers.
Hi,
Sorry for jumping in, but I think that I could help you with regard of chunk file updates: if none of your files are changed, then reference count for each chunk in the chunk file would be updated by one, and I think this is what causes io you observe.
This is extremely visible when chunk file in the cache is rebuilt (i've started a separate thread about that a while ago - "Speed of cache sync") On 2 Jun 2014 01:55, "exobuzz" [email protected] wrote:
Yeh, actually that was my first thought too, but as far as ext filesystems it doesn't seem to work that way. I actually read someone mentioning this in regards to python and os.walk etc (and even made some code to use a supposedly faster scandir). The current code could be improved by using readdir directly, as it could reduce lstats somewhat (as readdir can return some data such as type, etc)
Try it yourself anyway and see if it makes a difference.
The main bottleneck with attic seems to be the mmap'd chunk file. I have experimented with madvise calls, but need to do some more tests. I think having a constantly written mmap'ed file on the same hd as the back is the biggest problem with throughput. in the case of one of my servers which lacks fast IO, and is serving up pages/and doing database queries, I actually wonder if reading data from the remote repository into a memory cache, and not having it disk based would be more efficient. I'm really eager to replace my current rdiff-backup solution with attic, but the backups from attic, even with a complete file cache (caching everything), the mmap'd chunk file is constantly updated.
I also think that’s why the inode order has such an impact, as it reduces seeking. Unfortunately I don't understand the code enough yet to have a better solution myself (I've just spent the whole weekend on it though - as I really like the concept of attic).
If you get time, I would be happy to test out any other scenarios. The situation is now, that rdiff-backup took about an hour to scan / backup this filesystem with few changes, attic takes 4-5 hours but with massive IO overhead due to the chunk file writes (and this is one part im confused on, as even if a complete file cache exists, it still constantly updates the mmap'ed chunk file - so I don't fully understand the system yet). with the inode ordering, the backup takes about 3 hours, but it still has significant io overhead due to the writes. I suspect on my system, it could be faster having a memory buffer, and requesting data from the attic server on the other end.
Cheers.
— Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/91#issuecomment-44791993.
i had the impression this was fixed in #119, did you try with latest git?
https://docs.python.org/3.5/whatsnew/3.5.html#whatsnew-pep-471