Updated files to Python3000, Windows support, and interactive
made it work with Python3000, it now has interactive mode (see readme file) and supports windows (earlier a windows user could not use the code cause the DOS doesn't have pipes.
I should add that I didn't test it on Linux. So you might want to do that before merging.
also I don't know what's the performance hit of using the os.walk() rather than using $ find ... | ./duplicatefinder.py
Awesome. I need to check this, but I have no time now.
This is very nice. I've borrowed the python3 fixes. For the rest, I'm not sure. Deleting files is really dangerous and should be well thought. Also, using find and a pipe was so much simpler, because now you will want to add a whitelist/blacklist matcher, and a size threshold, and... ..It has no end :)
And note that this program was more or less a protoype for a GUI available here: http://kassoulet.blogspot.fr/2010/09/jankis-duplicate-finder.html
I was tempted to do a tkinter version so I can use it on windows and osx, but an interactive mode is a great idea.
And if you want to go further, here are some ideas I've no time to implement yet:
- In addition of checking the first KB of files, check also the last one. I'm pretty sure this allows to remove the need to read the whole files to detect partial matches.
- Store the files information (name, size, checksums) in a file, and use it in interactive mode. With this, the user can change dynamically the minimum size, or the minimum number of matches, without re-scanning.
- Add size/patterns options to the walker.
Thank you for the advice and feedback.
I do agree that on Linux using find is the best (and faster) option, but does 'find' work on Windows?
the main reason why I wrote the walker and the deleting function is because my housemate needs to get rid of duplicate files. She has windows and doesn't want to spend a lot of time on it, nor pay money to solve such problem.
I would like to add that for now, there is no way to delete files but in interactive mode.
The idea to read only the first KB and the last one is pretty great. But I don't know if I will have time to implement that.