find-dupes.awk: edits for Linux?
Hi,
Any chance you could help me adapt your find-dupes.awk script to work on a Linux system? Based on your notes, I was able to figure out the following changes:
- Instead of
ls -lTR, usels -l --full-time -R | grep -v ^d - Use
md5_exec = "md5sum" - Change
$9to$8:file = substr($0,match($0, $8)+length($8)+1,length($0)) - Change
$2to$1since we are usingmd5sum:hash = $1
I couldn't figure out the rest, starting with the line sizes[$5], as I don't know awk. Would appreciate it as I'm trying to find dupes using the md5sum from the stackexchange thread that you referenced, and it's still running after 1 day on 1.3TB worth of data.
Thanks in advance.
Actually, I was able to get things to work with the following:
- Use
md5sum --tagwhich gives BSD style results. - Revert back to
hash $2(my last bullet in my original post).
Thank you. Let's see how fast this goes.
Actually, the script errored out after 20 minutes:
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
awk: ./find-dupes.awk:72: (FILENAME=- FNR=139463) fatal: cannot open pipe `md5sum --tag 'amazon_drive/Amazon Photos Downloads/Pictures/Web/IMG_4636 (2022-02-23T15_55_55.366).jpg'': Too many open files
@taltman Actually, could you provide an example of ls -lTR output on FreeBSD. That would make it easier to match our own output. Thanks.
these changes worked for me in Debian Linux:
BEGIN{
OFS = "\t"
#md5_exec = "md5" # FS in Report Section has to be " = "
md5_exec = "md5sum" # FS in Report Section has to be " "
}
/:$/ {
sub(/:$/, "")
dir = $0
next
}
# there is a line with Fields and the line does not start with "t" (for "total"
# or "d" (for directory). we can be sure: here are only files, not dirs
# now the listing of the files start and will be parsed:
# Parse ls -ltR output:
NF && !/^[td]/ {
# substitute all "*" in line by "" (delete "*")
gsub(/\*$/,"")
#file = substr($0, index($0,$9))
file = substr($0,match($0,$9),length($9))
#file = substr($0,match($0, $9)+length($9),length($0))
#file = substr($0,match($0, $9)+length($9)+1,length($0))
file_size[$5, ++file_size[$5,"length"]] = dir "/" file
if(file_size[$5, "length"] > 1 && $5 > 35)
sizes[$5]
}
END {
# Find the files that have identical sizes, and then get their MD5 hash:
for(size in sizes)
for(i=1; i<=file_size[size,"length"]; i++) {
file = file_size[size,i]
FS=" "
#print "'" file "'"
(md5_exec " '" file "'") | getline
hash = $1
file_hash[hash,++file_hash[hash,"length"]] = file
if (file_hash[hash,"length"]>1)
hashes[hash]
}
# Report files that have identical MD5 hashes:
for(hash in hashes) {
print hash
for(i=1; i<=file_hash[hash,"length"]; i++)
print OFS file_hash[hash,i]
}
}
@karlic a bit late, but you only need size and name: use find . -type f -printf "%s %p\n" and modify index to get name file = substr($0, index($0, $2)), your size is $1