scripts icon indicating copy to clipboard operation
scripts copied to clipboard

find-dupes.awk: edits for Linux?

Open vinhdizzo opened this issue 3 years ago • 5 comments

Hi,

Any chance you could help me adapt your find-dupes.awk script to work on a Linux system? Based on your notes, I was able to figure out the following changes:

  • Instead of ls -lTR, use ls -l --full-time -R | grep -v ^d
  • Use md5_exec = "md5sum"
  • Change $9 to $8: file = substr($0,match($0, $8)+length($8)+1,length($0))
  • Change $2 to $1 since we are using md5sum: hash = $1

I couldn't figure out the rest, starting with the line sizes[$5], as I don't know awk. Would appreciate it as I'm trying to find dupes using the md5sum from the stackexchange thread that you referenced, and it's still running after 1 day on 1.3TB worth of data.

Thanks in advance.

vinhdizzo avatar Nov 19 '22 03:11 vinhdizzo

Actually, I was able to get things to work with the following:

  • Use md5sum --tag which gives BSD style results.
  • Revert back to hash $2 (my last bullet in my original post).

Thank you. Let's see how fast this goes.

vinhdizzo avatar Nov 19 '22 03:11 vinhdizzo

Actually, the script errored out after 20 minutes:

sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
awk: ./find-dupes.awk:72: (FILENAME=- FNR=139463) fatal: cannot open pipe `md5sum --tag 'amazon_drive/Amazon Photos Downloads/Pictures/Web/IMG_4636 (2022-02-23T15_55_55.366).jpg'': Too many open files

vinhdizzo avatar Nov 19 '22 17:11 vinhdizzo

@taltman Actually, could you provide an example of ls -lTR output on FreeBSD. That would make it easier to match our own output. Thanks.

karlic avatar Sep 25 '23 13:09 karlic

these changes worked for me in Debian Linux:

BEGIN{
    OFS = "\t"
    #md5_exec = "md5" # FS in Report Section has to be " = "
    md5_exec = "md5sum" # FS in Report Section has to be " "
}

/:$/ {
    sub(/:$/, "")
    dir = $0
    next
}

# there is a line with Fields and the line does not start with "t" (for "total"
# or "d" (for directory). we can be sure: here are only files, not dirs
# now the listing of the files start and will be parsed:
# Parse ls -ltR output:
NF && !/^[td]/ {
    # substitute all "*" in line by "" (delete "*")
    gsub(/\*$/,"")
    #file = substr($0, index($0,$9))
    file = substr($0,match($0,$9),length($9))
    #file = substr($0,match($0, $9)+length($9),length($0))
    #file = substr($0,match($0, $9)+length($9)+1,length($0))
    file_size[$5, ++file_size[$5,"length"]] = dir "/" file
    if(file_size[$5, "length"] > 1 && $5 > 35)
        sizes[$5]
}

END {
    # Find the files that have identical sizes, and then get their MD5 hash:
    for(size in sizes)
        for(i=1; i<=file_size[size,"length"]; i++) {
            file = file_size[size,i]
                FS=" "
                #print "'" file "'"
                (md5_exec " '" file "'") | getline
                    hash = $1
                    file_hash[hash,++file_hash[hash,"length"]] = file
                    if (file_hash[hash,"length"]>1)
                        hashes[hash]
        }

# Report files that have identical MD5 hashes:
    for(hash in hashes) {
        print hash
        for(i=1; i<=file_hash[hash,"length"]; i++)
            print OFS file_hash[hash,i]
    }
}

hasifantasy avatar Mar 22 '25 17:03 hasifantasy

@karlic a bit late, but you only need size and name: use find . -type f -printf "%s %p\n" and modify index to get name file = substr($0, index($0, $2)), your size is $1

JuanHdzCrr avatar Jun 07 '25 01:06 JuanHdzCrr