dust Memory allocation issue when running on windows

Hi.

I'm trying to count file on 30M files dataset on SMB. Anything can be done to overcome it, or I reached the maximum scale of dust ? Thanks !

.\dust -F -j -r -d 4 -n 100 -s 400000 -f \\server\share$\Groups Indexing: \\server\share$\Groups 9949070 files, 9.5M ... /memory allocation of 262144 bytes failed

Dec 31 '24 16:12 hmarko

Just an update .. running against the same repository from Linux client completes successfully. I suspect this is an issue which is relevant only to windows version

Jan 01 '25 09:01 hmarko

Can you try running dust with more memory: eg: dust -S 1073741824 -S lets you specify stack size so you can try increasing / decreasing the number and see if windows sorts itself out.

Jan 15 '25 19:01 bootandy

C:\DUST>C:\DUST\dust.exe -S 1073741824 -D -p -j -r -f -n 100 -d 7 -z 200000 "\\srv\c$\folder" Indexing: \\srv\c$\folder 12401021 files, 11M ... \memory allocation of 262144 bytes failed

Jan 20 '25 14:01 hmarko

I'm not sure I can do anything here. If windows is failing to assign enough memory to run dust, I'm not sure if there is anything I can do.

I'd recommend repeatedly halving the number in -S and then repeatedly doubling it and seeing if you can get a good run.

Jan 25 '25 21:01 bootandy

I see the same also on linux on file systems with many million files.

I will try to play with the -S but as far as I can see it is a general scalability issue.

BTW, did you try it on file systems with 20-30 million files or more?

Jan 25 '25 23:01 hmarko

The same on linux ? Ok, let me try and recreate on linux.

Using these 2 scripts I made a large number of files on my ext4 filesystem:

cat ~/temp/many_files/make.sh 
#! /bin/bash
for n in {1..1000}; do
    dd if=/dev/urandom of=file$( printf %03d "$n" ).bin bs=1 count=$(( RANDOM + 1024 ))
done


cat ~/temp/many_files/silly4/make.sh 
#! /bin/bash
for n in {1..1000}; do
	mkdir $n
	touch $n/bspl{00001..09009}.$n
done

Gives:

 (collapse)andy:(0):~/dev/rust/dust$ dust -f ~/temp/ -n 10
    99,003     ┌── many_small │█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   0%
   599,419     ├── many_small2│██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   1%
   900,982     ├── silly2     │███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   2%
   999,031     ├── silly      │████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   2%
 2,232,767     ├── silly3     │████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │   5%
 9,009,001     ├── silly4     │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  22%
 9,009,001     ├── silly5     │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  22%
 9,009,001     ├── silly6     │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  22%
 9,009,001     ├── silly7     │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  22%
40,887,211   ┌─┴ many_files   │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
40,887,212 ┌─┴ temp           │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
 (collapse)andy:(0):~/dev/rust/dust$

Jan 26 '25 11:01 bootandy

I think by the time you are getting up to tracking a few tens of million files you are pushing the memory limits of your average system. HTOP certainly wasn't very happy when I ran the above ^

Jan 26 '25 15:01 bootandy

I ran an identical command to what you did and It worked. In my use case there are a few differences that may related:

I use SMB or NFS to access the fs over the network
My directory structure is more complex (can get deep and narrow)
There are long directories and file names

Anyway, the servers I use have 32G of RAM and are doing nothing else. Is there any way I can use it to debug?

Thanks!

Jan 26 '25 15:01 hmarko

I'm not sure I can offer much more.

adding '-d' doesn't make it useless memory.

I can only suggest cd-ing into a subdirectory so it has less data to trawl through.

Jan 26 '25 18:01 bootandy

Thanks !

I will learn some rust and run some debugs myself.

I will let you know if something pops

Jan 27 '25 06:01 hmarko

Hi.

I could easily reproduce the dust crash with the following script. One problem is the length of the file names and the number of sub-directories.

`` #!/bin/bash

    BASE_DIR="/files"
    
    NUM_DIRS=10000
    NUM_FILES=10000
    
    FILENAME_LENGTH=50
    
    generate_random_string() {
      local length=$1
      tr -dc A-Za-z0-9 </dev/urandom | head -c $length
    }
    
    create_structure() {
      local current_depth=$1
      local current_dir=$2
    
      if [ $current_depth -gt 10 ]; then
        return
      fi
    
      for ((i=0; i<$NUM_DIRS; i++)); do
        dir_name=$(generate_random_string $FILENAME_LENGTH)
        new_dir="$current_dir/$dir_name"
        mkdir -p "$new_dir"
    
        for ((j=0; j<$NUM_FILES; j++)); do
          file_name=$(generate_random_string $FILENAME_LENGTH)
          touch "$new_dir/$file_name"
        done
    
        create_structure $((current_depth + 1)) "$new_dir"
      done
    }
    
    create_structure 1 "$BASE_DIR"

``

Mar 09 '25 13:03 hmarko

Is that only on windows ?

I tried the above on my linux box and it was dust handled it ok.

Mar 14 '25 21:03 bootandy

not only on Windows.. it also happens on Linux on VM with 64G RAM .

Mar 19 '25 05:03 hmarko

I have a 300TB volume on linux with billions of files. It takes 30GB RESS + 170GB kmem and goes OOM for the container. I limited the depth to 3 so theoretically it can be done in as little memory as the number of directories smaller than 3 depth.

I am using parallel du -hs ::: */*/* instead and it works quite fine (the catch is the workload is not balanced between processes and the last, largest directory takes a long time).

Jun 19 '25 03:06 eliphatfs

I don't think this is possible to fix. - du runs and dumps its output as it runs. dust loads it all into memory to make a decision. If there is too much to load dust will run out of memory.

Jul 05 '25 08:07 bootandy