Memory allocation issue when running on windows
Hi.
I'm trying to count file on 30M files dataset on SMB. Anything can be done to overcome it, or I reached the maximum scale of dust ? Thanks !
.\dust -F -j -r -d 4 -n 100 -s 400000 -f \\server\share$\Groups Indexing: \\server\share$\Groups 9949070 files, 9.5M ... /memory allocation of 262144 bytes failed
Just an update .. running against the same repository from Linux client completes successfully. I suspect this is an issue which is relevant only to windows version
Can you try running dust with more memory: eg: dust -S 1073741824 -S lets you specify stack size so you can try increasing / decreasing the number and see if windows sorts itself out.
C:\DUST>C:\DUST\dust.exe -S 1073741824 -D -p -j -r -f -n 100 -d 7 -z 200000 "\\srv\c$\folder" Indexing: \\srv\c$\folder 12401021 files, 11M ... \memory allocation of 262144 bytes failed
I'm not sure I can do anything here. If windows is failing to assign enough memory to run dust, I'm not sure if there is anything I can do.
I'd recommend repeatedly halving the number in -S and then repeatedly doubling it and seeing if you can get a good run.
I see the same also on linux on file systems with many million files.
I will try to play with the -S but as far as I can see it is a general scalability issue.
BTW, did you try it on file systems with 20-30 million files or more?
The same on linux ? Ok, let me try and recreate on linux.
Using these 2 scripts I made a large number of files on my ext4 filesystem:
cat ~/temp/many_files/make.sh
#! /bin/bash
for n in {1..1000}; do
dd if=/dev/urandom of=file$( printf %03d "$n" ).bin bs=1 count=$(( RANDOM + 1024 ))
done
cat ~/temp/many_files/silly4/make.sh
#! /bin/bash
for n in {1..1000}; do
mkdir $n
touch $n/bspl{00001..09009}.$n
done
Gives:
(collapse)andy:(0):~/dev/rust/dust$ dust -f ~/temp/ -n 10
99,003 ┌── many_small │█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 0%
599,419 ├── many_small2│██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 1%
900,982 ├── silly2 │███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 2%
999,031 ├── silly │████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 2%
2,232,767 ├── silly3 │████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 5%
9,009,001 ├── silly4 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
9,009,001 ├── silly5 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
9,009,001 ├── silly6 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
9,009,001 ├── silly7 │██████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ 22%
40,887,211 ┌─┴ many_files │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
40,887,212 ┌─┴ temp │████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ │ 100%
(collapse)andy:(0):~/dev/rust/dust$
I think by the time you are getting up to tracking a few tens of million files you are pushing the memory limits of your average system. HTOP certainly wasn't very happy when I ran the above ^
I ran an identical command to what you did and It worked. In my use case there are a few differences that may related:
- I use SMB or NFS to access the fs over the network
- My directory structure is more complex (can get deep and narrow)
- There are long directories and file names
Anyway, the servers I use have 32G of RAM and are doing nothing else. Is there any way I can use it to debug?
Thanks!
I'm not sure I can offer much more.
adding '-d' doesn't make it useless memory.
I can only suggest cd-ing into a subdirectory so it has less data to trawl through.
Thanks !
I will learn some rust and run some debugs myself.
I will let you know if something pops
Hi.
I could easily reproduce the dust crash with the following script. One problem is the length of the file names and the number of sub-directories.
`` #!/bin/bash
BASE_DIR="/files"
NUM_DIRS=10000
NUM_FILES=10000
FILENAME_LENGTH=50
generate_random_string() {
local length=$1
tr -dc A-Za-z0-9 </dev/urandom | head -c $length
}
create_structure() {
local current_depth=$1
local current_dir=$2
if [ $current_depth -gt 10 ]; then
return
fi
for ((i=0; i<$NUM_DIRS; i++)); do
dir_name=$(generate_random_string $FILENAME_LENGTH)
new_dir="$current_dir/$dir_name"
mkdir -p "$new_dir"
for ((j=0; j<$NUM_FILES; j++)); do
file_name=$(generate_random_string $FILENAME_LENGTH)
touch "$new_dir/$file_name"
done
create_structure $((current_depth + 1)) "$new_dir"
done
}
create_structure 1 "$BASE_DIR"
``
Is that only on windows ?
I tried the above on my linux box and it was dust handled it ok.
not only on Windows.. it also happens on Linux on VM with 64G RAM .
I have a 300TB volume on linux with billions of files. It takes 30GB RESS + 170GB kmem and goes OOM for the container. I limited the depth to 3 so theoretically it can be done in as little memory as the number of directories smaller than 3 depth.
I am using parallel du -hs ::: */*/* instead and it works quite fine (the catch is the workload is not balanced between processes and the last, largest directory takes a long time).
I don't think this is possible to fix. - du runs and dumps its output as it runs. dust loads it all into memory to make a decision. If there is too much to load dust will run out of memory.