duc Exceeding max DB size?

I compiled duc-1.4.4, I ran:

duc index --debug -p -d /projects/foo/.duc-index/duc.db /projects/foo

It ran much longer than I expected, and if I strace I get an seemingly infinite and unchanging:

pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256

Maybe I've hit the max DB size:

$ ls -alh /projects/foo/.duc-index/duc.db
-rwxrwx--- 1 130348 foo 1.8G Aug 20 20:08 /projects/foo/.duc-index/duc.db

I ran strace for a few seconds and put the log in /tmp/foo.log

$ cat /tmp/foo.log | grep pread64 | wc -l
275003
$ cat /tmp/foo.log | grep pread64 | sort | uniq
pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256
pread64(3, strace: Process 9187 detached
$

File handle 3 above is pointing at the duc.db file:

$ ls -al /proc/9187/fd/3
lrwx------ 1 root root 64 Aug 21 09:37 /proc/9187/fd/3 -> /projects/foo/.duc-index/duc.db

The filesystem has 4PB free, and I don't believe it's been full since I started the index.

Aug 21 '20 17:08 spikebike

Bill> I compiled duc-1.4.4, I ran:

Bill> duc index --debug -p -d /projects/foo/.duc-index/duc.db /projects/foo

Bill> It ran much longer than I expected, and if I strace I get an Bill> seemingly infinite and unchanging:

Bill> pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256

That doesn't help me unfortunately, I'd really need to see a backtrace. But I wonder if the depth of your directories is hitting the PATH_MAX by any chance? Do you want to try a silly hack and see if that makes any difference?

In src/libduc/duc.h is

#define PATH_MAX 16384

and I'd like you try and double it if possible and see if that makes it through your index.

Bill> Maybe I've hit the max DB size:

Bill> $ ls -alh /projects/foo/.duc-index/duc.db Bill> -rwxrwx--- 1 130348 foo 1.8G Aug 20 20:08 /projects/foo/.duc-index/duc.db

Nah, I've run into indexes of that size myself, where I have 14Tb and 100 million files because my users are idiots.

Bill> I ran strace for a few seconds and put the log in /tmp/foo.log

Bill> $ cat /tmp/foo.log | grep pread64 | wc -l Bill> 275003 Bill> $ cat /tmp/foo.log | grep pread64 | sort | uniq Bill> pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256 Bill> pread64(3, strace: Process 9187 detached Bill> $

Bill> File handle 3 above is pointing at the duc.db file:

Bill> $ ls -al /proc/9187/fd/3 Bill> lrwx------ 1 root root 64 Aug 21 09:37 /proc/9187/fd/3 -> /lustre/eaglefs/projects/foo/.duc-index/duc.db

Bill> The filesystem has 4PB free, and I don't believe it's been full since I started the index.

How many files does your filesystem have? If you have lots of projects under /lustre/eaglefs/projects/, maybe it's time to just index each project seperately? I know it's not ideal, but maybe it will get you going.

In any case, maybe you can do:

$ gdb duc
gdb> run -d /projects/foo/.duc-index/dub.db /projects/foo

Though I would also suggest that you don't put the duc DB into the same directory as the one you are indexing, though it shouldn't really matter actually.

Anyway, let the above run until it crashes, and then do a 'bt' to get a back trace.

Or if you have a core file from a previous crash, do:

$ gdb duc core gbd> bt

and send that to me directly if you want to keep details out of github. I won't share. I can't promise I'll have a good answer right away though...

Thanks for all your testing! John

Aug 22 '20 01:08 l8gravely

I found another wedged duc, I don't think it's pathmax, unless it's 100 characters or so.

I checked on the duc process and all the file handles are not changing, even when I look 10 minutes later. duc.db is suspiciously near 2GB again:

-rwxrwx--- 1 130348 baz 2019534848 Aug 30 11:21 /projects/baz/.duc-index/duc.db

Is there any hard limit to the number of sub directories duc can handle in a single directory?

Not seeing any I/O related timeouts, just things like the below that doesn't change:

lrwx------ 1 root root 64 Sep 21 07:38 0 -> /dev/pts/3
lrwx------ 1 root root 64 Sep 21 07:38 1 -> /dev/pts/3
lr-x------ 1 root root 64 Sep 21 07:38 10 -> /projects/foo/results/simulation_output/run112/depth119/run
l-wx------ 1 root root 64 Sep 16 14:10 2 -> /projects/baz/.duc-index/duc.log
lrwx------ 1 root root 64 Sep 21 07:38 3 -> /projects/baz/.duc-index/duc.db
lr-x------ 1 root root 64 Sep 21 07:38 4 -> /projects/baz
lr-x------ 1 root root 64 Sep 21 07:38 5 -> /projects/foo
lr-x------ 1 root root 64 Sep 21 07:38 6 -> /projects/foo/results
lr-x------ 1 root root 64 Sep 21 07:38 7 -> /projects/foo/results/simulation_output
lr-x------ 1 root root 64 Sep 21 07:38 8 -> /projects/foo/results/simulation_output/run112
lr-x------ 1 root root 64 Sep 21 07:38 9 -> /projects/foo/results/simulation_output/run112/depth119

strace -p on the duc process shows a near infinite number of:

So near infinite reads of the same string from duc.db

I forced it to core with gcore, and used gdb for the backtrace:

Thread 1 (Thread 0x7fffedadd900 (LWP 10850)):
#0  0x00007fffebbec033 in __pread_nocancel () at /lib64/libc.so.6
#1  0x00007fffec99e3c4 in tchdbseekreadtry () at /lib64/libtokyocabinet.so.9
#2  0x00007fffec99e492 in tchdbreadrec () at /lib64/libtokyocabinet.so.9
#3  0x00007fffec99f31e in tchdbputimpl () at /lib64/libtokyocabinet.so.9
#4  0x00007fffec9a5871 in tchdbput () at /lib64/libtokyocabinet.so.9
#5  0x00007fffec9a9621 in tcbdbleafsave () at /lib64/libtokyocabinet.so.9
#6  0x00007fffec9a9c55 in tcbdbleafcacheout () at /lib64/libtokyocabinet.so.9
#7  0x00007fffec9ad441 in tcbdbcacheadjust () at /lib64/libtokyocabinet.so.9
#8  0x00007fffec9ada72 in tcbdbputimpl () at /lib64/libtokyocabinet.so.9
#9  0x00007fffec9ae98f in tcbdbput () at /lib64/libtokyocabinet.so.9
#10 0x000000000040450c in db_put (db=<optimized out>, key=key@entry=0x7fffffff9000, key_len=<optimized out>, val=<optimized out>, val_len=<optimized out>) at src/libduc/db-tokyo.c:106
        r = <optimized out>
#11 0x0000000000406086 in scanner_free (scanner=scanner@entry=0x1958bf0) at src/libduc/index.c:549
        key = "83d08f32/20012fe3e015fa9\000\213\225\001\000\000\000"
        devino = 0x1958c50
        keyl = <optimized out>
        r = <optimized out>
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
#12 0x0000000000407ad1 in scanner_scan (scanner_dir=scanner_dir@entry=0xea02a0) at src/libduc/index.c:479
        scanner_ent = 0x1958bf0
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144136071247126441,
          st_nlink = 2,
          st_mode = 16877,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597972316,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x1871ca3 "run",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144136071247126441
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#13 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0xfcc280) at src/libduc/index.c:478
        scanner_ent = 0xea02a0
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144136071247126438,
          st_nlink = 3,
          st_mode = 16888,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597972316,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x5be5933 "depth119",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144136071247126438
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#14 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0x1e9d8e0) at src/libduc/index.c:478
        scanner_ent = 0xfcc280
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135898408178300,
          st_nlink = 1,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 20205568,
          st_blksize = 4096,
          st_blocks = 39512,
          st_atim = {
            tv_sec = 1597971033,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565969670,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565969670,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x1961c5b "run112",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 20230144,
            apparent = 20205568,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135898408178300
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#15 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0xf80050) at src/libduc/index.c:478
        scanner_ent = 0x1e9d8e0
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135898173277655,
          st_nlink = 33,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597912516,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565870580,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565870580,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x116895b "simulation_output",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135898173277655
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#16 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0x64c430) at src/libduc/index.c:478
        scanner_ent = 0xf80050
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135898173277654,
          st_nlink = 5,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597912516,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565969783,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565969783,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x64cecb "results",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135898173277654
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#17 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0x643f60) at src/libduc/index.c:478
        scanner_ent = 0x64c430
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135941123016226,
          st_nlink = 5,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 36864,
          st_blksize = 4096,
          st_blocks = 72,
          st_atim = {
            tv_sec = 1597912516,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1583785545,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1583785545,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x64431b "parametric_master",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 36864,
            apparent = 36864,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135941123016226
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#18 0x0000000000407ca6 in duc_index (req=0x62e690, path=<optimized out>, flags=flags@entry=(unknown: 0)) at src/libduc/index.c:632
        duc = 0x62e430
        path_canon = 0x63fe90 "/projects/baz99"
        report = 0x63ff00
        scanner = 0x643f60
#19 0x000000000040d8f8 in index_main (duc=0x62e430, argc=<optimized out>, argv=<optimized out>) at src/duc/cmd-index.c:102
        report = <optimized out>
        siz_apparent = '\000' <repeats 31 times>
        siz_actual = '\000' <repeats 31 times>
        index_flags = (unknown: 0)
        open_flags = <optimized out>
        r = <optimized out>
#20 0x000000000040390d in main (argc=1, argv=0x7fffffffda88) at src/duc/main.c:177
        r = <optimized out>
        duc = 0x62e430
        cmd = 0x618de0 <cmd_index>
        ducrc = 0x62e450
        home = 0x7fffffffed83 "/root"
        log_level = <optimized out>
(gdb) quit

Sep 21 '20 14:09 spikebike

Bill, Thanks for the report. I've looked at my own DBs and the largest one I have is 1.2gb, but using TokyoCabinet (like you're using) we already do

int opts = BDBTLARGE; if(compress) opts |= BDBTDEFLATE; tcbdbtune(db->hdb, -1, -1, -1, -1, -1, opts);

Before we open the DB, so duc should be able to open and use DBs larger than 2gb in size. Not sure how to test this without writing some code to fill it in by hand... But maybe we need to do some tuning for bigger numbers. Can you try the patch at the end for me?

Looking at your gdb trace (thank you!) I'm interested in:

    st_ent = {
          st_dev = 2211483442,
              st_ino = 144136071247126441,
              st_nlink = 2,
              st_mode = 16877,
              st_uid = 131078,
	  st_gid = 130988,

And how large the st_ino and st_uid values are. On linux at least, the st_uid is a 32bit integer, so that st_uid is fine. The st_ino is a 64bit number, so even though that looks huge, it fits within 64bits. Hmmm... Let's try this if you don't mind:

diff --git a/src/libduc/db-tokyo.c b/src/libduc/db-tokyo.c index 1eb7925..3eae431 100644 --- a/src/libduc/db-tokyo.c +++ b/src/libduc/db-tokyo.c @@ -61,8 +61,12 @@ struct db *db_open(const char *path_db, int flags, duc_errno *e)

int opts = BDBTLARGE;
if(compress) opts |= BDBTDEFLATE;

tcbdbtune(db->hdb, -1, -1, -1, -1, -1, opts);

int ret = tcbdbtune(db->hdb, 256, 512, 131071, 512, 2048, opts);
if(ret == 0) {
```
  *e = tcdb_to_errno(db->hdb);
```
```
  goto err2;
```
}
int r = tcbdbopen(db->hdb, path_db, mode); if(r == 0) { *e = tcdb_to_errno(db->hdb);

And compile and test this out and let us know if it craps out again. Basically, I just bumped all the defaults up by two times their defaults. It's not clear to me if this will break things with older DBs, so I'd recommend you be careful writing to old DBs using this new code.

Thanks, John

Sep 21 '20 20:09 l8gravely

And I'm a moron and mis-read the docs and sent my patch before I did a test compile. Please use this patch instead. Those two parameters 512 and 2048 needed to be the exponents, not the values. Duh... but I still don't know if this will fix the problem:

diff --git a/src/libduc/db-tokyo.c b/src/libduc/db-tokyo.c index 1eb7925..3006113 100644 --- a/src/libduc/db-tokyo.c +++ b/src/libduc/db-tokyo.c @@ -61,8 +61,12 @@ struct db *db_open(const char *path_db, int flags, duc_errno *e)

    int opts = BDBTLARGE;
    if(compress) opts |= BDBTDEFLATE;

  tcbdbtune(db->hdb, -1, -1, -1, -1, -1, opts);

  int ret = tcbdbtune(db->hdb, 256, 512, 131072, 9, 11, opts);

```
  if(ret == 0) {
```
```
      *e = tcdb_to_errno(db->hdb);
```
```
      goto err2;
```
```
  }
```

  int r = tcbdbopen(db->hdb, path_db, mode);
  if(r == 0) {
          *e = tcdb_to_errno(db->hdb);

Also, how much memory was your duc index process using when it got into the loop? How much memory is on the indexing system?

John

Sep 21 '20 20:09 l8gravely

I will try your patch and rerun on that directory. The system has 192GB ram (188 as reported by free -h). I didn't look at the memory size, but it didn't cause any problems. Next time it hangs I'll get you the memory size. I suspect this will take a few days so I'll let you know. I'll delete the old database before launching.

Sep 22 '20 19:09 spikebike

I recompiled, deleted the DB, and It's running, I did find a directory:

$ ls -al |grep drw| wc -l
350002

So far:

[-#------] Indexed 5.0Gb in 72.8K files and 24.2K directories

Sep 22 '20 21:09 spikebike

Bill> I recompiled, deleted the GB, and It's running, I did find a directory: Bill> $ ls -al |grep drw| wc -l Bill> 350002

Bill> So far:

Bill> [-#------] Indexed 5.0Gb in 72.8K files and 24.2K directories

Do you know how many files the filesystem has? What does 'df -i' tell you? Also, how is the size of the DB now? We should be supporting larger than 2gb DBs without any problem... but who knows.

I wish I had a way to build a test case for this. Maybe I can feed in junk data somehow to catch this type of error.

Sep 23 '20 18:09 l8gravely

df -i works, but is for the entire projects dir, but I'm only running duc on one of 250 sub directories.

Filesystem                          Inodes IUsed IFree IUse% Mounted on
<ip>@ib:<ip>@ib:/eaglefs   6.6G  1.6G  5.0G   25% /projects

At one point there was a ton of directories, only somewhat more files, and not much storage. So 2-3 files per dir, and obviously very small files.

[----#---] Indexed 594.8Gb in 8.5M files and 2.8M directories

Currently:

[-----#--] Indexed 8.0Tb in 12.6M files and 4.2M directories
[--#-----] Indexed 8.4Tb in 18.0M files and 5.9M directories
154M Sep 23 21:45 /projects/<foo>/.duc-index/duc.db
228M Sep 24 08:45 /projects/<foo>/.duc-index/duc.db

I believe the directory (last I looked) and it's around 12.3TB. It would be very cool if this rate keeps up, but I fear quite a few million files remain to be scanned.

Sep 23 '20 21:09 spikebike

Bill> df -i works, but is for the entire projects dir, but I'm only Bill> running duc on one of 250 sub directories.

Bill> Filesystem Inodes IUsed IFree IUse% Mounted on Bill> @ib:@ib:/eaglefs 6.6G 1.6G 5.0G 25% /projects

That's a crap load of files! I bet incremental backups suck suck suck... unless you can do block level snapshots or something like that.

Bill> At one point there was a ton of directories, only somewhat more Bill> files, and not much storage. So 2-3 files per dir, and obviously Bill> very small files.

Bill> [----#---] Indexed 594.8Gb in 8.5M files and 2.8M directories

Bill> Currently:

Bill> [------#-] Indexed 7.8Tb in 10.0M files and 3.3M directories Bill> 122M Sep 23 14:48 /projects//.duc-index/duc.db

I've got filesystems with 75+ million files in under 8Tb of space, which duc handles without any problems.

Bill> I believe the directory (last I looked) and it's around Bill> 12.3TB. It would be very cool if this rate keeps up, but I fear Bill> quiet a few million files remain to be scanned.

It should handle that many without problems. I'm starting to wonder if it's the depth thats killing us, but that might be because the hashing of TokyoCabinet starts crapping out due to hash collisions.

Just scanning all those files takes a long time, so getting something repeatable isn't going to be quick.

Do you have any ideas about how long your PATH is to the deepest level of the filesystem is? I'm thinking that maybe PATH_MAX is getting hit somewhere... which should be easy enough to test with a simple test case.

John

Sep 24 '20 15:09 l8gravely

No particular idea on PATH_MAX, but I'm looking at all open file handles once every 10 minutes. So far I've not seen much over 110 characters for the path. And the max number of file handles duc is using is around 10.

Not sure what hash is used, but this mix does seem pretty challenging, if it's not a good hash. There's directories called upXX from 01 to 40. In each of those is a sub directory called baz0000001 to baz0350000, and each of those has 6 files and 2 directories.

So then number of dirs is something like 40 * 350,000 * 2 = 28M directories and 40 * 350,000 * 6 = 84M files.

Do you think it's worth writing a wrapper for tokyocabinet to accept paths on <STDIN> so we can pump millions of pathnames quickly? I could write a bit of code to generate the that would likely be 1000s of times faster than walking a filesystem.

Something like:

#!/bin/bash
base=/this/is/just/to/take/up/90/chars/or/so/to/be/an/accurate/test/of/the/hash/function
for i in `seq -w 1 40`; do
	for j in `seq -w 1 350000`; do
#dirs
		echo $base/bazfoo$i/$j;
		echo $base/bazfoo$i/$j/run;
#files
		echo $base/bazfoo$i/$j/in.osw
		echo $base/bazfoo$i/$j/out.osw
		echo $base/bazfoo$i/$j/singularity_output.log
		echo $base/bazfoo$i/$j/run/data_point_out.json
		echo $base/bazfoo$i/$j/run/data_point.zip
		echo $base/bazfoo$i/$j/run/finished.job
	done
done

On a random server I ran 1 of the 40 loops and it took under a minute, so the above script should take approximately 40 minutes to produce the listing of the 28M directories and 84M files.

It's possible that your patch fixed things, so maybe we should just wait. Currently:

[#-------] Indexed 8.5Tb in 19.5M files and 6.4M directories

Sep 24 '20 17:09 spikebike

Bill> No particular idea on PATH_MAX, but I'm looking at all open file Bill> handles once every 10 minutes. So far I've not seen much over Bill> 110 characters for the path. And the max number of file handles Bill> duc is using is around 10.

That's good info.

Bill> Not sure what hash is used, but this mix does seem pretty Bill> challenging, if it's not a good hash. There's directories Bill> called upXX from 01 to 40. In each of those is a sub directory Bill> called baz0000001 to baz0350000, and each of those has 6 files Bill> and 2 directories.

Yikes! It's like you're an old time NNTP news spool directory structure! Are the six files and two directories in there also named in a regular pattern, or are they much more variable?

Bill> So then number of dirs is something like 403500002 = 28M Bill> directories and 403500006 = 84M files.

Pretty big, but as I've said I've got one filesystem with 75+ million files, so now I'm suspecting we're hitting hash bucket collissions here.

Bill> Do you think it's worth writing a wrapper for tokyocabinet to Bill> accept paths on so we can pump millions of pathnames quickly? I Bill> could write a bit of code to generate the that would likely be Bill> 1000s of times faster than walking a filesystem.

Yeah, it would be a simpler test to feed all this data in and see what it does without hitting the filesystem scan. I'll have to think about how to do this.

Another thought is to try and compile with a new backend DB if you have the time and inclination. I guess I'd go with LevelDB first off, but maybe KyotoCabinet would also work since it's a newer(ish) version of the same DB type as Tokyocabinet, but without the problems.

Bill> Something like:

Bill> #!/bin/bash Bill> base=/this/is/just/to/take/up/90/chars/or/so/to/be/an/accurate/test/of/the/hash/function Bill> for i in seq -w 1 40; do Bill> for j in seq -w 1 350000; do Bill> #dirs Bill> echo $base/bazfoo$i/$j; Bill> echo $base/bazfoo$i/$j/run; Bill> #files Bill> echo $base/bazfoo$i/$j/in.osw Bill> echo $base/bazfoo$i/$j/out.osw Bill> echo $base/bazfoo$i/$j/singularity_output.log Bill> echo $base/bazfoo$i/$j/run/data_point_out.json Bill> echo $base/bazfoo$i/$j/run/data_point.zip Bill> echo $base/bazfoo$i/$j/run/finished.job Bill> done Bill> done

Yeah, time to see if it's simpler to hack duc to do this, or to just push data into tokyocabinet DB directly. Hmm... it might be simplest to use the 'tcbtest' and 'tcbmgr' tools to build a crappy test to see if it blows up. More in a bit as I play with this...

Bill> It's possible that your patch fixed things, so maybe we should just wait. Currently:

All I really did was double the hash structures and such, not any smart tuning by any means.

Bill> [#-------] Indexed 8.5Tb in 19.5M files and 6.4M directories

Good to see!

Sep 25 '20 00:09 l8gravely

Bill, I'm running the following script... though it's not very fast, takes about 30 seconds to do 1000 inserts of the two directories and six files into the test DB.

My script looks like this, thanks to your base it was easy to whip up!

#!/bin/bash

db=/tmp/test.db

base=/this/is/just/to/take/up/90/chars/or/so/to/be/an/accurate/test/of/the/hash/function

tcbmgr create -tl $db for i in seq -w 1 40; do echo i=$i for j in seq 1 350000; do #dirs tcbmgr put $db "$base/bazfoo$i/$j" 1 tcbmgr put $db "$base/bazfoo$i/$j/run" 1

#files
tcbmgr put $db "$base/bazfoo$i/$j/in.osw" 10
tcbmgr put $db "$base/bazfoo$i/$j/out.osw" 20
tcbmgr put $db $base/bazfoo$i/$j/singularity_output.log 30
tcbmgr put $db $base/bazfoo$i/$j/run/data_point_out.json 40
tcbmgr put $db $base/bazfoo$i/$j/run/data_point.zip 50
tcbmgr put $db $base/bazfoo$i/$j/run/finished.job 60
if (( $j % 1000 == 0 )); then
    echo -n "."
fi
done

done

Sep 25 '20 00:09 l8gravely

Bill, Can you do the following on your broken DB? It would be interesting to see if we get numbers out of it:

tcbmgr inform /path/to/duc.db

Here's some numbers I got from a random duc DB I had hanging around:

$ tcbmgr inform /tmp/duc.db path: /tmp/duc.db database type: btree additional flags: comparison function: lexical max leaf member: 128 max node member: 256 leaf number: 1594 node number: 11 bucket number: 32749 used bucket number: 1605 alignment: 256 free block pool: 1024 inode number: 2228415 modified time: 2020-08-17T19:26:55-05:00 options: large deflate record number: 38490 file size: 8370944

And here's the duc info -d /tmp/duc.db info as well.

$ ./duc info -d /tmp/duc.db | head -2 Date Time Files Dirs Size Path 2020-08-17 20:26:49 618.9K 37.8K 413.5G /home/john

So this is just a small DB of my home machine, and you can see the "record number" field pretty much gels with the number of dirs in the DB. So actually thinking even more about it, the test I'm running is going to push even more info into the DB than we really need, since duc itself only records directory size and file counts. Eh... it will be an ok test to see if we can break it quicker than actually scanning the filesystem.

I'll let it run until it breaks or it ends and post more info then.

John

Sep 25 '20 00:09 l8gravely

First the update:

[-------#] Indexed 9.4Tb in 33.5M files and 11.1M directories
432M Sep 25 12:14 /projects/foo/.duc-index/duc.d

l8gravely> Yikes! It's like you're an old time NNTP news spool directory l8gravely> structure! Are the six files and two directories in there also named l8gravely> in a regular pattern, or are they much more variable?

Heh, indeed, not many around recognize NNTP these days. When I was a wee lad I used to read 100% of usenet, of course that didn't last long.

The particularly problematic directory is near identical to that produced by my example script, one of 40 dirs with 350k dirs in each one, and that subdir has 6 files and one more subdir.

Other dirs are fine, like this one:

Date        Time     Files     Dirs    Size Path
2020-08-13 18:37:49  222.0M   52.5M   95.1T /projects/civid
2.4G Sep  9 06:43 civid.db

So it's clearly not that 83M files and 23M directories is the problem.

l8gravely> Another thought is to try and compile with a new backend DB if you l8gravely> have the time and inclination. I guess I'd go with LevelDB first off, l8gravely> but maybe KyotoCabinet would also work since it's a newer(ish) version l8gravely> of the same DB type as Tokyocabinet, but without the problems.

I'm game, and thanks for mentioning KyotoCabinet vs Tokyocabinet I hadn't noticed. I'll let this one finish (or fail) first though.

I had hoped that some 750 directories (250 under /projects and 500 under /scratch) would be fine grained enough to allow reasonable runtimes by running them in parallel. I'm considering a few options:

A parallel scanner written in go piped to duc or the storage layer more directly
Something that can parse Lustre changelogs and pipe to duc and/or storage
Some way to run multiple duc scans on each of /projects/foo/up[00-40] and get the result into a single duc.db

Your tcbmgr script is the right idea, but each put requires launching tcpmgr which then has to do a full walk of the database/index before inserting. Considering that, 30 per second is pretty impressive. I'll look into writing a small wrapper to allow piping the directories so we can just stream 100M test file paths to it, without doing any I/O outside of writing to the DB. On a SSD that should be pretty quick and would let us figure out what patterns cause it to explode much more quickly, once that works I can walk all 15-20PB looking for any problematic patterns.

If that's too slow might be be better to slurp up say 1M paths into a small btree, and then walk both btrees (small and Tokyocabinet) that way you don't walk the entire tree per file.

l8gravely> Can you do the following on your broken DB?

Next time, yes. I don't have any broken ones at the moment. Here's one of the larger working ones:

$ tcbmgr inform ./civic.db
path: ./civic.db
database type: btree
additional flags:
comparison function: lexical
max leaf member: 128
max node member: 256
leaf number: 624808
node number: 3590
bucket number: 32749
alignment: 256
free block pool: 1024
inode number: 162133303107829870
modified time: 2020-09-09T05:43:06-07:00
options: large deflate
record number: 52472956
file size: 2483456512

Sep 25 '20 18:09 spikebike

Bill> First the update: Bill> [-------#] Indexed 9.4Tb in 33.5M files and 11.1M directories Bill> 432M Sep 25 12:14 /projects/foo/.duc-index/duc.d

l8gravely> Yikes! It's like you're an old time NNTP news spool directory l8gravely> structure! Are the six files and two directories in there also named l8gravely> in a regular pattern, or are they much more variable?

Bill> Heh, indeed, not many around recognize NNTP these days. When I Bill> was a wee lad I used to read 100% of usenet, of course that Bill> didn't last long.

Bill> The particularly problematic directory is near identical to that Bill> produced by my example script, one of 40 dirs with 350k dirs in Bill> each one, and that subdir has 6 files and one more subdir.

I'm suspecting more and more that we're hitting a pathalogical case here with the hashing function. If you look in libduc/index.c near the end, you'll see this:

char key[32]; struct duc_devino *devino = &scanner->ent.devino; size_t keyl = snprintf(key, sizeof(key), "%jx/%jx", (uintmax_t)devino->dev, (uintmax_t)devino->ino); int r = db_put(duc->db, key, keyl, scanner->buffer->data, scanner->buffer->len);

And I suspect that maybe this key isn't big enough in your case to get around hash collisions because of how similiar your paths are. So this is another place where we could possibly hack the code. But if the change I made before fixes things for you, then that's probably a better way since I think it's backwardsly compatible with older DBs.

You're directory setup probably requires some tweaks to how we hash things to make it work better. I'm not sure how to do this without a custom code at this point if the previous hack doesn't work.

Would it be possible for you to change your setup to NOT have 320,000 directories in a single directory? Do the old NNTP trick of making a level or two of directories down so that you only have a few thousand directories inside any one directory?

Bill> Other dirs are fine, like this one:

Bill> Date Time Files Dirs Size Path Bill> 2020-08-13 18:37:49 222.0M 52.5M 95.1T /projects/civid Bill> 2.4G Sep 9 06:43 civid.db

Bill> So it's clearly not that 83M files and 23M directories is the problem.

Yup, that's not the problem. But is your civid/ tree as regular as the problem tree?

l8gravely> Another thought is to try and compile with a new backend DB l8gravely> if you have the time and inclination. I guess I'd go with l8gravely> LevelDB first off, but maybe KyotoCabinet would also work l8gravely> since it's a newer(ish) version of the same DB type as l8gravely> Tokyocabinet, but without the problems.

And just to expand on this, we're thinking of changing the backend support with the next major release to drop support for older backends, to make things simpler for us. Especially since TokyoCabinet is not actively developed any more. And I think even Kyotocabinet is a bit un-maintained as well.

Bill> I'm game, and thanks for mentioning KyotoCabinet vs Tokyocabinet Bill> I hadn't noticed. I'll let this one finish (or fail) first Bill> though.

Bill> I had hoped that some 750 directories (250 under /projects and Bill> 500 under /scratch) would be fine grained enough to allow Bill> reasonable runtimes by running them in parallel. I'm considering Bill> a few options:

Bill> 1. A parallel scanner written in go piped to duc or the storage layer more directly

Unless your storage backend can handle it, it's not going to give you much gain. The big cost is the actual openddir(), readdir(), stat(), closedir() loop.

Way back when I was first poking at this when it was Philesight and written in Ruby, I had hacked together a parallel descent version which did N parallel threads. This was before I realized that Ruby had the problem if the Global Thread Lock, so while you could write parallel code, it wasn't really parallel.

Zevv re-wrote the core code into C which did speed up things quite a bit, but making it a multi-writer DB tool is alot more work. It might work if you had one writer, and multiple children which descended into a tree to read for you. But keeping it all in sync isn't trivial.

Bill> 2. Something that can parse Lustre changelogs and pipe to duc and/or storage

Is there any way to break it down by Lustre filesystem?

Bill> 3. Some way to run multiple duc scans on each of Bill> /projects/foo/up[00-40] and get the result into a single Bill> duc.db

This is not an easy thing to do. But maybe an offline merge to take two DBs and merge them together might be an interesting tool to write. Have to think about it.

But

Bill> Your tcbmgr script is the right idea, but each put requires Bill> launching tcpmgr which then has to do a full walk of the Bill> database/index before inserting. Considering that, 30 per second Bill> is pretty impressive. I'll look into writing a small wrapper to Bill> allow piping the directories so we can just stream 100M test Bill> file paths to it, without doing any I/O outside of writing to Bill> the DB. On a SSD that should be pretty quick and would let us Bill> figure out what patterns cause it to explode much more quickly, Bill> once that works I can walk all 15-20PB looking for any Bill> problematic patterns.

Great! I started this last night, but ended up going with the tcbmgr script, which I couldn't get to blow up. I suspect I wasn't pushing data into it quite the same way.

Mostly because duc builds it's key with "device_id/inode_num" and pushes in the data that way. So ... it's replicating that part that needs to happen I suspect.

So I just stopped my shell script for now since I had grown the DB file to 3+ gb without any problems.

Bill> If that's too slow might be be better to slurp up say 1M paths Bill> into a small btree, and then walk both btrees (small and Bill> Tokyocabinet) that way you don't walk the entire tree per file.

l8gravely> Can you do the following on your broken DB?

Bill> Next time, yes. I don't have any broken ones at the Bill> moment. Here's one of the larger working ones:

Bill> $ tcbmgr inform ./civic.db Bill> path: ./civic.db Bill> database type: btree Bill> additional flags: Bill> comparison function: lexical Bill> max leaf member: 128 Bill> max node member: 256 Bill> leaf number: 624808 Bill> node number: 3590 Bill> bucket number: 32749 Bill> alignment: 256 Bill> free block pool: 1024 Bill> inode number: 162133303107829870 Bill> modified time: 2020-09-09T05:43:06-07:00 Bill> options: large deflate Bill> record number: 52472956 Bill> file size: 2483456512

This in interesting, we see approx 52+ million entries, matching the number of directories you have. So they must not be nearly as pathalogical as the other tree.

How is your Lustre filesystem built? What is it based on for underlying filesystems? I suspect XFS if only because I think ext3/4 would simply bog down with 320,000 directories in a single directory. But I could be wrong here.

But as I've said before, splitting up your 250 projects into individual DBs gives your all the parallelism you need, then you have a single index script (see the examples/index.cgi) which takes all your DBs and generates the CGI files for each one, as well as the master index file you use to examine each project.

Now finding some way to tie them together into a single view would be interesting. It might be possible with a new DB format which is purely acting as a way to join DBs into a single view. Have to think about this.

It would keep the individual DBs for each tree/project/fileystem/etc, but give you a way to merge them into a single CGI interface so that users can browse things nicely.

Off the cuff thinking:

duc join -d master.db -a projects/*.db -a adds one or more DBs into master DB.

This 'join' command just does a duc info on each DB and sums it up, and builds the tree. So if you index four DBs holding:

foo.db - /projects/foo bar.db - /projects/bar blah-one.db - /projects/blah/one blah-two.db - /projects/blah/two

It would sum up and present you with /projects and then links to the four DBs to then be browsed through. When you jumped into a different DB, it would open that new DB file and present the data.

The trick is getting back up the tree from /projects/foo into the merge.db properly. I guess we'd just need to have a 'global' DB when we encountered one of these join'd DBs. It would be a special case.

Lots to think about here. Have a good weekend! John

Sep 25 '20 23:09 l8gravely

Good news, it finished:

[------#-] Indexed 21.0Tb in 159.7M files and 52.6M directories
/projects/foo done Writing to database "/projects/foo/.duc-index/duc.db"

And :

$ tcbmgr inform ./duc.db
path: ./duc.db
database type: btree
additional flags:
comparison function: lexical
max leaf member: 256
max node member: 512
leaf number: 583608
node number: 1749
bucket number: 147451
alignment: 512
free block pool: 2048
inode number: 162133238280620164
modified time: 2020-10-02T12:16:55-07:00
options: large deflate
record number: 52608733
file size: 2165248000

From the debug log:

<< /projects/foo actual:22465400747513 apparent:23083084365824
Indexed 159694681 files and 52608730 directories, (20.4TB apparent, 21.0TB actual) in 9d, 22 hours, 59 minutes, and 49.54 seconds.
238:59:52 3%CPU

So it looks like your patch fixed it.

l8gravely> Would it be possible for you to change your setup to NOT have 320,000 l8gravely> directories in a single directory? Do the old NNTP trick of making a l8gravely> level or two of directories down so that you only have a few thousand l8gravely> directories inside any one directory?

We can provide feedback to the user, but it's up to them, ideally of course duc would walk any legal filesystem.

l8gravely> Yup, that's not the problem. But is your civid tree as regular as l8gravely> the problem tree?

It was smaller and less regular and scanned previous with no problem.

l8gravely> Unless your storage backend can handle it, it's not going to give you l8gravely> much gain. The big cost is the actual openddir(), readdir(), stat(), l8gravely> closedir() loop.

I think it can, it's the latency killing it, it's got 3 metadata servers and 48 objects storage, even when running 20 copies of duc in parallel I don't see a noticeable difference in the load.

l8gravely> Way back when I was first poking at this when it was Philesight and l8gravely> written in Ruby, I had hacked together a parallel descent version l8gravely> which did N parallel threads. This was before I realized that Ruby l8gravely> had the problem if the Global Thread Lock, so while you could write l8gravely> parallel code, it wasn't really parallel.

Heh, we've been down many of the same paths. I wanted a high performance crawler, first I tried java, only to find that I had to do weird things to get stat() working, it wasn't part of the language standard and I didn't want an external call to C. So rewrote it in Python, before I knew of the GIL, and hit a serious performance bottleneck. I then wrote it in Go and was quite pleased with the result.

l8gravely> Zevv re-wrote the core code into C which did speed up things quite a l8gravely> bit, but making it a multi-writer DB tool is alot more work. It might l8gravely> work if you had one writer, and multiple children which descended into l8gravely> a tree to read for you. But keeping it all in sync isn't trivial.

I was planning on just having a single database thread being fed by N walkers. Something like:

walker
   dir=getDirFromQueue
   foreach <itemInDir>
      if IsDir:
          if dir_not_queued_or_scanned(dir):
              queueDir(dir)
      if IsFile
          st=stat(file)
          queueFile(st.Size)

I've not benchmarked the database, but hopefully it's not the bottleneck.

l8gravely> Is there any way to break it down by Lustre filesystem?

Not sure I get exactly what you mean. Hopefully this answers your question. It's tricky, there's metadata servers, but they mostly track the filename, which dir it's in, and which object stores have the blocks. They do NOT track filesize. There is a Lustre change log and a tool called robinhood was commonly used to track filesizes. It wasn't perfect, errors could accumulate, but it was reasonable. Unfortunately the changelog format changed, and we also had some problems with robinhood not keeping up, the changelogs would accumulate, and the filesystem would go down.

l8gravely> How is your Lustre filesystem built? What is it based on for l8gravely> underlying filesystems? I suspect XFS if only because I think ext3/4 l8gravely> would simply bog down with 320,000 directories in a single directory. l8gravely> But I could be wrong here.

I'll have to check, I believe Lustre is relatively FS agnostic and I think the road map includes support for ZFS. But old ext2/3 filesystems did use a linked list which had major issues. But since then there's a commonly supported flag called dir_index which allows using a hashed btree (sometimes called an htree) which should handle large directories reasonably.

l8gravely> But as I've said before, splitting up your 250 projects into l8gravely> individual DBs gives your all the parallelism you need, then you have l8gravely> a single index script (see the examples/index.cgi) which takes all l8gravely> your DBs and generates the CGI files for each one, as well as the l8gravely> master index file you use to examine each project.

Sadly splitting it into 250 projects and some 100s of /scratch dirs isn't enough. A single /project dir took 9 days and we'd like to be able to scan more often than that. Having a separate DB per /project dir and a simple awk script can give as a total for all directories. So a parallel scanner, or merging directories in post processing so that something like /project/foo/up<1 to 100> could be scanned in parallel then merged in post processing could work.

The proposed duc join does seem like a reasonable approach. I could walk 246 of the 250 directories and put them all in their own duc.db, then for the directories that take more than a week to scan I cold do those in parallel into a database for each one, and then join them together afterwards. Interesting, I'll have to ponder.

Oct 05 '20 07:10 spikebike

Bill> Good news, it finished: Bill> [------#-] Indexed 21.0Tb in 159.7M files and 52.6M directories Bill> /projects/foo done Writing to database "/projects/foo/.duc-index/duc.db"

Nice! Glad that it finally finished. That's a crap load of directories and files too, about 2.5x my biggest collection. So the simple thing is to just apply those changes to the next release.

Thanks for your testing, it's been a big help.

Would you be willing to run some tests where we make a couple of different changes to see which one made the difference, or if we really need all of them?

While it's a simple patch, it's five different tweaks, but I suspect in your case it's the bnum, since the man page says:

`bnum' specifies the number of elements of the bucket array.  If it is
not more than 0, the default value is specified.  The default value is
16381.  Suggested size of the bucket array is about from 1 to 4 times
of the number of all pages to be stored.

So if you feel like changing the '131072' back to '-1' in the tcbdbtune() call in db-toyko.c and re-running the test, that might tell us something.

But in any case, I think this is probably a safe change to make since it shouldn't do anything but increase the overhead for smaller DBs, and let monster ones like yours actually work.

Unfortunately, the 'tcbmgr inform' doesn't really give us enough indication on what actually made the difference here. So for now, I'll push this change.

I really need to get my butt in gear to spin out a new release.

John

Oct 05 '20 22:10 l8gravely

In theory, yes. If each run takes 9 days I'm not sure it's worth it. I've yet to find time to look, but seems like something like:

#ifdef DEBUG
   sscanf( ... # get data from stdin>
#elseif 
   readdir(...
   stat(...
#endif

I suspect a SSD could spew 160M file names and insert them into a DB in less than an hour instead of 9 days. Hrm, well maybe in a day anyways. I'll guesstimate 10-100x faster.

Which do you think would be easier to implement:

parallel scanner combined with a sequential database thread. Each reader walks an entry from the directory list and appends to that list when it finds a directory. You'd need a thread safe queues, and a thread safe way to check if each directory was visited yet. Go makes this easy, I'm less sure of the best approach with c/c++. I've written pthreads code before, but nothing particularly complicated
Manual parallelism with some post processing way to merge databases.

As for maximum sizes, my biggest single directory tree (out of 400) as measured by files:

/projects/foo1 files=   221966601 dirs=    52472953 sumBytesTB=    95 filesPerDir=     4 averageFileSizeMB=     0

By dirs:

  /projects/foo2 files=   159694681 dirs=    52608730 sumBytesTB=    20 filesPerDir=     3 averageFileSizeMB=     0

By size:

/projects/foo3 files=    25555665 dirs=     2458317 sumBytesTB=  1821 filesPerDir=    10 averageFileSizeMB=    74

Files per dir:

/projects/foo4 files=     4074022 dirs=          40 sumBytesTB=     0 filesPerDir=  101850 averageFileSizeMB=     0

Oct 15 '20 22:10 spikebike

Hrm, I do think the patched do make the databases kinda of weird and broken if not updated.

For instance I ran a scan today and then:

# /nopt/nrel/utils/bin/duc info -d /projects/foo12/.duc-index/duc.db /projects/foo12
Date       Time       Files    Dirs    Size Path
2020-10-06 23:00:37   42.0M    6.8M    9.7T /projects/foo12
# ls -al /projects/foo12/.duc-index/duc.db
349777664 Oct 15 13:58 /projects/foo12/.duc-index/duc.db

I'll zero out all my databases and see if it acts normally.

Oct 16 '20 03:10 spikebike

Bill> Hrm, I do think the patched do make the databases kinda of weird Bill> and broken if not updated. For instance I ran a scan today and Bill> then:

I must be suffering from -ENOCAFFEINE cause I'm not seeing the problem. Oh wait, did your /projects/novonutr come out as foo12 instead?

Bill> # /nopt/nrel/utils/bin/duc info -d /projects/foo12r/.duc-index/duc.db /projects/novonutr Bill> Date Time Files Dirs Size Path Bill> 2020-10-06 23:00:37 42.0M 6.8M 9.7T /projects/foo12 Bill> # ls -al /projects/foo12/.duc-index/duc.db Bill> 349777664 Oct 15 13:58 /projects/foo12/.duc-index/duc.db

Bill> I'll zero out all my databases and see if it acts normally.

Just to be clear, is this using the new version of duc with the updated DB settings, causing problems with older DBs created without those settings?

In any case, I need more details please.

Oct 16 '20 12:10 l8gravely

I was careful, but made some mistakes pasting in the output. Seems like the old duc (pre patched) makes a database that silently doesn't update with the patched duc. duc info and duc ls just report the old date and the rescans work without error, but the date never updates. Once I delete the databases it seems to work great.

Oct 16 '20 20:10 spikebike

I wrote a scanner in Go, so far it just counts files, counts directories, and counts total bytes. It doesn't store anything in a database, but it is quite a bit faster:

$ time duc index /projects/foo/running/130/trashFolder/136
real	1m8.265s
user	0m16.402s
sys	0m28.423s

$ time go run dir1.go /projects/foo/running/130/trashFolder/136
Total Files=593566 Total Dirs=220418 Total Size=422181633387
real	0m6.541s
user	0m8.294s
sys	1m13.236s

$ time du -hs /projects/foo/running/130/trashFolder/136
395G	/projects/foo/running/130/trashFolder/136
real	0m45.030s
user	0m2.402s
sys	0m24.926s

Not sure how interesting that is though, it doesn't write to a database, and is only likely to be a significant speedup on parallel filesystems. There are tokoyo cabinet bindings for go, so a go scanner could still be compatible with the rest of duc. So far my implementation is 73 lines of go.

Nov 04 '20 10:11 spikebike

Bill> I wrote a scanner in Go, so far it just counts files, counts Bill> directories, and counts total bytes. It doesn't store anything Bill> in a database, but it is quite a bit faster:

Interesting! But you do need to be sure to drop any/all caches between tests. You need to do the following before each test, and ideally run the test three times and average the results.

echo 3 > /proc/sys/vm/drop_caches

And you will need to be root to do this.

Bill> $ time duc index /projects/foo/running/130/trashFolder/136 Bill> real 1m8.265s Bill> user 0m16.402s Bill> sys 0m28.423s

Bill> $ time go run dir1.go /projects/foo/running/130/trashFolder/136 Bill> Total Files=593566 Total Dirs=220418 Total Size=422181633387 Bill> real 0m6.541s Bill> user 0m8.294s Bill> sys 1m13.236s

Bill> $ time du -hs /projects/foo/running/130/trashFolder/136 Bill> 395G /projects/foo/running/130/trashFolder/136 Bill> real 0m45.030s Bill> user 0m2.402s Bill> sys 0m24.926s

Bill> Not sure how interesting that is though, it doesn't write to a Bill> database, and is only likely to be a significant speedup on Bill> parallel filesystems. There are tokoyo cabinet bindings for go, Bill> so a go scanner could still be compatible with the rest of Bill> duc. So far my implementation is 73 lines of go.

The duc format is pretty simple, so you might be able to write something that will interoperate, even if only for the generation of the index at first.

But I really want to see your numbers from a cold cache before we really take this any farther.

John

Nov 10 '20 22:11 l8gravely

With a parallel file system dropping caches is tricky. There's client side caches, metadata caches (on the meta data server), and cache on the object stores as well. To complicate things there's 2600 other nodes making requests of the parallel file system. I ran my code twice, the first with cold caches, the 2nd with warm, and then duc (also with a warm cache).

I'll try to quantify what differences dropping the client caches changes things ... later.

Nov 11 '20 10:11 spikebike

Bill> With a parallel file system dropping caches is tricky. There's Bill> client side caches, metadata caches (on the meta data server), Bill> and cache on the object stores as well. To complicate things Bill> there's 2600 other nodes making requests of the parallel file Bill> system. I ran my code twice, the first with cold caches, the 2nd Bill> with warm, and then duc (also with a warm cache).

Can you explain what your setup is better? I really just want you to drop the caches on the linux system doing the indexing. I don't want to tell the storage to drop caches. Once they're big enough, they should spill no matter what.

Bill> I'll try to quantify what differences dropping the client caches Bill> changes things ... later.

So maybe run the test three times and average things. Or four times and drop the first run as you warm the caches.

John

Nov 11 '20 22:11 l8gravely

Well part of the problem is the node in question is running duc scans, the length of these scans (one running since Oct 15th) is why I'm looking for faster solutions:

$  ps acux | grep duc
root     24403  2.4  0.0  57488 32612 pts/3    S+   Nov09  79:56 duc
root     26617  1.0  0.0  72464 46036 pts/1    R    Oct15 412:06 duc
root     27126  3.0  0.0 119756 88712 pts/2    S+   Nov02 407:47 duc

It's one of 2600 or so clients of lustre, which has 3 meta data servers and 48 or so object stores. The file system is pretty much always busy.

Truely cold my code gets around 2 minutes:15 seconds for the benchmark dir. But I can only test that once, and can't easily flush the remote caches. If I run with warm caches:

real	0m8.124s
real	0m7.881s
real	0m7.909s

If I do this in a loop with the drop caches I get:

Total Files=742634 Total Dirs=273695 Total Size=380835906433
real	0m36.399s
Total Files=742634 Total Dirs=273695 Total Size=380835906433
real	0m28.817s
Total Files=742634 Total Dirs=273695 Total Size=380835906433
real	0m32.781s

If I do the same with duc, again with drop_caches immediately before every run:

for i in 1 2 3; do echo $i;  echo 3 > /proc/sys/vm/drop_caches; time duc index /projects/test/running/130/trashFolder/137; done
real	10m32.848s
real	14m53.847s
real	7m33.314s

That's more variation than usual, but you get the idea.

Nov 12 '20 02:11 spikebike

Hi Folks,

Sorry to drop into the discussion this late.

Quoting Bill Broadley (2020-11-04 11:32:40)

It is significantly faster indeed. I assume you do your scanning parallel in multiple threads/goroutines, or is it just sequential like Duc is doing?

I'm 100% sure that for parallel file systems like Lustre, CEPH and whatever more, Duc is far from optional: it's scans the whole tree recursively from a single thread and thus makes no use of the inherent parallelism offerd by the underlying file system.

In the early days of Duc I made some attempts of proper parallelizing the scanning, but I felt the gains on "normal" file systems were not worth the additional complexity.

Can you share some more details about your scanner implemenation, possible sharing the source, or maybe show a few thousand lines of strace output from a run?

Thanks!

Ico

-- :wq ^X^Cy^K^X^C^C^C^C

Nov 12 '20 07:11 zevv

zevv> I'm 100% sure that for parallel file systems like Lustre, CEPH and zevv> whatever more, Duc is far from optional: it's scans the whole tree zevv> recursively from a single thread and thus makes no use of the inherent zevv> parallelism offerd by the underlying file system.

I did try breaking up a large (several PB) filesystem into 400 directories and running 30-40 copies of duc in parallel. For 95% or so of our directories it's fine. But for others I end up with week or even month long scans, one sarted on Oct 15th (28 days ago).

Lustre has metadata servers, but that just includes the filename, permissions, and which object stores the file is using. Sadly this does NOT include the filesize. So a simple stat() call for size requires client -> metadata server -> talk to object stores (OSTs). Clearly the latency is going to be higher with network traffic to multiple other servers for a simple stat, so a parallel approach really helps to increase throughput.

With warm caches I see things like:

THREADS  runtime
16   real	0m8.200s // 9.3x faster with 16x threads
  8   real	0m11.263s
  4   real	0m20.874s
  2   real	0m39.100s
  1   real	1m16.306s

For this reason my first attempt was:

for {
            list, err := file.Readdirnames(1024) // hoping to avoid a stat() call
            //    fmt.Println(err)
            for _, name := range list {
               fullName := dirName + "/" + name
               fileNames <- fullName // send name to go channel to have workers run stat()
            }
         }

Unfortunately that wasn't any faster than the simpler:

      for {
         list, err := dir.Readdir(256) // 0 to read all files and folders
         if err != nil {
            break
         }
         for _, file := range list {
            totalSize += file.Size()
            if file.IsDir() {
               totalDirs++
               dirNames <- dirName + "/" + file.Name() // queue any found dirs
            } else {
               totalFiles++
            }
         }
      }

To see the above in a working example https://github.com/spikebike/go-dirscan

Nov 12 '20 18:11 spikebike

Right, and that kind of latency really adds up.

Part of the problem I ran into with parallelizing the scan as a whole is that at some point the subtrees meet and have to be consolidated in the common root. Traversing the FS in parallel is not hard, but keeping the total bookkeeping is harder.

Part of my problem is that threading and parallelization is just a pain-in-the-ass in C, in general. I got this to work one time, but the resulting code was just getting too complicated. I guess the scanner logic could greatly benefit from being written in a language with better threading primititives - like Go.

If the stat()ing is really the main problem, I can see a relatively clean solution where Duc could introduce a threadpoold only for doing the stats in parallel. So in this case one directory will still be traversed at any moment, all entries will be retreived with readdir(), and then all the indidual lstat() calls will be performed on N threads in parallel. I guess this also helps on non-distributed network file systems as it just parallelizes the server roundtrip time.

I think we could learn a lot by properly measuring and/or tracing the individual libc/system calls that are ran during a parallel scan to see where the low hanging fruit is.

That said: I stalled my active development on Duc for a long time, mostly because it Works For Me, and I kind of lost interest in the project. John is a teriffic help for me, as he actually takes the time and effort to respond to most (or even all?) of the Github issues. I'm not sure if I can find the time and motivation to really work on this. But if we thing the solution above (parallizing the stats only) would be enough help, I think we can give that a try, as that would be a nice local change affecting only index.c:scanner_scan()

-- :wq ^X^Cy^K^X^C^C^C^C

Nov 12 '20 18:11 zevv

zevv> Part of the problem I ran into with parallelizing the scan as a whole is that at some point the subtrees meet and have to be consolidated in the common root. Traversing the FS in parallel is not hard, but keeping the total bookkeeping is harder.

Not really, just queue one dir, then as you find new dirs queue them, that's it. No testing if you've seen them, no merging trees, no consolidation. Hrm, my code doesn't test for links, I should just ignore them. After all it's not strictly a tree with links.

zevv> If the stat()ing is really the main problem, I can see a relatively clean solution where Duc could introduce a threadpoold only for doing the stats in parallel. So in this case one directory will still be traversed at any moment, all entries will be retreived with readdir(), and then all the indidual lstat() calls will be performed on N threads in parallel.

Several of our problematic directories have something like 3-4 files per directory, so I didn't want to stick to a single directory at time.

zevv>That said: I stalled my active development on Duc for a long time, mostly because it Works For Me, and I kind of lost interest in the project. John is a teriffic help for me, as he actually takes the time and effort to respond to most (or even all?) of the Github issues. I'm not sure if I can find the time and motivation to really work on this. But if we thing the solution above (parallizing the stats only) would be enough help, I think we can give that a try, as that would be a nice local change affecting only index.c:scanner_scan()

Very fair, totally makes sense. For the common case the parallel scan is a waste, on my desktop with a fast NVMe (Samsung 950 Pro):

left:~/git/go-dirscan$ time go run dirscan.go .. 8; time du -hs ..
workers = 8
Total Files=1074232 Total Dirs=994209 Total Size=27519874975

real	0m7.369s
user	0m21.723s
sys	0m30.572s
26G	..

real	0m9.901s
user	0m4.233s
sys	0m5.595s

So clearly not much of a speedup (7.4 seconds for dirscan, 9.9 seconds for du) is not worth any additional complexity. I think the best way forward is to add the data I collect directly to the same backend duc uses. That way anyone with a similar problem can use this scanner and still use the other duc tools.

Any idea which of the database backends would be fastest for writes (I.e. fastest for writing scan metadata)? I believe it was mentioned that tokyocabinet is not under active development, but kyotocabinet is.

Nov 12 '20 19:11 spikebike