databricks-cli icon indicating copy to clipboard operation
databricks-cli copied to clipboard

`dbfs ls -l` output familiar to UNIX `ls -l`

Open hzpc-joostk opened this issue 3 years ago • 0 comments

I find the output of ls -l a bit odd, especially modification_time is hard to read. So I wrote a small awk script to process the output to make it more familiar, like the output of UNIX ls -l:

dbfs ls -l dbfs:/
dir         0  FileStore           1656588772000
dir         0  cluster-logs        1631877073000
dir         0  databricks          1596191581000
file  4608000  foo                 1657881982000
dir         0  init_scripts        1655196362000
dir         0  mnt                 1596027464000
dir         0  tmp                 1657843457000
dir         0  user                1588850628000
dbfs ls dbfs:/ -l | gawk 'BEGIN { hy=systime()-3600*24*30*6} 1 { if($1 == "dir") {d="d";m="rwx";s="/"} else {d="-";m="rw-";s=""}; t=$4/1000; tfmt = t < hy ? "%b %d  %Y" : "%b %d %H:%M"; printf "%s%s%s%s  root root %12d %s %s\n", d,m,m,m, $2, strftime(tfmt,$4/1000), $3 s }'
drwxrwxrwx  root root            0 Jun 30 13:32 FileStore/
drwxrwxrwx  root root            0 Sep 17  2021 cluster-logs/
drwxrwxrwx  root root            0 Jul 31  2020 databricks/
-rw-rw-rw-  root root      4608000 Jul 15 12:46 foo
drwxrwxrwx  root root            0 Jun 14 10:46 init_scripts/
drwxrwxrwx  root root            0 Jul 29  2020 mnt/
drwxrwxrwx  root root            0 Jul 15 02:04 tmp/
drwxrwxrwx  root root            0 May 07  2020 user/

For me, this reads more natural. Of course, the file modes are fictitious. I picked what is likely to show up when running ls -l with the %sh in a notebook cell, except for all files being marked as executable and directories are often 4096 bytes.

I then turned to rewriting this in databricks_cli.dbfs.api.FileInfo itself.

import time

...

class FileInfo(object):
    ...

    def to_row_unix(self, is_absolute):
        path = self.dbfs_path.absolute_path if is_absolute else self.dbfs_path.basename
        stylized_path = click.style(path, 'cyan') if self.is_dir else path

        if self.is_dir:
            mod = 'drwxrwxrwx'
            stylized_path += '/'
        else:
            mod = '-rw-rw-rw-'
        
        size = self.file_size

        for p in " KMGTP":
            if size < 1024:
                if size < 10 and p != " ":
                    size = f"{size:.1f}{p}"
                else:
                    size = f"{size:.0f}{p}"
                break

            size /= 1024
        else:
            size = f"{size:.0f}{p}"

        if self.modification_time is None:
            mtime = 0
        else:
            mtime = self.modification_time // 1000

        timet = time.gmtime(mtime)

        if mtime < (time.time() - 3600*24*365/2):
            # format times longer than six months ago with year
            tfmt = "%b %d  %Y"
        else:
            # format times within last six months with time
            tfmt = "%b %d %H:%M"

        ftime = time.strftime(tfmt, timet)

        return [mod, "root", "root", size, ftime, stylized_path]

Along with some additional code in databricks_cli.dbfs.api.ls_cli:

...
@click.option('--unix', is_flag=True, default=False,
              help="""Output as UNIX `ls` command.""")
def ls_cli(api_client, l, absolute, dbfs_path, unix): #  NOQA
    ...

    files = DbfsApi(api_client).list_files(dbfs_path)

    if unix:
        rows = [f.to_row_unix(is_long_form=l, is_absolute=absolute) for f in files]

    else:
        rows = [f.to_row(is_long_form=l, is_absolute=absolute) for f in files]

    table = tabulate(rows, tablefmt='plain')
    click.echo(table)

drwxrwxrwx  root  root     0  Jun 30 11:32  FileStore/
drwxrwxrwx  root  root     0  Sep 17  2021  cluster-logs/
drwxrwxrwx  root  root     0  Jul 31  2020  databricks/
-rw-rw-rw-  root  root  4.4M  Jul 15 10:46  foo
drwxrwxrwx  root  root     0  Jun 14 08:46  init_scripts/
drwxrwxrwx  root  root     0  Jul 29  2020  mnt/
drwxrwxrwx  root  root     0  Jul 15 00:04  tmp/
drwxrwxrwx  root  root     0  May 07  2020  user/

If this is of any use for others, I can create a pull request implementing a --unix or --posix option.

hzpc-joostk avatar Jul 15 '22 10:07 hzpc-joostk