databricks-cli
databricks-cli copied to clipboard
`dbfs ls -l` output familiar to UNIX `ls -l`
I find the output of ls -l a bit odd, especially modification_time is hard to read. So I wrote a small awk script to process the output to make it more familiar, like the output of UNIX ls -l:
dbfs ls -l dbfs:/
dir 0 FileStore 1656588772000
dir 0 cluster-logs 1631877073000
dir 0 databricks 1596191581000
file 4608000 foo 1657881982000
dir 0 init_scripts 1655196362000
dir 0 mnt 1596027464000
dir 0 tmp 1657843457000
dir 0 user 1588850628000
dbfs ls dbfs:/ -l | gawk 'BEGIN { hy=systime()-3600*24*30*6} 1 { if($1 == "dir") {d="d";m="rwx";s="/"} else {d="-";m="rw-";s=""}; t=$4/1000; tfmt = t < hy ? "%b %d %Y" : "%b %d %H:%M"; printf "%s%s%s%s root root %12d %s %s\n", d,m,m,m, $2, strftime(tfmt,$4/1000), $3 s }'
drwxrwxrwx root root 0 Jun 30 13:32 FileStore/
drwxrwxrwx root root 0 Sep 17 2021 cluster-logs/
drwxrwxrwx root root 0 Jul 31 2020 databricks/
-rw-rw-rw- root root 4608000 Jul 15 12:46 foo
drwxrwxrwx root root 0 Jun 14 10:46 init_scripts/
drwxrwxrwx root root 0 Jul 29 2020 mnt/
drwxrwxrwx root root 0 Jul 15 02:04 tmp/
drwxrwxrwx root root 0 May 07 2020 user/
For me, this reads more natural. Of course, the file modes are fictitious. I picked what is likely to show up when running ls -l with the %sh in a notebook cell, except for all files being marked as executable and directories are often 4096 bytes.
I then turned to rewriting this in databricks_cli.dbfs.api.FileInfo itself.
import time
...
class FileInfo(object):
...
def to_row_unix(self, is_absolute):
path = self.dbfs_path.absolute_path if is_absolute else self.dbfs_path.basename
stylized_path = click.style(path, 'cyan') if self.is_dir else path
if self.is_dir:
mod = 'drwxrwxrwx'
stylized_path += '/'
else:
mod = '-rw-rw-rw-'
size = self.file_size
for p in " KMGTP":
if size < 1024:
if size < 10 and p != " ":
size = f"{size:.1f}{p}"
else:
size = f"{size:.0f}{p}"
break
size /= 1024
else:
size = f"{size:.0f}{p}"
if self.modification_time is None:
mtime = 0
else:
mtime = self.modification_time // 1000
timet = time.gmtime(mtime)
if mtime < (time.time() - 3600*24*365/2):
# format times longer than six months ago with year
tfmt = "%b %d %Y"
else:
# format times within last six months with time
tfmt = "%b %d %H:%M"
ftime = time.strftime(tfmt, timet)
return [mod, "root", "root", size, ftime, stylized_path]
Along with some additional code in databricks_cli.dbfs.api.ls_cli:
...
@click.option('--unix', is_flag=True, default=False,
help="""Output as UNIX `ls` command.""")
def ls_cli(api_client, l, absolute, dbfs_path, unix): # NOQA
...
files = DbfsApi(api_client).list_files(dbfs_path)
if unix:
rows = [f.to_row_unix(is_long_form=l, is_absolute=absolute) for f in files]
else:
rows = [f.to_row(is_long_form=l, is_absolute=absolute) for f in files]
table = tabulate(rows, tablefmt='plain')
click.echo(table)
drwxrwxrwx root root 0 Jun 30 11:32 FileStore/
drwxrwxrwx root root 0 Sep 17 2021 cluster-logs/
drwxrwxrwx root root 0 Jul 31 2020 databricks/
-rw-rw-rw- root root 4.4M Jul 15 10:46 foo
drwxrwxrwx root root 0 Jun 14 08:46 init_scripts/
drwxrwxrwx root root 0 Jul 29 2020 mnt/
drwxrwxrwx root root 0 Jul 15 00:04 tmp/
drwxrwxrwx root root 0 May 07 2020 user/
If this is of any use for others, I can create a pull request implementing a --unix or --posix option.