torchx icon indicating copy to clipboard operation
torchx copied to clipboard

Support saving logs to target paths in various filesystems

Open mannatsingh opened this issue 4 years ago • 0 comments

Description

As a user, I would like to be able to have a way to have logs stored at a specific path. The challenge here is that this should work across different compute environments and the path can be in various filesystems (local, S3, etc.).

Motivation/Background

torchx currently doesn't support saving job logs to a specific path - the logging api currently is to use torchx log <app_handle>. While this is useful, this logging isn't user configurable.

For research experiment management, we like to retain all the job details - code, logs, checkpoints at a specific location filesystem://path/to/experiment. These datasets can then have a shared retention, be shared across researchers (even across different companies), and be inspected without any knowledge of the launcher or the job id. Being able to view log files outside of torchx is also a simpler experience overall.

Detailed Proposal

I am not sure what the best approach to support this is. Here are a couple of options we discussed -

  • Moving the logs to the destination path when a worker exits.
    • Pros: Conceptually simple.
    • Cons: The logs only appear at the end of training which isn't a great experience - we generally view the logs mid-training. If the worker crashes for some reason, the logs might not even appear in the destination. This also needs some integration with the worker process which might make things unnecessarily complex.
  • Having an API which can take a user defined file-like object which can provide an output stream (we use iopath at FAIR for this which can be implemented for various filesystems), or any conceptually similar ideas.
    • Pros: Once the API is defined, this is a simple and seamless experience - the logs get saved at the default torchx logging location and the user defined path at the same time!
    • Cons: The API would need a lot of thought.

I personally prefer the second option.

mannatsingh avatar Feb 18 '22 02:02 mannatsingh