Tiresias
Tiresias copied to clipboard
Tiresias is a GPU cluster manager for distributed deep learning training.
https://github.com/SymbioticLab/Tiresias/blob/959f9b08f44fa1b4f2b3aed79b01e4d439264a94/simulator/run_sim.py#L1226 Hi! I am using dlas scheduler in your simulator. However, I am confused about the calculation for `jump_time`. According to my knowledge, it represents the next time when a...
480_job
Hi,I'm using the 60_job.csv trace for implementation,you mentioned that you used the 480_job trace in NSDI paper,but I can't find it in your github.Can you provide the trace file?Thank you!
Hi, I'm using shortest first scheduler in your simulator but get an output "This cluster is not large enough to run the job", which seems unreasonable as shortest first scheduler...
I hope you are well. In Figure 4 of your paper, you show that the time overhead of pausing VGG19 is roughly ten times smaller than that of pausing ResNet152....