Spark UI for AZTK jobs
Is there a way to ssh into the cluster and look at the Spark UI for the job being run?
No, there is not a way to ssh into a job currently. Jobs are intended for operationalized workflows with the idea being you develop and test in cluster mode, then once you have your app functioning as expected, you can bundle it up as an aztk job, and schedule it as necessary.
Can you explain your use case in more detail?
I'm trying to run AZTK job as part of VSTS pipeline.
I've tried using the job.yaml so my job config is located in one file.
Unfortunately, when I experimented with size of the cluster I ran into an error where executors were not being provisioned. And there was a suggestion in the logs to have a look at Spark UI.
I had to switch to aztk spark cluster submit command to debug this issue.
Generally I find it cumbersome to maintain two completely different ways of submitting the same job: one for debugging and one for production usage. It would probably be ok if I had to switch from manual cluster to automatic only once, but the job code is changing over time, so I can't guarantee that it will continue working as before. Thus having means of debugging the jobs is required.
And it seems strange that I have to give up Azure Batch managing cluster lifecycle just to access debugging.
Thanks for the feedback. We will definitely bring the debug and monitoring tools from clusters to jobs. The goal is to make transitioning between cluster mode and job submit mode as streamlined as possible, and I agree that adding these features to jobs will help realize that goal.
When you say "the job code is changing over time" what do you mean?
When you say "the job code is changing over time" what do you mean?
I mean that the code and logic in the Spark application (which we run on AZTK) changes over time.
In this particular case we run AZTK job as part of CI pipeline for a Spark application to ensure that it works after every build. Using AZTK job submission would be ideal in that case because we wouldn't need to worry about test clusters lingering in case something goes wrong. But at the same time, if the application is stuck we'd want to look at it.