mhuguesaws
mhuguesaws
I was unable to reproduce that.
I can see the problem coming when you want to run a higher cuda version that the driver support. I can re-test on hyperpod to double check.
Please stop numbering things. This is useless and will create plenty of challenges to add and remove things. Why AMI is 1. and container 2. what's the logic? `efa_version.sh` in...
Here is my proposal ``` - docs/ - core_infra/ - orchestrators - aws-parallelcluster/ - sagemaker-hyperpod/ - slurm/ - lifecycle-scripts/ - amazon-eks/ - aws-batch/ - observability/ - ml-frameworks/ - [FRAMEWORK_NAME] -...
@awsankur if you can comment here.
> I like this structure. A couple of comments: > > 1. Observability solution will depend on the orchestrator. So we should have an observability section as part of each...
> @mhuguesaws how about CloudWatch or profilers like Nsight? Profiler in profiler ;)
> We should add Nsight profilers.