cloudml
cloudml copied to clipboard
Error setting up cloud instance
Submitting first job through cloudml, and there are errors on the cloud install.
Log from google cloud console:
The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-kgurwctg/setup.py", line 163, in <module>
cmdclass = { "install": CustomCommands }
File "/opt/conda/lib/python3.7/site-packages/setuptools/__init__.py", line 161, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-req-build-kgurwctg/setup.py", line 138, in run
self.RunCustomCommandList(PIP_INSTALL_KERAS)
File "/tmp/pip-req-build-kgurwctg/setup.py", line 119, in RunCustomCommandList
self.RunCustomCommand(command, True)
File "/tmp/pip-req-build-kgurwctg/setup.py", line 102, in RunCustomCommand
raise RuntimeError(message)
RuntimeError: Command ['pip', 'install', 'h5py', 'pyyaml', 'requests', 'Pillow', 'scipy', '--upgrade'] failed: exit code 1
To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=894990183050&resource=ml_job%2Fjob_id%2Fcloudml_2021_02_18_030427919&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22cloudml_2021_02_18_030427919%22
To explore this a bit more, I:
-
installed cloudml from this repository rather than CRAN
-
Used the example mnist script:
library(cloudml)
dir.create("mnist-train")
file.copy(system.file("examples/mnist/train.R", package = "cloudml"), "mnist-train")
setwd("mnist-train")
cloudml_train()
The first error that pops up is probably not consequential:
ERROR: You have configured your Cloud SDK installation to be fixed to version [220.0.0]. Make sure this is a valid archived Cloud SDK version.
But things seem to go wrong when installing matrix 1.3-2, where I get:
curl: (22) The requested URL returned error: 404 Not Found
FAILED
Error in getSourceForPkgRecord(pkgRecord, srcDir(project), availablePackagesSource(repos = repos), :
Failed to retrieve package sources for Matrix 1.3-2 from CRAN (internet connectivity issue?)
Calls: retrieve_packrat_packages ... restoreImpl -> playActions -> installPkg -> getSourceForPkgRecord
Execution halted
Command ['Rscript', '/root/.local/lib/python3.7/site-packages/cloudml-model/cloudml/deploy.R'] failed: exit code 1
Command '['python3', '-m', 'cloudml-model.cloudml.deploy', 'Rscript', '--job-dir', 'gs://jm-dl-r-2/r-cloudml/staging']' returned non-zero exit status 1.
full logs in csv and JSON attached