nni icon indicating copy to clipboard operation
nni copied to clipboard

restserver does not response when using frameworkcontroller

Open hviet2603 opened this issue 3 years ago • 0 comments

Describe the issue: Hi I am running the example using frameworkcontroller, but somehow I always have timeout when making request to the restserver, it seems like it hangs up somewhere (I also tried using Postman to send request to "GET /check-status" and the restserver doesn't seem to response). The web interface even does not show up. Actually sometimes it does but unfortunenately most of the time it doesn't. Does anyone tried this and face the same issue?

Environment:

  • NNI version: 2.8
  • Training service (local|remote|pai|aml|etc): frameworkcontroller on minikube
  • Client OS: ubuntu
  • Python version: 3.8

Log message:

  • nnimanager.log:

[2022-07-26 18:30:09] INFO (main) Start NNI manager [2022-07-26 18:30:09] DEBUG (SqlDB) Database directory: /home/viet/nni-experiments/2a1e3f6u/db [2022-07-26 18:30:09] INFO (NNIDataStore) Datastore initialization done [2022-07-26 18:30:09] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/" [2022-07-26 18:30:09] WARNING (NNITensorboardManager) Tensorboard may not installed, if you want to use tensorboard, please check if tensorboard installed. [2022-07-26 18:30:09] INFO (RestServer) REST server started. [2022-07-26 18:30:09] DEBUG (main) start() returned. [2022-07-26 18:30:10] DEBUG (NNIRestHandler) GET: /check-status: body: {} [2022-07-26 18:30:10] DEBUG (NNIRestHandler) POST: /experiment: body: { searchSpace: { features: { _type: 'choice', _value: [Array] }, lr: { _type: 'loguniform', _value: [Array] }, momentum: { _type: 'uniform', _value: [Array] } }, trialCodeDirectory: '/home/viet/Programming/python/ba-hpo/workspace/NNI/experiments/nni-kube-cluster/frameworkcontroller-python', trialConcurrency: 2, maxTrialNumber: 10, nniManagerIp: '192.168.0.101', useAnnotation: false, debug: false, logLevel: 'info', experimentWorkingDirectory: '/home/viet/nni-experiments', tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } }, trainingService: { platform: 'frameworkcontroller', trialCommand: '', trialCodeDirectory: '/home/viet/Programming/python/ba-hpo/workspace/NNI/experiments/nni-kube-cluster/frameworkcontroller-python', nniManagerIp: '192.168.0.101', debug: false, storage: { storageType: 'nfs', server: '192.168.0.101', path: '/exports', storage: 'nfs' }, serviceAccountName: 'frameworkcontroller', taskRoles: [ [Object] ], reuseMode: true, namespace: 'default' } } [2022-07-26 18:30:10] INFO (NNIManager) Starting experiment: 2a1e3f6u [2022-07-26 18:30:10] INFO (NNIManager) Setup training service... [2022-07-26 18:30:10] DEBUG (TrialDispatcher) current folder /home/viet/.local/lib/python3.8/site-packages/nni_node/training_service/reusable [2022-07-26 18:30:10] INFO (NNIManager) Setup tuner... [2022-07-26 18:30:10] DEBUG (NNIManager) dispatcher command: /usr/bin/python,-m,nni,--exp_params,eyJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvaG9tZS92aWV0L1Byb2dyYW1taW5nL3B5dGhvbi9iYS1ocG8vd29ya3NwYWNlL05OSS9leHBlcmltZW50cy9ubmkta3ViZS1jbHVzdGVyL2ZyYW1ld29ya2NvbnRyb2xsZXItcHl0aG9uIiwidHJpYWxDb25jdXJyZW5jeSI6MiwibWF4VHJpYWxOdW1iZXIiOjEwLCJubmlNYW5hZ2VySXAiOiIxOTIuMTY4LjAuMTAxIiwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvaG9tZS92aWV0L25uaS1leHBlcmltZW50cyIsInR1bmVyIjp7Im5hbWUiOiJUUEUiLCJjbGFzc0FyZ3MiOnsib3B0aW1pemVfbW9kZSI6Im1heGltaXplIn19LCJ0cmFpbmluZ1NlcnZpY2UiOnsicGxhdGZvcm0iOiJmcmFtZXdvcmtjb250cm9sbGVyIiwidHJpYWxDb21tYW5kIjoiIiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL2hvbWUvdmlldC9Qcm9ncmFtbWluZy9weXRob24vYmEtaHBvL3dvcmtzcGFjZS9OTkkvZXhwZXJpbWVudHMvbm5pLWt1YmUtY2x1c3Rlci9mcmFtZXdvcmtjb250cm9sbGVyLXB5dGhvbiIsIm5uaU1hbmFnZXJJcCI6IjE5Mi4xNjguMC4xMDEiLCJkZWJ1ZyI6ZmFsc2UsInN0b3JhZ2UiOnsic3RvcmFnZVR5cGUiOiJuZnMiLCJzZXJ2ZXIiOiIxOTIuMTY4LjAuMTAxIiwicGF0aCI6Ii9leHBvcnRzIiwic3RvcmFnZSI6Im5mcyJ9LCJzZXJ2aWNlQWNjb3VudE5hbWUiOiJmcmFtZXdvcmtjb250cm9sbGVyIiwidGFza1JvbGVzIjpbeyJuYW1lIjoid29ya2VyIiwiZG9ja2VySW1hZ2UiOiJtc3Jhbm5pL25uaTpsYXRlc3QiLCJ0YXNrTnVtYmVyIjoxLCJjb21tYW5kIjoicHl0aG9uMyBtb2RlbC5weSIsImdwdU51bWJlciI6MCwiY3B1TnVtYmVyIjoxLCJtZW1vcnlTaXplIjoiMSBnYiIsImZyYW1ld29ya0F0dGVtcHRDb21wbGV0aW9uUG9saWN5Ijp7Im1pbkZhaWxlZFRhc2tDb3VudCI6MSwibWluU3VjY2VlZFRhc2tDb3VudCI6MX19XSwicmV1c2VNb2RlIjp0cnVlLCJuYW1lc3BhY2UiOiJkZWZhdWx0In19 [2022-07-26 18:30:10] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING

  • dispatcher.log:

[2022-07-26 18:30:09] INFO (nni.experiment) Creating experiment, Experiment ID: 2a1e3f6u [2022-07-26 18:30:09] INFO (nni.experiment) Starting web server... [2022-07-26 18:30:10] INFO (nni.experiment) Setting up... [2022-07-26 18:30:30] ERROR (nni.experiment) Create experiment failed [2022-07-26 18:30:30] INFO (nni.experiment) Stopping experiment, please wait...

How to reproduce it?: I just followed the tutorial from the doc page

hviet2603 avatar Jul 26 '22 17:07 hviet2603