List Clusters API call returns Internal Server Error after installation
I have cluster with 4 nodes. I have installed tendrl and after all was installed I had to wait about 2 minutes before I was able to use tendrl ui because List Clusters API call was returning 500 Internal Server Error. Most of other API calls were working correctly. (I was able to log in and navigate through sites)
After 2 minutes the site and API call started working correctly.
This is how it looks right after installation:

Tested with:
tendrl-commons-1.5.2-20171004T104238.6d5ffef.noarch
tendrl-api-1.5.2-20170927T205654.a9e16c0.noarch
tendrl-ui-1.5.2-20170928T151925.1cdc907.noarch
tendrl-node-agent-1.5.2-20171004T105957.709badc.noarch
tendrl-monitoring-integration-1.5.2-20171005T132717.1b8a4c1.noarch
tendrl-grafana-plugins-1.5.2-20171005T132717.1b8a4c1.noarch
tendrl-notifier-1.5.2-20170927T205657.566bfa0.noarch
tendrl-api-httpd-1.5.2-20170927T205654.a9e16c0.noarch
glusterfs-4.0dev-0.176.gitfabee9d.el7.centos.x86_64
@fbalak It would be really helpful if you can provide the API logs when it was in error state.
@anivargi Sure, here is /var/log/messages from server where the api is run:
messages-server.txt
If you want access to the machine, I can grant it to you if you send me ssh public key.
@fbalak can you share /var/log/tendrl/api/error.log
@anivargi here: error.log
@r0h4n @shtripat looks like TendrlContext is temporarily empty even if the cluster_id is present, which causes the API to fail, I can add a check for that, but does this need to be handled by the backend?
Assuming you have not imported this cluster, tendrl-node-agent will start syncing all the data after the default "sync_interval" (as given in node-agent.conf.yaml). This is expected behavior and the UI and API should respond appropriately before the first sync completes
If the cluster is a Tendrl managed cluster, then we need to investigate further.
@fbalak
@anivargi @r0h4n I have been testing this and from the UI point of view it looks good now:
I am just concerned that the API call still returns 500 Internal Server Error and it provide no information for user about what is happening. If we know where the problem is in the backend can you please update error message and exit code? I would suggest 503 Service Unavailable.
After discussion with @dahorak we came to conclusion that there should not be any 5xx error. It would be best to return not entirely imported cluster in response and add some loading API attribute that would be reflected in UI but It would also make sense to return code 202 Accepted and return empty list.
Based on discussion with @r0h4n today, I would propose the following.
Since Tendrl needs to wait for all node agents to send all data required to describe the cluster, the cluster can't be shown immediately. Only when Tendrl has all the data, it makes sense to return the cluster in the api/1.0/clusters api call.
Unfortunately there are few problems with the current behavior:
- when the admin logs in for the first time, there are no indication that the cluster is being detected
- the api returns 500 (Internal Server Error) status code while tendrl is collecting the data
- ui shows just "no clusters detected", as if nothing happened (ignoring 500 http status code)
I don't think it makes sense to return not fully detected cluster in clusters list for multiple reasons:
- data model tendrl uses doesn't include not fully detected cluster, cluster is either fully detected or it doesn't exist, so listing cluster still being detected in the list would require changes in the data model
- ui would need to be updated to handle this
So one possible approach would be to return empty list in the clusters call response until the cluster is fully detected. And then to let the user know about cluster detecting taking place, tendrl could send the notification about this.
I searched in the design documents and the ui design doesn't anticipate with any delay in detection of the cluster for import: https://redhat.invisionapp.com/share/8QCOEVEY9#/screens/244820318
@julienlim How would you suggest to address this delay in cluster detection? Would the notification as suggester be ok?
@fbalak @dahorak I don't think that returning 202 Accepted makes sense, as the cluster detection is not triggered by clusters api call.
@fbalak I disagree with your description in https://github.com/Tendrl/api/issues/306#issuecomment-336105767 - showing spinner during cluster detection looks much more reasonable compared current 500 status code which is ignored by the ui. Unfortunately it seems that the 500 status code was misused to indicate that the cluster detection is running, which is not ok.
@mbukatov I agree with you, I suggested @fbalak 200 OK with empty clusters list (if it is not possible to mark cluster as not fully detected).
Current takeaway on answers we need to the following questions:
- How to convey the information about the ongoing detection on the api level
- How to show this ongoing detection in the UI; it would need be clear for a first time user as well. Note: Troubleshooting section in documentation should also mention / address this.
@Tendrl/tendrl-qe @[email protected] @mcarrano @julienlim
Thinking about my previous suggestion to use notification: sending notification that tendrl is detecting a cluster makes sense from backend perspective, but could we use this notification to control when to show spinner in the ui?
- Tendrl can provide notifications as to whenever cluster detection is in progress or is complete. It can also set a "detection_in_progress" flag (per node since we havent detected any cluster yet).
@Tendrl/tendrl-core do you guys think we should add this right away or wait for the next release?
More updates on this issue discussed in today's daily meeting
- We need help from @julienlim to understand possibilities of how to highlight the cluster detection progress in case of single cluster and in case of multiple clusters.
@Tendrl/tendrl-core @Tendrl/tendrl-qe @mcarrano @r0h4n
Here's our recommendation from a UI perspective:
(1) When there is only 1 cluster that is being detected (but not yet ready for importing), show the following blank slate:
Title: Cluster detected Summary: Tender server is connecting to the Tendrl node agents to detect the cluster resources.
(2) When there are multiple clusters and a detected cluster shows up in the Cluster list (but is not yet ready for Import), in the column before the button, show the following message “Cluster detection in progress” and disable Import button. When cluster detection has completed, then enable the “Import” button. Currently, the status of the cluster prior to being imported shows a “?” which is probably still okay to leave as-is.
If there's any other scenarios that I've not addressed, please advise and we can come up with other recommendations accordingly.
Fixed in https://github.com/Tendrl/api/pull/332
@julienlim This issue is pretty old. Is it still persisting for you?