asds cannot be claimed after a node is removed and then added in
Hi,
I got an issue after a node was removed from cluster. This ticket's title is not accurate enough at this moment.
Lets's say A, B and C are three nodes of the cluster. They were installed and setup in the order of A, B and C. I did the following:
- removed node A when it's offline
- re-installed OS on A
- configured network with the same IP as before removed
- installed ovs
- ovs setup
- assigned roles
- added this new ASD node into backend
- initialized ASDs
- claimed
Now claiming icon keeps on spinning and can't end.
This issue also happened when I added fourth node into cluster. So I would conclude that all ASDs' claiming will get stuck if more nodes join in. However there's no this issue if no remove node operation.
BTW, ovs setup on the fourth node has no problem. It is setup as an extra node without master role. All ovs services are running.
my all nodes are physical servers.
I find there's no alba maintenance running on this node.
Claiming ASDs still doesn't work after I paste alba maintenance cmd line according to JeffreyDevloo's suggestion in #1703 .
Do you have logging from the ovs-workers? I'd like to follow the trail on this one.
The task name is alba.add_units.
Best regards
Jeffrey Devloo
Attached log from journalctl -u ovs-workers. I clicked claim again just now, so the newest 10 minutes should be your interest. i didn't see alba.add_units in it.
workers.zip
Nov 6 10:28:40 NODE-181 gunicorn[3563]: 2018-11-06 03:28:40 85200 +0100 - NODE-181 - 4073/140101558844304 - extensions/api - 34 - INFO - [albabackends.add_units] - 83f524a7-2642-41ae-86f8-b068be3322f1 - [] - {"pk": "02d3b439-4688-42c7-b33a-c6a81844da92"} - {"cookies": {"csrftoken": "vDDLVeCu2ZPidZRrRRqNNxTlYiToESE5", "sessionid": "tyq5w0mw1fww6bt0hzlinr7cv3461xpl"}, "meta": {"HTTP_AUTHORIZATION": "Bearer WLBaF+W.ZS+PbM,?sFOj.lTJ|dz5pP*]<S+xNU1TbsO|H<yTlr>Elvb{UbqrqLn|", "wsgi.multiprocess": "True", "HTTP_COOKIE": "csrftoken=vDDLVeCu2ZPidZRrRRqNNxTlYiToESE5; sessionid=tyq5w0mw1fww6bt0hzlinr7cv3461xpl", "HTTP_X_FORWARDED_SSL": "on", "SERVER_SOFTWARE": "gunicorn/19.4.5", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/02d3b439-4688-42c7-b33a-c6a81844da92/add_units/", "SERVER_PROTOCOL": "HTTP/1.0", "QUERY_STRING": "timestamp=1541471322141", "HTTP_X_REAL_IP": "192.168.3.138", "CONTENT_LENGTH": "84", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0", "HTTP_CONNECTION": "close", "HTTP_REFERER": "https://192.168.2.181/", "SERVER_NAME": "127.0.0.1", "REMOTE_PORT": "50262", "wsgi.url_scheme": "https", "SERVER_PORT": "8002", "HTTP_X_SCHEME": "https", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7f6bef7ddfd0>", "HTTP_HOST": "192.168.2.181", "wsgi.multithread": "True", "HTTP_ACCEPT": "application/json; version=*", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/02d3b439-4688-42c7-b33a-c6a81844da92/add_units/?timestamp=1541471322141", "wsgi.run_once": "False", "wsgi.errors": "<gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7f6bef7dd950>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "<socket fileno=61 sock=127.0.0.1:8002 peer=127.0.0.1:50262>", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "<class 'gunicorn.http.wsgi.FileWrapper'>", "CSRF_COOKIE": "vDDLVeCu2ZPidZRrRRqNNxTlYiToESE5", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "requ
I reinstalled and set up a cluster and replicated this issue. I removed an offline node and reinstalled OS+OVS.
I noticed the following on newly added node:
No arakoon under roles' drive's mount point.
root@NODE-181:~# ls /mnt/hdd1/
lost+found
some services are missing (no inactive or activating):
OVS running processes
=====================
ovs-arakoon-config active 5045
ovs-arakoon-ovsdb active 5353
ovs-scheduled-tasks active 9779
ovs-support-agent active 9819
ovs-volumerouter-consumer active 9773
ovs-watcher-config active 3931
ovs-watcher-framework active 9770
ovs-webapp-api active 9775
ovs-workers active 10105
not like other nodes:
OVS running processes
=====================
ovs-albaproxy_pool-2_0 active 23069
ovs-albaproxy_pool-2_1 active 23162
ovs-arakoon-config active 35058
ovs-arakoon-ovsdb active 35968
ovs-arakoon-sata-back-abm active 3735
ovs-arakoon-sata-back-nsm_0 active 17278
ovs-arakoon-voldrv active 3834
ovs-dtl_pool-2 active 22994
ovs-scheduled-tasks active 880
ovs-support-agent active 1098
ovs-volumedriver_pool-2 active 23261
ovs-volumerouter-consumer active 873
ovs-watcher-config active 4098
ovs-watcher-framework active 871
ovs-watcher-volumedriver active 22912
ovs-webapp-api active 876
ovs-workers active 1341
Attached new logs of ovs-workers and syslog.
Best regards,
Hi yongshengma
Try to grep for add_units on all nodes within the cluster. The distributed nature of the ovs-workers might execute a task on a host different from where you sent the API call to.
The 'missing' services look to be volumedriver and abm/nsm service. Volumedriver related services are only added once you extend the vpool to the new host. The ovsdb arakoon is not deployed on the DB role. It resides under /opt/OpenvStorage/db instead. ABM and NSM arakoons will be added again after sometime (its scheduled to check every 30 minutes by default). You can manually ensure the checkup using:
from ovs.lib.alba import AlbaController
AlbaController.scheduled_alba_arakoon_checkup()
Best regards
Jeffrey Devloo
Yes, I find related info on another node as what you said. grep add_units /var/log/syslog
Nov 8 11:44:53 Node-182 gunicorn[876]: 2018-11-08 04:44:53 41000 +0100 - Node-182 - 1594/140495241835696 - extensions/api - 7271 - INFO - [albabackends.add_units] - caff05ee-c451-4e89-8f10-7bfefc53411b - [] - {"pk": "e8831b18-a552-4abe-b254-171d2261beb2"} - {"cookies": {"csrftoken": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "sessionid": "6wio9zbd2tkdtxvwm0cw97tvn16isbu0"}, "meta": {"HTTP_AUTHORIZATION": "Bearer H!f:E=0RUZr4smA={4bF/{[#.C0C2WR25eo@an0[1}v1_CYt{=hh>6anOm{*]/@v", "wsgi.multiprocess": "True", "HTTP_COOKIE": "csrftoken=OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb; sessionid=6wio9zbd2tkdtxvwm0cw97tvn16isbu0", "HTTP_X_FORWARDED_SSL": "on", "SERVER_SOFTWARE": "gunicorn/19.4.5", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_units/", "SERVER_PROTOCOL": "HTTP/1.0", "QUERY_STRING": "timestamp=1541648695066", "HTTP_X_REAL_IP": "192.168.3.138", "CONTENT_LENGTH": "84", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0", "HTTP_CONNECTION": "close", "HTTP_REFERER": "https://192.168.2.182/", "SERVER_NAME": "127.0.0.1", "REMOTE_PORT": "34316", "wsgi.url_scheme": "https", "SERVER_PORT": "8002", "HTTP_X_SCHEME": "https", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7fc7991969d0>", "HTTP_HOST": "192.168.2.182", "wsgi.multithread": "True", "HTTP_ACCEPT": "application/json; version=*", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_units/?timestamp=1541648695066", "wsgi.run_once": "False", "wsgi.errors": "<gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7fc798b28850>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "<socket fileno=65 sock=127.0.0.1:8002 peer=127.0.0.1:34316>", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "<class 'gunicorn.http.wsgi.FileWrapper'>", "CSRF_COOKIE": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "re
Nov 8 11:44:53 Node-182 gunicorn[876]: 2018-11-08 04:44:53 41000 +0100 - Node-182 - 1594/140495241835696 - log/api - 7271 - INFO - [albabackends.add_units] - caff05ee-c451-4e89-8f10-7bfefc53411b - [] - {"pk": "e8831b18-a552-4abe-b254-171d2261beb2"} - {"cookies": {"csrftoken": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "sessionid": "6wio9zbd2tkdtxvwm0cw97tvn16isbu0"}, "meta": {"HTTP_AUTHORIZATION": "Bearer H!f:E=0RUZr4smA={4bF/{[#.C0C2WR25eo@an0[1}v1_CYt{=hh>6anOm{*]/@v", "wsgi.multiprocess": "True", "HTTP_COOKIE": "csrftoken=OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb; sessionid=6wio9zbd2tkdtxvwm0cw97tvn16isbu0", "HTTP_X_FORWARDED_SSL": "on", "SERVER_SOFTWARE": "gunicorn/19.4.5", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_units/", "SERVER_PROTOCOL": "HTTP/1.0", "QUERY_STRING": "timestamp=1541648695066", "HTTP_X_REAL_IP": "192.168.3.138", "CONTENT_LENGTH": "84", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0", "HTTP_CONNECTION": "close", "HTTP_REFERER": "https://192.168.2.182/", "SERVER_NAME": "127.0.0.1", "REMOTE_PORT": "34316", "wsgi.url_scheme": "https", "SERVER_PORT": "8002", "HTTP_X_SCHEME": "https", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7fc7991969d0>", "HTTP_HOST": "192.168.2.182", "wsgi.multithread": "True", "HTTP_ACCEPT": "application/json; version=*", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_unit/?timestamp=1541648695066", "wsgi.run_once": "False", "wsgi.errors": "<gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7fc798b28850>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "<socket fileno=65 sock=127.0.0.1:8002 peer=127.0.0.1:34316>", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "<class 'gunicorn.http.wsgi.FileWrapper'>", "CSRF_COOKIE": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "request":
Yes, abm/nsm services are up when i monitor today.
I just noticed a message
Nov 8 11:44:56 Node-182 alba[23162]: 2018-11-08 11:44:56 066588 +0800 - Node-182 - 23162/0000 - alba/proxy - 13125 - info - connect_with failed: 192.168.2.181 8602 None Net_fd.TCP (fd:31): (Unix.Unix_error "Connection refused" connect ""); backtrace:; Raised at file "format.ml" (inlined), line 239, characters 35-52; Called from file "format.ml", line 465, characters 8-33; Called from file "format.ml", line 480, characters 6-24
I find it's not always true that asds cannot be claimed after a node is removed, reinstalled and added in. Most of time they can be claimed and then everything works. But sometimes it doesn't work and asds cannot be claimed even on a new node which means IP and hostname are totally new.
[albabackends.add_osds]
Jan 24 15:31:23 NODE-3 gunicorn: 2019-01-24 08:31:23 54000 +0100 - NODE-3 - 2657/139807204896592 - api/decorators.py - new_function - 138 - INFO - [albabackends.add_osds] - 17805a67-f2ce-426c-9323-b699faa472c7 - [] - {"pk": "7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8"} - {"cookies": {"csrftoken": "In8uYjduj9i1qO4amQQ79CMi3nviUqJD", "sessionid": "obnmkapfq9wm4ns9ii1xu59jdcpxbwkb"}, "meta": {"HTTP_AUTHORIZATION": "Bearer ?c,y>xl]rR}wNxbB|3t{dM4yNAA4@4F>S+[jHybm8GI3~>JN3By7QnU,gr?sS99", "wsgi.multiprocess": "True", "HTTP_REFERER": "https://192.168.0.43:443/", "SERVER_PROTOCOL": "HTTP/1.0", "SERVER_SOFTWARE": "gunicorn/18.0", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8/add_osds/", "HTTP_X_FORWARDED_SSL": "on", "QUERY_STRING": "timestamp=1548315083286", "HTTP_X_REAL_IP": "192.168.3.53", "CONTENT_LENGTH": "169", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0", "HTTP_CONNECTION": "close", "HTTP_COOKIE": "csrftoken=In8uYjduj9i1qO4amQQ79CMi3nviUqJD; sessionid=obnmkapfq9wm4ns9ii1xu59jdcpxbwkb", "SERVER_NAME": "192.168.0.43", "REMOTE_PORT": "56114", "wsgi.url_scheme": "https", "SERVER_PORT": "443", "HTTP_ACCEPT": "application/json; version=", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7f27667e44d0>", "HTTP_HOST": "192.168.0.43:443", "wsgi.multithread": "False", "HTTP_X_SCHEME": "https", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8/add_osds/?timestamp=1548315083286", "wsgi.run_once": "False", "wsgi.errors": "<open file '
Hi JeffreyDevloo ,
I just noticed that the code I'm using has difference. The URL for claiming is the following on my node:
/api/alba/backends/7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8/add_osds/?timestamp=1548300223638
which is add_osds instead of add_units.
I think you were asking to search albabackends.add_units , but I'm sure I got the same issue In original F version.
there are some things you can check:
- are the ASDs you want to claim present in the output of the
list-available-osdscommand ?$> alba list-available-osds --config <abm-url> - are the ASDS up and running ? (the connection refused error message hints that this is not the case)
- can you claim the asds via the command line ?
Hi toolslive, Any clue about abm-url ? An example would be nice.
it's an arakoon with the role of abm, so
$> pgrep -a arakoon | grep abm
...
5992 /usr/bin/arakoon --node wZ4GAcmR2MZl5x7S -config arakoon://config/ovs/arakoon/ny1-hddbackend01-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini -autofix -start
....
so the abm's url for that alba is
arakoon://config/ovs/arakoon/ny1-hddbackend01-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini
it's needed when you want to do things with that backend (add namespaces, delete namespaces, add osds, claim osds, purge osds, ....)
Mind: Your environment might have multiple backends and not all nodes run an arakoon for a backend's abm.
Nice, very nice! I have only one backend so far.
[root@NODE-3 api]# alba list-available-osds --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%OpenvStorage%2Fconfig%2Farakoon_cacc.ini
2019-01-24 17:26:19 029073 +0800 - NODE-3 - 117990/0000 - alba/cli - 0 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 17:26:19 033158 +0800 - NODE-3 - 117990/0000 - alba/cli - 1 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 17:26:19 033349 +0800 - NODE-3 - 117990/0000 - alba/cli - 2 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:7) succeeded
2019-01-24 17:26:19 034312 +0800 - NODE-3 - 117990/0000 - alba/cli - 3 - info - Found 0 available osds: []
192.168.0.49 is the target node I'm struggling on ASD claiming. There should be 4 hard drives and so asds.
you can add it via the cli
alba add-osd --help
(to add an asd, you will need its host and port , the abm-url for the backend)
After you've add it, you can list it with `list-available-osds', and you can claim it (via the cli).
normally, ASDs are discovered by alba components (maintenance, proxies) via UDP multicast. But this multicast does not always work because of network (configuration) issues.
Just cannot play around add-osd.
alba add-osd -h 10.10.10.9 -p 8600 --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%OpenvStorage%2Fconfig%2Farakoon_cacc.ini
I believe something missing. For example, how to assign OSD id ?
alba add-osd -h 10.10.10.9 -p 8600 --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --node-id SNqRbMGvYW63Arm8
2019-01-24 18:01:05 845582 +0800 - NODE-3 - 12767/0000 - alba/cli - 0 - info - Connecting to ADDR_INET(10.10.10.9,8600)
2019-01-24 18:01:05 846951 +0800 - NODE-3 - 12767/0000 - alba/cli - 1 - info - connect_with 10.10.10.9 8600 None Net_fd.TCP (fd:3) succeeded
2019-01-24 18:01:06 047671 +0800 - NODE-3 - 12767/0000 - alba/cli - 2 - info - Connecting to ADDR_INET(10.10.10.9,8600)
2019-01-24 18:01:06 047885 +0800 - NODE-3 - 12767/0000 - alba/cli - 3 - info - connect_with 10.10.10.9 8600 None Net_fd.TCP (fd:3) succeeded
2019-01-24 18:01:06 048270 +0800 - NODE-3 - 12767/0000 - alba/cli - 4 - info - long_id :"MOVHEthMdvzgp7QxFnihNUGwixtEE4eJ"
2019-01-24 18:01:06 051266 +0800 - NODE-3 - 12767/0000 - alba/cli - 5 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 18:01:06 052220 +0800 - NODE-3 - 12767/0000 - alba/cli - 6 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 18:01:06 052381 +0800 - NODE-3 - 12767/0000 - alba/cli - 7 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:7) succeeded
all right.
alba claim-osd --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --long-id MOVHEthMdvzgp7QxFnihNUGwixtEE4eJ
2019-01-24 18:11:19 362473 +0800 - NODE-3 - 39533/0000 - alba/cli - 0 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 18:11:19 365499 +0800 - NODE-3 - 39533/0000 - alba/cli - 1 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 18:11:19 366503 +0800 - NODE-3 - 39533/0000 - alba/cli - 2 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 18:11:19 366653 +0800 - NODE-3 - 39533/0000 - alba/cli - 3 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 18:11:19 366762 +0800 - NODE-3 - 39533/0000 - alba/cli - 4 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:7) succeeded
2019-01-24 18:11:19 366931 +0800 - NODE-3 - 39533/0000 - alba/cli - 5 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:8) succeeded
2019-01-24 18:11:19 368114 +0800 - NODE-3 - 39533/0000 - alba/cli - 6 - info - Connecting to ADDR_INET(10.10.10.9,8600)
2019-01-24 18:11:19 368368 +0800 - NODE-3 - 39533/0000 - alba/cli - 7 - info - connect_with 10.10.10.9 8600 None Net_fd.TCP (fd:3) succeeded
2019-01-24 18:11:19 374118 +0800 - NODE-3 - 39533/0000 - alba/cli - 8 - info - Connecting to ADDR_INET(10.10.10.9,8600)
Cool ! Appreciate!
Hi Yongshengma
The underlying commands is what the Framework also invokes. I'm still interested why the claiming was unsuccessful in the first place. If you'd ever find yourself back in the situation, please provide me the worker logging and I'll look into it.
Best regards
Hi JeffreyDevloo
Sure. No problem.
The add-osd didn't provide an option to specify an osd id, did it? Does it look for only one available each time?
there are 2 id's in this context.
osd_id : this is the number the abm gives the osd when it's added (it just increases)
long_id : this is the world wide unique identifier of the osd. The osd creates this when launched for the first time (and it will communicate this when you connect to it)
ftr, which alba version, and which framework version is this?
alba version
1.3.25-33-ge43faca-dirty
git_revision: "heads/master-0-ge43faca-dirty"
git_repo: "https://github.com/openvstorage/alba.git"
compile_time: "07/12/2017 22:03:11 UTC"
machine: "localhost.localdomain 3.10.0-693.el7 x86_64 x86_64 x86_64 GNU/Linux"
model_name: "Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz"
compiler_version: "4.04.2"
is_alba_test: false
dependencies:
arakoon_client 1.9.22 Arakoon client
bz2 0.6.0 [n/a]
cmdliner v1.0.2 Declarative definition of command line interfaces
core v0.9.1
ctypes 0.13.1 Combinators for binding to C libraries without writing any C.
ctypes.foreign 0.13.1 Dynamic linking of C functions
kinetic-client 0.0.6 Kinetic client
lwt 3.0.0 Lightweight thread library for OCaml (core library)
lwt.unix 3.0.0 Unix support for Lwt
oUnit 2.0.6 Unit testing framework
ocplib-endian 1.0 Optimised functions to read and write int16/32/64 from strings and bigarrays
ppx_deriving.enum 4.1 [@@deriving enum]
ppx_deriving.show 4.1 [@@deriving show]
ppx_deriving_yojson 3.1 [@@deriving yojson]
redis 0.3.3 Ocaml bindings for Redis
rocks 0.3.0 Rocksdb binding
sexplib v0.9.2
snappy 0.1.0 Bindings to snappy compression library
ssl 0.5.3 OCaml bindings to libssl
tiny_json 1.1.4 A small Json library from OCAMLTTER
uri 1.9.4
yojson 1.4.0 JSON parsing and printing (successor of json-wheel)
framework version openvstorage-backend-core-1.10.2_dev.248ae6a openvstorage-hc-1.10.2_dev.248ae6a openvstorage-core-2.10.3_dev.b57ccf4 openvstorage-backend-webapps-1.10.2_dev.248ae6a openvstorage-2.10.3_dev.b57ccf4 openvstorage-extensions-0.2.2_dev.461c67b openvstorage-webapps-2.10.3_dev.b57ccf4 openvstorage-sdm-1.10.1_dev.936e27e openvstorage-backend-1.10.2_dev.248ae6a
the suffix such as 248ae6a is git version
The previous claiming issue still exists even if I powered off network switch and powered it on. I just used cli to claim osds one by one.
However I got a new issue afterwards. When I extended the vpool to this node, the action failed and gave error:
Jan 25 11:29:17 NODE-3 celery: 2019-01-25 11:29:1707400 +0800 - NODE-3 - 42057/140345664415552 - celery/log.py - log - 109 - ERROR - Task ovs.storagerouter.add_vpool[dc313fcc-49cf-4281-9c1d-3e33d3df7993] raised unexpected: ConnectionError(MaxRetryError('None: Max retries exceeded with url: /api/oauth2/token/ (Caused by None)',),)
stuck again. NODE-3 is the node I'm accessing webapi. NODE-9 is the target node for vpool to extend to.
Two more info before :
Jan 25 11:29:16 NODE-3 celery: 2019-01-25 11:29:16 78900 +0800 - NODE-3 - 42187/140345664415552 - lib/storagerouter.py - add_vpool - 45 - ERROR - Something went wrong during the validation or modeling of vPool pool on StorageRouter NODE-9
Jan 25 11:29:16 NODE-3 celery: ConnectionError: HTTPSConnectionPool(host='192.168.0.41', port=443): Max retries exceeded with url: /api/oauth2/token/ (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fa4a4643950>: Failed to establish a new connection: [Errno 113] No route to host',))
It tries to connect host='192.168.0.41', but this host has been removed and the NEW node is not this IP any more. It looks like the old endpoint is still hiding somewhere.
Hi @JeffreyDevloo
Attached logs are from F version. You will find 3 nodes' logs: 192.168.2.31, 192.168.2.32, 192.168.2.33. Cluster was set up in this order.
Steps I did:
- shutdown 192.168.2.31 (first installed node)
- removed it; success
- reinstalled 192.168.2.31
- ovs setup
- assigned roles
- added ASD node
- initialized drives ; success
- claimed osds; spinning forever
Continue :
- claimed by alba cli
- extended vpool; silently gone without success return
My workround
- extended vpool before claiming osds // but I didn't see backend's NSM/ABM services shown up on new node (192.168.2.31) even after 2 hours
I accessed web page on 192.168.2.33 node, so the log of 192.168.2.33 contains the whole process that I did above.
So far this issue always occurs on the first installed node and it absolutely appears. Another place worth to say: 192.168.2.31 node has been removed, but vpool's detail page shows this vpool's connection is still 192.168.2.31:443. This info is wrong, isn't it ?
Hi @toolslive
What's the alba command for reverse operations? I mistakenly took away a hard drive running as an ASD and the data has been wiped out. This caused framework-alba-plugin UI hanging. I think I have to remove this ASD from backend. Should I use alba asd-delete first and then purge-osd? What does the key required by asd-delete look like?
Best regards, Yongsheng