framework asds cannot be claimed after a node is removed and then added in

Hi,

I got an issue after a node was removed from cluster. This ticket's title is not accurate enough at this moment.

Lets's say A, B and C are three nodes of the cluster. They were installed and setup in the order of A, B and C. I did the following:

removed node A when it's offline
re-installed OS on A
configured network with the same IP as before removed
installed ovs
ovs setup
assigned roles
added this new ASD node into backend
initialized ASDs
claimed

Now claiming icon keeps on spinning and can't end.

This issue also happened when I added fourth node into cluster. So I would conclude that all ASDs' claiming will get stuck if more nodes join in. However there's no this issue if no remove node operation.

BTW, ovs setup on the fourth node has no problem. It is setup as an extra node without master role. All ovs services are running.

Nov 05 '18 08:11 yongshengma

my all nodes are physical servers.

Nov 05 '18 10:11 yongshengma

I find there's no alba maintenance running on this node.

Claiming ASDs still doesn't work after I paste alba maintenance cmd line according to JeffreyDevloo's suggestion in #1703 .

Nov 06 '18 03:11 yongshengma

Do you have logging from the ovs-workers? I'd like to follow the trail on this one. The task name is alba.add_units.

Best regards

Jeffrey Devloo

Nov 06 '18 09:11 JeffreyDevloo

Attached log from journalctl -u ovs-workers. I clicked claim again just now, so the newest 10 minutes should be your interest. i didn't see alba.add_units in it. workers.zip

Nov 06 '18 10:11 yongshengma

Nov 6 10:28:40 NODE-181 gunicorn[3563]: 2018-11-06 03:28:40 85200 +0100 - NODE-181 - 4073/140101558844304 - extensions/api - 34 - INFO - [albabackends.add_units] - 83f524a7-2642-41ae-86f8-b068be3322f1 - [] - {"pk": "02d3b439-4688-42c7-b33a-c6a81844da92"} - {"cookies": {"csrftoken": "vDDLVeCu2ZPidZRrRRqNNxTlYiToESE5", "sessionid": "tyq5w0mw1fww6bt0hzlinr7cv3461xpl"}, "meta": {"HTTP_AUTHORIZATION": "Bearer WLBaF+W.ZS+PbM,?sFOj.lTJ|dz5pP*]<S+xNU1TbsO|H<yTlr>Elvb{UbqrqLn|", "wsgi.multiprocess": "True", "HTTP_COOKIE": "csrftoken=vDDLVeCu2ZPidZRrRRqNNxTlYiToESE5; sessionid=tyq5w0mw1fww6bt0hzlinr7cv3461xpl", "HTTP_X_FORWARDED_SSL": "on", "SERVER_SOFTWARE": "gunicorn/19.4.5", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/02d3b439-4688-42c7-b33a-c6a81844da92/add_units/", "SERVER_PROTOCOL": "HTTP/1.0", "QUERY_STRING": "timestamp=1541471322141", "HTTP_X_REAL_IP": "192.168.3.138", "CONTENT_LENGTH": "84", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0", "HTTP_CONNECTION": "close", "HTTP_REFERER": "https://192.168.2.181/", "SERVER_NAME": "127.0.0.1", "REMOTE_PORT": "50262", "wsgi.url_scheme": "https", "SERVER_PORT": "8002", "HTTP_X_SCHEME": "https", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7f6bef7ddfd0>", "HTTP_HOST": "192.168.2.181", "wsgi.multithread": "True", "HTTP_ACCEPT": "application/json; version=*", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/02d3b439-4688-42c7-b33a-c6a81844da92/add_units/?timestamp=1541471322141", "wsgi.run_once": "False", "wsgi.errors": "<gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7f6bef7dd950>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "<socket fileno=61 sock=127.0.0.1:8002 peer=127.0.0.1:50262>", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "<class 'gunicorn.http.wsgi.FileWrapper'>", "CSRF_COOKIE": "vDDLVeCu2ZPidZRrRRqNNxTlYiToESE5", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "requ

Nov 06 '18 10:11 yongshengma

I reinstalled and set up a cluster and replicated this issue. I removed an offline node and reinstalled OS+OVS.

I noticed the following on newly added node:

No arakoon under roles' drive's mount point.

root@NODE-181:~# ls /mnt/hdd1/
lost+found

some services are missing (no inactive or activating):

OVS running processes
=====================

ovs-arakoon-config         active  5045
ovs-arakoon-ovsdb          active  5353
ovs-scheduled-tasks        active  9779
ovs-support-agent          active  9819
ovs-volumerouter-consumer  active  9773
ovs-watcher-config         active  3931
ovs-watcher-framework      active  9770
ovs-webapp-api             active  9775
ovs-workers                active  10105

not like other nodes:

OVS running processes
=====================

ovs-albaproxy_pool-2_0       active  23069
ovs-albaproxy_pool-2_1       active  23162
ovs-arakoon-config           active  35058
ovs-arakoon-ovsdb            active  35968
ovs-arakoon-sata-back-abm    active  3735
ovs-arakoon-sata-back-nsm_0  active  17278
ovs-arakoon-voldrv           active  3834
ovs-dtl_pool-2               active  22994
ovs-scheduled-tasks          active  880
ovs-support-agent            active  1098
ovs-volumedriver_pool-2      active  23261
ovs-volumerouter-consumer    active  873
ovs-watcher-config           active  4098
ovs-watcher-framework        active  871
ovs-watcher-volumedriver     active  22912
ovs-webapp-api               active  876
ovs-workers                  active  1341

Attached new logs of ovs-workers and syslog.
Best regards,

181.zip

Nov 07 '18 09:11 yongshengma

Hi yongshengma

Try to grep for add_units on all nodes within the cluster. The distributed nature of the ovs-workers might execute a task on a host different from where you sent the API call to.

The 'missing' services look to be volumedriver and abm/nsm service. Volumedriver related services are only added once you extend the vpool to the new host. The ovsdb arakoon is not deployed on the DB role. It resides under /opt/OpenvStorage/db instead. ABM and NSM arakoons will be added again after sometime (its scheduled to check every 30 minutes by default). You can manually ensure the checkup using:

from ovs.lib.alba import AlbaController
AlbaController.scheduled_alba_arakoon_checkup()

Best regards

Jeffrey Devloo

Nov 07 '18 15:11 JeffreyDevloo

Yes, I find related info on another node as what you said. grep add_units /var/log/syslog

Nov  8 11:44:53 Node-182 gunicorn[876]: 2018-11-08 04:44:53 41000 +0100 - Node-182 - 1594/140495241835696 - extensions/api - 7271 - INFO - [albabackends.add_units] - caff05ee-c451-4e89-8f10-7bfefc53411b - [] - {"pk": "e8831b18-a552-4abe-b254-171d2261beb2"} - {"cookies": {"csrftoken": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "sessionid": "6wio9zbd2tkdtxvwm0cw97tvn16isbu0"}, "meta": {"HTTP_AUTHORIZATION": "Bearer H!f:E=0RUZr4smA={4bF/{[#.C0C2WR25eo@an0[1}v1_CYt{=hh>6anOm{*]/@v", "wsgi.multiprocess": "True", "HTTP_COOKIE": "csrftoken=OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb; sessionid=6wio9zbd2tkdtxvwm0cw97tvn16isbu0", "HTTP_X_FORWARDED_SSL": "on", "SERVER_SOFTWARE": "gunicorn/19.4.5", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_units/", "SERVER_PROTOCOL": "HTTP/1.0", "QUERY_STRING": "timestamp=1541648695066", "HTTP_X_REAL_IP": "192.168.3.138", "CONTENT_LENGTH": "84", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0", "HTTP_CONNECTION": "close", "HTTP_REFERER": "https://192.168.2.182/", "SERVER_NAME": "127.0.0.1", "REMOTE_PORT": "34316", "wsgi.url_scheme": "https", "SERVER_PORT": "8002", "HTTP_X_SCHEME": "https", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7fc7991969d0>", "HTTP_HOST": "192.168.2.182", "wsgi.multithread": "True", "HTTP_ACCEPT": "application/json; version=*", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_units/?timestamp=1541648695066", "wsgi.run_once": "False", "wsgi.errors": "<gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7fc798b28850>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "<socket fileno=65 sock=127.0.0.1:8002 peer=127.0.0.1:34316>", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "<class 'gunicorn.http.wsgi.FileWrapper'>", "CSRF_COOKIE": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "re
Nov  8 11:44:53 Node-182 gunicorn[876]: 2018-11-08 04:44:53 41000 +0100 - Node-182 - 1594/140495241835696 - log/api - 7271 - INFO - [albabackends.add_units] - caff05ee-c451-4e89-8f10-7bfefc53411b - [] - {"pk": "e8831b18-a552-4abe-b254-171d2261beb2"} - {"cookies": {"csrftoken": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "sessionid": "6wio9zbd2tkdtxvwm0cw97tvn16isbu0"}, "meta": {"HTTP_AUTHORIZATION": "Bearer H!f:E=0RUZr4smA={4bF/{[#.C0C2WR25eo@an0[1}v1_CYt{=hh>6anOm{*]/@v", "wsgi.multiprocess": "True", "HTTP_COOKIE": "csrftoken=OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb; sessionid=6wio9zbd2tkdtxvwm0cw97tvn16isbu0", "HTTP_X_FORWARDED_SSL": "on", "SERVER_SOFTWARE": "gunicorn/19.4.5", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_units/", "SERVER_PROTOCOL": "HTTP/1.0", "QUERY_STRING": "timestamp=1541648695066", "HTTP_X_REAL_IP": "192.168.3.138", "CONTENT_LENGTH": "84", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0", "HTTP_CONNECTION": "close", "HTTP_REFERER": "https://192.168.2.182/", "SERVER_NAME": "127.0.0.1", "REMOTE_PORT": "34316", "wsgi.url_scheme": "https", "SERVER_PORT": "8002", "HTTP_X_SCHEME": "https", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7fc7991969d0>", "HTTP_HOST": "192.168.2.182", "wsgi.multithread": "True", "HTTP_ACCEPT": "application/json; version=*", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/e8831b18-a552-4abe-b254-171d2261beb2/add_unit/?timestamp=1541648695066", "wsgi.run_once": "False", "wsgi.errors": "<gunicorn.http.wsgi.WSGIErrorsWrapper object at 0x7fc798b28850>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "<socket fileno=65 sock=127.0.0.1:8002 peer=127.0.0.1:34316>", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "<class 'gunicorn.http.wsgi.FileWrapper'>", "CSRF_COOKIE": "OvBaCC5SnhUpQzSM0yGv7Zs7BB6QFvEb", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "request":

Nov 08 '18 06:11 yongshengma

Yes, abm/nsm services are up when i monitor today.

Nov 08 '18 06:11 yongshengma

I just noticed a message

Nov 8 11:44:56 Node-182 alba[23162]: 2018-11-08 11:44:56 066588 +0800 - Node-182 - 23162/0000 - alba/proxy - 13125 - info - connect_with failed: 192.168.2.181 8602 None Net_fd.TCP (fd:31): (Unix.Unix_error "Connection refused" connect ""); backtrace:; Raised at file "format.ml" (inlined), line 239, characters 35-52; Called from file "format.ml", line 465, characters 8-33; Called from file "format.ml", line 480, characters 6-24

Nov 08 '18 10:11 yongshengma

I find it's not always true that asds cannot be claimed after a node is removed, reinstalled and added in. Most of time they can be claimed and then everything works. But sometimes it doesn't work and asds cannot be claimed even on a new node which means IP and hostname are totally new.

Jan 24 '19 07:01 yongshengma

[albabackends.add_osds]

Jan 24 15:31:23 NODE-3 gunicorn: 2019-01-24 08:31:23 54000 +0100 - NODE-3 - 2657/139807204896592 - api/decorators.py - new_function - 138 - INFO - [albabackends.add_osds] - 17805a67-f2ce-426c-9323-b699faa472c7 - [] - {"pk": "7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8"} - {"cookies": {"csrftoken": "In8uYjduj9i1qO4amQQ79CMi3nviUqJD", "sessionid": "obnmkapfq9wm4ns9ii1xu59jdcpxbwkb"}, "meta": {"HTTP_AUTHORIZATION": "Bearer ?c,y>xl]rR}wNxbB|3t{dM4yNAA4@4F>S+[jHybm8GI3~>JN3By7QnU,gr?sS99", "wsgi.multiprocess": "True", "HTTP_REFERER": "https://192.168.0.43:443/", "SERVER_PROTOCOL": "HTTP/1.0", "SERVER_SOFTWARE": "gunicorn/18.0", "SCRIPT_NAME": "/api", "REQUEST_METHOD": "POST", "PATH_INFO": "/alba/backends/7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8/add_osds/", "HTTP_X_FORWARDED_SSL": "on", "QUERY_STRING": "timestamp=1548315083286", "HTTP_X_REAL_IP": "192.168.3.53", "CONTENT_LENGTH": "169", "HTTP_USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0", "HTTP_CONNECTION": "close", "HTTP_COOKIE": "csrftoken=In8uYjduj9i1qO4amQQ79CMi3nviUqJD; sessionid=obnmkapfq9wm4ns9ii1xu59jdcpxbwkb", "SERVER_NAME": "192.168.0.43", "REMOTE_PORT": "56114", "wsgi.url_scheme": "https", "SERVER_PORT": "443", "HTTP_ACCEPT": "application/json; version=", "HTTP_X_REQUESTED_WITH": "XMLHttpRequest", "wsgi.input": "<gunicorn.http.body.Body object at 0x7f27667e44d0>", "HTTP_HOST": "192.168.0.43:443", "wsgi.multithread": "False", "HTTP_X_SCHEME": "https", "wsgi.version": "(1, 0)", "RAW_URI": "/alba/backends/7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8/add_osds/?timestamp=1548315083286", "wsgi.run_once": "False", "wsgi.errors": "<open file '', mode 'w' at 0x7f2788d071e0>", "REMOTE_ADDR": "127.0.0.1", "HTTP_ACCEPT_LANGUAGE": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "gunicorn.socket": "", "CONTENT_TYPE": "application/json", "wsgi.file_wrapper": "gunicorn.http.wsgi.FileWrapper", "CSRF_COOKIE": "In8uYjduj9i1qO4amQQ79CMi3nviUqJD", "HTTP_ACCEPT_ENCODING": "gzip, deflate, br"}, "reques

Jan 24 '19 08:01 yongshengma

Hi JeffreyDevloo , I just noticed that the code I'm using has difference. The URL for claiming is the following on my node: /api/alba/backends/7dcb2c30-3cb5-4cd9-85c4-92f27e4468b8/add_osds/?timestamp=1548300223638 which is add_osds instead of add_units.

I think you were asking to search albabackends.add_units , but I'm sure I got the same issue In original F version.

Jan 24 '19 08:01 yongshengma

there are some things you can check:

are the ASDs you want to claim present in the output of the list-available-osds command ? $> alba list-available-osds --config <abm-url>
are the ASDS up and running ? (the connection refused error message hints that this is not the case)
can you claim the asds via the command line ?

Jan 24 '19 08:01 toolslive

Hi toolslive, Any clue about abm-url ? An example would be nice.

Jan 24 '19 09:01 yongshengma

it's an arakoon with the role of abm, so

$> pgrep -a arakoon | grep abm
...
5992 /usr/bin/arakoon  --node wZ4GAcmR2MZl5x7S -config arakoon://config/ovs/arakoon/ny1-hddbackend01-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini -autofix -start
....

so the abm's url for that alba is arakoon://config/ovs/arakoon/ny1-hddbackend01-abm/config?ini=%2Fmnt%2Fssd1%2Farakoon%2Fexternal_arakoon_cacc.ini it's needed when you want to do things with that backend (add namespaces, delete namespaces, add osds, claim osds, purge osds, ....)

Mind: Your environment might have multiple backends and not all nodes run an arakoon for a backend's abm.

Jan 24 '19 09:01 toolslive

Nice, very nice! I have only one backend so far.

[root@NODE-3 api]# alba list-available-osds --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%OpenvStorage%2Fconfig%2Farakoon_cacc.ini
2019-01-24 17:26:19 029073 +0800 - NODE-3 - 117990/0000 - alba/cli - 0 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 17:26:19 033158 +0800 - NODE-3 - 117990/0000 - alba/cli - 1 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 17:26:19 033349 +0800 - NODE-3 - 117990/0000 - alba/cli - 2 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:7) succeeded
2019-01-24 17:26:19 034312 +0800 - NODE-3 - 117990/0000 - alba/cli - 3 - info - Found 0 available osds: []

192.168.0.49 is the target node I'm struggling on ASD claiming. There should be 4 hard drives and so asds.

Jan 24 '19 09:01 yongshengma

you can add it via the cli alba add-osd --help (to add an asd, you will need its host and port , the abm-url for the backend) After you've add it, you can list it with `list-available-osds', and you can claim it (via the cli).

Jan 24 '19 09:01 toolslive

normally, ASDs are discovered by alba components (maintenance, proxies) via UDP multicast. But this multicast does not always work because of network (configuration) issues.

Jan 24 '19 09:01 toolslive

Just cannot play around add-osd.

alba add-osd -h 10.10.10.9 -p 8600 --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%OpenvStorage%2Fconfig%2Farakoon_cacc.ini

I believe something missing. For example, how to assign OSD id ?

alba add-osd -h 10.10.10.9 -p 8600 --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --node-id SNqRbMGvYW63Arm8
2019-01-24 18:01:05 845582 +0800 - NODE-3 - 12767/0000 - alba/cli - 0 - info - Connecting to ADDR_INET(10.10.10.9,8600)
2019-01-24 18:01:05 846951 +0800 - NODE-3 - 12767/0000 - alba/cli - 1 - info - connect_with 10.10.10.9 8600 None Net_fd.TCP (fd:3) succeeded
2019-01-24 18:01:06 047671 +0800 - NODE-3 - 12767/0000 - alba/cli - 2 - info - Connecting to ADDR_INET(10.10.10.9,8600)
2019-01-24 18:01:06 047885 +0800 - NODE-3 - 12767/0000 - alba/cli - 3 - info - connect_with 10.10.10.9 8600 None Net_fd.TCP (fd:3) succeeded
2019-01-24 18:01:06 048270 +0800 - NODE-3 - 12767/0000 - alba/cli - 4 - info - long_id :"MOVHEthMdvzgp7QxFnihNUGwixtEE4eJ"
2019-01-24 18:01:06 051266 +0800 - NODE-3 - 12767/0000 - alba/cli - 5 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 18:01:06 052220 +0800 - NODE-3 - 12767/0000 - alba/cli - 6 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 18:01:06 052381 +0800 - NODE-3 - 12767/0000 - alba/cli - 7 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:7) succeeded

all right.

Jan 24 '19 10:01 yongshengma

alba claim-osd --config arakoon://config/ovs/arakoon/ceng-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini --long-id MOVHEthMdvzgp7QxFnihNUGwixtEE4eJ 
2019-01-24 18:11:19 362473 +0800 - NODE-3 - 39533/0000 - alba/cli - 0 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 18:11:19 365499 +0800 - NODE-3 - 39533/0000 - alba/cli - 1 - info - Albamgr_client.make_client :ceng-abm
2019-01-24 18:11:19 366503 +0800 - NODE-3 - 39533/0000 - alba/cli - 2 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 18:11:19 366653 +0800 - NODE-3 - 39533/0000 - alba/cli - 3 - info - Connecting to ADDR_INET(192.168.0.49,26406)
2019-01-24 18:11:19 366762 +0800 - NODE-3 - 39533/0000 - alba/cli - 4 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:7) succeeded
2019-01-24 18:11:19 366931 +0800 - NODE-3 - 39533/0000 - alba/cli - 5 - info - connect_with 192.168.0.49 26406 None Net_fd.TCP (fd:8) succeeded
2019-01-24 18:11:19 368114 +0800 - NODE-3 - 39533/0000 - alba/cli - 6 - info - Connecting to ADDR_INET(10.10.10.9,8600)
2019-01-24 18:11:19 368368 +0800 - NODE-3 - 39533/0000 - alba/cli - 7 - info - connect_with 10.10.10.9 8600 None Net_fd.TCP (fd:3) succeeded
2019-01-24 18:11:19 374118 +0800 - NODE-3 - 39533/0000 - alba/cli - 8 - info - Connecting to ADDR_INET(10.10.10.9,8600)

Cool ! Appreciate!

Jan 24 '19 10:01 yongshengma

Hi Yongshengma

The underlying commands is what the Framework also invokes. I'm still interested why the claiming was unsuccessful in the first place. If you'd ever find yourself back in the situation, please provide me the worker logging and I'll look into it.

Best regards

Jan 24 '19 10:01 JeffreyDevloo

Hi JeffreyDevloo

Sure. No problem.

The add-osd didn't provide an option to specify an osd id, did it? Does it look for only one available each time?

Jan 24 '19 10:01 yongshengma

there are 2 id's in this context. osd_id : this is the number the abm gives the osd when it's added (it just increases) long_id : this is the world wide unique identifier of the osd. The osd creates this when launched for the first time (and it will communicate this when you connect to it)

Jan 24 '19 10:01 toolslive

ftr, which alba version, and which framework version is this?

Jan 24 '19 10:01 toolslive

alba version

1.3.25-33-ge43faca-dirty
git_revision: "heads/master-0-ge43faca-dirty"
git_repo: "https://github.com/openvstorage/alba.git"
compile_time: "07/12/2017 22:03:11 UTC"
machine: "localhost.localdomain 3.10.0-693.el7 x86_64 x86_64 x86_64 GNU/Linux"
model_name: "Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz"
compiler_version: "4.04.2"
is_alba_test: false
dependencies:
arakoon_client      	          1.9.22	Arakoon client
bz2                 	           0.6.0	[n/a]
cmdliner            	          v1.0.2	Declarative definition of command line interfaces
core                	          v0.9.1	
ctypes              	          0.13.1	Combinators for binding to C libraries without writing any C.
ctypes.foreign      	          0.13.1	Dynamic linking of C functions
kinetic-client      	           0.0.6	Kinetic client
lwt                 	           3.0.0	Lightweight thread library for OCaml (core library)
lwt.unix            	           3.0.0	Unix support for Lwt
oUnit               	           2.0.6	Unit testing framework
ocplib-endian       	             1.0	Optimised functions to read and write int16/32/64 from strings and bigarrays
ppx_deriving.enum   	             4.1	[@@deriving enum]
ppx_deriving.show   	             4.1	[@@deriving show]
ppx_deriving_yojson 	             3.1	[@@deriving yojson]
redis               	           0.3.3	Ocaml bindings for Redis
rocks               	           0.3.0	Rocksdb binding
sexplib             	          v0.9.2	
snappy              	           0.1.0	Bindings to snappy compression library
ssl                 	           0.5.3	OCaml bindings to libssl
tiny_json           	           1.1.4	A small Json library from OCAMLTTER
uri                 	           1.9.4	
yojson              	           1.4.0	JSON parsing and printing (successor of json-wheel)

framework version openvstorage-backend-core-1.10.2_dev.248ae6a openvstorage-hc-1.10.2_dev.248ae6a openvstorage-core-2.10.3_dev.b57ccf4 openvstorage-backend-webapps-1.10.2_dev.248ae6a openvstorage-2.10.3_dev.b57ccf4 openvstorage-extensions-0.2.2_dev.461c67b openvstorage-webapps-2.10.3_dev.b57ccf4 openvstorage-sdm-1.10.1_dev.936e27e openvstorage-backend-1.10.2_dev.248ae6a

the suffix such as 248ae6a is git version

Jan 25 '19 03:01 yongshengma

The previous claiming issue still exists even if I powered off network switch and powered it on. I just used cli to claim osds one by one.

However I got a new issue afterwards. When I extended the vpool to this node, the action failed and gave error:

Jan 25 11:29:17 NODE-3 celery: 2019-01-25 11:29:1707400 +0800 - NODE-3 - 42057/140345664415552 - celery/log.py - log - 109 - ERROR - Task ovs.storagerouter.add_vpool[dc313fcc-49cf-4281-9c1d-3e33d3df7993] raised unexpected: ConnectionError(MaxRetryError('None: Max retries exceeded with url: /api/oauth2/token/ (Caused by None)',),) stuck again. NODE-3 is the node I'm accessing webapi. NODE-9 is the target node for vpool to extend to.

Jan 25 '19 04:01 yongshengma

Two more info before :

Jan 25 11:29:16 NODE-3 celery: 2019-01-25 11:29:16 78900 +0800 - NODE-3 - 42187/140345664415552 - lib/storagerouter.py - add_vpool - 45 - ERROR - Something went wrong during the validation or modeling of vPool pool on StorageRouter NODE-9

Jan 25 11:29:16 NODE-3 celery: ConnectionError: HTTPSConnectionPool(host='192.168.0.41', port=443): Max retries exceeded with url: /api/oauth2/token/ (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7fa4a4643950>: Failed to establish a new connection: [Errno 113] No route to host',))

It tries to connect host='192.168.0.41', but this host has been removed and the NEW node is not this IP any more. It looks like the old endpoint is still hiding somewhere.

Jan 25 '19 06:01 yongshengma

Hi @JeffreyDevloo

Attached logs are from F version. You will find 3 nodes' logs: 192.168.2.31, 192.168.2.32, 192.168.2.33. Cluster was set up in this order.

Steps I did:

shutdown 192.168.2.31 (first installed node)
removed it; success
reinstalled 192.168.2.31
ovs setup
assigned roles
added ASD node
initialized drives ; success
claimed osds; spinning forever

Continue :

claimed by alba cli
extended vpool; silently gone without success return

My workround

extended vpool before claiming osds // but I didn't see backend's NSM/ABM services shown up on new node (192.168.2.31) even after 2 hours

I accessed web page on 192.168.2.33 node, so the log of 192.168.2.33 contains the whole process that I did above.

So far this issue always occurs on the first installed node and it absolutely appears. Another place worth to say: 192.168.2.31 node has been removed, but vpool's detail page shows this vpool's connection is still 192.168.2.31:443. This info is wrong, isn't it ?

syslog.zip

Jan 26 '19 09:01 yongshengma

Hi @toolslive

What's the alba command for reverse operations? I mistakenly took away a hard drive running as an ASD and the data has been wiped out. This caused framework-alba-plugin UI hanging. I think I have to remove this ASD from backend. Should I use alba asd-delete first and then purge-osd? What does the key required by asd-delete look like?

Best regards, Yongsheng

Apr 09 '19 09:04 yongshengma