microceph icon indicating copy to clipboard operation
microceph copied to clipboard

microceph radosgw crashes after snap refresh

Open wolsen opened this issue 2 years ago • 8 comments

The microceph radosgw services crash after a snap refresh with the following stack trace:

Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]: warning: unable to create /var/snap/microceph/483/runNo such file or directory
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]: terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:   what():  filesystem error: cannot set permissions: No such file or directory [/var/snap/microceph/483/run]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]: *** Caught signal (Aborted) **
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  in thread 7fdd0f663e40 thread_name:radosgw
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7fdd1377e520]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  2: pthread_kill()
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  3: raise()
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  4: abort()
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2bbe) [0x7fdd11eaebbe]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae24c) [0x7fdd11eba24c]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae2b7) [0x7fdd11eba2b7]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae518) [0x7fdd11eba518]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  9: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa702e) [0x7fdd11eb302e]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  10: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_>
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  11: (radosgw_Main(int, char const**)+0x213) [0x7fdd13f0de13]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  12: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fdd13765d90]
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  13: __libc_start_main()
Jul 11 22:38:25 juju-ba47b1-default-0 microceph.rgw[328259]:  14: _start()

This is due to the rgw config file containing the current revision in the $SNAP_DATA environment variable, which is used for rendering the run dir part of the ceph mon config file (e.g. /var/snap/microceph/483/run will become /var/snap/microceph/509/run - which will no longer exist after multiple refreshes).

wolsen avatar Jul 11 '23 23:07 wolsen

Thanks @wolsen for reporting.

I'm wondering if $SNAP_COMMON/conf wouldn't be a better location for this kind of data (ie. ceph.conf, keyrings, metadata)

sabaini avatar Jul 12 '23 16:07 sabaini

Hi @wolsen, can you share the snap versions you have been using?

I don't see this when refreshing from default to quincy/edge, possibly it's been fixed(?)

Steps I took for repro'ing:

Install microceph default

ubuntu@bn-0:~$ snap info microceph | grep installed
installed:          0+git.fdf6d5e            (338) 93MB -
ubuntu@bn-0:~$ sudo microceph status
MicroCeph deployment summary:
- bn-0 (10.0.8.27)
  Services: mds, mgr, mon, rgw, osd
  Disks: 1
- bn-1 (10.0.8.131)
  Services: mds, mgr, mon, osd
  Disks: 1
- bn-2 (10.0.8.50)
  Services: mds, mgr, mon, osd
  Disks: 1

Refresh to quincy/edge on all nodes

sudo snap refresh microceph --channel quincy/edge # 3 times

ubuntu@bn-0:~$ snap info microceph | grep installed
installed:          0+git.c31263e            (532) 94MB -

Check radosgw is present and accessible

ubuntu@bn-0:~$ ps auxw | grep rados
root        5826  0.5  1.3 6036408 88380 ?       Ssl  08:09   0:03 radosgw -f --cluster ceph --name client.radosgw.gateway -c /var/snap/microceph/532/conf/radosgw.conf

ubuntu@bn-0:~$ sudo ss -tnlp | grep rados
LISTEN 0      4096         0.0.0.0:80        0.0.0.0:*    users:(("radosgw",pid=5826,fd=51))       
LISTEN 0      4096            [::]:80           [::]:*    users:(("radosgw",pid=5826,fd=52))       

ubuntu@bn-0:~$ curl localhost
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult ...

It seems microceph correctly rendered radosgw.conf in both the previous and current versions

ubuntu@bn-0:~$ sudo find /var/snap/microceph -name radosgw.conf
/var/snap/microceph/532/conf/radosgw.conf
/var/snap/microceph/338/conf/radosgw.conf

sabaini avatar Aug 03 '23 08:08 sabaini

I have a similar issue, however for me it affects the Ceph MON.

The snapd changes history has unfortunately been purged, but looking at the snap mount entries in the system log provides a nice history:

Dec 28 22:38:51 gandalf systemd[1]: Mounted snap-microceph-707.mount - Mount unit for microceph, revision 707.
Dec 29 06:59:02 gandalf systemd[1]: Mounted snap-microceph-817.mount - Mount unit for microceph, revision 817.
Feb 14 16:32:37 gandalf systemd[1]: Mounted snap-microceph-862.mount - Mount unit for microceph, revision 862.

The ceph.conf still points to 707:

# # Generated by MicroCeph, DO NOT EDIT.
[global]
run dir = /var/snap/microceph/707/run
fsid = e745a888-ef7a-4013-878f-43a03dd87399
mon host = 192.168.50.11,192.168.50.13,192.168.50.12
auth allow insecure global id reclaim = false
public addr = 192.168.50.11
ms bind ipv4 = true
ms bind ipv6 = false

fnordahl avatar Mar 03 '24 10:03 fnordahl

@fnordahl we had some backward compat issues when upgrading from quincy to reef, cf. #318, where the regular refresh of the ceph.conf failed due to missing configuration -- this was fixed in #322. Does this look like it could be your case?

sabaini avatar Mar 04 '24 16:03 sabaini

Given that the revision number is 707 and there is a public addr config in ceph.conf file i think this is the same issue.

UtkarshBhatthere avatar Mar 05 '24 06:03 UtkarshBhatthere

Seen this again with the conf file pointing to revision directory 862 for the snap in /var/snap/microceph/862.

This should be a fairly easy thing to recreate - essentially it involves 2 refreshes of the snap for this to occur. The first refresh will bring in a new snap revision, the second refresh will rotate out the original directory - which means that the folder will no longer exist and will cause the service to fail.

I assume those that follow instructions with a snap refresh --hold will not run into this until they have manually refreshed the snap a couple of times. The location of the conf files and such, should be moved to either be a more permanent directory that isn't snapshotted (eg. like $SNAP_COMMON) or use the current symlink in place of the specific revision.

But to confirm, this is NOT yet fixed.

wolsen avatar Jul 10 '24 09:07 wolsen

Think this could be the same as #356 which was fixed in #357 (but was still present in revision 862)

For current releases I don't see mismatches between what run dir points to and the current symlink after 2 refreshes:

peter@pirx ~ » sudo snap install microceph --channel quincy/stable
microceph (quincy/stable) 0+git.4a608fc from Canonical✓ installed
peter@pirx ~ » sudo microceph cluster bootstrap                   
peter@pirx ~ » sudo microceph disk add loop,4G,3
peter@pirx ~ » ls -la /var/snap/microceph 
total 16
drwxr-xr-x  4 root root 4096 Jul 10 11:54 .
drwxr-xr-x 33 root root 4096 Jul 10 11:54 ..
drwxr-xr-x  4 root root 4096 Jul 10 11:55 793
drwxr-xr-x  5 root root 4096 Jul 10 11:54 common
lrwxrwxrwx  1 root root    3 Jul 10 11:54 current -> 793
peter@pirx ~ » sudo microceph.ceph -s
  cluster:
    id:     3522ee24-6287-4e9d-8019-7365a98f54e6
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum pirx (age 60s)
    mgr: pirx(active, since 54s)
    osd: 3 osds: 3 up (since 27s), 3 in (since 29s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   27 MiB used, 12 GiB / 12 GiB avail
    pgs:     1 active+clean
 
peter@pirx ~ » sudo snap refresh microceph --channel reef/stable
2024-07-10T11:56:34+02:00 INFO Waiting for "snap.microceph.mds.service" to stop.
microceph (reef/stable) 18.2.0+snapcba31e8c75 from Canonical✓ refreshed
peter@pirx ~ » ls -la /var/snap/microceph                       
total 20
drwxr-xr-x  5 root root 4096 Jul 10 11:56 .
drwxr-xr-x 33 root root 4096 Jul 10 11:54 ..
drwxr-xr-x  4 root root 4096 Jul 10 11:55 793
drwxr-xr-x  4 root root 4096 Jul 10 11:55 999
drwxr-xr-x  5 root root 4096 Jul 10 11:54 common
lrwxrwxrwx  1 root root    3 Jul 10 11:56 current -> 999
peter@pirx ~ » sudo microceph.ceph -s                           
  cluster:
    id:     3522ee24-6287-4e9d-8019-7365a98f54e6
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum pirx (age 29s)
    mgr: pirx(active, since 24s)
    osd: 3 osds: 3 up (since 19s), 3 in (since 92s)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   83 MiB used, 12 GiB / 12 GiB avail
    pgs:     1 active+clean
 
  io:
    recovery: 0 B/s, 0 objects/s
 
peter@pirx ~ » sudo snap refresh microceph --channel reef/candidate
2024-07-10T11:57:32+02:00 INFO Waiting for "snap.microceph.mds.service" to stop.
microceph (reef/candidate) 18.2.0+snap6d794545e6 from Canonical✓ refreshed
peter@pirx ~ » ls -la /var/snap/microceph                          
total 20
drwxr-xr-x  5 root root 4096 Jul 10 11:57 .
drwxr-xr-x 33 root root 4096 Jul 10 11:54 ..
drwxr-xr-x  4 root root 4096 Jul 10 11:55 1052
drwxr-xr-x  4 root root 4096 Jul 10 11:55 999
drwxr-xr-x  5 root root 4096 Jul 10 11:54 common
lrwxrwxrwx  1 root root    4 Jul 10 11:57 current -> 1052
peter@pirx ~ » sudo microceph.ceph -s                              
  cluster:
    id:     3522ee24-6287-4e9d-8019-7365a98f54e6
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum pirx (age 16s)
    mgr: pirx(active, since 11s)
    osd: 3 osds: 3 up (since 7s), 3 in (since 2m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 577 KiB
    usage:   84 MiB used, 12 GiB / 12 GiB avail
    pgs:     1 active+clean
 
  io:
    client:   30 KiB/s rd, 0 B/s wr, 13 op/s rd, 2 op/s wr
    recovery: 0 B/s, 0 objects/s
 
peter@pirx ~ » grep "run dir" /var/snap/microceph/current/conf/ceph.conf 
run dir = /var/snap/microceph/1052/run

sabaini avatar Jul 10 '24 10:07 sabaini

~Ok it seems after trying a few more times I could repro after all, smells like we still are racing~ This actually appears to be a separate issue specific to reef/candidate; I've added a bug about this in #385

sabaini avatar Jul 10 '24 13:07 sabaini