ceph qa: test new client with old cluster

Fixes: https://tracker.ceph.com/issues/53573 Signed-off-by: Dhairya Parmar [email protected]

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- [x] References tracker ticket
- [ ] Very recent bug; references commit where it was introduced
- [ ] New feature (ticket optional)
- [ ] Doc update (no ticket needed)
- [ ] Code cleanup (no ticket needed)
Component impact
- [ ] Affects Dashboard, opened tracker ticket
- [ ] Affects Orchestrator, opened tracker ticket
- [x] No impact that needs to be tracked
Documentation (select at least one)
- [ ] Updates relevant documentation
- [x] No doc update is appropriate
Tests (select at least one)
- [x] Includes unit test(s)
- [ ] Includes integration test(s)
- [ ] Includes bug reproducer
- [ ] No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

Sep 28 '22 09:09 dparmar18

If we're testing a new client with an old cluster, why are we moving to install Pacific (which we're still doing point releases and backports on, so might miss actual protocol issues with) instead of Nautilus (which is a fixed quantity)?

Oct 10 '22 14:10 gregsfortytwo

If we're testing a new client with an old cluster, why are we moving to install Pacific (which we're still doing point releases and backports on, so might miss actual protocol issues with) instead of Nautilus (which is a fixed quantity)?

I don't precisely remember but I had some discussion with Venky on IRC or 1x1 where we decided to go with Quincy clients with pacific cluster. If pacific can hide such things, I've move back to nautilus but things aren't good with pacific cluster too, almost all the tests are stuck since hours in teuthology http://pulpito.front.sepia.ceph.com/dparmar-2022-10-10_13:22:59-fs:upgrade-main-distro-default-smithi/

Oct 10 '22 18:10 dparmar18

Okay, so definitely there is some issue with ceph-fuse,

http://pulpito.front.sepia.ceph.com/dparmar-2022-10-11_17:01:35-fs:upgrade-main-distro-default-smithi/
http://pulpito.front.sepia.ceph.com/dparmar-2022-10-11_17:48:59-fs:upgrade-main-distro-default-smithi/

yamls used in [1]:

tasks:
- ceph-fuse: [client.0]
- workunit:
    clients:
       client.0:
         - suites/iozone.sh

AND

tasks:
- ceph-fuse: [client.0]
- workunit:
    clients:
       client.0:
         - suites/blogbench.sh

[1] never completed, had to kill the run. Jobs run into timeout issues and this happens at mkdir -p -v /home/ubuntu/cephtest/mnt.0

When I looked into the machine, I cannot list cephtest dir, I cannot do df -hT. ceph osd blocklist ls shows one entry, I suspect it's the client only. Doing umout -fl home/ubuntu/cephtest/mnt.0 and clearing the blocklist can un-hang the job but ultimately the test would fail.

[2] which is all green uses the exact same yaml just - ceph-fuse: [client.0] removed, those yamls are - https://github.com/ceph/ceph/pull/48280/commits/7c0d7a28de9f7f0f366a940ccd0af417d6181cd2#diff-a609885c2c09cab572df7a707110532bb78b235691d3c51a28db1128d5f3ae6a and https://github.com/ceph/ceph/pull/48280/commits/7c0d7a28de9f7f0f366a940ccd0af417d6181cd2#diff-f9147c01c698bb81576733e51a936f4ce490271b768450147043808b7baa78b1

and both the jobs succeed without even a single hiccup.

But I can also see this in logs:

2022-10-11T18:06:16.269 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi088.stderr:ceph-fuse[39786]: starting ceph client

2022-10-11T18:07:44.542 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi057.stderr:ceph-fuse[39794]: starting ceph client

If these tests are also working with ceph-fuse and PASS, question is why and how there is not hang issue here as I can see it does create dir at mount point successfully:

2022-10-11T18:06:15.846 DEBUG:teuthology.orchestra.run.smithi088:> mkdir -p -v /home/ubuntu/cephtest/mnt.0
2022-10-11T18:06:15.867 INFO:teuthology.orchestra.run.smithi088.stdout:mkdir: created directory '/home/ubuntu/cephtest/mnt.0'

2022-10-11T18:07:44.117 DEBUG:teuthology.orchestra.run.smithi057:> mkdir -p -v /home/ubuntu/cephtest/mnt.0
2022-10-11T18:07:44.140 INFO:teuthology.orchestra.run.smithi057.stdout:mkdir: created directory '/home/ubuntu/cephtest/mnt.0'

I've tested this multiple times with different cluster setups like: having client on the same node as other daemons and client on different node and both yield similar results i.e. test hangs at doing mkdir -p -v at mountpoint. Another failure that occurred intermittently - (Command failed on smithi203 with status 1: 'chmod 0000 /home/ubuntu/cephtest/mnt.1') which @rishabh-d-dave helped me fix where I had to include omit_sudo=False in create_mntpt() in mount.py .

Oct 11 '22 20:10 dparmar18

Okay, so definitely there is some issue with ceph-fuse,

http://pulpito.front.sepia.ceph.com/dparmar-2022-10-11_17:01:35-fs:upgrade-main-distro-default-smithi/

It seems you are doing the ceph-fuse mount twice:

  - ceph-fuse:
      client.0: null
  - print: '**** done remount client'
  - ceph-fuse:
    - client.0

I am not sure this will work for the netns, I never test this.

Thanks!

Oct 12 '22 00:10 lxbsz

As we discussed please just append the iozone, dbench, kernel untar and blogbench test after newops test for from_nautilus/. And add new directory from_pacific/ for Pacific cluster, etc.

Oct 12 '22 00:10 lxbsz

I've re-structured upgraded_client and added tests. Also addressed all the failures and issues that prevented tests from running successfully.

Oct 12 '22 11:10 dparmar18

I was experiencing failures like Command failed on smithi157 with status 1: 'chmod 0000 /home/ubuntu/cephtest/mnt.1' in teuthology, therefore added https://github.com/ceph/ceph/pull/48280/commits/5105d1b8fc1d712ae1438724e6529d17ea299034 to address it, should I move this to a complete separate PR or it is okay to ship with this PR? Asking because qa/tasks/cephfs/mount.py is widely used, it can/might cause similar failures in future too. Any suggestions/thoughts?

Oct 12 '22 11:10 dparmar18

http://pulpito.front.sepia.ceph.com/dparmar-2022-10-12_10:31:59-fs:upgrade-main-distro-default-smithi/ - Run with the latest patch. Ignore 7063339 and 7063346 they are from fs:upgrade/featureful_client/upgraded_client and not fs:upgrade/upgraded_client (mistake with --filter with teuthology-suite cmd).

All tests on pacific passed while nautilus tests failed with

teuthology.exceptions.CommandFailedError: Command failed on smithi017 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls'

which I think @lxbsz is already working on?

Oct 12 '22 12:10 dparmar18

http://pulpito.front.sepia.ceph.com/dparmar-2022-10-12_10:31:59-fs:upgrade-main-distro-default-smithi/ - Run with the latest patch. Ignore 7063339 and 7063346 they are from fs:upgrade/featureful_client/upgraded_client and not fs:upgrade/upgraded_client (mistake with --filter with teuthology-suite cmd).

All tests on pacific passed while nautilus tests failed with
teuthology.exceptions.CommandFailedError: Command failed on smithi017 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls'
which I think @lxbsz is already working on?

The PR https://github.com/ceph/ceph/pull/48182 already merged!

Oct 12 '22 12:10 lxbsz

http://pulpito.front.sepia.ceph.com/dparmar-2022-10-12_10:31:59-fs:upgrade-main-distro-default-smithi/ - Run with the latest patch. Ignore 7063339 and 7063346 they are from fs:upgrade/featureful_client/upgraded_client and not fs:upgrade/upgraded_client (mistake with --filter with teuthology-suite cmd). All tests on pacific passed while nautilus tests failed with
teuthology.exceptions.CommandFailedError: Command failed on smithi017 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls'
which I think @lxbsz is already working on?
The PR #48182 already merged!

Thanks @lxbsz! It should not be an issue then. As discussed this is still pending backports right? we need this to land in nautilus branch else it's not gonna pass in anyway.

Oct 12 '22 13:10 dparmar18

removed commit addressing sudo/su issue, created separate PR https://github.com/ceph/ceph/pull/48476

Oct 13 '22 11:10 dparmar18

removed commit addressing sudo/su issue, created separate PR #48476

This issue was never reported previously in any runs since the induction of the commit https://github.com/ceph/ceph/commit/60d5d7cf9cac250905ec4cfaee0c8d508f98ed3e. This issue occurred when I was running https://github.com/ceph/ceph/pull/48280 in teuthology where I found my yamls were mounting ceph-fuse twice and that led to almost all the failures. I feel this should not occur with correct yamls, so, now when the correct yamls have been pushed into that PR and in order to confirm this assumption I ran all the tests again and it went as expected: https://pulpito.ceph.com/dparmar-2022-10-14_11:09:18-fs:upgrade-main-distro-default-smithi/.

Oct 14 '22 13:10 dparmar18

jenkins test api

Oct 14 '22 13:10 dparmar18

all test pass after rebase (with PR https://github.com/ceph/ceph/pull/48182 in my branch).

http://pulpito.front.sepia.ceph.com/dparmar-2022-10-15_11:27:50-fs:upgrade-main-distro-default-smithi/

Oct 17 '22 08:10 dparmar18

I'll rename them

Oct 17 '22 11:10 dparmar18

Maybe we should just link the files by keeping the old names as they are as we do in most of other test cases, for example just do:
.qa/suites/fs/workload/tasks/5-workunit/iozone.yaml --> fs/upgrade/upgraded_client/tasks/2-workload/stress_tests/iozone.yaml
But it's trivial.

done

Oct 17 '22 11:10 dparmar18

changes made: https://github.com/ceph/ceph/compare/8df3225a2dbbe7ad482f8847891dff1ad352d64d..179e4bcae9bc7b6b0c049120ed5795c143180e0a

Mar 10 '23 07:03 dparmar18