qa: test new client with old cluster
Fixes: https://tracker.ceph.com/issues/53573 Signed-off-by: Dhairya Parmar [email protected]
Contribution Guidelines
-
To sign and title your commits, please refer to Submitting Patches to Ceph.
-
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
- Tracker (select at least one)
- [x] References tracker ticket
- [ ] Very recent bug; references commit where it was introduced
- [ ] New feature (ticket optional)
- [ ] Doc update (no ticket needed)
- [ ] Code cleanup (no ticket needed)
- Component impact
- [ ] Affects Dashboard, opened tracker ticket
- [ ] Affects Orchestrator, opened tracker ticket
- [x] No impact that needs to be tracked
- Documentation (select at least one)
- [ ] Updates relevant documentation
- [x] No doc update is appropriate
- Tests (select at least one)
- [x] Includes unit test(s)
- [ ] Includes integration test(s)
- [ ] Includes bug reproducer
- [ ] No tests
Show available Jenkins commands
-
jenkins retest this please -
jenkins test classic perf -
jenkins test crimson perf -
jenkins test signed -
jenkins test make check -
jenkins test make check arm64 -
jenkins test submodules -
jenkins test dashboard -
jenkins test dashboard cephadm -
jenkins test api -
jenkins test docs -
jenkins render docs -
jenkins test ceph-volume all -
jenkins test ceph-volume tox -
jenkins test windows
If we're testing a new client with an old cluster, why are we moving to install Pacific (which we're still doing point releases and backports on, so might miss actual protocol issues with) instead of Nautilus (which is a fixed quantity)?
If we're testing a new client with an old cluster, why are we moving to install Pacific (which we're still doing point releases and backports on, so might miss actual protocol issues with) instead of Nautilus (which is a fixed quantity)?
I don't precisely remember but I had some discussion with Venky on IRC or 1x1 where we decided to go with Quincy clients with pacific cluster. If pacific can hide such things, I've move back to nautilus but things aren't good with pacific cluster too, almost all the tests are stuck since hours in teuthology http://pulpito.front.sepia.ceph.com/dparmar-2022-10-10_13:22:59-fs:upgrade-main-distro-default-smithi/
Okay, so definitely there is some issue with ceph-fuse,
- http://pulpito.front.sepia.ceph.com/dparmar-2022-10-11_17:01:35-fs:upgrade-main-distro-default-smithi/
- http://pulpito.front.sepia.ceph.com/dparmar-2022-10-11_17:48:59-fs:upgrade-main-distro-default-smithi/
yamls used in [1]:
tasks:
- ceph-fuse: [client.0]
- workunit:
clients:
client.0:
- suites/iozone.sh
AND
tasks:
- ceph-fuse: [client.0]
- workunit:
clients:
client.0:
- suites/blogbench.sh
[1] never completed, had to kill the run. Jobs run into timeout issues and this happens at mkdir -p -v /home/ubuntu/cephtest/mnt.0
When I looked into the machine, I cannot list cephtest dir, I cannot do df -hT. ceph osd blocklist ls shows one entry, I suspect it's the client only. Doing umout -fl home/ubuntu/cephtest/mnt.0 and clearing the blocklist can un-hang the job but ultimately the test would fail.
[2] which is all green uses the exact same yaml just - ceph-fuse: [client.0] removed, those yamls are - https://github.com/ceph/ceph/pull/48280/commits/7c0d7a28de9f7f0f366a940ccd0af417d6181cd2#diff-a609885c2c09cab572df7a707110532bb78b235691d3c51a28db1128d5f3ae6a and https://github.com/ceph/ceph/pull/48280/commits/7c0d7a28de9f7f0f366a940ccd0af417d6181cd2#diff-f9147c01c698bb81576733e51a936f4ce490271b768450147043808b7baa78b1
and both the jobs succeed without even a single hiccup.
But I can also see this in logs:
2022-10-11T18:06:16.269 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi088.stderr:ceph-fuse[39786]: starting ceph client
2022-10-11T18:07:44.542 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi057.stderr:ceph-fuse[39794]: starting ceph client
If these tests are also working with ceph-fuse and PASS, question is why and how there is not hang issue here as I can see it does create dir at mount point successfully:
2022-10-11T18:06:15.846 DEBUG:teuthology.orchestra.run.smithi088:> mkdir -p -v /home/ubuntu/cephtest/mnt.0
2022-10-11T18:06:15.867 INFO:teuthology.orchestra.run.smithi088.stdout:mkdir: created directory '/home/ubuntu/cephtest/mnt.0'
2022-10-11T18:07:44.117 DEBUG:teuthology.orchestra.run.smithi057:> mkdir -p -v /home/ubuntu/cephtest/mnt.0
2022-10-11T18:07:44.140 INFO:teuthology.orchestra.run.smithi057.stdout:mkdir: created directory '/home/ubuntu/cephtest/mnt.0'
I've tested this multiple times with different cluster setups like: having client on the same node as other daemons and client on different node and both yield similar results i.e. test hangs at doing mkdir -p -v at mountpoint. Another failure that occurred intermittently - (Command failed on smithi203 with status 1: 'chmod 0000 /home/ubuntu/cephtest/mnt.1') which @rishabh-d-dave helped me fix where I had to include omit_sudo=False in create_mntpt() in mount.py .
Okay, so definitely there is some issue with
ceph-fuse,
- http://pulpito.front.sepia.ceph.com/dparmar-2022-10-11_17:01:35-fs:upgrade-main-distro-default-smithi/
It seems you are doing the ceph-fuse mount twice:
- ceph-fuse:
client.0: null
- print: '**** done remount client'
- ceph-fuse:
- client.0
I am not sure this will work for the netns, I never test this.
Thanks!
As we discussed please just append the iozone, dbench, kernel untar and blogbench test after newops test for from_nautilus/. And add new directory from_pacific/ for Pacific cluster, etc.
I've re-structured upgraded_client and added tests. Also addressed all the failures and issues that prevented tests from running successfully.
I was experiencing failures like Command failed on smithi157 with status 1: 'chmod 0000 /home/ubuntu/cephtest/mnt.1' in teuthology, therefore added https://github.com/ceph/ceph/pull/48280/commits/5105d1b8fc1d712ae1438724e6529d17ea299034 to address it, should I move this to a complete separate PR or it is okay to ship with this PR? Asking because qa/tasks/cephfs/mount.py is widely used, it can/might cause similar failures in future too. Any suggestions/thoughts?
http://pulpito.front.sepia.ceph.com/dparmar-2022-10-12_10:31:59-fs:upgrade-main-distro-default-smithi/ - Run with the latest patch. Ignore 7063339 and 7063346 they are from fs:upgrade/featureful_client/upgraded_client and not fs:upgrade/upgraded_client (mistake with --filter with teuthology-suite cmd).
All tests on pacific passed while nautilus tests failed with
teuthology.exceptions.CommandFailedError: Command failed on smithi017 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls'
which I think @lxbsz is already working on?
http://pulpito.front.sepia.ceph.com/dparmar-2022-10-12_10:31:59-fs:upgrade-main-distro-default-smithi/ - Run with the latest patch. Ignore 7063339 and 7063346 they are from
fs:upgrade/featureful_client/upgraded_clientand notfs:upgrade/upgraded_client(mistake with--filterwithteuthology-suitecmd).All tests on pacific passed while nautilus tests failed with
teuthology.exceptions.CommandFailedError: Command failed on smithi017 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls'which I think @lxbsz is already working on?
The PR https://github.com/ceph/ceph/pull/48182 already merged!
http://pulpito.front.sepia.ceph.com/dparmar-2022-10-12_10:31:59-fs:upgrade-main-distro-default-smithi/ - Run with the latest patch. Ignore 7063339 and 7063346 they are from
fs:upgrade/featureful_client/upgraded_clientand notfs:upgrade/upgraded_client(mistake with--filterwithteuthology-suitecmd). All tests on pacific passed while nautilus tests failed withteuthology.exceptions.CommandFailedError: Command failed on smithi017 with status 22: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd blocklist ls'which I think @lxbsz is already working on?
The PR #48182 already merged!
Thanks @lxbsz! It should not be an issue then. As discussed this is still pending backports right? we need this to land in nautilus branch else it's not gonna pass in anyway.
removed commit addressing sudo/su issue, created separate PR https://github.com/ceph/ceph/pull/48476
removed commit addressing sudo/su issue, created separate PR #48476
This issue was never reported previously in any runs since the induction of the commit https://github.com/ceph/ceph/commit/60d5d7cf9cac250905ec4cfaee0c8d508f98ed3e. This issue occurred when I was running https://github.com/ceph/ceph/pull/48280 in teuthology where I found my yamls were mounting ceph-fuse twice and that led to almost all the failures. I feel this should not occur with correct yamls, so, now when the correct yamls have been pushed into that PR and in order to confirm this assumption I ran all the tests again and it went as expected: https://pulpito.ceph.com/dparmar-2022-10-14_11:09:18-fs:upgrade-main-distro-default-smithi/.
jenkins test api
all test pass after rebase (with PR https://github.com/ceph/ceph/pull/48182 in my branch).
http://pulpito.front.sepia.ceph.com/dparmar-2022-10-15_11:27:50-fs:upgrade-main-distro-default-smithi/
I'll rename them
Maybe we should just link the files by keeping the old names as they are as we do in most of other test cases, for example just do:
.qa/suites/fs/workload/tasks/5-workunit/iozone.yaml --> fs/upgrade/upgraded_client/tasks/2-workload/stress_tests/iozone.yamlBut it's trivial.
done
changes made: https://github.com/ceph/ceph/compare/8df3225a2dbbe7ad482f8847891dff1ad352d64d..179e4bcae9bc7b6b0c049120ed5795c143180e0a