daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-13292 control: Use cart API to detect fabric

Open kjacque opened this issue 1 year ago • 2 comments

  • Add a lib/hardware package to collect fabric interface information through CART API.
  • Remove custom OFI and UCX packages and dependencies.
  • Update Go githook to ignore deleted files.

Required-githooks: true

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

kjacque avatar Mar 15 '24 00:03 kjacque

Ticket title is 'Update control plane fabric scans to use new mercury API' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-13292

github-actions[bot] avatar Mar 15 '24 00:03 github-actions[bot]

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13989/1/testReport/

daosbuild1 avatar Mar 15 '24 01:03 daosbuild1

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13989/2/testReport/

daosbuild1 avatar Mar 30 '24 00:03 daosbuild1

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13989/2/testReport/

daosbuild1 avatar Apr 02 '24 23:04 daosbuild1

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13989/2/execution/node/1545/log

daosbuild1 avatar Apr 02 '24 23:04 daosbuild1

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13989/4/execution/node/1540/log

daosbuild1 avatar Apr 05 '24 04:04 daosbuild1

Test failure is https://daosio.atlassian.net/browse/DAOS-15598

kjacque avatar Apr 08 '24 16:04 kjacque

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13989/7/execution/node/1451/log

daosbuild1 avatar Apr 19 '24 03:04 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13989/8/execution/node/1173/log

daosbuild1 avatar Apr 19 '24 22:04 daosbuild1

Test failure is DAOS-15686. I'm going to exclude the affected test and re-run.

kjacque avatar Apr 22 '24 18:04 kjacque

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13989/9/testReport/

daosbuild1 avatar Apr 23 '24 17:04 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13989/10/execution/node/1198/log

daosbuild1 avatar Apr 25 '24 02:04 daosbuild1

Discussed at gatekeeping and with @kjacque . This change re-uses same information that engines are already using, as such the landing risk is low in terms of breaking other systems/clusters. This change is also a right way to go forward, as we don't want to rely on libfabric/ucx apis and need to use unified cart-level ones for retrieval of the info.

frostedcmos avatar Apr 29 '24 16:04 frostedcmos