ucc
ucc copied to clipboard
TL/UCP: add special service worker
What
Enables the creation and usage of a special ucp worker for service collectives.
This feature is configured through the env variable UCC_TL_UCP_SERVICE_TLS
- if
UCC_TL_UCP_SERVICE_TLS=""(by default): the feature is disabled and the same worker is used for collectives and service collectives. - if
UCC_TL_UCP_SERVICE_TLS != ""a special worker is created withUCX_TLconfiguration set to that string (see UCX doc here). This worker is used for all service collectives. If the env variableUCC_TL_UCP_SERVICE_NET_DEVICESis not the empty string, then the fieldUCX_NET_DEVICESof the service worker is set to that string.
How ?
- The special worker is created at context initialization. The context
struct ucc_tl_ucp_contexthas been supplemented with an attributestruct ucc_tl_ucp_servicewhich stores all the relevant data for the special service worker. This attribute contains a bool flagis_usedwhich specifies whether the feature is enabled. Ifis_used=0then no worker is created, therefore this new feature causes no overhead to the default use case. - At the task level,
struct ucc_coll_taskhas been supplemented with the boolean flagis_service. This flag is then passed to the functionucc_tl_ucp_connect_team_epand so on to specify which worker to use. -
ucc_tl_ucp_get_context_attrhas been modified to include the address exchange for the service worker during wiring up - Context cleanup has been refactorized to avoid code duplication.
CC @kingchc
Ping @kingchc
warnings in linter seem to be relevant
[1661245989.981908] [fv-az72-584:44720:561] flush.c:28 UCX ERROR req 0x62e000e26080: error during flush: Endpoint timeout, flush comp 0x62e000e26110 count reduced to 2
[1661245989.981912] [fv-az72-584:44720:561] flush.c:28 UCX ERROR req 0x62e000e26080: error during flush: Endpoint timeout, flush comp 0x62e000e26110 count reduced to 1
[1661245989.981916] [fv-az72-584:44720:561] flush.c:28 UCX ERROR req 0x62e000e26080: error during flush: Endpoint timeout, flush comp 0x62e000e26110 count reduced to 0
[1661245989.981922] [fv-az72-584:44720:561] tl_ucp_ep.c:116 TL_UCP ERROR error during ucp ep close, ep 0x7f356256f240, status Endpoint timeout
[1661245990.021879] [fv-az72-584:44720:562] tl_ucp_ep.c:116 TL_UCP ERROR error during ucp ep close, ep 0x7f356254f380, status Connection reset by remote peer
bot:retest
bot:retest