ucc icon indicating copy to clipboard operation
ucc copied to clipboard

TL/UCP: add special service worker

Open samnordmann opened this issue 3 years ago • 3 comments

What

Enables the creation and usage of a special ucp worker for service collectives.

This feature is configured through the env variable UCC_TL_UCP_SERVICE_TLS

  • if UCC_TL_UCP_SERVICE_TLS="" (by default): the feature is disabled and the same worker is used for collectives and service collectives.
  • if UCC_TL_UCP_SERVICE_TLS != "" a special worker is created with UCX_TL configuration set to that string (see UCX doc here). This worker is used for all service collectives. If the env variable UCC_TL_UCP_SERVICE_NET_DEVICES is not the empty string, then the field UCX_NET_DEVICES of the service worker is set to that string.

How ?

  • The special worker is created at context initialization. The context struct ucc_tl_ucp_context has been supplemented with an attribute struct ucc_tl_ucp_service which stores all the relevant data for the special service worker. This attribute contains a bool flag is_used which specifies whether the feature is enabled. If is_used=0 then no worker is created, therefore this new feature causes no overhead to the default use case.
  • At the task level, struct ucc_coll_task has been supplemented with the boolean flag is_service. This flag is then passed to the function ucc_tl_ucp_connect_team_ep and so on to specify which worker to use.
  • ucc_tl_ucp_get_context_attr has been modified to include the address exchange for the service worker during wiring up
  • Context cleanup has been refactorized to avoid code duplication.

samnordmann avatar Jul 06 '22 16:07 samnordmann

CC @kingchc

manjugv avatar Jul 20 '22 17:07 manjugv

Ping @kingchc

manjugv avatar Aug 10 '22 17:08 manjugv

warnings in linter seem to be relevant

[1661245989.981908] [fv-az72-584:44720:561]           flush.c:28   UCX  ERROR req 0x62e000e26080: error during flush: Endpoint timeout, flush comp 0x62e000e26110 count reduced to 2
[1661245989.981912] [fv-az72-584:44720:561]           flush.c:28   UCX  ERROR req 0x62e000e26080: error during flush: Endpoint timeout, flush comp 0x62e000e26110 count reduced to 1
[1661245989.981916] [fv-az72-584:44720:561]           flush.c:28   UCX  ERROR req 0x62e000e26080: error during flush: Endpoint timeout, flush comp 0x62e000e26110 count reduced to 0
[1661245989.981922] [fv-az72-584:44720:561]       tl_ucp_ep.c:116  TL_UCP ERROR error during ucp ep close, ep 0x7f356256f240, status Endpoint timeout
[1661245990.021879] [fv-az72-584:44720:562]       tl_ucp_ep.c:116  TL_UCP ERROR error during ucp ep close, ep 0x7f356254f380, status Connection reset by remote peer

Sergei-Lebedev avatar Aug 26 '22 14:08 Sergei-Lebedev

bot:retest

vspetrov avatar Oct 11 '22 10:10 vspetrov

bot:retest

vspetrov avatar Oct 12 '22 05:10 vspetrov