Some subtree actions dispatched via dispatcher fails, if dispatch server is run with "-c" option
Affiliation EPFL/Swiss Plasma Center
Version(s) Affected/Platform(s) RHEL9 - mdsplus-alpha 7.139-60 RHEL7 - mdsplus-stable 7.142.81
Installation Method(s) yum/dnf
Describe the bug
- We have a main tree, which has few subtrees with actions
- We run 1 dispatch and several action servers, all with
-c 9(compression) argument. - We observed that some actions failed when dispatched from dispatch server to action servers.
- With help of @joshStillerman I tracked it down to
-c 9argument. - For me, issue is ONLY reproducible in presence of multiple subtrees, and dispatch server running with
-coption.
To Reproduce
- On 2 separate terminals, start 2 servers (Change ports if these ones are not available)
mdsip -p 9999 -s -h ./mdsip.hosts -c 9 # Dispatch server
mdsip -p 9998 -s -h ./mdsip.hosts -c 9 # Action server
- Run following python program that will create sub-trees, main tree, set paths everywhere (current shell and action servers), then dispatch actions
#!/usr/bin/env python
# Make a test tree to test dispatcher
# Before running this script, start 2 actions servers on 2 terminals
# mdsip -p 9999 -s -h ./mdsip.hosts -c 9 # Dispatch server
# mdsip -p 9998 -s -h ./mdsip.hosts -c 9 # Action server
import os
import time
import MDSplus
# Create tree path dir
user = os.environ['USER']
distest_path="/tmp/{}/distest".format(user)
os.system('mkdir -p {}'.format(distest_path))
os.environ['distest_path'] = distest_path
os.environ['dissub1_path'] = distest_path
os.environ['dissub2_path'] = distest_path
# Set environment vars for tree paths in the servers
for port in ["9998","9999"]:
c = MDSplus.Connection('localhost:' + port)
for t in ['distest','dissub1','dissub2']:
c.get('setenv("{}_path=/tmp/{}/distest/")'.format(t,user))
c.disconnect()
# Create subtrees
for treename in ['dissub1','dissub2']:
t = MDSplus.Tree(treename,1,'NEW')
template = 'Build_Action(Build_Dispatch(2, "{}", "INIT", 10, *), {}, *, *, *)'
# Each action is a simple write expression with tree/action name
for i in range(1,4):
nodename = 'act{:03d}'.format(i)
node = t.addNode(nodename,usage="ACTION")
server = "localhost:9998"
function_expr = 'write(*,"Action of {}:{}")'.format(treename,nodename)
expr = template.format(server, function_expr)
node.record = t.tdiCompile(expr)
t.write()
t.close()
# Create main tree
t = MDSplus.Tree('distest',1,'NEW')
t.addNode('DISSUB1',usage='SUBTREE')
t.addNode('DISSUB2',usage='SUBTREE')
t.write()
t.close()
# Run actions
MDSplus.tcl('dispatch /command /server=localhost:9999 set tree distest /shot=1')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/build')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/phase init')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/close')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 close/shot=1 distest')
- Check logs of dispatch server for errors
- Repeat the experiment removing
-c 9argument
Log of dispatch server at localhost:9999 (This is from RHEL9, alpha build)
Mon Nov 18 23:26:40 2024 (4) (pid 1106189) Connection received from yildiz@localhost [127.0.0.1]
Connected: yildiz
dispatched client 127.0.0.1
Mon Nov 18 23:26:40 2024 (4) (pid 1106189) Connection disconnected from yildiz@localhost [127.0.0.1]
Mon Nov 18 23:26:40 2024 (4) (pid 1106189) Connection received from yildiz@localhost [127.0.0.1]
Connected: yildiz
dispatched client 127.0.0.1
I, 1106189:1106305, 1731968800.998966397: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:951 setup_client() setup connection 8 SvrJob(op=6, jobid=0, addr=127.0.0.1, port=8800)
Mon Nov 18 23:26:43 2024 (4) (pid 1106189) Connection disconnected from yildiz@localhost [127.0.0.1]
E, 1106189:1106305, 1731968804.784366888: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:860 open_socket() Cannot connect to 127.0.0.1:8800: Connection refused
I, 1106189:1106305, 1731968804.784389394: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:1008 send_reply() SvrJob(op=6, jobid=1, addr=127.0.0.1, port=8800) break connection
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB1:ACT001 to localhost:9998
W, 1106189:1106305, 1731968804.786785578: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:229 ServerSendMessage() no response from server
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT001 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT001 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB1:ACT002 to localhost:9998
W, 1106189:1106305, 1731968804.787646046: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:229 ServerSendMessage() no response from server
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT002 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT002 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB1:ACT003 to localhost:9998
W, 1106189:1106305, 1731968804.788400405: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:229 ServerSendMessage() no response from server
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT003 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Action .DISSUB1:ACT003 failed, %MDSPLUS-E-ERROR, Error
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB2:ACT001 to localhost:9998
Mon Nov 18 23:26:44 2024, localhost:9998 is beginning action .DISSUB2:ACT001
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB2:ACT002 to localhost:9998
Mon Nov 18 23:26:44 2024, Action .DISSUB2:ACT001 completed
Mon Nov 18 23:26:44 2024, localhost:9998 is beginning action .DISSUB2:ACT002
Mon Nov 18 23:26:44 2024, Dispatching node .DISSUB2:ACT003 to localhost:9998
Mon Nov 18 23:26:44 2024, Action .DISSUB2:ACT002 completed
Mon Nov 18 23:26:44 2024, localhost:9998 is beginning action .DISSUB2:ACT003
Mon Nov 18 23:26:44 2024, Action .DISSUB2:ACT003 completed
I, 1106189:1106305, 1731968804.791971854: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerQAction.c:951 setup_client() setup connection 17 SvrJob(op=6, jobid=2, addr=127.0.0.1, port=8800)
W, 1106189:1106311, 1731968804.792059552: /opt/jenkins/workspace/MDSplus_alpha/rhel9/servershr/ServerSendMessage.c:620 accept_client() Dropped connection from 127.0.0.1:43762
Expected behavior All actions are successful and dispatched
Additional context We do not require this bug to be fixed for our site, we don't need to use "-c" option. I'm just reporting for sake of
I can reproduce this on the current stable without the -c on either of the servers. The test script works for the first time that you run it, and chokes tthe second time. The dispatcher segflaults when trying to look at the treeheader of the parent tree the 2nd time.
That's due to missing dispatch/close. Adding dispatch/close and close statement, you can run it many times.
I update ticket description with correct clean up actions.
Yes, thanks - however I think this is a bug in TreeOpen.
If I run your script twice the dispatcher (port 9999) does in fact segfault. However it segfaults regardless of the compression flag. What is happening is that the python program that drives the whole thing, needs to:
- tell the dispatcher to execute the command
dispatch/close - tell the dispatcher to execute the command
close
Otherwise the edited tree shot 1 is still on its stack, and TreeOpen is quite confused.
Here is the new test program (I added a second shot, which had no impact)
#!/usr/bin/env python
# Make a test tree to test dispatcher
# Before running this script, start 2 actions servers on 2 terminals
# mdsip -p 9999 -s -h ./mdsip.hosts -c 9 # Dispatch server
# mdsip -p 9998 -s -h ./mdsip.hosts -c 9 # Action server
import os
import time
import MDSplus
# Create tree path dir
user = os.environ['USER']
distest_path="/tmp/{}/distest".format(user)
os.system('mkdir -p {}'.format(distest_path))
os.environ['distest_path'] = distest_path
os.environ['dissub1_path'] = distest_path
os.environ['dissub2_path'] = distest_path
# Set environment vars for tree paths in the servers
for port in ["9998","9999"]:
c = MDSplus.Connection('localhost:' + port)
for t in ['distest','dissub1','dissub2']:
c.get('setenv("{}_path=/tmp/{}/distest/")'.format(t,user))
c.disconnect()
# Create subtrees
for treename in ['dissub1','dissub2']:
t = MDSplus.Tree(treename,-1,'NEW')
template = 'Build_Action(Build_Dispatch(2, "{}", "INIT", 10, *), {}, *, *, *)'
# Each action is a simple write expression with tree/action name
for i in range(1,4):
nodename = 'act{:03d}'.format(i)
node = t.addNode(nodename,usage="ACTION")
server = "localhost:9998"
function_expr = 'write(*,"Action of {}:{}")'.format(treename,nodename)
expr = template.format(server, function_expr)
node.record = t.tdiCompile(expr)
t.write()
t.close()
# Create main tree
t = MDSplus.Tree('distest',-1,'NEW')
t.addNode('DISSUB1',usage='SUBTREE')
t.addNode('DISSUB2',usage='SUBTREE')
t.write()
t.close()
# Run actions
MDSplus.tcl('set tree distest/shot=-1')
time.sleep(1)
MDSplus.tcl('create pulse 1')
MDSplus.tcl('dispatch /command /server=localhost:9999 set tree distest /shot=1')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/build')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/phase init')
time.sleep(1)
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/close')
MDSplus.tcl('dispatch /command /server=localhost:9999 close')
print('done with shot 1')
MDSplus.tcl('create pulse 2')
time.sleep(1)
print('open shot 2')
MDSplus.tcl('dispatch /command /server=localhost:9999 set tree distest /shot=2')
time.sleep(1)
print('dispatch build')
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/build')
time.sleep(1)
print('dispatch the phase')
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/phase init')
MDSplus.tcl('dispatch /command /server=localhost:9999 dispatch/close')
MDSplus.tcl('dispatch /command /server=localhost:9999 close')
I assume that what promted this bug report originally was not a scenario where the pulse file was being editted between shots.
Also, I do not know if we can detect that this is happening to us. One possibility is to have a way to signal processes that the tree has been editted and needs to be reopened - kind of like the 'close_all_trees' event