jScope freezes during loading with a "broken pipe" error
Affiliation LLNL / DIII-D (submitted by @mwinkel-dev of MIT PSFC on behalf of Brian V. of LLNL)
Version(s) Affected Client MDSplus: TBD Server MDSplus: TBD
Platform Client: GA's Iris cluster, CentOS 6.10 (Final) Server: GA's Atlas cluster, TBD
Describe the bug Intermittent socket failures when using jScope to display DIII-D data. Causes jScope to freeze when loading / displaying data.
To Reproduce This description is from the email of 8-Feb-2024 that reported the issue.
Here are four screenshots that explain what happens.
-
Typical jScope display showing many signals. This is working correctly.
-
Many of the signals fail to load. jScope loads signals top to bottom, left to right, so in this case it failed to load ‘i_boot’ from the automatic onetwo tree. After that failure, none of the other signals load.
notes: The failure doesn’t always occur on the same signal. Often it fails on Ip. I don’t know why it failed on this shot. Anecdotally, some shots seem to fail more often than others. This morning took longer than usual to reproduce this error.
-
If I try any other shot in the same jScope session, I get this error. The only thing I can do is restart jScope.
-
The error indicates a ’socket Exception: broken pipe'.
Expected behavior All signals should load and display in jScope.
Screenshots These screenshots are from the 8-Feb-2024 email (and presented in the same order).
Additional context n/a
This is likely a network issue regarding the mdsip protocol. The error message probably indicates that the mdsip socket for the jScope connection is being killed for some reason.
We will attempt to reproduce the issue using GA's computers. And then using MIT PSFC's computers.
It is likely that eventually the troubleshooting will require the assistance of GA's networking specialists.
Brian reported (via email) that this problem started ~4 months ago. Prior to that he used jScope for years without any issues.
Client computer details: iris cluster at DIII-D I don't know the version of linux or MDSplus installed on the cluster MDSplus Archive computer details: Host name = atlas.gat.com I don't know the version of linux or MDSplus installed on the server Network details: I'm physically at DIII-D running on the iris cluster through noMachine. I have the most problems with jScope, but I also use it the most. Programs like efitviewer and reviewplus have occasional data loading issues, too.
Hi @victorbs -- Thanks for the information. Much appreciated.
Hi @ModestMC -- Am hoping you can provide some additional context regarding this issue.
- Would appreciate it if you can provide the operating system and MDSplus version for GA's Iris and Atlas clusters.
- Also, are you aware of any networking issues at GA in the past ~4 months that would account for the intermittent
mdsipsocket errors that @victorbs has encountered? - And have any other users at GA reported problems with
mdsipsockets in recent months?
Thanks, -MarkW
Just for information, jScope has not been changed at least in the last year.
Hi there, just a hunch but since its a broken pipe, what protocol are you using for connecting to the server (plain mdsip, via tunnel, ssh; i am not familiar with noMachine). If i am not mistaken you get a 'broken pipe' if you try to send something down a socket that does not have a receiver anymore. Can you check the data servers (however, this may be an mdsip spawn on your user machine) logs for any potential crashes of mdsip sessions? If the error is new but the software old, there may be a memory heavy process sitting causing sporadic OutOfMemory isssues.
Hi @ModestMC,
Would also appreciate it if you can check the mdsip logs on GA's Atlas server. Typically, the log files are in /var/log/mdsplus/mdsipd, and there can be clues in both the access and the errors files.
And as per @zack-vii's post above, would also be good to check the various system logs on the Atlas server for networking issues.
Thanks,
-MarkW
Iris uses 6.1.84 as its default, Atlas was updated in November from 7.96.?? to 7.139.59 (also the OS went from RHEL6 to RHEL8). Without an exact timeframe of the change from Brian, it's hard to say whether this was or was not what gave rise to the issue. We will not be updating the version on Iris, so this might not be worth trying to reproduce.
Our recommendation is that @victorbs try using JScope on Omega (which also runs 7.139.59) to see if the bug persists. Many users here also use Reviewplus or OMFIT for visualization. As for the log files, we tried looking but there are too many entries to have any idea who is associated with what (see #2683).
That change in November seems like it would be consistent with when the problems started to occur. My understanding from an earlier email that Sterling sent out to all users is that there are also problems with data visualization using reviewplus. I am able to reproduce this error. @ModestMC https://github.com/ModestMC Is it possible to look at the log files at the time the error occurs?
I will also test if similar issues occur on Omega.
Sincerely,
Brian Victor Lawrence Livermore National Laboratory 13-352 @.*** Phone: 858-455-3098
On Feb 13, 2024, at 2:53 PM, Mitchell Clark @.***> wrote:
Iris uses 6.1.84 as its default, Atlas was updated in November from 7.96.?? to 7.139.59 (also the OS went from RHEL6 to RHEL8). Without an exact timeframe of the change from Brian, it's hard to say whether this was or was not what gave rise to the issue. We will not be updating the version on Iris, so this might not be worth trying to reproduce.
Our recommendation is that @victorbs https://github.com/victorbs try using JScope on Omega (which also runs 7.139.59) to see if the bug persists. Many users here also use Reviewplus or OMFIT for visualization. As for the log files, we tried looking but there are too many entries to have any idea who is associated with what (see #2683 https://github.com/MDSplus/mdsplus/issues/2683).
— Reply to this email directly, view it on GitHub https://github.com/MDSplus/mdsplus/issues/2704#issuecomment-1942770586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVIHVMXOCPZCZGAX3RJ7A3YTPVHFAVCNFSM6AAAAABDAICORWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBSG43TANJYGY. You are receiving this because you were mentioned.
Hi @ModestMC and @victorbs,
Thanks to both of you for the additional information.
If the problem is reproducible on Omega, let me know. I will then see if I can reproduce the issue at MIT using MDSplus 7.139.59 and RHEL8.
@victorbs the simplest way for me to attempt (though I'm not optimistic) to reproduce your errors would be for you to give me a basic example that breaks and then I try to run it at a time when Atlas usage is minimal (eg. wee hours on a weekend or something) until I can see something interesting. Realistically, I think this is a good sign that we should find you a more stable long term workflow.
The reviewplus issues I'm recalling were the result of network changes which have since been patched, but Sterling would know better than I would. Definitely let me know what happens when you try using JScope on Omega, as it's a datapoint worth having. Feel welcome to email me from the original email thread if you'd like.
As for @mwinkel-dev, my hunch is that this is some kind of incompatibility between 6.x.x and 7.x.x in a manner like what @zack-vii described, specifically when the server is updated. If Brian has no issue with the same versions communicating (Omega <--> Atlas), I think this bug can be closed a known version incompatibility.
I haven't had time to test jScope on Omega extensively, but in a couple days of use, I haven't had any problems with the data loading. I will continue to use jScope on Omega and will keep you posted if I begin to have any issues.
Hi @victorbs -- Thanks for the update. If jScope on Omega works well for you during the next two weeks or so, then let me know if this issue should be closed.
Hi @mwinkel-dev I'm starting to have similar problems using jScope on omega that I was having on iris. I get an error that 'the connection to atom.gat.com' was lost. After I get that error, signals will no longer load.
Hi @victorbs -- That is unfortunate news. But thanks for the update.
Hi @ModestMC -- What is the atom.gat.com server? And what version of MDSplus is it running? Could this be another cross-version incompatibility similar to your conjecture regarding Iris and Atlas?
https://github.com/MDSplus/mdsplus/issues/2704#issuecomment-1944980825
@mwinkel-dev : ATOM is a Linux server similar to Omega and is restricted to team that operate D3D. It does not have an MDSplus server at all, only clients. Perhaps the use case is misunderstood?
The available version are
- mdsplus/core/alpha-7.130.1
- mdsplus/core/alpha-7.139.39
- mdsplus/core/alpha-7.139.40
- mdsplus/core/alpha-7.139.59
Hi @margomw -- Thanks for explaining the purpose of the atom.gat.com server. That is useful to know.
Hi @sflanagan and @ModestMC -- Any idea why jScope on Omega would be connecting to Atom? For details, see the post from @victorbs .
https://github.com/MDSplus/mdsplus/issues/2704#issuecomment-2004560127
Note though that jScope (from Omega to Atlas) has apparently worked well for about a month.
I was mistaken above. I lost connection to 'atlas.gat.com' not 'atom.gat.com'. Sorry for the confusion. Here are screenshots of my connection and the error message.
Hi @victorbs -- Thanks for the clarification. According to a previous post, both Omega and Atlas are running MDSplus alpha-7.139.59. Therefore, I will see if I can reproduce the problem using that version of MDSplus for both client and server.
Hi. Is there any update on this?
There was a period about a month ago when I wasn't having data loading issues. In the last week or two, I have had more connection issues than usual.
Hi @victorbs,
Thanks for reminding us to look at this. (We've been swamped with tasks associated with the startup of DIII-D.)
Hi @sflanagan and @ModestMC,
Have there been any changes regarding Omega and/or GA's networking that would explain why jScope is freezing for @victorbs? When he switched to Omega (instead of Iris) the problem vanished for a month or so. Strikes me as odd that the problem has arisen again. (My guess is that we'll probably have to fix Issue #2683 to troubleshoot this jScope issue at GA.)