shell: Command hangs when `--conda` is used and the runtime has not been set up
Scope
Computers with /bin/bash version <4.3 (macOS ships with 3.2)
Current Behavior
I thought I had the Conda runtime set up on my computer so I used nextstrain shell --conda. It resulted in a process that hung indefinitely using 100% CPU even upon closing the terminal. I was able to terminate the process through kill -9 <PID> (SIGTERM) and Activity Monitor's "Force Quit" option.
Console output:
$ nextstrain check-setup
…
# conda is not supported
✔ yes: operating system is supported
✔ yes: Rosetta 2 is enabled
✔ yes: runtime data dir doesn't have spaces
✘ no: runtime appears set up
The Conda runtime appears supported but not yet set up.
Try running `nextstrain setup conda` first.
…
$ nextstrain shell --conda .
Entering the Nextstrain runtime (conda)
Run the command "exit" to leave the runtime.
^C^C^C^C^C^C
Closed that terminal and opened another. Here you can see it's been running for 4 minutes.
$ ps -ax | grep bash
37226 ?? 4:14.93 bash --rcfile /var/folders/ld/41vgpmnj5t7bgmj5b4cn8l9h0000gn/T/tmp6w1u7o58bashrc
Expected behavior
This is partly user error since --conda is useless without the runtime being set up. But the command should at least be easily terminable, or even better exit with a useful error saying that the Conda runtime has not been set up yet.
How to reproduce
Steps to reproduce the current behavior:
- Run
nextstrain shell --conda .in an environment without the Conda runtime available. - Try to quit.
Your environment: if running Nextstrain locally
Hardware: macOS 14.3.1 Apple M2 chip
$ nextstrain --version
nextstrain.cli 8.2.0
Possible solutions
- ✅ Error when runtime isn't available
- ⛔️ Adjust
HISTSIZEto work with bash <4.3
I'm curious why it just hangs 🤔
Tangentialy, it seems like the error message for nextstrain build is not super informative when the runtime hasn't been set up either:
$ nextstrain build --conda .
Error executing into ['snakemake', '--printshellcmds', '--cores=all']: [Errno 2] No such file or directory: b'/usr/bin/snakemake'
even better exit with a useful error saying that the Conda runtime has not been set up yet.
I think this we can accomplish this by checking runtime setup before trying to run anything.
diff --git a/nextstrain/cli/runner/__init__.py b/nextstrain/cli/runner/__init__.py
index 0aaefd0..6bc1388 100644
--- a/nextstrain/cli/runner/__init__.py
+++ b/nextstrain/cli/runner/__init__.py
@@ -14,7 +14,7 @@ from .. import config, env, hostenv
from ..argparse import DirectoryPath, SKIP_AUTO_DEFAULT_IN_HELP
from ..errors import UserError
from ..types import EllipsisType, Env, Options, RunnerModule
-from ..util import prose_list, runner_name, runner_module, runner_help, warn
+from ..util import prose_list, runner_name, runner_module, runner_help, warn, runner_tests_ok, print_runner_tests
from ..volume import NamedVolume
@@ -277,6 +277,13 @@ def run(opts: Options, working_volume: NamedVolume = None, extra_env: Env = {},
**dict(hostenv.forwarded_values()),
**extra_env }
+ tests = opts.__runner__.test_setup()
+ if not runner_tests_ok(tests):
+ print(f"The {runner_name(opts.__runner__)} is not supported.")
+ print()
+ print_runner_tests(tests)
+ return 1
+
return opts.__runner__.run(opts, argv, working_volume = working_volume, extra_env = extra_env, cpus = cpus, memory = memory)
I think this we can accomplish this by checking runtime setup before trying to run anything.
Sure, but probably not with that patch as-is? Runner tests sometimes still produce a bunch of junk output to stdout (should fix that!) and they're also sometimes slow (seconds instead of milliseconds). I'd suggest a more cursory check for "is it good?" instead as the guard, or perhaps running the full tests only on failure to actually run, as part of better error reporting.
The hanging is interesting. Perhaps something about macOS or something about your specific system?
If I try I get a shell, but of course no runtime is available (bad, but better than hang):
$ NEXTSTRAIN_HOME=/tmp/bogus nextstrain shell --conda /var/empty
Entering the Nextstrain runtime (conda)
Run the command "exit" to leave the runtime.
$ echo $PATH
/tmp/bogus/runtimes/conda/env/bin:/bin:/usr/bin
$ stat /tmp/bogus/runtimes/conda/env/bin
stat: cannot stat '/tmp/bogus/runtimes/conda/env/bin': No such file or directory
I'm on a different computer now (macOS 13.6 Apple M1 chip) and can't reproduce. I'll try again on my home computer which is where I observed this initially. Is there any extra logging that would be helpful here, maybe something around these lines?
https://github.com/nextstrain/cli/blob/25075faebad904ff3d9df9422abb16bedb662f80/nextstrain/cli/runner/conda.py#L388-L391
Well, that snippet of code is in nextstrain.cli.runner.conda.micromamba(), not .run(), but I think you probably meant something like:
https://github.com/nextstrain/cli/blob/25075faebad904ff3d9df9422abb16bedb662f80/nextstrain/cli/runner/conda.py#L170
https://github.com/nextstrain/cli/blob/25075faebad904ff3d9df9422abb16bedb662f80/nextstrain/cli/util.py#L322-L328
If the hang is somewhere in Python before the exec, then logging around it could be helpful. But I suspect it's after the exec and we're no longer in Python. What output did you see before it hung? Anything? That would be a useful clue.
Would also be useful to enable existing debug logging (which is minimal, but a little relevant here), e.g.
NEXTSTRAIN_DEBUG=1 nextstrain shell --conda .
Beyond that, what I'd reach for first is process inspection (from the outside). i.e. looking at the process tree, the output of lsof on the process(es), strace on the process(es) (well, dtruss on macOS I guess), etc.
Thanks, I'll poke around on my home computer sometime and report back.
Output is in the issue description – I had it collapsed but just edited so it's more visible.
Ah! Thanks. I clearly missed it! :upside_down_face:
So seeing bash --rcfile /var/folders/ld/41vgpm… in the process list means it exec'd out of Python and into bash. I wonder what the process was up to.
I was able to reproduce on my home computer, but only by removing ~/.nextstrain/runtimes/conda/. Setting NEXTSTRAIN_HOME=/tmp/bogus fails to reproduce. It's what I tried on my work computer, so I'm hopeful that I can go into work later and reproduce by removing the runtime directory, then we can investigate further.
Here's the debug output:
$ NEXTSTRAIN_DEBUG=1 nextstrain shell --conda /var/empty
DEBUG: Resolved /var/empty to /private/var/empty
DEBUG: Looking for nextstrain-pathogen.yaml as pathogen root dir marker
DEBUG: /private/var/empty/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/var/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/nextstrain-pathogen.yaml: does not exist
DEBUG: /nextstrain-pathogen.yaml: does not exist
DEBUG: Using /private/var/empty as build volume
DEBUG: Using /private/var/empty as working (build) volume
Entering the Nextstrain runtime (conda)
Run the command "exit" to leave the runtime.
DEBUG: exec_or_return(['bash', '--rcfile', '/var/folders/ld/41vgpmnj5t7bgmj5b4cn8l9h0000gn/T/tmp7ma5b0iabashrc'], {'NEXTSTRAIN_PS1': '\\[\\e[1;97m\\]\\[\\e[48;2;67;119;205m\\] \\[\\e[48;2;67;119;205m\\]N\\[\\e[48;2;80;151;186m\\]e\\[\\e[48;2;99;172;154m\\]x\\[\\e[48;2;124;184;121m\\]t\\[\\e[48;2;154;190;92m\\]s\\[\\e[48;2;185;188;74m\\]t\\[\\e[48;2;212;177;63m\\]r\\[\\e[48;2;228;153;56m\\]a\\[\\e[48;2;230;112;48m\\]i\\[\\e[48;2;222;60;38m\\]n\\[\\e[48;2;222;60;38m\\] \\[\\e[0m\\] \\w\\[\\e[38;2;255;0;255m\\] \\$ \\[\\e[0m\\]', 'NEXTSTRAIN_HISTFILE': '/Users/victor/.nextstrain/shell-history', 'PATH': '/Users/victor/.nextstrain/runtimes/conda/env/bin:/bin:/usr/bin', 'PYTHONHOME': None, 'PYTHONPATH': None, 'PYTHONUSERBASE': '/Users/victor/.nextstrain/runtimes/conda/python-user-base', 'PYTHONNOUSERSITE': '1'})
and the same with NEXTSTRAIN_HOME=/tmp/bogus which is successful:
$ NEXTSTRAIN_DEBUG=1 NEXTSTRAIN_HOME=/tmp/bogus nextstrain shell --conda /var/empty
DEBUG: Resolved /var/empty to /private/var/empty
DEBUG: Looking for nextstrain-pathogen.yaml as pathogen root dir marker
DEBUG: /private/var/empty/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/var/nextstrain-pathogen.yaml: does not exist
DEBUG: /private/nextstrain-pathogen.yaml: does not exist
DEBUG: /nextstrain-pathogen.yaml: does not exist
DEBUG: Using /private/var/empty as build volume
DEBUG: Using /private/var/empty as working (build) volume
Entering the Nextstrain runtime (conda)
Run the command "exit" to leave the runtime.
DEBUG: exec_or_return(['bash', '--rcfile', '/var/folders/ld/41vgpmnj5t7bgmj5b4cn8l9h0000gn/T/tmpudxmics7bashrc'], {'NEXTSTRAIN_PS1': '\\[\\e[1;97m\\]\\[\\e[48;2;67;119;205m\\] \\[\\e[48;2;67;119;205m\\]N\\[\\e[48;2;80;151;186m\\]e\\[\\e[48;2;99;172;154m\\]x\\[\\e[48;2;124;184;121m\\]t\\[\\e[48;2;154;190;92m\\]s\\[\\e[48;2;185;188;74m\\]t\\[\\e[48;2;212;177;63m\\]r\\[\\e[48;2;228;153;56m\\]a\\[\\e[48;2;230;112;48m\\]i\\[\\e[48;2;222;60;38m\\]n\\[\\e[48;2;222;60;38m\\] \\[\\e[0m\\] \\w\\[\\e[38;2;255;0;255m\\] \\$ \\[\\e[0m\\]', 'NEXTSTRAIN_HISTFILE': '/private/tmp/bogus/shell-history', 'PATH': '/tmp/bogus/runtimes/conda/env/bin:/bin:/usr/bin', 'PYTHONHOME': None, 'PYTHONPATH': None, 'PYTHONUSERBASE': '/tmp/bogus/runtimes/conda/python-user-base', 'PYTHONNOUSERSITE': '1'})
The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
Nextstrain /private/var/empty $
Can reproduce on my work computer and the blab Mac Mini, but not a GitHub Codespace.
- Work computer ~uses bash 5.2.15 installed via Homebrew~ uses bash 3.2.57 pre-installed on macOS 13.6.
- blab Mac Mini uses bash 3.2.57 pre-installed on macOS 12.6.8.
- GitHub Codespace uses bash 5.0.17 pre-installed on Ubuntu 20.04.
We had a debugging session this morning. Here's a reproducible example:
cat >bad-histfile <<~~
#anything here
ls
~~
cat >bad-rcfile <<~~
HISTSIZE=-1
~~
HISTFILE=bad-histfile /bin/bash --rcfile bad-rcfile
Scratch the comment that my work computer uses bash 5.2.15. It does in my normal startup shell (zsh) that adds /opt/homebrew/bin to PATH in .zshrc. In the exec, PATH is set to /Users/victor/.nextstrain/runtimes/conda/env/bin:/bin:/usr/bin which uses /bin/bash, version 3.2.57.
A negative value for HISTSIZE is only supported in bash 4.3 and above, though the behavior of bash 3.2 hanging at 100% CPU when given a negative value is still noteworthy and mentioned elsewhere.
Ah, really good to know that 5.2.15 isn't actually affected. I can repro the issue on Linux with Bash 3.2.57. This lets me strace it, which shows that immediately after reading the history file it enters an infinite loop of segfaults, each of which invokes its registered SIGSEGV handler.
We could adjust HISTSIZE to not trigger this on Bash 3.2.57, but our Conda runtime will have a newer Bash where it's not an issue, so the real fix here is still just that we should error out ourselves when the runtime isn't available.
The flip side to this where it "works" on newer versions of bash can also be considered a bug - the shell appears to enter a conda runtime that doesn't exist. This is also solved by erroring when the runtime isn't available.