sshfs icon indicating copy to clipboard operation
sshfs copied to clipboard

Hangs if connection lost after mounting

Open Soukyuu opened this issue 10 years ago • 23 comments

I'm running arch x64 with a few sshfs mounts.

  • if the server is offline when I start my PC, the mounts are not mounted -> Ok
  • if the server is online when I start my PC, the mounts are mounted -> Ok
  • if the server goes down while my PC has already mounted the shares, any program that accesses my home folder will freeze for 10-20 seconds per access attempt.

The shares are mounted with

[email protected]:/data/SomeShare           /home/myuser/SomeShare       fuse.sshfs      nofail,x-systemd.automount,idmap=user,_netdev,identityfile=/home/myuser/.ssh/id_rsa,allow_other,default_permissions,uid=1000,gid=1000,umask=0,reconnect,cache=no,kernel_cache,ciphers=arcfour,compression=no

See here for more info and stuff I've already tried. To me it seems like there is no handling of a connection being dead rather than a mounting issue per se.

Soukyuu avatar Jan 24 '16 10:01 Soukyuu

On Jan 24 2016, Ivan Pilipenko [email protected] wrote:

  • if the server goes down while my PC has already mounted the shares, any program that accesses my home folder will freeze for 10-20 seconds per access attempt.

What happens after the 10-20 seconds?

And what behavior did you expect to see? I think the only reasonable behavior would be to block indefinitely until the server is online again, or to make the mountpoint unavailable immediately (so that any request gives "Transport endpoint not connected" error.

Best, Nikolaus

GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

         »Time flies like an arrow, fruit flies like a Banana.«

Nikratio avatar Jan 24 '16 20:01 Nikratio

It unfreezes, then freezes again. The whole pc is basically unusable. I expected it to fail, maybe after a configured timeout,  so that the user could adjust the behavior to the speed of the server connection.

On Nikolaus Rath [email protected], Jan 24, 2016 9:25 PM wrote:On Jan 24 2016, Ivan Pilipenko [email protected] wrote:

  • if the server goes down while my PC has already mounted the shares, any program that accesses my home folder will freeze for 10-20 seconds per access attempt.

What happens after the 10-20 seconds?

And what behavior did you expect to see? I think the only reasonable behavior would be to block indefinitely until the server is online again, or to make the mountpoint unavailable immediately (so that any request gives "Transport endpoint not connected" error.

Best, Nikolaus

GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

         »Time flies like an arrow, fruit flies like a Banana.«

—Reply to this email directly or view it on GitHub.

Soukyuu avatar Jan 25 '16 07:01 Soukyuu

On Jan 24 2016, Ivan Pilipenko [email protected] wrote:

It unfreezes, then freezes again.

What does that mean? Are you saying it makes progress, but very slowly?

I expected it to fail, maybe after a configured timeout

Patches are welcome :-).

Best, -Nikolaus

GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

         »Time flies like an arrow, fruit flies like a Banana.«

Nikratio avatar Jan 25 '16 18:01 Nikratio

What I mean is that the whole program becomes unusable. Any program that happens to access the home folder (this is where my sshfs mounts are located) in any way, even bash autocompletion outside the home folder. It then reacts after unfreezing, but only as long as you don't cause it to access the home folder again.

As a specific example, opening an archive with ark makes it not appear for about 10-20 seconds. It then appears and works fine, until you attempt to extract a file. The program freezes again for 10-20 seconds, then the file dialog is displayed.

Another example: audio players that have a media library, for example foobar2000 in wine, hang for 10-20 seconds upon startup, then proceed to hang every 10 seconds, with sound stopping to play, because they're trying to access the media files located on one of the mounts. Listening to local music is impossible because of that.

I wish I could provide a patch, but I have no idea about FUSE or file systems in general =/

Soukyuu avatar Jan 26 '16 20:01 Soukyuu

@Tharbad, are you are sure you are commenting on the right bug? This is about sshfs. Samba or SMB has never been mentioned here.

Nikratio avatar Apr 18 '16 16:04 Nikratio

I get this same issue. I sshfs mount a remote directory (on an OSX server). When that server goes to sleep basically anything using the filesystem with the mount will freeze. This seems to have gotten worse in the 16.04 Ubuntu release, whereas previously (12.04), it would at least timeout at 30-40 seconds.

Is there some setting to make it timeout immediately, or after just 1-2 seconds?

mortoray avatar May 30 '16 17:05 mortoray

I'm experiencing something similar, that I havent been able to debug. Basically my entire system just locks up not sure what is causing it except that it seems to be related to the sshfs as it only happens when I'm running sshfs.

if the issue is related I wouldn't mark this as an enhancement, more a bug. ATM I'm going to try different configuration options to see if once of them is causing it.

james-lawrence avatar Nov 01 '16 13:11 james-lawrence

Hi! I am not sure if this is the right forum or not, but I too am seeing similar issues related to "Transport endpoint not connected". I am seeing this error occur much more frequently on Ubuntu 16.04.

We are using sshfs (2.5-1ubuntu1) and autofs (5.1.1-1ubuntu3.1).

xpros avatar Jan 11 '18 21:01 xpros

Hi, I can confirm this issue and would be interested in a solution. Arch Linux, everything vanilla and current. sshfs connection on the local network. Server goes down --> every process accessing the filesystem freezes while (supposedly) waiting for a confirmation that the mount (which is unaffected by the read operation) is indeed unavailable.

jtiemer avatar Jan 27 '20 23:01 jtiemer

Based on the documentation on this topic, I tried running sshfs with -o reconnect -o ServerAliveInterval=5. In this configuration, I can reliably reproduce the following sequence:

  1. Attempt to read a file on the server (ok)
  2. Disconnect the server (client still has network access, but server won't respond)
  3. Immediately attempt to read file again; client is now frozen
  4. After 15-20 seconds, client gets IO error (ok; probably ssh disconnected automatically)
  5. Attempt to read file again; client is now frozen indefinitely (would prefer immediate IO error)
  6. Reconnect server; client still remains frozen indefinitely
  7. While first client is frozen, attempt to read from the server in a second client. Now sshfs reconnects and the second client gets a successful read (ok) and the the first client gets an IO error

So it seems like just using ServerAliveInterval is not a complete solution here; preferably sshfs would immediately generate IO errors for any access attempt that is made while there is no established ssh connection.

campagnola avatar Mar 10 '20 23:03 campagnola

Has this issue been consistently present for the last 4 1/2 years? I remember running into it a long time ago, and am still running into it now. I don't understand how people use sshfs with this issue. Is there a workaround that involves using sshfs differently so that the computer is useable even without the connection to the server?

Radvendii avatar Aug 20 '20 23:08 Radvendii

@Radvendii It appears to still be an issue, and I'm not aware of any workaround. Unfortunatly, sshfs has been largely unmaintained for years, so I wouldn't count on it getting fixed any time soon. I'm also not aware of any comparable alternative projects, for secure remote files over spotty connections. It looks like you just have to live with broken and buggy network drives on Linux, at least until a better alternative appears.

justinlovinger avatar Aug 21 '20 14:08 justinlovinger

Bumping, in an attempt to bring more attention to this.

I have the exact same issue. I'm using sshfs with OSX Fuse. Even my wireless network stalls until it finishes trying to reconnect to the offline host.

Hope this gets a fix at some point. Anyone knows of an alternative to sshfs and OSX Fuse for macOS?

ccostel avatar Sep 29 '20 19:09 ccostel

I currently use a script to work around this issue: https://askubuntu.com/a/1274431/83134

campagnola avatar Sep 29 '20 20:09 campagnola

The network is not currently down for me, I can ssh to the target server, and I have these options (among others) set:

reconnect,ConnectTimeout=10,ServerAliveInterval=10

Yet every process that touches the mount (including umount) goes into uninterruptible sleep and never comes back. The only solution (short of hard reboot) was sudo killall -9 sshfs followed by another umount.

OrangeDog avatar Mar 29 '21 11:03 OrangeDog

In my case, sshfs outputs

Timeout, server 10.0.0.1 not responding.
remote host has disconnected

so it notices the disconnect, but then doesn't seem to do anything about it (that is, reconnect).

Maybe it tries, but cannot, because all kinds of operations (e.g. df) are stuck, and sshfs needs to do one of those?

If I strace the sshfs process in such a situation, it shows me that sshfs is stuck in the following syscalls:

# strace -fyp "$(pidof sshfs)"                                                                                                                                  :(
strace: Process 1250 attached with 12 threads
[pid 23895] futex(0x7fabfc102fa8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 23892] read(3</dev/fuse>,  <unfinished ...>
[pid 23891] read(3</dev/fuse>,  <unfinished ...>
[pid 23888] futex(0x415bb0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 23887] read(3</dev/fuse>,  <unfinished ...>
[pid 23882] futex(0x415bb0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  5839] futex(0x415bb0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  1312] read(3</dev/fuse>,  <unfinished ...>
[pid  1311] read(3</dev/fuse>,  <unfinished ...>
[pid  1299] read(3</dev/fuse>,  <unfinished ...>
[pid  1257] futex(0x415bb0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid  1250] futex(0x7ffdff62d8b0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

nh2 avatar Apr 05 '21 23:04 nh2

This has been an issue for a very long time, and it's not isolated to sshfs, and I've fixed it MANY times by writing wrappers around most remote mounts, but this has never been the correct answer. I posted about this yesterday to systemd regarding a feature request whereby systemd/remote-fs.target could potentially also monitor remote fs response times, and give us a timeout option, when when response time is > timeout, systemd would gracefully but immediately terminate the connection. This is roughly how I've solved the problem in the past 25 years, but just done through various bash scripts/wrappers.

Of course, systemd devs response is that the filesystem itself should be responsible for monitoring response time and automatically terminating the connection. I agree, but at the same time, this strategy would only work if ALL of the maintainers of each remote fs could all agree on a single unified strategy. SystemD thread here: https://github.com/systemd/systemd/issues/20148#issuecomment-875538262

SSHFS I believe actually used to have an option (or combination of options) which allowed this behaviour, but either I can't remember it, or it no longer exists.

As others have mentioned, if the remote endpoint disappears while sshfs is still connected, any process attempting to access the mount point, will simply hang indefinitely. This is an absurd problem imo and should be addressed by each of the offending fs's. Currently, if you're super sneaky about your mount options, you can get the system to remain relatively responsive, but in the case of SSHFS, there's still no good solution.

For example: [email protected]:/mnt/dhauz/bens /home/bens/remote/dhauz fuse.sshfs noauto,x-systemd.idle-timeout=5,x-systemd.mount-timeout=3,x-systemd.automount,x-systemd.TimeoutSec=1,transform_symlinks,_netdev,users,idmap=user,IdentityFile=/home/bens/.ssh/id_rsa,allow_other,ServerAliveInterval=1,ServerAliveCountMax=1,reconnect 0 0

With this configuration, if 192.168.1.13 becomes unresponsive, it will at least not hang up accessing processes indefinitely, but it also doesn't recognise that the mountpoint is dead. EG: if you ls the directory, you'll receive a cached result. If you try to ls something further down the tree, you'll get an IO error, which is good, but not ideal. When the endpoint becomes available again, SSHFS will correctly reconnect, but I haven't been able to figure out WHEN it reconnects. It seems to be pretty arbitrary. Sometimes it'll be 1 second, sometimes 2-3 minutes.

What I'd LOVE to see, would be for all the remote fs's to agree to monitor themselves for response time, and give us a single unified option for a "hard_unmount_timeout", which is dependent on response time.

blistovmhz avatar Jul 07 '21 13:07 blistovmhz

I am using the flags from https://github.com/libfuse/sshfs/issues/3#issuecomment-809305107 but it is still dissatisfying:

  • Suspending the laptop by closing the lid is completely broken: Suspend will just not work while sshfs is running, so if I close the lid and put the machine in a bag without noticing that it's still running, it might actually catch fire :fire:. I'll certainly run out of battery without noticing.
  • sshfs will re-prompt the SSH agent for the passphrase upon reconnect, meaning I'll get random popups. I understand why this happens technically, but it is still bad UX; given that the sshfs process is long-lived, I should be able to "give it credentials" once, and not have to do that again every time it wants to do some background action (that is, reconnect).

nh2 avatar Jul 07 '21 14:07 nh2

  • the SSH agent for the passphrase upon reconnect, m

I've never seen that behaviour and I suspect your suspend issue is likely unrelated. I've just tested again right now, with no issue suspending. Are you positive this only occurs when sshfs is connected to a remote endpoint?

blistovmhz avatar Jul 07 '21 14:07 blistovmhz

I've never seen that behaviour and I suspect your suspend issue is likely unrelated. I've just tested again right now, with no issue suspending. Are you positive this only occurs when sshfs is connected to a remote endpoint?

@blistovmhz It depends what the settings of your ssh-agent are: Whether it's using a timeout.

I'm using gpg-agent as the SSH agent, which gives nice GUI popups when the SSH private key needs to be accessed. It makes sense to configure this with a timeout, e.g. 5 minutes, so that not anybody / any new program can access your SSH key indefinitely without having to enter the passphrase.

The default ssh-agent / ssh-add does not have a timeout.

Currently, sshfs will ask for the passphrase when its reconnect functionality triggers; if that happens after the agent's timeout, you'll get a new popup. I'm arguing that that is problematic UX; it would make sense to give the sshfs process a permanent capability to access the key / request it from the agent without passphrase entering.

nh2 avatar Jul 07 '21 14:07 nh2

Well I just use authorised_keys in the first place, so the password/auth isn't an issue. Though I don't see what the auth has to do with your machine failing to suspend?

blistovmhz avatar Jul 07 '21 15:07 blistovmhz

@blistovmhz it's the passphrase for the key, not the remote password.

@nh2 if you don't want to be prompted for a passphrase then you have to remove the passphrase. If you don't want ssh-agent to time out then you have to remove the timeout. There's no per-process ssh-agent permanent auth system.

OrangeDog avatar Jul 07 '21 15:07 OrangeDog

There's no per-process ssh-agent permanent auth system.

@OrangeDog Yes, that's what I meant with

  • I understand why this happens technically, but it is still bad UX

nh2 avatar Jul 09 '21 17:07 nh2

@h4sh5 What's the resolution?

nh2 avatar Feb 27 '24 04:02 nh2

This issue seems to be caused by processes outside of sshfs' control (ssh and ssh agent)

if the problem is caused by re-auth with password after the ssh session is disconnected, using key authentication instead should mitigate the issue

h4sh5 avatar Mar 24 '24 20:03 h4sh5

@h4sh5, I don't think this issue is resolved. I observe this behaviour on sshfs 3.7.3 with reconnect and ServerAliveInterval.

One data point which might be useful for identifying the root cause: I just noticed that the underlying ssh process (not sshfs) was wedged, and it did not respond to SIGTERM, I had to SIGKILL it. When I did this the filesystem became unstuck without a restart of sshfs. I'm a bit surprised the ssh process wouldn't respond to SIGTERM. Anyone else see this? Any ideas why ssh itself would get stuck?

peterwaller-arm avatar Jul 02 '24 09:07 peterwaller-arm

This issue seems to be caused by processes outside of sshfs' control (ssh and ssh agent)

if the problem is caused by re-auth with password after the ssh session is disconnected, using key authentication instead should mitigate the issue

Can you explain how you came to that conclusion? I'm not doubting you, I'm just trying to learn the ins and outs of ssh better.

Even if that's the case, there's prior art of processes handling reauth with password after ssh session is disconnected. VSCode's remote development extension works via SSH. If I suspend my laptop and reopen it, the ssh session is disconnected, and VSCode prompts me for my ssh password again. It then fails to reestablish the connection and lets me know in a dialog box that I need to reload the VSCode window. Different design philosophy, (Unix vs Microsoft) but it's probably still possible for sshfs to handle this gracefully.

Jacob-Stevens-Haas avatar Jul 10 '24 17:07 Jacob-Stevens-Haas