Infinite loop in dht when lookup fails with ENODATA
Description of problem: When two clients simultaneously create and unlink the same file in a loop (stress testing), the client doing the unlink was hung and unresponsive to CTRL-C. On examining, it was observed that when dht_lookup_cbk() failed with ENODATA (since the other client had created the file but not yet set the gfid), it was triggering a recursive loop of lookups (on the client doing the unlink):
dht_lookup_cbk ───────────► dht_lookup_directory ──►dht_lookup_dir_cbk───►dht_lookup_everywhere ───► dht_lookup_everywhere_done
▲ │
│ │
│ │
│ │
│ │
└─────────────────────────────────────────────────────────────────────────▼
The exact command to reproduce the issue:
- mount -t glusterfs IP:volname /mnt/1
- mount -t glusterfs IP:volname /mnt/2
- On /mnt1/: while true; do touch f1; done
- On /mnt/2: while true; do rm -f f1; done
- Hit CTRL-C on both mounts. The one on /mnt/1 returns while the one on /mnt/2 hangs since its stuck in an infinite lookup loop.
Terminal 2:
[root@host ~]# cd /mnt/2/
[root@host 2]# while true; do rm -f f1; done
^C <---------Hung
Terminal 1:
[root@host ~]# cd /mnt/1/
[[root@host 1]# while true; do touch f1; done
touch: setting times of ‘f1’: Stale file handle
touch: setting times of ‘f1’: Stale file handle
touch: setting times of ‘f1’: Stale file handle
^C <--------Not hung, it exits.
[[root@host 1]#
- The output of the gluster volume info command:
Volume Name: distvol
Type: Distribute
Volume ID: c2f2b9a6-ab33-4344-ae17-fe7c8c8288a0
Status: Started
Snapshot Count: 0
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: IP1:/brickl
Brick2: IP2:/brick2
Brick3: IP3:/brick3
Brick4: IP4:/brick4
Brick5: IP5:/brick5
Options Reconfigured:
cluster.lookup-optimize: on
diagnostics.client-log-level: INFO
features.read-only: off
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
storage.reserve: 42949672960
config.client-threads: 4
network.inode-lru-limit: 90000
features.ctime: off
auth.allow: *
diagnostics.client-sys-log-level: WARNING
diagnostics.brick-sys-log-level: WARNING
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on