dcache bulk requests stuck on STARTED state

Runnig dCache version 9.2.25 and Enstore.

When a recall is failed due to a problem reading the tape, not due a missing file, checksum error, etc. the bulk request remain STARTED, with RUNNING state on the file, and without any "rh" request on any pool.

Example:

[dccore12] (local) admin > \sn pnfsidof /pnfs/pic.es/data/cms/store/data/Run2024G/ZeroBias/AOD/PromptReco-v1/000/384/202/00000/f58968b7-5890-4970-abcc-b5ace5d645e5.root 0000AFE37C9682A641AE99358012553B0CE8

$ echo "\s dc* rh ls"| ssh -p 22224 dccore.pic.es|grep 0000AFE37C9682A641AE99358012553B0CE8

In this example after running \bulk request reset the rh process doesn't appear, but when we faced the problem the first time, the new rh for the stuck file, was launched.

Don't hesitate on request any information you need.

Cheers, Elena

Feb 05 '25 17:02 elenamplanas

Hi Elena.

Make sure you set:

rc onerror fail
rc set max retries 3

(max retries 3 kind of means "smaller number", you do not want to have this number to be large)

Feb 05 '25 18:02 DmitryLitvintsev

As for currenr request - I suggest to cancel it via bulk admin api.

Feb 05 '25 18:02 DmitryLitvintsev

Hi Dmitry, the parameters you suggested are related the recall processes on poolmanager, but the ones stuck have entered through bulk and have no entries on the poolmanager, they are managed directly by the bulk service, sending the requests to the pool, bypassing the poolmanager. Or maybe I'm wrong?

Feb 06 '25 11:02 elenamplanas

Hi Dmitry, the parameters you suggested are related the recall processes on poolmanager, but the ones stuck have entered through bulk and have no entries on the poolmanager, they are managed directly by the bulk service, sending the requests to the pool, bypassing the poolmanager. Or maybe I'm wrong?

This is how it works:

bulk -> PinManager -> PoolManager -> pool

All staging requests in dCache are handled in PoolManager

Feb 06 '25 16:02 DmitryLitvintsev

Hi,

Our pool manager configuration related to recalls is as follows:

--rc (Request container) ---
      Retry Timeout : 900 seconds
  Thread Controller :
***@***.***[Running,
pool size = 65, active threads = 0, queued tasks = 0, completed tasks
= 148951997]
    Maximum Retries : 3
    Pool Ping Timer : 600 seconds
           On Error : fail
          Allow p2p : on oncost=on fortransfer=on
      Allow staging : onAllow stage on cost : off
      Restore Limit : unlimited
   Restore Exceeded : 0

As far as I understand, when a restore attempt fails, the error should be reported to the corresponding bulk request, and the request should be marked as finished. However, instead of that, the pool manager or the bukl is continuously launching recall requests to the pools.

This has become a problem because we have more than 3K targets pending restore that keep trying to be downloaded and fail repeatedly.

For now, to reduce the load on Enstore, we’ve set the gets to zero for the ATLAS pools, which are the ones affected. However, we’d like to know how this retry number can be limited, or whether we should cancel the bulk requests or the affected targets.

Maybe if we set on error = suspend in the poolmanager, that will stop the retries?

Thank you in advance.

-- Elena Planas PIC/Institut de Física d’Altes Energies (IFAE), The Barcelona Institute of Science and Technology, Campus UAB 08193 Bellaterra (Barcelona) +34-93 164 16 55

Nov 11 '25 10:11 elenamplanas

Hi Elena,

These are the settings we have on our systems at Fermilab:

rc onerror fail
rc set max retries 10
rc set retry 900
rc set poolpingtimer 600
rc set max restore unlimited
rc set sameHostCopy besteffort
cm set magic on

And we do see bulk request failed if 10 retries exhausted.

You can set

rc set max retries 1

Then you can manually fail one of the requests by:

rc failed <id> [<errornumber>] [<errorstring>]

And see what happened to corresponding request in bulk

Nov 11 '25 15:11 DmitryLitvintsev

One thing that may be creating problems here...

"..a restore attempt fails...:"

This means HSM script failed on pool, right? Number one thing to check - has the corresponding request on PoolManager request queue failed or it is still there "Staging"..

HSM script may behave differently (give different exit codes and dCache reacts differently based on these codes)

I have suspicion that PoolManager is not notified on fail.

Or are you saying the failed restore ends up w/ another restore, on another pool and so on?

Nov 11 '25 16:11 DmitryLitvintsev

As you mentioned, the request fails in the pool, but it doesn’t appear when checking with rc ls. What we have observed is that multiple restore requests keep appearing in different pools for the same file, and they all fail because the tape is in a NOACCESS state.

At the moment, we have on error=suspend, since it’s the only way we can stop the infinite retries.

We’ve tested setting the parameters as you suggested and forced a restore with rc retry for one of the requests that appeared as Suspended. However, at the time we changed the definitions, new restore requests started appearing in multiple pools for the targets that were RUNNING in the bulk requests and were not Suspended in the poolmanager.

It's weird that poolmanager don't stop retrying when de max retries is reached.

We've had to revert to "on error:suspend" while the tape is migrated.

Nov 12 '25 11:11 elenamplanas

We've done a new test. Setting:

`[dccore12] (local) admin > \sp rc set max retries 1

[dccore12] (local) admin > \sp rc onerror fail onerror fail ` The file we forced the retry was Waiting for stage on one pool:

00002633B9A06D584080AF0F7C5021BD4C07@ugroup-world-net-*/* m=1 r=0 [dc035_2] [Waiting for stage: dc035_2 11.12 11:56:53] {0,} And after failed it was requested from another pool: 00002633B9A06D584080AF0F7C5021BD4C07@ugroup-world-net-*/* m=1 r=1 [dc061_1] [Waiting for stage: dc061_1 11.12 12:45:07] {0,} After the first retry it failed and no more Waiting for stage , but the bulk request is still showing the file as RUNNING.

And we could see in the qos-activity log that there are lots of retries:

[root@dccore12 dcache]# grep 00002633B9A06D584080AF0F7C5021BD4C07 qos-activity.2025-11-111* qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:44:37 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc035_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:57:40 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:58:11 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:58:42 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:59:13 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:59:43 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:00:14 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:00:45 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:01:16 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:01:47 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:02:18 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:02:48 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:03:19 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:03:50 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:04:21 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:04:51 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:05:22 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:05:53 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:06:24 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:06:55 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:07:26 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:07:57 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:08:27 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:08:59 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:09:29 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:10:00 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:10:31 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:11:02 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:11:33 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:12:04 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:12:35 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07.

We've set again on error to suspend

Nov 12 '25 12:11 elenamplanas

See below a example of the number of retries for Suspended recalls

000051258FE57C544716B63FC662BE48648D@ugroup-world-net-/ m=1 r=1999 [dc032_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:24] {0,} 00007DFFF23CC3B94711BD76ABAF32E84D07@ugroup-world-net-/ m=1 r=1998 [dc070_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:37] {0,} 0000E2E58054DE1B46FD8840E7FA01911ED6@ugroup-world-net-/ m=1 r=1897 [dc056_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:27] {0,} 0000F5F789E3345D462D8E851430709263BD@ugroup-world-net-/ m=1 r=1859 [dc035_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:42] {0,} 0000B309669D63AB4E90864B3B6B531F356F@ugroup-world-net-/ m=1 r=1818 [dc026_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:47] {0,} 000000854A57B3474236A3D282AE1C818F8C@ugroup-world-net-/ m=1 r=1813 [dc041_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:47] {0,} 0000EDB9093583F047478A89F76B20153544@ugroup-world-net-/ m=1 r=1813 [dc060_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:47] {0,} 000082517B73186846C5BFDF62B3FA3403E7@ugroup-world-net-/ m=1 r=1931 [dc026_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:08] {0,} 00003332D2BB5CB048BAAADBC448FCB7B6E3@ugroup-world-net-/ m=1 r=534 [dc048_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 13:50:50] {0,} 00003D481D25601444F7904D9C04F85F63B5@ugroup-world-net-/ m=1 r=1792 [dc020_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:27] {0,} 0000BBC1DC348AD04775BED2A0D2E691DCF9@ugroup-world-net-/ m=1 r=1838 [dc012_2] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:42] {0,} 0000D18D81C449814A6DB2A7B3242AA660BF@ugroup-world-net-/ m=2 r=3 [dc011_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:32] {0,} 0000B4801A4CADF440EBA99C0645879FC2E1@ugroup-world-net-/ m=1 r=3 [dc026_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:42] {0,} 0000DA88EA6577AA49F38B1407B8CF09FDEB@ugroup-world-net-/ m=1 r=1918 [dc079_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:37] {0,} 0000070815C955B5421BADD6CF2164BCEB16@ugroup-world-net-/ m=1 r=1864 [dc061_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:53] {0,} 0000BFA995CE05504CC9B87DD388184E3316@ugroup-world-net-/ m=1 r=1849 [dc060_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:37] {0,} 000007E36790083940E692B9B60D88A04626@ugroup-world-net-/ m=1 r=1782 [dc012_2] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:39] {0,} 00007DD536BF91F54F339BCE11391DC34E74@ugroup-world-net-/ m=1 r=513 [dc042_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 13:50:24] {0,} 00007122A818145D48A9B4D6542CBF09C531@ugroup-world-net-/ m=1 r=1731 [dc061_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:32] {0,} 00007961AB3E3B6E419EB52091B43FE892DB@ugroup-world-net-/ m=1 r=493 [dc019_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 13:50:13] {0,} 00000341ACC2E8A54A5E8BA4546FA53EFA55@ugroup-world-net-/ m=1 r=1898 [dc032_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:20] {0,} 0000652B2D08B98040DDB7026E1B3DD8A98A@ugroup-world-net-/ m=1 r=1754 [dc077_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:00] {0,} 000052367426CFA14A10AD1E64D67C012344@ugroup-world-net-/ m=1 r=1925 [dc011_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:08] {0,} 0000637A4B216D3344B6BD10335910508DE1@ugroup-world-net-/ m=1 r=508 [dc019_2] [Suspended: HSM script failed (script reported: 32: ) 11.11 14:27:55] {0,}

While the on error was set to fail the number of retries was growing.

Nov 12 '25 13:11 elenamplanas

The staging is not triggered by bulk, staging is triggered by PinManager (on request from bulk). I wonder if you could repeat that exercise (if you have test system where you can play with - it will be ideal)

when you see request staging

What is

ls <pnfsid>

in PinManager show.

And when you failed the request in PoolManager what

ls <pnfsid>

shows.

Nov 12 '25 13:11 DmitryLitvintsev

I will have a closer look later this week. I must say, it looks like PoolManager ignores max number of retries

rc set max retries <maxNumberOfRetries>

I just did this exercise w/

rc set max retries 1

And:

[fndca] (PoolManager@dCacheDomain) enstore > rc ls 0000F46FB27E381247B1917EA301FF011442.*
0000F46FB27E381247B1917EA301FF011442@world-net-*/* m=1 r=11 [rw-protodune-stkendca2038-4] [Waiting for stage: rw-protodune-stkendca2038-4 11.12 08:11:19] {0,}

note the r=11 retry count

Nov 12 '25 14:11 DmitryLitvintsev

Question : why HSM keeps failing?

If tape is broken, then you can use Enstore DB to extract list of pnfsids and then set corresponding locations as unavailable in chimera. That will stop this.

I can provide a bash script for that

Nov 12 '25 14:11 DmitryLitvintsev

The tape is not completely broken, we're migrating it. Only few files are marked as bad for the moment, but the migration is failing time to time, setting the tape NOACCESS. We expect finish the migration soon.

Thanks for everything.

Nov 12 '25 14:11 elenamplanas