bulk requests stuck on STARTED state
Runnig dCache version 9.2.25 and Enstore.
When a recall is failed due to a problem reading the tape, not due a missing file, checksum error, etc. the bulk request remain STARTED, with RUNNING state on the file, and without any "rh" request on any pool.
Example:
[dccore12] (local) admin > \s bulk request ls 2eda9e78-fbaa-4f5a-b285-787dc7f29bec ID | ARRIVED | MODIFIED | OWNER | STATUS | UID 3182252 | 2025/02/04-10:40:14 | 2025/02/04-10:40:14 | 31101:1399 | STARTED | 2eda9e78-fbaa-4f5a-b285-787dc7f29bec dccore12] (local) admin > \s bulk request info 2eda9e78-fbaa-4f5a-b285-787dc7f29bec 2eda9e78-fbaa-4f5a-b285-787dc7f29bec: status: STARTED arrived at: 2025-02-04 10:40:14.781 started at: 2025-02-04 10:40:14.793 last modified at: 2025-02-04 10:40:14.793 target prefix: / targets: CREATED | STARTED | COMPLETED | STATE | TARGET 2025-02-04 10:40:14.782 | 2025-02-04 10:40:14.782 | ? | RUNNING | /pnfs/pic.es/data/cms/store/data/Run2024G/ZeroBias/AOD/PromptReco-v1/000/384/202/00000/f58968b7-5890-4970-abcc-b5ace5d645e5.root 2025-02-04 10:40:14.782 | 2025-02-04 10:40:14.782 | 2025-02-04 10:40:14.808 | FAILED | /pnfs/pic.es/data/cms/store/test/loadtest/source/T1_ES_PIC_Tape/urandom.270MB.file0000 -- (ERROR: diskCacheV111.util.CacheException : File not on tape.)
[dccore12] (local) admin > \sn pnfsidof /pnfs/pic.es/data/cms/store/data/Run2024G/ZeroBias/AOD/PromptReco-v1/000/384/202/00000/f58968b7-5890-4970-abcc-b5ace5d645e5.root 0000AFE37C9682A641AE99358012553B0CE8
$ echo "\s dc* rh ls"| ssh -p 22224 dccore.pic.es|grep 0000AFE37C9682A641AE99358012553B0CE8
In this example after running \bulk request reset
Don't hesitate on request any information you need.
Cheers, Elena
Hi Elena.
Make sure you set:
rc onerror fail
rc set max retries 3
(max retries 3 kind of means "smaller number", you do not want to have this number to be large)
As for currenr request - I suggest to cancel it via bulk admin api.
Hi Dmitry, the parameters you suggested are related the recall processes on poolmanager, but the ones stuck have entered through bulk and have no entries on the poolmanager, they are managed directly by the bulk service, sending the requests to the pool, bypassing the poolmanager. Or maybe I'm wrong?
Hi Dmitry, the parameters you suggested are related the recall processes on poolmanager, but the ones stuck have entered through bulk and have no entries on the poolmanager, they are managed directly by the bulk service, sending the requests to the pool, bypassing the poolmanager. Or maybe I'm wrong?
This is how it works:
bulk -> PinManager -> PoolManager -> pool
All staging requests in dCache are handled in PoolManager
Hi,
Our pool manager configuration related to recalls is as follows:
--rc (Request container) ---
Retry Timeout : 900 seconds
Thread Controller :
***@***.***[Running,
pool size = 65, active threads = 0, queued tasks = 0, completed tasks
= 148951997]
Maximum Retries : 3
Pool Ping Timer : 600 seconds
On Error : fail
Allow p2p : on oncost=on fortransfer=on
Allow staging : onAllow stage on cost : off
Restore Limit : unlimited
Restore Exceeded : 0
As far as I understand, when a restore attempt fails, the error should be reported to the corresponding bulk request, and the request should be marked as finished. However, instead of that, the pool manager or the bukl is continuously launching recall requests to the pools.
This has become a problem because we have more than 3K targets pending restore that keep trying to be downloaded and fail repeatedly.
For now, to reduce the load on Enstore, we’ve set the gets to zero for the ATLAS pools, which are the ones affected. However, we’d like to know how this retry number can be limited, or whether we should cancel the bulk requests or the affected targets.
Maybe if we set on error = suspend in the poolmanager, that will stop the retries?
Thank you in advance.
-- Elena Planas PIC/Institut de Física d’Altes Energies (IFAE), The Barcelona Institute of Science and Technology, Campus UAB 08193 Bellaterra (Barcelona) +34-93 164 16 55
Hi Elena,
These are the settings we have on our systems at Fermilab:
rc onerror fail
rc set max retries 10
rc set retry 900
rc set poolpingtimer 600
rc set max restore unlimited
rc set sameHostCopy besteffort
cm set magic on
And we do see bulk request failed if 10 retries exhausted.
You can set
rc set max retries 1
Then you can manually fail one of the requests by:
rc failed <id> [<errornumber>] [<errorstring>]
And see what happened to corresponding request in bulk
One thing that may be creating problems here...
"..a restore attempt fails...:"
This means HSM script failed on pool, right? Number one thing to check - has the corresponding request on PoolManager request queue failed or it is still there "Staging"..
HSM script may behave differently (give different exit codes and dCache reacts differently based on these codes)
I have suspicion that PoolManager is not notified on fail.
Or are you saying the failed restore ends up w/ another restore, on another pool and so on?
As you mentioned, the request fails in the pool, but it doesn’t appear when checking with rc ls. What we have observed is that multiple restore requests keep appearing in different pools for the same file, and they all fail because the tape is in a NOACCESS state.
At the moment, we have on error=suspend, since it’s the only way we can stop the infinite retries.
We’ve tested setting the parameters as you suggested and forced a restore with rc retry for one of the requests that appeared as Suspended. However, at the time we changed the definitions, new restore requests started appearing in multiple pools for the targets that were RUNNING in the bulk requests and were not Suspended in the poolmanager.
It's weird that poolmanager don't stop retrying when de max retries is reached.
We've had to revert to "on error:suspend" while the tape is migrated.
We've done a new test. Setting:
`[dccore12] (local) admin > \sp rc set max retries 1
[dccore12] (local) admin > \sp rc onerror fail onerror fail ` The file we forced the retry was Waiting for stage on one pool:
00002633B9A06D584080AF0F7C5021BD4C07@ugroup-world-net-*/* m=1 r=0 [dc035_2] [Waiting for stage: dc035_2 11.12 11:56:53] {0,}
And after failed it was requested from another pool:
00002633B9A06D584080AF0F7C5021BD4C07@ugroup-world-net-*/* m=1 r=1 [dc061_1] [Waiting for stage: dc061_1 11.12 12:45:07] {0,}
After the first retry it failed and no more Waiting for stage , but the bulk request is still showing the file as RUNNING.
And we could see in the qos-activity log that there are lots of retries:
[root@dccore12 dcache]# grep 00002633B9A06D584080AF0F7C5021BD4C07 qos-activity.2025-11-111* qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:44:37 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc035_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:57:40 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:58:11 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:58:42 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:59:13 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 12:59:43 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:00:14 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:00:45 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:01:16 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:01:47 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:02:18 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:02:48 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:03:19 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:03:50 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:04:21 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:04:51 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:05:22 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:05:53 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:06:24 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:06:55 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:07:26 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:07:57 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:08:27 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:08:59 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc033_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:09:29 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:10:00 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:10:31 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc078_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:11:02 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:11:33 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc019_2 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:12:04 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc079_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07. qos-activity.2025-11-1116991717320308549.tmp:12 Nov 2025 13:12:35 (qos-engine) [PnfsManager PnfsClearCacheLocation 00002633B9A06D584080AF0F7C5021BD4C07] Received notice that pool dc061_1 cleared file 00002633B9A06D584080AF0F7C5021BD4C07.
We've set again on error to suspend
See below a example of the number of retries for Suspended recalls
000051258FE57C544716B63FC662BE48648D@ugroup-world-net-/ m=1 r=1999 [dc032_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:24] {0,} 00007DFFF23CC3B94711BD76ABAF32E84D07@ugroup-world-net-/ m=1 r=1998 [dc070_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:37] {0,} 0000E2E58054DE1B46FD8840E7FA01911ED6@ugroup-world-net-/ m=1 r=1897 [dc056_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:27] {0,} 0000F5F789E3345D462D8E851430709263BD@ugroup-world-net-/ m=1 r=1859 [dc035_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:42] {0,} 0000B309669D63AB4E90864B3B6B531F356F@ugroup-world-net-/ m=1 r=1818 [dc026_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:47] {0,} 000000854A57B3474236A3D282AE1C818F8C@ugroup-world-net-/ m=1 r=1813 [dc041_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:47] {0,} 0000EDB9093583F047478A89F76B20153544@ugroup-world-net-/ m=1 r=1813 [dc060_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:47] {0,} 000082517B73186846C5BFDF62B3FA3403E7@ugroup-world-net-/ m=1 r=1931 [dc026_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:08] {0,} 00003332D2BB5CB048BAAADBC448FCB7B6E3@ugroup-world-net-/ m=1 r=534 [dc048_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 13:50:50] {0,} 00003D481D25601444F7904D9C04F85F63B5@ugroup-world-net-/ m=1 r=1792 [dc020_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:27] {0,} 0000BBC1DC348AD04775BED2A0D2E691DCF9@ugroup-world-net-/ m=1 r=1838 [dc012_2] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:42] {0,} 0000D18D81C449814A6DB2A7B3242AA660BF@ugroup-world-net-/ m=2 r=3 [dc011_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:32] {0,} 0000B4801A4CADF440EBA99C0645879FC2E1@ugroup-world-net-/ m=1 r=3 [dc026_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:42] {0,} 0000DA88EA6577AA49F38B1407B8CF09FDEB@ugroup-world-net-/ m=1 r=1918 [dc079_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:37] {0,} 0000070815C955B5421BADD6CF2164BCEB16@ugroup-world-net-/ m=1 r=1864 [dc061_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:53] {0,} 0000BFA995CE05504CC9B87DD388184E3316@ugroup-world-net-/ m=1 r=1849 [dc060_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:37] {0,} 000007E36790083940E692B9B60D88A04626@ugroup-world-net-/ m=1 r=1782 [dc012_2] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:00:39] {0,} 00007DD536BF91F54F339BCE11391DC34E74@ugroup-world-net-/ m=1 r=513 [dc042_1] [Suspended: HSM script failed (script reported: 1: ) 11.11 13:50:24] {0,} 00007122A818145D48A9B4D6542CBF09C531@ugroup-world-net-/ m=1 r=1731 [dc061_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:00:32] {0,} 00007961AB3E3B6E419EB52091B43FE892DB@ugroup-world-net-/ m=1 r=493 [dc019_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 13:50:13] {0,} 00000341ACC2E8A54A5E8BA4546FA53EFA55@ugroup-world-net-/ m=1 r=1898 [dc032_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:20] {0,} 0000652B2D08B98040DDB7026E1B3DD8A98A@ugroup-world-net-/ m=1 r=1754 [dc077_2] [Suspended: HSM script failed (script reported: 1: ) 11.11 11:01:00] {0,} 000052367426CFA14A10AD1E64D67C012344@ugroup-world-net-/ m=1 r=1925 [dc011_1] [Suspended: HSM script failed (script reported: 32: ) 11.11 11:01:08] {0,} 0000637A4B216D3344B6BD10335910508DE1@ugroup-world-net-/ m=1 r=508 [dc019_2] [Suspended: HSM script failed (script reported: 32: ) 11.11 14:27:55] {0,}
While the on error was set to fail the number of retries was growing.
The staging is not triggered by bulk, staging is triggered by PinManager (on request from bulk). I wonder if you could repeat that exercise (if you have test system where you can play with - it will be ideal)
when you see request staging
What is
ls <pnfsid>
in PinManager show.
And when you failed the request in PoolManager what
ls <pnfsid>
shows.
I will have a closer look later this week. I must say, it looks like PoolManager ignores max number of retries
rc set max retries <maxNumberOfRetries>
I just did this exercise w/
rc set max retries 1
And:
[fndca] (PoolManager@dCacheDomain) enstore > rc ls 0000F46FB27E381247B1917EA301FF011442.*
0000F46FB27E381247B1917EA301FF011442@world-net-*/* m=1 r=11 [rw-protodune-stkendca2038-4] [Waiting for stage: rw-protodune-stkendca2038-4 11.12 08:11:19] {0,}
note the r=11 retry count
Question : why HSM keeps failing?
If tape is broken, then you can use Enstore DB to extract list of pnfsids and then set corresponding locations as unavailable in chimera. That will stop this.
I can provide a bash script for that
The tape is not completely broken, we're migrating it. Only few files are marked as bad for the moment, but the migration is failing time to time, setting the tape NOACCESS. We expect finish the migration soon.
Thanks for everything.