dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Bulk Rest-API does not stage files with broken disk locations

Open christianvoss opened this issue 1 year ago • 4 comments

Hi,

we've been observing some curious behaviour with the bulk staging service. It appears the bulk service does not trigger stages, if there are disk locations known to dCache, even when these pools are offline. We've observed this recently, when a storage node had to be taken out of production for a week and we wanted to stage back some files needed by our users.

I've reproduced this also with the latest 9.2 dCache release: 9.2.21. What we see, when we want to stage a NEARLINE file with is:

{ "nextId": -1, "uid": "f9b987ee-02b5-4ba8-a334-df4b24ed4b6a", "arrivedAt": 1719238168621, "startedAt": 1719238168720, "lastModified": 1719238168753, "status": "COMPLETED", "targetPrefix": "/", "targets": [ { "target": "/pnfs/desy.de/exfel/archive/XFEL/raw/FXE/201802/p002271/r0081/RAW-R0081-LPD09-S00003.h5", "state": "SKIPPED", "submittedAt": 1719238168635, "startedAt": 1719238168635, "finishedAt": 1719238168750, "id": 242049 } ] }

The operation will always be skipped. But, dCache reports the file correctly as NEARLINE: "fileLocality": "NEARLINE",

In contrast, staging via SRM triggers a restore from tape immediately:

[vossc@naf-it01] [dev/vossc/no-macaroon-voms-directly] pnfs_qos_api $ srm-bring-online -lifetime=864000 srm://dcache-door-xfel01.desy.de:8443/pnfs/desy.de/exfel/archive/XFEL/raw/FXE/201802/p002271/r0081/RAW-R0081-LPD09-S00003.h5

[dcache-head-xfel02] (local) vossc > \sn pnfsidof /pnfs/desy.de/exfel/archive/XFEL/raw/FXE/201802/p002271/r0081/RAW-R0081-LPD09-S00003.h5 00005283EB13A8A943E9938C32E0BFFF47FC

[dcache-head-xfel02] (local) vossc > \sp rc ls 00005283EB13A8A943E9938C32E0BFFF47FC@world-net-/ m=1 r=0 [dcache-xfel499-01] [Waiting for stage: dcache-xfel499-01 06.24 16:10:40] {0,}

[dcache-head-xfel02] (local) vossc > \s dcache-xfel499-01 rh ls a928e3c3-6454-4151-b186-0f3ab7b93757 ACTIVE Mon Jun 24 16:10:40 CEST 2024 Mon Jun 24 16:10:40 CEST 2024 00005283EB13A8A943E9938C32E0BFFF47FC xfel:FXE-2018

Is it possible for bulk to behave like SRM did in the past, or would the procedure be to 'disable' the location in chimera before triggering the stage?

Thanks a lot, Christian

christianvoss avatar Jun 24 '24 14:06 christianvoss

yes. Bulk purely relies on location information in chimera.

DmitryLitvintsev avatar Jun 24 '24 18:06 DmitryLitvintsev

Hi Dmitry,

thanks a lot for the clarification. I guess it is supposed to stage from tape even when a pool is offline?

Thanks a lot, Christian

christianvoss avatar Jul 04 '24 13:07 christianvoss

Hi Christian,

sorry it took a long time to get to this.

By disabling pool (pool disable -strict) or stopping the pool the system works as designed:

 [uqbar] (bulk@bulk1Domain) admin > \sn cacheinfoof 000081BB1343AB6A4783B342B28269116004 
 rw-uqbar-3


#  systemctl stop [email protected]

[uqbar] (local) admin > \c rw-uqbar-3 
(1) Cell does not exist.

execute the script to stage file:

$ python pin_many.py /pnfs/fs/usr/fermilab/users/litvinse/apache-maven-3.9.8-bin.tar.gz 
201 https://uqbar.fnal.gov:3880/api/v1/bulk-requests/92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd
Cheking status
200 {
  "nextId" : -1,
  "uid" : "92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd",
  "arrivedAt" : 1726695461378,
  "startedAt" : 1726695461395,
  "lastModified" : 1726695461395,
  "status" : "STARTED",
  "targetPrefix" : "/pnfs/fs/usr/fermilab",
  "targets" : [ {
    "target" : "/pnfs/fs/usr/fermilab/users/litvinse/apache-maven-3.9.8-bin.tar.gz",
    "state" : "RUNNING",
    "submittedAt" : 1726695461389,
    "startedAt" : 1726695461389,
    "id" : 363
  } ]
}

Observe:

ID           | ARRIVED             |            MODIFIED |        OWNER |     STATUS | UID
...
170          | 2024/09/18-16:37:41 | 2024/09/18-16:37:41 |    8637:3200 |  COMPLETED | 92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd
[uqbar] (bulk@bulk1Domain) admin > request ls 92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd 
ID           | ARRIVED             |            MODIFIED |        OWNER |     STATUS | UID
170          | 2024/09/18-16:37:41 | 2024/09/18-16:37:41 |    8637:3200 |  COMPLETED | 92edcdd1-c04d-4d63-a85c-c9cdc28e7fcd
[uqbar] (bulk@bulk1Domain) admin > \sn cacheinfoof 000081BB1343AB6A4783B342B28269116004 
 rw-uqbar-3 rw-uqbar-9

The script:

#!/usr/bin/env python

import json 
import requests
import os
import sys
from requests.exceptions import HTTPError

import urllib3
urllib3.disable_warnings()

base_url = "https://uqbar.fnal.gov:3880/api/v1/bulk-requests"
#base_url = "https://cmsdcatape.fnal.gov:3880/api/v1/bulk-requests"

if __name__ == "__main__":
    session = requests.Session()
    session.verify = "/etc/grid-security/certificates"
    session.cert = f"/tmp/x509up_u{os.getuid()}"
    session.key = f"/tmp/x509up_u{os.getuid()}"

    headers = { "accept" : "application/json",
                "content-type" : "application/json"}

    data =  {
    	"target" : sys.argv[1:],
       "clearOnFailure" : "true",
        "expandDirectories" : "none",
        "activity" : "PIN",
        "arguments": {
            "lifetime": "24",
            "lifetime-unit": "HOURS"
        }
    }


    try:
        r = session.post(base_url,
                         data=json.dumps(data),
                         headers=headers)
        r.raise_for_status()
        print (r.status_code, r.headers['request-url'])
    except HTTPError as exc:
        print(exc)
        sys.exit(1)

    rq =  r.headers['request-url']    
    print ("Cheking status")
    r = session.get(rq, headers=headers)
    r.raise_for_status()
    print (r.status_code, r.text)

I will re-scan (by eye) the condition that sets SKIPPED status

DmitryLitvintsev avatar Sep 18 '24 21:09 DmitryLitvintsev

Could you do me a favor. could you use exact same script (need voms proxy) to run exact same exercise?

DmitryLitvintsev avatar Sep 18 '24 21:09 DmitryLitvintsev