python-irodsclient icon indicating copy to clipboard operation
python-irodsclient copied to clipboard

Catching and translating iRODS errors

Open kript opened this issue 3 years ago • 14 comments

Hi folks,

I've been trying to use RErrorStack to display the errors on a failed upload via a put() and running into some issues following the documentation, can you tell me what I might have missed?

Create the 1024 test files (2.1G) with:

dd if=/dev/urandom of=1gb count=1024 bs=1048576
split -b 1048576 -a 3 1gb 1mb.

run with

python ~iRODS_small_file_stressor.py 1mb.*
import os
import ssl
import sys
from irods.session import iRODSSession
from irods.manager.data_object_manager import Server_Checksum_Warning
import irods.keywords as kw
from irods.message import RErrorStack

try:
    env_file = os.environ["IRODS_ENVIRONMENT_FILE"]
except KeyError:
    env_file = os.path.expanduser("~/.irods/irods_environment.json")

ssl_context = ssl.create_default_context(
    purpose=ssl.Purpose.SERVER_AUTH, cafile=None, capath=None, cadata=None
)
ssl_settings = {"ssl_context": ssl_context}

session = iRODSSession(irods_env_file=env_file, **ssl_settings)
session.connection_timeout = 300
opts = {kw.VERIFY_CHKSUM_KW: ""}
r_err_stk = RErrorStack()
warn = None

session.collections.create("/seq-dev/home/jc18#Sanger1-dev/irods_test")

retries = 0
fails = 0
for base in sys.argv[1:]:
    remote = "/seq-dev/home/jc18#Sanger1-dev/irods_test/" + base
    print(remote)

    for attempt in range(10):
        warn = None
        try:
            try:
                session.data_objects.put(base, remote, r_error = r_err_stk)
            except Exception as exc:
                print("put failed")
                warn = exc
                print(warn)
                print(r_err_stk)
            try:
                obj = session.data_objects.get(remote, r_error = r_err_stk)
            except Exception as exc:
                print("get failed")
                warn = exc
                print(warn)
                print(r_err_stk)

            try:
                obj.chksum(**opts, r_error = r_err_stk)
            except Server_Checksum_Warning as exc:
                print("some checksums are missing or wrong")
                warn = exc
                print(r_err_stk)

        except KeyboardInterrupt:
            sys.exit(1)
        except:
            print(" retrying " + remote)
            retries += 1
            continue
        else:
            break
    else:
        fails += 1
        print(base + " failed")

session.cleanup()

print(f"retries: {retries}; fails: {fails}")

when running I see;

./iRODS_small_file_stressor.py 1mb.*
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aaa
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aab
put failed
None
[]
get failed

[]
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aac
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aad
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aae
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aaf

kript avatar Oct 07 '22 13:10 kript

So, how many worked? Did you know that 1mb.aab was going to fail? But the others continued? Where are the other 1018?

trel avatar Oct 07 '22 13:10 trel

I edited the output for brevity, and no, I expected them all to work (that's another ticket I suspect 😇 ) - its the fact that I'm not showing the errors at the client end that I am reporting/asking for help with in this issue, as it appears RErrorStack is not getting populated...

kript avatar Oct 07 '22 13:10 kript

@kript Unfortunately put( ...) in PRC isn't quite the same as iput - one difference being it doesn't use the PUT api - so verifying checksums will have to be done another way. Something perhaps like:

import base64, hashlib
def put_with_checksum_verify ( session, src, dst ):
   (h:=hashlib.sha256( )).update(open(src,'rb').read()) 
   obj = session.data_objects.put( src, dst, return_data_object = True)
   src_checksum = base64.b64encode(h.digest())
   dst_checksum = obj.chksum( )
   return dst_checksum.startswith('sha2:') and dst_chksum[5:] == src_checksum

d-w-moore avatar Oct 07 '22 18:10 d-w-moore

@kript In the case that seq-dev zone is using an iRODS before 4.2.11, you will probably also find that Server_Checksum_Warning is not raised, so trapping warnings via the RErrorStack mechanism wouldn't work.

d-w-moore avatar Oct 07 '22 18:10 d-w-moore

OK... but there is no mention of the version requirements in the README.

If an example requires a particular version, can it be mentioned in the example, please?

So... How does one get the iRODS errors from a system earlier than 4.2.11 (yes, seq-dev is currently on 4.2.7)?

kript avatar Oct 09 '22 18:10 kript

@kript I believe the mechanism itself (for getting the iRODS errors back) would work as soon as 4.2.7 or even before; but in this case checksum's API changed for version 4.2.11. So, prior to that version, you wouldn't see the Server_Checksum_Warning exception and therefore no messages would come back via the rErrorStack.

d-w-moore avatar Oct 09 '22 21:10 d-w-moore

I've gone for a simplified version of the script, but even with this I don't get visibility of the errors;

#!/usr/bin/env python
# vim: tabstop=8 expandtab shiftwidth=4 softtabstop=4
import os
import ssl
import sys
from irods.session import iRODSSession

try:
    env_file = os.environ["IRODS_ENVIRONMENT_FILE"]
except KeyError:
    env_file = os.path.expanduser("~/.irods/irods_environment.json")

ssl_context = ssl.create_default_context(
    purpose=ssl.Purpose.SERVER_AUTH, cafile=None, capath=None, cadata=None
)
ssl_settings = {"ssl_context": ssl_context}

session = iRODSSession(irods_env_file=env_file, **ssl_settings)
session.connection_timeout = 300

session.collections.create("/seq-dev/home/jc18#Sanger1-dev/irods_test")

retries = 0
fails = 0
for base in sys.argv[1:]:
    remote = "/seq-dev/home/jc18#Sanger1-dev/irods_test/" + base
    print(remote)

    for attempt in range(10):
        try:
            session.data_objects.put(base, remote)
            obj = session.data_objects.get(remote)
        except KeyboardInterrupt:
            sys.exit(1)
        except Exception as exc:
            print(" retrying " + remote)
            print(exc)
            #print(f"retries: {retries}; fails: {fails}")
            retries += 1
            continue
        else:
            print("Uploaded: " + remote)
            break
    else:
        fails += 1
        print(base + " failed")

session.cleanup()

print(f"retries: {retries}; fails: {fails}")

sample output is

Uploaded: /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhv
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
 retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
1mb.bhw failed
...
Uploaded: /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bni
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bnj
Uploaded: /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bnj
retries: 160; fails: 16

real    6m23.979s
user    0m11.760s
sys     0m3.770s

kript avatar Oct 10 '22 10:10 kript

Do we know what the errors were exactly, from the server logs?

d-w-moore avatar Oct 10 '22 12:10 d-w-moore

Ah, this was it then - the "out of space on device" error - for the run just above? https://github.com/irods/python-irodsclient/issues/399#issue-1402946364 And we'd prefer that the "out of space on device" appear on the client output? (Just trying to make sure I'm clear on the central issue itself.)

d-w-moore avatar Oct 10 '22 12:10 d-w-moore

Not really, no. I started looking into it because a variant of this script was getting intermittent SYS_INTERNAL_NULL_INPUT_ERR, but nothing was appearing in the logs on the Provider or consumer.

In triaging, the logs had variously UNIX_FILE_WRITE_ERR, UNIX_FILE_MKDIR_ERR, as in testing we had filled up some very small resources, however, as I wasn't seeing any errors client side it was hard to know when the disk full occurred from a client perspective, or if the error was something different.

To sum up, using iRODS 4.2.7 and python-irodsclient==1.1.5 I can't see to get errors back in the client whether I use;

  1. Normal python Exception
  2. irods.message.RErrorStack

kript avatar Oct 10 '22 13:10 kript

@kript . Ah, ok, thanks.

d-w-moore avatar Oct 10 '22 13:10 d-w-moore

None

@kript The PUTs that cause an out-of-disk-space error in your script above are in fact irods.exception.UNIX_FILE_WRITE_ERR(None,) instances but only the None part is printing out because print is using the default __str__ translation for Exception objects, which just prints out the args member. That will probably always be a None in any iRODS API call that returns an errno code without a message. So maybe try this instead, to print out the exception type:

print(repr(e))

When using formats, this means using 'r' instead of 's':

# If e is defined locally then these are equivalent:
"{e!r}".format(**locals())
f"{e!r}"
'%r' % e

d-w-moore avatar Oct 15 '22 03:10 d-w-moore

@kript Looking at this again, all we get as repr( )-style output is UNIX_FILE_WRITE_ERR(None,), and .... We could have at least included the errno code (which was 28) and/or the symbol and strerror-message [ (ENOSPC) / 'No space left on device']. I'll make this an issue in its own right.

d-w-moore avatar Oct 16 '22 14:10 d-w-moore

Thanks Daniel!

kript avatar Oct 16 '22 17:10 kript