Catching and translating iRODS errors
Hi folks,
I've been trying to use RErrorStack to display the errors on a failed upload via a put() and running into some issues following the documentation, can you tell me what I might have missed?
Create the 1024 test files (2.1G) with:
dd if=/dev/urandom of=1gb count=1024 bs=1048576
split -b 1048576 -a 3 1gb 1mb.
run with
python ~iRODS_small_file_stressor.py 1mb.*
import os
import ssl
import sys
from irods.session import iRODSSession
from irods.manager.data_object_manager import Server_Checksum_Warning
import irods.keywords as kw
from irods.message import RErrorStack
try:
env_file = os.environ["IRODS_ENVIRONMENT_FILE"]
except KeyError:
env_file = os.path.expanduser("~/.irods/irods_environment.json")
ssl_context = ssl.create_default_context(
purpose=ssl.Purpose.SERVER_AUTH, cafile=None, capath=None, cadata=None
)
ssl_settings = {"ssl_context": ssl_context}
session = iRODSSession(irods_env_file=env_file, **ssl_settings)
session.connection_timeout = 300
opts = {kw.VERIFY_CHKSUM_KW: ""}
r_err_stk = RErrorStack()
warn = None
session.collections.create("/seq-dev/home/jc18#Sanger1-dev/irods_test")
retries = 0
fails = 0
for base in sys.argv[1:]:
remote = "/seq-dev/home/jc18#Sanger1-dev/irods_test/" + base
print(remote)
for attempt in range(10):
warn = None
try:
try:
session.data_objects.put(base, remote, r_error = r_err_stk)
except Exception as exc:
print("put failed")
warn = exc
print(warn)
print(r_err_stk)
try:
obj = session.data_objects.get(remote, r_error = r_err_stk)
except Exception as exc:
print("get failed")
warn = exc
print(warn)
print(r_err_stk)
try:
obj.chksum(**opts, r_error = r_err_stk)
except Server_Checksum_Warning as exc:
print("some checksums are missing or wrong")
warn = exc
print(r_err_stk)
except KeyboardInterrupt:
sys.exit(1)
except:
print(" retrying " + remote)
retries += 1
continue
else:
break
else:
fails += 1
print(base + " failed")
session.cleanup()
print(f"retries: {retries}; fails: {fails}")
when running I see;
./iRODS_small_file_stressor.py 1mb.*
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aaa
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aab
put failed
None
[]
get failed
[]
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aac
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aad
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aae
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.aaf
So, how many worked? Did you know that 1mb.aab was going to fail? But the others continued? Where are the other 1018?
I edited the output for brevity, and no, I expected them all to work (that's another ticket I suspect 😇 ) - its the fact that I'm not showing the errors at the client end that I am reporting/asking for help with in this issue, as it appears RErrorStack is not getting populated...
@kript Unfortunately put( ...) in PRC isn't quite the same as iput - one difference being it doesn't use the PUT api - so verifying checksums will have to be done another way. Something perhaps like:
import base64, hashlib
def put_with_checksum_verify ( session, src, dst ):
(h:=hashlib.sha256( )).update(open(src,'rb').read())
obj = session.data_objects.put( src, dst, return_data_object = True)
src_checksum = base64.b64encode(h.digest())
dst_checksum = obj.chksum( )
return dst_checksum.startswith('sha2:') and dst_chksum[5:] == src_checksum
@kript In the case that seq-dev zone is using an iRODS before 4.2.11, you will probably also find that Server_Checksum_Warning is not raised, so trapping warnings via the RErrorStack mechanism wouldn't work.
OK... but there is no mention of the version requirements in the README.
If an example requires a particular version, can it be mentioned in the example, please?
So... How does one get the iRODS errors from a system earlier than 4.2.11 (yes, seq-dev is currently on 4.2.7)?
@kript I believe the mechanism itself (for getting the iRODS errors back) would work as soon as 4.2.7 or even before; but in this case checksum's API changed for version 4.2.11. So, prior to that version, you wouldn't see the Server_Checksum_Warning exception and therefore no messages would come back via the rErrorStack.
I've gone for a simplified version of the script, but even with this I don't get visibility of the errors;
#!/usr/bin/env python
# vim: tabstop=8 expandtab shiftwidth=4 softtabstop=4
import os
import ssl
import sys
from irods.session import iRODSSession
try:
env_file = os.environ["IRODS_ENVIRONMENT_FILE"]
except KeyError:
env_file = os.path.expanduser("~/.irods/irods_environment.json")
ssl_context = ssl.create_default_context(
purpose=ssl.Purpose.SERVER_AUTH, cafile=None, capath=None, cadata=None
)
ssl_settings = {"ssl_context": ssl_context}
session = iRODSSession(irods_env_file=env_file, **ssl_settings)
session.connection_timeout = 300
session.collections.create("/seq-dev/home/jc18#Sanger1-dev/irods_test")
retries = 0
fails = 0
for base in sys.argv[1:]:
remote = "/seq-dev/home/jc18#Sanger1-dev/irods_test/" + base
print(remote)
for attempt in range(10):
try:
session.data_objects.put(base, remote)
obj = session.data_objects.get(remote)
except KeyboardInterrupt:
sys.exit(1)
except Exception as exc:
print(" retrying " + remote)
print(exc)
#print(f"retries: {retries}; fails: {fails}")
retries += 1
continue
else:
print("Uploaded: " + remote)
break
else:
fails += 1
print(base + " failed")
session.cleanup()
print(f"retries: {retries}; fails: {fails}")
sample output is
Uploaded: /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhv
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
retrying /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bhw
None
1mb.bhw failed
...
Uploaded: /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bni
/seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bnj
Uploaded: /seq-dev/home/jc18#Sanger1-dev/irods_test/1mb.bnj
retries: 160; fails: 16
real 6m23.979s
user 0m11.760s
sys 0m3.770s
Do we know what the errors were exactly, from the server logs?
Ah, this was it then - the "out of space on device" error - for the run just above? https://github.com/irods/python-irodsclient/issues/399#issue-1402946364 And we'd prefer that the "out of space on device" appear on the client output? (Just trying to make sure I'm clear on the central issue itself.)
Not really, no. I started looking into it because a variant of this script was getting intermittent SYS_INTERNAL_NULL_INPUT_ERR, but nothing was appearing in the logs on the Provider or consumer.
In triaging, the logs had variously UNIX_FILE_WRITE_ERR, UNIX_FILE_MKDIR_ERR, as in testing we had filled up some very small resources, however, as I wasn't seeing any errors client side it was hard to know when the disk full occurred from a client perspective, or if the error was something different.
To sum up, using iRODS 4.2.7 and python-irodsclient==1.1.5 I can't see to get errors back in the client whether I use;
- Normal python
Exception -
irods.message.RErrorStack
@kript . Ah, ok, thanks.
None
@kript The PUTs that cause an out-of-disk-space error in your script above are in fact irods.exception.UNIX_FILE_WRITE_ERR(None,) instances but only the None part is printing out because print is using the default __str__ translation for Exception objects, which just prints out the args member. That will probably always be a None in any iRODS API call that returns an errno code without a message. So maybe try this instead, to print out the exception type:
print(repr(e))
When using formats, this means using 'r' instead of 's':
# If e is defined locally then these are equivalent:
"{e!r}".format(**locals())
f"{e!r}"
'%r' % e
@kript Looking at this again, all we get as repr( )-style output is UNIX_FILE_WRITE_ERR(None,), and .... We could have at least included the errno code (which was 28) and/or the symbol and strerror-message [ (ENOSPC) / 'No space left on device']. I'll make this an issue in its own right.
Thanks Daniel!