framework Capture more volumedriver event message

Problem description

The volumedriver emits certain events that we do not capture @redlicha please provide us a list so we can handle them

Jan 08 '18 11:01 JeffreyDevloo

Protobuf descriptions of the events:

https://github.com/openvstorage/volumedriver/blob/dev/src/volumedriver/Events.proto
https://github.com/openvstorage/volumedriver/blob/dev/src/volumedriver/VolumeDriverEvents.proto
https://github.com/openvstorage/volumedriver/blob/dev/src/filesystem/FileSystemEvents.proto

The former provides a base message type that is extended by the latter two.

The generated code is part of the -base package:

$ find . -name \*_pb2.py
./usr/lib/python2.7/dist-packages/volumedriver/storagerouter/Events_pb2.py
./usr/lib/python2.7/dist-packages/volumedriver/storagerouter/VolumeDriverEvents_pb2.py
./usr/lib/python2.7/dist-packages/volumedriver/storagerouter/FileSystemEvents_pb2.py

Jan 09 '18 08:01 redlicha

Some more info on the VolumeDriverErrorCodes

Unknown: (unused) placeholder
ReadFromDisposableSCO: error reading data from a disposable SCO in the SCO cache (I/O error on device?) Will trigger SCO cache mountpoint offlining.
ReadFromNonDisposableSCO: error reading data from a non-disposable SCO in the SCO cache (I/O Error on device?). Will trigger SCO cache mountpoint offlining and the SCO will be fetched from the DTL to another mountpoint (if there are any).
PutSCOToBackend: problem uploading a SCO to the backend (due to backend errors, local I/O errors, checksum mismatch ...). Depending on the exact cause this could trigger a SCO cache mountpoint getting offlined (and the SCO getting fetched again from the DTL to another mountpoint);
PutTLogToBackend problem uploading a TLog to the backend (due to backend errors, local I/O errors, checksum mismatch ...). Depending on the exact cause this could lead to the volume getting put into 'halted' state.
PutSnapshotsToBackend: problem uploading snapshots.xml to the backend (due to backend errors, local I/O errors, ...). Depending on the exact cause this could lead to the volume getting put into 'halted' state
GetSCOFromBackend: failure to download a SCO from the backend to the SCO cache. Obsolete as we use partial reads.
GetTLogFromBackend: failure to fetch a TLog from the backend (due to backend errors, local I/O errors, ...). This can happen on MDS slave updates or volume restarts; the former will be logged and ignored, the latter will lead to a failed restart
GetSnapshotsFromBackend: analogous to GetTLogFromBackend, for snapshots.xml
ReadSourceSCOWhileMoving; unused
MetaDataStore: unused
ReadTLog: error reading a local TLog (I/O error, ...). Can happen on MDS slave updates and volume restarts - cf. GetTLogFromBackend
ReadSnapshots: analogous to ReadTLog for snapshots.xml
ApplyScrubbingRelocs: error during scrub result application to the volume's metadata. Might leak scrub result data.
GetScrubbingResultsFromBackend: failure to fetch scrub result info from the backend, unused at the moment.
WriteToSCO: failure to write to a SCO in the SCO cache (I/O error, ...). Will lead to the mountpoint getting offlined.
WriteDestinationSCOWhileMoving: unused
WriteTLog: failure to write to a TLog (I/O error). Will lead to the volume getting halted.
WriteSnapshots: failure to write snapshots.xml locally (I/O error). Will lead to the volume getting halted
ApplyScrubbingToSnapshotMamager: unused (and misspelt)
SCOCacheMountPointOfflined: a SCO cache mountpoint was offlined, usually as a consequence of another error, or an error discovered by the cleanup thread
ClusterCacheMountPointOfflined: a cluster cache mountpoint was offlined (I/O error)
GetSCOFromFOC: failure to fetch a SCO from the DTL. Might lead to the mountpoint getting offlined and the volume getting halted
VolumeHalted: volume entered halted state due to one of the errors listed here or due to fencing
DiskSpace: unused
MDSFailover: a volume failed over to an MDS slave

Feb 21 '19 10:02 redlicha