framework icon indicating copy to clipboard operation
framework copied to clipboard

Capture more volumedriver event message

Open JeffreyDevloo opened this issue 8 years ago • 2 comments

Problem description

The volumedriver emits certain events that we do not capture @redlicha please provide us a list so we can handle them

JeffreyDevloo avatar Jan 08 '18 11:01 JeffreyDevloo

Protobuf descriptions of the events:

  • https://github.com/openvstorage/volumedriver/blob/dev/src/volumedriver/Events.proto
  • https://github.com/openvstorage/volumedriver/blob/dev/src/volumedriver/VolumeDriverEvents.proto
  • https://github.com/openvstorage/volumedriver/blob/dev/src/filesystem/FileSystemEvents.proto

The former provides a base message type that is extended by the latter two.

The generated code is part of the -base package:

$ find . -name \*_pb2.py
./usr/lib/python2.7/dist-packages/volumedriver/storagerouter/Events_pb2.py
./usr/lib/python2.7/dist-packages/volumedriver/storagerouter/VolumeDriverEvents_pb2.py
./usr/lib/python2.7/dist-packages/volumedriver/storagerouter/FileSystemEvents_pb2.py

redlicha avatar Jan 09 '18 08:01 redlicha

Some more info on the VolumeDriverErrorCodes

  • Unknown: (unused) placeholder
  • ReadFromDisposableSCO: error reading data from a disposable SCO in the SCO cache (I/O error on device?) Will trigger SCO cache mountpoint offlining.
  • ReadFromNonDisposableSCO: error reading data from a non-disposable SCO in the SCO cache (I/O Error on device?). Will trigger SCO cache mountpoint offlining and the SCO will be fetched from the DTL to another mountpoint (if there are any).
  • PutSCOToBackend: problem uploading a SCO to the backend (due to backend errors, local I/O errors, checksum mismatch ...). Depending on the exact cause this could trigger a SCO cache mountpoint getting offlined (and the SCO getting fetched again from the DTL to another mountpoint);
  • PutTLogToBackend problem uploading a TLog to the backend (due to backend errors, local I/O errors, checksum mismatch ...). Depending on the exact cause this could lead to the volume getting put into 'halted' state.
  • PutSnapshotsToBackend: problem uploading snapshots.xml to the backend (due to backend errors, local I/O errors, ...). Depending on the exact cause this could lead to the volume getting put into 'halted' state
  • GetSCOFromBackend: failure to download a SCO from the backend to the SCO cache. Obsolete as we use partial reads.
  • GetTLogFromBackend: failure to fetch a TLog from the backend (due to backend errors, local I/O errors, ...). This can happen on MDS slave updates or volume restarts; the former will be logged and ignored, the latter will lead to a failed restart
  • GetSnapshotsFromBackend: analogous to GetTLogFromBackend, for snapshots.xml
  • ReadSourceSCOWhileMoving; unused
  • MetaDataStore: unused
  • ReadTLog: error reading a local TLog (I/O error, ...). Can happen on MDS slave updates and volume restarts - cf. GetTLogFromBackend
  • ReadSnapshots: analogous to ReadTLog for snapshots.xml
  • ApplyScrubbingRelocs: error during scrub result application to the volume's metadata. Might leak scrub result data.
  • GetScrubbingResultsFromBackend: failure to fetch scrub result info from the backend, unused at the moment.
  • WriteToSCO: failure to write to a SCO in the SCO cache (I/O error, ...). Will lead to the mountpoint getting offlined.
  • WriteDestinationSCOWhileMoving: unused
  • WriteTLog: failure to write to a TLog (I/O error). Will lead to the volume getting halted.
  • WriteSnapshots: failure to write snapshots.xml locally (I/O error). Will lead to the volume getting halted
  • ApplyScrubbingToSnapshotMamager: unused (and misspelt)
  • SCOCacheMountPointOfflined: a SCO cache mountpoint was offlined, usually as a consequence of another error, or an error discovered by the cleanup thread
  • ClusterCacheMountPointOfflined: a cluster cache mountpoint was offlined (I/O error)
  • GetSCOFromFOC: failure to fetch a SCO from the DTL. Might lead to the mountpoint getting offlined and the volume getting halted
  • VolumeHalted: volume entered halted state due to one of the errors listed here or due to fencing
  • DiskSpace: unused
  • MDSFailover: a volume failed over to an MDS slave

redlicha avatar Feb 21 '19 10:02 redlicha