cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

Get checksum actual record bytes

Open sree018 opened this issue 11 months ago • 1 comments

Background

We have requirement like to validate upstream given checksum vs actual data checksum.

Currently we are stripping BDW and RDW and writing to file and calculating checksum.

Feature

def getDataBytes(infile:String,outfile:String) :String ={
      val inputStream =new FileInputStream(new File(infile))
      val outStream =new FileOutputStream(new File(outfile))
      while (inputStream.available() > 0){
           val bdw = BigInt(IOUtils.toByteArray(inputStream,2))
           val bdwBuffer = Array.OfDim[Byte](bdw.toInt-2)
           IOUtils.read(inputSream,bdwBuffer)
           val block :Array[Byte] = bdwBuffer.drop(2)
           var rdw =0
           while (rdw < bdw -4) {
                 val rdwSize = BigInt(block.slice(rdw,rdw+2)).toInt
                 val record =block.slice(rdw+4, rdw+rdwSize)
                 outStream.write(record)
                 rdw = rdw + rdwSize
           }
     }
    inputStream.close()
   outStream.close()
}

Shell we avoid creating temp file and calculate directly record bytes hash (Md5 or Sha256) value ?

sree018 avatar Feb 28 '25 12:02 sree018

Generating checksum won't be easy after you convert your data to another format. Parquet data types have different representation internally so file bytes will differ. If you want to ensure completness you can get original bytes of each record:

.option("generate_record_bytes", "id")
.option("generate_record_bytes", "true")

Then, you can order records by File_Id and Record_Id, and cpmcatenate all values from Record_Bytes. Then, you can collect it as an array of bytes, and then stream is via ByteBufferInputStream into your checksum caculation.

yruslan avatar Mar 04 '25 10:03 yruslan