tfio.IODataset.from_parquet changing the magic number of jpeg files

Open HoltSpalding opened this issue 4 years ago • 1 comments

Hello, I have parquets containing the raw bytes of many jpeg files I want to read in as a tf.data.Dataset. Iterating through each row of the parquet using pyspark, I'm able to successfully read each jpeg image, and each image byte-string contains the starting bytes we would expect of a jpeg image (b'\xff\xd8\xff\xe0). However, when using the tfio.IODataset.from_parquet method, I noticed the magic numbers of 85-90% of the images in my dataset are modified, making it impossible to decode them (random examples of modified starting bytes include b'\xd8\xff\xe0\x00' and b'G\xf8\xff\x00). 10-15% of the image's bytes appear to be unchanged.

Not exactly sure what's happening yet, but I can provide more details if other would like to recreate the issue. Thank you.

Nov 29 '21 18:11 HoltSpalding

I'm having the same issue with encoded PNG byte strings in 0.33.0.

Some additional details:

The first 1024 images are always read correctly.
Smaller files aren't affected. Every image in a file with 2048 rows was read correctly, but increasing that to 3000 rows (or more) consistently reproduces the error.
In some (but not all) cases, corrupted images return different bytes every time they are accessed
Decreasing the step value in ParquetIODataset.__init__ decreases the number of corrupted records, but also seems to increase the time it takes to read into memory. Not sure how significant or relevant that is.

I'm happy to share an example Parquet file with code to reproduce the error.

Aug 09 '23 18:08 aazuspan