HybridBackend icon indicating copy to clipboard operation
HybridBackend copied to clipboard

to_sparse failed for Value with ragged_rank > 1 read from parquet file

Open SamJia opened this issue 3 years ago • 9 comments

Current behavior

when hb read some nested lists with ragged_rank > 1,the read Value cannot be transformed to SparseTensor by function hb.data.to_sparse.

For example: dense_feature is one of the features read by hb.data.ParquetDataset, and to_sparse does not work for it. image

Moreover, if I swap the order of the two nested_row_splits, then it can be to_sparse.

image

So maybe the order of the nested_row_splits when reading parquet file is incorrect?

Expected behavior

the Value read from parquet file can be transformed to SparseTensor.

System information

  • GPU model and memory: No
  • OS Platform: Ubuntu
  • Docker version: No
  • GCC/CUDA/cuDNN version: 7.4/No/No
  • Python/conda version:3.6.13/4.13.0
  • TensorFlow/PyTorch version:1.14.0

Code to reproduce

import tensorflow as tf
import hybridbackend.tensorflow as hb
dataset = hb.data.ParquetDataset("test2.zstd.parquet", batch_size=1)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
sess = tf.Session()
vals = sess.run(next_element)

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

Willing to contribute

Yes

SamJia avatar Aug 02 '22 06:08 SamJia

Thanks for your report, I will look into it.

2sin18 avatar Aug 02 '22 06:08 2sin18

An example to create a parquet dataset file and reproduce the error:

# Create parquet file
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[[1], [2, 3]], [[4], [5]]], pa.list_(pa.list_(pa.int64())))
table = pa.Table.from_arrays([arr], ['test'])
pq.write_table(table, 'test.zstd.parquet', compression='ZSTD')

# Reading the parquet file
import tensorflow as tf
import hybridbackend.tensorflow as hb

dataset = hb.data.ParquetDataset("test.zstd.parquet", batch_size=2)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()  
sess = tf.Session()
vals = sess.run(next_element)

SamJia avatar Aug 03 '22 06:08 SamJia

It seems this error still exists in 0.8.0

DelightRun avatar May 29 '23 11:05 DelightRun

@DelightRun Could you try the latest commit ?

2sin18 avatar May 29 '23 11:05 2sin18

@DelightRun Could you try the latest commit ?

I use your pre-built v0.8.0 wheel package with TensorFlow 1.15.0. It's not very convenient for me to compile from source (I use this in our prod env, which has several limits).

However, I found it seems the problem is nested_row_splits need to be reversed:

WRONG CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

RIGHT CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4]))[::-1])
sess = tf.Session()
sess.run(val.to_sparse())

DelightRun avatar May 30 '23 02:05 DelightRun

@DelightRun Could you try the latest commit ?

I use your pre-built v0.8.0 wheel package with TensorFlow 1.15.0. It's not very convenient for me to compile from source (I use this in our prod env, which has several limits).

However, I found it seems the problem is nested_row_splits need to be reversed:

WRONG CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

RIGHT CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4]))[::-1])
sess = tf.Session()
sess.run(val.to_sparse())

You are right, and the issue has been fixed, but might not released for your platform. Which Python version, CUDA version (or CPU-only), TensorFlow version do you use? I would release v1.0 in these days.

2sin18 avatar May 30 '23 02:05 2sin18

  • Python == 3.6
  • TensorFlow == 1.15.5
  • CPU-Only

Hybridbackend is installed via pip: wheel == https://files.pythonhosted.org/packages/11/de/1408b520b9e4eed382ee068aba170d6089d3731f0d51ce3e898b0cb2aef6/hybridbackend_tf115_cpu-0.8.0-cp36-cp36m-manylinux_2_24_x86_64.whl

DelightRun avatar May 30 '23 04:05 DelightRun

@DelightRun Could you try the latest commit ?

Tried the latest commit (compiled via docker), still has this error. RaggedTensor with rank >= 2 seems pretty buggy.

DelightRun avatar May 30 '23 10:05 DelightRun

Hi @DelightRun, I tried your previous demo (with an adjustment of API accordingly) with the latest commit (4486ba138515a1dbdb6f7d542d7ad23a27476524)

# Create parquet file
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[[1], [2, 3]], [[4], [5]]], pa.list_(pa.list_(pa.int64())))
table = pa.Table.from_arrays([arr], ['test'])
pq.write_table(table, './test.zstd.parquet', compression='ZSTD')

# Reading the parquet file
import tensorflow as tf
import hybridbackend.tensorflow as hb

dataset = hb.data.ParquetDataset("./test.zstd.parquet", batch_size=2)
dataset = dataset.apply(hb.data.parse())
next_element = tf.data.make_one_shot_iterator(dataset).get_next()
sess = tf.Session()
vals = sess.run(next_element)
print(vals)

The output is

{'test': SparseTensorValue(indices=array([[0, 0, 0],
       [0, 1, 0],
       [0, 1, 1],
       [1, 0, 0],
       [1, 1, 0]]), values=array([1, 2, 3, 4, 5]), dense_shape=array([2, 2, 2]))}

It seems OK and could you reproduce this result? my env is python == 3.6 tensorflow == 1.15.5 hybridbackend == 1.0.0 (cpu-only)

francktcheng avatar May 30 '23 10:05 francktcheng