io icon indicating copy to clipboard operation
io copied to clipboard

Apache ORC Support in TensorFlow IO

Open oliverhu opened this issue 4 years ago • 7 comments

(Creating this issue for visibility so people interested can join the discussion... )

Overview

Load Apache ORC formatted data natively into TensorFlow from file system supported by TensorFlow, e.g. HDFS, local disk, etc.

Motivation

We traditionally use Avro to store our dataset but it is becoming inefficient to use row based format for big data analytics processing. Historically we selected ORC as our columnar storage format. (not planning to argue Parquet vs ORC here ;))

Design Discussions

  • Apache ORC would be brought in via https://github.com/bazelbuild/rules_foreign_cc
  • Feature wise, I expect the APIs to be similar to Parquet or Arrow reader.

Milestones

  • [x] Add Apache ORC build dependency.
  • [x] Implement a simple ORC dataset that maps records in ORC files into Tensors.
  • [x] add a tutorial for ORC reader.
  • [ ] feature schemas support: support sparseTensor and VarLenFeature.
  • [ ] feature schemas support: support denseTensor FixedLenFeature only. (follow parse_example_v2.)
  • [ ] usability improvements
  • [ ] performance tuning
  • [ ] feature schemas support: support raggedTensor

oliverhu avatar Apr 21 '21 17:04 oliverhu

@oliverhu any update on this?

kvignesh1420 avatar Jun 15 '21 17:06 kvignesh1420

no update recently @kvignesh1420

oliverhu avatar Jun 15 '21 18:06 oliverhu

@oliverhu can we document the current feature in the form of a tutorial?

kvignesh1420 avatar Jun 15 '21 18:06 kvignesh1420

sure, will add that !

oliverhu avatar Jun 15 '21 22:06 oliverhu

Reference FYKI: https://github.com/tensorflow/io/tree/master/docs/tutorials

kvignesh1420 avatar Jun 16 '21 07:06 kvignesh1420

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

372046933 avatar Mar 18 '22 03:03 372046933

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

HDFS supported (with kerberos) by https://github.com/tensorflow/io/pull/1674

372046933 avatar Apr 28 '22 07:04 372046933