Igor Dvorzhak
Igor Dvorzhak
I can not find any documentation in [GCS API](https://cloud.google.com/storage/docs/json_api/v1/objects/list) for server-side glob filtering support. May you provide curl command that demonstrates filter push down in the format: ```bash curl "https://www.googleapis.com/storage/v1/b//o?prefix="...
We need to address globbing performance in 2 phases: 1. run default glob and flat glob algorithms concurrently and return result as soon as one finishes. 2. parallelize flat glob...
Thanks for report, I will take a look into this.
Is this reproducible with GCS connector 2.2.4?
Seems like we still can fix this issue as suggested above?
@sidseth I have made some optimizations to address this issue in https://github.com/GoogleCloudPlatform/bigdata-interop/pull/110, I plan to mainline them soon. May you check if they help your use-case?
We just released GCS connector [1.9.2](https://github.com/GoogleCloudPlatform/bigdata-interop/releases/tag/v1.9.2) which includes all the performance optimizations. To take advantage of all available optimizations set the properties: fs.gs.inputstream.fadvise=RANDOM fs.gs.io.buffersize=524288 fs.gs.inputstream.footer.prefetch.size=65536 fs.gs.performance.cache.enable=true fs.gs.performance.cache.max.entry.age.ms=1800000
Buffer size [limits](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/v1.9.2/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageReadChannel.java#L872) minimum HTTP range requests size in `RANDOM` mode. In my SparkSQL tests with ORC files it lead to redundant data transfer (up to 2x) - my guess...
Thanks for sharing test results! 1. Yes, for just footer reads `RANDOM` mode is not necessarily beneficial, because footer is relatively small and at the end of the file, so...