Igor Dvorzhak comments

Results 52 comments of


                                            Igor Dvorzhak

globStatus call runs very slowly from Hadoop to GCS compared with gsutil

I can not find any documentation in [GCS API](https://cloud.google.com/storage/docs/json_api/v1/objects/list) for server-side glob filtering support. May you provide curl command that demonstrates filter push down in the format: ```bash curl "https://www.googleapis.com/storage/v1/b//o?prefix="...

globStatus call runs very slowly from Hadoop to GCS compared with gsutil

We need to address globbing performance in 2 phases: 1. run default glob and flat glob algorithms concurrently and return result as soon as one finishes. 2. parallelize flat glob...

GCS Connector runs into StackOverflow while creating hadoop credential

Thanks for report, I will take a look into this.

GCS Connector runs into StackOverflow while creating hadoop credential

Is this reproducible with GCS connector 2.2.4?

Improve HTTP request statistics gathering

/gcbrun

Status of hsync/hflush and suitability for backing HBase

Seems like we still can fix this issue as suggested above?

High latency reads in GCS connector

@sidseth I have made some optimizations to address this issue in https://github.com/GoogleCloudPlatform/bigdata-interop/pull/110, I plan to mainline them soon. May you check if they help your use-case?

High latency reads in GCS connector

We just released GCS connector [1.9.2](https://github.com/GoogleCloudPlatform/bigdata-interop/releases/tag/v1.9.2) which includes all the performance optimizations. To take advantage of all available optimizations set the properties: fs.gs.inputstream.fadvise=RANDOM fs.gs.io.buffersize=524288 fs.gs.inputstream.footer.prefetch.size=65536 fs.gs.performance.cache.enable=true fs.gs.performance.cache.max.entry.age.ms=1800000

High latency reads in GCS connector

Buffer size [limits](https://github.com/GoogleCloudPlatform/bigdata-interop/blob/v1.9.2/gcsio/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageReadChannel.java#L872) minimum HTTP range requests size in `RANDOM` mode. In my SparkSQL tests with ORC files it lead to redundant data transfer (up to 2x) - my guess...

High latency reads in GCS connector

Thanks for sharing test results! 1. Yes, for just footer reads `RANDOM` mode is not necessarily beneficial, because footer is relatively small and at the end of the file, so...