Retry issues caused by dns resolution
Environment details
- Specify the API at the beginning of the title. biquery
- OS type and version: MacOS
- Java version: 1.8
- bigquery version(s): 1.117.1
Steps to reproduce
Configuring retrying exceptions, in case of an error during dns resolution on http GRPC API, the exception does not seems to retry. Google Extensions for java consider this a not retry case. The api does not seem to allow retry under errors of this sort:
stacktrace: com.google.cloud.bigquery.BigQueryException: www.googleapis.com\n\tat com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:113)\n\tat com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.create(HttpBigQueryRpc.java:213)\n\tat com.google.cloud.bigquery.BigQueryImpl$5.call(BigQueryImpl.java:327)\n\tat com.google.cloud.bigquery.BigQueryImpl$5.call(BigQueryImpl.java:324)\n\tat
Is there any option to retry this exceptions, could this be added as a feature as the gax retry api is internal to the client.
Thank you for your help.
The BigQuery client uses gax-java for retrying. Hence, transferring this ticket to gax-java for triaging.
Hello @miraleung I see you label the issue as feature request, could you elaborate a bit your thoughts after triages?, I am happy to contribute, I rather have a retry option for this case handle by the client than build a wrapper.
The issue persist, and is now becoming a bit worse for us as we move more jobs to a new implementation that does not use a custom made retry wrapper.
We know there is an underline issue causing java.net.UnknownHostException: www.googleapis.com to appear, but this should be retried..
Thanks @pegerto, @stephaniewang526 and I will take a look. In the meantime, if you could put a repro case and steps here, that would be great. :)
Hello @miraleung @stephaniewang526
Thank you very much for your help, It may be my misunderstanding about the retry scope that we can expect from the client.
Imaging a continues thread processing data, to avoid entering in our specific use case, for this example just doing a counting of the datasets instead of instantiating bq jobs.
val retrySettings = RetrySettings.newBuilder()
.setMaxAttempts(0)
.setTotalTimeout(Duration.ofMinutes(2))
.build()
val client: BigQuery = BigQueryOptions.newBuilder()
.setRetrySettings(retrySettings)
.build().getService
val watch = Stopwatch.start()
@tailrec
def workLoop: Unit = {
println(s"${watch()} - ${client.listDatasets().iterateAll().asScala.size}")
Thread.sleep(10.seconds.toMillis)
workLoop
}
Try(workLoop).recover {
case e: BigQueryException => {
println(s"${watch()} - $e \n retryable: ${e.isRetryable} \n ${e.getCause}")
}
}
If I disconnect the wifi during the execution of this loop to simulate some glitch we have in our production systems.
I have the following output:
73.milliseconds+581.microseconds+738.nanoseconds - 42
10.seconds+954.milliseconds+122.microseconds+715.nanoseconds - 42
21.seconds+474.milliseconds+156.microseconds+770.nanoseconds - 42
32.seconds+757.milliseconds+828.microseconds+672.nanoseconds - com.google.cloud.bigquery.BigQueryException: www.googleapis.com
retryable: false
java.net.UnknownHostException: www.googleapis.com
Process finished with exit code 0
It seems clearly that the dns resolution failure is not a retryable exception for gax, but is is a recoverable exception, we will expect this to be handle by the retry settings and retry linearly or exponentially for 2 minutes.
We could implement a code solution for this network glitches we are observing but this will duplicate the retry wrapper offered by the gax implementation. So we rather have a discussion first to clarify our expectations.
Thank you very much for your response.