gax-java icon indicating copy to clipboard operation
gax-java copied to clipboard

Retry issues caused by dns resolution

Open pegerto opened this issue 4 years ago • 5 comments

Environment details

  1. Specify the API at the beginning of the title. biquery
  2. OS type and version: MacOS
  3. Java version: 1.8
  4. bigquery version(s): 1.117.1

Steps to reproduce

Configuring retrying exceptions, in case of an error during dns resolution on http GRPC API, the exception does not seems to retry. Google Extensions for java consider this a not retry case. The api does not seem to allow retry under errors of this sort:

 stacktrace: com.google.cloud.bigquery.BigQueryException: www.googleapis.com\n\tat com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:113)\n\tat com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.create(HttpBigQueryRpc.java:213)\n\tat com.google.cloud.bigquery.BigQueryImpl$5.call(BigQueryImpl.java:327)\n\tat com.google.cloud.bigquery.BigQueryImpl$5.call(BigQueryImpl.java:324)\n\tat 

Is there any option to retry this exceptions, could this be added as a feature as the gax retry api is internal to the client.

Thank you for your help.

pegerto avatar Feb 12 '21 14:02 pegerto

The BigQuery client uses gax-java for retrying. Hence, transferring this ticket to gax-java for triaging.

stephaniewang526 avatar Feb 15 '21 17:02 stephaniewang526

Hello @miraleung I see you label the issue as feature request, could you elaborate a bit your thoughts after triages?, I am happy to contribute, I rather have a retry option for this case handle by the client than build a wrapper.

pegerto avatar Feb 19 '21 19:02 pegerto

The issue persist, and is now becoming a bit worse for us as we move more jobs to a new implementation that does not use a custom made retry wrapper.

We know there is an underline issue causing java.net.UnknownHostException: www.googleapis.com to appear, but this should be retried..

pegerto avatar Apr 19 '21 16:04 pegerto

Thanks @pegerto, @stephaniewang526 and I will take a look. In the meantime, if you could put a repro case and steps here, that would be great. :)

miraleung avatar Apr 19 '21 17:04 miraleung

Hello @miraleung @stephaniewang526

Thank you very much for your help, It may be my misunderstanding about the retry scope that we can expect from the client.

Imaging a continues thread processing data, to avoid entering in our specific use case, for this example just doing a counting of the datasets instead of instantiating bq jobs.

  val retrySettings = RetrySettings.newBuilder()
    .setMaxAttempts(0)
    .setTotalTimeout(Duration.ofMinutes(2))
    .build()

  val client: BigQuery = BigQueryOptions.newBuilder()
    .setRetrySettings(retrySettings)
    .build().getService

  val watch = Stopwatch.start()

  @tailrec
  def workLoop: Unit =  {
    println(s"${watch()} -  ${client.listDatasets().iterateAll().asScala.size}")
    Thread.sleep(10.seconds.toMillis)
    workLoop
  }

  Try(workLoop).recover {
    case e: BigQueryException => {
      println(s"${watch()} - $e \n retryable: ${e.isRetryable} \n ${e.getCause}")
    }
  }

If I disconnect the wifi during the execution of this loop to simulate some glitch we have in our production systems.

I have the following output:

73.milliseconds+581.microseconds+738.nanoseconds -  42
10.seconds+954.milliseconds+122.microseconds+715.nanoseconds -  42
21.seconds+474.milliseconds+156.microseconds+770.nanoseconds -  42
32.seconds+757.milliseconds+828.microseconds+672.nanoseconds - com.google.cloud.bigquery.BigQueryException: www.googleapis.com 
 retryable: false 
 java.net.UnknownHostException: www.googleapis.com

Process finished with exit code 0

It seems clearly that the dns resolution failure is not a retryable exception for gax, but is is a recoverable exception, we will expect this to be handle by the retry settings and retry linearly or exponentially for 2 minutes.

We could implement a code solution for this network glitches we are observing but this will duplicate the retry wrapper offered by the gax implementation. So we rather have a discussion first to clarify our expectations.

Thank you very much for your response.

pegerto avatar Apr 20 '21 12:04 pegerto