fluent-plugin-opensearch opensearch auth times out when using opensearch plugin

(check apply)

[ X] read the contribution guideline
[] (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.

Steps to replicate

Provide example config and message

<label "@#{ENV['ENDPOINT_NAME']}">
  <match **>
    @type "opensearch_data_stream"
    @id "out-aws-es-#{worker_id}"
    @log_level "#{ENV['OUTPUT_LOG_LEVEL']}"
    log_es_400_reason true
    logstash_format false
    data_stream_name "ds-#{ENV['ENDPOINT_NAME']}"
    include_timestamp true
    include_tag_key true
    time_key timestamp
    flush_interval 5s
    slow_flush_log_threshold 135.0
    reconnect_on_error true
    reload_on_failure true
    reload_connections false
    request_timeout 300s

    <buffer>
      @type memory

      chunk_limit_size 20MB
      flush_mode interval
      flush_interval 5s
      flush_thread_count 12
      flush_at_shutdown true
      retry_max_times 2
      retry_wait 60s
      retry_type exponential_backoff
      retry_exponential_backoff_base 3
      retry_timeout 30m
      overflow_action drop_oldest_chunk
      disable_chunk_backup true
      total_limit_size "#{ENV['TOTAL_BUFFER_SIZE']}MB"
    </buffer>

    <endpoint>
      url "https://#{ENV['ES_ENDPOINT']}"
      region us-east-2
      assume_role_arn "#{ENV['COLLECTOR_SVC_ROLE']}"
    </endpoint>
  </match>
</label>

When using the opensearch plugin, we now get lots of errors like this on our fluentd collectors:

"error": "#<Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure: could not push logs to OpenSearch cluster (ds-janus): [400] {"Message":"You have exceeded the number of permissible concurrent requests with unique IAM Identities. Please retry."}>"

Expected Behavior or What you need to ask

We're wondering if this is due to fb04e91d0cb9981e6174b1adc128b3a4016ae577

Prior to implementing this plugin within our collectors we did not have this problem.

Using Fluentd and OpenSearch plugin versions

Fluentd v1.14.4-1.0 AWS Opensearch 1.2 fluent-plugin-opensearch 1.0.7

Jul 11 '22 18:07 ngamber

We've also noticed that the auth request doesn't seem to pass in a maximum session duration. It would make sense to set this to the same as the refresh_credentials_interval so that it doesn't expire before then.

RecoverableRequestFailure error=\"could not push logs to OpenSearch cluster (datastream-test): [403] {\\\"message\\\":\\\"The security token included in the request is expired

This is after changing refresh_credentials_interval to 10h and the maximum session duration on the role has been set to 12h.

Per AWS:

To learn how to view the maximum value for your role, see View the maximum session duration setting for a role. If you do not pass this parameter, the temporary credentials expire in one hour.

Jul 12 '22 15:07 ngamber

Not too good with Ruby but I assume adding a line like this here might help?

https://github.com/Barracuda-CloudOps/fluent-plugin-opensearch/blob/main/lib/fluent/plugin/out_opensearch.rb#L239

duration_seconds: conf[:[refresh_credentials_interval.to_s]

Jul 12 '22 15:07 ngamber

https://github.com/fluent/fluent-plugin-opensearch/pull/78 works for you?

Sep 01 '22 07:09 cosmo0920

We've also noticed that the auth request doesn't seem to pass in a maximum session duration. It would make sense to set this to the same as the refresh_credentials_interval so that it doesn't expire before then.

RecoverableRequestFailure error=\"could not push logs to OpenSearch cluster (datastream-test): [403] {\\\"message\\\":\\\"The security token included in the request is expired

This is after changing refresh_credentials_interval to 10h and the maximum session duration on the role has been set to 12h.

Per AWS:

To learn how to view the maximum value for your role, see View the maximum session duration setting for a role. If you do not pass this parameter, the temporary credentials expire in one hour.

Is there a solution for this? We are also having the same issue :thinking:

Dec 13 '22 12:12 antoniocascais

Same issue here. This is critical. Any workaround? I thought that by setting refresh_credentials_interval it would work.

Jun 28 '23 16:06 kaiohenricunha

What version are you on?

Isn't #74 solves the problem? It has been merged and released in v1.1.1.

Do you think the problem still exist in v1.1.1 or above?

Jun 28 '23 19:06 Jonniedev

What version are you on?

Isn't #74 solves the problem? It has been merged and released in v1.1.1.

Do you think the problem still exist in v1.1.1 or above?

The fluent-operator automatically updates the plugin's version. I only noticed this update you mentioned because fluentd started throwing errors as discussed here: https://github.com/fluent/fluent-operator/issues/814

I tried setting the session duration of my IAM role to the same session duration of the plugin's default: 5h

It works for a few errors, then gets stuck again.

Jun 29 '23 13:06 kaiohenricunha

Issue persists and is even worse now: https://github.com/fluent/fluent-plugin-opensearch/issues/107

Jun 30 '23 14:06 kaiohenricunha