opensearch auth times out when using opensearch plugin
(check apply)
- [ X] read the contribution guideline
- [] (optional) already reported 3rd party upstream repository or mailing list if you use k8s addon or helm charts.
Steps to replicate
Provide example config and message
<label "@#{ENV['ENDPOINT_NAME']}">
<match **>
@type "opensearch_data_stream"
@id "out-aws-es-#{worker_id}"
@log_level "#{ENV['OUTPUT_LOG_LEVEL']}"
log_es_400_reason true
logstash_format false
data_stream_name "ds-#{ENV['ENDPOINT_NAME']}"
include_timestamp true
include_tag_key true
time_key timestamp
flush_interval 5s
slow_flush_log_threshold 135.0
reconnect_on_error true
reload_on_failure true
reload_connections false
request_timeout 300s
<buffer>
@type memory
chunk_limit_size 20MB
flush_mode interval
flush_interval 5s
flush_thread_count 12
flush_at_shutdown true
retry_max_times 2
retry_wait 60s
retry_type exponential_backoff
retry_exponential_backoff_base 3
retry_timeout 30m
overflow_action drop_oldest_chunk
disable_chunk_backup true
total_limit_size "#{ENV['TOTAL_BUFFER_SIZE']}MB"
</buffer>
<endpoint>
url "https://#{ENV['ES_ENDPOINT']}"
region us-east-2
assume_role_arn "#{ENV['COLLECTOR_SVC_ROLE']}"
</endpoint>
</match>
</label>
When using the opensearch plugin, we now get lots of errors like this on our fluentd collectors:
"error": "#<Fluent::Plugin::OpenSearchOutput::RecoverableRequestFailure: could not push logs to OpenSearch cluster (ds-janus): [400] {"Message":"You have exceeded the number of permissible concurrent requests with unique IAM Identities. Please retry."}>"
Expected Behavior or What you need to ask
We're wondering if this is due to fb04e91d0cb9981e6174b1adc128b3a4016ae577
Prior to implementing this plugin within our collectors we did not have this problem.
Using Fluentd and OpenSearch plugin versions
Fluentd v1.14.4-1.0 AWS Opensearch 1.2 fluent-plugin-opensearch 1.0.7
We've also noticed that the auth request doesn't seem to pass in a maximum session duration. It would make sense to set this to the same as the refresh_credentials_interval so that it doesn't expire before then.
RecoverableRequestFailure error=\"could not push logs to OpenSearch cluster (datastream-test): [403] {\\\"message\\\":\\\"The security token included in the request is expired
This is after changing refresh_credentials_interval to 10h and the maximum session duration on the role has been set to 12h.
Per AWS:
To learn how to view the maximum value for your role, see View the maximum session duration setting for a role. If you do not pass this parameter, the temporary credentials expire in one hour.
Not too good with Ruby but I assume adding a line like this here might help?
https://github.com/Barracuda-CloudOps/fluent-plugin-opensearch/blob/main/lib/fluent/plugin/out_opensearch.rb#L239
duration_seconds: conf[:[refresh_credentials_interval.to_s]
https://github.com/fluent/fluent-plugin-opensearch/pull/78 works for you?
We've also noticed that the auth request doesn't seem to pass in a maximum session duration. It would make sense to set this to the same as the refresh_credentials_interval so that it doesn't expire before then.
RecoverableRequestFailure error=\"could not push logs to OpenSearch cluster (datastream-test): [403] {\\\"message\\\":\\\"The security token included in the request is expiredThis is after changing refresh_credentials_interval to 10h and the maximum session duration on the role has been set to 12h.
Per AWS:
To learn how to view the maximum value for your role, see View the maximum session duration setting for a role. If you do not pass this parameter, the temporary credentials expire in one hour.
Is there a solution for this? We are also having the same issue :thinking:
Same issue here. This is critical. Any workaround? I thought that by setting refresh_credentials_interval it would work.
What version are you on?
Isn't #74 solves the problem? It has been merged and released in v1.1.1.
Do you think the problem still exist in v1.1.1 or above?
What version are you on?
Isn't #74 solves the problem? It has been merged and released in v1.1.1.
Do you think the problem still exist in v1.1.1 or above?
The fluent-operator automatically updates the plugin's version. I only noticed this update you mentioned because fluentd started throwing errors as discussed here: https://github.com/fluent/fluent-operator/issues/814
I tried setting the session duration of my IAM role to the same session duration of the plugin's default: 5h
It works for a few errors, then gets stuck again.
Issue persists and is even worse now: https://github.com/fluent/fluent-plugin-opensearch/issues/107