fluentd Error_class=URI::InvalidURIError error=“bad URI(is not URI) ..”

Describe the bug

I’m getting this error continuously. When using @http plugin. I’ve not been able to find the root cause for this but I’ve noticed this in coincidentally when my external endpoint is down for restarts. I’ve buffering enabled which writes into my local disk and I do not drop any log chunks ie I’ve retry_forever as try. But when the service is back up this one chunk goes into periodic retries till infinity as the dynamic tag in the http endpoint is not resolved in retries.

so the whole error is like this: Error_class=URI::InvalidURIError error=“bad URI(is not URI) \”https://myexternalendpoint.com/v0/${tag}\””

fluentd version: 1.16.5

To Reproduce

Use http plugin to an endpoint with ${tag}, using retry forever as true.

Expected behavior

Buffer chunk should be sent. It should not complain for invalid uri

Your Environment

- Fluentd version: 1.16.5
- Package version:
- Operating system: Red Enterprise Linux Server 7.9(Maipo)
- Kernel version: 2024 x86_64 GNU/Linux

Your Configuration

<match **>
@type http 
endpoint http://externalxx.com/v0/${tag}
content_type application/json
<format>
@type json
</format>
json_array true
<buffer_tag>
@type file
path /local/data/fluentd
flush_interval 12s
flush_thread_count 1
overflow_action block
chunk_limit_size 4MB
retry_type periodic
retry_wait 60s
total_limit_size 6GB
retry_forever true
</buffer>
</match>

Your Error Log

Error_class=URI::InvalidURIError error=“bad URI(is not URI) \”https://myexternalendpoint.com/v0/${tag}\””

Additional context

No response

Sep 25 '24 08:09 raulgupto

@raulgupto Thanks for your report. However, I can't reproduce this.

The placeholder is replaced when retrying if the chunk has tag key info. If you set tag to the chunk keys, ${tag} should be replaced when retrying.

<match test>
  @type http 
  endpoint http://localhost:9880/${tag}
  <format>
    @type json
  </format>
  <buffer tag>
    @type file
    path ...
    flush_mode immediate
  </buffer>
</match>

Sep 26 '24 04:09 daipom

Can you try killing the fluentd process ? I can’t figure out the exact scenario to reproduce this issue. What I’ve noticed is that, In normal scenario we have two buffer files for a chunk of message. But in this case, I’ve also noticed that only one present most of the time.

Sep 26 '24 09:09 raulgupto

Can you try killing the fluentd process ?

I have tried. When Fluentd restarts, Fluentd loads the existing chunks and sends them correctly.

But in this case, I’ve also noticed that only one present most of the time.

This should be the cause. I can reproduce this issue as follows.

Make some buffer files.
Stop Fluentd with some buffer files remaining.
Delete some .meta buffer files manually.
Restart Fluentd.
This error happens.

The file buffer (buf_file) needs a .meta file to process the placeholders. If it is removed, Fluentd can't process the placeholders.

Sep 27 '24 01:09 daipom

If .meta file is removed accidentally, it means the information about tag is lost. So, it is very difficult for Fluentd to recover such data.

Sep 27 '24 01:09 daipom

I understand without a location you don’t know where to send it. But since retry_forever is true and fluentd keeps on retrying this chunk. What I’ve noticed is that instead of waiting just this chunk to be flushed. Fluentd proces is heavily waiting for this to be flushed, does not go down but consume whole buffer space and remain stuck forever. A solution to manually clear that buffer is there but that requires manual intervention to delete the buffer in production environment which is not sustainable. Either we should drop the chunk that is corrupted ie without end address or we should fix this with the current address. The later seems not correct because ${tag} or fields like this were supposed to be dynamically resolved. Also, what if someone had changed config with new http address that chunk which was meant for old would go to the new. I’d go with dropping the ill-configured buffers.

Sep 27 '24 07:09 raulgupto

Another approach is to find a way how this problem would not appear in the first place. I’ve seen this appear frequently. Around 3-5 unique /160 hosts are facing this on monthly basis. Any existing config change that would fix this issue?

Sep 27 '24 07:09 raulgupto

To address the root cause, please investigate why some buffer files are disappearing. Is it a bug in Fluentd or an external factor?

If this may be a bug in Fluentd, we need to find out how to reproduce this phenomenon to fix the bug. (I can reproduce the error by manually removing some buffer files. On the other hand, some buffer files must have been lost for some reason in your environment. We need to find out the cause.)

But since retry_forever is true and fluentd keeps on retrying this chunk. What I’ve noticed is that instead of waiting just this chunk to be flushed. Fluentd proces is heavily waiting for this to be flushed, does not go down but consume whole buffer space and remain stuck forever.

Some errors are considered non-retriable, and Fluentd gives up retrying.

https://docs.fluentd.org/buffer#handling-unrecoverable-errors

About the error in this issue, Fluentd executes retrying. It is considered retriable in the current implementation. So, if using retry_forever, Fluentd retries to flush the chunk forever.

The issue may be improved if this can be fixed so that the error can be determined as non-retriable.

A solution to manually clear that buffer is there but that requires manual intervention to delete the buffer in production environment which is not sustainable.

You can stop using retry_forever, and add secondary. This allows to automatically save unexpected data to a file or other location without manual tweaking.

Either we should drop the chunk that is corrupted ie without end address or we should fix this with the current address.

Certainly, we should improve the handling of buffers about this point. If there is no corresponding .meta buffer file, it could be better that Fluentd drops or backups the chunk.

Sep 27 '24 07:09 daipom

I’ll definitely add secondary_file. 1 question: If I use retry_timeout / retry_max_times, how will my retries work in this case.

If 1 buffer has exhausted the retry parameter it will stop sending all buffer chunks. or
If 1 buffer has exhausted the retry parameter. This 1 chunk of won’t be retried but others will be retried for same no of times.

I don’t want to stop after n tries or n duration. I want to keep retrying assuming my endpoint will be back after recovering from failure / releases. Edit : I tried secondary_file. It doesn’t resolve ${tag}. I have <match **> as my match condition. I wanted to separate out in dump which log files chunks have failed so that I could manually send it to the endpoint.

Sep 27 '24 09:09 raulgupto

@raulgupto Sorry for my late response.

If I use retry_timeout / retry_max_times, how will my retries work in this case.

1. If 1 buffer has exhausted the retry parameter it will stop sending all buffer chunks.
   or

2. If 1 buffer has exhausted the retry parameter. This 1 chunk of won’t be retried but others will be retried for same no of times.

2 is correct. Fluentd handles retries for each chunk.

Oct 02 '24 06:10 daipom

Edit : I tried secondary_file. It doesn’t resolve ${tag}. I have <match **> as my match condition. I wanted to separate out in dump which log files chunks have failed so that I could manually send it to the endpoint.

Chunks that cannot resolve placeholders due to missing metafiles fail to be transferred. The secondary_file handles such chunks, so it can't resolve ${tag}. If the metafile is lost, the tag information cannot be recovered.

Oct 02 '24 06:10 daipom

Thank you for the seconday_file workaround. It will help to manually recover and send logs in case of failures. It would however be great if we can have retries/solution that can help recover buffers in case the .meta file is lost

Oct 02 '24 09:10 raulgupto

If .meta file is removed accidentally, it means the information about tag is lost. So, it is very difficult for Fluentd to recover such data.

So, it would be better to avoid the disappearance of buffer files.

Do you have any idea as to why the buffer file disappears?

Is Fluentd running duplicatedly?

Oct 04 '24 01:10 daipom

I’ve added graceful kill commands to kill running process and around 10 second of sleep for restarts. However, we have a process monitor that checks if fluentd is running or not. If not running it restarts it. So even if during host maintenance or clean restarts I don’t think there will be process duplication. But there are chances of process kill and restarts which ideally should not leave half of metadata. Is there any flag that can prevent metadata corruption during restarts that I can use ?

Oct 10 '24 08:10 raulgupto

Sorry for my late response.

So even if during host maintenance or clean restarts I don’t think there will be process duplication.

I see...

Is there any flag that can prevent metadata corruption during restarts that I can use ?

No. It's a very unusual case that some buffer files disappear. It is highly likely that the factor is external to Fluentd, and without identifying it, it is difficult to consider specific measures.

We need a way to reproduce the phenomenon.

Oct 28 '24 04:10 daipom

Hi all.

We've been experiencing the same problem in our fluentd 1.16.5

Though we haven't been able to reproduce it we can offer some clues regarding when/how we started seeing it.

Whilst running it under EKS we applied a VPA component to it which meant an automatic adjustment of CPU and memory limits versus the hardcoded limits we had before. In some situations ( lower load period ) the VPA would dynamically lower the memory and CPU limits to much lower values than we had ever ran fluentd with. Subsequently, when under load, sometimes our fluentd would restart after OOMing and we'd start seeing those errors appear in the logs.

Our current running hypothesis is that perhaps when fluentd hit one of these conditions (OOM) it would fail to write the meta file for one of the received logs and thus leave the log in an unprocessable state.

Oct 29 '24 11:10 pecastro

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

Nov 29 '24 10:11 github-actions[bot]

@pecastro Thanks for your report! If OOM forcibly kills the process, files that are being written may become broken. That's not a problem on the application side. Please operate the system so that OOM does not occur.

Dec 04 '24 01:12 daipom

Yap, we learned the lesson the hard way. :) I'm not the original creator of this issue but as fas as I am concerned the memory condition clearly explains this behaviour as we stopped seeing it once we ensured there was plenty of memory for fluentd to run.

Dec 05 '24 19:12 pecastro

I see! Thanks for your report! So, the cause of this issue may be the memory shortage. If you find other causes, please reopen this issue!

Dec 09 '24 01:12 daipom