connect icon indicating copy to clipboard operation
connect copied to clipboard

upgrade "files" input with watchdog

Open DpoBoceka opened this issue 6 years ago • 10 comments

It would be nice to have an opportunity to use benthos instead of filebeat or rsyslog for simple shipping logs so it could expand its influence and conquer more use-cases. But currently benthos'es "files" input reads path just once, hence in order to ship new logs we have to restart the instance. I also wonder if it has metadata in it to understand where benthos stopped its reading if we had it restarted.

DpoBoceka avatar Sep 06 '19 10:09 DpoBoceka

Hey @DpoBoceka, I'm not opposed to adding the ability to watch and track input files. However, it's a fairly large task, so I'm not likely to take this on myself any time soon.

Jeffail avatar Sep 06 '19 11:09 Jeffail

I'll just leave it here in order someone would be interested. https://github.com/radovskyb/watcher With that library we could implement Input.Connect() and Read() the bytes of a file from that channel as we do now. I would like to try that out later

DpoBoceka avatar Sep 23 '19 08:09 DpoBoceka

Some advise before I'll get to it?

DpoBoceka avatar Jan 31 '20 15:01 DpoBoceka

So I think this behaviour should be added to the file input rather than files because files specifically consumes each discrete file as a payload instead of line by line.

I would propose the following additions:

  • Allow consuming >1 files with the file input. We need to preserve backwards compatibility here so path needs to still allow a string value, but we can either add another field paths which is an array, or allow path to be either a string or array of strings.
  • Allow wildcard paths (optional for now, we can do it later)
  • Add a field cache which allows users to specify a cache resource to store metadata about when and where we last read from each file being consumed.
  • When cache is specified, for each file path being consumed we store the consumed position in the cache using the path as the key (maybe hashed). It might be worth storing this in a structured way so that we can add more context later (JSON format?) We should also flush these offsets in a separate goroutine in intervals.
  • On start up, if a cache is specified, for each file we query the cache to see if there's a pre-existing position to consume from. If there is not, or if the position is greater than the files current size (meaning it's been rotated) then we consume from the beginning.

Allowing users to specify their own cache resource not only means they can store this metadata however they like but it also gives them control over things like TTLs. It probably makes sense to eventually flesh out the file cache type to support TTLs itself as it's the most likely candidate for this purpose.

Jeffail avatar Feb 01 '20 09:02 Jeffail

I wish "file" input could support tail mode (with truncation/move detection, as in https://github.com/hpcloud/tail) and "super asterisk" as in https://github.com/influxdata/telegraf/tree/master/plugins/inputs/tail

Use case: reading syslog-generated log files (rotated and/or created based on current time)

miko avatar Oct 20 '20 23:10 miko

@Jeffail Since the file plugin has been deprecated, should this feature be in a new tail-file plugin or be added as a feature to files after all?

abh avatar Aug 08 '21 08:08 abh

Hey @abh, it's actually the files input that has been deprecated in favour of file, the reason for that was because the file input got a new field codec along with supporting multiple paths with the new paths field, and so it supports everything that the files input did (and more).

However, I think it might be difficult to map over all the different codec options to a watcher because they expect to consume an io.Reader, whereas a file watcher will want to chop the file byte stream into discrete lines (or follow a custom delimiter), so I think it might be sensible to go with a separate implementation for now.

Maybe a good path would be to create a new input marked as experimental, iterate on it a few times, and if we can eventually find a way to introduce the codecs from the normal file input then we can combine them, otherwise they'll remain separate.

Is this something you're considering working on? If so let me know if I can help or provide any guidance, it would be awesome to finally get it done.

Jeffail avatar Aug 08 '21 08:08 Jeffail

Looks like https://github.com/influxdata/tail is a maintained version of https://github.com/hpcloud/tail

mihaitodor avatar Dec 09 '21 01:12 mihaitodor

There's also https://github.com/nxadm/tail which looks a bit more active.

Jeffail avatar Jul 31 '22 09:07 Jeffail

Just had a quick look in there and it doesn’t look like that much code, TBH. Might be worth maintaining that logic directly in Benthos.

LE: This is definitely not smth we want in Benthos: https://github.com/nxadm/tail/blob/master/winfile/winfile.go I wonder if there's a separate library for it...

mihaitodor avatar Jul 31 '22 09:07 mihaitodor

Also need this.

I already started to use https://github.com/nxadm/tail and it’s been good .

gedw99 avatar Apr 01 '23 20:04 gedw99

For consistency consider following the SFTP "watcher" pattern. https://www.benthos.dev/docs/components/inputs/sftp

Thanks for an excellent project.

terryherron avatar May 18 '23 18:05 terryherron

For consistency consider following the SFTP "watcher" pattern. https://www.benthos.dev/docs/components/inputs/sftp

Thanks for an excellent project.

Had a look. Its using polling. is that your point ? I think polling is also a good base to start from too. We can also add debounce too.

gedw99 avatar May 22 '23 08:05 gedw99

this could be used as a base: https://github.com/loov/watchrun/tree/master

Its using polling and also high resolution timers

gedw99 avatar May 22 '23 08:05 gedw99