connect icon indicating copy to clipboard operation
connect copied to clipboard

Whether the output batch supports multi threading

Open skyoct opened this issue 3 years ago • 1 comments

hello I read the code related to batcher and found that it seems to be single thread processing. Is there any way to make this conversion parquet file faster. https://github.com/benthosdev/benthos/blob/811c58786a46085861a828f7fd606e659f872253/internal/component/output/batcher/batcher.go#L63-L156

    batching:
      byte_size: 125829120
      count: 20000
      period: 30s
      processors:
        - parquet:
            compression: snappy
            operator: from_json
            schema: ''

skyoct avatar Jun 16 '22 08:06 skyoct

Hey @skyoct, you could move the batching mechanism up to the input level, and then perform the processing within pipeline.processors where you can have parallel processing threads, something like this:

input:
  foo:
    batching:
      byte_size: 125829120
      count: 20000
      period: 30s

pipeline:
  processors:
    - parquet:
        compression: snappy
        operator: from_json
        schema: ''

output:
  bar: {}

If the specific input you're using doesn't have a batching field then place it within a broker:

input:
  broker:
    inputs:
      - foo: {}
    batching:
      byte_size: 125829120
      count: 20000
      period: 30s

Jeffail avatar Jun 18 '22 08:06 Jeffail