vector icon indicating copy to clipboard operation
vector copied to clipboard

enhancement(aws_s3 sink): Add ability to configure request errors to retry

Open jchap-pnnl opened this issue 7 months ago • 1 comments

Summary

Adds a new field, a retry_logic struct, to S3SinkConfig that allows user to specify types of response errors to retry. Users can specify specific status codes of error responses of failed requests they want to be automatically retried, or they can specify all failed requests to be automatically retried.

Change Type

  • [ ] Bug fix
  • [x] New feature
  • [ ] Non-functional (chore, refactoring, docs)
  • [ ] Performance

Is this a breaking change?

  • [ ] Yes
  • [x] No

How did you test this PR?

  1. Ran existing integration tests and verified they passed.
  2. Manually tested to verify the new functionality. Generated 403 errors by changing the access_key_id to an invalid one. Verified the failed authentication service was retried when the configuration required it to and didn't when the configuration didn't require it to. Ran tests with different values for retry_logic, retry_all_errors, and errors_to_retry:
    • retry_logic: omitted |
      • retry_all_errors: true | false | omitted
      • errors_to_retry: [] | [403] | omitted

Used this configuration:

# vector.yaml
sources:
  generate_syslog:
    type: "demo_logs"
    format: "syslog"
    count: 10

transforms:
  remap_syslog:
    inputs:
      - "generate_syslog"
    type: "remap"
    source: |
      structured = parse_syslog!(.message)
      . = merge(., structured)

sinks:
  s3:
    type: aws_s3
    encoding:
      codec: "json"
    region: <actual region>
    inputs:
      - remap_syslog
    bucket: <bucket name>
    auth:
      access_key_id: <key id value>
      secret_access_key: <key value>

    retry_logic:
      retry_all_errors: true
      errors_to_retry: [403]
  1. Considered adding a new test to src/sinks/aws_s3/integration_tests.rs, but didn't see a good option for detecting service retries. Considered a test using a timer that could measure how long a successful request takes and determine if retries are happening if the failed request takes longer. We rejected this option because the results could vary depending on the load of system resources or other factors. But please let us know if you recommend a testing approach that would work.

Does this PR include user facing changes?

  • [x] Yes. Please add a changelog fragment based on our guidelines.
  • [ ] No. A maintainer will apply the "no-changelog" label to this PR.

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • The CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
      • ./scripts/check_changelog_fragments.sh
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.
  • Closes: #10870

jchap-pnnl avatar Jun 13 '25 22:06 jchap-pnnl

CLA assistant check
All committers have signed the CLA.

bits-bot avatar Jun 13 '25 22:06 bits-bot

We will review again once the merge conflicts are resolved.

pront avatar Jul 07 '25 20:07 pront

Just ran git fetch && git checkout origin/master -- Cargo.toml Cargo.lock LICENSE-3rdparty.csv && git merge origin/master to fix your merge conflicts. ~Cargo.lock still shows changes which is weird but~ you should be good now to implement the suggestions made in this PR 👍

thomasqueirozb avatar Jul 08 '25 17:07 thomasqueirozb

Awesome, thanks for your help, @thomasqueirozb!

jchap-pnnl avatar Jul 08 '25 18:07 jchap-pnnl

I'm trying to figure out the two failing checks -- Test Suite / Checks and Test Suite / Test Suite.

In Test Suite / Checks, the failure is happening when cargo vdev check docs is run. Looks like the auto-generated documentation for retry_strategy in website/cue/reference/components/sinks/generated/aws_s3.cue isn't matching the required schema defined in website/cue/reference.cue. Seems to be complaining that a description isn't getting documented for status_codes. That's curious because there is a description on the StatusCodes variant of the RetryStrategy enum defined in src/sinks/s3_common/config.rs, but it's not getting pulled into the auto-generated documentation. Furthermore, status_codes isn't put in with the other enum variants of retry_strategy in the auto-generated documentation, I guess because they're strings but status_codes is a list. I wonder if the variants being different types is contributing to the problem.

Looking at .github/workflows/test.yml, maybe the Test Suite / Test Suite failure is simply because Test Suite / Checks failed.

Appreciate any help.

jchap-pnnl avatar Jul 11 '25 18:07 jchap-pnnl

I will take a look @jchap-pnnl

pront avatar Jul 11 '25 18:07 pront

Great, thanks!

jchap-pnnl avatar Jul 11 '25 18:07 jchap-pnnl

image

pront avatar Jul 11 '25 18:07 pront

Please take another look at the current state of this PR and let us know if this is ready for the final review.

pront avatar Jul 11 '25 18:07 pront

I reran the tests, verified the functionality still works, and updated the PR description. I've addressed everything I'm aware of. Please resume reviewing. Thanks!

jchap-pnnl avatar Jul 11 '25 21:07 jchap-pnnl