enhancement(aws_s3 sink): Add ability to configure request errors to retry
Summary
Adds a new field, a retry_logic struct, to S3SinkConfig that allows user to specify types of response errors to retry. Users can specify specific status codes of error responses of failed requests they want to be automatically retried, or they can specify all failed requests to be automatically retried.
Change Type
- [ ] Bug fix
- [x] New feature
- [ ] Non-functional (chore, refactoring, docs)
- [ ] Performance
Is this a breaking change?
- [ ] Yes
- [x] No
How did you test this PR?
- Ran existing integration tests and verified they passed.
- Manually tested to verify the new functionality. Generated 403 errors by changing the
access_key_idto an invalid one. Verified the failed authentication service was retried when the configuration required it to and didn't when the configuration didn't require it to. Ran tests with different values forretry_logic,retry_all_errors, anderrors_to_retry:- retry_logic: omitted |
- retry_all_errors: true | false | omitted
- errors_to_retry: [] | [403] | omitted
- retry_logic: omitted |
Used this configuration:
# vector.yaml
sources:
generate_syslog:
type: "demo_logs"
format: "syslog"
count: 10
transforms:
remap_syslog:
inputs:
- "generate_syslog"
type: "remap"
source: |
structured = parse_syslog!(.message)
. = merge(., structured)
sinks:
s3:
type: aws_s3
encoding:
codec: "json"
region: <actual region>
inputs:
- remap_syslog
bucket: <bucket name>
auth:
access_key_id: <key id value>
secret_access_key: <key value>
retry_logic:
retry_all_errors: true
errors_to_retry: [403]
- Considered adding a new test to src/sinks/aws_s3/integration_tests.rs, but didn't see a good option for detecting service retries. Considered a test using a timer that could measure how long a successful request takes and determine if retries are happening if the failed request takes longer. We rejected this option because the results could vary depending on the load of system resources or other factors. But please let us know if you recommend a testing approach that would work.
Does this PR include user facing changes?
- [x] Yes. Please add a changelog fragment based on our guidelines.
- [ ] No. A maintainer will apply the "no-changelog" label to this PR.
Notes
- Please read our Vector contributor resources.
- Do not hesitate to use
@vectordotdev/vectorto reach out to us regarding this PR. - The CI checks run only after we manually approve them.
- We recommend adding a
pre-pushhook, please see this template. - Alternatively, we recommend running the following locally before pushing to the remote branch:
-
cargo fmt --all -
cargo clippy --workspace --all-targets -- -D warnings -
cargo nextest run --workspace(alternatively, you can runcargo test --all) -
./scripts/check_changelog_fragments.sh
-
- We recommend adding a
- After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run
git merge origin masterandgit push.
- If this PR introduces changes Vector dependencies (modifies
Cargo.lock), please runcargo vdev build licensesto regenerate the license inventory and commit the changes (if any). More details here.
- Closes: #10870
We will review again once the merge conflicts are resolved.
Just ran git fetch && git checkout origin/master -- Cargo.toml Cargo.lock LICENSE-3rdparty.csv && git merge origin/master to fix your merge conflicts. ~Cargo.lock still shows changes which is weird but~ you should be good now to implement the suggestions made in this PR 👍
Awesome, thanks for your help, @thomasqueirozb!
I'm trying to figure out the two failing checks -- Test Suite / Checks and Test Suite / Test Suite.
In Test Suite / Checks, the failure is happening when cargo vdev check docs is run. Looks like the auto-generated documentation for retry_strategy in website/cue/reference/components/sinks/generated/aws_s3.cue isn't matching the required schema defined in website/cue/reference.cue. Seems to be complaining that a description isn't getting documented for status_codes. That's curious because there is a description on the StatusCodes variant of the RetryStrategy enum defined in src/sinks/s3_common/config.rs, but it's not getting pulled into the auto-generated documentation. Furthermore, status_codes isn't put in with the other enum variants of retry_strategy in the auto-generated documentation, I guess because they're strings but status_codes is a list. I wonder if the variants being different types is contributing to the problem.
Looking at .github/workflows/test.yml, maybe the Test Suite / Test Suite failure is simply because Test Suite / Checks failed.
Appreciate any help.
I will take a look @jchap-pnnl
Great, thanks!
Please take another look at the current state of this PR and let us know if this is ready for the final review.
I reran the tests, verified the functionality still works, and updated the PR description. I've addressed everything I'm aware of. Please resume reviewing. Thanks!