Scan oversized lines by chunks
Description
Please include a summary of the change and which is fixed.
- Add chunk-based scan for oversized lines with overlap to do not skip a credential between chunks
- slowdown ~26%
How has this been tested?
Please describe the tests that you ran to verify your changes.
- [x] UnitTest
- [x] Benchmark
Codecov Report
Attention: Patch coverage is 78.57143% with 3 lines in your changes are missing coverage. Please review.
Project coverage is 90.53%. Comparing base (
21a54c9) to head (903ea75).
| Files | Patch % | Lines |
|---|---|---|
| credsweeper/scanner/scan_type/scan_type.py | 66.66% | 2 Missing and 1 partial :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #522 +/- ##
=======================================
Coverage 90.53% 90.53%
=======================================
Files 127 127
Lines 4394 4397 +3
Branches 707 707
=======================================
+ Hits 3978 3981 +3
Misses 274 274
Partials 142 142
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thank you for suggest great idea to scanning oversize lines! It seems there is a problem when scanning full benchmark data... So how about set oversize limit const like as below?
OVERSIZE_LINE_LIMIT = MAX_LINE_LENGTH * 10 # You can change the number.I think it would be good to find a number which is 10 above, by testing with benchmark data. I hope the slowdown should under 10%..
Thank you for the feedback!
There should be no limit for text lines. The workflow jobs compares performance relatively for last release and base version. After new release it will be the base time - no failure for the workflow
The jobs finds overhead in unususal cases. IMO, the feature deserves the price.
We can more reduce benchmark set for performance test but total duration of performance test is less than the main benchmark test, so there is no speedup for common workflow.
Markup and scores were fixed.
@csh519 The overlap duplicates issue is fixed. https://github.com/Samsung/CredSweeper/pull/522/files#diff-a308e37dd0360462596f1887de9d165bf5fb16132a748481cc54a1b64b77b62f
- two the same Candidates were used in the test. The purging is processed with unique positions of value, variable, separator and line.