CredSweeper icon indicating copy to clipboard operation
CredSweeper copied to clipboard

Scan oversized lines by chunks

Open babenek opened this issue 1 year ago • 1 comments

Description

Please include a summary of the change and which is fixed.

  • Add chunk-based scan for oversized lines with overlap to do not skip a credential between chunks
  • slowdown ~26%

How has this been tested?

Please describe the tests that you ran to verify your changes.

  • [x] UnitTest
  • [x] Benchmark

babenek avatar Mar 03 '24 08:03 babenek

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 90.53%. Comparing base (21a54c9) to head (903ea75).

Files Patch % Lines
credsweeper/scanner/scan_type/scan_type.py 66.66% 2 Missing and 1 partial :warning:
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #522   +/-   ##
=======================================
  Coverage   90.53%   90.53%           
=======================================
  Files         127      127           
  Lines        4394     4397    +3     
  Branches      707      707           
=======================================
+ Hits         3978     3981    +3     
  Misses        274      274           
  Partials      142      142           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Mar 03 '24 08:03 codecov-commenter

Thank you for suggest great idea to scanning oversize lines! It seems there is a problem when scanning full benchmark data... So how about set oversize limit const like as below?

OVERSIZE_LINE_LIMIT = MAX_LINE_LENGTH * 10  # You can change the number.

I think it would be good to find a number which is 10 above, by testing with benchmark data. I hope the slowdown should under 10%..

Thank you for the feedback!

There should be no limit for text lines. The workflow jobs compares performance relatively for last release and base version. After new release it will be the base time - no failure for the workflow

The jobs finds overhead in unususal cases. IMO, the feature deserves the price.

We can more reduce benchmark set for performance test but total duration of performance test is less than the main benchmark test, so there is no speedup for common workflow.

Markup and scores were fixed.

babenek avatar May 27 '24 11:05 babenek

@csh519 The overlap duplicates issue is fixed. https://github.com/Samsung/CredSweeper/pull/522/files#diff-a308e37dd0360462596f1887de9d165bf5fb16132a748481cc54a1b64b77b62f

  • two the same Candidates were used in the test. The purging is processed with unique positions of value, variable, separator and line.

babenek avatar May 28 '24 09:05 babenek