guarddog icon indicating copy to clipboard operation
guarddog copied to clipboard

Detection Evasion with Unicode

Open QuinceyJames opened this issue 3 years ago • 1 comments

Problem

Hi! I just read an interesting article on how bad actors can evade text-based static analysis tools using Unicode. Ever since PEP 3131, Python allowed programmers to use non-ASCII characters to allow developers "to define classes and functions with names in their native languages". As a consequence, there are now many ways keywords like eval be specified. (See: https://lingojam.com/BoldTextGenerator)

Proposal

Guarddog could preprocess all source files by converting any Unicode to ASCII. According to the PEP, "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."

Alternatively, Guarddog could define a new heuristic that warns if non-ASCII characters are found.

Test

  1. Generate a bolded Unicode variant of the letter e to obtain 𝐞
  2. Append tests/analyzer/sourcecode/code-execution.py with the following code:
    # ruleid: code-execution
    𝐞val("print('malicious print statement')")
    
  3. From the root of the project, run semgrep --metrics off --test --config guarddog/analyzer/sourcecode tests/analyzer/sourcecode
  4. Verify all of the unit tests pass

QuinceyJames avatar Mar 28 '23 16:03 QuinceyJames

Interesting post!

I've seen this be solved a few ways, one of them being what you suggest. The preprocessing/replacement part can be tricky as it could break functionality if you incorrectly replace a piece of unicode.

zmallen avatar Apr 10 '23 23:04 zmallen