Detection Evasion with Unicode

Open QuinceyJames opened this issue 3 years ago • 1 comments

Problem

Hi! I just read an interesting article on how bad actors can evade text-based static analysis tools using Unicode. Ever since PEP 3131, Python allowed programmers to use non-ASCII characters to allow developers "to define classes and functions with names in their native languages". As a consequence, there are now many ways keywords like eval be specified. (See: https://lingojam.com/BoldTextGenerator)

Proposal

Guarddog could preprocess all source files by converting any Unicode to ASCII. According to the PEP, "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."

Alternatively, Guarddog could define a new heuristic that warns if non-ASCII characters are found.

Test

Generate a bolded Unicode variant of the letter e to obtain 𝐞
Append tests/analyzer/sourcecode/code-execution.py with the following code:
```
# ruleid: code-execution
𝐞val("print('malicious print statement')")
```
From the root of the project, run semgrep --metrics off --test --config guarddog/analyzer/sourcecode tests/analyzer/sourcecode
Verify all of the unit tests pass

Mar 28 '23 16:03 QuinceyJames

Interesting post!

I've seen this be solved a few ways, one of them being what you suggest. The preprocessing/replacement part can be tricky as it could break functionality if you incorrectly replace a piece of unicode.

Apr 10 '23 23:04 zmallen