Detection Evasion with Unicode
Problem
Hi! I just read an interesting article on how bad actors can evade text-based static analysis tools using Unicode. Ever since PEP 3131, Python allowed programmers to use non-ASCII characters to allow developers "to define classes and functions with names in their native languages". As a consequence, there are now many ways keywords like eval be specified. (See: https://lingojam.com/BoldTextGenerator)
Proposal
Guarddog could preprocess all source files by converting any Unicode to ASCII. According to the PEP, "All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC."
Alternatively, Guarddog could define a new heuristic that warns if non-ASCII characters are found.
Test
- Generate a bolded Unicode variant of the letter
eto obtain𝐞 - Append
tests/analyzer/sourcecode/code-execution.pywith the following code:# ruleid: code-execution 𝐞val("print('malicious print statement')") - From the root of the project, run
semgrep --metrics off --test --config guarddog/analyzer/sourcecode tests/analyzer/sourcecode - Verify all of the unit tests pass
Interesting post!
I've seen this be solved a few ways, one of them being what you suggest. The preprocessing/replacement part can be tricky as it could break functionality if you incorrectly replace a piece of unicode.