datadog-static-analyzer icon indicating copy to clipboard operation
datadog-static-analyzer copied to clipboard

[STAL-2195] Initial implementation of intra-method taint analysis in Java

Open jasonforal opened this issue 1 year ago • 0 comments

What problem are you trying to solve?

Some significant security vulnerabilities (e.g. SQL injection) can be caught with taint analysis.

Currently, writing accurate rules to detect these vulnerabilities is difficult because we can currently only very crudely approximate the flow of variables through a program. To address this, we will build "native support" for (presently: intra-method) taint analysis for Java.

What is your solution?

[!NOTE] This is a large PR. All commits are intentionally organized and purely additive to each other. It's likely easier to review commits individually, in sequential order.

Preface

Our goal is to implement simple taint analysis in a way that builds on our current CST-based technique.

From a theoretical perspective, a CST-based analysis hits an accuracy ceiling pretty quickly. More traditional approaches lower an AST to an intermediate form like SSA, and then construct an abstract program state: constructing/utilizing facts like symbol/name resolution, type resolution, constant propagation and control flow to improve analysis accuracy.

Doing this would require re-architecting and re-implementing some core assumptions of the analyzer, and so that's an out-of-scope consideration.

Methodology

[!NOTE] Terminology: Source: A place in a source code where data originates (e.g. the headers from an HTTP request). Sink: A place in a source code where that data can end up (and particularly that this data could have harmful side effects -- e.g. a SQL statement).

  1. Rule author defines tree-sitter query only for a "sink". Sink captures are sent to the JavaScript runtime and a standard analysis begins.
  2. The sink node's containing method is located.
  3. The MethodFlow class recursively traverses that containing method, iteratively constructing a graph of relationships between CST nodes. This graph is an unfiltered interpretation of the relationship between nodes in a method, and so it contains superfluous nodes that a rule author will not care about.
  4. A graph reduction is performed to cull nodes that aren't useful for taint analysis and to draw new edges between those that are.
  5. The reduced graph is then traversed and all paths from the sink to any "source" (technically defined: a graph node with an out-degree of 0 -- note that we perform a backward data flow: graph edges go from sink to source) are collected.
  6. We expose these paths to the rule author as a TaintFlow, and so the rule can perform its own logic to determine true vs false positive. This unit test shows how this information is surfaced.

Limitations

Some limitations are artificial and were chosen for simplicity. These can be addressed in future iterations:

  • Mutually exclusive control-flow branches are treated as if they were sequential.
    • This is probably the highest priority task in terms of improving analysis accuracy, but requires a refactor in how we construct the initial graph.
  • Variable scoping is not supported. Supporting should be pretty straightforward -- we'd need to add the hooks I commented here, as well as walk a stack of HashMaps here:
  • Some Java language constructs have not been implemented, as they are dependent on other limitations we've chosen (e.g. field_access)
  • We hardcoded some heuristics to propagate taint that aren't always true (one, two). We should eventually implement hooks either we or the user can specify custom "propagator" patterns that are aware of, e.g. specific standard library methods.

Out of Scope

Features/abstractions this PR does not implement:

  • Support for multiple physical locations (both in a Rust RuleResult and in the SARIF output.
  • Taint sanitizers
  • Fine-grained definition of taint propagators (detailed in "Limitations")
  • Inter-method graph building

Alternatives considered

What the reviewer should know

  • Because our methodology uses a lot of known simplifications that we eventually might want to revisit, these are organized into separate modules so the behavior is both a) easily seen as intentional and b) easy to see what needs to be fixed.
  • Renaming from TreeSitterNode.type to TreeSitterNode.cstType was overdue and so was squeezed in here.

jasonforal avatar Aug 15 '24 00:08 jasonforal