datadog-static-analyzer icon indicating copy to clipboard operation
datadog-static-analyzer copied to clipboard

[STAL-2635] Add JavaScript `Digraph`

Open jasonforal opened this issue 1 year ago • 2 comments

What problem are you trying to solve?

To support more sophisticated analysis techniques, we want to be able to construct arbitrary graphs from the source code we're analyzing.

We're initially iterating on the construction and traversal logic in JavaScript for iteration speed. These can be ported up to Rust if/when it makes sense.

This PR has two main goals:

  1. Implement the graph data structure.
  2. Establish a pattern to write unit tests against an expected graph structure.

What is your solution?

JavaScript-based Digraph

We'll initially be implementing taint analysis, and so our initial graph will be stored as an adjacency list containing directed edges that indicate the flow of tainted values from sources to targets. We need to be able to type the edges to distinguish between various semantics (for now: "assignment" and "dependence")

Thus, in pseudo-code, an edge may look like

type NodeId = u16;
#[repr(u8)]
enum EdgeKind {
    Assignment,
    Dependence,
}
#[derive(Copy)]
struct Edge {
    target: NodeId,
    kind: EdgeKind,
}
type Adjacencies = Vec<EdgeKind>;

Defining and parsing edges as structs makes sense in Rust, as LLVM can optimize these into register allocations.

The straightforward JavaScript equivalent of this would be an object:

const edge1 = {
    target: 123,
    kind: EdgeKind_DEPENDENCE,
};
const adjacencies = [edge1, /* edge2, edge3, ... */];

However, this isn't ideal because it would always perform a heap allocation for each edge, as v8 cannot optimize in a similar manner.

Instead, we use bit packing to represent the edge as a v8 smi (31 bit int), which additionally lets us take advantage of a specific v8 array optimization: PACKED_SMI_ELEMENTS). Edges are expected to be created frequently, so this represents a significant speedup.

(The max safe integer in JavaScript is 53 bits, so reserving 4 bits for edge type isn't a concern)

The logic for this serialization and deserialization is round-trip tested via these Rust tests.

Unit Tests

We'll be building more complex graphs that are specific to the semantics of each programming language. We need to be able to easily read (and write) unit tests that make assertions about the structure of a graph generated from source code. To that end, this PR introduces the following pattern:

  • Expected graphs are specified using DOT Language.
  • Unit tests specify a transformer function to map a manually-written graph to an algorithmically generated one.

To illustrate: while not implemented in this PR, a future PR will use this pattern to ergonomically specify CST nodes as graph nodes and edges as taint flow. For example, for the given Java code:

String value;
if (foo) {
    value = "safe";
} else {
    value = userInput;
}
System.out.println(value);

A test writer will be able to specify that they expect at line 7, "value is dependence on userInput or "safe"" by simply writing:

digraph expected {
    value [line=7]
    userInput [line=5]
    s1 [text=safe,line=3]
  
    value -> s1 [kind=dependence]
    value -> userInput [kind=dependence]
}

An example of how the graph assertions are currently done in this PR is located here.

Graph Reduction

When we traverse a CST, we encounter a lot of nodes that do not contribute to our analysis. Even so, we still construct the graph with those "unnecessary" nodes for simplicity's sake: it makes it so our initial graph traversal doesn't need to be overly complex/stateful. We then reduce the graph in a secondary pass.

For example, given the Java code:

int value = (1 + 2) * 3 + magic;

We will be able to construct the following graph with a straightforward traversal:

digraph {
    var1 [text=value,kind=variable]
    var2 [text=magic,kind=variable]
    bin1 [text="3 + magic",kind=bin_op]
    bin2 [text="1 + 2",kind=bin_op]
    bin3 [text="(1 + 2) * 3 + magic",kind=bin_op]
    par1 [text="(1 + 2)",kind=parenthesized]
    n1 [text=1,kind=integer]
    n2 [text=2,kind=integer]
    n3 [text=3,kind=integer]

    bin1 -> var2 [kind=dependence]
    bin1 -> n3 [kind=dependence]
    bin2 -> n2 [kind=dependence]
    bin2 -> n1 [kind=dependence]
    bin3 -> bin1 [kind=dependence]
    bin3 -> bin2 [kind=dependence]
    var1 -> bin3 [kind=assignment]
}

And then by reducing it (calling digraph.reduce()), we can surface only the relationships our analysis cares about (e.g. that the variable value is dependent on the variable magic):

digraph {
    var1 [text=value,kind=variable]
    var2 [text=magic,kind=variable]
    bin3 [text="(1 + 2) * 3 + magic",kind=bin_op]
    
    var -> var2 [kind=dependence]
    var1 -> bin3 [kind=assignment]
}

For simplicity, all future analysis can then operate on this reduced graph.

Alternatives considered

What the reviewer should know

  • New third party dependency graphviz-rust is only used to parse the DOT language format used for unit tests.

jasonforal avatar Aug 09 '24 20:08 jasonforal

My initial benchmarks showed ~40% perf increase by using a single v8::Number instead of allocating objects (especially since hundreds of edges are expected per file). I think the only somewhat complex thing is in using a function call rather than property access: so

const packed = makeEdge(1, EDGE_DEPENDENCE);
// vs
const edge = new Edge(1, EDGE_DEPENDENCE);

//////
getEdgeTarget(packed);
// vs
edge.target;

//////
getEdgeKind(packed);
// vs
edge.kind;

but doesn't seem like a huge deviation from our current patterns (e.g. we do ddsa.getChildren(node) instead of node.children).

Additionally, I think this also makes parts of our implementation more simple. To detect a cycle in the graph, we simply do:

// Set<number>
const visited = new Set();
// Marking visited:
visited.add(packedEdge);
// Checking visited
if (visited.has(packedEdge)) {
    // ...
}

To do this with an edge object, we'd have to do something like:

// Option 1
// Map<TargetId, EdgeKind[]>
const visited = new Map();
// Marking visited:
let existing = visited.get(edge.target);
if (existing === undefined) {
    const newArr = [];
    visited.set(edge.target, newArr);
    existing = newArr
}
existing.push(edge.kind);

// Checking visited:
const existing = visited.get(edge.target);
if (existing !== undefined) {
    if (existing.contains(edge.kind)) {
        // ...
    }
}

Or

// Option 2
// Set<string>
function serializeEdge(edge) {
    return `${edge.target}-${edge.kind}`
}
// Marking visited:
const edgeStr = serializeEdge(edge);
visited.add(edgeStr);
// Checking visited:
if (visited.has(edgeStr)) {
    // ...
}

Option 1 is more complex than my implementation, Option 2 is the same complexity. Both will perform markedly worse.

--

I think the complexity is a toss up given the requirement to detect cycles and how an object-based approach introduces its own complexities. Given the magnitude of the perf improvement, I'd err on the side of going with the PackedEdge implementation.

jasonforal avatar Aug 14 '24 15:08 jasonforal

Rebased to fix merge conflicts from #⁠485

jasonforal avatar Aug 14 '24 20:08 jasonforal

Force pushed to clean up the implementation:

  • Removed graph reduction
  • Removed "markIdentifier" API (intention here was to specify "core" nodes to preserve during a graph reduction/edge contraction, but this was leaky)
  • Improved documentation
  • Renamed graph "Node" to "Vertex" -- this will end up being much less confusing as we will want to distinguish a tree-sitter "node" from a graph "node".

jasonforal avatar Aug 27 '24 15:08 jasonforal