codeql Java: Call Graph

Hello,

I am trying to use CodeQL to get the control flow of a program. More specifically I want to get the control flow into methods that I have marked as sensitive in a classes named SensitiveMethodCall. The goal is to parse the results to get the path. However, right now I am not generating a path. It is hard to tell if a path isn't being generated because of the query itself (I am used to using dataflow, so this is new to me) or if it has to do with this error

Error was: Expected result pattern(s) are not present for path-problem query: Expected at least two result patterns. These should include at least an 'edges' result set (see https://codeql.github.com/docs/writing-codeql-queries/creating-path-queries/). [INVALID_RESULT_PATTERNS]

All of the examples I see on this link have to deal with dataflow and not control flow, so I am having trouble figuring out what I am doing wrong to get the correct output. Here is my query.

import java
import semmle.code.java.ControlFlowGraph
import SensitiveInfo.SensitiveInfo

/**
 * @name Control flow path from normal method call to sensitive method call
 * @description This query identifies the full control flow path from a normal method call to a sensitive method call.
 *              It traces every control flow node between the normal and sensitive calls.
 * @kind path-problem
 * @id java/custom/control-flow-path
 * @problem.severity warning
 * @tags control-flow, data-flow, security, sensitive-data
 * @precision medium
 * @security-severity 5.0
 * @cwe CWE-200, CWE-201, CWE-209
 * @sub-severity high
 * 
 * Get the control flow path from a normal method call to a sensitive method call.
 * Returns the full sequence of control flow nodes.
 */
predicate controlFlowPath(MethodCall start, MethodCall end, ControlFlowNode node) {
  exists(ControlFlowNode startNode, ControlFlowNode endNode |
    startNode = start.getControlFlowNode() and
    endNode = end.getControlFlowNode() and
    startNode.getANormalSuccessor*() = node and
    node.getANormalSuccessor*() = endNode
  )
}

/**
 * Get the full control flow path between normal and sensitive method calls.
 */
from MethodCall normalCall, SensitiveMethodCall sensitiveCall, ControlFlowNode node
where controlFlowPath(normalCall, sensitiveCall, node)
select normalCall.getControlFlowNode(), normalCall, sensitiveCall, "Full control flow path from normal method call to sensitive method call."

Any help would be greatly appreciated. Thank you.

Sep 13 '24 09:09 KylerKatz

Hi again @KylerKatz 👋🏻

Queries of type path-problem are intended to capture the path that data takes through a program in a data flow query. I do not believe that you can write control flow queries of type path-problem. So you should use a different query type here.

To then understand where your query goes wrong, the first step would be to identify where the problem is by starting with the smallest components involved and then working your way up. For example, if you query all instances of SensitiveMethodCall, do you get the results you would expect? Similarly, if you use the quick evaluation feature to evaluate all possible results for controlFlowPath, do you get the results you expect?

Also, query metadata should normally be at the very top of the file, above the import statements.

Sep 13 '24 11:09 mbg

Hello @mbg,

Thank you for all your help.

I have broken the problem down and done some quick evaluations. Using this example,

public class BAD_APIEndpointDebugging {
    private static final boolean API_DEBUG = Boolean.parseBoolean(System.getenv("API_DEBUG"));

    public static void main(String[] args) {
        String apiEndpoint = "https://api.example.com/data";
        String apiKey = "api_key_123456789"; 

        if (API_DEBUG) {
            testAPI(apiEndpoint, apiKey);
        }
    }

    private static void testAPI(String url, String key) {
        System.out.println("DEBUG: Testing API endpoint: " + url + " with API Key: " + key);
        // Assume some testing logic here
    }
}

SensitiveMethodCall

I get these results

Which lines up with the data that I have in my YML file

extensions:
  - addsTo:
      pack: custom-codeql-queries
      extensible: sinks
    data:
    - ["fileName", "sinkName", "type"]
    - ["BAD_APIEndpointDebugging.java", "println", "Log Sink"]
    - ["BAD_APIEndpointDebugging.java", "parseBoolean", "Log Sink"]
    - ["BAD_APIEndpointDebugging.java", "testAPI", "Log Sink"]
    - ["BAD_APIEndpointDebugging.java", "getenv", "Log Sink"]

So SensitiveMethodCall seems to be returning the desired results

ControlFlowPath

I did the same for this

Looking at this, it seems to almost be alright Results 1 - 3 correspond to this line

private static final boolean API_DEBUG = Boolean.parseBoolean(System.getenv("API_DEBUG"));

4 & 5 correspond to their own self use, however, there should be one starting at testAPI and ending at println I am not sure why this control flow is being missed?

Also, I am still looking for a way to get the actual path. Something like this.

Is this possible for control flow?

Also, here is my updated query for reference

/**
 * @name Control flow path from normal method call to sensitive method call
 * @description This query identifies the control flow path from a normal method call to a sensitive method call.
 *              It traces every control flow node between the normal and sensitive calls.
 * @kind problem
 * @id java/custom/control-flow-path
 * @problem.severity warning
 * @tags control-flow, security, sensitive-data
 * @precision medium
 * @security-severity 5.0
 * @cwe CWE-200, CWE-201, CWE-209
 * @sub-severity high
 */

 import java
 import semmle.code.java.ControlFlowGraph
 import SensitiveInfo.SensitiveInfo
 
 /**
  * Predicate to define a control flow path between a start method call and an end method call.
  */
 predicate controlFlowPath(MethodCall start, MethodCall end) {
   exists(ControlFlowNode startNode, ControlFlowNode endNode |
     startNode = start.getControlFlowNode() and
     endNode = end.getControlFlowNode() and
     startNode.getANormalSuccessor*() = endNode // Track the normal successors between start and end
   )
 }
 
 from MethodCall normalCall, SensitiveMethodCall sensitiveCall
 where controlFlowPath(normalCall, sensitiveCall)
 select normalCall, sensitiveCall.getControlFlowNode().toString(), sensitiveCall.getMethod().getReturnType(),
   "Control flow path from normal method call to sensitive method call."

Thank you once again for your help.

Sep 13 '24 19:09 KylerKatz

After some more research, I believe I was referring to the wrong thing, instead of wanting the control-flow paths. I believe it's actually the call graph that I am looking for. Sorry for the confusion. I found this discussion That talks about doing it using Methods.

However, if I use this on my previous example,

public class BAD_APIEndpointDebugging {
    private static final boolean API_DEBUG = Boolean.parseBoolean(System.getenv("API_DEBUG"));

    public static void main(String[] args) {
        String apiEndpoint = "https://api.example.com/data";
        String apiKey = "api_key_123456789"; 

        if (API_DEBUG) {
            testAPI(apiEndpoint, apiKey);
        }
    }

    private static void testAPI(String url, String key) {
        System.out.println("DEBUG: Testing API endpoint: " + url + " with API Key: " + key);
        // Assume some testing logic here
    }
}

I only get

This is going in the right direction because I am finally getting the path that I have been trying to get. However, it is missing the paths from

parseBoolean() -> getenv() And testAPI() -> println()

I believe the reason for this is because they are MethodCalls.

Does anyone know of a way of making something like this work with MethodCalls instead? The issue is that MethodCall doesn't have a call predicate, which allows for the path to be generated.

/**
 * @kind path-problem
 */

import java

class StartMethod extends Method {
  StartMethod() { getName() = "validateExpression" }
}

class TargetMethod extends Method {
  TargetMethod() { getName() = "findValue" }
}

query predicate edges(Method a, Method b) { a.calls(b) }

from TargetMethod end, StartMethod entryPoint
where edges+(entryPoint, end)
select end, entryPoint, end, "Found a path from start to target."

Thank you.

Sep 20 '24 21:09 KylerKatz

These are getting filtered out because the first column, here end, gives the associated location for the alert, and CodeQL will only show in VS Code / export as SARIF those results with an associated location in user code (i.e., a .java file located within the extracted repository). By contrast, println and parseBoolean are defined in the standard library, so we don't get an alert.

Change the final line to

select entryPoint, entryPoint, end, "Found a path from start to target."

Then you'll get alerts whenever the start point is in user code, and so you will indeed see paths like main -> println.

Note that you will not see a path getenv -> parseBoolean because neither of those methods calls the other. parseBoolean consumes the output of getenv, but this isn't a call-graph relation, it's a data-flow relation, for which the standard CodeQL dataflow library is probably the right tool.

Sep 23 '24 07:09 smowton

Hello @smowton,

Thank you for your response,

I updated the query with your suggestion, and it is now working. However, I am noticing a few limitations of this query. Just to make sure that we're on the same page here is my current version.

/**
 * @kind path-problem
 * @id java/sinks/call-graph
 */

 import java
 import SensitiveInfo.SensitiveInfo

 query predicate edges(Method a, Method b) { a.calls(b) }
 
 from DetectedMethod end, Method entryPoint
 where edges+(entryPoint, end)
 select entryPoint, entryPoint, end, "Found a path from start to target."

There are two main limitations, the first being the one that you pointed out at the end of your response concerning seeing a path getenv -> parseBoolean because neither of those methods calls the other.

The second is that if we have a standard method such as println the location for this will be a class file where it is implemented not where it is called. I have a workaround for this when parsing, however, I still feel like there is a better way of doing so.

To test this, I went back to trying dataflow, I have a helper class DetectedMethodCall which is defined as so,

    class DetectedMethodCall extends MethodCall {
      DetectedMethodCall() {
        // Check for matches against the sinks
        exists(File f |
          sinks(f.getBaseName(), this.getMethod().getName(), _)
        )
      }
    }

Running a quick eval on this results in these

1 | parseBoolean(...)
2 | getenv(...)
3 | testAPI(...)
4 | println(...)
5 | parseBoolean(...)
6 | getProperty(...)
7 | processData(...)
8 | println(...)
9 | getDataSnippet(...)
10 | substring(...)
11 | parseBoolean(...)
12 | getProperty(...)
13 | println(...)

This checks out and corresponds to the sinks that I have in my YML file.

So, I decided to integrate this class into my dataflow query as the sink.

 predicate isSink(DataFlow::Node sink) {
    exists(DetectedMethodCall dmc |
      sink.asExpr() = dmc
    )
  }

However, running a quick eval on this predicate, I only get these paths

  |   | Path
  |   | 1 | "API_DEBUG" : String |   | BAD_APIEndpointDebugging.java:2:81
  |   | 2 | getenv(...) |   | BAD_APIEndpointDebugging.java:2:67
  |   | String | BAD_APIEndpointDebugging.java:1:14
  |   | Path
  |   | 1 | "API_DEBUG" : String |   | BAD_APIEndpointDebugging.java:2:81
  |   | 2 | getenv(...) : String |   | BAD_APIEndpointDebugging.java:2:67
  |   | 3 | parseBoolean(...) |   | BAD_APIEndpointDebugging.java:2:46

This excludes testAPI and println is there something you know of that I could add to include all method class such as these two?

I know if I do something like this,

predicate isSink(DataFlow::Node sink) {
  exists(DetectedMethodCall dmc |
    sink.asExpr() = dmc.getAnArgument()
  )
}

The I can get a result like this which is used by println however, I want the method call name, not the arguments as the ending point.

  |   | 1 | "api_key_123456789" : String |   | BAD_APIEndpointDebugging.java:6:25
  |   | 2 | apiKey : String |   | BAD_APIEndpointDebugging.java:9:34
  |   | 3 | key : String |   | BAD_APIEndpointDebugging.java:13:45
  |   | 4 | ... + ... |   | BAD_APIEndpointDebugging.java:14:28

Lastly, here is the full query for reference, I am mostly considering almost everything as a source and just want to see what flows into my methods.

/**
 * @name Backward slicing (optimized with extended flow tracking)
 * @description Identifies the backward slice of a sink node with performance improvements and additional tracking for transformations
 * @kind path-problem
 * @problem.severity warning
 * @id java/backward-slice-extended
 */

 import java
 private import semmle.code.java.dataflow.ExternalFlow
 import semmle.code.java.dataflow.TaintTracking
 import semmle.code.java.dataflow.FlowSources
 import CommonSinks.CommonSinks
 import SensitiveInfo.SensitiveInfo
 
 module Flow = TaintTracking::Global<SensitiveInfoInErrorMsgConfig>;
 import Flow::PathGraph
 
 /** A configuration for tracking sensitive information flow into error messages. */
 module SensitiveInfoInErrorMsgConfig implements DataFlow::ConfigSig {
 
   // Optimized source predicate to reduce performance overhead and track additional types
   predicate isSource(DataFlow::Node source) {
     // Include all string literals (including numeric-like strings)
     source.asExpr() instanceof StringLiteral
     or
     // Track method calls if they are assigned to variables (right-hand side of assignment)
     exists(AssignExpr assign |
       assign.getRhs() = source.asExpr() and
       source.asExpr() instanceof MethodCall
     )
     or
     // Track static method calls
     exists(AssignExpr assign |
       assign.getRhs() = source.asExpr() and
       source.asExpr() instanceof StaticMethodCall
     )
     or
     // Track field accesses if they are assigned to variables (right-hand side of assignment)
     exists(AssignExpr assign |
       assign.getRhs() = source.asExpr() and
       source.asExpr() instanceof FieldAccess
     )
     or
     // Include constructor arguments as sources to capture flow into objects
     exists(NewClassExpr newExpr |
       newExpr.getAnArgument() = source.asExpr()
     )
     or
     // Include cast expressions where sensitive data might flow between different types
     source.asExpr() instanceof CastExpr
   }
 
   predicate isSink(DataFlow::Node sink) {
    exists(DetectedMethodCall dmc |
      sink.asExpr() = dmc
    )
  }
  

 
   predicate isAdditionalFlowStep(DataFlow::Node pred, DataFlow::Node succ) {
    // Flow from arguments to method calls
    exists(MethodCall call |
      pred.asExpr() = call.getArgument(_) and
      succ.asExpr() = call
    )
    or
    // Flow through binary expressions (e.g., string concatenations)
    exists(BinaryExpr binExpr |
      pred.asExpr() = binExpr.getAnOperand() and
      succ.asExpr() = binExpr
    )
    // or
    // Existing steps (if any)
    // ...
  }
}
 
 from Flow::PathNode source, Flow::PathNode sink
 where
   // Capture flows between optimized sources and sinks
   Flow::flowPath(source, sink)
    
 select
   sink.getNode().getEnclosingCallable(),
   source,
   sink,
   source.getNode().getType().toString(),
   sink.getNode().getType().toString(),
   "Dataflow from " + source.getNode().toString() + " to " + sink.getNode().toString() + ""

Thank you for your help.

Sep 24 '24 06:09 KylerKatz

A few points about this:

If you're looking to both query control-flow relations (what method calls what) and data-flow (what results can flow into arguments of other methods), I would strongly advise querying the two things in different queries, rather than try to cram them into a single query seeking to conflate the two kinds of relationship.
A sink like exists(DetectedMethodCall dmc | sink.asExpr() = dmc) will never work for void-typed methods like println, because there is no data-flow method for something of void type. Setting the argument as the sink instead is the correct approach for something like this.
A good way to report the method called at the final step could be the use of $@ interpolation in the alert message, which allows you to report additional locatable things alongside the result:

from Method finalCallee
...
exists(MethodCall mc | sink.getNode().asExpr() = mc.getAnArgument() | finalCallee = mc.getCallee())
...
 select
   sink.getNode().getEnclosingCallable(),
   source,
   sink,
   "Dataflow from $@ to $@",
   source,
   source.toString(),
   finalCallee,
   finalCallee.toString()

Note that each of the $@ tokens obliges you to append one locatable (source and finalCallee here) followed by one string (used to construct the alert message). In VSCode these will be the target and the text of a hotlink respectively.

Sep 24 '24 16:09 smowton

Hi @KylerKatz,

Were you able to use @smowton suggestions to improve your query. I didn't follow the entire thread, but based on your initial question, that I interpreted as your intention to generate a path graph of some control flow of interest, the following blog post might be interesting CodeQL Path Graphs.

Oct 16 '24 22:10 rvermeulen