codeql icon indicating copy to clipboard operation
codeql copied to clipboard

[Bug Report] Data Flow Interruption with Function Parameters and Variable Arguments in Python

Open gravingPro opened this issue 1 year ago • 7 comments

I've encountered issues in CodeQL regarding data flow interruption. Here are the details:

1. Function Parameter Passing Interruption

In the code below:

def read_sql(sql):
    spark.sql()  # sink custom

def process(func, args): 
    func(*args) 

sql = request.json['data']  # Source
process(func=read_sql, args=sql)

CodeQL fails to detect that the tainted variable sql is passed into read_sql when using the process function to handle the function call and its argument. This shows an interruption in data flow tracking during function parameter passing and subsequent invocation with variable arguments.

2. *args and **kwargs Interruption

The problem with *args (variable positional arguments) and **kwargs (variable keyword arguments) is that when used in a way that impacts data flow, CodeQL can't track accurately. In the given example, using *args in the process function leads to incorrect recognition of the data flow for sql. This issue extends to similar scenarios involving these constructs.

Moreover, these problems also occur in functions related to multithreading and multiprocessing like threading.Thread, mulitprocess.Process, concurrent.futures.ThreadPoolExecutor, and concurrent.futures.ProcessPoolExecutor.

I hope this description helps in identifying and resolving these problems. Looking forward to a timely fix or further guidance on handling such complex data flow tracking scenarios.

Best regards

gravingPro avatar Oct 14 '24 13:10 gravingPro

Hi @gravingPro,

Thanks for the bug report. We will inform the Python team and get back to you on possible further guidance.

rvermeulen avatar Oct 14 '24 19:10 rvermeulen

Hi @gravingPro,

A quick follow-up question. How is your custom sink defined? In read_sql the argument sql is currently unused.

rvermeulen avatar Oct 14 '24 21:10 rvermeulen

Hi @gravingPro,

A quick follow-up question. How is your custom sink defined? In read_sql the argument sql is currently unused.

It's just a simple example. Any sink can be used here, no matter sql injection or ssrf.

gravingPro avatar Oct 15 '24 15:10 gravingPro

Hi @gravingPro,

Is read_sql your sink, because the comment suggests spark.sql() # sink custom is the sink? If spark.sql() is the sink, then the flow will stop at read_sql's parameter sql that is unused. That is, it will never reach your sink so I would like to exclude that possibility.

rvermeulen avatar Oct 17 '24 22:10 rvermeulen

Hi @gravingPro,

Is read_sql your sink, because the comment suggests spark.sql() # sink custom is the sink? If spark.sql() is the sink, then the flow will stop at read_sql's parameter sql that is unused. That is, it will never reach your sink so I would like to exclude that possibility.

Oh, i made a mistake there. I missed the taint data. The correct codes are:

def read_sql(sql_data):
    spark.sql(sql_data)  # A simple sink for  example. Any other official sink can be appied here.

def process(func, args): 
    func(*args) 

sql = request.json['data']  # Source
process(func=read_sql, args=sql)

gravingPro avatar Oct 18 '24 11:10 gravingPro

As said in description, it's found in the projects using multiprocess functions as "threading.Thread", "concurrent.futures.Executor" and "multiprocessing.Process". The CodeQL cannot get the path.

# example
import threading
import time

def print_greeting(greeting):
    sink(greeting)

taint_data = 
thread = threading.Thread(target=print_greeting, args=(taint_data, ))
thread.start()

We have inspected the underlying code of the threading. And finally found the self._target(*self._args, **self._kwargs) in run function which can not be correctly extracted by CodeQL.

class Thread(BaseThread):
    # ...

    def run(self):
        """A virtual method to be overridden by the subclass."""
        if self._target is not None:
            self._target(*self._args, **self._kwargs)

gravingPro avatar Oct 18 '24 12:10 gravingPro

@gravingPro Thanks for clarifying your example. We will provide an update when we know more!

rvermeulen avatar Oct 18 '24 17:10 rvermeulen

Hi @gravingPro,

Sorry for the late reply :(. This indeed seems to be a limitation of our current data flow tracking. The team has been notified, but I don't have timelines w.r.t. a solution.

Thanks for sharing these cases.

I you have any further questions/comments, feel free to open a new issue!

rvermeulen avatar Mar 21 '25 20:03 rvermeulen