feat: identify functions only referenced from library code
Follow up on https://github.com/mandiant/capa/issues/989. Marking this as a draft, because I believe we should a test for this function.
@mr-tz @williballenthin let me know if you have a test binary, otherwise I can compile one.
Feel free to grab any of the files from capa testfiles. Using zlib routines might be a good place to start because they're obvious and common.
let me know if you have a test binary, otherwise I can compile one.
tests/data/038476f1705f3ac1237ac57f4c1753e0aa085dd7cda5669d4e93399cf7a565af.exe_ contains a few functions to test with:
- 0x40ca70
- 0x40bf40
- 0x40b06c
Once we have the logic and tests working correctly, we may want to find a way to cache the results and/or do all analysis in a single pass.
Recursive functions that are called in a loop tend to have O(n**2) worst case runtime. We can avoid this by ensuring each function is only evaluated once.
We could cache the result in the viv workspace, like we do in the flirt analyzer. And/or, we could build the global call graph once and extract all the library functions in a single traversal, saving the results in an intermediate object.
Anyways, let's get the core behavior specified and tested before we optimize the implementation.
tests/data/038476f1705f3ac1237ac57f4c1753e0aa085dd7cda5669d4e93399cf7a565af.exe_contains a few functions to test with:
- 0x40ca70
- 0x40bf40
- 0x40b06c
Great, thanks. I will use this binary as a test bed.
Here are some tests I will implement. Let me know if you can think of other test cases.
from fixtures import sample_038476
from viv_utils.flirt import is_only_called_from_library_functions
def test_invalid(sample_038476):
"""
test an invalid function address
"""
# this is an an address that is not a function
func_addr = 0x400000
assert is_only_called_from_library_functions(sample_038476, func_addr) == False
def test_not_called(sample_038476):
"""
test a function that is not called by any another function
"""
# this is a function that is not called by any other function.
func_addr = 0x400000
assert is_only_called_from_library_functions(sample_038476, func_addr) == False
def test_positive(sample_038476):
"""
test a library function
"""
# this is an existing library function
func_addr = 0x400000
assert is_only_called_from_library_functions(sample_038476, func_addr) == True
def test_negative(sample_038476):
"""
test a function called by both library and non-library functions,
where at least one caller is not a library a function
"""
# this should be a function, where mixed callers, where
# at least one caller is neither a library call or is not
# called only from library calls
func_addr = 0x400000
assert is_only_called_from_library_functions(sample_038476, func_addr) == True
def test_circular(sample_038476):
"""
test a function with a circular function call graph
"""
# two functions calling each other in a loop
func_addr1 = 0x400001 # calls 0x400002
func_addr2 = 0x400002 # calls 0x400001
assert is_only_called_from_library_functions(sample_038476, func_addr1) == True
assert is_only_called_from_library_functions(sample_038476, func_addr2) == True
The test ideas look great! I recommend to use descriptive names for each - ones we can make sense of by just reading the name (as much as possible).
We should also test the transitive call and labeling (A (lib) -> B -> C) as discussed above.
I included some tests in 06f2fba0834f4b9e6efcb1b0d65c73ce6389bbf1. For references, there are the functions I used in testing (from tests/data/038476f1705f3ac1237ac57f4c1753e0aa085dd7cda5669d4e93399cf7a565af.exe_):
0x40CAA3
0x408155 this is the main (entry) function
0x407660
0x40B06C
I've researched this a bit further and while we could do some more advanced computation based on graph algorithms I think we can get away with the current approach plus the additional check to verify code lies within a certain range (start/end of library functions).