Improve performance of heuristics
We apply a number of heuristics in swupd before installing a file. The current code is performing a big number of strcmp in all filenames to check if that file belongs to one category. Instead we could use sorted lists and a merge-like algorithms to search for heuristics in a list of files. On big updates or bundle-adds the time we take to apply heuristics is considerable.
We also need to check if mixer is performing the same task and if we can remove heuristics from swupd
I tried my best to understand the source code and I wish my idea helps.
Currently, heuristics is applied to to_install_files, which is a doubly linked-list of files. What if we have a hashmap-like structure.
The keys are dirname of those files/dirs, while values are flags such as is_boot, is_state, and is_config, as long as several global variables.
We don't need to apply heuristic on each file. Instead, we only update the flags for the keys, which indicates the status for all files/dirs it contains.
There are a few exceptions like /data and /usr/bin/bootctl, that flags indicate status of not only their child nodes, and also themselves. This can be handled easily.
When constructing the table, the keys are already known, which are the directories checked in heuristics functions. For files not matching any of them, just imagine they belong to some else keys.
For example, the left column are the keys, and right column
/home <- is_state = 1, is_boot = 0, is_config = 0,
...
...
/etc <- is_state = 0, is_boot = 0, is_config = 1,
...
...
/usr/bin/bootctl <- is_state = 1, is_boot = 1, is_config = 0,
globals.need_update_bootloader = true
...
...
/usr/lib/kernel <- is_state = 1, is_boot = 1, is_config = 0,
globals.need_update_boot = true
/usr/lib/modules <- is_state = 0, is_boot = 1, is_config = 0,
(no change to global flags)
...
...
/usr/lib/systemd/systemd <- is_state = 1, is_boot = 0, is_config = 0,
globals.need_systemd_reexec = true
/usr/src/kernel <- is_state = 0, is_boot = 0, is_config = 0,
...
...
__else <- is_state = 0, is_boot = 0, is_config = 0,
__usr/else <- is_state = 1, is_boot = 0, is_config = 0,
So to infer the status of /usr/bin/foo and /tmp/bar for example, we refer to their keys, /usr/bin and /elserespectively, in which thekeysare retrieved by taking thefilepath` of those files.
Given a HashMap M, and a file/directory x
let f = filepath(x)
while f not empty:
if M contains f:
return M.get(f)
else:
f = filepath(f)
if first 5 chars of x is "/usr/":
return M.get("__usr/else")
else:
return M.get("__else")
Though there's a recursion, the full filepath usually is not that deep. Currently, in worst case we perform strncmp on each file for more than 20 times, and it seems that it definitely takes less recursion to find a match/mismatch. Below is the depth of files on my system, the mean and median is around 9.

Thanks for the analysis :)
Using a hash to keep the heuristics to be applied is a good solution, but we still would need to use multiple strcmp and hash computes for each file. And in this case we can use the fact that the to_install_files are a sorted list to prevent us to compare the same file name multiple times.
If we also save all heuristics in a sorted list, something like:
/etc <- is_state = 0, is_boot = 0, is_config = 1,
/home <- is_state = 1, is_boot = 0, is_config = 0,
/usr/bin/ <- is_state = 1, is_boot = 1, is_config = 0,
...
We can use a sorted list compare algorithm to look for the best match for each file. Something like:
file = to_install_files;
heuristic = heuristic_list;
while(file) {
int result = compare(heuristic, file);
if (result == 0) { //file matches the heuristic
apply_the_heuristics_to_file(file, heuristic);
file = file->next;
} else if (result < 0) { //There's no heuristic for this file
file = file->next;
} else { // We already applied this heuristic to all files that we should
heuristic = heuristic->next;
}
}
Some examples for this pseudo-code: Imagine a heuristic list of: {"/usr/bin/", "/usr/lib/"} And a file list: {"/usr/bin/vim", "/usr/include/my_include", "/usr/lib/my_lib" }
heuristic = /usr/bin/ file = /usr/bin/vim compare should match. So we apply the heuristic and move to next file
heuristic = /usr/bin/ file = /usr/include/my_include compare will return a negative number ("/usr/bin/" < "/usr/include/my_include"), so there's no file left in the list that is under /usr/bin and we can move the heuristic to the next heuristic.
heuristic = /usr/lib/ file = /usr/include/my_include compare will return a positve number ("/usr/lib/" > "/usr/include/my_include"), so there's no heuristic to this file. We can move to next file.
heuristic = /usr/lib/ file = /usr/lib/my_lib compare should match. So we apply the heuristic and move to next file
There are still some exceptions that we may need to handle, like the fact that we have dirs and files (kernel, for example), but this can be fixed with a good compare and sorting function.
The good about this algorithm, assuming the file list is always sorted beforehand, for other reasons, is that it's a O(n + m), where n is the number of files and m is the number of heuristics. So the number of strcmps will be n+m. As the number of files is usually way bigger than the number of heuristics (in the cases where performance matters) we are going to have around 1 strcmp per file.
Thanks for the analysis :) ...
@otaviobp I thoroughly think about this and I think your implementation will work well in practice.
As for the file list, it's reasonable to assume it's at least partially sorted. Because it's generated by summarizing the files mentioned in manifest files of bundles, in which files are sorted, I suppose. Then for most consecutive rows of to_install_files are sorted, except for corner cases where the two consecutive rows may come from two different manifest files.
Say we have two MoMs:
/usr/bin/foo
/usr/lib/systemd/systemd/foo.service
and
/etc/bar.conf
/usr/include/bar.h
Say the first one require the second one and after merge, the order is mixed.
Then how about let's sort files in manifest files such that files are grouped into sections corresponds to heuristic groups.
[bin]
/usr/bin/foo
[systmd]
/usr/systmed/systemd/foo.service
and
[etc]
/etc/bar.conf
[header]
/usr/include/bar.h
When generating the to_install_file list, we are not blindly concatenating the contents, but doing it smartly that keeps the group information.
Regardless of whether files in each group is sorted or not, the heuristic applied to it is the same! And this may also help to make the MoM more meaningful to human.
Sent with GitHawk