Optimize matchcompiler: create ConstStrings once
Reduces IR with ~2.5%.
This looks really interesting. Could you also provide a small diff example of the generated code?
I did a small testrun on mame_regtest with DISABLE_VALUEFLOW=1 and --enable=all --inconclusive on Clang 13 and the Ir actually increased:
2,472,706,153-> 2,500,915,833
I guess you used GCC.
Digging into it (via findLambdaEndToken()) I see operator StringRef() which wasn't there before and more spend within equalN(). So I guess the const object prevents some better inlining which was happening before.
I dabbled in getting rid of some of the indirections in MatchCompiler::makeConstString() in the past and it also have mixed results with being less performant although there's less code. That's when I hit another case where the compilers are behaving quite differently.
I tried constexpr on ConstString on top of your changes and that reduced the Ir to 2,468,887,538. The operator StringRef() calls are now gone. I still see slight increases but they are hard to track. It seems they are related to std::string:find*() now being more expensive!? I remember coming across this before - I think it was when we moved from Clang 13 to Clang 14.
Here's the diff:
diff --git a/lib/matchcompiler.h b/lib/matchcompiler.h
--- a/lib/matchcompiler.h (revision 46984ae45eee127ce52730540a3583454ee315d1)
+++ b/lib/matchcompiler.h (date 1659509930132)
@@ -28,7 +28,7 @@
class ConstString {
public:
typedef const char(&StringRef)[n];
- explicit ConstString(StringRef s)
+ explicit constexpr ConstString(StringRef s)
: _s(s) {}
operator StringRef() const {
@@ -64,7 +64,7 @@
}
template<unsigned int n>
- inline ConstString<n> makeConstString(const char (&s)[n])
+ inline constexpr ConstString<n> makeConstString(const char (&s)[n])
{
return ConstString<n>(s);
}
diff --git a/tools/matchcompiler.py b/tools/matchcompiler.py
--- a/tools/matchcompiler.py (revision 46984ae45eee127ce52730540a3583454ee315d1)
+++ b/tools/matchcompiler.py (date 1659509556848)
@@ -721,7 +721,7 @@
constants = u''
for constString, constStringId in self._constStringCache.items():
- constants += 'static const auto ' + constStringId + ' = MatchCompiler::makeConstString("' + constString + '");\n'
+ constants += 'static constexpr auto ' + constStringId + ' = MatchCompiler::makeConstString("' + constString + '");\n'
lineno = u''
if line_directive:
And dropping your changes with only ConstString supporting constexpr I got 2,472,695,908. If I dig into it there's part which slightly regress and other which improve.
I will give GCC a spin later today.
In checkunusedfunctions.cpp the code creates an object and replaces it for
// Taking care of funcName == "operator", which is no valid operator
We at least need to skip lines which start with //. This would also fix such a case in checktype.cpp.
In symboldatabase.cpp there's one where the comment starts within the line which would be harder to support.
All of those generate -Wunused-const-variable compiler warnings with Clang.
I came across this in other MatchCompiler replacements but somehow didn't fix it yet.
GCC 12
2,395,429,828 - main
2,421,359,726 - this PR
2,395,496,634 - this PR + constexpr ConstString + static constexpr
2,395,496,606 - this PR + constexpr ConstString
2,395,429,826 - constexpr ConstString
Interestingly with your patch it uses more Ir with GCC as well - what toolchain did you use? I am aware less instructions is not automatically better. There's better but more complicated ways to measure it than callgrind.
This looks remarkedly similar to what I experienced while working on #3725.
Here's the overview for Clang 13:
2,472,706,153 - main
2,500,915,833 - this PR
2,468,887,538 - this PR + constexpr ConstString + static constexpr
2,468,875,670 - this PR + constexpr ConstString
2,472,695,908 - constexpr ConstString
This obviously needs a bigger testrun.
This is why I don't like optimizations :)
I'm using gcc 12.1.1, valgrind 3.19.0, linux. I did the measurements using callgrind when running cppcheck on main.cpp in simplecpp (I used --enable=all but if I recall correctly, that didn't matter).
I didn't notice the unused variable warning, I guess we could add a check for comments in MatchCompiler.
Another thing I noticed was that make didn't realize it needed to recompile after changes in matchcompiler.h, but that's a separate issue.
Finally, I'm running this in VirtualBox on a very old computer but I don't think that should matter for IR?
I also tested with constexpr, and gcc didn't seem to care about it, so I didn't add it. I'll see if I can reproduce the result with value flow disabled tonight.
I'm using gcc 12.1.1, valgrind 3.19.0, linux. I did the measurements using callgrind when running
cppcheckonmain.cppinsimplecpp(I used--enable=allbut if I recall correctly, that didn't matter).
mame_regtest is C only. I disable the valueflow for my tests since it slows things down to bad it's kinds unusable even outside of callgrind (but optimizing the actual core isn't bad either). main.cpp of simplecpp doesn't have much code at all - you should be using simplecpp.cpp instead - or better a compilation database of simplecpp.
Another thing I noticed was that
makedidn't realize it needed to recompile after changes in matchcompiler.h, but that's a separate issue.
Funny - I was wondering if CMake is correctly picking it up - and it does. Didn't think the Makefile didn't - but since it only applies to MATCHCOMPILER=1 builds that makes sense. Will take a look.
Finally, I'm running this in VirtualBox on a very old computer but I don't think that should matter for IR?
Not that I am aware of. perf is apparently much better since it counts the actual cycles used but I have zero experience with that and it requires kernel support so me using WSL1 (which doesn't have a kernel) cannot utilize it. llvm-mca looks at the cost of the actual instructions but that requires an actual code sample. I only used it once so I am not too familiar with it either.
This looks really interesting. Could you also provide a small diff example of the generated code?
So given the following silly example file:
bool isF(const Scope *s)
{
for (const Token* tok = scope->bodyStart; tok != scope->bodyEnd; tok = tok->next()) {
if (tok->str() == "f")
return true;
}
}
Previously, MatchCompiler generated this:
#include "matchcompiler.h"
#include <string>
#include <cstring>
bool isF(const Scope *s)
{
for (const Token* tok = scope->bodyStart; tok != scope->bodyEnd; tok = tok->next()) {
if (tok->str() == MatchCompiler::makeConstString("f"))
return true;
}
}
With these changes, this is generated:
#include "matchcompiler.h"
#include <string>
#include <cstring>
static const auto constString0 = MatchCompiler::makeConstString("f");
bool isF(const Scope *s)
{
for (const Token* tok = scope->bodyStart; tok != scope->bodyEnd; tok = tok->next()) {
if (tok->str() == constString0)
return true;
}
}
So, my results with clang-14 (still using main.cpp in simplecpp) is the improvements with this PR improves IR with clang too (without the constexpr), but I guess I should try a bigger file (but it takes forever with callgrind). The improvements are actually bigger with clang compared to gcc, but this clearly needs more investigation...
Another thing I noticed was that
makedidn't realize it needed to recompile after changes in matchcompiler.h, but that's a separate issue.
I filed https://trac.cppcheck.net/ticket/11241 with a suggestion on how to fix it.
I did some tests with godbolt and llvm-mca and none of the variations have any effect the code at all in a very simple test case. So it seems to be related to more complex functions.
Here's a link to play around with: https://godbolt.org/z/K9nvqrzax.
We now have a callgrind step in the selfcheck so you get information about the impact. So please rebase this PR.
You can also download the callgrind.<pid> from the workflow and view it locally with kcachegrind. The callgrind_annotate output to be a off on some functions.
On a side note. Being able to optimize this is good, but as seen in the callgrind output the problem are the non-matchcompiler Token::Match() calls.
Right, I rebased and tested it on the latest master. With the constexpr-changes you suggested, there is a slight improvement, but not as much as I initially thought. So maybe we just close this?
Yeah, the changes unfortunately don't seem to have that much effect. I will take another look in the next few days. Let's leave it open for now.
Still Clang profits from it and it would be great to get a sample to file a report upstream.
Some std::string comparison related benchmarks: https://github.com/llvm/llvm-project/issues/58002#issuecomment-1396950031
I will also do some tests with the stuff we discussed in this PR.
short string: Clang 15 - https://quick-bench.com/q/uMa0PbTdG7CZhp94pk2-aZ9VySE GCC 12.2 - https://quick-bench.com/q/HehK-2CaXR4-g3znvZlh21tzdH4
GCC seems to yield more consistent code compared to Clang.
longer string: Clang 15 - https://quick-bench.com/q/2jfuoRSk5wq5bQrvRaqC-gRZxKE GCC 12.2 - https://quick-bench.com/q/_3hA3P1HGN_7juNpdznXvn3iTps
constexpr hurts GCC but produces more consistent results with Clang.
Turns out all of this is not necessary - see https://github.com/llvm/llvm-project/issues/58002#issuecomment-1396950031. Simply moving the string to compare with into a global static const std::string object is enough to produce even faster results than the MatchCompiler. Clang is slightly slower with single character strings but GCC is faster. So that optimization should be kept for now. We could also make it optional based on if the compiler has __clang__ defined though.
I updated your changes to apply my findings and added some additional improvements. With those we no longer have need for matchcompiler.h which would be great. See https://github.com/firewave/cppcheck/tree/mc-opt with the changes applied in bulk.
Unfortunately callgrind gives Ir regressions on that.
GCC 12
1,910,631,872 - current
2,148,962,096 - static const std::string
2,057,846,913 - static const std::string + single-character optimization
But given these functions are called millions of times a change of just a few instructions will have a visible impact. But as more instructions does not mean more cycles the code might actually be faster. So we need actual cycle profiling for this.