Add git-based build information for better issue tracking
Description:
This pull request adds a build-info.h header file, which will contain git-related build information such as branch and commit count. This will help users and maintainers to identify the build being used when reporting issues or bugs. The build-info.h file is generated using CMake and Makefile whenever .git/index changes, and is printed to stderr in main.cpp.
Changes:
- Modified
.gitignoreto includebuild-info.h. - Added build information generation to
CMakeLists.txt(and updated main'sCMakeLists.txtto add a dependency). - Added build information generation to
Makefile. - Modified
examples/main/main.cppto print build information at runtime.
This will allow us to track the build version in a more granular way and may be helpful in identifying and addressing issues or bugs until an official versioning system is introduced.
I think this is a good idea. One question, currently it looks like this outputs the number of commits in HEAD, but how do we find what version they are using from this information? I think I would prefer a commit hash id, such as obtained by git rev-parse --short HEAD.
@slaren
but how do we find what version they are using from this information?
To go the other direction:
git rev-list HEAD --skip=$(( $(git rev-list HEAD --count) - BUILD_NUMBER)) --max-count=1
So if you were looking for 461:
$ git rev-list HEAD --skip=$(( $(git rev-list HEAD --count) - 461)) --max-count=1
305eb5afd51325e3142c01c17431febb7c67de87
It's not as straight forward I suppose. I just thought a build number would be something easier for people to understand and without looking up the hash, judge how old the build is, and at the same time allow people (or documentation) to say "you need at least build number 461 to do X."
If you want the direct hash, how would you feel about a combination? Something like:
build = 461 (305eb5afd51325e3142c01c17431febb7c67de87)
seed = 189273487
How about pulling out the commit date?
build = 2022-04-29 (305eb5afd51325e3142c01c17431febb7c67de87)
Did you test how this behaves when it is a cmake subdirectory?
edit: also git submodules. which have a .git file that points to the root.
@j-f1:
The date might be better for a quick judge of how old something is. I thought a straight number component would be more useful in the long run because it could be used more easily programmatically.
For example, if we wanted to tag all the new quantizations with which build number they were quantized in and there were some quantization bugs fixed in commit number 670, you just have to test if the build number in the quantized file was < 760 and then warn the user they should requantize for best performance. The hash alone wouldn't be good enough and a date would mostly work, but it's not exact (and time zones?) and no one wants to parse date strings.
@Green-Sky:
I tried it with main (which is in another subdirectory, if that's what you mean?) and it worked fine. The BUILD_INFO project it makes is itself relative to the source root so even when I write CMAKE_CURRENT_SOURCE_DIR it's still using that path and should use only the .git at that that particular location but there might be some edge cases I'm not thinking of? I can replace CMAKE_CURRENT_SOURCE_DIR with CMAKE_SOURCE_DIR just to cover my bases.
It's still all configured on my dev machine if there's a particular cmake command you want me to try out.
I tried it with main (which is in another subdirectory, if that's what you mean?)
hm no.
The BUILD_INFO project it makes is itself relative to the source root so even when I write CMAKE_CURRENT_SOURCE_DIR it's still using that path and should use only the .git at that that particular location but there might be some edge cases I'm not thinking of?
I was thinking of including llama.cpp as a git submodule or git subtree and including the cmake via eg add_subirectory(external/llama.cpp) .
CMAKE_CURRENT_SOURCE_DIR
In the llama.cpp's project root directory is the way to go, I think.
I committed a change so it shows the hash and the build number. I think this is better after all.
main: build = 464 (5e7d90a838c2543687fcff2e603b1189b6bc1084)
main: seed = 1682785118
Previously, this would have shown my branch instead (conditionally, because it wasn't master), but now it just always shows the hash, which I do admit is easier. The example even points to this exact commit: 5e7d90a838c2543687fcff2e603b1189b6bc1084
And while I don't think it was necessary (because BUILD_INFO is treated as its own project in cmake), I didn't see any harm in replacing all the CMAKE_CURRENT_SOURCE_DIR with CMAKE_SOURCE_DIR just to be extra careful. (Spoiler, I undid this below.)
@Green-Sky Oh I see what you're saying. If someone else is using this project in their own project, cmake should be smart about that since it would still be a sub directory, but I haven't tested it and now I think CMAKE_CURRENT_SOURCE_DIR was actually the right thing to do. I need to go back undo that.
Alright, I put CMAKE_CURRENT_SOURCE_DIR back and checked for any other spots it may have caused issues when used with add_subdirectory(). I also made the building comments a little more descript.
@sw Yeah, it's embarrassingly easy to fix. Everything else is all set up to generate dummy values. I just overlooked that dependency line because I originally wasn't going to add it (since BUILD_INFO is it's own thing and part of ALL) but added it the last minute in case someone builds only main (which could cause other issues anyway), and it missed the .git testing. Give me a second.
Edit: git will still spit out an error claiming "not a git repository", but only when configuring, not during builds. I can also hide that but I think maybe it's good people see that so if they look at their logs they will understand why their build info is "unknown"?
Thanks @DannyDaemonic, I've noticed another problem, though. CMake doesn't find the git executable:
-- Could NOT find Git (missing: GIT_EXECUTABLE)
This is on Ubuntu 22.04 with cmake 3.22.1 and git 2.34.1
Also seems to happen in the CI build: https://github.com/ggerganov/llama.cpp/actions/runs/4840382315/jobs/8625989141?pr=1232#step:4:22
I think you have to use find_package(Git) in the top-level CMakeLists.txt as well, not just in the generated BUILD_INFO.cmake.
Couldn't you add BUILD_INFO.cmake as a regular file in scripts/, instead of writing it to the build directory?
I think the goal of this PR is good, the implementation just seems very complicated. But I'm no Cmake expert.
@sw
I think you have to use find_package(Git) in the top-level CMakeLists.txt as well, not just in the generated BUILD_INFO.cmake.
Hmm, I don't know why that would make a difference when BUILD_INFO has its own top-level project (at build/BUILD_INFO.vcxproj. Unless you mean it didn't work when called during the config but then it worked after? Either way, I added a fallback on which, that always works (outside of Windows anyway). I tested it and it was able to find git even in a subshell. I also double checked and it was working in Windows without the which work around. :shrug:
As for making a separate cmake file, all of the other .cmake files are generated by the CMakeLists.txt files, so it seemed natural for it to create BUILD_INFO.cmake as well. It was also easier to work with the code in one place, but it has grown quite a bit since so I separated it out and left just the project stuff behind in CMakeLists.txt.
The other downside to separating it into scripts was it messed with the path stuff Green-Sky and I were talking about earlier. Even thought I was using CMAKE_CURRENT_SOURCE_DIR it was using the build directory and apparently you can't override the source directories with -D. I was able to find a way to do it with nmake 3.12, which luckily, we already require, it but it took longer than I'd like to admit to figure out what was wrong...
@slaren
Alright. I've put it in all the examples. I tried to place it after parameter checks that could cause an exit, but before the seed is written out so that they'd be printed together. This also means I had to update the makefile to depend on the .h file for generation. The only thing is this made the Makefile a bit messy. We could probably clean it up with a pattern. But I'm not sure that's desired. I can always try to clean it up in a separate PR to see if that's prefered.
This is very minor, but I wonder if the filter-out in the Makefile could be replaced by a more generic $(filter-out %.h,$^).
@slaren Oh, Yeah, I could do that much without reshaping the makefile. Previously, I was thinking of something like this:
# Define the targets
TARGETS := main quantize quantize-stats perplexity embedding
# Define dependencies for each target
main_DEPS := ggml.o llama.o common.o $(OBJS)
quantize_DEPS := ggml.o llama.o $(OBJS)
quantize-stats_DEPS := ggml.o llama.o $(OBJS)
perplexity_DEPS := ggml.o llama.o common.o $(OBJS)
embedding_DEPS := ggml.o llama.o common.o $(OBJS)
# Pattern rule for compiling and linking
%.out: %.cpp $(%_DEPS) build-info.h
$(CXX) $(CXXFLAGS) $(filter-out *.h,$^) -o $@ $(LDFLAGS)
But I don't know if that's too far removed from a ~normal~ simple makefile.
Yeah that probably would be better in the long run. I also noticed that the LLAMA_GPROF and LLAMA_PERF flags don't work with cuBLAS. There is some refactoring to do there, but that would be a different PR.
@ggerganov: Oh, good catch.
@slaren I tried to make that particular pattern/rule work but it was really not having it. The only way to join the target with a variable is to use an $eval.
It still looks great to use:
TARGETS_CPP += quantize
DEPS_quantize := examples/quantize/quantize.cpp ggml.o llama.o
TARGETS_CPP += quantize-stats
DEPS_quantize-stats := examples/quantize-stats/quantize-stats.cpp ggml.o llama.o
TARGETS_CPP += perplexity
DEPS_perplexity := examples/perplexity/perplexity.cpp ggml.o llama.o common.o
TARGETS_CPP += embedding
DEPS_embedding := examples/embedding/embedding.cpp ggml.o llama.o common.o
But then you have something like this at the end that includes all the extras (OBJS, build-info, etc):
define template_cpp
OUTP_$(1) ?= $(1)
$(1): $$(DEPS_$(1)) $$(OBJS) build-info.h
$$(CXX) $$(CXXFLAGS) $$(filter-out build-info.h,$$^) -o $$(OUTP_$(1)) $$(LDFLAGS)
$$(if $$(value EXEC_$(1)),$$(EXEC_$(1)))
endef
$(foreach target,$(TARGETS_CPP),$(eval $(call template_cpp,$(target))))
I commented it nicely but I don't know if that's still preferred.
Thoughts?
Rebased with less "fancy" Makefile.
Used shorter hash as originally recommended by @slaren. If this is no longer desired, let me know and I'll put it back.