Templar Seg-faults after clearing the graph scene

I was just trying to run templar a bit and got several seg-faults and other memory issues. I fixed a couple of them in the pull request that I just put up. But there are still several that I can't fix myself (without knowing how things are supposed to work).

The latest one I got triggers the following valgrind log entry:

==23753== Invalid read of size 4
==23753==    at 0x48AA20: QNode::getId() const (qgraph.h:137)
==23753==    by 0x4897F4: QGraph::colorizeUpToNode(int) (qgraph.cpp:452)
==23753==    by 0x47CD16: Templar::GraphHandler::handleEvent(Templar::TraceEntry const&) (graphhandler.cpp:30)
==23753==    by 0x4820FC: Templar::DebugManager::next() (debugmanager.cpp:61)
==23753==    by 0x4A9459: Templar::DebugManager::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) (moc_debugmanager.cxx:53)
==23753==    by 0x5C7B879: QMetaObject::activate(QObject*, QMetaObject const*, int, void**) (in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.6)
==23753==    by 0x4FFAA61: QAction::triggered(bool) (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==    by 0x4FFC432: QAction::activate(QAction::ActionEvent) (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==    by 0x53B3B01: ??? (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==    by 0x53B3C2B: QAbstractButton::mouseReleaseEvent(QMouseEvent*) (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==    by 0x546AA49: QToolButton::mouseReleaseEvent(QMouseEvent*) (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==    by 0x5050509: QWidget::event(QEvent*) (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==  Address 0x20277430 is 64 bytes inside a block of size 80 free'd
==23753==    at 0x4C2C2BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==23753==    by 0x4AA191: QNode::~QNode() (qgraph.h:122)
==23753==    by 0x55CE225: QGraphicsScene::clear() (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)
==23753==    by 0x47CD7C: Templar::GraphHandler::reset(Templar::TraceEntry const&) (graphhandler.cpp:54)
==23753==    by 0x48225E: Templar::DebugManager::reset() (debugmanager.cpp:81)
==23753==    by 0x46A1BB: MainWindow::reset() (mainwindow.cpp:270)
==23753==    by 0x46AFB4: MainWindow::openTrace(QString const&) (mainwindow.cpp:373)
==23753==    by 0x46AD06: MainWindow::on_actionOpen_trace_triggered() (mainwindow.cpp:350)
==23753==    by 0x4A8DE8: MainWindow::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) (moc_mainwindow.cxx:81)
==23753==    by 0x4A8F6B: MainWindow::qt_metacall(QMetaObject::Call, int, void**) (moc_mainwindow.cxx:127)
==23753==    by 0x5C7BA77: QMetaObject::activate(QObject*, QMetaObject const*, int, void**) (in /usr/lib/x86_64-linux-gnu/libQtCore.so.4.8.6)
==23753==    by 0x4FFAA61: QAction::triggered(bool) (in /usr/lib/x86_64-linux-gnu/libQtGui.so.4.8.6)

Which pretty much speaks for itself when you look at the corresponding source locations. When the GraphHandler calls the "theGraph->getScene()->clear();" during its "reset" callback, it causes all the QNode objects to be deleted and then, the next time the QGraph gets colored via a call to "colorizeUpToNode" it accesses freed memory.

I believe that the fix would be to call "theGraph->clearGraph()" before (or instead of) the call to clear the scene. I think that clearGraph should include the clearing of the scene. Thus, having:

void QGraph::clearGraph()
{
    nodeList.clear();
    QList<QGraphicsItem*> items(scene->items());
    while (!items.isEmpty())
        delete items.takeFirst();
    scene->clear();
}

and

void GraphHandler::reset(const TraceEntry &/*entry*/)
{
    undoStack.clear();
    nodeColor.clear();
    theGraph->clearGraph();
}

That just looks cleaner to me.

Dec 27 '14 21:12 mikael-s-persson

While I can confirm that this resolves the segfault and looks cleaner, I am not sure if it leads to the desired behavior. I am not that accustomed to the code myself. (Almost all changes I made so far were rather mechanical bugfixes).

For now I will apply this change but in the long run we need a better strategy which would center around having some documentation/system specification to code against. (This again would require to hash out some usage details which I should not do on my own. I am not exactly a power user of Templar.)

Dec 29 '14 15:12 schulmar

we need a better strategy which would center around having some documentation/system specification to code against.

Well, I would say, a specification like that is not really needed for a GUI application, it's more for the design of libraries or tools. What you need is a clearer "user experience" design.

I'll offer a bit of feedback from my perspective:

For example, as a potential user of templar, what I would want to see (somehow represented graphically) is a clear picture of what parts of the code takes the most time / memory to be instantiated. Then, I would want some intuitive way to zoom into or go down the paths that I see as most problematic. Gprof2dot is a decent graphviz-based solution for that, and kcachegrind is a more elaborate GUI design. I think that both are pretty ugly and overly cluttered. I think that templar could aim at something much more intuitive and purified. The main point is to keep in mind that the typical user wants to find out where most time / memory is (proportionally) spent during compilation.

In that perspective, I don't quite understand the purpose of the kind of "walk" features for a profiler. If you want to take a walk along the sequence of template instantiations, you would only do so on small pieces of code (e.g., "toy" examples) and you would use something like the templight debugger, the meta-debugger (from metashell), or a GUI front-end to either of those. The point is that a walk or step-through user experience is typical of the use of a debugger, not of a profiler. That's where I think templar is missing the mark a bit. By trying to implement such a "walk" feature, you end up duplicating the logic of the meta-shell / meta-debugger programs (which also use templight to produce a trace and then provide a command-line interface to walk back and forth in the sequence of instantiations and corresponding AST nodes).

As a GUI front-end for a profiler, you should tailor to tasks that people want to use a profiler for, which is to identify performance bottlenecks in large projects.

And just a hint about GUI programming. It appears to me that templar doesn't have a lot of intuitive clues about what actions can or cannot be done at any given time. You have a lot of "dead buttons", i.e., buttons (or menu items) that can be clicked but don't actually do anything because the current state of the application means that nothing can or will happen if you click it. It is much better (for you and for the user) to disable all buttons that cannot be meaningfully clicked at any given point in time. And Qt has great features for that. It makes things intuitive for the user because he gets an immediate visual clue about how to use the application. And it makes things easier for you as a programmer because you have far fewer checks to do in the code to handle all the possible "stupid clicks" from users. Some of the seg-faults I fixed in my earlier pull request were caused by me making "stupid clicks" because I couldn't understand how to use the app, and caused seg-faults in even handlers that didn't check for those stupid situations. For example, clicking "next" when the history is empty caused a seg-fault, but why would the "next" button be enabled when the history is empty? That's my point. Prevent the "stupid clicks" at the source.

Dec 30 '14 01:12 mikael-s-persson

Well, I would say, a specification like that is not really needed for a GUI application, it's more for the design of libraries or tools. What you need is a clearer "user experience" design.

What I am aiming at is to have the program develop alongside a user manual which it is currently lacking. Of course more intuitive controls will help but they can't solve everything.

For example, as a potential user of templar, what I would want to see (somehow represented graphically) is a clear picture of what parts of the code takes the most time / memory to be instantiated.

Am am working on this (I want to do something like this) but it is way back in the pipeline. Current top things on the list are:

fixing bugs/crashes
being able to load the established formats (old XML, protobuf, ...?)
improving visualization/interaction/GUI

In that perspective, I don't quite understand the purpose of the kind of "walk" features for a profiler.

I always understood Templar as a Trace inspector not as a profiler (after all the profiling is done by Templight). However, I also feel that the graphical nature of the program should offer better support for seeing the performance implications of the code.

From my point of view these are the main tasks I would expect Templar to accomplish:

Validate that the correct template specializations are invoked (this is somewhat related to the stepping)
Have an eye on the time consumption
Look for memory problems/excessive recursion

The current state of the program is rather bad. The original design was a bit lacking. Mering in the changes in the fork of Thomas Benard brought in some nice ideas but made the program even more complicated to the point that some things don't work anymore.

I am not sure if going on from this code base is the right thing or if a clean start should be made that borrows some of the old ideas code (e.g. file loading).

Dec 30 '14 09:12 schulmar

Am am working on this (I want to do something like this) but it is way back in the pipeline.

Yeah, that is exactly the kind of thing I would love to see.

being able to load the established formats (old XML, protobuf, ...?)

Hopefully my pull request on updated protobuf loading code will help ;)

after all the profiling is done by Templight

Profiling is pretty much pointless without visualization. Templight generates the profiling data, but how is someone supposed to inspect it? For example, I just generated (finally!) some templight profiling data for some template-heavy code that I want to optimize, the traces are 20-50MB per file in protobuf format (so, roughly 200-500MB in other formats), for a total of more than 1GB of profiling data. Am I supposed to look at this in a text editor? The post-processing tasks, including visualization and data aggregation, are an essential part of the overall process. Whether Templar aims to fulfil role or not is your decision, of course, but if not, some other application will have to fill this role.

Validate that the correct template specializations are invoked (this is somewhat related to the stepping)

I noticed that you had prepared fields for "declarationBegin" and "declarationEnd" in your trace entry objects. You will be happy to know that I have added (in my develop branch of templight) a field for recording the "template_origin", which points to the file location of the template (or specialization) being instantiated (as opposed to the "location" field which marks the point of instantiation). That should help you for this (and it was requested by someone on the boost mailing list).

Have an eye on the time consumption Look for memory problems/excessive recursion

I think that the kind of visual representation that you linked to (picture) is a great start for this, if you can produce one of those for time and one for memory, and of course, the excessive recursion is immediately seen by spotting very deep branches on that picture, then that's like 90% of what anyone would desire as far as inspecting the profiling data. The other 10% would be aggregated statistics, something like compiling the time spent on any instantiation related to a base template name like std::tuple (any template parameters and all member functions).

I am not sure if going on from this code base is the right thing or if a clean start should be made that borrows some of the old ideas code (e.g. file loading).

GUI programming is never pretty, that's why I don't like it much. Templar's code is definitely a bit messy, it looks like a lot of haphazard code put together. Things are a bit all over the place.

As far as making a clean start-over, that's not a bad idea. The main thing is that it really needs to be more self-contained modules with minimal interactions. Take for instance this issue with the seg-faults on clearing the graphs, as you said, you're not even sure that the fix is correct (neither am I), because of this weird spread of the graph handling code over three or four classes (in addition to graphviz). That's the kind of thing that makes things messy very quickly.

Dec 30 '14 19:12 mikael-s-persson

Hopefully my pull request on updated protobuf loading code will help :) It sure takes some work off me :)

Profiling is pretty much pointless without visualization. True, it was more of a language nitpick because you referred to Templar as profiler where it is a visualizer.

Whether Templar aims to fulfil role or not is your decision, of course, but if not, some other application will have to fill this role.

I am not the owner of Templar and as I already said I don't feel confident enough to decide on my own which direction it should take. I am glad that you are discussing some of the problems with me but I feel there should be more people giving their opinion so we can make sure we don't code against a wall.

I noticed that you had prepared fields for "declarationBegin" and "declarationEnd" in your trace entry objects. You will be happy to know that I have added ...

These fields were added by Thomas Benard but you are right, I am quite fond of them :) Thanks for the work there (to both of you), it really helps understand what is going on with different specializations. (The next step would be to have an explanation by the compiler why some specialization was chosen over the other, but that is another story.)

if you can produce one of those for time and one for memory ... The other 10% would be aggregated statistics

Would you mind opening issues for that, so I don't loose track of them? :)

As far as making a clean start-over, that's not a bad idea.

I would be in favor for it but I am not sure when to do that. Currently, we have no working instance and I would not like to maintain this situation for a prolonged period while rewriting. So the way to go is probably to fix the most important issues with the current version and then aim for the rewrite.

This time of fixing may even be used to think of better ways (than walking) to present the data.

Dec 31 '14 08:12 schulmar