Performance analysis: High CPU load when LMMS is doing nothing
@unfa's issue #2290 made me curious which methods are taxing the CPU when LMMS should be idle. On my machine a release build of LMMS taxes the CPU at around 60 - 70% in the KDE task manager when doing exactly nothing. The CPU is an Intel Core i5-3570K. The ALSA driver is used.
I have therefore started an analysis with Valgrind with the following rather simple use case:
- Build release version.
- Start LMMS in Valgrind.
- Let LMMS sit there "idle" for some time.
- Close LMMS.
The results are the following. The percentages indicate how much of the complete amount of time that LMMS was running that specific method is consuming.
-
MixHelpers::addSanitizedMultiplied: 24.92% -
AudioDevice::convertToS16: 16.21% -
Mixer::peakValueLeft: ~4% -
Mixer::peakValueRight: ~4% So these four methods alone seem to take almost half of the time.
The critical call to MixHelpers::addSanitizedMultiplied can be found in FxMixer::masterMix. Part of the costly stuff are the calls to isinff and isnanf. I think it might make sense to replace it with MixHelpers::addMultiplied (at least in release builds) which still takes some time but not that much.
With regards to AudioDevice::convertToS16: why is this method used in the first place? The method seems to convert float values to signed 16 bit integers. I would assume that most ALSA devices should be capable of consuming float directly, shouldn't they?
Another interesting thing happened when I simplified the implementation of AudioDevice::convertToS16 to only writes zeros into the buffers. The method still showed up with a significant amount. This might indicate that the buffers are accessed in an unfortunate way which leads to a huge number of cache misses.
After I had replaced the four aforementioned methods with empty or trivial implementations the CPU load was more of less unchanged (still at 60-70%). I assume that this might be caused by the busy waiting done in AudioAlsa::run. Is it possible to use ALSA similar to Jack with a callback method that is called whenever ALSA decides that it wants to have the next buffer filled?
I would assume that most ALSA devices should be capable of consuming float directly, shouldn't they?
Not sure. Audio devices have become stupider over time, as CPUs are more powerful, so I'd be surprised if any large number of them actually accepted floats. I tried arecord -D hw:0,0 --dump-hw-params /dev/null and got only S16_LE, which probably is the supported by everything since Soundblaster 16 or so. The same trick for the default device shows U8 S16_LE S16_BE S32_LE S32_BE FLOAT_LE FLOAT_BE MU_LAW A_LAW.
So, it seems there's some ALSA magic involved, unless you go straight to the hardware. There might be some plugin or other way to configure ALSA to accept floats for any device, but if it's up to the user to do that configuration, I say it's not a very good solution. How about feeding ALSA floats whenever possible and keeping the current conversion code as a fallback? That might even give 24-bit output on capable hardware (if using default)
Is it possible to use ALSA similar to Jack with a callback method that is called whenever ALSA decides that it wants to have the next buffer filled?
Something is surely possible. http://www.alsa-project.org/alsa-doc/alsa-lib/pcm.html, says polling and asynchronous notifications are available. Using mmap-based access should also be an improvement, but it isn't necessarily supported by the default driver (at least it looks like that on my Ubuntu 14.04), so the current code would still be needed as a fallback.
The convertToS16 code could possibly be improved, if we assume endianness usually matches on a system (i.e. the ALSA/SDL/whatever subsystem expects the same endianness LMMS is using). In that case, endian swapping would mostly be relevant when rendering to a file, which isn't quite as time critical.
Something like the following could help:
- invert the order of the if clause, put the non-swapping case first. This might help with the cache performance of the code or whatever.
- maybe even make it a 2-pass loop, first going through the buffer to convert the samples, then doing the endianness swap if necessary.
And then there's Mixer::clip() being called a lot, that might or might not be faster using ternary operators, like this: http://stackoverflow.com/a/16659263
Plus the nested loops. Could be worth a shot to see if for( fpp_t frame = 0; frame < _frames*channels(); ++frame ) { optimizes better.
not related to alsa or sound-devises at all, Just want to put this head-up in here, because it is a cpu-usage when lmms is idle -but how is the current (master-as-of-today) auto-save method doing? -That feature was a cpu-hog, but it has improved with each release.
@musikBear Your question sounds rather speculative. This issue on the other hand covers performance problems that have been measured and that therefore are known to be problems.
If you are sure that auto-save causes performance problems, please open a separate issue. This will also make it easier to make progress and to keep an oversight of the issues. Thanks!
@michaelgregorius I am currently refactoring the updateFaders method to improve decoupling. While I am at it, it will be perfectly possible to only perform the peak calculations if we have an active listener for that data (I.e. only if the mixer window is open), which should remove 8% of idle cpu usage under the opposite condition.
I haven't peaked into the getPeakLeft/Right functions yet (hah), but I have
a hunch it's just a loop of p = max(p, abs(bufferLeft[i++])), in which
case there are likely other algorithms that avoid as much branching.
p.s. 70% idle seems VERY high compared to anything I ever get. Opening an empty project for me results in 1% on a mobile i5 processor.
Well darn. If our channels weren't interleaved, we could use std::minmax_element, but no luck.
I took a look at the assembly for some variants of the peak functions, and they're all much more complex than one would imagine. The important thing is that qMax, std::max and fmax all contain conditional branching when compiled with CMAKE_BUILD_TYPE=DebugWithRelInfo. Same with qAbs, fabs and maybe std::abs (It seems some of the implementation is not easy to find).
Even with -march=native (for SSE), there is very little difference. Thoughts:
- It may be better to calculate the peaks in lockstep instead of two separate iterations - increased data locality
- Maybe explicitly vectorize the code (http://stackoverflow.com/questions/15238978/sse3-intrinsics-how-to-find-the-maximum-of-a-large-array-of-floats). Would have to be done using intrinsics that still work with SSE off though.
p.s. 70% idle seems VERY high compared to anything I ever get. Opening an empty project for me results in 1% on a mobile i5 processor.
Argh, sorry guys! Due to some update I performed on my master branch my audio configuration got borked so that I had the null driver set during the profiler runs. This has also caused the 70% idle usage. It seems that the null driver does not simulate a normal progress of time but does all the calls as fast as possible. You can see this if you set the null driver and press play in the song editor. The position handle will just fly away. :)
I have now repeated the measurements with the correct ALSA driver. The aforementioned methods still show up but with much smaller numbers. For example MixHelpers::addSanitizedMultiplied only takes 2.76% with the correct ALSA driver. The peak algorithms also seem to only take around 1% as well.
Sorry for the confusion and work that this may have caused! Shall I just close this issue?
@michaelgregorius keep this open for now. Idle cpu really shouldn't be measurable - we can definitely improve on this area and your basic performance measurements will be useful to reference.
Ah, so auto-vectorization appears to only be enabled at -O3. Compiling the code as-is, but with CMAKE_CXX_FLAGS=-march=native -O3 results in much more efficient SSE code for the peakValueLeft/peakValueRight code (though I suspect they would still benefit from being computed in lockstep).
Performance comparision between CMAKE_BUILD_TYPE=RelWithDebugInfo (1) and CMAKE_CXX_FLAGS=-march=native -O3 (2) obtained via (perf record ./lmms &); sleep 120 && killall lmms and perf report:
- Mixer::peakValueRight: 0.22%; Mixer::peakValueLeft: 0.22%
- Mixer::peakValueRight: 0.12%; Mixer::peakValueLeft: 0.17%
So there appears to be at least some difference between the SSE and non-vectorized code, but it's not very clear how strong the difference is - not many samples seemed to land in this area of the code, so there's a lot of noise. Anyway, at some point, it may be worth shipping both a lmmscore.so with sse disabled and an lmmscore.sseN.so w/ SSE enabled, and dynamically link the appropriate one at runtime.
@Wallacoloo Is it worth the effort to compile a non SSE version at all? I think SSE(2) exists for quite some time and should be available in most modern (and not so modern) processors.
Another interesting question might be how the common Linux distributions compile their software packages? Do the assume SSE to be present? If yes, then it might be worth to enable these optimizations by default.
https://github.com/LMMS/lmms/issues/835
Worth mentioning https://github.com/LMMS/lmms/issues/2005#issuecomment-96410262
@Wallacoloo
If you are sure that auto-save causes performance problems, please open a separate issue.
Allready exists : #181 But that is not the same as yours here, so the reference is border-line. (However, auto-save does need customizing / improvements)
@michaelgregorius @tresf The links provided by Tres are pretty clear - it's safe to enable SSE(v1) for win32 and SSE2 for 64-bit builds of any OS. Apparently, the latter is done by default (I was slightly uninformed, it seems). So we just need to make sure we enable the approriate auto-vectorization flags on shipped builds (i.e. whichever relevant ones are enabled in -O3) & then we don't have to worry about dropping support for anyone.
I did speak rather prematurely though - the difference between 0.22% and 0.15% is pretty negligible to warrant much serious effort on this front.
Maybe explicitly vectorize the code (http://stackoverflow.com/questions/15238978/sse3-intrinsics-how-to-find-the-maximum-of-a-large-array-of-floats). Would have to be done using intrinsics that still work with SSE off though.
IMO that answer on stackoverflow fits pretty perfectly, the _mm_max_ps technique would find the peak for both channels in one loop. (data[0] and data[2] are one channel, data[1] and data[3] the other). Finding both peaks could happen up to twice as fast as finding one now.
Update: actually, if both channels are handled in one loop there would be a need for a "richer" return format or something like that. So why not go the full distance and return a sampleFrame[2] array containing maximum and minimum for both channels, that might be useful for something (jellyfish display?) There's really no extra effort involved in getting them, as the absolute can then be left out of the loop).
Is there anything else that would be useful and cheap to calculate as long as there's a loop going on? Average level?
Closing this issue as the effects cannot be reproduced anymore.