Metrogram Transform for Time Signature Detection
Hey, first of all thank you for maintaining this very useful library.
I'm working with my team (@ethan89009, @ValerianCoelho) on our college project for time signature (meter) detection. We are planning to use the Metrogram Transform introduced in this paper.
In short, it looks for rhythmic ratios in the tempogram, I found that librosa.feature.tempogram_ratio does something similar. The major difference is the Metrogram Transform doesn't require the tempo of the track. It works even if the tempo changes with time.
It would be nice to receive some pointers on code from librosa that we might be able to reuse or draw from when implementing this transform. A lot of functions in librosa are using more basic functions (eg. autocorrelation) from the library instead of coding the math from scratch. Which is neat. We will also try figuring this out.
Would this transform be a good fit for a PR to librosa? There was some related discussion in #194.
Thanks for the proposal. I'm reading through the paper now, and here are my thoughts about what a librosa implementation would look like:
- Everything up to equation 7 is essentially onset strength envelope calculation (
D(t)in the paper's notation). I would separate functionality at this point, and implement a method that operates on any precomputed onset strength envelope. - Equation 8 looks a bit like a fourier (power) tempogram, except using a wavelet basis instead of a DFT.
- Equation 11 looks like an adapted version of the Peeters 2011 idea of DFT/AC product. This is essentially a cleanup step on a tempogram representation to suppress subharmonics. It might be nice to have a function to implement just this step, but I would keep it separate from the metrogram idea.
- The key equation here is 12, which is very similar to our
saliencefunction, which first usesinterp_harmonicsand then aggregates out the harmonic dimension. Metrogram basically does the same thing, with two differences: 1) the frequency (tempo) dimension is aggregated out, not the harmonic dimension, and 2) the aggregation is a sum-product against the fundamental. (1) is easy enough to achieve by changing the target axis parameter. (2) is a little more delicate, but you could achieve it easily by doing something like the following:
# Assume T is the tempogram we begin with, and k are the harmonics/ratios we're interested in
T_interp = interp_harmonics(T, freqs=tempogram_freqs, harmonics=k, ...) # and other parameters as needed
# T[np.newaxis, ...] is a view of T with compatible dimensions to T_interp
# Now we can multiply them and aggregate out the tempo dimension
# Default to aggregate=np.sum, but user could supply others as needed
M = aggregate(T_interp * T[np.newaxis, ...], axis=-2)
# M will now be of shape (T.shape[:-2], len(k), T.shape[-1])
This should be efficient and handle multichannel inputs readily.
So TLDR: yeah, this sounds like an interesting idea and something that would require a very small amount of new code. Fully reimplementing everything in their paper (ie their onset envelope method) would be out-of-scope IMO (or at least for this PR; improvements to onset envelopes are also welcome). OTOH the product tempogram idea would be useful to include here, since it seems like subharmonic suppression is probably critical to getting this method to work well.
If you want to go ahead with this, it would be helpful to see a proof of concept implementation with minimal new parts (ie using existing onset calculation and tempogram methods, etc). That would help focus the development on the metrogram idea independently of improvements to more general components of the pipeline.
The major difference is the Metrogram Transform doesn't require the tempo of the track. It works even if the tempo changes with time.
Minor quibble: tempogram_ratio does require tempo, but it does not require constant tempo. It uses the f0_harmonics function to do interpolation against a time-varying fundamental.
Hey Brian, thank you very much for your helpful advice.
I've made a gist with the progress so far. Initially, I had tried the metrogram transform directly on the fourier / AC transforms, but the results looked strange. So then I tried the fundamental tempogram idea, but it's even weirder (the harmonic k=1 is blank). The outputs for all are in the gist notebook. Do you have any suggestions we can try?
About the PR, we were also planning to focus on just the metrogram function right now, so this sounds perfect! We do plan to later build a robust ODF using machine learning. Not yet sure what it'll look like but we'll be happy to share it to see if it's a good fit.
it seems like subharmonic suppression is probably critical to getting this method to work well.
My intuition is that the meter is found through the subharmonics of the tempo. Like, if the tempo is 100bpm, and we see a strong subharmonic at 100bpm/5 = 20bpm, then we know there are 5 beats in a bar. I'm curious why suppressing the subharmonics helps.
tempogram_ratio does require tempo, but it does not require constant tempo. It uses the f0_harmonics function to do interpolation against a time-varying fundamental.
Thank you for the correction. That is honestly very cool!
So then I tried the fundamental tempogram idea, but it's even weirder (the harmonic k=1 is blank). The outputs for all are in the gist notebook. Do you have any suggestions we can try?
The fundamental tempogram itself looks good - and I like the way you've implemented it (by interpolating on a common frequency axis)!
I suspect the issue with your metrogram implementation is that the harmonics given to interp_harmonics should actually be defined as subharmonics. Remember that these are multiplicative factors on the frequency axis, so k=1/2 corresponds to tempo/2 (ie two beats), s `k = [1/2, 1/3, 1/4, 1/5, ...] would correspond to 2/4, 3/4, 4/4, 5/4 etc.
My intuition is that the meter is found through the subharmonics of the tempo. Like, if the tempo is 100bpm, and we see a strong subharmonic at 100bpm/5 = 20bpm, then we know there are 5 beats in a bar. I'm curious why suppressing the subharmonics helps.
Yes you're right. I meant the subharmonics in the autocorrelation tempogram, not Fourier.
Hey Brian!
Using subharmonic values in k did end up working! Here is the new gist. We made a very simple metronome beat that moves from 4/4 to 3/4 then 5/4 to test if the metrogram can pick it up and it is working.
More complex tracks do not have as clean of an output (sweetwaltz), we feel it's where onset improvements become important. Do you have any suggestions for improvements we can make here?
The smoothing and confidence code is how we are extracting the time signature from the metrogram.
I like the way you've implemented it (by interpolating on a common frequency axis)!
Thanks! I got the interpolation idea from the Peeters paper you mentioned. I generalized it so the user can choose the resulting frequency axis to be from either tempograms. I wonder if we should generalize it even further to be just a product of two spectrograms?
Are there any other changes you'd like to see before the PR? I think we'd have to add error handling and docs.
Using subharmonic values in k did end up working! Here is the new gist. We made a very simple metronome beat that moves from 4/4 to 3/4 then 5/4 to test if the metrogram can pick it up and it is working.
Excellent! This looks great.
More complex tracks do not have as clean of an output (sweetwaltz), we feel it's where onset improvements become important. Do you have any suggestions for improvements we can make here?
Definitely not as clean, but it does seem to be hitting the right answer. It's not necessarily obvious to see in the plot, so I added a normalization option in a local copy of your gist:
M = librosa.util.normalize(M, axis=-2)
scaling each column by its max value. The original paper didn't do this, but it does make the results a lot easier to parse visually. Here's sweetwaltz:
I also tried modifying the onset envelope calculation to use median aggregation (does well with strong transients, which we have here):
So, slightly cleaner, but I think it doesn't materially affect the high-level result. I also experimented with plugging in the PLP-cleaned onset envelope instead of the raw envelope, but this discarded too much information to be useful.
I wonder if we should generalize it even further to be just a product of two spectrograms?
That's an interesting idea. I could imagine a general util function that takes two arrays (x1, x2) with the following properties:
- arrays have broadcast-compatible dimensions except along a chosen axis (in this case, frequency or tempo)
- each array has positioning data given for the chosen axis
- a third set of positioning data is given as the interpolation grid
- x1 and x2 are interpolated to the common positions -> y1, y2
- a user-supplied ufunc (eg np.multiply) combines them:
return ufunc(y1, y2)
Making the ufunc user-supplied would allow for other combinations (sum could make sense?). It's maybe a lot of generalization here, but it doesn't actually cost anything more than interpolate and multiply, so I see no downside.
Are there any other changes you'd like to see before the PR? I think we'd have to add error handling and docs.
Those would be the big ones 😁 and test coverage.
Thanks again for your work on this!
I'm using this for a paper! We found the results were stronger when we used Fourier tempogram frequencies, instead of autocorrelation, and max instead of sum aggregation.
We evaluated on a dataset of the 222 tracks from the Hainsworth dataset plus 32 tracks from the McGill Billboard dataset, which we selected to be in one of the following meters: 3, 5, 6, 7, or 12. On these 254, tracks, the method with two changes I suggested, achieved an f-measure of 0.812, precision 0.70, and recall 0.967. It works great for a 4/4-estimator, which is our use case.
Took a while, but the PR is finally here! ( #1986 ) It is still draft and any feedback is very welcome.
Brian, thank you for experimenting with the notebook - the normalization tip is particularly helpful to see things more clearly!
Elena, thank you for rigorously testing out the implementation and improving on it. I think we're all glad this code was useful so quickly. :)
It would be great to also showcase the Fourier tempogram + max aggregation method as an example in the docs.