Support masking
Description of the problem/new feature
for example I want to make a baseline fit for this spectrum
but I want to mask out the -200<x<200 region
I need to manually mask the data
mask = (velocities < -200) | (velocities > 200)
v_masked = velocities[mask]
spec_masked = spectra[mask]
I think it is better to have a mask function integrated
Description of a possible solution or alternative
I would recommend just writing your own helper function that takes the mask and handles the interpolation and baseline correction internally. It would be essentially how I would implement it here and is much too simple for inclusion.
With that being said, I've considered implementing masking before, more in the context of missing data but it would support your use-case as well. However, I think it's ultimately better suited living in user code. It's not exactly clear how you wanted masking support to be implemented, but, as I see it, there are two options:
- (A) Support it internally on a method-by-method basis
- (B) An external wrapper that just interpolates the input data and then fits the baseline (it seems like this is what you're doing now?)
For option (A), I would say all of the algorithms in pybaselines fall into one of three categories:
- (1) Methods that already directly support masking by inputting the mask as
weights, which includes all classification methods and polynomial methods except for loess and quant_reg. - (2) Methods that do iterative reweighting, such as Whittaker smoothing methods, most spline methods, loess, and quant_reg. On the surface, support of masking would be as simple as setting weights outside of the mask to 0 each iteration; however, after some preliminary testing in response to this issue, I've found that most weighting-schemes would also need to be made mask-aware; this would be quite an effort, especially to catch some edge cases.
- (3) All other methods, which would have to be handled on a case-by-case basis. Most would probably just interpolate the input data and then do the underlying baseline algorithm, more or less just reproducing the output of option (B).
Option (A) would be a major amount of work to support what I would consider a niche use, and in most cases would be approximated closely by option (B), so I don't intend on doing it.
Now, my issues with option (B):
- Its very existence could make users think it is the sole correct way to do masking, even for the algorithms that already directly support masking in category (1) above
- For methods in category (3) above, their performance would be directly tied to how interpolation on the input is performed (demonstrated in the simple example plot below using the
mormethod). Interpolation is often a non-trivial task and is very data-specific. You've given a simple example where linear interpolation would work, but many use cases would require more sophisticated spline or penalized spline interpolation. I don't want to be responsible for handling interpolation... that is up to the user. And if the user handles interpolation, then the data can be input directly into whatever baseline algorithm, so option (B) becomes useless on my end.
I think adding an example to the documentation showing a simple implementation of option (B) with a few different interpolation methods and discussions of when it is actually needed would be the best route. The example could also show how to handle algorithms in category (2) such that it produces a weighted interpolation that closely resembles their mask-aware versions from option (A) that is independent of the interpolation of the input, as shown below for the arpls method (gist for the reproducer code).
Finally, just in case this is the actual data you want to mask and not just an example (if not, ignore this), I wanted to mention that positive peaks don't need to be masked for almost any baseline correction algorithm and especially not arpls. lam is just too small in this case and makes the baseline too flexible.
What about masking based on y-values, such as mask out all y>10 or y>3 sigma?
If you're asking to add a new function that does masking (based on whatever metric) and then inputs that masked data into another baseline correction method, then, to reiterate what I said above, that is much more suited living in user code, and masking out positive peaks before calling a baseline correction method is almost never necessary to begin with. If you're asking for a baseline correction method that defines such a mask internally, then classification methods such as golotvin, dietrich, fastchrom, fabc, etc. already do this using more robust metrics.
If neither of the above cover what you actually want, then can you please provide code for a complete, working example of what you're expecting? It is very unclear to me right now because "support masking" could have several different meanings.
Hey @Firestar-Reimu,
I added an example to the documentation to cover how to handle masking for the various algorithms in pybaselines (see https://pybaselines--45.org.readthedocs.build/en/45/generated/examples/general/plot_masked_data.html#sphx-glr-generated-examples-general-plot-masked-data-py for the built documentation). Let me know if you have any additional questions or issues with that example, otherwise I will merge it in 2 weeks and close this issue.
Good, thanks