Sylvia whittle/431 ignore high grains
Closes #431
Adds a step to grains.py that removes grains from the mask based on their median heights. This enables users to specifically remove tall grains such as proteins when imaging DNA.
The thresholds for upper / lower limits for this removal are configurable, to enable as much configuration as possible.
If not set, this will default to not removing any grains.
Codecov Report
Patch coverage: 97.36% and project coverage change: +0.22 :tada:
Comparison is base (
7b931c1) 80.84% compared to head (4854e7d) 81.06%.
:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.
Additional details and impacted files
@@ Coverage Diff @@
## main #574 +/- ##
==========================================
+ Coverage 80.84% 81.06% +0.22%
==========================================
Files 19 19
Lines 2814 2852 +38
==========================================
+ Hits 2275 2312 +37
- Misses 539 540 +1
| Impacted Files | Coverage Δ | |
|---|---|---|
| topostats/processing.py | 90.85% <ø> (ø) |
|
| topostats/validation.py | 100.00% <ø> (ø) |
|
| topostats/grains.py | 97.84% <97.36%> (-0.18%) |
:arrow_down: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I've only briefly looked at the code but would like to make the following observation...
Removes grains from a labelled mask based on the median height of the grains. Takes two values (lower, upper) in a tuple, height_thresholds_std_dev_mult, which get multiplied by the standard deviation of the data image to act as thresholds. Grains whose median value is less or greater than these values respectively are removed from the mask.
For example, for the values (1.0, 2.0), and an image whose standard deviation is 3.0, the lower grain height threshold would be 1.0 * 3.0 = 3.0, and the upper height threshold would be 2.0 * 3.0 = 6.0. So any grain grain mask whose median pixel value is outside of the range 3.0 to 6.0, will be removed.
it should only remove grains whose mean height is higher than 100 standard deviations away from the median
A perhaps petty but I think important statistical issue here. The standard deviation is a measure of dispersion around the mean, not the median.
Its calculated as...
$$\sqrt{ \left( \sum_{} \left( x_i - mean \right)^2 \over n \right)}$$
(doesn't render quite right, not an expert in GitHub Markdown formulae).
The mean is
$$ \sum{} x_i \over n $$
The median is the 50th percentile and the measure of dispersion around the median is the other percentiles, common ones are the 25th and 75th percentile which gives the inter-quartile range.
Typically means and standard deviations are used for normally/Gaussian distributions, median and percentiles are useful for these two but are more appropriate when the distribution is skewed by extreme outliers in one direction or another.
It doesn't make sense to me, as an ex-statistician, to calculate 5 or 100 standard deviations from the median. Statistically, in distributions that follow roughly Gaussian distributions the Central Limit Theorem means that 99% of observations are within the mean -/+ 3 x Standard Deviation, 100 standard deviations from the mean would probably not encapsulate any observations. Under Gaussian distributions the mean and median are often roughly equivalent and even though it doesn't make sense to use standard deviation with median you probably wouldn't expect to see many observations so far from the mean (any such values would be massive outliers and would be skewing the mean a lot).
In this regard I think the implementation should perhaps be based on some observed data (apologies if this has already been done I see in #431 @SylviaWhittle may have done some investigations). But perhaps lets collate statistics from a range of samples (not just minicircle.spm) and work out what these parameters are (after filtering which cleans things up)...
- whole scan mean
- whole scan variance (standard deviation is just square-root of this)
- whole scan percentiles (5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 85, 90, 95)
For each grain the same statistics...
- grain mean
- grain variance
- grain percentiles
Then we can start seeing how these stack up and whether using standard deviations across the whole image away from median of a grain is reasonable.
Would we prefer this over the arrays? It fills up more area in the config file, which we might want to keep looking less intimidating. It's a tradeoff between simplicity and conciseness.
Concise but less simple:
Verbose but simple:
Thanks for your comments @Jean-Du , I've fixed the bug that caused the problem you found. It was a single minus sign.
Config file for reference:
# Configuration from TopoStats run completed : 2023-06-01 17:44:31
# For more information on configuration : https://afm-spm.github.io/TopoStats/main/configuration.html
base_dir: G:\My Drive\PhD\Data\HU_project\Tests\High_objects_test
output_dir: output_new
log_level: info
cores: 2
file_ext: .spm
loading:
channel: Height
filter:
run: true
row_alignment_quantile: 0.5
threshold_method: std_dev
otsu_threshold_multiplier: 1.0
threshold_std_dev:
below: 10.0
above: 1.0
threshold_absolute:
below: -1.0
above: 1.0
gaussian_size: 1.0121397464510862
gaussian_mode: nearest
remove_scars:
run: true
removal_iterations: 2
threshold_low: 0.25
threshold_high: 0.666
max_scar_width: 4
min_scar_length: 16
grains:
run: true
threshold_method: std_dev
otsu_threshold_multiplier: 1.0
threshold_std_dev:
below: 10.0
above: 1.0
threshold_absolute:
below: -1.0
above: 1.0
direction: above
grain_height_removal_thresholds_std_dev:
below:
-
-
above:
-
- 2.75
smallest_grain_size_nm2: 50
absolute_area_threshold:
above:
-
-
below:
-
-
grainstats:
run: true
edge_detection_method: binary_erosion
cropped_size: 40.0
dnatracing:
run: true
min_skeleton_size: 10
plotting:
run: true
save_format: png
pixel_interpolation:
image_set: core
zrange:
-
-
colorbar: true
axes: true
cmap: nanoscope
mask_cmap: blu
histogram_log_axis: false
histogram_bins: 200
dpi: 100
summary_stats:
run: true
config:
Looking to close up old issues that are lingering and I'm curious if this might be addressed by #666 ?
Looking to close up old issues that are lingering and I'm curious if this might be addressed by #666 ?
This feature is slightly different. It removes grains whose median value is higher than a threshold. #666 prevents pixels above a threshold from being added to the mask.
In the former case, a protein that is very high up but has low pixels around the edge would be entirely removed. In the latter case, the ring of low pixels around the protein would be masked and kept as a grain. Does that make sense?
@SylviaWhittle will finish this off in the coming weeks as she is familiar with the work and it doesn't require too much more effort.
Concept approved by experimentalists. Use max value. Users will be able to guess a good height to set as the maximum for their sample.
Revive this PR