TopoStats Sylvia whittle/431 ignore high grains

Closes #431

Adds a step to grains.py that removes grains from the mask based on their median heights. This enables users to specifically remove tall grains such as proteins when imaging DNA.

The thresholds for upper / lower limits for this removal are configurable, to enable as much configuration as possible.

If not set, this will default to not removing any grains.

May 17 '23 17:05 SylviaWhittle

Codecov Report

Patch coverage: 97.36% and project coverage change: +0.22 :tada:

Comparison is base (7b931c1) 80.84% compared to head (4854e7d) 81.06%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #574      +/-   ##
==========================================
+ Coverage   80.84%   81.06%   +0.22%     
==========================================
  Files          19       19              
  Lines        2814     2852      +38     
==========================================
+ Hits         2275     2312      +37     
- Misses        539      540       +1

Impacted Files	Coverage Δ
topostats/processing.py	`90.85% <ø> (ø)`
topostats/validation.py	`100.00% <ø> (ø)`
topostats/grains.py	`97.84% <97.36%> (-0.18%)`	:arrow_down:

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

May 17 '23 17:05 codecov-commenter

I've only briefly looked at the code but would like to make the following observation...

Removes grains from a labelled mask based on the median height of the grains. Takes two values (lower, upper) in a tuple, height_thresholds_std_dev_mult, which get multiplied by the standard deviation of the data image to act as thresholds. Grains whose median value is less or greater than these values respectively are removed from the mask.

For example, for the values (1.0, 2.0), and an image whose standard deviation is 3.0, the lower grain height threshold would be 1.0 * 3.0 = 3.0, and the upper height threshold would be 2.0 * 3.0 = 6.0. So any grain grain mask whose median pixel value is outside of the range 3.0 to 6.0, will be removed.

it should only remove grains whose mean height is higher than 100 standard deviations away from the median

A perhaps petty but I think important statistical issue here. The standard deviation is a measure of dispersion around the mean, not the median.

Its calculated as...

$$\sqrt{ \left( \sum_{} \left( x_i - mean \right)^2 \over n \right)}$$

(doesn't render quite right, not an expert in GitHub Markdown formulae).

The mean is

$$ \sum{} x_i \over n $$

The median is the 50th percentile and the measure of dispersion around the median is the other percentiles, common ones are the 25th and 75th percentile which gives the inter-quartile range.

Typically means and standard deviations are used for normally/Gaussian distributions, median and percentiles are useful for these two but are more appropriate when the distribution is skewed by extreme outliers in one direction or another.

It doesn't make sense to me, as an ex-statistician, to calculate 5 or 100 standard deviations from the median. Statistically, in distributions that follow roughly Gaussian distributions the Central Limit Theorem means that 99% of observations are within the mean -/+ 3 x Standard Deviation, 100 standard deviations from the mean would probably not encapsulate any observations. Under Gaussian distributions the mean and median are often roughly equivalent and even though it doesn't make sense to use standard deviation with median you probably wouldn't expect to see many observations so far from the mean (any such values would be massive outliers and would be skewing the mean a lot).

In this regard I think the implementation should perhaps be based on some observed data (apologies if this has already been done I see in #431 @SylviaWhittle may have done some investigations). But perhaps lets collate statistics from a range of samples (not just minicircle.spm) and work out what these parameters are (after filtering which cleans things up)...

whole scan mean
whole scan variance (standard deviation is just square-root of this)
whole scan percentiles (5, 10, 20, 25, 30, 40, 50, 60, 70, 75, 80, 85, 90, 95)

For each grain the same statistics...

grain mean
grain variance
grain percentiles

Then we can start seeing how these stack up and whether using standard deviations across the whole image away from median of a grain is reasonable.

May 23 '23 19:05 ns-rse

Would we prefer this over the arrays? It fills up more area in the config file, which we might want to keep looking less intimidating. It's a tradeoff between simplicity and conciseness.

Concise but less simple:

Verbose but simple:

May 24 '23 14:05 SylviaWhittle

Thanks for your comments @Jean-Du , I've fixed the bug that caused the problem you found. It was a single minus sign.

May 26 '23 12:05 SylviaWhittle

Config file for reference:

# Configuration from TopoStats run completed : 2023-06-01 17:44:31
# For more information on configuration : https://afm-spm.github.io/TopoStats/main/configuration.html
base_dir: G:\My Drive\PhD\Data\HU_project\Tests\High_objects_test
output_dir: output_new
log_level: info
cores: 2
file_ext: .spm
loading:
  channel: Height
filter:
  run: true
  row_alignment_quantile: 0.5
  threshold_method: std_dev
  otsu_threshold_multiplier: 1.0
  threshold_std_dev:
    below: 10.0
    above: 1.0
  threshold_absolute:
    below: -1.0
    above: 1.0
  gaussian_size: 1.0121397464510862
  gaussian_mode: nearest
  remove_scars:
    run: true
    removal_iterations: 2
    threshold_low: 0.25
    threshold_high: 0.666
    max_scar_width: 4
    min_scar_length: 16
grains:
  run: true
  threshold_method: std_dev
  otsu_threshold_multiplier: 1.0
  threshold_std_dev:
    below: 10.0
    above: 1.0
  threshold_absolute:
    below: -1.0
    above: 1.0
  direction: above
  grain_height_removal_thresholds_std_dev:
    below:
    - 
    - 
    above:
    - 
    - 2.75
  smallest_grain_size_nm2: 50
  absolute_area_threshold:
    above:
    - 
    - 
    below:
    - 
    - 
grainstats:
  run: true
  edge_detection_method: binary_erosion
  cropped_size: 40.0
dnatracing:
  run: true
  min_skeleton_size: 10
plotting:
  run: true
  save_format: png
  pixel_interpolation:
  image_set: core
  zrange:
  - 
  - 
  colorbar: true
  axes: true
  cmap: nanoscope
  mask_cmap: blu
  histogram_log_axis: false
  histogram_bins: 200
  dpi: 100
summary_stats:
  run: true
  config:

Jun 01 '23 16:06 Jean-Du

Looking to close up old issues that are lingering and I'm curious if this might be addressed by #666 ?

Oct 04 '23 11:10 ns-rse

Looking to close up old issues that are lingering and I'm curious if this might be addressed by #666 ?

This feature is slightly different. It removes grains whose median value is higher than a threshold. #666 prevents pixels above a threshold from being added to the mask.

In the former case, a protein that is very high up but has low pixels around the edge would be entirely removed. In the latter case, the ring of low pixels around the protein would be masked and kept as a grain. Does that make sense?

Oct 10 '23 11:10 SylviaWhittle

@SylviaWhittle will finish this off in the coming weeks as she is familiar with the work and it doesn't require too much more effort.

Dec 19 '23 11:12 ns-rse

Concept approved by experimentalists. Use max value. Users will be able to guess a good height to set as the maximum for their sample.

Revive this PR

Jan 24 '24 13:01 SylviaWhittle