SuperCell icon indicating copy to clipboard operation
SuperCell copied to clipboard

supercell_purity entropy measure

Open simozhou opened this issue 7 months ago • 1 comments

Hello,

in the function documentation you state this about the supercell_purity function:

#' @return a vector of super-cell purity, which is defined as:
#' - proportion of the most abundant cluster within super-cell for \code{method = "max_proportion"} or
#' - Shanon entropy for \code{method = "entropy"}.
#' With 1 meaning that super-cell consists of single cells from one cluster (reference assignment)

Could you explain better how the entropy method work? I get strange results and I don't understand whether in that case a higher value means more pure supercells (I guess not?)

Thank you

simozhou avatar Jun 16 '25 13:06 simozhou

Hello @simozhou,

Thanks for your question!

The entropy approach of purity estimation computes Shannon entropy of cell-type frequency vector and is 0 in fully pure metacells and larges that 0 otherwise.

For example:

For a metacell that consist of 9 cell of the same cells type C, the cell-type frequency vector looks like:

type A type B type C type D 
     0      9      0      0 

and the entropy is 0 (high purity)

For a metacell that consist of 9 cell of different cell types, the cell-type frequency vector looks like:

type A type B type C type D 
     3      2      1      3 

and the entropy is ~1.3. (low purity)

The maximum value depends on the size of a metacell (and the total number of cell types, but this is constant per dataset). Therefore, so direct comparison across metacells of different sizes can be less meaningful. However, as a general rule, lower entropy indicates higher purity.

Best, Mariia

mariiabilous avatar Jun 26 '25 12:06 mariiabilous