supercell_purity entropy measure
Hello,
in the function documentation you state this about the supercell_purity function:
#' @return a vector of super-cell purity, which is defined as:
#' - proportion of the most abundant cluster within super-cell for \code{method = "max_proportion"} or
#' - Shanon entropy for \code{method = "entropy"}.
#' With 1 meaning that super-cell consists of single cells from one cluster (reference assignment)
Could you explain better how the entropy method work? I get strange results and I don't understand whether in that case a higher value means more pure supercells (I guess not?)
Thank you
Hello @simozhou,
Thanks for your question!
The entropy approach of purity estimation computes Shannon entropy of cell-type frequency vector and is 0 in fully pure metacells and larges that 0 otherwise.
For example:
For a metacell that consist of 9 cell of the same cells type C, the cell-type frequency vector looks like:
type A type B type C type D
0 9 0 0
and the entropy is 0 (high purity)
For a metacell that consist of 9 cell of different cell types, the cell-type frequency vector looks like:
type A type B type C type D
3 2 1 3
and the entropy is ~1.3. (low purity)
The maximum value depends on the size of a metacell (and the total number of cell types, but this is constant per dataset). Therefore, so direct comparison across metacells of different sizes can be less meaningful. However, as a general rule, lower entropy indicates higher purity.
Best, Mariia