scCATCH icon indicating copy to clipboard operation
scCATCH copied to clipboard

Cell type annotation for more than 3 subtypes

Open gkc251 opened this issue 2 years ago • 1 comments

Dear developers,

When making custom cell marker dataframe for cells with more than 3 subtypes (e.g. different subtypes of T cells), what is the best approach? Do you have plan to extend the software to support N subtypes; or hierachical cell types (e.g. immune cells)? Thank you in advance.

gkc251 avatar Aug 29 '23 08:08 gkc251

Hi @gkc251,

Thank you for your question! I'd like to introduce mLLMCelltype, a new tool that addresses exactly the challenges you mentioned with annotating multiple cell subtypes and hierarchical cell types.

mLLMCelltype Features Relevant to Your Needs:

1. Unlimited Subtype Support

mLLMCelltype can handle any number of cell subtypes without limitations. Whether you have 3, 10, or even 50+ T cell subtypes, our multi-model consensus framework can distinguish between them based on their marker gene profiles.

2. Hierarchical Annotation Support

The tool naturally handles hierarchical cell type relationships. For example:

  • Level 1: Immune cells vs Non-immune cells
  • Level 2: T cells, B cells, Myeloid cells, etc.
  • Level 3: CD4+ T cells, CD8+ T cells, Tregs, etc.
  • Level 4: Naive CD4+, Memory CD4+, Th1, Th2, Th17, etc.

3. How It Works

  • Input: Simply provide marker genes for each cluster/subtype in CSV/TSV/Excel format
  • Processing: Multiple state-of-the-art LLMs (GPT-4.1, Claude 4, Gemini 2.5, etc.) analyze your markers
  • Output: Consensus cell type annotations with confidence scores

4. Key Advantages for Complex Cell Types

  • Context-aware: You can specify tissue type to improve subtype resolution
  • Multi-model consensus: Reduces errors common when distinguishing similar subtypes
  • Confidence metrics: Provides Consensus Proportion and Shannon Entropy to flag uncertain annotations
  • No predefined limits: Works with any number of clusters/subtypes

Try It Now

Web Interface (No Installation)

Visit https://www.mllmcelltype.com to try it immediately with your data.

Python Package

pip install mllmcelltype

R Package (Great for Seurat users)

devtools::install_github("cafferychen777/mLLMCelltype")
# CRAN submission in progress

Example Use Case for T Cell Subtypes

# Your marker genes dataframe might look like:
# Cluster_1: CD3D, CD4, IL7R, CCR7  # Naive CD4+
# Cluster_2: CD3D, CD4, IL7R, S100A4  # Memory CD4+ 
# Cluster_3: CD3D, CD8A, CCR7, LEF1  # Naive CD8+
# Cluster_4: CD3D, CD8A, GZMK, DUSP2  # Memory CD8+
# Cluster_5: CD3D, CD4, FOXP3, IL2RA  # Tregs
# ... and many more subtypes

results = mllmcelltype.annotate(
    marker_genes_df,
    species="human",
    tissue="PBMC",  # Helps with subtype resolution
    models=["gpt-4.1", "claude-4", "gemini-2.5"]
)

Why mLLMCelltype Excels at Complex Annotations

  1. LLMs understand biological context: They've been trained on vast amounts of biological literature and understand subtle differences between cell subtypes
  2. Multi-model validation: Different models cross-check each other, reducing misclassification of similar subtypes
  3. Continuous updates: As new cell types are discovered and published, the LLMs naturally incorporate this knowledge

Paper and Resources

  • Paper: bioRxiv
  • GitHub: https://github.com/cafferychen777/mLLMCelltype
  • Documentation: Available on GitHub

Feel free to try it with your T cell data! The tool handles complex immune cell hierarchies particularly well since there's extensive literature on immune cell subtypes that the models can leverage.

Would love to hear about your experience if you give it a try!

Best regards, Chen

cafferychen777 avatar Jun 27 '25 20:06 cafferychen777