Cell type annotation for more than 3 subtypes

Open gkc251 opened this issue 2 years ago • 1 comments

Dear developers,

When making custom cell marker dataframe for cells with more than 3 subtypes (e.g. different subtypes of T cells), what is the best approach? Do you have plan to extend the software to support N subtypes; or hierachical cell types (e.g. immune cells)? Thank you in advance.

Aug 29 '23 08:08 gkc251

Hi @gkc251,

Thank you for your question! I'd like to introduce mLLMCelltype, a new tool that addresses exactly the challenges you mentioned with annotating multiple cell subtypes and hierarchical cell types.

mLLMCelltype Features Relevant to Your Needs:

1. Unlimited Subtype Support

mLLMCelltype can handle any number of cell subtypes without limitations. Whether you have 3, 10, or even 50+ T cell subtypes, our multi-model consensus framework can distinguish between them based on their marker gene profiles.

2. Hierarchical Annotation Support

The tool naturally handles hierarchical cell type relationships. For example:

Level 1: Immune cells vs Non-immune cells
Level 2: T cells, B cells, Myeloid cells, etc.
Level 3: CD4+ T cells, CD8+ T cells, Tregs, etc.
Level 4: Naive CD4+, Memory CD4+, Th1, Th2, Th17, etc.

3. How It Works

Input: Simply provide marker genes for each cluster/subtype in CSV/TSV/Excel format
Processing: Multiple state-of-the-art LLMs (GPT-4.1, Claude 4, Gemini 2.5, etc.) analyze your markers
Output: Consensus cell type annotations with confidence scores

4. Key Advantages for Complex Cell Types

Context-aware: You can specify tissue type to improve subtype resolution
Multi-model consensus: Reduces errors common when distinguishing similar subtypes
Confidence metrics: Provides Consensus Proportion and Shannon Entropy to flag uncertain annotations
No predefined limits: Works with any number of clusters/subtypes

Try It Now

Web Interface (No Installation)

Visit https://www.mllmcelltype.com to try it immediately with your data.

Python Package

pip install mllmcelltype

R Package (Great for Seurat users)

devtools::install_github("cafferychen777/mLLMCelltype")
# CRAN submission in progress

Example Use Case for T Cell Subtypes

# Your marker genes dataframe might look like:
# Cluster_1: CD3D, CD4, IL7R, CCR7  # Naive CD4+
# Cluster_2: CD3D, CD4, IL7R, S100A4  # Memory CD4+ 
# Cluster_3: CD3D, CD8A, CCR7, LEF1  # Naive CD8+
# Cluster_4: CD3D, CD8A, GZMK, DUSP2  # Memory CD8+
# Cluster_5: CD3D, CD4, FOXP3, IL2RA  # Tregs
# ... and many more subtypes

results = mllmcelltype.annotate(
    marker_genes_df,
    species="human",
    tissue="PBMC",  # Helps with subtype resolution
    models=["gpt-4.1", "claude-4", "gemini-2.5"]
)

Why mLLMCelltype Excels at Complex Annotations

LLMs understand biological context: They've been trained on vast amounts of biological literature and understand subtle differences between cell subtypes
Multi-model validation: Different models cross-check each other, reducing misclassification of similar subtypes
Continuous updates: As new cell types are discovered and published, the LLMs naturally incorporate this knowledge

Paper and Resources

Paper: bioRxiv
GitHub: https://github.com/cafferychen777/mLLMCelltype
Documentation: Available on GitHub

Feel free to try it with your T cell data! The tool handles complex immune cell hierarchies particularly well since there's extensive literature on immune cell subtypes that the models can leverage.

Would love to hear about your experience if you give it a try!

Best regards, Chen

Jun 27 '25 20:06 cafferychen777