SingleR icon indicating copy to clipboard operation
SingleR copied to clipboard

How to choose reference data? Should test data and refernce data coming from the same group? like using control group to infer control group cell types, using disease group to infer disease group labels?

Open lkqnaruto opened this issue 4 years ago • 5 comments

Hi

I'm working on cell type annotation. In my test data, I have one control group which contains 4 samples, one Vehicle treatment group which contains 8 samples, and one drug treatment with 8 samples. In my reference dataset, it also contains three groups: 1 control group, 1 Vehicle treatment group, 1 drug treatment group(different drug to the one in test data). My goal is to give cell type names to EACH group.

My question is: When I use SingleR to annotate cell type names for each group, should I subset the control group data from the reference dataset when I give cell type names to control group of test dataset, and do the same for the rest two groups?

If so, what should I do when I perform cell type annotation for test dataset which is using one drug but reference data from using another drug treatment?

I really appreciate the kindly help and many thanks in advance!

lkqnaruto avatar May 20 '21 00:05 lkqnaruto

It really depends on what your labels are.

If your labels are coarse entities like "T cell" and "B cell", I would tell you not to worry about it and just mix all the references together. A drug isn't going to cause cells to change their broad identities. Probably.

If your labels are very fine-grained and specific, e.g., "cell type that activates upon drug X", then some more thinking is required. Your suggestion sounds reasonable enough, and as long as the drug treatments are roughly similar, I wouldn't worry about the difference. (If they're not and the labels are highly drug-specific, I would just omit the drug reference.)

Personally, I would pick one of the strategies here; you can treat each group as a separate reference and then do a multi-reference analysis. Under this suggestion, a cell in the control test dataset might be assigned a label from a treatment reference group - however, this is a good quality control measure, because if this happens, it means that the differences between treatment/control weren't very strong and you shouldn't be putting too much weight on them.

LTLA avatar May 21 '21 04:05 LTLA

@LTLA Thank you for the reply. I wonder when you say "A drug isn't going to cause cells to change their broad identities", do you mean statistically or biologically?

And yes, my labels in the reference dataset is not drug specific, they are all very straightfoward, like "T cell" and "B cell" etc.

lkqnaruto avatar May 21 '21 04:05 lkqnaruto

Biologically. To be more specific, I would expect that most perturbations would not change how different broad cell types are defined. (Individual cells can of course shift between cell types, e.g., if the drug triggers differentiation.) In other words, a cell type's defining characteristics should generally be the same across different treatments; if this is the case, SingleR will mostly not care about the treatments, as long as the identities of the markers and their expression profiles are consistent.

On the other hand, if you have a treatment that, say, causes a cell type to stop expressing its defining markers, you need to be more careful. For example, if your drug induced a deletion of CD3 in T cells, the presence/absence of CD3 is no longer a good defining characteristic of T cells in your treated samples. In this case, matching to an appropriate treated reference makes sense.

LTLA avatar May 21 '21 07:05 LTLA

@thank you. If the drug or disease indeed triggers differentiation, does the multi reference strategy still work? And the truth is I have no idea whether there will be any deletion across the different groups. If so, will be still the same diagnostic strategy as in chapter 4 of the SingleR book?

lkqnaruto avatar May 21 '21 07:05 lkqnaruto

If the drug or disease indeed triggers differentiation, does the multi reference strategy still work?

Differentiation is not the concern here. You need to ask yourself how the cell types were defined. It doesn't matter if cells move between cell types, as long as you can define the cell types in the same way across different treatments.

And the truth is I have no idea whether there will be any deletion across the different groups. If so, will be still the same diagnostic strategy as in chapter 4 of the SingleR book?

Possibly. Nothing beats some biological context, though. If you see assignments to "T cells" and they don't express CD3, alarm bells should be going off in your head.

LTLA avatar May 21 '21 07:05 LTLA