phylogenetic analysis with missing d_gene
Dear immunarch team,
thank you for the user-friendly and highly useful package.
After loading 10x VDJ data filtered_contig_annotations.csv via repLoad, I run into problems when running repGermline which is required for the phylogenetic analysis I would like to perform.
When running the following commands just copied from the tutorial:
align_dt <- bcr_data %>%
seqCluster(seqDist(bcr_data, .col = 'CDR3.nt', .group_by_seqLength = TRUE),
.perc_similarity = 0.6) %>%
repGermline(.threads = 1) %>%
repAlignLineage(.min_lineage_sequences = 6, .align_threads = 2, .nofail = TRUE)
fails with (repGermline being the problematic command)
Error in validate_genes_edges(., sample_name) : No data in mandatory V.end column in sample filtered_contig_annotations!
I think this is because the d_gene column is missing in my sample. However, I have several samples, and the d_gene column is missing in all samples and this seems to be quite common in 10x data 10x FAQ. Is there a workaround or does this mean phylogenetic analysis is impossible in all those samples?
Thank you, Mischko
Hi, Mischko!
I'm Aleksandr Popov, a developer of Immunarch package. Thank you for using our software!
For now, V.end can be manually calculated as sum of lengths of FWR1, CDR1, FWR2, CDR2 and FWR3. On the next version of Immunarch I plan to update repGermline function to remove dependencies from columns that can be calculated from other data. After this, V.end will not be needed.
If you have more questions, feel free to ask them!
Best regards, Aleksandr
Hi Aleksandr,
thanks for the swift response.
Sorry, could you provide the code how to calculate the V.end? E.g. there is FR1.nt and FR1.aa, and the length (is this nchar ?) seems always to be the same?
Additionally, when providing some random numbers for V.end, there is a problem with J.start missing. Is it also possible to calculate J.start then?
Here is an overview of my data, when reading the filtered_annotations.csv from 10x with repLoad.
$ airr_rearrangement: tibble [3,727 × 15] (S3: tbl_df/tbl/data.frame)
..$ Clones : int [1:3727] 3634 3230 2986 2982 2938 2910 2824 2798 2596 2576 ...
..$ Proportion: num [1:3727] 0.00205 0.00182 0.00168 0.00168 0.00166 ...
..$ CDR3.nt : chr [1:3727] "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" ...
..$ CDR3.aa : chr [1:3727] "CQSSDNSRILGIF" "CQSSDNSRILGIF" "CQSSDNSRILGIF" "CQSSDNSRILGIF" ...
..$ V.name : chr [1:3727] "IGLV3-25" "IGLV3-25" "IGLV3-25" "IGLV3-25" ...
..$ D.name : chr [1:3727] NA NA NA NA ...
..$ J.name : chr [1:3727] "IGLJ2" "IGLJ2" "IGLJ2" "IGLJ2" ...
..$ V.end : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ D.start : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ D.end : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ J.start : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ VJ.ins : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ VD.ins : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ DJ.ins : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
..$ Sequence : chr [1:3727] "GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTTGACACAGC"| __truncated__ "GCTGTGCTGTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTT"| __truncated__ "GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTTGACACAGC"| __truncated__ "GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTTGACACAGC"| __truncated__ ...
Thank you, Mischko
You can use values from FR1.nt, CDR1.nt, ... columns and stringr::str_length() function to calculate lengths of these strings. J.start is str_length(seq) - str_length(fr4_nt) + 1, where seq is value from Sequence column, and fr4_nt is from FR4.nt column.
Best regards, Aleksandr
Thanks again.
Converting this into code with the example data results in:
bcr_data$full_clones |>
mutate(V.end2 = str_length(FR1.nt) + str_length(CDR1.nt) + str_length(FR2.nt) + str_length(CDR2.nt) + str_length(FR3.nt)) |>
mutate(J.start2 = str_length(Sequence) - str_length(FR4.nt) + 1) |>
select(V.end, V.end2, J.start, J.start2)
# A tibble: 1,000 × 4
V.end V.end2 J.start J.start2
<int> <int> <int> <dbl>
1 180 288 204 216
2 125 282 152 169
3 174 288 198 216
4 150 285 190 216
5 166 294 194 215
6 182 288 199 216
7 169 285 199 216
8 161 282 191 216
9 182 285 192 216
10 174 285 197 216
# … with 990 more rows
# ℹ Use `print(n = ...)` to see more rows
Not sure why there are different values in V.end and V.end2, and J.start and J.start2.
Also in my own data, which are filtered_contig_annotations.csv from 10x, there is no FR1, FR2, CDR2 ... column.
List of 1
$ filtered_contig_annotations:'data.frame': 20 obs. of 20 variables:
..$ Clones : int [1:20] 2368 588 179 7 5 3 2 2 2 1 ...
..$ Proportion : num [1:20] 0.74771 0.18566 0.05652 0.00221 0.00158 ...
..$ CDR3.nt : chr [1:20] "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTACAATTCTAGGAATATTC" ...
..$ CDR3.aa : chr [1:20] "CQSSDNSRILGIF" "CARGSRLKGAIFFDFW;CQSSDNSRILGIF" "CARGSRLKGAIFFDFW" "CARGSRLKGAIFFDFW;CQSSDNSTILGIF" ...
..$ V.name : chr [1:20] "IGLV3-25" "IGHV4-34;IGLV3-25" "IGHV4-34" "IGHV4-34;IGLV3-25" ...
..$ D.name : chr [1:20] "NA" "NA;NA" "NA" "NA;NA" ...
..$ J.name : chr [1:20] "IGLJ2" "IGHJ4;IGLJ2" "IGHJ4" "IGHJ4;IGLJ2" ...
..$ V.end : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ D.start : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ D.end : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ J.start : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ VJ.ins : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ VD.ins : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ DJ.ins : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
..$ Sequence : chr [1:20] "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTACAATTCTAGGAATATTC" ...
..$ chain : chr [1:20] "IGL" "IGH;IGL" "IGH" "IGH;IGL" ...
..$ Barcode : chr [1:20] "AAACCTGCAGTACACT-1;AAACCTGGTCCCTTGT-1;AAACCTGTCCTGTAGA-1;AAACGGGAGAGTTGGC-1;AAACGGGAGCACCGCT-1;AAACGGGAGGCCCTTG"| __truncated__ "AAACGGGAGAGCTGGT-1;AAACGGGTCCGGGTGT-1;AAACGGGTCGGCTTGG-1;AAAGATGAGAATCTCC-1;AAAGATGGTTCGGCAC-1;AAAGCAAAGGTCATCT"| __truncated__ "AACTCAGTCCGCATCT-1;AACTCCCAGTCCGTAT-1;AACTGGTCATAGAAAC-1;AACTTTCCACCCAGTG-1;ACACCGGAGAGCTGCA-1;ACACCGGGTCCTAGCG"| __truncated__ "ACGGGCTTCCAAACAC-1;ACTGCTCGTTAAGACA-1;CTAACTTGTCTTTCAT-1;CTCATTATCGTACCGG-1;CTGTGCTTCCAGTATG-1;GCATGTAAGAGGTACC"| __truncated__ ...
..$ raw_clonotype_id: chr [1:20] "1;147;41;52;53;50;163;42;99;19;82;18;130;5;37;4;67;134;65;2;139;136;84;118;89;162;83;80;51;81;161;135;35;140;43"| __truncated__ "1" "1;174;193;187;166;205;176;197;170;167;190;210;215;175;194;214;191;188;213;172;168;182;183;208;209;211;206;200;2"| __truncated__ "1" ...
..$ ContigID : chr [1:20] "AAACCTGCAGTACACT-11;AAACCTGGTCCCTTGT-11;AAACCTGTCCTGTAGA-11;AAACGGGAGAGTTGGC-11;AAACGGGAGCACCGCT-11;AAACGGGAGGC"| __truncated__ "AAACGGGAGAGCTGGT-11;AAACGGGAGAGCTGGT-12;AAACGGGTCCGGGTGT-11;AAACGGGTCCGGGTGT-12;AAACGGGTCGGCTTGG-12;AAACGGGTCGG"| __truncated__ "AACTCAGTCCGCATCT-11;AACTCCCAGTCCGTAT-11;AACTGGTCATAGAAAC-11;AACTTTCCACCCAGTG-11;ACACCGGAGAGCTGCA-11;ACACCGGGTCC"| __truncated__ "ACGGGCTTCCAAACAC-11;ACGGGCTTCCAAACAC-12;ACTGCTCGTTAAGACA-11;ACTGCTCGTTAAGACA-12;CTAACTTGTCTTTCAT-11;CTAACTTGTCT"| __truncated__ ...
..$ C.name : chr [1:20] "IGLC2" "IGHM;IGLC2" "IGHM" "IGHM;IGLC2" ...
I gave you the output of the airr_rearrangement.tsv in the previous post, sorry. But no FR1 etc. columns in either of those files.
Thank you, Mischko
Now I'm working on improving repGermline(); in the next release (Immunarch 0.8.0) it will not need V.end and J.start columns, only segment columns (FR1, CDR1, FR2, CDR2, FR3, CDR3, FR4) will be mandatory. Most pre-processing utilities now have options to save these segment columns. For now, only MiXCR input format is supported in Immunarch BCR pipeline; in the future versions we plan to expand the functionality and add support for 10x format and single cell data.
Best regards, Aleksandr