immunarch phylogenetic analysis with missing d

Dear immunarch team,

thank you for the user-friendly and highly useful package. After loading 10x VDJ data filtered_contig_annotations.csv via repLoad, I run into problems when running repGermline which is required for the phylogenetic analysis I would like to perform. When running the following commands just copied from the tutorial:

align_dt <- bcr_data %>%
  seqCluster(seqDist(bcr_data, .col = 'CDR3.nt', .group_by_seqLength = TRUE), 
             .perc_similarity = 0.6) %>%
    repGermline(.threads = 1) %>%
    repAlignLineage(.min_lineage_sequences = 6, .align_threads = 2, .nofail = TRUE)

fails with (repGermline being the problematic command)

Error in validate_genes_edges(., sample_name) : No data in mandatory V.end column in sample filtered_contig_annotations!

I think this is because the d_gene column is missing in my sample. However, I have several samples, and the d_gene column is missing in all samples and this seems to be quite common in 10x data 10x FAQ. Is there a workaround or does this mean phylogenetic analysis is impossible in all those samples?

Thank you, Mischko

Aug 19 '22 16:08 mihem

Hi, Mischko!

I'm Aleksandr Popov, a developer of Immunarch package. Thank you for using our software!

For now, V.end can be manually calculated as sum of lengths of FWR1, CDR1, FWR2, CDR2 and FWR3. On the next version of Immunarch I plan to update repGermline function to remove dependencies from columns that can be calculated from other data. After this, V.end will not be needed.

If you have more questions, feel free to ask them!

Best regards, Aleksandr

Aug 24 '22 15:08 Alexander230

Hi Aleksandr,

thanks for the swift response.

Sorry, could you provide the code how to calculate the V.end? E.g. there is FR1.nt and FR1.aa, and the length (is this nchar ?) seems always to be the same?

Additionally, when providing some random numbers for V.end, there is a problem with J.start missing. Is it also possible to calculate J.start then?

Here is an overview of my data, when reading the filtered_annotations.csv from 10x with repLoad.

 $ airr_rearrangement: tibble [3,727 × 15] (S3: tbl_df/tbl/data.frame)
  ..$ Clones    : int [1:3727] 3634 3230 2986 2982 2938 2910 2824 2798 2596 2576 ...
  ..$ Proportion: num [1:3727] 0.00205 0.00182 0.00168 0.00168 0.00166 ...
  ..$ CDR3.nt   : chr [1:3727] "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" ...
  ..$ CDR3.aa   : chr [1:3727] "CQSSDNSRILGIF" "CQSSDNSRILGIF" "CQSSDNSRILGIF" "CQSSDNSRILGIF" ...
  ..$ V.name    : chr [1:3727] "IGLV3-25" "IGLV3-25" "IGLV3-25" "IGLV3-25" ...
  ..$ D.name    : chr [1:3727] NA NA NA NA ...
  ..$ J.name    : chr [1:3727] "IGLJ2" "IGLJ2" "IGLJ2" "IGLJ2" ...
  ..$ V.end     : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ D.start   : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ D.end     : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ J.start   : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ VJ.ins    : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ VD.ins    : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ DJ.ins    : int [1:3727] NA NA NA NA NA NA NA NA NA NA ...
  ..$ Sequence  : chr [1:3727] "GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTTGACACAGC"| __truncated__ "GCTGTGCTGTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTT"| __truncated__ "GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTTGACACAGC"| __truncated__ "GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGATCCCTCTACTTCTCCCCCTCCTCACTCTCTGCACAGGCTCTGACGCCTGCAATGAGTTGACACAGC"| __truncated__ ...

Thank you, Mischko

Aug 24 '22 15:08 mihem

You can use values from FR1.nt, CDR1.nt, ... columns and stringr::str_length() function to calculate lengths of these strings. J.start is str_length(seq) - str_length(fr4_nt) + 1, where seq is value from Sequence column, and fr4_nt is from FR4.nt column.

Best regards, Aleksandr

Aug 24 '22 16:08 Alexander230

Thanks again.

Converting this into code with the example data results in:

bcr_data$full_clones |>
    mutate(V.end2 = str_length(FR1.nt) + str_length(CDR1.nt) + str_length(FR2.nt) +  str_length(CDR2.nt) + str_length(FR3.nt)) |>
    mutate(J.start2 = str_length(Sequence) - str_length(FR4.nt) + 1) |>
    select(V.end, V.end2, J.start, J.start2)

# A tibble: 1,000 × 4
   V.end V.end2 J.start J.start2
   <int>  <int>   <int>    <dbl>
 1   180    288     204      216
 2   125    282     152      169
 3   174    288     198      216
 4   150    285     190      216
 5   166    294     194      215
 6   182    288     199      216
 7   169    285     199      216
 8   161    282     191      216
 9   182    285     192      216
10   174    285     197      216
# … with 990 more rows
# ℹ Use `print(n = ...)` to see more rows

Not sure why there are different values in V.end and V.end2, and J.start and J.start2.

Also in my own data, which are filtered_contig_annotations.csv from 10x, there is no FR1, FR2, CDR2 ... column.


List of 1
 $ filtered_contig_annotations:'data.frame':	20 obs. of  20 variables:
  ..$ Clones          : int [1:20] 2368 588 179 7 5 3 2 2 2 1 ...
  ..$ Proportion      : num [1:20] 0.74771 0.18566 0.05652 0.00221 0.00158 ...
  ..$ CDR3.nt         : chr [1:20] "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTACAATTCTAGGAATATTC" ...
  ..$ CDR3.aa         : chr [1:20] "CQSSDNSRILGIF" "CARGSRLKGAIFFDFW;CQSSDNSRILGIF" "CARGSRLKGAIFFDFW" "CARGSRLKGAIFFDFW;CQSSDNSTILGIF" ...
  ..$ V.name          : chr [1:20] "IGLV3-25" "IGHV4-34;IGLV3-25" "IGHV4-34" "IGHV4-34;IGLV3-25" ...
  ..$ D.name          : chr [1:20] "NA" "NA;NA" "NA" "NA;NA" ...
  ..$ J.name          : chr [1:20] "IGLJ2" "IGHJ4;IGLJ2" "IGHJ4" "IGHJ4;IGLJ2" ...
  ..$ V.end           : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ D.start         : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ D.end           : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ J.start         : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ VJ.ins          : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ VD.ins          : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ DJ.ins          : int [1:20] NA NA NA NA NA NA NA NA NA NA ...
  ..$ Sequence        : chr [1:20] "TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTAGAATTCTAGGAATATTC" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG" "TGTGCGAGAGGGTCCAGATTAAAGGGAGCCATATTCTTTGACTTCTGG;TGTCAGTCATCAGACAACAGTACAATTCTAGGAATATTC" ...
  ..$ chain           : chr [1:20] "IGL" "IGH;IGL" "IGH" "IGH;IGL" ...
  ..$ Barcode         : chr [1:20] "AAACCTGCAGTACACT-1;AAACCTGGTCCCTTGT-1;AAACCTGTCCTGTAGA-1;AAACGGGAGAGTTGGC-1;AAACGGGAGCACCGCT-1;AAACGGGAGGCCCTTG"| __truncated__ "AAACGGGAGAGCTGGT-1;AAACGGGTCCGGGTGT-1;AAACGGGTCGGCTTGG-1;AAAGATGAGAATCTCC-1;AAAGATGGTTCGGCAC-1;AAAGCAAAGGTCATCT"| __truncated__ "AACTCAGTCCGCATCT-1;AACTCCCAGTCCGTAT-1;AACTGGTCATAGAAAC-1;AACTTTCCACCCAGTG-1;ACACCGGAGAGCTGCA-1;ACACCGGGTCCTAGCG"| __truncated__ "ACGGGCTTCCAAACAC-1;ACTGCTCGTTAAGACA-1;CTAACTTGTCTTTCAT-1;CTCATTATCGTACCGG-1;CTGTGCTTCCAGTATG-1;GCATGTAAGAGGTACC"| __truncated__ ...
  ..$ raw_clonotype_id: chr [1:20] "1;147;41;52;53;50;163;42;99;19;82;18;130;5;37;4;67;134;65;2;139;136;84;118;89;162;83;80;51;81;161;135;35;140;43"| __truncated__ "1" "1;174;193;187;166;205;176;197;170;167;190;210;215;175;194;214;191;188;213;172;168;182;183;208;209;211;206;200;2"| __truncated__ "1" ...
  ..$ ContigID        : chr [1:20] "AAACCTGCAGTACACT-11;AAACCTGGTCCCTTGT-11;AAACCTGTCCTGTAGA-11;AAACGGGAGAGTTGGC-11;AAACGGGAGCACCGCT-11;AAACGGGAGGC"| __truncated__ "AAACGGGAGAGCTGGT-11;AAACGGGAGAGCTGGT-12;AAACGGGTCCGGGTGT-11;AAACGGGTCCGGGTGT-12;AAACGGGTCGGCTTGG-12;AAACGGGTCGG"| __truncated__ "AACTCAGTCCGCATCT-11;AACTCCCAGTCCGTAT-11;AACTGGTCATAGAAAC-11;AACTTTCCACCCAGTG-11;ACACCGGAGAGCTGCA-11;ACACCGGGTCC"| __truncated__ "ACGGGCTTCCAAACAC-11;ACGGGCTTCCAAACAC-12;ACTGCTCGTTAAGACA-11;ACTGCTCGTTAAGACA-12;CTAACTTGTCTTTCAT-11;CTAACTTGTCT"| __truncated__ ...
  ..$ C.name          : chr [1:20] "IGLC2" "IGHM;IGLC2" "IGHM" "IGHM;IGLC2" ...

I gave you the output of the airr_rearrangement.tsv in the previous post, sorry. But no FR1 etc. columns in either of those files.

Thank you, Mischko

Aug 24 '22 16:08 mihem

Now I'm working on improving repGermline(); in the next release (Immunarch 0.8.0) it will not need V.end and J.start columns, only segment columns (FR1, CDR1, FR2, CDR2, FR3, CDR3, FR4) will be mandatory. Most pre-processing utilities now have options to save these segment columns. For now, only MiXCR input format is supported in Immunarch BCR pipeline; in the future versions we plan to expand the functionality and add support for 10x format and single cell data.

Best regards, Aleksandr

Sep 29 '22 12:09 Alexander230

phylogenetic analysis with missing d_gene