seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

[Feature Request] Include GC content calculation for seqkit stats -a

Open jolespin opened this issue 3 years ago • 3 comments

Prerequisites

  • [ x ] make sure you're are using the latest version by seqkit version
  • [ x ] read the usage

Describe your issue

  • [ ] describe the problem I'm trying to replace my Quast workflow with seqkit stats but the one bit that's missing is the GC calculation.

  • [ ] provide a reproducible example file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC

jolespin avatar Mar 25 '22 20:03 jolespin

seqkit fx2tab has a flag:

  -g, --gc                     print GC content

GC content is more meaningful to individual sequences than a file, I think.

shenwei356 avatar Mar 31 '22 07:03 shenwei356

For metagenomics, I usually have metagenome assembled genomes split up in multiple contigs but organized as individual files. I admit, it would probably only be useful in that situation so I can see why it wouldn't be included in the stats module.

jolespin avatar Mar 31 '22 07:03 jolespin

Thanks for the feedback. If the file contains only one sequence, it would be easy to extract:

$ seqkit head -n 1 ../tests/hairpin.fa \
    | seqkit fx2tab -ni -g  | cut -f 2
43.43

For multiple sequences:

gc_cnt=$(cat ../tests/hairpin.fa | seqkit fx2tab -ni -C gc | csvtk summary  -Ht -f 2:sum -w 0)
sum_len=$(cat ../tests/hairpin.fa | seqkit stats -T | csvtk cut -t -f sum_len | csvtk del-header)
gc=$(echo "scale=4;" $gc_cnt / $sum_len "*" 100  | bc)

echo $gc
45.7700

No... It's too verbose, I'll add GC content to seqkit stats.

shenwei356 avatar Mar 31 '22 07:03 shenwei356

Added. I think It's useful for checking bacterial assemblies.

$ seqkit stats -a ../tests/hairpin.fa
file                 format  type  num_seqs    sum_len  min_len  avg_len  max_len  Q1  Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  GC(%)
../tests/hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,354  76  91  111        0  101       0       0  45.77


shenwei356 avatar Aug 12 '22 08:08 shenwei356

Available in v2.3.0 : https://github.com/shenwei356/seqkit/releases/tag/v2.3.0

shenwei356 avatar Aug 12 '22 15:08 shenwei356

Awesome, thanks for adding in this feature. This will streamline my workflow as I'm trying to replace most of my sequencing manipulations/stats with seqkit.

jolespin avatar Aug 12 '22 16:08 jolespin