seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

seqkit split with regexp does not respect letter case overwriting file output

Open ammaraziz opened this issue 1 year ago • 9 comments

Dear ShenWei,

Thank you again for creating and maintaining seqkit. Congrats on seqkit2 publication!

I need to split a fasta file from gisaid. An example fasta looks like this

>hRSV/a/test/123/2021
ATGC
>hRSV/b/test/234/2022
ATGC
>hRSV/A/test/345/2023
ATGC
>hRSV/B/test/567/2024
ATGC

The goal is to split the fasta files into 2 files. The pattern is essentially hRSV/A/ and hRSV/B/. However the fasta file contains capitals and lowercase A/a and B/b in the designation name.

The command I am using:

seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d

Terminal output looks correct:

[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta

Folder output:

test.part_B.fasta 
test.part_a.fasta

Note there are only 2 files when there should be 4. The contents are also incorrect, they contain the upper case designations.

I suspect the code which writes out the fasta files is ignoring the letter case, resulting in the overwriting of files.

Using seqkit v2.8.1 on macos (x86 rosetta) installed via conda.

ammaraziz avatar Apr 29 '24 08:04 ammaraziz

Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.

ammaraziz avatar Apr 29 '24 08:04 ammaraziz

Supported now. Added a flag --ignore-case.

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d 
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d  --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta

$ seqkit split  -i --id-regexp "hRSV/(\w)/.+" test.fasta -d  --ignore-case -2
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] create or read FASTA index ...
[INFO] create FASTA index for test.fasta
[INFO]   4 records loaded from test.fasta.seqkit.fai
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta

seqkit_darwin_amd64.tar.gz seqkit_darwin_arm64.tar.gz seqkit_linux_amd64.tar.gz

Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.

Yes, I forgot to bump the version number in the tool.

shenwei356 avatar Apr 29 '24 09:04 shenwei356

Thank you for the super quick response. For my usecase this is solves the issue.

But I think the bug still exists. Seqkit sends a message that 4 files are created, but creates 2 due to case conflict. Seqkit split message needs to reflect the output files.

ammaraziz avatar Apr 29 '24 23:04 ammaraziz

But I think the bug still exists. Seqkit sends a message that 4 files are created

Where's it?

shenwei356 avatar Apr 30 '24 06:04 shenwei356

Seqkit prints this:

[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta

But it writes out only 2 files.

It needs to print either this:

[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta

or alternatively actually write case sensitive files (the preferred option in my opinion).

ammaraziz avatar Apr 30 '24 06:04 ammaraziz

I just read this issue once again.

seqkit split with regexp does not respect letter case overwriting file output Note there are only 2 files when there should be 4

And find that SeqKit does so in a case-sensitive way.

seqkit 2.8.1 works as you expect.

$ ./seqkit-2.8.1 split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta

$ seqkit stats test.fasta.split/*
processed files:  4 / 4 [======================================] ETA: 0s. done
file                                format  type  num_seqs  sum_len  min_len  avg_len  max_len
test.fasta.split/test.part_a.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_A.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_b.fasta  FASTA   DNA          1        4        4        4        4
test.fasta.split/test.part_B.fasta  FASTA   DNA          1        4        4        4        4

$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit head -n 1 $f | seqkit seq -n)"; done
test.fasta.split/test.part_a.fasta      hRSV/a/test/123/2021
test.fasta.split/test.part_A.fasta      hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta      hRSV/b/test/234/2022
test.fasta.split/test.part_B.fasta      hRSV/B/test/567/2024

Only adding --ignore-case (added yesterday) would ignore the case.

$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta

$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit seq -ni $f | paste -sd ,)"; done
test.fasta.split/test.part_a.fasta      hRSV/a/test/123/2021,hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta      hRSV/b/test/234/2022,hRSV/B/test/567/2024

shenwei356 avatar Apr 30 '24 06:04 shenwei356

This is likely to be an issue related to the case (in)sensitivity of MacOS file system.

botond-sipos avatar Apr 30 '24 08:04 botond-sipos

MacOS is not a case sensitive file system by default. So you can't have two files named File.txt and file.txt. You can choose to configure the OS as case sensitive if you want to.

OMG, I just learned this.

shenwei356 avatar Apr 30 '24 12:04 shenwei356

That's exactly what's happening. Thanks @botond-sipos

@shenwei356 do you think adding a disclaimer to recommend macos users use the ignore case flag? This will hopefully stop future issues.

ammaraziz avatar May 02 '24 00:05 ammaraziz

Added.

  1. For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive, output files might be overwritten if they are only different in cases, like Abc and ABC.

shenwei356 avatar May 17 '24 15:05 shenwei356

Updated:

  1. For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive, output files might be overwritten if they are only different in cases, like Abc and ABC. To avoid this, please switch one -I/--ignore-case.

shenwei356 avatar May 17 '24 15:05 shenwei356