seqkit split with regexp does not respect letter case overwriting file output
Dear ShenWei,
Thank you again for creating and maintaining seqkit. Congrats on seqkit2 publication!
I need to split a fasta file from gisaid. An example fasta looks like this
>hRSV/a/test/123/2021
ATGC
>hRSV/b/test/234/2022
ATGC
>hRSV/A/test/345/2023
ATGC
>hRSV/B/test/567/2024
ATGC
The goal is to split the fasta files into 2 files. The pattern is essentially hRSV/A/ and hRSV/B/. However the fasta file contains capitals and lowercase A/a and B/b in the designation name.
The command I am using:
seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d
Terminal output looks correct:
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
Folder output:
test.part_B.fasta
test.part_a.fasta
Note there are only 2 files when there should be 4. The contents are also incorrect, they contain the upper case designations.
I suspect the code which writes out the fasta files is ignoring the letter case, resulting in the overwriting of files.
Using seqkit v2.8.1 on macos (x86 rosetta) installed via conda.
Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.
Supported now. Added a flag --ignore-case.
$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta
$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta -d --ignore-case -2
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] create or read FASTA index ...
[INFO] create FASTA index for test.fasta
[INFO] 4 records loaded from test.fasta.seqkit.fai
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta
seqkit_darwin_amd64.tar.gz seqkit_darwin_arm64.tar.gz seqkit_linux_amd64.tar.gz
Not related, I installed seqkit via mamba which reports the version as 2.8.1 but seqkit version reports 2.8.0.
Yes, I forgot to bump the version number in the tool.
Thank you for the super quick response. For my usecase this is solves the issue.
But I think the bug still exists. Seqkit sends a message that 4 files are created, but creates 2 due to case conflict. Seqkit split message needs to reflect the output files.
But I think the bug still exists. Seqkit sends a message that 4 files are created
Where's it?
Seqkit prints this:
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
But it writes out only 2 files.
It needs to print either this:
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
or alternatively actually write case sensitive files (the preferred option in my opinion).
I just read this issue once again.
seqkit split with regexp does not respect letter case overwriting file output Note there are only 2 files when there should be 4
And find that SeqKit does so in a case-sensitive way.
seqkit 2.8.1 works as you expect.
$ ./seqkit-2.8.1 split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 1 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_A.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_B.fasta
[INFO] write 1 sequences to file: test.fasta.split/test.part_a.fasta
$ seqkit stats test.fasta.split/*
processed files: 4 / 4 [======================================] ETA: 0s. done
file format type num_seqs sum_len min_len avg_len max_len
test.fasta.split/test.part_a.fasta FASTA DNA 1 4 4 4 4
test.fasta.split/test.part_A.fasta FASTA DNA 1 4 4 4 4
test.fasta.split/test.part_b.fasta FASTA DNA 1 4 4 4 4
test.fasta.split/test.part_B.fasta FASTA DNA 1 4 4 4 4
$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit head -n 1 $f | seqkit seq -n)"; done
test.fasta.split/test.part_a.fasta hRSV/a/test/123/2021
test.fasta.split/test.part_A.fasta hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta hRSV/b/test/234/2022
test.fasta.split/test.part_B.fasta hRSV/B/test/567/2024
Only adding --ignore-case (added yesterday) would ignore the case.
$ seqkit split -i --id-regexp "hRSV/(\w)/.+" test.fasta --force --ignore-case
[INFO] split by ID. idRegexp: hRSV/(\w)/.+
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: test.fasta.split/test.part_b.fasta
[INFO] write 2 sequences to file: test.fasta.split/test.part_a.fasta
$ for f in test.fasta.split/*; do echo -e "$f\t$(seqkit seq -ni $f | paste -sd ,)"; done
test.fasta.split/test.part_a.fasta hRSV/a/test/123/2021,hRSV/A/test/345/2023
test.fasta.split/test.part_b.fasta hRSV/b/test/234/2022,hRSV/B/test/567/2024
This is likely to be an issue related to the case (in)sensitivity of MacOS file system.
MacOS is not a case sensitive file system by default. So you can't have two files named File.txt and file.txt. You can choose to configure the OS as case sensitive if you want to.
OMG, I just learned this.
That's exactly what's happening. Thanks @botond-sipos
@shenwei356 do you think adding a disclaimer to recommend macos users use the ignore case flag? This will hopefully stop future issues.
Added.
- For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive, output files might be overwritten if they are only different in cases, like Abc and ABC.
Updated:
- For splitting by sequence IDs in Windows/MacOS, where the file systems might be case-insensitive, output files might be overwritten if they are only different in cases, like Abc and ABC. To avoid this, please switch one -I/--ignore-case.