Index multiple VCF files
Hi - When I try to index multiple VCF files data/*.gz, it only takes the first VCF file for indexing and not all in the directory. Can it only take one input VCF file at a time?
I suspect the issue lies with how you are "quoting" the data/*.gz bit. You'll want to use single quotes so that the shell (bash for most people) wont expand the "*" into all the .gz files.
So you'll want to do the following:
giggle index -i 'data/*.gz' # ... and so on
I have almost same issue using Ubuntu 20.04.
In addition, the giggle index program can only read the file if the execution is in the directory with the file for some reason. So I can't use giggle index data/*.gz #.... I have to cd into the directory with the file I want to index. Then it can read the file.
Using the wildcard character does not work to apply the giggle index command to all files in the directory (Eg. giggle index -i *.gz #...) . For some reason, it only works on the first file it encounters in the directory.
This is not fixed by adding single quotes around the wildcard expression (Eg. '*.gz').
This is also not fixed by adding double quotes (Eg. "*.gz")
hmm, I've never had to cd into the directory of bed files just to index. In order for me to try to reproduce this, can you let me know what version of your dependencies that you used to install giggle with? Also, what shell are you using? Bash? If so, what version? Giggle wants a string that contains the "glob" not the actual list of files (ie it needs the literal *.bed.gz, *.gz, etc.) At least with bash, I know that an unquoted glob like *.gz is expanded into a space-separated list of files, so when giggle parses that -i option, it'll only take the first one in the list.
In the meantime, you can try to build/use the singularity container mentioned in #68.
Yes, like I said, I tried *.gz with and without single or double quotes and none had worked.
I am using bash on Ubuntu 20.04: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Here are the versions of the dependencies installed for giggle
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)
GNU Make 4.2.1
autoconf (GNU Autoconf) 2.69
zlib1g-dev 1:1.2.11.dfsg-2ubuntu1.5
libbz2-dev 1.0.8-2
libcurl4-openssl-dev 7.68.0-1ubuntu2.24
libssl-dev 1.1.1f-1ubuntu2.23
ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]
I worked around this issue by making separate index directories for each of my *.bed.gz files with the following:
cd /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/giggle_db_bed_gz/
for file in *.gz; do
index_path="/media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/giggle_indices/${file%_sort_peaks.narrowPeak.bed.gz}/"
/home/kyle/giggle/giggle/bin/giggle index -i "$file" -o "$index_path" -f -s
done
But, it would be nice if the giggle index ... command worked with multiple reference BED files.
Try these options as well:
- escape the asterisk with a backslash (eg.
\*.gz) - in a bash script: precede the giggle index step with
set -o nogloband reenable withset +o noglobafterwards.
Also post the exact giggle index command you were using (and any surrounding context) so I can recreate the conditions exactly.
also here's a little test script that summarises the different behaviors of strings with the glob expansion asterisk. This will tell me whether glob expansion is even the real culprit here.
#!/bin/env bash
# assuming we have a directory called bed with some bed files in it.
echo bed/*.bed # should expand
echo 'bed/*.bed' # shouldn't expand
echo bed/\*.bed # shouldn't expand
set -o noglob
echo bed/*.bed # shouldn't expand
set +o noglob
echo bed/*.bed # should expand
Output on my MacBook (using bash not zsh), one of our centos servers, and a singularity container based on the one I mentioned earlier (Ubuntu 18.04) all have the same output:
bed/A.bed bed/B.bed
bed/*.bed
bed/*.bed
bed/*.bed
bed/A.bed bed/B.bed
If yours is different, I'll try to make a singularity container that more closely matches your conditions to understand why there's a difference.
Try these options as well:
- escape the asterisk with a backslash (eg.
\*.gz)- in a bash script: precede the giggle index step with
set -o nogloband reenable withset +o noglobafterwards.Also post the exact giggle index command you were using (and any surrounding context) so I can recreate the conditions exactly.
The original command I was trying to run is the following. But the issue is that it only indexes the first file in the directory:
$ giggle index -i *.gz \
> -o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ \
> -f -s
Indexed 2339 intervals.
If I try to escape the asterisk with a backslash, I get an error that says giggle can't read one of the files:
$ giggle index -i \*.gz \
> -o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ \
> -f -s
Could not open file 'GSM1356202_sort_peaks.narrowPeak.bed.gz'
giggle: Could not open GSM1356202_sort_peaks.narrowPeak.bed.gz.
# But the file seems fine when I examine it independently
$ zcat GSM1356202_sort_peaks.narrowPeak.bed.gz | head
chr1 3292658 3292886 peak1 65 . 5.35251 8.67630 6.59574 83
chr1 3540503 3540664 peak2 56 . 4.97019 7.73483 5.68759 40
chr1 4318168 4318454 peak3 348 . 14.52825 37.48520 34.89425 143
chr1 4417821 4418240 peak4 730 . 20.91680 76.03189 73.06583 221
chr1 4432069 4432345 peak5 137 . 8.02877 16.00201 13.73194131
chr1 4623389 4623535 peak6 75 . 5.73484 9.64725 7.53552 67
chr1 4724217 4724366 peak7 65 . 5.35251 8.67630 6.59574 118
chr1 4764443 4764649 peak8 126 . 7.64645 14.88650 12.64020113
chr1 4766792 4766995 peak9 159 . 8.79342 18.29233 15.9771489
chr1 4775302 4775561 peak10 219 . 10.70503 24.32764 21.91055 137
# ... And giggle runs fine on that file when it is called by its name:
$ giggle index -i GSM1356202_sort_peaks.narrowPeak.bed.gz -o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ -f -s
Indexed 110650 intervals.
If I try to in a bash script, preceding the giggle index step with set -o noglob and reenable with set +o noglob afterwards, I get the error again that giggle cannot open the bed file:
$ ./test_giggle_indexing.sh
Could not open file 'GSM1356202_sort_peaks.narrowPeak.bed.gz'
giggle: Could not open GSM1356202_sort_peaks.narrowPeak.bed.gz.
... where test_giggle_indexing.sh is:
cd /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/giggle_db_bed_gz/
set -o noglob
giggle index -i *.gz \
-o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ \
-f -s
set +o noglob
also here's a little test script that summarises the different behaviors of strings with the glob expansion asterisk. This will tell me whether glob expansion is even the real culprit here.
#!/bin/env bash # assuming we have a directory called bed with some bed files in it. echo bed/*.bed # should expand echo 'bed/*.bed' # shouldn't expand echo bed/\*.bed # shouldn't expand set -o noglob echo bed/*.bed # shouldn't expand set +o noglob echo bed/*.bed # should expandOutput on my MacBook (using bash not zsh), one of our centos servers, and a singularity container based on the one I mentioned earlier (Ubuntu 18.04) all have the same output:
bed/A.bed bed/B.bed bed/*.bed bed/*.bed bed/*.bed bed/A.bed bed/B.bedIf yours is different, I'll try to make a singularity container that more closely matches your conditions to understand why there's a difference.
Here is the output on my Ubuntu 20.04 system, given the script (and .bed file context) you provided above:
$ ./test_glob_expansion.sh
bed/A.bed bed/B.bed
bed/*.bed
bed/*.bed
bed/*.bed
bed/A.bed bed/B.bed
ok it seems that the escaping of globs achieves the desired effect, but runs into another issue afterwards. I'll go ahead and try to recreate this issue with a singularity container and see if I can resolve it.
In the meantime, I'd like you to try one last thing. Instead of trying to giggle index from within the directory containing the bed files. Try supplying the absolute path to the directory like so:
giggle index -i "path/to/your/beds/*.bed.gz" -o path/to/output/index -f -s