giggle icon indicating copy to clipboard operation
giggle copied to clipboard

Index multiple VCF files

Open bio-bench opened this issue 3 years ago • 9 comments

Hi - When I try to index multiple VCF files data/*.gz, it only takes the first VCF file for indexing and not all in the directory. Can it only take one input VCF file at a time?

bio-bench avatar Aug 19 '22 06:08 bio-bench

I suspect the issue lies with how you are "quoting" the data/*.gz bit. You'll want to use single quotes so that the shell (bash for most people) wont expand the "*" into all the .gz files.

So you'll want to do the following: giggle index -i 'data/*.gz' # ... and so on

mchowdh200 avatar Aug 13 '24 20:08 mchowdh200

I have almost same issue using Ubuntu 20.04.

In addition, the giggle index program can only read the file if the execution is in the directory with the file for some reason. So I can't use giggle index data/*.gz #.... I have to cd into the directory with the file I want to index. Then it can read the file.

Using the wildcard character does not work to apply the giggle index command to all files in the directory (Eg. giggle index -i *.gz #...) . For some reason, it only works on the first file it encounters in the directory. This is not fixed by adding single quotes around the wildcard expression (Eg. '*.gz'). This is also not fixed by adding double quotes (Eg. "*.gz")

KyleFerchen avatar Sep 18 '24 18:09 KyleFerchen

hmm, I've never had to cd into the directory of bed files just to index. In order for me to try to reproduce this, can you let me know what version of your dependencies that you used to install giggle with? Also, what shell are you using? Bash? If so, what version? Giggle wants a string that contains the "glob" not the actual list of files (ie it needs the literal *.bed.gz, *.gz, etc.) At least with bash, I know that an unquoted glob like *.gz is expanded into a space-separated list of files, so when giggle parses that -i option, it'll only take the first one in the list.

In the meantime, you can try to build/use the singularity container mentioned in #68.

mchowdh200 avatar Sep 26 '24 17:09 mchowdh200

Yes, like I said, I tried *.gz with and without single or double quotes and none had worked.

I am using bash on Ubuntu 20.04: GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)

Here are the versions of the dependencies installed for giggle

gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.2)
GNU Make 4.2.1
autoconf (GNU Autoconf) 2.69
zlib1g-dev 1:1.2.11.dfsg-2ubuntu1.5
libbz2-dev 1.0.8-2
libcurl4-openssl-dev 7.68.0-1ubuntu2.24
libssl-dev 1.1.1f-1ubuntu2.23
ruby 2.7.0p0 (2019-12-25 revision 647ee6f091) [x86_64-linux-gnu]

I worked around this issue by making separate index directories for each of my *.bed.gz files with the following:

cd /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/giggle_db_bed_gz/

for file in *.gz; do
    index_path="/media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/giggle_indices/${file%_sort_peaks.narrowPeak.bed.gz}/"
    /home/kyle/giggle/giggle/bin/giggle index -i "$file" -o "$index_path" -f -s
done

But, it would be nice if the giggle index ... command worked with multiple reference BED files.

KyleFerchen avatar Sep 30 '24 23:09 KyleFerchen

Try these options as well:

  • escape the asterisk with a backslash (eg. \*.gz)
  • in a bash script: precede the giggle index step with set -o noglob and reenable with set +o noglob afterwards.

Also post the exact giggle index command you were using (and any surrounding context) so I can recreate the conditions exactly.

mchowdh200 avatar Oct 01 '24 00:10 mchowdh200

also here's a little test script that summarises the different behaviors of strings with the glob expansion asterisk. This will tell me whether glob expansion is even the real culprit here.

#!/bin/env bash
# assuming we have a directory called bed with some bed files in it.
echo  bed/*.bed # should expand
echo 'bed/*.bed' # shouldn't expand
echo bed/\*.bed # shouldn't expand

set -o noglob
echo bed/*.bed # shouldn't expand
set +o noglob
echo bed/*.bed # should expand

Output on my MacBook (using bash not zsh), one of our centos servers, and a singularity container based on the one I mentioned earlier (Ubuntu 18.04) all have the same output:

bed/A.bed bed/B.bed
bed/*.bed
bed/*.bed
bed/*.bed
bed/A.bed bed/B.bed

If yours is different, I'll try to make a singularity container that more closely matches your conditions to understand why there's a difference.

mchowdh200 avatar Oct 01 '24 01:10 mchowdh200

Try these options as well:

  • escape the asterisk with a backslash (eg. \*.gz)
  • in a bash script: precede the giggle index step with set -o noglob and reenable with set +o noglob afterwards.

Also post the exact giggle index command you were using (and any surrounding context) so I can recreate the conditions exactly.



The original command I was trying to run is the following. But the issue is that it only indexes the first file in the directory:

$ giggle index -i *.gz \
> -o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ \
> -f -s

Indexed 2339 intervals.

If I try to escape the asterisk with a backslash, I get an error that says giggle can't read one of the files:

$ giggle index -i \*.gz \
> -o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ \
> -f -s

Could not open file 'GSM1356202_sort_peaks.narrowPeak.bed.gz'
giggle: Could not open GSM1356202_sort_peaks.narrowPeak.bed.gz.


# But the file seems fine when I examine it independently
$ zcat GSM1356202_sort_peaks.narrowPeak.bed.gz | head

chr1	3292658	3292886	peak1	65	.	5.35251	8.67630	6.59574	83
chr1	3540503	3540664	peak2	56	.	4.97019	7.73483	5.68759	40
chr1	4318168	4318454	peak3	348	.	14.52825	37.48520	34.89425	143
chr1	4417821	4418240	peak4	730	.	20.91680	76.03189	73.06583	221
chr1	4432069	4432345	peak5	137	.	8.02877	16.00201	13.73194131
chr1	4623389	4623535	peak6	75	.	5.73484	9.64725	7.53552	67
chr1	4724217	4724366	peak7	65	.	5.35251	8.67630	6.59574	118
chr1	4764443	4764649	peak8	126	.	7.64645	14.88650	12.64020113
chr1	4766792	4766995	peak9	159	.	8.79342	18.29233	15.9771489
chr1	4775302	4775561	peak10	219	.	10.70503	24.32764	21.91055	137


# ... And giggle runs fine on that file when it is called by its name:
$ giggle index -i GSM1356202_sort_peaks.narrowPeak.bed.gz -o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ -f -s

Indexed 110650 intervals.




If I try to in a bash script, preceding the giggle index step with set -o noglob and reenable with set +o noglob afterwards, I get the error again that giggle cannot open the bed file:

$ ./test_giggle_indexing.sh

Could not open file 'GSM1356202_sort_peaks.narrowPeak.bed.gz'
giggle: Could not open GSM1356202_sort_peaks.narrowPeak.bed.gz.

... where test_giggle_indexing.sh is:

cd /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/giggle_db_bed_gz/

set -o noglob

giggle index -i *.gz \
-o /media/kyle_storage/kyle_ferchen/grimes_lab_main/data/CistromeDB/new_giggle_index_test/ \
-f -s

set +o noglob

KyleFerchen avatar Oct 14 '24 21:10 KyleFerchen

also here's a little test script that summarises the different behaviors of strings with the glob expansion asterisk. This will tell me whether glob expansion is even the real culprit here.

#!/bin/env bash
# assuming we have a directory called bed with some bed files in it.
echo  bed/*.bed # should expand
echo 'bed/*.bed' # shouldn't expand
echo bed/\*.bed # shouldn't expand

set -o noglob
echo bed/*.bed # shouldn't expand
set +o noglob
echo bed/*.bed # should expand

Output on my MacBook (using bash not zsh), one of our centos servers, and a singularity container based on the one I mentioned earlier (Ubuntu 18.04) all have the same output:

bed/A.bed bed/B.bed
bed/*.bed
bed/*.bed
bed/*.bed
bed/A.bed bed/B.bed

If yours is different, I'll try to make a singularity container that more closely matches your conditions to understand why there's a difference.



Here is the output on my Ubuntu 20.04 system, given the script (and .bed file context) you provided above:

$ ./test_glob_expansion.sh
bed/A.bed bed/B.bed
bed/*.bed
bed/*.bed
bed/*.bed
bed/A.bed bed/B.bed

KyleFerchen avatar Oct 14 '24 21:10 KyleFerchen

ok it seems that the escaping of globs achieves the desired effect, but runs into another issue afterwards. I'll go ahead and try to recreate this issue with a singularity container and see if I can resolve it.

In the meantime, I'd like you to try one last thing. Instead of trying to giggle index from within the directory containing the bed files. Try supplying the absolute path to the directory like so:

giggle index -i "path/to/your/beds/*.bed.gz" -o path/to/output/index -f -s

mchowdh200 avatar Oct 17 '24 22:10 mchowdh200