How to use Custom input directory?
Hi, thanks for reaching out and using BGCFlow :)
I just wish to know how to use Custom input directory
To use custom input directory, you will need to set up two things:
- Defining the input directory path (and maybe the input file type) in the project configuration (
project_config.yaml) - Adding your custom input samples in the
samples.csv
Please find the example here: config.zip
This is how the project structure will look like:
config/
├── Lactobacillus_delbrueckii
│ ├── input_files # your custom input directory
│ │ └── my_custom_genome.gbk
│ ├── gtdbtk.bac120.summary.tsv # an optional GTDB-tk style taxonomic assignment
│ ├── project_config.yaml # project level configuration, the rule set here will override the global parameter
│ └── samples.csv
└──── config.yaml # global parameter configuration
And this is how the project configuration (project_config.yaml) looks like:
name: Lactobacillus_delbrueckii_custom_input
pep_version: 2.1.0
description: "An example of using custom input files in BGCFlow projects."
input_folder: input_files # This is the folder where the input files are located, relative to this file.
input_type: gbk # This is the default type of input files. It can be gbk or fna. Note that samples from NCBI will default to fna format.
gtdb-tax: gtdbtk.bac120.summary.tsv # you can also provide a custom GTDB-tk output style taxonomy information
sample_table: samples.csv
#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
seqfu: TRUE
...
and finally, you need to add the custom sample in the samples.csv:
| genome_id | source | organism | genus | species | strain | closest_placement_reference | input_file |
|---|---|---|---|---|---|---|---|
| GCA_000056065.1 | ncbi | ||||||
| GCA_000182835.1 | ncbi | ||||||
| GCA_000191165.1 | ncbi | ||||||
| GCA_000014405.1 | ncbi | ||||||
| strain_01 | custom | my_custom_genome.gbk |
and you should get this message:
Step 2.1 Getting sample information from: config/Lactobacillus_delbrueckii/project_config.yaml
- Processing project [config/Lactobacillus_delbrueckii/project_config.yaml]
- Custom input directory: True
- Getting input files from: /datadrive/bgcflow/config/Lactobacillus_delbrueckii/input_files
- Custom input format: True
- Default input file type: gbk
- ! WARNING: GCA_000056065.1 is from ncbi. Enforcing format to `fna`.
- ! WARNING: GCA_000182835.1 is from ncbi. Enforcing format to `fna`.
- ! WARNING: GCA_000191165.1 is from ncbi. Enforcing format to `fna`.
- ! WARNING: GCA_000014405.1 is from ncbi. Enforcing format to `fna`.
- Found user-provided taxonomic information
Why this workflow can still running after I delete all the input files?
I would assume that you don't change anything in the example template configuration and only deleted the default input files located in data/raw. If this is the case, then the project can still run because all the samples are being fetched online from NCBI. You can check if this is the case from the samples.csv
Thank you again for the question, we will be sure to add this to the FAQ section and improve the WIKI.
Hi WJ,
Glad to hear it works :)
The CLI for bgcflow run is just a simple wrapper of the snakemake CLI. So you can always directly use snakemake and use whatever parameter is available in snakemake documentation.
If you prefer to use the bgcflow_wrapper CLI, you can check what parameter is available using the help command, such as:
$ bgcflow run --help
Usage: bgcflow run [OPTIONS]
A snakemake CLI wrapper to run BGCFlow. Automatically run panoptes.
Options:
-d, --bgcflow_dir TEXT Location of BGCFlow directory. (DEFAULT: Current
working directory.)
--workflow TEXT Select which snakefile to run. Available
subworkflows: {BGC | Database | Report | Metabase |
lsagbc | ppanggolin}. (DEFAULT: workflow/Snakefile)
--monitor-off Turn off Panoptes monitoring workflow. (DEFAULT:
False)
--wms-monitor TEXT Panoptes address. (DEFAULT: http://127.0.0.1:5000)
-c, --cores INTEGER Use at most N CPU cores/jobs in parallel. (DEFAULT:
8)
-n, --dryrun Test run.
--unlock Remove a lock on the snakemake working directory.
--until TEXT Runs the pipeline until it reaches the specified
rules or files.
--profile TEXT Path to a directory containing snakemake profile.
-t, --touch Touch output files (mark them up to date without
really changing them).
-h, --help Show this message and exit.
Note that the current bgcflow_wrapper package is using an older snakemake version and we are currently working on the update.
Also, there is another question. After bcgflow build report, the code '# use conda or mamba mamba env create -f bgcflow_notes.yaml # or r_notebook.yaml' doesn't work
I hope this means the command bgcflow build report works and you want to manually edit the notebook templates?
By snakemake convention, the environment files can be found in workflow/envs/<environment name>.yaml. Therefore, you can create the conda environment using mamba env create -f workflow/envs/bgcflow_notes.yaml.
You can actually reuse the conda environment built by snakemake by checking the snakemake log. They can be found in the .snakemake/conda folder.
PS: If you find any misleading or wrong instruction in the WIKI, please do let us know to correct it.
Hi,
Thanks very much for the help.
I got another problem when the workflow install the env for roary. shows below: Output: Channels:
- bioconda
- conda-forge
- defaults Platform: linux-64 Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... failed Channels:
- bioconda
- conda-forge
- defaults Platform: linux-64 Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... failed
LibMambaUnsatisfiableError: Encountered problems while solving:
- package roary-3.13.0-pl526h516909a_0 requires parallel >=20180522, but none of the providers can be installed
- package r-ggplot2-3.3.6-r42h6115d3f_0 is excluded by strict repo priority
- package r-base-4.1.3-h0887e52_11 requires libtiff >=4.6.0,<4.7.0a0, but none of the providers can be installed
I can use roary in individual conda env , but when I export the yaml to replace the yaml in the workflow it will generate new errors. So may I ask how to modify the yaml to solve this?
Thanks for the help!
Best Regards, Jay
Hi Jay,
Unfortunately I cannot reproduce the error for creating the roary environment and the test seems to work fine.
From the message: package r-ggplot2-3.3.6-r42h6115d3f_0 is excluded by strict repo priority, it seems that you have your conda channel priority to strict.
Can you check your conda channel priorities and set it to flexible to see if it solves the problem?
A detailed instruction is available in the wiki
After setting the priority to flexible, while running the snakemake jobs, you should see this warning message nagging you about it, which is fine:
Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Creating conda environment workflow/envs/roary.yaml...
Downloading and installing remote packages.
Environment for /datadrive_cemist/test/workflow/rules/../envs/roary.yaml created (location: .snakemake/conda/b39a961a250810ddef5ab2698703b6ab_)
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Singularity containers: ignored
We will probably remove roary with other newer pangenome builder. Hopefully there will be support to use singularity containers in the future for better reproducibility.
Hi Matinnuhamunada,
Thanks for your help. I fixed the roary under your guidance.
But there is anther problem. After generate the figure of automlst tree-roary matrix, the whole workflow will interrupt work like no error, no crush, just stop work there. I want to find what happened but the log showed nothing errors.
Also the automlst tree figures show all nan instead of the names of strains.
Thanks very much for the help.
Best Regards, Jay
Hi Jay,
I'm currently on summer vacation, so forgive me in advance if I can't reply to your issues swiftly.
We do encounter some issues with Roary, and there are future plans to replace it with newer alternatives.
Building pangenome can be tricky, as it depends on the sample set that is given. If the genomes are complex (say Streptomyces) or they are too distantly related (at strain vs species vs genus level), Roary might fail because the number of orthologues cluster are too high (we set it to 80,000). If this is the case, you might want to increase the max number, or rethink about your sample dataset.
But there is anther problem. After generate the figure of automlst tree-roary matrix, the whole workflow will interrupt work like no error, no crush, just stop work there. I want to find what happened but the log showed nothing errors.
Can you elaborate more on which step does this error happen? If you can provide the log file here, it will be great :)
Also the automlst tree figures show all nan instead of the names of strains.
It's true that I haven't put to much effort on pangenome visualization as there is another project going at our center (see https://pankb.org/). I don't think I can work too much on this as I need to finish my PhD in August 😬. But maybe @JackSun1997 can help or give suggestion on how to process BGCFlow roary output for further visualization?
Hi Matinnuhamunada,
Congratulations on your PhD!
I attached the log below, the 60lines before it stopped. The workflow will stop after that till I terminate it next morning.
Thanks for your help!
[Tue Jul 16 20:15:45 2024] rule checkm_out: input: data/interim/checkm/Jeni_isolates/storage/bin_stats_ext.tsv output: data/processed/Jeni_isolates/tables/df_checkm_stats.csv log: logs/checkm/checkm_out_Jeni_isolates.log jobid: 37 reason: Missing output files: data/processed/Jeni_isolates/tables/df_checkm_stats.csv; Input files updated by another job: data/interim/checkm/Jeni_isolates/storage/bin_stats_ext.tsv wildcards: name=Jeni_isolates resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/99b85a6c79ba929c0f380ce8472bd644_ Activating conda environment: .snakemake/conda/fdc4fd4e3e5776a2b25e63e13fba7ea5_ [Tue Jul 16 20:15:46 2024] Finished job 37. 207 of 223 steps (93%) done Select jobs to execute... [Tue Jul 16 20:30:31 2024] Finished job 129. 208 of 223 steps (93%) done
[Tue Jul 16 20:30:31 2024] rule summarize_bigslice_query: input: data/interim/bigslice/query/Jeni_isolates_antismash_7.1.0 output: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv, data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.json, data/processed/Jeni_isolates/bigslice/query_as_7.1.0/query_network.csv log: logs/bigslice/summarize_bigslice_query/summarize_bigslice_query_Jeni_isolates-antismash-7.1.0.log jobid: 128 reason: Missing output files: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv; Input files updated by another job: data/interim/bigslice/query/Jeni_isolates_antismash_7.1.0 wildcards: name=Jeni_isolates, version=7.1.0 resources: tmpdir=/tmp
[Tue Jul 16 20:30:31 2024] rule bigslice: input: data/interim/bigslice/tmp/Jeni_isolates_antismash_7.1.0 output: data/processed/Jeni_isolates/bigslice/cluster_as_7.1.0 log: logs/bigslice/bigslice/bigslice_Jeni_isolates-antismash-7.1.0.log jobid: 136 reason: Missing output files: data/processed/Jeni_isolates/bigslice/cluster_as_7.1.0; Input files updated by another job: data/interim/bigslice/tmp/Jeni_isolates_antismash_7.1.0 wildcards: name=Jeni_isolates, version=7.1.0 threads: 16 resources: tmpdir=data/interim/tempdir
Activating conda environment: .snakemake/conda/fdc4fd4e3e5776a2b25e63e13fba7ea5_ Activating conda environment: .snakemake/conda/99b85a6c79ba929c0f380ce8472bd644_ [Tue Jul 16 20:31:27 2024] Finished job 128. 209 of 223 steps (94%) done Select jobs to execute...
[Tue Jul 16 20:31:28 2024] rule annotate_bigfam_hits: input: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv output: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_annotation.csv log: logs/bigslice/summarize_bigslice_query/annotate_bigslice_query_Jeni_isolates-antismash-7.1.0.log jobid: 127 reason: Missing output files: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_annotation.csv; Input files updated by another job: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv wildcards: name=Jeni_isolates, version=7.1.0 resources: tmpdir=/tmp
Activating conda environment: .snakemake/conda/fdc4fd4e3e5776a2b25e63e13fba7ea5_ [Tue Jul 16 20:31:56 2024] Finished job 136. 210 of 223 steps (94%) done Removing temporary output data/interim/bigslice/tmp/Jeni_isolates_antismash_7.1.0. Select jobs to execute... [Tue Jul 16 20:32:10 2024] Finished job 127. 211 of 223 steps (95%) done Terminating processes on user request, this might take some time. [Wed Jul 17 09:34:21 2024]
Hi Jay,
I'm currently on summer vacation, so forgive me in advance if I can't reply to your issues swiftly.
We do encounter some issues with Roary, and there are future plans to replace it with newer alternatives.
Building pangenome can be tricky, as it depends on the sample set that is given. If the genomes are complex (say Streptomyces) or they are too distantly related (at strain vs species vs genus level), Roary might fail because the number of orthologues cluster are too high (we set it to 80,000). If this is the case, you might want to increase the max number, or rethink about your sample dataset.
But there is anther problem. After generate the figure of automlst tree-roary matrix, the whole workflow will interrupt work like no error, no crush, just stop work there. I want to find what happened but the log showed nothing errors.
Can you elaborate more on which step does this error happen? If you can provide the log file here, it will be great :)
Also the automlst tree figures show all nan instead of the names of strains.
It's true that I haven't put to much effort on pangenome visualization as there is another project going at our center (see https://pankb.org/). I don't think I can work too much on this as I need to finish my PhD in August 😬. But maybe @JackSun1997 can help or give suggestion on how to process BGCFlow roary output for further visualization?