bgcflow How to use Custom input directory?

Jun 12 '24 09:06 540889956

Hi, thanks for reaching out and using BGCFlow :)

I just wish to know how to use Custom input directory

To use custom input directory, you will need to set up two things:

Defining the input directory path (and maybe the input file type) in the project configuration (project_config.yaml)
Adding your custom input samples in the samples.csv

Please find the example here: config.zip

This is how the project structure will look like:

config/
├── Lactobacillus_delbrueckii
│   ├── input_files # your custom input directory
│   │   └── my_custom_genome.gbk
│   ├── gtdbtk.bac120.summary.tsv # an optional GTDB-tk style taxonomic assignment
│   ├── project_config.yaml # project level configuration, the rule set here will override the global parameter
│   └── samples.csv
└──── config.yaml # global parameter configuration

And this is how the project configuration (project_config.yaml) looks like:

name: Lactobacillus_delbrueckii_custom_input

pep_version: 2.1.0

description: "An example of using custom input files in BGCFlow projects."
input_folder: input_files # This is the folder where the input files are located, relative to this file.
input_type: gbk # This is the default type of input files. It can be gbk or fna. Note that samples from NCBI will default to fna format.
gtdb-tax: gtdbtk.bac120.summary.tsv # you can also provide a custom GTDB-tk output style taxonomy information
sample_table: samples.csv

#### RULE CONFIGURATION ####
# rules: set value to TRUE if you want to run the analysis or FALSE if you don't
rules:
  seqfu: TRUE
...

and finally, you need to add the custom sample in the samples.csv:

genome_id	source	input_file
GCA_000056065.1	ncbi
GCA_000182835.1	ncbi
GCA_000191165.1	ncbi
GCA_000014405.1	ncbi
strain_01	custom	my_custom_genome.gbk

and you should get this message:

Step 2.1 Getting sample information from: config/Lactobacillus_delbrueckii/project_config.yaml
 - Processing project [config/Lactobacillus_delbrueckii/project_config.yaml]
 - Custom input directory: True
 - Getting input files from: /datadrive/bgcflow/config/Lactobacillus_delbrueckii/input_files
 - Custom input format: True
 - Default input file type: gbk
   - ! WARNING: GCA_000056065.1 is from ncbi. Enforcing format to `fna`.
   - ! WARNING: GCA_000182835.1 is from ncbi. Enforcing format to `fna`.
   - ! WARNING: GCA_000191165.1 is from ncbi. Enforcing format to `fna`.
   - ! WARNING: GCA_000014405.1 is from ncbi. Enforcing format to `fna`.
 - Found user-provided taxonomic information

Why this workflow can still running after I delete all the input files?

I would assume that you don't change anything in the example template configuration and only deleted the default input files located in data/raw. If this is the case, then the project can still run because all the samples are being fetched online from NCBI. You can check if this is the case from the samples.csv

Thank you again for the question, we will be sure to add this to the FAQ section and improve the WIKI.

Jun 12 '24 22:06 matinnuhamunada

Hi WJ,

Glad to hear it works :)

The CLI for bgcflow run is just a simple wrapper of the snakemake CLI. So you can always directly use snakemake and use whatever parameter is available in snakemake documentation.

If you prefer to use the bgcflow_wrapper CLI, you can check what parameter is available using the help command, such as:

$ bgcflow run --help
Usage: bgcflow run [OPTIONS]

  A snakemake CLI wrapper to run BGCFlow. Automatically run panoptes.

Options:
  -d, --bgcflow_dir TEXT  Location of BGCFlow directory. (DEFAULT: Current
                          working directory.)
  --workflow TEXT         Select which snakefile to run. Available
                          subworkflows: {BGC | Database | Report | Metabase |
                          lsagbc | ppanggolin}. (DEFAULT: workflow/Snakefile)
  --monitor-off           Turn off Panoptes monitoring workflow. (DEFAULT:
                          False)
  --wms-monitor TEXT      Panoptes address. (DEFAULT: http://127.0.0.1:5000)
  -c, --cores INTEGER     Use at most N CPU cores/jobs in parallel. (DEFAULT:
                          8)
  -n, --dryrun            Test run.
  --unlock                Remove a lock on the snakemake working directory.
  --until TEXT            Runs the pipeline until it reaches the specified
                          rules or files.
  --profile TEXT          Path to a directory containing snakemake profile.
  -t, --touch             Touch output files (mark them up to date without
                          really changing them).
  -h, --help              Show this message and exit.

Note that the current bgcflow_wrapper package is using an older snakemake version and we are currently working on the update.

Also, there is another question. After bcgflow build report, the code '# use conda or mamba mamba env create -f bgcflow_notes.yaml # or r_notebook.yaml' doesn't work

I hope this means the command bgcflow build report works and you want to manually edit the notebook templates?

By snakemake convention, the environment files can be found in workflow/envs/<environment name>.yaml. Therefore, you can create the conda environment using mamba env create -f workflow/envs/bgcflow_notes.yaml.

You can actually reuse the conda environment built by snakemake by checking the snakemake log. They can be found in the .snakemake/conda folder.

PS: If you find any misleading or wrong instruction in the WIKI, please do let us know to correct it.

Jun 14 '24 18:06 matinnuhamunada

Hi,

Thanks very much for the help.

I got another problem when the workflow install the env for roary. shows below: Output: Channels:

bioconda
conda-forge
defaults Platform: linux-64 Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... failed Channels:
bioconda
conda-forge
defaults Platform: linux-64 Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... failed

LibMambaUnsatisfiableError: Encountered problems while solving:

package roary-3.13.0-pl526h516909a_0 requires parallel >=20180522, but none of the providers can be installed
package r-ggplot2-3.3.6-r42h6115d3f_0 is excluded by strict repo priority
package r-base-4.1.3-h0887e52_11 requires libtiff >=4.6.0,<4.7.0a0, but none of the providers can be installed

I can use roary in individual conda env , but when I export the yaml to replace the yaml in the workflow it will generate new errors. So may I ask how to modify the yaml to solve this?

Thanks for the help!

Best Regards, Jay

Jun 27 '24 09:06 540889956

Hi Jay,

Unfortunately I cannot reproduce the error for creating the roary environment and the test seems to work fine.

From the message: package r-ggplot2-3.3.6-r42h6115d3f_0 is excluded by strict repo priority, it seems that you have your conda channel priority to strict.

Can you check your conda channel priorities and set it to flexible to see if it solves the problem? A detailed instruction is available in the wiki

After setting the priority to flexible, while running the snakemake jobs, you should see this warning message nagging you about it, which is fine:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Creating conda environment workflow/envs/roary.yaml...
Downloading and installing remote packages.
Environment for /datadrive_cemist/test/workflow/rules/../envs/roary.yaml created (location: .snakemake/conda/b39a961a250810ddef5ab2698703b6ab_)
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Singularity containers: ignored

We will probably remove roary with other newer pangenome builder. Hopefully there will be support to use singularity containers in the future for better reproducibility.

Jun 27 '24 19:06 matinnuhamunada

Hi Matinnuhamunada,

Thanks for your help. I fixed the roary under your guidance.

But there is anther problem. After generate the figure of automlst tree-roary matrix, the whole workflow will interrupt work like no error, no crush, just stop work there. I want to find what happened but the log showed nothing errors.

Also the automlst tree figures show all nan instead of the names of strains.

Thanks very much for the help.

Best Regards, Jay

Jul 17 '24 08:07 540889956

Hi Jay,

I'm currently on summer vacation, so forgive me in advance if I can't reply to your issues swiftly.

We do encounter some issues with Roary, and there are future plans to replace it with newer alternatives.

Building pangenome can be tricky, as it depends on the sample set that is given. If the genomes are complex (say Streptomyces) or they are too distantly related (at strain vs species vs genus level), Roary might fail because the number of orthologues cluster are too high (we set it to 80,000). If this is the case, you might want to increase the max number, or rethink about your sample dataset.

But there is anther problem. After generate the figure of automlst tree-roary matrix, the whole workflow will interrupt work like no error, no crush, just stop work there. I want to find what happened but the log showed nothing errors.

Can you elaborate more on which step does this error happen? If you can provide the log file here, it will be great :)

Also the automlst tree figures show all nan instead of the names of strains.

It's true that I haven't put to much effort on pangenome visualization as there is another project going at our center (see https://pankb.org/). I don't think I can work too much on this as I need to finish my PhD in August 😬. But maybe @JackSun1997 can help or give suggestion on how to process BGCFlow roary output for further visualization?

Jul 17 '24 09:07 matinnuhamunada

Hi Matinnuhamunada,

Congratulations on your PhD!

I attached the log below, the 60lines before it stopped. The workflow will stop after that till I terminate it next morning.

Thanks for your help!

[Tue Jul 16 20:15:45 2024] rule checkm_out: input: data/interim/checkm/Jeni_isolates/storage/bin_stats_ext.tsv output: data/processed/Jeni_isolates/tables/df_checkm_stats.csv log: logs/checkm/checkm_out_Jeni_isolates.log jobid: 37 reason: Missing output files: data/processed/Jeni_isolates/tables/df_checkm_stats.csv; Input files updated by another job: data/interim/checkm/Jeni_isolates/storage/bin_stats_ext.tsv wildcards: name=Jeni_isolates resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/99b85a6c79ba929c0f380ce8472bd644_ Activating conda environment: .snakemake/conda/fdc4fd4e3e5776a2b25e63e13fba7ea5_ [Tue Jul 16 20:15:46 2024] Finished job 37. 207 of 223 steps (93%) done Select jobs to execute... [Tue Jul 16 20:30:31 2024] Finished job 129. 208 of 223 steps (93%) done

[Tue Jul 16 20:30:31 2024] rule summarize_bigslice_query: input: data/interim/bigslice/query/Jeni_isolates_antismash_7.1.0 output: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv, data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.json, data/processed/Jeni_isolates/bigslice/query_as_7.1.0/query_network.csv log: logs/bigslice/summarize_bigslice_query/summarize_bigslice_query_Jeni_isolates-antismash-7.1.0.log jobid: 128 reason: Missing output files: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv; Input files updated by another job: data/interim/bigslice/query/Jeni_isolates_antismash_7.1.0 wildcards: name=Jeni_isolates, version=7.1.0 resources: tmpdir=/tmp

[Tue Jul 16 20:30:31 2024] rule bigslice: input: data/interim/bigslice/tmp/Jeni_isolates_antismash_7.1.0 output: data/processed/Jeni_isolates/bigslice/cluster_as_7.1.0 log: logs/bigslice/bigslice/bigslice_Jeni_isolates-antismash-7.1.0.log jobid: 136 reason: Missing output files: data/processed/Jeni_isolates/bigslice/cluster_as_7.1.0; Input files updated by another job: data/interim/bigslice/tmp/Jeni_isolates_antismash_7.1.0 wildcards: name=Jeni_isolates, version=7.1.0 threads: 16 resources: tmpdir=data/interim/tempdir

Activating conda environment: .snakemake/conda/fdc4fd4e3e5776a2b25e63e13fba7ea5_ Activating conda environment: .snakemake/conda/99b85a6c79ba929c0f380ce8472bd644_ [Tue Jul 16 20:31:27 2024] Finished job 128. 209 of 223 steps (94%) done Select jobs to execute...

[Tue Jul 16 20:31:28 2024] rule annotate_bigfam_hits: input: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv output: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_annotation.csv log: logs/bigslice/summarize_bigslice_query/annotate_bigslice_query_Jeni_isolates-antismash-7.1.0.log jobid: 127 reason: Missing output files: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_annotation.csv; Input files updated by another job: data/processed/Jeni_isolates/bigslice/query_as_7.1.0/gcf_summary.csv wildcards: name=Jeni_isolates, version=7.1.0 resources: tmpdir=/tmp

Activating conda environment: .snakemake/conda/fdc4fd4e3e5776a2b25e63e13fba7ea5_ [Tue Jul 16 20:31:56 2024] Finished job 136. 210 of 223 steps (94%) done Removing temporary output data/interim/bigslice/tmp/Jeni_isolates_antismash_7.1.0. Select jobs to execute... [Tue Jul 16 20:32:10 2024] Finished job 127. 211 of 223 steps (95%) done Terminating processes on user request, this might take some time. [Wed Jul 17 09:34:21 2024]

Hi Jay,

I'm currently on summer vacation, so forgive me in advance if I can't reply to your issues swiftly.

We do encounter some issues with Roary, and there are future plans to replace it with newer alternatives.

Building pangenome can be tricky, as it depends on the sample set that is given. If the genomes are complex (say Streptomyces) or they are too distantly related (at strain vs species vs genus level), Roary might fail because the number of orthologues cluster are too high (we set it to 80,000). If this is the case, you might want to increase the max number, or rethink about your sample dataset.

But there is anther problem. After generate the figure of automlst tree-roary matrix, the whole workflow will interrupt work like no error, no crush, just stop work there. I want to find what happened but the log showed nothing errors.

Can you elaborate more on which step does this error happen? If you can provide the log file here, it will be great :)

Also the automlst tree figures show all nan instead of the names of strains.

It's true that I haven't put to much effort on pangenome visualization as there is another project going at our center (see https://pankb.org/). I don't think I can work too much on this as I need to finish my PhD in August 😬. But maybe @JackSun1997 can help or give suggestion on how to process BGCFlow roary output for further visualization?

Jul 17 '24 10:07 540889956