ColabFold How to use templates found by mmseqs

If you run colabfold_search with --use-templastes 1 option, it will generate a single .m8 file containing the template hits as well as MSAs. I am wondering how I can make colabfold_batch to use this file?

In details, Running this:

 colabfold_search --threads 
  --db2 pdb70_220313 \
  --use-templates 1 \
  --use-env 1 \
  --db-load-mode 0 \
  fasta \
  $DATA_DIR \
  msas

would result in the following files in the msas directory:

0.a3m
1.a3m
2.a3m
3.a3m
4.a3m
5.a3m
6.a3m
7.a3m
8.a3m
9.a3m
...
pdb70_220313.m8

There is also pdb directory containing containing .cif.gz files, downloaded with setup_databases.sh script, which can be used.

Feb 10 '23 22:02 alirezaomidi

@alirezaomidi did you find a solution here? (maybe also @milot-mirdita)? I am also confused how to use the predefined PDB database (70/100) further but also how to use custom templates in the prediction step.

Dec 18 '23 13:12 paoslaos

@alirezaomidi did you find a solution here? (maybe also @milot-mirdita)? I am also confused how to use the predefined PDB database (70/100) further but also how to use custom templates in the prediction step.

No I didn't.

Dec 18 '23 18:12 alirezaomidi

The current version of ColabFold has implemented --pdb-hit-file and --local-pdb-path args to use .m8 file containing template hits.

INPUTFILE="RAS_RAF.a3m"
PDBHITFILE="RAS_RAF_pdb100_230517.m8"
LOCALPDBPATH="/path/to/pdb_mmcif/mmcif_files"
# e.g. if you have a template file at "/path/to/pdb/divided/xy/2xya.cif.gz",
# LOCALPDBPATH="/path/to/pdb" 
OUTPUTDIR="/path/to/output"
RANDOMSEED=0

colabfold_batch \
  --amber \
  --templates \
  --use-gpu-relax \
  --pdb-hit-file ${PDBHITFILE} \
  --local-pdb-path ${LOCALPDBPATH} \
  --random-seed ${RANDOMSEED} \
${INPUTFILE}
${OUTPUTDIR}

We also modified colabfold_search script to accommodate the change. So, please update your locally-installed ColabFold first and use these commands.

Dec 19 '23 07:12 YoshitakaMo

This is terrific! Thanks!

Dec 19 '23 10:12 paoslaos

I was wondering whether we can input multiple m8 files into a single colabfold_batch call.

So in my case I ran colabfold_search for a long list of dimers (AAAAAA:BBBBBB) and have now a long list of paired a3m and m8 files (one of each per dimer). I was planning to run colabfold_batch giving as an input the folder with the a3m files, but I cannot provide the m8 files (so the PDBHITFILEs) in the same way. Am I constrained to run a few thousands colabfold_batch calls instead of running a single, multiquery call? I first thought on concatenating all the m8 files but since the id of each sequence is changed into 101 and 102 (for the A and B proteins) I am not sure on whether this is safe.

Apr 25 '24 12:04 CBorreda