How to use templates found by mmseqs
If you run colabfold_search with --use-templastes 1 option, it will generate a single .m8 file containing the template hits as well as MSAs. I am wondering how I can make colabfold_batch to use this file?
In details, Running this:
colabfold_search --threads
--db2 pdb70_220313 \
--use-templates 1 \
--use-env 1 \
--db-load-mode 0 \
fasta \
$DATA_DIR \
msas
would result in the following files in the msas directory:
0.a3m
1.a3m
2.a3m
3.a3m
4.a3m
5.a3m
6.a3m
7.a3m
8.a3m
9.a3m
...
pdb70_220313.m8
There is also pdb directory containing containing .cif.gz files, downloaded with setup_databases.sh script, which can be used.
@alirezaomidi did you find a solution here? (maybe also @milot-mirdita)? I am also confused how to use the predefined PDB database (70/100) further but also how to use custom templates in the prediction step.
@alirezaomidi did you find a solution here? (maybe also @milot-mirdita)? I am also confused how to use the predefined PDB database (70/100) further but also how to use custom templates in the prediction step.
No I didn't.
The current version of ColabFold has implemented --pdb-hit-file and --local-pdb-path args to use .m8 file containing template hits.
INPUTFILE="RAS_RAF.a3m"
PDBHITFILE="RAS_RAF_pdb100_230517.m8"
LOCALPDBPATH="/path/to/pdb_mmcif/mmcif_files"
# e.g. if you have a template file at "/path/to/pdb/divided/xy/2xya.cif.gz",
# LOCALPDBPATH="/path/to/pdb"
OUTPUTDIR="/path/to/output"
RANDOMSEED=0
colabfold_batch \
--amber \
--templates \
--use-gpu-relax \
--pdb-hit-file ${PDBHITFILE} \
--local-pdb-path ${LOCALPDBPATH} \
--random-seed ${RANDOMSEED} \
${INPUTFILE}
${OUTPUTDIR}
We also modified colabfold_search script to accommodate the change. So, please update your locally-installed ColabFold first and use these commands.
This is terrific! Thanks!
I was wondering whether we can input multiple m8 files into a single colabfold_batch call.
So in my case I ran colabfold_search for a long list of dimers (AAAAAA:BBBBBB) and have now a long list of paired a3m and m8 files (one of each per dimer). I was planning to run colabfold_batch giving as an input the folder with the a3m files, but I cannot provide the m8 files (so the PDBHITFILEs) in the same way. Am I constrained to run a few thousands colabfold_batch calls instead of running a single, multiquery call?
I first thought on concatenating all the m8 files but since the id of each sequence is changed into 101 and 102 (for the A and B proteins) I am not sure on whether this is safe.