creating a custom taxonomic database using MAGs

Open clb21565 opened this issue 10 months ago • 0 comments

Hi there,

I am embarking on putting together a taxonomic database using MAGs with GTDB-style taxonomy annotations. Specifically, I want to combine my MAGs with the current version of GTDB. To do that I am doing the following:

final update: success!

proposing a final workflow of:

Download the GTDB taxonomy data - the ar53 and bac120 taxonomy tsv files as well as the gtdb.tar.gz archive.
Extract the GTDB-style lineage classifications from the gtdb-tk output where the first column is the MAG id and the second is the lineage classification.
Code in species/other taxonomic rank information into GTDB style for MAGs with incomplete lineages (see example code below). The first column of the output file should be the gzipped protein sequence file name and the second column should be the complete lineage. For example, :

Concatenate the MAG/custom taxonomic rank info with the corresponding GTDB data.
Generate the NCBI taxonomy dmp files using script - see below for example.

then, you can generate the database as in the below code:

mmseqs createdb dir_with_proteins/*.faa.gz database 
mmseqs createtaxdb database tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file taxonomy/mapping_genomes --tax-mapping-mode 1 --threads 32 -v 3

some caveats:

I am not too bright, this worked for me, but no warranty included ;-)
I don't know if there are any downstream impacts of non-standard taxids, I would guess perhaps not as GTDB uses tax labels that do not correspond directly to NCBI labels.

will keep this thread open for now in case anything else opens up.

R code for adding species/etc:

library(tidyr)
library(dplyr)
library(data.table)

getwd()

mfdAR=fread("mfd_AR.tsv",sep="\t",header=FALSE) # the taxonomy data frame, no header columns 

needs_nothing=mfdAR %>% subset(grepl("s__$",V2)==FALSE)

needs_genus = mfdAR %>% 
    separate(., V2, into=c("domain","phylum","class","order","family","genus","species"),
    sep=";") %>% subset(genus=="g__") %>% 
    mutate(genus = paste("g__",gsub(".faa.gz","",V1),"_genus",sep=""))%>% 
    mutate(species = paste("s__",gsub(".faa.gz","",V1),".gs",sep=""))

needs_species = mfdAR %>% 
    separate(., V2, into=c("domain","phylum","class","order","family","genus","species"),
    sep=";") %>% subset(genus!="g__"&species=="s__") %>% 
    #mutate(genus = paste("g__",gsub(".faa.gz","",V1),"_genus",sep=""))%>% 
    mutate(species = paste("s__",gsub(".faa.gz","",V1),".gs",sep=""))

#needs_species=needs_genus%>%subset(genus=="g__") %>% 
#    mutate(genus=paste("g__",gsub(".faa.gz","",V1),"_genus",sep="" ))

ns=needs_species%>%mutate(lineage=trimws(paste(domain,phylum,class,order,family,genus,species,sep=";")))
ng=needs_genus%>%mutate(lineage=trimws(paste(domain,phylum,class,order,family,genus,species,sep=";")))

needed=rbind(ns,ng)%>%unique()
no=needed[,c("V1","lineage")]
nno=needs_nothing%>%rename(lineage=V2)

out=rbind(no,nno)
nrow(out)
nrow(mfdAR) # confirm that you retained all of the original entries

out%>%fwrite("AR-MAGs-with-new-sp.txt",sep="\t",col.names = FALSE)

old stuff

Downloaded the GTDB taxonomy data - the ar53 and bac120 taxonomy tsv files as well as the gtdb.tar.gz archive
Extracted the GTDB-style lineage classifications from the gtdb-tk output where the first column is the MAG id and the second is the classification
Concatenated the BAC120/ar53s with the corresponding MAG classifications
Generated the NCBI taxonomy dmp files using script below extracted from the download.sh generated while running mmseqs databases
unzipped the gtdb.tar.gz directory, and added in the protein sequences predicted from the MAGs to the corresponding sections (labeled as MAGID.faa.gz) , then rezipped
reran the latter steps of mmseqs databases

OK Update (25/04/04, one week later)-

some things that may seem obvious, but weren't to me at the time:

1) to generate the correct taxdump directory, you need to have the actual file names as the first column of the taxonomy tsv file (I started by using just what was downloaded from gtdb directly, but that did not include _proteins.faa.gz) 2) some of the genomes are missing species or higher level taxonomic classifications- which seems to be causing issues down the line as after generating the tax dmp files many are listed with the same taxid despite being different

the solution i will try next is to put in distinct placeholder names for those without direct species classifications as I think this particular code chunk is not compatible with missing classifications:

{
    prevTaxon=1;
    for (i = 2; i <= NF; i++) {
        if ($i in ids) {
            prevTaxon = ids[$i];
        } else {
            taxCnt++;
            ids[$i] = taxCnt;
            r = substr($i, 0, 1);
            name = substr($i, 4);
            gsub(/_/, " ", name);
            printf("%s\t|\t%s\t|\t%s\t|\t-\t|\n", taxCnt, prevTaxon, rank[r]) > taxdir"/nodes.dmp";
            printf("%s\t|\t%s\t|\t-\t|\tscientific name\t|\n", taxCnt, name) > taxdir"/names.dmp";
            prevTaxon = taxCnt;
        }
    }
    printf("%s\t%s\n", $1, ids[$NF]) > taxdir"/mapping_genomes";
}'

use this code to generate the NCBI-style taxonomy metadata (taken from download.sh generated during databases):

#!/bin/bash

CMD='BEGIN {
    FS = "[\t;]";
    rank["c"] = "class";
    rank["d"] = "superkingdom";
    rank["f"] = "family";
    rank["g"] = "genus";
    rank["o"] = "order";
    rank["p"] = "phylum";
    rank["s"] = "species";
    taxCnt = 1;
    ids["root"] = 1;
    print "1\t|\t1\t|\tno rank\t|\t-\t|" > taxdir"/nodes.dmp";
    print "1\t|\troot\t|\t-\t|\tscientific name\t|" > taxdir"/names.dmp";
}

{
    prevTaxon=1;
    for (i = 2; i <= NF; i++) {
        if ($i in ids) {
            prevTaxon = ids[$i];
        } else {
            taxCnt++;
            ids[$i] = taxCnt;
            r = substr($i, 0, 1);
            name = substr($i, 4);
            gsub(/_/, " ", name);
            printf("%s\t|\t%s\t|\t%s\t|\t-\t|\n", taxCnt, prevTaxon, rank[r]) > taxdir"/nodes.dmp";
            printf("%s\t|\t%s\t|\t-\t|\tscientific name\t|\n", taxCnt, name) > taxdir"/names.dmp";
            prevTaxon = taxCnt;
        }
    }
    printf("%s\t%s\n", $1, ids[$NF]) > taxdir"/mapping_genomes";
}'

TMP_PATH=.
mkdir -p "${TMP_PATH}/taxonomy" 

awk -v taxdir="${TMP_PATH}/taxonomy" "$CMD" "${TMP_PATH}/updated-bac.tsv" "${TMP_PATH}/updated-AR.tsv"
touch "${TMP_PATH}/taxonomy/merged.dmp"
touch "${TMP_PATH}/taxonomy/delnodes.dmp"

Mar 26 '25 10:03 clb21565