Inconsistencies in the GTDB taxonomy
Thanks for putting HumGut up online!
I find some inconsistencies in the GTDB taxonomy names when comparing to the taxononmy names in the GTDB 226.0 release.
I find 3294 unique bacteria names in the HumGut.tsv file (gtdbtk_taxonomy column). I manage to match only 1918 of these into the most recent GTDB release. Which GTDB release are the names in the HumGut.tsv based on?
The version you find here is based on the 2.14 release of GTDB. I have updated this to the 2.20 release, but I have it only on our HPC. I haven't had time to update with the latest 2.26 release from this year yet. Maybe I will find time for that this summer.
Many thanks @larssnip, I'll use that version for my mapping.
There are also some (500) rows in the tsv file that is missing the species in the gtdbtk_taxonomy column (the string id ends on ";s__"). Am I right in assuming that you didn't manage to fully identify the gtdb taxo or is there another reason?
Yes, this is standard output from GTDBTk that we used to classify. Some taxa are of a species not within the reach of the GTDB taxonomy and are only classified down to the genus, or even higher up in the hierarchy.
I started the work on updating this GitHub site yesterday. It will take many days. We have a HumGut 2.0 version with slightly more genomes, and I will now add the GTDB 2.26 taxonomy and then put this our here. We will no longer support the NCBI taxonomy, since there are too many differences now to make it worth while.
That is great to hear @larssnip.'
Would it be possible to get access to the metadata file for HumGut 2.0 even if it only maps to the gtdb 2.20?
I just want to mention that I have now uploaded a new version of HumGut, the HumGut2. It is only a smallish extension of the genome collection, but the taxonomy has now been upgraded to the GTDB release 2.26 (april 2025).
We have now omitted the NCBI taxonomy of the HumGut genomes, but in the GTDB_names.dmp and GTDB_nodes.dmp the full NCBI Taxonomy for all non-prokaryotic taxa are included. Thus, you can now use this to build kraken2 databases where you use HumGut together with libraries for human, fungi, viruses etc.