psm_utils icon indicating copy to clipboard operation
psm_utils copied to clipboard

Quicker mzid parser

Open julianu opened this issue 10 months ago • 2 comments

Hej,

I re-wrote the mzident reader and made it much faster, albeit maybe a bit less complete. For now, I added the new reader alongside the old one. It does not use Pyteomics, but parses the structure more directly. Hence, it is less complete for complicated files, but should be good for most "normal" ones originating from a single search engine and contain only one search run. I tested the conversion to TSV on some bigger files from MS-GF+ and Comet (2-20 GB) and the output was exactly identical to the files created by the original reader. But the conversion took only about a tenth of teh time (with equal memory consumption). Would be great, if you could add this new reader, if you like it. As the conversion of the bigger files (like a combination of TimsTOF files and proteogenomics databases) otherwise takes days :)

Cheers, Julian

julianu avatar Mar 17 '25 08:03 julianu

How much work would it be for you to list any limitations to your parser, especially those that are actually relevant in the context of psm_utils? It would probably be interesting to have this as part of the documentation so people can make an informed decision when selecting the parser. On top of that, it might also be interesting to see how computationally expensive it would be to implement any relevant missing features, and use your parser as the default.

paretje avatar Apr 16 '25 13:04 paretje

I will look over it and check, what information is actually missing / could be missing. This might take some time due to other things on my list, I will come back to you, when I am done.

julianu avatar Apr 17 '25 14:04 julianu