fix parsing PSMs and complete protein names in XTandem
[edited after adding fix for PSM parsing]
-
As XTandem's protein names tend to be abbreviated in the protein "label" tag, change the origin to the "note" tag.
-
While XTandem saves only the highest scoring PSMs per spectrum, these can still be more than one PSM, with different peptidoforms, if the score is exact the same. This is not an extremely rare case, especially with equal peptides (think of a single AA flip in the sequence). This fix parses the identifications with same peptidoforms into one new PSM, with only the relevant proteins assigned to each PSM. Before, there were weird matches of proteins to peptides, which did not occur in the databases used by XTandem.
-
Also, it seems as the remark that only one protein per peptide/PSM is parsed is thus not true anymore.
I updated the comment for the initial PR, as there were some further additions to it.
Codecov Report
Attention: Patch coverage is 15.38462% with 11 lines in your changes missing coverage. Please review.
Project coverage is 63.97%. Comparing base (
6e51896) to head (5d01b6f). Report is 2 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| psm_utils/io/xtandem.py | 15.38% | 11 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #83 +/- ##
==========================================
- Coverage 64.12% 63.97% -0.16%
==========================================
Files 26 26
Lines 2492 2498 +6
==========================================
Hits 1598 1598
- Misses 894 900 +6
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 63.97% <15.38%> (-0.16%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.