PDB.read() function with automatic handling of gzipped files
-
[X] I hereby agree to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.
-
[X] I have read the
CONTRIBUTING.rstfile, have runpre-commitlocally, and understand that AppVeyor and TravisCI will be used to confirm the Biopython unit tests and style checks pass with these changes. -
[X] I have added my name to the alphabetical contributors listings in the files
NEWS.rstandCONTRIB.rstas part of this pull request, am listed already, or do not wish to be listed. (This acknowledgement is optional.)
Adds support for pdb.gz and cif.gz files for PDBParser() and MMCIFParser() as discussed in #3498
Added a new function for PDBParser and MMCIFParser that detects if file is gzipped or not and then decodes the line.
Codecov Report
:x: Patch coverage is 74.41860% with 11 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 81.85%. Comparing base (36ad781) to head (ee8e957).
:warning: Report is 1488 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| Bio/PDB/__init__.py | 74.41% | 11 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #3499 +/- ##
==========================================
- Coverage 84.27% 81.85% -2.42%
==========================================
Files 316 296 -20
Lines 51711 48703 -3008
==========================================
- Hits 43577 39864 -3713
- Misses 8134 8839 +705
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Following discussion in the issue #3498 I have changed the read mode in gzip to text instead of binary and removed the decoding.
Technically users could also directly pass gzip.open("filename","rt") to PDBParser or MMCIFParser but I guess for beginners it might be more user friendly if this was handled automatically by the parser.
Hi @duerrsimon
I will have a look at this over the weekend for sure. Thank you for the contribution!
I just fixed the doctest that was still failing and I added the tests for the PDB.read() with the mmtf parser to test_mmtf so that they run only when mmtf is available.
I also changed the title of the Pull request to reflect the changes to the implementation discussed in #3498
Great to make id optional. Then, the file name should be the first argument of the PDB parser function.
So I think I have implemented everything that was discussed above:
- format uppercase or lowercase
- rename format to fmt
- add
_check_formathelper function + allow kwargs for the parsers - allow str, PathLike, Bytes or handle as input (had to abandon guess_types because it supports PathLike only in 3.8+)
- add helpful message in case structure cannot be parsed because of wrong specified fmt (invalid string literal ... is not so helpful imho)
There should corresponding unittests for all the changes above. All the test for the read function with very basic structure comparisons reside in test_PDB, all further tests for correct structure parsing are with the parsers.
Sorry for taking a bit of time on this.
Any feedback on this @JoaoRodrigues ? I am happy to also implement a PDB.write() and PDB.fetch() function (perhaps in a seperate pull request) if you want to unify the the interface a bit more :)
Hi @duerrsimon
Definitely will give you feedback soon! Sorry for the delay - a lot happening at the same time. If you want, go ahead and work on the write/fetch methods in separate PRs, I'd be happy to work on them with you/review them!
Hi @duerrsimon ,
Sorry for the delay but these have been tough months to have time to spare. I had a look at the PR and it is in great shape. I added some logic to parse the format from the file name (or raise an error), to make it simpler to use PDB.read("1xyz.pdb"). You can still force the format if you want.
I reviewed your tests too, some were failing and I think there was some unnecessary parsing going on. Since the changes were a little too much, I preferred to push to a branch on my own fork that you can review and cherry-pick from. Let me know what you think and again, thanks for the patience and the code! This will be a very useful piece of code, thus the long review and long list of changes :)
https://github.com/JoaoRodrigues/biopython/commit/94d59686e1454e48f0ffe5cfba5e0667d7319d6f