biopython icon indicating copy to clipboard operation
biopython copied to clipboard

PDB.read() function with automatic handling of gzipped files

Open duerrsimon opened this issue 4 years ago • 10 comments

  • [X] I hereby agree to dual licence this and any previous contributions under both the Biopython License Agreement AND the BSD 3-Clause License.

  • [X] I have read the CONTRIBUTING.rst file, have run pre-commit locally, and understand that AppVeyor and TravisCI will be used to confirm the Biopython unit tests and style checks pass with these changes.

  • [X] I have added my name to the alphabetical contributors listings in the files NEWS.rst and CONTRIB.rst as part of this pull request, am listed already, or do not wish to be listed. (This acknowledgement is optional.)

Adds support for pdb.gz and cif.gz files for PDBParser() and MMCIFParser() as discussed in #3498

Added a new function for PDBParser and MMCIFParser that detects if file is gzipped or not and then decodes the line.

duerrsimon avatar Mar 10 '21 17:03 duerrsimon

Codecov Report

:x: Patch coverage is 74.41860% with 11 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 81.85%. Comparing base (36ad781) to head (ee8e957). :warning: Report is 1488 commits behind head on master.

Files with missing lines Patch % Lines
Bio/PDB/__init__.py 74.41% 11 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3499      +/-   ##
==========================================
- Coverage   84.27%   81.85%   -2.42%     
==========================================
  Files         316      296      -20     
  Lines       51711    48703    -3008     
==========================================
- Hits        43577    39864    -3713     
- Misses       8134     8839     +705     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Mar 10 '21 17:03 codecov[bot]

Following discussion in the issue #3498 I have changed the read mode in gzip to text instead of binary and removed the decoding. Technically users could also directly pass gzip.open("filename","rt") to PDBParser or MMCIFParser but I guess for beginners it might be more user friendly if this was handled automatically by the parser.

duerrsimon avatar Mar 11 '21 11:03 duerrsimon

Hi @duerrsimon

I will have a look at this over the weekend for sure. Thank you for the contribution!

JoaoRodrigues avatar Mar 17 '21 22:03 JoaoRodrigues

I just fixed the doctest that was still failing and I added the tests for the PDB.read() with the mmtf parser to test_mmtf so that they run only when mmtf is available.

duerrsimon avatar Mar 18 '21 09:03 duerrsimon

I also changed the title of the Pull request to reflect the changes to the implementation discussed in #3498

duerrsimon avatar Mar 18 '21 09:03 duerrsimon

Great to make id optional. Then, the file name should be the first argument of the PDB parser function.

Lucioric2000 avatar Mar 20 '21 02:03 Lucioric2000

So I think I have implemented everything that was discussed above:

  • format uppercase or lowercase
  • rename format to fmt
  • add _check_format helper function + allow kwargs for the parsers
  • allow str, PathLike, Bytes or handle as input (had to abandon guess_types because it supports PathLike only in 3.8+)
  • add helpful message in case structure cannot be parsed because of wrong specified fmt (invalid string literal ... is not so helpful imho)

There should corresponding unittests for all the changes above. All the test for the read function with very basic structure comparisons reside in test_PDB, all further tests for correct structure parsing are with the parsers.

Sorry for taking a bit of time on this.

duerrsimon avatar Apr 30 '21 09:04 duerrsimon

Any feedback on this @JoaoRodrigues ? I am happy to also implement a PDB.write() and PDB.fetch() function (perhaps in a seperate pull request) if you want to unify the the interface a bit more :)

duerrsimon avatar Jun 01 '21 13:06 duerrsimon

Hi @duerrsimon

Definitely will give you feedback soon! Sorry for the delay - a lot happening at the same time. If you want, go ahead and work on the write/fetch methods in separate PRs, I'd be happy to work on them with you/review them!

JoaoRodrigues avatar Jun 03 '21 16:06 JoaoRodrigues

Hi @duerrsimon ,

Sorry for the delay but these have been tough months to have time to spare. I had a look at the PR and it is in great shape. I added some logic to parse the format from the file name (or raise an error), to make it simpler to use PDB.read("1xyz.pdb"). You can still force the format if you want.

I reviewed your tests too, some were failing and I think there was some unnecessary parsing going on. Since the changes were a little too much, I preferred to push to a branch on my own fork that you can review and cherry-pick from. Let me know what you think and again, thanks for the patience and the code! This will be a very useful piece of code, thus the long review and long list of changes :)

https://github.com/JoaoRodrigues/biopython/commit/94d59686e1454e48f0ffe5cfba5e0667d7319d6f

JoaoRodrigues avatar Jun 19 '21 00:06 JoaoRodrigues