Pull genbank files with Accession numbers
The NCBI is phasing out GI numbers per this announcement. The code works for now but vdb.parse needs to be updated to get genbank files by accession number and not gi number.
So the NCBI and BioPython don't actually let you pull genbank files via accession numbers. Issue was started on the BioPython page to track changes for this. https://github.com/biopython/biopython/issues/926
Any updates on the change-over from GI to accession numbers? I use a similar script to automate generating knock-out vectors. Even after updating xcode and biopython, I still see this error message:
File "dictyko.py", line 624, in
@pkundert I'd recommend asking on the BioPython page. I haven't hear anything yet. https://github.com/biopython/biopython/issues/926
I think this issue has now become pressing; running dengue_upload, my accessions list and query are being formed correctly, but this returns giList==[].
https://github.com/nextstrain/fauna/blob/master/vdb/parse.py#L195
I'm investigating now (starting with the biopython issue thread @chacalle mentioned above), but wanted to give people a heads up in the meantime.
Seems like people are on it, but it's also a bit of a mess for the time being: https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/comment-page-1/#comment-35754
@sidneymbell Hey Sidney, I can try helping with this later. Do you know if running update on test_vdb is also failing? python vdb/zika_update.py -db test_vdb -v zika. I feel like people would be creating issues on biopython if this is failing for others as well.
Hey @chacalle - I wondered about that as well. I'm planning to spend the morning investigating in more detail, and will certainly start with the zika implementation to see if it's just something specific about the way my code interacts with the base scripts.
It's definitely failing at the step where it tries to run the query with GI numbers (the query itself is being created and formatted correctly), and it hasn't in the past, which makes me rather suspicious though. I'll update here with what I figure out today. Thanks!
@chacalle --
So, the good news is it's a false alarm. It is failing on the esearch step (just returning an empty ID list with an error message that sounds a whole lot like it's a GI number issue), but luckily I don't think it's the case (I totally leapt to conclusions here).
The less awesome news is that I'm pretty sure it's related to the number of accessions. This doesn't make a whole lot of sense given that, from the docs
Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 100,000 records.
and retmax == 10**9 for our queries (in my case, n==6000). But, it's reproducible.
Shouldn't hard to fix, I'll patch it and submit a PR for your thoughts. Thanks for looking at this, and sorry for the confusion!
Has there been any solution to this? Entrez (efetch/epost) won't accept accession.version, but most of their results are given as an accession. Otherwise, is there a way to replace thousands of accession.version with GI numbers?
After upgrading to biopython 1.68
pip install biopython --upgrade
Successfully installed biopython-1.68
--update_citations is working again for me. I don't know if the underlying bug is actually resolved however,
@trvrb --update_citations stopped working? It seems like biopython 1.68 was released in August 2016 (http://biopython.org/wiki/Download) so I don't think the underlying problem is fixed. I wonder why it wasn't working.
According to this comment (https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/comment-page-1/#comment-35754) they are supposed to blog about it when they do finally change things.
Oh. This was entirely me then. Thanks for the update.