fauna icon indicating copy to clipboard operation
fauna copied to clipboard

Pull genbank files with Accession numbers

Open chacalle opened this issue 9 years ago • 12 comments

The NCBI is phasing out GI numbers per this announcement. The code works for now but vdb.parse needs to be updated to get genbank files by accession number and not gi number.

chacalle avatar Sep 02 '16 03:09 chacalle

So the NCBI and BioPython don't actually let you pull genbank files via accession numbers. Issue was started on the BioPython page to track changes for this. https://github.com/biopython/biopython/issues/926

chacalle avatar Sep 03 '16 19:09 chacalle

Any updates on the change-over from GI to accession numbers? I use a similar script to automate generating knock-out vectors. Even after updating xcode and biopython, I still see this error message:

File "dictyko.py", line 624, in gbrecord = locus_maps(gene, flank) ###returns outfile name to use in primer stuff File "dictyko.py", line 89, in locus_maps gi_id, ORF_start, ORF_end, strand = fetch_gene_coordinates(gene) File "dictyko.py", line 59, in fetch_gene_coordinates rec = Entrez.read(handle) File "/Library/Python/2.7/site-packages/Bio/Entrez/init.py", line 376, in read record = handler.read(handle) File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line 205, in read self.parser.ParseFile(handle) File "/Library/Python/2.7/site-packages/Bio/Entrez/Parser.py", line 513, in externalEntityRefHandler self.dtd_urls.append(url) UnboundLocalError: local variable 'url' referenced before assignment

pkundert avatar Sep 15 '16 21:09 pkundert

@pkundert I'd recommend asking on the BioPython page. I haven't hear anything yet. https://github.com/biopython/biopython/issues/926

chacalle avatar Sep 16 '16 03:09 chacalle

I think this issue has now become pressing; running dengue_upload, my accessions list and query are being formed correctly, but this returns giList==[]. https://github.com/nextstrain/fauna/blob/master/vdb/parse.py#L195

I'm investigating now (starting with the biopython issue thread @chacalle mentioned above), but wanted to give people a heads up in the meantime.

sidneymbell avatar Dec 08 '16 00:12 sidneymbell

Seems like people are on it, but it's also a bit of a mess for the time being: https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/comment-page-1/#comment-35754

sidneymbell avatar Dec 08 '16 01:12 sidneymbell

@sidneymbell Hey Sidney, I can try helping with this later. Do you know if running update on test_vdb is also failing? python vdb/zika_update.py -db test_vdb -v zika. I feel like people would be creating issues on biopython if this is failing for others as well.

chacalle avatar Dec 09 '16 18:12 chacalle

Hey @chacalle - I wondered about that as well. I'm planning to spend the morning investigating in more detail, and will certainly start with the zika implementation to see if it's just something specific about the way my code interacts with the base scripts.

It's definitely failing at the step where it tries to run the query with GI numbers (the query itself is being created and formatted correctly), and it hasn't in the past, which makes me rather suspicious though. I'll update here with what I figure out today. Thanks!

sidneymbell avatar Dec 09 '16 18:12 sidneymbell

@chacalle -- So, the good news is it's a false alarm. It is failing on the esearch step (just returning an empty ID list with an error message that sounds a whole lot like it's a GI number issue), but luckily I don't think it's the case (I totally leapt to conclusions here).

The less awesome news is that I'm pretty sure it's related to the number of accessions. This doesn't make a whole lot of sense given that, from the docs

Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 100,000 records.

and retmax == 10**9 for our queries (in my case, n==6000). But, it's reproducible.

Shouldn't hard to fix, I'll patch it and submit a PR for your thoughts. Thanks for looking at this, and sorry for the confusion!

sidneymbell avatar Dec 09 '16 19:12 sidneymbell

Has there been any solution to this? Entrez (efetch/epost) won't accept accession.version, but most of their results are given as an accession. Otherwise, is there a way to replace thousands of accession.version with GI numbers?

pawlowac avatar Feb 04 '17 03:02 pawlowac

After upgrading to biopython 1.68

pip install biopython --upgrade
Successfully installed biopython-1.68

--update_citations is working again for me. I don't know if the underlying bug is actually resolved however,

trvrb avatar Feb 25 '17 21:02 trvrb

@trvrb --update_citations stopped working? It seems like biopython 1.68 was released in August 2016 (http://biopython.org/wiki/Download) so I don't think the underlying problem is fixed. I wonder why it wasn't working.

According to this comment (https://ncbiinsights.ncbi.nlm.nih.gov/2016/07/15/ncbi-is-phasing-out-sequence-gis-heres-what-you-need-to-know/comment-page-1/#comment-35754) they are supposed to blog about it when they do finally change things.

chacalle avatar Feb 25 '17 22:02 chacalle

Oh. This was entirely me then. Thanks for the update.

trvrb avatar Feb 25 '17 22:02 trvrb