Gaia: change the signature of the method load_data
The present implementation of the method GaiaClass.load_data makes use of the parameter output_file that defines a user defined zipped file were the data is saved.
We would like to change this parameter by the boolean parameter dump_to_file to save the results in the hard-code file name "datalink_output.zip" that will be saved in the current working directory.
This PR also contains 2 bug fixes in the present implementation:
-
if the value of the parameter output_file is a simple name, i.e., without a path, the code will collect all the files with extensions '.fits', '.xml', '.csv', or '.ecsv' from the current working directory. This must be changed, since this implies that any file with any of those extensions will be included in zipped file, i.e., independently if they come from the execution of the method load_data, or from previous executions.
-
If the parameter output_file is used, the built zip file is used to save the files coming from by the datalink service. Then the file is unzipped and the files that it contains are read, so that a list of Tables is built and returned to the user. But the the unzipped files are never removed. We think that this is a bug and it would be good to remove the unzipped files.
New units tests were developed.
cc @esdc-esac-esa-int
jira: GAIAMNGT-1700
Hello @cosmoJFH! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2024-10-16 22:42:55 UTC
I would also raise the idea of maybe having a separate method for the download instead of the dump_to_file parameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether the dump_to_file is the best phrase to use.)
Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.
I would also raise the idea of maybe having a separate method for the download instead of the
dump_to_fileparameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether thedump_to_fileis the best phrase to use.)Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.
The proposed change is a bit counterintuitive - it's hard to tell what drives it (maybe subsequent processing on the resulting file?). If the user has no control on the name of the file I would expect it to be at least timestamped. Also, current working directory is a bit vague. What is current working directory when a script calls the astroquery library, or a test is run with pytest or astroquery is invoked in a Jupyter notebook. At least make the location of the file available of the file available in the class.
Sorry I'm a bit late to this, but I agree that removing users' ability to specify the output location does not seem like a good idea.
However, I had a hard time parsing the original intent: is the idea that datalink_output.zip is supposed to be a temporary file that is always removed after the files are extracted? It seems instead what is implemented is the opposite: the datalink_output.zip is kept, but all the files extracted from it are deleted. To me, this seems backward; what's the use case for preserving a generically-named zip file?
Dear all,
Thanks for your valuable feedback. Let me give you an overview of the way this functionality is currently implemented and the reasons behind our proposal. The load_data() method retrieves ancillary products from the Gaia Archive that are served via the DataLink protocol. These products are stored inside a (Python) dictionary, whose keys are named according to the values of the following arguments:
- data_structure : options are: “INDIVIDUAL”, “RAW”, and “COMBINED” – this last one is going to be deprecated)
- format: options are: VOTable and VOTable_plain (translates to “xml”), FITS, CSV, and ECSV
- retrieval_type: options are: 'ALL' , 'EPOCH_PHOTOMETRY', 'RVS', 'XP_CONTINUOUS', 'XP_SAMPLED', 'MCMC_GSPPHOT' and 'MCMC_MSC'
Examples of key names are: “EPOCH_PHOTOMETRY-Gaia DR3 2263166706630078848.xml” or “EPOCH_PHOTOMETRY_RAW.xml”.
This file-naming procedure reproduces the behavior of the web interface of the Gaia ESA Archive (which does not allow to specify the names of the files to be downloaded). That said, it is possible to retrieve:
- One product for one source, or
- One product for many sources, or
- All the products available for one source, or
- All the products available for many sources.
From the web interface, the data is always downloaded as a single compressed (.zip) file that is named as: <user_name><job_id>.
- It mimics the behavior of the Gaia ESA Archive web GUI: the name of the retrieved file is set by the Archive.
- It avoids problems if users specify a name with or without adding the “.zip” suffix (it is possible to update the method to check this issue and correct the name as necessary, but that requires extra work), and
- It is very easy to modify the (fixed) file name in a Python script by e.g., appending a timestamp suffix.
We could consider the possibility of adding a separated method to download the ancillary files, but there are few reasons that make me reluctant to implement it (although I am not totally against this):
- The output of the load_data method is not a TAP job, but a Python dictionary whose elements (keys) contain ancillary products retrieved via DataLink.
- In DR4 the Gaia Archive will serve images in FITS format. Our tests show that these products can be written in a .fits file with the proposed implementation. However, retrieving them without using the “dump_to_file = True” option results in an incomplete download, as only the tabular HDUs from the fits file are retrieved, and the HDU that contains the image part is missed.
All in all, we considered that the proposed change would benefit the users by simplifying the way that the DataLink products can be stored using Astroquery.Gaia). But if you think that this change can be improved (by e.g. automatically adding a timestamp to the currently fixed file name), please let us know.
Kind regards,
================================== Dr. Héctor Cánovas Cabrera (ORCID)
Telespazio for ESA - European Space Agency SCO-09 – Support Archive Scientist for the Gaia mission European Space Astronomy Centre (ESAC) Camino Bajo del Castillo s/n 28692 Villanueva de la Cañada (Madrid), Spain
Thank you for your great description @hectorcanovas. Yes, trying to emulate the Web Interface functionality is a great goal as it will make it easier for user to use both mechanisms to access their data. I'm not familiar with the Gaia Archive, but in general, the Web interface is used for exploration while the library is used for processing.
Even so, the Web browser can prompt for the location where the file to be dowloaded or, in case it goes into the Downloads directory, it adds a number suffix each time the file with the same name is downloaded so versions are not overridden - maybe this is different in Gaia.
The astroquery library is used for processing in which multiple downloads are likely to be required and users will need a way to distinguish between these files.
Again, I'm not familiar with Gaia and please ignore if these general comments do not apply to your case. It is a design decision that people most familiar with the service/user base must take.
Dear all, following your feedback I have proposed to our team to add a timestamp suffix to the name of the file generated by the load_data method. With this change, the output filename would be: "datalink_output_<time_stamp>.zip" where <time_stamp> corresponds to the UTC time associated to the request received at the Gaia Archive DataLink server. The <time_stamp> format should follow the ISO 8601 standard: "yyyymmddThhmmss". Based on our internal records, a 1-second granularity should be enough to avoid duplicated filenames (but this format can be updated as needed). If you agree with this proposal we do our best to prepare a new, dedicated Pull Request right after the summer break.
Changes committed to the branch: now the output file name follows the patter "datalink_output_<time_stamp>.zip" where the time stamp is obtained as
now = datetime.now(timezone.utc)
output_file = 'datalink_output_' + now.strftime("%Y%m%dT%H%M%S") + '.zip'
Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD.
Codecov Report
Attention: Patch coverage is 70.96774% with 9 lines in your changes missing coverage. Please review.
Project coverage is 67.49%. Comparing base (
6959406) to head (58f8692). Report is 158 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| astroquery/gaia/core.py | 70.96% | 9 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #3014 +/- ##
==========================================
+ Coverage 67.39% 67.49% +0.10%
==========================================
Files 233 233
Lines 18405 18413 +8
==========================================
+ Hits 12404 12428 +24
+ Misses 6001 5985 -16
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Test failures are real, while the RTD docs build failure is not. Could you please rebase? That would take care of both the conflict and RTD.
The tests and conflicts were fixed.
Hi @bsipocz, what pending tasks do we need to address on our end?
Hi @bsipocz, what pending tasks do we need to address on our end?
No, it waits on my todo to a final review, I'll try to get to it this week.
Note to the other maintainers: I should have cleaned up the commit history here with a rebase, sometimes mistakes happen 🤷♀️
Dear @bsipocz,
Thank you very much for your help and feedback in implementing this PR. This update will support our ongoing preparations for Gaia DR4.
Regards,
Dr. Héctor Cánovas Cabrera (ORCID)
Telespazio for ESA - European Space Agency SCO-09 – Support Archive Scientist for the Gaia mission
European Space Astronomy Centre (ESAC) Camino Bajo del Castillo s/n 28692 Villanueva de la Cañada (Madrid), Spain
I would also raise the idea of maybe having a separate method for the download instead of the
dump_to_fileparameter? That is what we have for most of the other modules, though I am not against the idea of having parameters instead. (But then we can also bike-shed on whether thedump_to_fileis the best phrase to use.)Anyway, this API question points further than this PR, and for that I would like to also ping @keflavich and @andamian for their opinions.
@bsipocz, could you give us an example in other module for a method for the download?
https://github.com/astropy/astroquery/blob/main/astroquery/alma/core.py#L896 is a download example from alma