sunpy icon indicating copy to clipboard operation
sunpy copied to clipboard

Fido implement the "limited support" attributes available in IDL vso_search

Open starfleetjames opened this issue 8 years ago • 21 comments

There are several parameters in vso_search for IDL that aren't available in net/vso/attrs. Mostly things in the IDL header under "Keywords with limited support". For example, layout and wavetype.

This issue was discovered when I was seeking a way to filter SDO/EVE data for just the extracted emission lines product. In particular, I want to exclude the spectra data product because they are much larger in file size so I don't want to download them.

Some possible workarounds have been identified: 1.

results = Fido.search(attrs.Time("2010-09-05 00:00:00", "2010-09-05 01:00:00"), attrs.Instrument('eve'))
results.response_block_properties()

Within response_block_properties there are fileid and info parameters, whose values can be used to filter the search results for just lines. To access them, the actual response from VSO must be obtained (bypassing the UnifiedResponse), e.g.,

vr = results.get_response(0)
vr[2]['fileid']

results in: EVE_L2_lines_2010248_01 or

vr[2]['info']

results in: L2Lines (merged)

It should be possible to distinguish lines and spectra with attrs.Wavelength(wavemin=, wavemax=). The lines have wavelength range of [93.0 .. 1033.0] while the spectra range [60.0 .. 1060.0]. So doing something like:

from astropy import units as u
results = Fido.search(attrs.Time("2010-09-05 00:00:00", "2010-09-05 01:00:00"), attrs.Instrument('eve'), attrs.Wavelength(wavemin=62 * u.angstrom, wavemax=1035 * u.angstrom))

should work, but doesn't. Instead, what is returned is, somehow, the spectra again. That obviously doesn't satisfy the input conditions. This is a problem with the VSO, not Fido, which can be confirmed by trying the same filtering on the VSO website. I plan on raising this issue with the SDAC people at Goddard. I only include it here for completeness and future reference.

starfleetjames avatar Jan 11 '18 17:01 starfleetjames

VSO folks looked at this issue today and agree that it seems to be an issue with how we are constructing the various queries we send. We will try to figure out what we are doing different between Web/IDL/Python and fix it!

AlisdairDavey avatar Feb 06 '18 01:02 AlisdairDavey

I'm running a test trying to duplicate your EVE query and am getting the error:

Traceback (most recent call last): File "./test_sunpy.py", line 3, in from sunpy.net import vso File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/net/init.py", line 5, in from sunpy.net.fido_factory import Fido File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/net/fido_factory.py", line 19, in from sunpy.net.dataretriever.clients import CLIENTS File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/net/dataretriever/init.py", line 13, in from . import clients File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/net/dataretriever/clients.py", line 4, in from .sources.eve import EVEClient File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/net/dataretriever/sources/init.py", line 8, in from .eve import EVEClient File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/net/dataretriever/sources/eve.py", line 9, in from sunpy.util.scraper import Scraper File "/Applications/anaconda2/lib/python2.7/site-packages/sunpy/util/scraper.py", line 8, in from bs4 import BeautifulSoup File "/Applications/anaconda2/lib/python2.7/site-packages/bs4/init.py", line 30, in from .builder import builder_registry, ParserRejectedMarkup File "/Applications/anaconda2/lib/python2.7/site-packages/bs4/builder/init.py", line 311, in from . import _html5lib File "/Applications/anaconda2/lib/python2.7/site-packages/bs4/builder/_html5lib.py", line 57, in class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: 'module' object has no attribute '_base'

The suggestions I see on Stackoverflow suggest upgrading both beautifulsoup and html5lib but I already have the latest versions of both.

The script itself is simple:

``#!/Applications/anaconda2/bin/python

from sunpy.net import vso client = vso.VSOClient() results = client.query(vso.attrs.Time("2010-09-05 00:00:00", "2010-09-05 22:00:00"), vso.attrs.Instrument('eve'))``

This is with the latest version of SunPy in my Anaconda .

Do I need to downgrade html5lib to version 1.08b or is it something else?

--Ed

ejm4567 avatar Feb 07 '18 04:02 ejm4567

OK, I corrected the code in BeautifulSoup (changing "_base" to "base") and now get results back:

volterra:anaconda2 mansky$ ./test_sunpy.py ./test_sunpy.py:5: SunpyDeprecationWarning: The query function is deprecated and may be removed in a future version. Use VSOClient.search instead. results = client.query(vso.attrs.Time("2010-09-05 00:00:00", "2010-09-05 22:00:00"), vso.attrs.Instrument('eve')) Start Time [1] End Time [1] Source Instrument Type Wavelength [2] Angstrom


2010-09-05 00:00:00 2010-09-06 00:00:00 SDO EVE FULLDISK 1.0 .. 304.0 2010-09-05 00:00:00 2010-09-05 01:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 01:00:00 2010-09-05 02:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 02:00:00 2010-09-05 03:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 03:00:00 2010-09-05 04:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 04:00:00 2010-09-05 05:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 05:00:00 2010-09-05 06:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 06:00:00 2010-09-05 07:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 07:00:00 2010-09-05 08:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 2010-09-05 08:00:00 2010-09-05 09:00:00 SDO EVE FULLDISK 93.0 .. 1033.0 ... ... ... ... ... ... 2010-09-05 14:00:00 2010-09-05 15:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 15:00:00 2010-09-05 16:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 16:00:00 2010-09-05 17:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 17:00:00 2010-09-05 18:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 18:00:00 2010-09-05 19:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 19:00:00 2010-09-05 20:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 20:00:00 2010-09-05 21:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 21:00:00 2010-09-05 22:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 22:00:00 2010-09-05 23:00:00 SDO EVE FULLDISK 60.0 .. 1060.0 2010-09-05 00:00:00 2010-09-06 00:00:00 SDO EVE FULLDISK 1.0 .. 1050.0 Length = 48 rows

ejm4567 avatar Feb 07 '18 14:02 ejm4567

@ejm4567 That is really strange that you had to edit the source file for BS4.

I will have to run the python 2 tests to make sure that something isn't amiss with our current step!

nabobalis avatar Feb 07 '18 14:02 nabobalis

I installed a fresh copy of Anaconda and SunPy and ran the SunPy tests with self_test. All came back OK. My problem with bs4 probably stemmed from a downgrade to bs4 triggered from installation of some other modules in my earlier Anaconda env.

Tested again the full search w/o wavelengths and got the full set again (shown above).

Then tested with the wavemin and wavemax specified and got a reduced set:

volterra:anaconda2 mansky$ ./test_sunpy.py
results = client.query(vso.attrs.Time("2010-09-05 00:00:00", "2010-09-05 22:00:00"), 
vso.attrs.Instrument('eve'),
 vso.attrs.Wavelength(wavemin=93 * u.angstrom,wavemax=1033 * u.angstrom))
   Start Time [1]       End Time [1]    Source Instrument   Type   Wavelength [2]
                                                                      Angstrom
------------------- ------------------- ------ ---------- -------- --------------
2010-09-05 00:00:00 2010-09-05 01:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 01:00:00 2010-09-05 02:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 02:00:00 2010-09-05 03:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 03:00:00 2010-09-05 04:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 04:00:00 2010-09-05 05:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 05:00:00 2010-09-05 06:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 06:00:00 2010-09-05 07:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 07:00:00 2010-09-05 08:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 08:00:00 2010-09-05 09:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
                ...                 ...    ...        ...      ...            ...
2010-09-05 14:00:00 2010-09-05 15:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 15:00:00 2010-09-05 16:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 16:00:00 2010-09-05 17:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 17:00:00 2010-09-05 18:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 18:00:00 2010-09-05 19:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 19:00:00 2010-09-05 20:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 20:00:00 2010-09-05 21:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 21:00:00 2010-09-05 22:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 22:00:00 2010-09-05 23:00:00    SDO        EVE FULLDISK 60.0 .. 1060.0
2010-09-05 00:00:00 2010-09-06 00:00:00    SDO        EVE FULLDISK  1.0 .. 1050.0
Length = 24 rows

I would have expected the range of wavelengths 93-1033 to be in the result set and not the other wavelength range 60-1060.

Seems like the WSDL is set to do an OR instead of an AND.

I am looking into that now to see how it is coded.

ejm4567 avatar Feb 12 '18 16:02 ejm4567

The code in the VSO is currently written to basically do an OR on the wavemin/wavemax range given.

In the EVE Perl pkg there is a private method, _ProcessParam_wave, that adds the following SQL:

$self->_AddSQL( 'p.minwave < ? and p.maxwave > ?', $wavemin, $wavemax );

Hence the VSO search will return everything outside the wavelength range specified for the given date range for EVE.

Let me know if you'd like an additional optional operator to be added to the overall query that would apply to the wavelength attribute only, that would allow one to request either an OR selection or an AND selection.

Here's a snippet from the log file on sdo5 that shows the full WHERE clause:

[Tue Feb 13 11:40:29.301056 2018] [cgi:error] [pid 2925] [client 146.5.21.121:38499] AH01215: WHERE 1=1 [Tue Feb 13 11:40:29.301367 2018] [cgi:error] [pid 2925] [client 146.5.21.121:38499] AH01215: AND ((e.start_date BETWEEN DATE_ADD(?, INTERVAL -1 DAY) AND ?) AND e.end_date > ?) and (p.minwave < ? and p.maxwave > ?) GROUP BY fileid

ejm4567 avatar Feb 13 '18 17:02 ejm4567

The VSO's matching of wavelengths is 'non-vanishing intersection'. So if there's any overlap in the query range and the data's range, it'll be returned. The same is logic is used for time ranges (which can get annoying when there's a mission-long data product)

The idea was that we'd rather the VSO returned too much data (which you can filter out), rather than not show something that you might've wanted. This philosophy also means that when we don't have information about a given facet (eg, the data layout for many providers, including EVE : https://sdac.virtualsolar.org/cgi/show_details?instrument=EVE ), it will always match.

If it doesn't differentiate by layout, it might be a difference in 'wave_type' (see http://docs.virtualsolar.org/wiki/SpectralRange ), but either one would require someone with access to the search code at the SDAC to update it. (I'm not a solar physicist, so I'm not sure if it makes sense for spectra, or just the imager stuff that's the bulk of the VSO)

The 'info' field is returned by the VSO and is useful for filtering after the fact, but it's not searchable.

-Joe

jhourcle avatar Dec 03 '19 23:12 jhourcle

In talking to one of the VSO scientists, he said that the notion of 'wave type' as defined in the VSO data model makes no sense when talking about instruments like EVE. The best thing within the VSO data model to differentiate the two products would likely be 'wave bands' ( FITS keyword: WV_NBAND / The number of wavelength bands in the observation), but that one never actually made it in the API, so we don't catalog on it nor have a way to send it via any of the clients.

I suspect that the best way right now to deal with the situation is to filter after the fact on the 'info' field.

-Joe

jhourcle avatar Dec 04 '19 22:12 jhourcle

vso_search has a "level" keyword that can be used for filtering. EVE version 4 releasing in 2020 March will have this implemented. PSP already uses it. So Fido should implement this kwarg.

starfleetjames avatar Feb 18 '20 17:02 starfleetjames

Fido already supports Level so if VSO provides it we will work with it.

Cadair avatar Feb 18 '20 17:02 Cadair

@jmason86 Is this resolved now?

abhijeetmanhas avatar Jul 17 '20 11:07 abhijeetmanhas

I'm trying to determine that but failing. It looks like some syntax has changed since my original post.

vr = results.get_response(0)
vr[2]['fileid']

that was suggested I do now fails saying: TypeError: list indices must be integers or slices, not str ​Ditto when I try to do vr[2]['Level'] When I just do vr[2], the table only includes parameters for start time, end time, source, instrument, type, and wavelength. So I'm stuck on seeing if the idea to use level for filtering is working now. And because I now get an error when trying to look at fileid, the old workaround doesn't seem like it'd work anymore either. Did these parameters get moved somewhere else in the data structures?

The other workaround from above that didn't work -- specifying the wavelength range -- also does not work still (get basically the same result as before).

starfleetjames avatar Jul 21 '20 18:07 starfleetjames

Hi @jmason86 thanks for the update , to extract fileid from the 3rd response, i.e. vr[2] you have to do vr.blocks[2]['fileid']`. this will work, can you please check again? Also I checked by looking data of vr[2] so looks like this

[{
     'provider': 'LASP',
     'source': 'SDO',
     'instrument': 'EVE',
     'physobs': 'irradiance',
     'time': {
         'start': '20100905010000',
         'end': '20100905020000',
         'near': None
     },
     'wave': {
         'wavemin': '60',
         'wavemax': '1060',
         'waveunit': 'Angstrom',
         'wavetype': None
     },
     'extent': {
         'x': None,
         'y': None,
         'width': None,
         'length': None,
         'type': 'FULLDISK'
     },
     'size': -1.0,
     'extra': None,
     'info': 'L2Spectra (MEGS)',
     'datatype': None,
     'fileurl': None,
     'fileid': 'EVE_L2_spectra_2010248_01'
 }]

So there is fileid here but no level. However vr.blocks[2]['info'] contains details of the level. Is first part of your issue is that sunpy vso doesn't shows level in the response table?

abhijeetmanhas avatar Jul 21 '20 22:07 abhijeetmanhas

Sweet; that does work for me. So that confirms that the first workaround from my original post is still viable (using either info or fileid).

Yep, I would agree with your statement that one part of the issue is that the level keyword isn't part of the response. If it was, it should then be possible to use the Level kwarg in the initial Fido.search(... attrs.Level='') as @Cadair pointed out a few comments up.

starfleetjames avatar Jul 21 '20 22:07 starfleetjames

@jmason86 it is possible to specify Level as kwargs in Fido search and that will work. If you run this Fido.search(a.Time("2010-09-05 00:00:00", "2010-09-05 01:00:00"), a.Instrument('eve'), a.Level(1)) or Fido.search(a.Time("2010-09-05 00:00:00", "2010-09-05 01:00:00"), a.Instrument('eve'), a.Level(2)) results are returned according to the correct Level.

Its true, level is not shown in the response table, though it can be specified in the Fido. That needs to be worked out.

abhijeetmanhas avatar Jul 22 '20 02:07 abhijeetmanhas

That's good to know. I think it may come down then to having support for non-numerical level types. For example, it's common to have e.g., level '0b', '0c', '0d' etc. At least in most LASP datasets, it's also common to have these kinds of level '2Lines'. I'm not sure if this numeric implementation is in sunpy or VSO, but if the former then my suggestion is that it be generalized to a string. Is that possible? Maybe this issue would then be resolved if VSO is reporting e.g., 'L2Lines (merged)' not only for the info but for the level.

starfleetjames avatar Jul 22 '20 19:07 starfleetjames

On Jul 22, 2020, at 3:02 PM, James Mason [email protected] wrote:

That's good to know. I think it may come down then to having support for non-numerical level types. For example, it's common to have e.g., level '0b', '0c', '0d' etc. At least in most LASP datasets, it's also common to have these kinds of level '2Lines'. I'm not sure if this numeric implementation is in sunpy or VSO, but if the former then my suggestion is that it be generalized to a string. Is that possible? Maybe this issue would then be resolved if VSO is reporting e.g., 'L2Lines (merged)' not only for the info but for the level.

There are two problems here:

The first is that different communities have different scales / standards for the concept of “level”.

LASP and NOAA follow the definitions used by the earth science community, as defined by the NASA EOSDIS.

When the VSO started, it was rare for solar physics remote sensing data to be anything other than “Level 0”. But our community’s “Level 0” (what we call “raw” data, still in sensor units), might be what the EOSDIS considers to be “Level 1A” because it has the metadata to be located in space and has the necessary metadata for use. (although we also rely on some ancillary data in external files)

But what they consider to be “Level 1B” is considered somewhere past what the solar physics community considers to be “Level 1” data. There was actually a big fight in the community because SDO/AIA wanted to make the only public data be irreversibly transformed. It’s what they call their “Level 1.5” data, which has been adjusted for point spread. (And this would be “Level 4” by the CODMAC standard*).

The second problem is that the VSO didn’t plan for there to be non-numric levels. When we started, there was only one non-integer “Level” used by the solar physics community. SOHO/LASCO raw data was considered “Level 0.5” because it is not the exact raw values read from the instrument, because lossy compression was used when downlinking it from the spacecraft.

I mentioned NOAA earlier, and that’s because a few weeks ago, the VSO added GOES SUVI data. And because GOES is operated by an earth science group, they label the data they distribute as “1A” and “1B”. And because the VSO is perl under the hood, it will actually do coercion to numbers if you try doing a test if ( 1 == “1B” ), so we tested what would happen if we put a level in as “1b”. And it works, but because of some other settings place to try to prevent it from being too lax, it’s spewing warnings right now.

So now you’re wondering why VSO’s “level” field is a string, and not a float. It’s because you can send it things like “> 1” or “ge 2”.

But this whole thing leaves us with a question — do we formally define the scale that the solar physics community uses, and index the data that the VSO searches using those definitions, or do we catalog them according to whatever the provider says they are and the solar physics scale be damned?

Or do we index it by the solar physics community standards, but add some coercion in there so if you search for ‘1b’ it still does what people think it should do. (likely by translating ‘1b’ to ‘1.5’ … I have no idea what the various LASP level 0s mean)

There’s actually precedence for doing this in the VSO. We do a few obscure things when searching … so when Stuart started adding in instrument tab expansion into sunpy, he was surprised that EUVI didn’t show up as a VSO instrument … but if you search for it, you get data that you’d expect. That’s because we use the FITS “INSTRUME” field for the VSO “instrument”, which is SECCHI. EUVI is the “detector”. But if the VSO doesn’t find a match in its registry when you’ve defined an instrument or detector, it will swap the two fields in the query, and try again.

In the past, when I worked at the SDAC, I would wander the halls and ask solar physicists in the building how they think it should behave, then run whatever I planned by the VSO science advisors (Joe Gurman, Piet Martens, Frank Hill, and Rick Bogart … two of whom have since retired, and the other two haven’t called into our telecons in a months if not a year). For the really big things, I would set up a VSO related poster at SPD meetings, and then survey whoever would talk to me.

...

Here’s the basic level scale that the solar physics (and heliophysics) community uses:

0 : raw sensor data
1 : calibrated data, in sensor units
2 : calibrated data, in physical units

It’s possible that heliophysics has formal definitions for levels higher than 2, but they’re usually just discussed as “higher level data” in the solar physics community. (any sort of results of analysis … things like activity maps, event catalogs, etc.)

… and any fractional levels are in some way between those integers, but there’s no fixed definition. This may actually related back to issue #3505 — AIA has multiple designations of lev0 data (level 0.1, level 0.3, etc.) … with the lowest level ones having “as planned” metadata and one of the later versions (0.3? 0.5?) having the “as run” metadata, and I think there was another designation for once the metadata had been calibrated (which gets processed in daily batches, so might not happen ’til 24hrs after the data is collected, and wouldn’t be available for NRT data)

-Joe

  • CODMAC = Committee On Data Management And Computation. Their “Level 1” is “raw data”, but their definition of “raw” is telemetry frames from the spacecraft. What we consider “raw data” for analysis is their “Level 2” (“Edited”). It took me years to track down the original CODMAC report, which I don’t think’s ever been posted online, but someone from the PDS (Planetary Data System) has mapped the EOSDIS levels to CODMAC: https://pds-smallbodies.astro.umd.edu/holdings/nh-p-mvic-3-pluto-v2.0/document/codmac_level_definitions.pdf (but CODMAC went higher … at least 7). And because there are two definitions of “raw” data, I recommended against using the term. (See http://virtualsolar.org/vocab )

jhourcle avatar Jul 23 '20 01:07 jhourcle

Eesh yeah. I didn't doubt there was a lot of history here. I'm generally a zealot about standards to unify things but I think a push to define a standard level definition for all the datasets that sunpy already has and will have access to is beyond the scope of this project, though certainly we could band together to try to push for one. That sounds like a decade-ish effort.

For the immediate case, I think it makes sense to let the instruments use whatever level definitions they already have and be able to filter for them via Fido, regardless of the backend source (VSO, JSOC, and hopefully soon LISIRD, etc). It'll fall to the user to understand what each level means for whatever particular instrument data they're looking at. Of course, that's already the case right now. So for sunpy that means, (imho):

  1. a.Level() should allow string input and do string matching
  2. Fido's response table should include Level as a viewable parameter table, to allow users to quickly identify what options they can choose from
  3. Perhaps down the line, allow string comparisons (e.g., a.Level('>1'))
  4. Perhaps down the line, do fuzzy string matching?

Why? At least for me, as a barely-contributor but a long-time user, that's the behavior I naively expect when I start using Fido.

starfleetjames avatar Jul 23 '20 16:07 starfleetjames

On Jul 23, 2020, at 12:01 PM, James Mason [email protected] wrote:

Eesh yeah. I didn't doubt there was a lot of history here. I'm generally a zealot about standards to unify things but I think a push to define a standard level definition for all the datasets that sunpy already has and will have access to is beyond the scope of this project, though certainly we could band together to try to push for one. That sounds like a decade-ish effort.

I was thinking that the VSO should work on the standard … but because sunpy indexes some data that’s not VSO data, sunpy would then have to decide if they want to follow that standard or not.

For the immediate case, I think it makes sense to let the instruments use whatever level definitions they already have and be able to filter for them via Fido, regardless of the backend source (VSO, JSOC, and hopefully soon LISIRD, etc). It'll fall to the user to understand what each level means for whatever particular instrument data they're looking at. Of course, that's already the case right now. So means that for sunpy, (imho):

• a.Level() should allow string input and do string matching • Fido's response table should include Level as a viewable parameter table, to allow users to quickly identify what options they can choose from • Perhaps down the line, allow string comparisons (e.g., a.Level('>1’))

The VSO does … at least for the ones in which level is actually defined. (one of my long-running tasks is to go back and figure out which VSO data providers support which features …. but I don’t have access to the NASA hosted ones anymore, so I’ve been doing code review of what’s in CVS and a little bit of black box analysis when the code wasn’t checked in … so it’s been slow going, especially as I’ve been sick and/or run down since mid-January)

• Perhaps down the line, do fuzzy string matching? Why? At least for me, as a barely-contributor but a long-time user, that's the behavior I naively expect when I start using Fido.

What do you count as ‘fuzzy string matching’? Like ‘1B’ matching ‘1’ and visa-versa? Or ‘1B’ matching > 1.0 but < 2 ?

I guess give me some examples of what you would expect to match, and what you wouldn’t.

-Joe

jhourcle avatar Jul 23 '20 16:07 jhourcle

That's really good that VSO already supports 1 and 2; that will make implementation on the sunpy side easier, I imagine, if it's agreed upon by everyone that's something that should be implemented.

The fuzzy matching was something of a fuzzy suggestion. I don't really know. Trying to think of examples now, one thing I would not want to happen is the user searches for a.Level('1') but gets a null result because all level definitions for that particular instrument are e.g., ''1A, 1.5" or whatever. That would be stymying and confusing. I would hope that instead, it would return all of the above levels. Then the user could look at the Levels in the response table (suggestion 2) and further refine their search if they like.

starfleetjames avatar Jul 23 '20 16:07 starfleetjames

On Jul 23, 2020, at 12:28 PM, James Mason [email protected] wrote:

That's really good that VSO already supports 1 and 2; that will make implementation on the sunpy side easier, I imagine, if it's agreed upon by everyone that's something that should be implemented.

The fuzzy matching was something of a fuzzy suggestion. I don't really know. Trying to think of examples now, one thing I would not want to happen is the user searches for a.Level('1') but gets a null result because all level definitions for that particular instrument are e.g., ''1A, 1.5" or whatever. That would be stymying and confusing. I would hope that instead, it would return all of the above levels. Then the user could look at the Levels in the response table (suggestion 2) and further refine their search if they like.

Right now, a VSO query for level=‘1’ should match ‘1’ and ‘1A’ or ‘1B’ but not 1.5.

I guess I could maybe do something like parse the expression for ‘1*’ and have that match all four cases.

It’s also worth mentioning that all VSO queries are matched twice —

  1. The query is sent to the “VSO registry” which has a general description of the data held by each data provider. From that, it gets back a list of which data providers to send the query to

  2. The query is then sent to that list of "data providers” (code that knows about the workings of various archives) … which may use slightly different matching rules.

But we can use this to our advantage as we can list 1A and 1B data both as ‘1’ in the registry, but then have the NOAA or LASP data providers do the more fine-grained work if you had actually asked for ‘1B’.

Although, we also then have to decide on defaults — if someone asked for SUVI level ‘1’, do we return 1A or 1B? (or both, although I think this would do more harm than good, and has been a precedent I’ve been trying to avoid for many, many years now)

So basically, some global changes can be made relatively quickly (although there are multiple instances of the VSO, so we try to only make changes to the registry logic when we have people available at all of “main” sites, currently NASA/SDAC, NSO, and maybe Stanford). If we don’t do this, then we have issues where people get inconsistent results across sunpy, IDL, web interface, etc. (we have a mechanism to synchronize the data, but not the code … and there’s a couple of ancillary files that sometimes need updating (like that know the full names for instruments) that are still done manually))

... but there may be cases where we need to go in and tweak individual data providers … which can take some time it's not hosted at one of the organizations where people working on the VSO are. (this is why some data providers take longer to fix than others … but when it drags on for a really long time, we may put up a proxy that does some query re-writing before sending it to the external organization, or tweaks the results, but we try to avoid doing this as it’s just more things that can break)

and there’s also been talks through the years (10+? … I know they predate helioviewer) about if VSO needs to be less file-centric, and rebuild itself around a concept where you pre-define some concept of ‘collections’, so someone would drill down to a collection level (like the ‘series’ in DRMS), and then ask for a time or spectral slice within that collection.

And it starts to get messy, as there are a lot of missions where they’re not taking similar data through the life of the mission like SDO, but running a bunch of ad-hoc observing campaigns … so do we need to just catalog the general observing modes & types of processing applied to the data, or do we need to index each individual campaign?

In many ways, it kinda sucks that VSO has done as good of a job as it has through the years, even with the various nagging problems that crop up … because making it so some scientist can send a query and have a high chance that they’ll get the data that they’re asking for, given the disparate systems that we have to interface with (FTP sites, DRMS, etc.) requires a lot of careful planning and some black magic behind the scenes.

When I started on the project (back in 2004), not having come from solar physics or even the sciences, I was amazed how a group called the “Virtual Solar Observatory” couldn’t even give me a clear definition of what a “solar observation” is*. (I mean … LASCO … they’ve blocked out the sun … how the hell is that ‘solar’? And now SECCHI/HI_2 ? it’s not even trying to point towards the sun. And I don’t think we’d have all of the in situ data from SOHO and STEREO if our NASA project scientist wasn’t also the SOHO project scientist.)

I’m often amazed that the VSO works as reliably as it does being that most of the internal workings haven’t been touched in ~10 years and given all of the things that can go wrong. (like bureaucratic crap, which is why I’m no longer at NASA)

*and we still don’t … it’s whatever data the solar physicists tell us is of interest to them, that the science advistory/overight group doesn’t nix. So not outputs from model runs, as the CCMC deals with that.

-Joe

jhourcle avatar Jul 23 '20 18:07 jhourcle