pyvo Support for other output formats (tsv, csv) missing in TAP

Only VOTable supported

Jun 07 '19 16:06 andamian

could it be done seamlessly via the astropy table parsing? Just asking, I don't have any insight what TAP should be able to read etc.

Jun 07 '19 16:06 bsipocz

It's an interesting question. At the TAP level, technically it does support different output formats, and even custom ones. I also agree that if you have a VOTable you pretty much have everything you need to make a tsv or csv file from that (although that's extra logic you'd have to write).

Are you thinking this is more there should be a custom way to support any format, or more we should add support for these formats? Are there any use cases that we should keep in mind when implementing this?

Jun 07 '19 18:06 cbanek

I can think of a couple of reasons (not 100% sure on the second):

xml is verbose so tsv and csv would reduce the amount of traffic. For simple exploratory queries I would rather do tsv or csv.
Streaming. I didn't look at the pyvo closely but I suspect it waits and reads the entire result set before returning the VOTable. That could be suboptimal for large result sets.

Jun 07 '19 18:06 andamian

On Fri, Jun 07, 2019 at 11:41:22AM -0700, andamian wrote:

I can think of a couple of reasons (not 100% sure on the second):

xml is verbose so tsv and csv would reduce the amount of traffic. For simple exploratory queries I would rather do tsv or csv.

Streaming. I didn't look at the pyvo closely but I suspect it waits and reads the entire result set before returning the VOTable. That could be suboptimal for large result sets.

Using non-VOTable response formats in TAP loses important metadata (e.g., query status, overflow, not to mention the column metadata and, at least for some data providers, licensing information, source links, and the like.

As to andamian's reason 1, I doubt tsv or csv save a lot over (binary) VOTable in typical cases; once a gzip transfer encoding is put on top, remaining differences will most likely become very small indeed. If size is to become a motivation, we'd need to research this a whole lot more.

As to reason 2, it's not terribly hard to tweak the VOTable parser to stream (TOPCAT, for instance, does it). I'd say that's work better spent.

However, I actually once used TAP's FORMAT with pyVO: This was a large catalog dump which I wanted to have in FITS (for seekability), and since it was ~1e9 records, I didn't want to funnel everything through astropy's VOTable parser and then to FITS writer. That might be a use case. Back then I did something like

  job = svc.submit_job(
     "SELECT {cols} from gdr2mock.main where"
       " source_id between {low} and {high}".format(**locals()), 
       maxrec=20000000, format="fits")
  try:
    job.run()
    job.wait()

    with open(dest_name, "wb") as dest:     
      src = urllib.urlopen(job.result_uri)
      while True:                     
        stuff = src.read(10000000)
        if not stuff:           
          break
        dest.write(stuff)
  finally:                        
    job.delete()

I seem to remember I had to do one little tweak in pyVO to make this work; I could look it up if you want.

Anyway, given VOTable is used for protocol signalling in TAP sync, I'm pretty sure changing format should be an async thing only whatever we do.

Jun 11 '19 12:06 msdemlei

@msdemlei I agree with you on the merits of the VOTable format. That's the default and the standard says that all the TAP services must support it.

Those reasons aside, the standards also states that services can support other formats through the RESPONSEFORMAT (or FORMAT for backwards compatibility) keywords and I think that the library should support that explicitly with a method attribute and a returned result holder in the case of a sync request (internally it should probably check if the specified format is supported by the service before sending the request).

BTW, my impression is that RESPONSEFORMAT can be used in both sync and async: The RESPONSEFORMAT parameter is used so the client can specify the format of the response (e.g. the output of the job). For DALI-sync requests, this is the content-type of the response. For DALI-async requests, this is the content-type of the result resource(s) the client can retrieve from the UWS result list resource;... - http://www.ivoa.net/documents/DALI/20170517/REC-DALI-1.1.html#tth_sEc3.4.3` Am I misreading this?

I also agree that streaming is an entirely different issue.

Jun 11 '19 16:06 andamian

On Tue, Jun 11, 2019 at 09:51:46AM -0700, andamian wrote:

Those reasons aside, the standards also states that services can support other formats through the RESPONSEFORMAT (or FORMAT for backwards compatibility) keywords and I think that the library should support that explicitly with a method attribute and a returned result holder in the case of a sync request (internally it should probably check if the specified format is supported by the service before sending the request).

The problem is that then a lot of the error checking goes haywire, because errors and overflows are, by the standard, signalled in magic VOTable elements. You can argue that that's questionable behaviour in the first place, but it's been what VO protocols have done since the very first one, and so we'll have to live with it I'm afraid.

BTW, my impression is that RESPONSEFORMAT can be used in both sync and async: The RESPONSEFORMAT parameter is used so the client can specify the format of the response (e.g. the output of the job). For DALI-sync requests, this is the content-type of the response. For DALI-async requests, this is the content-type of the result resource(s) the client can retrieve from the UWS result list resource;... - http://www.ivoa.net/documents/DALI/20170517/REC-DALI-1.1.html#tth_sEc3.4.3` Am I misreading this?

Not at all. RESPONSEFORMAT can be applied to sync as well, of course, and it does have a purpose there, too, in particular for curlbashware el-cheapo TAP clients. For instance, I have a shell function

function synctap() {
	tapurl=http://dc.g-vo.org/tap/sync 
	curl -s -FLANG=ADQL -FREQUEST=doQuery -FQUERY="$1" -FFORMAT=votable/td \
		"$tapurl" |\
	xmlstarlet fo | less
}

in my bashrc that quickly queries my TAP service, and I'm using FORMAT to make sure I'm getting xmlstarlet-formattable TABLEDATA (rather than BINARY) back.

But outside of quick hacks like that I'd not touch FORMAT in sync queries.

There is an aspect of FORMAT we might want to look into, though: If the TAP capabilities indicate that a service supports BINARY2 VOTables, I'd say we should use that; it's somewhat more robust than BINARY, but it needs to be explicitly requested, so if we don't do it, we won't get it. Doing that would also reduce the likelihood stuff working in TOPCAT won't work with pyVO, because TOPCAT works the same way.

Jun 11 '19 18:06 msdemlei

Not to derail this great conversation (I can see both sides of it to be honest, but it is part of the spec), we are also interested in the binary2 or binary format, just for a bit of performance. I'm not sure what that performance would actually be, when you throw in compression, but it has been talked about.

Jun 11 '19 18:06 cbanek

Like @msdemlei said, the VOTable Response contains essential metadata, so it wouldn't make sense to parse other formats.

could it be done seamlessly via the astropy table parsing? Just asking, I don't have any insight what TAP should be able to read etc.

It could be if you take care of requests obstacle.

Getting the astropy votable parser to be able to yield rows would be at least a medium amount of work (I've been looking into this a while ago).

At row level, python is way too slow, so it's only useful for small amount of data anyway (in which case streaming or not makes no difference).

Refering to https://github.com/astropy/astropy/issues/6519#issuecomment-379423604

i guess there's no way around some C-Code, as python is just too slow in general.

About getting a different output format, it can be done using the astropy table object.

Jun 15 '19 14:06 funbaker