astroquery Module import pattern causes randomly timed exceptions to be raised

This line, which allows people to do queries in the following way, causes signficant bad side effects, as follows:

from astroquery.simbad import Simbad
result = Simbad.query_object("HD12345")
... etc ...

Apologies if these words are a little imprecise, but, because the module import itself ends up functioning as a class, the resulting "object" doesn't drop out of scope, because it's an import not a normal instance of a class. This means that when the underlying connection goes stale/times out, an exception is raised at some random time in the future, causing whatever process is running to fall over with a very confusing error, as it doesn't relate to any code that is running at that point.

There are likely many others, but a simple fix (which would, however, mean an API change) would, I think, be just to remove that line, so as to enfore people explictly making class instances, which will then be cleaned up as one would expect, e.g. forcing

from astroquery.simbad import SimbadClass
simbad = SimbadClass()
... etc ...

as the usage pattern. It might also be that simply having people do simbad.Simbad() rather than simbad.Simbad (without the parens) would also work.

Finally, it might also be worth noting that this Simbad is an instance, not a class definition, and therefore has the wrong capitalization.

Jul 20 '22 23:07 mfisherlevine

Actualy traceback I'm seeing at random times is

  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 543, in query_object
    response = self.query_object_async(object_name, wildcard=wildcard,
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 580, in query_object_async
    response = self._request("POST", self.SIMBAD_URL, data=request_payload,
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 251, in _request
    raise ex
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 235, in _request
    response = super()._request(*args, **kwargs)
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/query.py", line 317, in _request
    response = query.request(self._session,
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/query.py", line 71, in request
    return session.request(self.method, self.url, params=self.params,
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/requests/adapters.py", line 553, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='simbad.u-strasbg.fr', port=80): Max retries exceeded with url: /simbad/sim-script (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb6a5c200a0>, 'Connection to simbad.u-strasbg.fr timed out. (connect timeout=60)'))

Jul 20 '22 23:07 mfisherlevine

I don't understand how this can be happening. The existence of the Simbad object does not create any open connections to go stale; no HTTP connections are created at initialization time.

Beyond that, though, your suggested change constitutes a major backward-compatibility breaking interface change that I would not support.

Jul 21 '22 00:07 keflavich

Ah, let me add: the traceback you showed is not at a random time, but is happening when query_object is called.

Jul 21 '22 00:07 keflavich

I think the connections are initialised, as you say, not in import, but when you use the object to query something, but then they continue to live, because the object hangs around in scope ~forever, because it's done as a module import rather than a class instance. Does that make sense?

Jul 21 '22 00:07 mfisherlevine

I don't think so, but it's possible there's something I don't know about garbage collection playing a role here. You can always remove the Simbad object as you would any other, i.e., del Simbad. But the Simbad class does not create any persistent connections - there's no reason it would trigger a traceback like the one you showed except when Simbad.query_object is called.

Jul 21 '22 00:07 keflavich

OK, I'm happy to concede that the traceback is indeed only raised when I issue the query, using a pre-made object which is old. I tried adding del Simbad after each use and that did indeed fix it, which is what led us to dig further and find this pattern.

But that does feel like a significant issue to me. And a few Rubin folk thought that this was likely a module-import-looking object that's stopping the GC from sorting this out as it usually would.

Jul 21 '22 00:07 mfisherlevine

Is it possible you could adopt the pattern

from astroquery.simbad.core import SimbadClass

def my_thing_that_uses_simbad():
    simbad_instance = SimbadClass()
    ...

to avoid ever having the class-that-is-an-instance in scope? I'm not convinced yet that there is a difference between how object instances and classes are garbage collected, but I don't claim a comprehensive knowledge of how the garbage collector works (except that it generally comes once per week...).

Jul 21 '22 00:07 keflavich

Oh yes, that is exactly what I've done - without that all my tests on GitHub Actions were failing, so I couldn't wait for a fix here. I was just filing the issue because it feels like a bug that others will surely trip over, and it was really quite hard to track down, so wanted to contribute to making things more robust by filing the issue.

Jul 21 '22 00:07 mfisherlevine

Right, fair enough, that's helpful.

Any chance you could point us to a more complete MWE - e.g., one of the failing github tests? It would be helpful to see the full workflow in which this failure mode is (was) happening.

Jul 21 '22 00:07 keflavich

Sure. I've got a very involved one, which I don't think is so helpful as it's basically the opposite of an MWE, and weirdly the MWE that @erykoff made here didn't actually reproduce the problem... Let me see if I can make that "work" (i.e. fail)

Jul 21 '22 01:07 mfisherlevine

Thanks. Given that your failures are timeout errors, I wonder if part of the problem that led you to identify this as a random occurrence is a delayed error message?

A problem we frequently have is that upstream services are not perfectly reliable (or, sometimes, just the internet isn't), which leads to intermittent failures in some modules. There might be a way to make your tests more robust against connection failures by, e.g., caching the expected SIMBAD result early on in the tests such that the later calls don't need to make remote connections. If there's a fundamental issue in the garbage collection that's causing the problem, this won't help, but if it's just that sometimes the connection is timing out, it might at least help isolate the error.

Jul 21 '22 01:07 keflavich

Sadly, and even more confusingly to me at the time, this was a 100% consistent failure, rather than something timing out from time to time. I do, however, only see it on GHA, and not when I run locally, or on our Jenkins CI buids, which I do find odd. I'm just working on the MWE now...

I will also be the first to admit that having actual test code reach out to the internet is a huge anti-pattern in general, but this was 3rd party code which I was trying to fix, rather than tests I was writing from scratch (it's more of a CI test than a unit test that's failing in this case).

Jul 21 '22 01:07 mfisherlevine

I think you're seeing an example here of the issue described by @5j9 at https://github.com/psf/requests/issues/6064#issuecomment-1044011057. Specifically, what's happening is that Simbad holds an instance of requests.Session. After the first query is made using that Session, it holds open an HTTP connection simbad.u-strasbg.fr ready for the next one; if something interrupts that connection, the Session will simply give up rather than attempting to retry.

You can see this in action at https://github.com/jdswinbank/Spectractor/blob/lsst-dev/tests/test_extractor.py#L36; this uses requests.Session directly (ie, taking Astroquery out of the equation), and reliably fails; adding a retry causes it to pass.

What this doesn't answer is what's causing the connection between GitHub Actions and U. Strasbourg to be interrupted; like @mfisherlevine , I can't replicate that from my laptop, and I don't see the same interruptions when I connect from GitHub to other hosts (e.g. Google works fine).

As far as I can tell, the discussion of module import patterns, garbage collection, etc is all a bit of a red herring here. Any long-lived instance of SimbadClass (or anything else which holds and reuses a Session) would presumably exhibit the same behaviour, regardless of how it's created.

Potential solutions which are not Astroquery API changes would be to add a retry to the HTTPAdapter (e.g. following the pattern described by @5j9) or to catch the timeout within BaseQuery and create a new Session when needed. Or just not to worry about it, since this seems to be a weird corner case only being triggered between GHA and Strasbourg.

Jul 23 '22 09:07 jdswinbank

For randomly happening remote failures that only/mostly affect CI, I would suggest using pytest-rerunfailures.

Jul 25 '22 18:07 bsipocz

That is a great thing to know about in general, thank you! Sadly though, this problem, though it's not seen everywhere (and is therefore inconsistent in that regard) seems to happen 100% of the time when in GHA when the conditions are right to trigger it, so I don't think this fix will work.

Jul 25 '22 18:07 mfisherlevine

Is it still an issue?

Apr 03 '24 13:04 ManonMarchand