Module import pattern causes randomly timed exceptions to be raised
This line, which allows people to do queries in the following way, causes signficant bad side effects, as follows:
from astroquery.simbad import Simbad
result = Simbad.query_object("HD12345")
... etc ...
Apologies if these words are a little imprecise, but, because the module import itself ends up functioning as a class, the resulting "object" doesn't drop out of scope, because it's an import not a normal instance of a class. This means that when the underlying connection goes stale/times out, an exception is raised at some random time in the future, causing whatever process is running to fall over with a very confusing error, as it doesn't relate to any code that is running at that point.
There are likely many others, but a simple fix (which would, however, mean an API change) would, I think, be just to remove that line, so as to enfore people explictly making class instances, which will then be cleaned up as one would expect, e.g. forcing
from astroquery.simbad import SimbadClass
simbad = SimbadClass()
... etc ...
as the usage pattern. It might also be that simply having people do simbad.Simbad() rather than simbad.Simbad (without the parens) would also work.
Finally, it might also be worth noting that this Simbad is an instance, not a class definition, and therefore has the wrong capitalization.
Actualy traceback I'm seeing at random times is
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 543, in query_object
response = self.query_object_async(object_name, wildcard=wildcard,
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 580, in query_object_async
response = self._request("POST", self.SIMBAD_URL, data=request_payload,
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 251, in _request
raise ex
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/simbad/core.py", line 235, in _request
response = super()._request(*args, **kwargs)
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/query.py", line 317, in _request
response = query.request(self._session,
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/astroquery/query.py", line 71, in request
return session.request(self.method, self.url, params=self.params,
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/usr/share/miniconda/envs/test/lib/python3.8/site-packages/requests/adapters.py", line 553, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='simbad.u-strasbg.fr', port=80): Max retries exceeded with url: /simbad/sim-script (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fb6a5c200a0>, 'Connection to simbad.u-strasbg.fr timed out. (connect timeout=60)'))
I don't understand how this can be happening. The existence of the Simbad object does not create any open connections to go stale; no HTTP connections are created at initialization time.
Beyond that, though, your suggested change constitutes a major backward-compatibility breaking interface change that I would not support.
Ah, let me add: the traceback you showed is not at a random time, but is happening when query_object is called.
I think the connections are initialised, as you say, not in import, but when you use the object to query something, but then they continue to live, because the object hangs around in scope ~forever, because it's done as a module import rather than a class instance. Does that make sense?
I don't think so, but it's possible there's something I don't know about garbage collection playing a role here. You can always remove the Simbad object as you would any other, i.e., del Simbad. But the Simbad class does not create any persistent connections - there's no reason it would trigger a traceback like the one you showed except when Simbad.query_object is called.
OK, I'm happy to concede that the traceback is indeed only raised when I issue the query, using a pre-made object which is old. I tried adding del Simbad after each use and that did indeed fix it, which is what led us to dig further and find this pattern.
But that does feel like a significant issue to me. And a few Rubin folk thought that this was likely a module-import-looking object that's stopping the GC from sorting this out as it usually would.
Is it possible you could adopt the pattern
from astroquery.simbad.core import SimbadClass
def my_thing_that_uses_simbad():
simbad_instance = SimbadClass()
...
to avoid ever having the class-that-is-an-instance in scope? I'm not convinced yet that there is a difference between how object instances and classes are garbage collected, but I don't claim a comprehensive knowledge of how the garbage collector works (except that it generally comes once per week...).
Oh yes, that is exactly what I've done - without that all my tests on GitHub Actions were failing, so I couldn't wait for a fix here. I was just filing the issue because it feels like a bug that others will surely trip over, and it was really quite hard to track down, so wanted to contribute to making things more robust by filing the issue.
Right, fair enough, that's helpful.
Any chance you could point us to a more complete MWE - e.g., one of the failing github tests? It would be helpful to see the full workflow in which this failure mode is (was) happening.
Sure. I've got a very involved one, which I don't think is so helpful as it's basically the opposite of an MWE, and weirdly the MWE that @erykoff made here didn't actually reproduce the problem... Let me see if I can make that "work" (i.e. fail)
Thanks. Given that your failures are timeout errors, I wonder if part of the problem that led you to identify this as a random occurrence is a delayed error message?
A problem we frequently have is that upstream services are not perfectly reliable (or, sometimes, just the internet isn't), which leads to intermittent failures in some modules. There might be a way to make your tests more robust against connection failures by, e.g., caching the expected SIMBAD result early on in the tests such that the later calls don't need to make remote connections. If there's a fundamental issue in the garbage collection that's causing the problem, this won't help, but if it's just that sometimes the connection is timing out, it might at least help isolate the error.
Sadly, and even more confusingly to me at the time, this was a 100% consistent failure, rather than something timing out from time to time. I do, however, only see it on GHA, and not when I run locally, or on our Jenkins CI buids, which I do find odd. I'm just working on the MWE now...
I will also be the first to admit that having actual test code reach out to the internet is a huge anti-pattern in general, but this was 3rd party code which I was trying to fix, rather than tests I was writing from scratch (it's more of a CI test than a unit test that's failing in this case).
I think you're seeing an example here of the issue described by @5j9 at https://github.com/psf/requests/issues/6064#issuecomment-1044011057. Specifically, what's happening is that Simbad holds an instance of requests.Session. After the first query is made using that Session, it holds open an HTTP connection simbad.u-strasbg.fr ready for the next one; if something interrupts that connection, the Session will simply give up rather than attempting to retry.
You can see this in action at https://github.com/jdswinbank/Spectractor/blob/lsst-dev/tests/test_extractor.py#L36; this uses requests.Session directly (ie, taking Astroquery out of the equation), and reliably fails; adding a retry causes it to pass.
What this doesn't answer is what's causing the connection between GitHub Actions and U. Strasbourg to be interrupted; like @mfisherlevine , I can't replicate that from my laptop, and I don't see the same interruptions when I connect from GitHub to other hosts (e.g. Google works fine).
As far as I can tell, the discussion of module import patterns, garbage collection, etc is all a bit of a red herring here. Any long-lived instance of SimbadClass (or anything else which holds and reuses a Session) would presumably exhibit the same behaviour, regardless of how it's created.
Potential solutions which are not Astroquery API changes would be to add a retry to the HTTPAdapter (e.g. following the pattern described by @5j9) or to catch the timeout within BaseQuery and create a new Session when needed. Or just not to worry about it, since this seems to be a weird corner case only being triggered between GHA and Strasbourg.
For randomly happening remote failures that only/mostly affect CI, I would suggest using pytest-rerunfailures.
That is a great thing to know about in general, thank you! Sadly though, this problem, though it's not seen everywhere (and is therefore inconsistent in that regard) seems to happen 100% of the time when in GHA when the conditions are right to trigger it, so I don't think this fix will work.
Is it still an issue?