Add method to differentiate e-mail outage from test failure
Investigate a way to check the difference between the signup/invite e-mail not being there (indicating an test failure) and nothing being there (indicating a Mailosaur outage).
I'm not sure offhand how long e-mails are retained in the Mailosaur system, but if it's an acceptably long time maybe we could have one static e-mail that we query for first to ensure the system is operational.
I had a thought on this that I haven't had a chance to try out yet - It wouldn't catch every possible Mailosaur outage, but if we wrapped their API call in a try/catch we should be able to capture the response code, and if it's 5XX that means a server error and a pretty positive indication the issue is on their side and we can handle it gracefully.
Smart plan — similar to trying again for first e2e test failures. Could include a few tries.
Smart plan — similar to trying again for first e2e test failures. Could include a few tries
We actually already have retry logic in place, querying the API every 500ms for up to 60s. Although with PR #593 I increased that to 1500ms, as once I saw our request traffic in the Charles proxy I realized we were stacking up requests a lot rather than waiting for a response.
That PR was spawned from my investigation into how to catch those 502 errors from their API. Unfortunately I found that with the way their NodeJS code executes the requests asynchronously it's really not possible to catch the exception. But they do actively throw an exception, which in turn stomps on any effort to just handle the error code.
I've reached out to their support to see if they have an alternate suggestion before I fork their repo and submit a PR. I didn't want to go too far down that path before talking to them because I didn't want them to refuse the merge and then be stuck running our own version of the library that we need to maintain.
I got a response from Mailosaur support saying that they're aware of the 502 errors, and that for some reason our account is particularly susceptible to them for some reason, likely due to the large volume of e-mails flowing through our mailboxes. I then received a follow-up a few hours later saying they found a latency issue on their side that they were able to resolve, and it should hopefully fix our issues. So fingers crossed.
They also recommended looking at https://caolan.github.io/async/docs.html#retry for a method to catch the exception they're throwing. I'll investigate that if the errors start recurring, but for now I want to leave the system as-is to test out their fix.