pigeon icon indicating copy to clipboard operation
pigeon copied to clipboard

Lost messages

Open bernardd opened this issue 7 years ago • 3 comments

I've been trying to pin down a possible cause for an issue we saw the other day: it looks like we did 18 calls to APNS.push/2 in quick succession with an associated on_response function and a 60 second timeout, but the response function was only called 14 times.

Looking at the code, it looks to me like it's possible that, if the Connection process has its connection closed by the remote end after accepting a set of events from the Worker GenStage process, and before all the results have been received, those events will be lost and the on_response will never be called. Is that a reasonable conclusion, or am I missing something? (I started digging into Kadabra to try to answer the question myself, but I'm so many layers down at this point that I would doubt any conclusion I came to anyway).

Cheers.

bernardd avatar Jul 31 '18 07:07 bernardd

You've actually addressed one of the issues I'm working toward fixing in subsequent pigeon/kadabra releases. Ideally pigeon wouldn't use GenStage at all, and kadabra would queue all of the requests, triggering an error callback on already dispatched pushes.

That said, APNS never closes its connections, and FCM doesn't close as long as you send something every minute or two. Theoretically something like this should rarely ever happen.

Couple questions:

  • Are you using the latest kadabra v0.4.3?
  • Could you post a snippet of how you're testing it?

hpopp avatar Jul 31 '18 15:07 hpopp

Hi. I work with @bernardd and can answer your questions.

Are you using the latest kadabra v0.4.3?

We updated to 0.4.3 yesterday, but that hasn't made it onto our production servers yet.

Could you post a snippet of how you're testing it?

Right now, we are working from reconstructions based on our production logs. I will talk to Bernard and see if we could put together a synthetic test.

toland avatar Jul 31 '18 19:07 toland

Theoretically something like this should rarely ever happen.

I should add that we know for sure it has happened once, but can't say with any confidence if it has happened more than that. Another wrinkle is that we have been having some internal networking issues that may have caused the socket to close prematurely. Isn't software fun? 😄

toland avatar Jul 31 '18 19:07 toland