squid icon indicating copy to clipboard operation
squid copied to clipboard

Fix FD accept() error handling

Open yadij opened this issue 4 years ago • 4 comments

FD begin their lifecycle belonging to TcpAcceptor which is responsible for closing per Comm::Connection API.

That means FD which have issues within the TcpAcceptor logic should be close()'d with a specific ERROR or WARNING message displayed about the reason for closure.

Do not enter connections for orphanage until their ownership is passed into the async queue for a Subscriber.

yadij avatar Jun 08 '21 06:06 yadij

PR description: Do not enter connections for orphanage until their ownership is passed into the async queue for a Subscriber.

IMO, this PR is a step backwards and should not be merged: The proposed code is less exception safe (exceptions may lead to bogus complains about orphaned connections) and is unnecessary explicit (requiring more thought/maintenance/etc. instead of properly relying on RAII).

This PR is properly using RAII. When exceptions happen the connection is closed by RAII implicit destruct and claims of bug happening are logged.

The rest of my comments are essentially attempts at clarifications.

FD begin their lifecycle belonging to TcpAcceptor which is responsible for closing per Comm::Connection API.

The responsibility to close does not imply the need to close explicitly.

The collection of seriously nasty issues represented by Bug 3329 are what impose that requirement. To catch such issues we have owners responsibly calling close() and the RAII implicit case logging a bug situation to be identified and fixed.

"Belonging" is not a very accurate adjective in this context IMO. A Connection actually does not belong to any job when it is born -- it is always born orphaned. And this particular job never really accepts the ownership of the connection it creates.

False. TcpAcceptor is the owner until it calls notify() to begin delivery to some Subscriber, or when it drops its local reference. This PR is about making those two points be the "safe" exits from TcpAcceptor.

That means FD which have issues within the TcpAcceptor logic should be close()'d with a specific ERROR or WARNING message displayed about the reason for closure.

Yes, errors and warnings are needed in most cases, but no explicit closure is required.

see Comm::~Connection() code. Owners either call close() explicitly or bogus log entries about being an orphan are recorded by RAII implicit closure.

I would dearly like to let RAII implicit closure not be a worry. But we still have issues with fd_table and signals causing things like two samples below (yes, until recently server connection orphans were easier to replicate):

" Bug 3329: Christos Tsantilas 2015-04-24 11:05:29 UTC

A description of the problem:

When an idle server-side pinned connection closes then:

  • The ConnStateData::clientPinnedConnectionRead, the server side pinned connection read handler, will be called with the io.flag set to Comm::ERR_CLOSING. The server-side Comm::Connection object will not closed inside this method in this case, the close handler expected to do the final clean up

  • The ConnStateData::clientPinnedConnectionClosed close handler will be called but it does not close the server-side Comm::Connection object too.

  • The related fde object is marked as non open fde::flags.open = false, but the Comm::Connection server-side object is still marked as valid. "

" Christos Tsantilas 2015-06-02 11:09:51 UTC

I am also attaching an actions sequence which can cause this bug:

  • Ssl::PeerConnector successfully connects to SSL server.
  • The ConnStateData::httpsPeeked called and pins the connection
  • A pinned squid-to-server connection expired and closes.
  • There is not any timeout Handler so the connection is closed with comm_close(fd).
  • The FD is reused by ICAP or HTTP or other client connection
  • The ConnStateData object of the pinned connection destroyed so the ConnStateData::pinning::serverConnection destructor called which closes an FD which is uded by other objects
  • Bad thinks happens. "

Do not enter connections for orphanage until their ownership is passed into the async queue for a Subscriber.

The "enter orphanage when ownership is passed from X to Y" reads like an oxymoron to me: A connection is either an orphan or is owned.

Between TcpAcceptor and FooServer Jobs there is an indefinite amount of time where the queue of AsyncCall are waiting to be dialed. If things go bad during that period Bug 3329 applies: "Bad Things Happen" and an orphan dies.

The same could be true of the periods spent between other Jobs too.

yadij avatar Jun 08 '21 17:06 yadij

IMO, this PR is a step backwards and should not be merged: The proposed code is less exception safe (exceptions may lead to bogus complains about orphaned connections) and is unnecessary explicit (requiring more thought/maintenance/etc. instead of properly relying on RAII).

When exceptions happen the connection is closed by RAII implicit destruct and claims of bug happening are logged.

Yes, the connection will be closed by RAII, but exceptions are not bugs, so any logged "claims of bug" would be wrong. This problem does not exist in the official code, so introducing such log records is a step backwards.

The rest of my comments are essentially attempts at clarifications.

FD begin their lifecycle belonging to TcpAcceptor which is responsible for closing per Comm::Connection API.

The responsibility to close does not imply the need to close explicitly.

The collection of seriously nasty issues represented by Bug 3329 are what impose that requirement.

I disagree that connection ownership problems (that still do exist) require or force an explicit connection closure (in this PR context). The whole idea of supporting a known orphanage state is to prevent the need for many such closures (and most explicit closures are a problem rather than a desirable requirement or goal because sooner or later they become missing, forgotten, or misplaced).

To catch such issues we have owners responsibly calling close() and the RAII implicit case logging a bug situation to be identified and fixed.

There are two very different implicit closure cases: A known/expected orphan and an unexpected orphan. The former is not a "bug situation" -- there is nothing to "fix" there. The latter is indeed a bug that should be triaged and fixed. The official code (in this PR scope) creates known orphans. That is not a bug.

"Belonging" is not a very accurate adjective in this context IMO. A Connection actually does not belong to any job when it is born -- it is always born orphaned. And this particular job never really accepts the ownership of the connection it creates.

False. TcpAcceptor is the owner until it calls notify() to begin delivery to some Subscriber, or when it drops its local reference.

It is a matter of definition -- the concept of "ownership" is not well defined in this scope. Our use of the words "own" and "belong" differ. Neither usage is "true" or "false" or, to be more precise, each is either true or false depending on the definition in use by the reader.

This PR is about making those two points be the "safe" exits from TcpAcceptor.

They are already (equally) safe in the official code.

see Comm::~Connection() code. Owners either call close() explicitly or bogus log entries about being an orphan are recorded by RAII implicit closure.

Please interpret my comments as if I am well aware of that code. AFAICT, you are missing the fact that some orphans (those marked with COMM_ORPHANED) are expected, and those orphans are logged as such. Being an orphan is not a bug per se. Only unexpected orphans point to (and logged as) bugs.

Between TcpAcceptor and FooServer Jobs there is an indefinite amount of time where the queue of AsyncCall are waiting to be dialed. If things go bad during that period Bug 3329 applies: "Bad Things Happen" and an orphan dies.

That part of the discussion is outside this PR scope because this PR does not change the orphaned state of the Connection during async queuing time -- the queued connection was orphaned before this PR and remains orphaned after this PR. As you know, we agree that there are many Connection ownership problems in Squid code (including in-scope code). I am working on a comprehensive solution to address them, but all that does not affect this PR.

rousskov avatar Jun 08 '21 17:06 rousskov

AFAICT, you are missing the fact that some orphans (those marked with COMM_ORPHANED) are expected, and those orphans are logged as such. Being an orphan is not a bug per se. Only unexpected orphans point to (and logged as) bugs.

Sigh. (A) expected crash/orphan vs (B) unexpected crash/orphan. Which is better ? neither. So why call code (C) [explicit close()] that prevents a crash entirely worse and block its creation in favour of (A).

yadij avatar Jun 09 '21 13:06 yadij

(A) expected crash/orphan vs (B) unexpected crash/orphan. Which is better ? neither. So why call code (C) [explicit close()] that prevents a crash entirely worse and block its creation in favour of (A).

The above logic assumes that expected orphans lead to crashes. That assumption is false -- bugs notwithstanding, expected orphans do not lead to crashes. That is the primary difference between an unexpectedly orphaned Connection (a bug) and a purposefully orphaned one (an intended behavior that is expected to be harmless). Are there bugs? Yes! Does this PR fix bugs? None were identified so far.

Connection lifetime management is a difficult subject related to many misconceptions in the official code and developer knowledge. I have studied the issue, can explain what is actually going on in most cases, can help resolve most misunderstandings, and am working on a comprehensive solution. We can fix specific bugs and should finalize a comprehensive long-term solution. AFAICT, this PR currently does neither and complicates future application of any comprehensive solution.

I suggest closing this spur-of-the-moment PR and revisiting the issue when the comprehensive solution is ready for your review. This will save time and give us a much better chance of success than trying to find bugs and negotiate their fixes in this PR environment, where you are not trying to fix a specific bug (as in "here is a test case that crashes official Squid and that this PR fixes"), and we do not even have a common vocabulary to talk about these complicated issues!

If you insist, we do not have to actually close this PR; we can just suspend this discussion instead.

If none of the above is acceptable to you, then I see no other option than to request a precise description of a specific "crash" that this PR is allegedly "preventing". Referring to bug reports about already fixed crashes is not sufficient, of course, especially when those bug reports are dealing with a different underlying problem. If a specific crash is identified, then we will be able to discuss whether this PR needs any adjustments to properly address that specific issue. I bet this PR fixes no crashes -- the "crash" you are thinking about either does not exist or is not addressed by this PR -- but, if you insist, you can force me to evaluate bug candidates.

rousskov avatar Jun 09 '21 15:06 rousskov

Closing as superseded by PR #1238

yadij avatar Jan 13 '23 02:01 yadij