james-project icon indicating copy to clipboard operation
james-project copied to clipboard

JAMES-3799 Optimize memory requirements of SimpleMessageSearchIndex

Open ottoka opened this issue 3 years ago • 13 comments

DRAFT: This adds a step to analyze the provided SearchQuery criteria and sort options to figure out the highest required mail fetch type.

Someone please check my classification in getFetchTypeForCriterion() and getFetchTypeForSort() - I have not worked with these before and am not sure I got them all right.

Unfortunately, I cannot really test this, as the InMemoryMessageMapper used by SimpleMessageSearchIndexTest seems to ignore the fetch type completely. Any suggestions? :(

ottoka avatar Aug 08 '22 13:08 ottoka

Read it.

quantranhong1999 avatar Aug 09 '22 04:08 quantranhong1999

Latest push includes suggestions so far.

Re testability, I also added a second commit to make InMemoryMessageMapper honor the fetch type option - which immediately broke SimpleMessageSearchIndexTest.sortOnSentDateShouldWork(). Seems the UID optimization branch also needs a maxFetchType based on requested sort order. So apparently this was a worthwhile addition in itself. Lets see if Jenkins finds more affected tests :D

ottoka avatar Aug 09 '22 10:08 ottoka

OK, turns out honoring the fetch type breaks MemoryMessageMapperTest (and MemoryMessageWithAttachmentMapperTest), in different ways:

  1. The messagesRetrieved... tests want bodyStartOctet to stay the same, regardless whether there actually is a body (HEADERS), or whether there are any preceding headers in the content at all (BODY). Effectively, have the messages lie about what they actually have in their content byte array. I can change the code to make it so, but I worry what happens if someone actually wants to retrieve the content?
  2. Some messagesRetrieved... tests basically want BODY and FULL to be the same thing in regard to findInMailbox(). A quick comparison shows that CassandraMessageMapper does exactly that, so fine. And incidentally, JPAMessageMapper also ignores the fetch type, meaning it will use a lot of memory on bulk search/retrieve operations. Oh well.
  3. The deleteMessages... tests are doing something strange during AbstractMessageMapper.updateFlag(). They first retrieve messages with fetch type METADATA and the save them via InMemoryMessageMapper.save(), which copies the truncated message and thus winds up with a broken message in the store?? So, this mechanism actually relies on MessageMappers to ignore the fetch type???

At this point things are getting too weird for me, and I am not sure how all this is actually supposed to behave. Help please?

ottoka avatar Aug 09 '22 14:08 ottoka

Help please?

I will try to look at it in the coming week.

Effectively, have the messages lie about what they actually have in their content byte array.

No they are supposed to tell what the underlying data is regardless of what they did actually load.

Tests basically want BODY and FULL to be the same thing in regard to findInMailbox().

Good catch. Maybe we should just drop Body fetch type and use FULL instead, which would simplify the code, and avoid some confusions. We don't use it (FetchType.BODY) so it effectively looks like accidental complexity we might want get rid of...

That would simplify your issue?

The deleteMessages... tests are doing something strange during AbstractMessageMapper.updateFlag(). They first retrieve messages with fetch type METADATA and the save them via InMemoryMessageMapper.save(), which copies the truncated message and thus winds up with a broken message in the store?? So, this mechanism actually relies on MessageMappers to ignore the fetch type???

What you describe looks like a behavior specific to the in-memory implementation (ignoring fetch type, etc..). So ok memory mailbox, implemented for testing purposes, might benefit from a rework...

turns out honoring the fetch type breaks MemoryMessageMapperTest (and MemoryMessageWithAttachmentMapperTest), in different ways

Do we care about memory honoring the fetch type stuff? Ok, it makes it closer to real implementations (cassandra / JPA) meaning catching some issues earlier in the development cycle, but integration tests (here MPT IMAP tests?) still get us covered. While making memory fetch type aware the way Cassandra implem is would be a contribution I would support, I do not know if I would consider it 'essential'

save them via InMemoryMessageMapper.save(), which copies the truncated message and thus winds up with a broken message in the store?? So, this mechanism actually relies on MessageMappers to ignore the fetch type???

We could adapt "save" behaviour to actually update message metadata instead of replacing it, if the message is already present?

chibenwa avatar Aug 10 '22 04:08 chibenwa

Effectively, have the messages lie about what they actually have in their content byte array.

No they are supposed to tell what the underlying data is regardless of what they did actually load.

Thanks for the explanation. In this case, I am OK with the necessary changes and will adapt the code accordingly.

Tests basically want BODY and FULL to be the same thing in regard to findInMailbox().

Good catch. Maybe we should just drop Body fetch type and use FULL instead, which would simplify the code, and avoid some confusions. We don't use it (FetchType.BODY) so it effectively looks like accidental complexity we might want get rid of...

That would simplify your issue?

On a quick look I saw that BODY and FULL are different things in the Cassandra Mapper when it loads the message into memory. I do wonder what the use case is for having a message with full (potentially large) body but without the headers. BODY is used in a lot of tests, and I cannot really tell if the difference is significant there, or if they really do want FULL and are confused by the distinction. In the latter case, I agree it makes sense to drop BODY, but that will be a bigger/different ticket.

The deleteMessages... tests are doing something strange during AbstractMessageMapper.updateFlag(). They first retrieve messages with fetch type METADATA and the save them via InMemoryMessageMapper.save(), which copies the truncated message and thus winds up with a broken message in the store?? So, this mechanism actually relies on MessageMappers to ignore the fetch type???

What you describe looks like a behavior specific to the in-memory implementation (ignoring fetch type, etc..). So ok memory mailbox, implemented for testing purposes, might benefit from a rework...

turns out honoring the fetch type breaks MemoryMessageMapperTest (and MemoryMessageWithAttachmentMapperTest), in different ways

Do we care about memory honoring the fetch type stuff? Ok, it makes it closer to real implementations (cassandra / JPA) meaning catching some issues earlier in the development cycle, but integration tests (here MPT IMAP tests?) still get us covered. While making memory fetch type aware the way Cassandra implem is would be a contribution I would support, I do not know if I would consider it 'essential'

Agreed it is not essential, but it DID catch a possible bug, so I believe it is worth doing. After your explanations I have a better understanding of the issues involved and feel more comfortable in doing the necessary code changes.

save them via InMemoryMessageMapper.save(), which copies the truncated message and thus winds up with a broken message in the store?? So, this mechanism actually relies on MessageMappers to ignore the fetch type???

We could adapt "save" behaviour to actually update message metadata instead of replacing it, if the message is already present?

Again, thanks for the clarification. I compared with JPAMessageMapper, which does exactly that, i.e. re-fetch an existing message to update its flags and modseq. So I will do this for InMemoryMessageMapper as well, which should fix the remaining test failures in MemoryMessageMapperTest.

ottoka avatar Aug 10 '22 08:08 ottoka

I do wonder what the use case is for having a message with full (potentially large) body but without the headers.

That's the point, this use case do not really exist...

but that will be a bigger/different ticket.

+1

chibenwa avatar Aug 10 '22 09:08 chibenwa

This is interesting. In the latest version, InMemoryMessageMapper does lookup up an existing message by mailbox+UID and updates it if it exists. This breaks MemoryMessageIdMapperTest>MessageIdMapperTest.findMailboxesShouldReturnTwoMailboxesWhenMessageExistsInTwoMailboxes(). Turns out that when the test saves its original four messages, it puts message4 into mailbox 2. During this the message4 gets UID 1 from the AbstractMessageMappers built-in InMemoryUidProvider, since the latter knows about mailboxes and it was empty before, so UID 1 is fine. But then the test creates and saves copy of message1, and gives it a UID via the MessageUidProvider from the tests own InMemoryMapperProvider. MessageUidProvider is a different implementation that does not now about mailboxes but provides UIDs based on a simple counter. So message1copy also gets UID 1 by accident. When trying to save it in mailbox 2, it already finds message4 with UID 1 there due to the UID conflict, and updates it instead, which breaks the test. In contrast, the original implementation just overwrites message4 with message1copy in this case, which seems like a bug to me!

Now I wonder if this kind of UID generator conflict can happen in production as well? I had thought a UID would be unique wherever it came from. Or is it again an artifact of the simplified InMemoryXXX implementations? In the first case we should dig into the issue some more. In the second case, I believe the test should be adjusted to prevent the conflict, i.e. just fast-forwarding the tests MessageUidProvider to a non-conflicting UID. Any suggestions?

ottoka avatar Aug 10 '22 14:08 ottoka

UID (as defined per IMAP) are mailbox scopped...

So copy/move the mail should allocate a new UID, using the UID generation mechanism of the target mailbox.

Yes maybe it would be good to re-implement InMemoryMessageMappers without depending on AbstractMessageMapper (thus we could handle all subscenari in smarter ways....

chibenwa avatar Aug 10 '22 15:08 chibenwa

Unfortunately I do not understand the implications of removing BODY fetch type enough to tackle it :(

ottoka avatar Aug 11 '22 08:08 ottoka

Making InMemoryMessageMapper honor the fetch type seems to be a game of whack-a-mole. Issues crop up in parts of James I have never touched before, and I don't know how to handle them. It has reached a point where it distracts me too much from the original purpose of the PR, and the effort of making it work outweigh its usefulness to me.

Do we care about memory honoring the fetch type stuff? Ok, it makes it closer to real implementations (cassandra / JPA) meaning catching some issues earlier in the development cycle, but integration tests (here MPT IMAP tests?) still get us covered. While making memory fetch type aware the way Cassandra implem is would be a contribution I would support, I do not know if I would consider it 'essential'

I will follow your reasoning here and will regrettably remove the InMemoryMessageMapper commit from this PR. If you are interested in this topic, I can keep it in a different branch in my personal repo, or create a separate PR here.

ottoka avatar Aug 11 '22 12:08 ottoka

Separate branch looks the way to go!

chibenwa avatar Aug 11 '22 13:08 chibenwa

I put the InMemoryMessageMapper changes into https://github.com/ottoka/james-project/tree/JAMES-3799-MemoryMapper

ottoka avatar Aug 11 '22 13:08 ottoka

I'll try to get my hands on it, when I got time!

chibenwa avatar Aug 11 '22 13:08 chibenwa