activemq-artemis icon indicating copy to clipboard operation
activemq-artemis copied to clipboard

ARTEMIS-3057 Add min-disk-free feature

Open haanhvu opened this issue 4 years ago • 21 comments

fixes https://issues.apache.org/jira/browse/ARTEMIS-3057

@clebertsuconic please review

haanhvu avatar Oct 24 '21 19:10 haanhvu

@clebertsuconic I got BUILD SUCCESS with mvn -Pdev install -Derrorprone

But now I realize that my implementation lets the broker measure both max-disk-usage and min-free-disk-bytes at the same time. That can cause some problems:

  • What if max-disk-usage is violated but min-free-disk-bytes is not violated (or vice versa)? Will the broker block or not?
  • Why forces users to set two parameters for the same purpose?

For the solution, I'm thinking of two options:

  • Allow only either max-disk-usage or min-free-disk-bytes activated. If going this route, we need two FileStoreMonitor constructor options, and checking that broker.xml sets only one of them.
  • Allow both max-disk-usage or min-free-disk-bytes activated, however only when both are violated the system will block. We can do this by editing the tick() method.

What do you think?

haanhvu avatar Oct 25 '21 12:10 haanhvu

I would probably make them mutually exclusive personally, as they will most likely never agree exactly on when to block and unblock so it seems odd to set both. Although that could be awkward since max-disk-usage has a built in default, so its currently 'almost always set' in a way, and your PR currently proposes having a default for the 'min available' too.

If you allow them both to be configured then it really has to start blocking when either is breached, or else one of them simply wouldn't be doing what was asked. It would have to ensure both are not breached before unblocking though, or again it wouldnt be doing what one of them asked.

Since the existing functionality uses 'max-disk-usage' as a name, I think something like 'min-disk-...' would be more consistent than 'min-free-disk-bytes' is. I wouldn't put the 'bytes' units in the name, allowing for more friendly unit-specifying strings to also be supplied, e.g. as is supported for the "global-max-size".

If allowing for units, a different route and obvious simplification to avoid needing to consider any conflicting 'max usage' and 'min available' configurations would be to instead just extend what the existing 'max-disk-usage' can do and allow it to specify the actual usage amount rather than a percentage of the total. Values without units (or even just values <=100) could retain the existing behaviour as a percentage of total, whilst enabling configuration strings with units to pass a literal size for the actual disk usage amount. (That could still be done even if adding 'min available' functionality, but returns us to the question of whether to allow both to be set and what to do with them)

gemmellr avatar Oct 25 '21 14:10 gemmellr

I would probably make them mutually exclusive personally, as they will most likely never agree exactly on when to block and unblock so it seems odd to set both. Although that could be awkward since max-disk-usage has a built in default, so its currently 'almost always set' in a way, and your PR currently proposes having a default for the 'min available' too.

To handle that I guess I would leave the existing max-disk-usage default to retain existing behaviour, and remove any default for the new 'min available' config. That way you know it was configured explicitly if it is set, and we could choose to document that it takes precedence over the max-disk-usage setting, i.e they can choose to use 'max-disk-usage' or to instead use 'min-disk-...'

gemmellr avatar Oct 25 '21 14:10 gemmellr

@gemmellr

Since the existing functionality uses 'max-disk-usage' as a name, I think something like 'min-disk-...' would be more consistent than 'min-free-disk-bytes' is. I wouldn't put the 'bytes' units in the name, allowing for more friendly unit-specifying strings to also be supplied, e.g. as is supported for the "global-max-size".

Agree. I actually thought unit-specified strings would be better too. About the name, does "min-free-disk" sound better?

If allowing for units, a different route and obvious simplification to avoid needing to consider any conflicting 'max usage' and 'min available' configurations would be to instead just extend what the existing 'max-disk-usage' can do and allow it to specify the actual usage amount rather than a percentage of the total.

This is of course a simpler way. However, it will force users to make a calculation of the actual usage amount. Also, users will need to update this value every time their disk size increases. With "min-free-disk", users don't have to do those. So I still vote for "min-free-disk" as a better choice.

To handle that I guess I would leave the existing max-disk-usage default to retain existing behaviour, and remove any default for the new 'min available' config. That way you know it was configured explicitly if it is set, and we could choose to document that it takes precedence over the max-disk-usage setting, i.e they can choose to use 'max-disk-usage' or to instead use 'min-disk-...'

I think this approach is the optimal one now. I'll implement it. If I'm stuck somewhere, I'll ping you.

Do you agree?

haanhvu avatar Oct 26 '21 16:10 haanhvu

@gemmellr

Since the existing functionality uses 'max-disk-usage' as a name, I think something like 'min-disk-...' would be more consistent than 'min-free-disk-bytes' is. I wouldn't put the 'bytes' units in the name, allowing for more friendly unit-specifying strings to also be supplied, e.g. as is supported for the "global-max-size".

Agree. I actually thought unit-specified strings would be better too. About the name, does "min-free-disk" sound better?

I was suggesting consistency/alignment between 'max-disk-...' and 'min-disk-...' so 'min-free-disk' misses the mark there slightly for me. Perhaps 'min-disk-free'.

(Obviously these often arent 'disks' these days...but its still a generally used name for storage)

If allowing for units, a different route and obvious simplification to avoid needing to consider any conflicting 'max usage' and 'min available' configurations would be to instead just extend what the existing 'max-disk-usage' can do and allow it to specify the actual usage amount rather than a percentage of the total.

This is of course a simpler way. However, it will force users to make a calculation of the actual usage amount.

Not really an arduous task though, specfying a known amount less than a known amount.

. Also, users will need to update this value every time their disk size increases. With "min-free-disk", users don't have to do those. So I still vote for "min-free-disk" as a better choice.

This is clearly true, hence I'm also happy to have both. Personally I would probably change the max usage bits as discussed either way.

(I would say though that the typical response to hitting such a value, which shouldnt really be getting hit, is often not going to be changing the size of storage).

To handle that I guess I would leave the existing max-disk-usage default to retain existing behaviour, and remove any default for the new 'min available' config. That way you know it was configured explicitly if it is set, and we could choose to document that it takes precedence over the max-disk-usage setting, i.e they can choose to use 'max-disk-usage' or to instead use 'min-disk-...'

I think this approach is the optimal one now. I'll implement it. If I'm stuck somewhere, I'll ping you.

Do you agree?

Seems a reasonable approach to me. Not sure what others think. @clebertsuconic @jbertram ?

gemmellr avatar Oct 26 '21 17:10 gemmellr

I definitely prefer leaving the existing behavior as-is and adding a new configuration parameter. Changing defaults in a minor a release typically results in a poor user experience.

Also, I agree with Robbie about aligning the name with max-disk-usage so my vote would be to use min-disk-free.

Lastly, I agree with Robbie that in case both max-disk-usage and min-disk-free are set then you must start blocking as soon as either one is reached. I think it would also be good to log a WARN message if both are configured stating that the configuration is ambiguous and the most conservative value will be used.

jbertram avatar Oct 26 '21 19:10 jbertram

The issue with respecting both if set is that the existing max-disk-usage has a default and so its essentially always on even without config, unless manually disabled with config, plus the two settings are very likely to disagree on when to block if both have values.

Fair enough if it logs a warning about that, but it seems like it just complicates things to try handlng both, only to almost certainly not give actual desired behaviour, plus always warning about both being set. Thats effectively just requiring users must always explicitly manually configure a disabling value for max-disk-usage in addition to configuring a value for min-disk-free, in order to get the desired behaviour and suppress the warning.

It seems more straight forward to me for both impl and users to just say min-disk-free always overrides max-disk-usage when it is configured, i.e you are explicitly choosing to use that feature over the other by setting it to anything.

gemmellr avatar Oct 26 '21 21:10 gemmellr

I agree, min-disk-free should override max-disk-usage.. If both are specified we could log a warn.

(sorry I have been not much active this week.. I had a few personal issues, but I will come backing this).

clebertsuconic avatar Oct 26 '21 23:10 clebertsuconic

I'm amending the PR as we discussed. Please wait.

haanhvu avatar Nov 01 '21 13:11 haanhvu

@clebertsuconic @gemmellr @jbertram

I amended the PR as we discussed. Please review. (Sorry I had an important certificate exam last week, so I didn't have much time on the PR).

I got BUILD SUCCESS with mvn -Pdev install -Derrorprone. Not sure if it's the same with the Github workflow. Please help activate the workflow.

Also, I feel like I need to add more tests, especially the configuration tests, for the overriding behavior. But I'm not sure which test cases should be prioritized here. Any advice?

haanhvu avatar Nov 01 '21 20:11 haanhvu

this Pull Request is in good shape for your outreachy application already.. thanks for this...

I may still need some minor tweaks before merging.. but for the purpose of your application this is good enough already.

I will merge it later today... or if there's some minor changed needed I will let you know. (I could do it myself if it's really simple)

clebertsuconic avatar Nov 02 '21 13:11 clebertsuconic

@haanhvu Perhaps Wildfly is using this API on their integration. I'm adding a commit to deprecate the older method and adding a test.

I'm just going to run the complete test suite on my CI. and I will look into merge this soon.

clebertsuconic avatar Nov 02 '21 21:11 clebertsuconic

Can you check on testSimpleTickforMinDiskFree(). it's always failing.

clebertsuconic avatar Nov 02 '21 21:11 clebertsuconic

another real failure is org.apache.activemq.artemis.tests.integration.jms.cluster.TopicClusterPageStoreSizeTest.testPageStoreSizeWithClusteredDurableSub

I will check this later...

(If you could check that one please?)

But again, for the purpose of your Outreachy application you should consider your task done...

I'm just asking you now in the spirit of open source :) ... I will also look for why these failures tomorrow.

clebertsuconic avatar Nov 03 '21 01:11 clebertsuconic

Can you check on testSimpleTickforMinDiskFree(). it's always failing.

That test is just a following copy of testSimpleTickforMaxDiskUsage() (initally testSimpleTick())

The test fails because I haven't fully understood the logic of testSimpleTick. Can you pls explain to me the logic of this code:

      storeMonitor.tick();

      Assert.assertEquals(0, overMaxUsage.get());
      Assert.assertEquals(1, tick.get());
      Assert.assertEquals(1, underMaxUsage.get());

      storeMonitor.setMaxUsage(0);

      storeMonitor.tick();

      Assert.assertEquals(1, overMaxUsage.get());
      Assert.assertEquals(2, tick.get());
      Assert.assertEquals(1, underMaxUsage.get());

haanhvu avatar Nov 03 '21 09:11 haanhvu

another real failure is org.apache.activemq.artemis.tests.integration.jms.cluster.TopicClusterPageStoreSizeTest.testPageStoreSizeWithClusteredDurableSub

this also, what does this test do?

haanhvu avatar Nov 03 '21 10:11 haanhvu

Can you check on testSimpleTickforMinDiskFree(). it's always failing.

That test is just a following copy of testSimpleTickforMaxDiskUsage() (initally testSimpleTick())

The test fails because I haven't fully understood the logic of testSimpleTick. Can you pls explain to me the logic of this code:

      storeMonitor.tick();

      Assert.assertEquals(0, overMaxUsage.get());
      Assert.assertEquals(1, tick.get());
      Assert.assertEquals(1, underMaxUsage.get());

      storeMonitor.setMaxUsage(0);

      storeMonitor.tick();

      Assert.assertEquals(1, overMaxUsage.get());
      Assert.assertEquals(2, tick.get());
      Assert.assertEquals(1, underMaxUsage.get());

The existing test creates a storeMonitor and passes a maxUsage value of 0.999, such that its checking a limit of 99.9% usage, i.e something it assumes it should hopefully be under. After the first tick shown there, it checks the 'over' callback was not called and that the 'under' callback was. It then alters the maxUsage value to be 0, meaning 0% usage, which it assumes the file it wrote originally plus any other existing usage, will hopefully push it usage over the new low limit. It then calls tick again and ensures the 'over' callback fired but not the 'under' callback. (Its assumptions are a little brittle, but will typically work).

Your new test would do something similar but use extreme values to compare against the actual free space limit, say 0/1 byte and Long.MAX_VALUE, to try and ensure the actual value went over/under it.

gemmellr avatar Nov 12 '21 12:11 gemmellr

I got BUILD SUCCESS with mvn -Pdev install -Derrorprone. Not sure if it's the same with the Github workflow. Please help activate the workflow.

I cant approve the workflow here, but you can run it in your own fork regardless by enabling the actions there, at e.g https://github.com/haanhvu/activemq-artemis/actions. Then if you push to a branch which is not already in use for a PR then only your fork will run the workflow, and you can check the results etc there. Then you can raise a PR when ready (or update an existing branch already used for an open PR, e.g this one).

gemmellr avatar Nov 12 '21 12:11 gemmellr

The existing test creates a storeMonitor and passes a maxUsage value of 0.999, such that its checking a limit of 99.9% usage, i.e something it assumes it should hopefully be under. After the first tick shown there, it checks the 'over' callback was not called and that the 'under' callback was. It then alters the maxUsage value to be 0, meaning 0% usage, which it assumes the file it wrote originally plus any other existing usage, will hopefully push it usage over the new low limit. It then calls tick again and ensures the 'over' callback fired but not the 'under' callback. (Its assumptions are a little brittle, but will typically work).

@gemmellr Thanks a lot! I tweaked the test and it passed.

@clebertsuconic Can you review this again? It passed on my own fork.

another real failure is org.apache.activemq.artemis.tests.integration.jms.cluster.TopicClusterPageStoreSizeTest.testPageStoreSizeWithClusteredDurableSub

I didn't even touch this. Not sure why it passes now :)

Do you think we need to add more tests?

haanhvu avatar Nov 13 '21 18:11 haanhvu

I will merge this right after 2.20 is released...

clebertsuconic avatar Dec 14 '21 18:12 clebertsuconic

I will merge this right after 2.20 is released...

Hello @clebertsuconic .

As far as I can see 2.20 is released. Do you still plan on merging this work?

chibenwa avatar Mar 22 '22 05:03 chibenwa

@clebertsuconic, it's been a year since this was apparently ready to be merged. Let me know if you still think it should be merged and I'll resolve the conflicts and merge it. Thanks!

jbertram avatar Dec 17 '22 01:12 jbertram

@clebertsuconic, it's been almost two years since this was apparently ready to be merged. Let me know if you still think it should be merged and I'll resolve the conflicts and merge it. Thanks!

jbertram avatar Aug 29 '23 19:08 jbertram

@jbertram I think it needs some testing and reviewing. but if you could handle it? don't assume it's ready to merge without reviewing it.

clebertsuconic avatar Aug 29 '23 20:08 clebertsuconic

@clebertsuconic, I was going by what you said back in December 2021:

I will merge this right after 2.20 is released...

jbertram avatar Aug 29 '23 20:08 jbertram

@jbertram yeah.. but then I was not sure if it was 100% tested. and I never got to do it.

let me know if you're doing this... otherwise I will make sure I do it :)

clebertsuconic avatar Aug 29 '23 23:08 clebertsuconic

I'm resolving the conflicts and refactoring a few changes that I'm not sold on. I'll probably close this PR and send a new one, but the author will still be @haanhvu.

jbertram avatar Aug 29 '23 23:08 jbertram

Superseded by #4599.

jbertram avatar Aug 31 '23 01:08 jbertram