HDDS-14110. [DiskBalancer] Show EstimatedBytesToMove only during active balancing and improve threshold check message
What changes were proposed in this pull request?
-
EstimatedBytesToMoved and EstimatedTimeLeft should not be shown up if no container movement happens. It's not a bug if there is no container to move while EstimatedBytesToMove is not 0, if the configured threshold is very small and none of container's size of DN is less than this value. For this case, we are adding comments in the output of status CLI.
-
Improve threshold validation error message. When running the DiskBalancer update command with a threshold value of 100.0, the operation fails on all datanodes with the following error:
bash> ozone admin datanode diskbalancer update -t 100.0 --in-service-datanodes
Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100.
A threshold of 0 means any deviation from ideal usage (even 0.01%) triggers container movement
This leads to excessive and continuous balancing operations and results in unnecessary I/O overhead and resource consumption A Threshold value can never be 100.0% as it would mean allow moving 100% of a disk's contents, effectively emptying one disk. Suggested improvement: Rather the error message should clarify that 0 and 100 is excluded. The validation is being updated to exclude 0, requiring threshold to be in the range (0, 100) exclusive. new error msg:
Error on node [DN-1]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14110
How was this patch tested?
Added check for estimatedBytes and DiskBalancerConfiguration in unit test TestDiskBalancerService.
Tested manually:
before patch:
bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode Status Threshold(%) BandwidthInMB Threads StopAfterDiskEven SuccessMove FailureMove BytesMoved(MB) EstBytesToMove(MB) EstTimeLeft(min)
ozone-datanode-5.ozone_default RUNNING 0.0001 10 5 false 0 0 0 638 2
ozone-datanode-3.ozone_default RUNNING 0.0001 10 5 false 0 0 0 1 1
ozone-datanode-4.ozone_default RUNNING 0.0001 10 5 false 0 0 0 1 1
ozone-datanode-2.ozone_default RUNNING 0.0001 10 5 false 0 0 0 698 2
ozone-datanode-1.ozone_default RUNNING 0.0001 10 5 false 0 0 0 3 1
Note: Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.
After code chnages output fixed:
bash-5.1$ ozone admin datanode diskbalancer report --in-service-datanodes
Report result:
Datanode VolumeDensity
ozone-datanode-2.ozone_default 8.413243594594944E-4
ozone-datanode-5.ozone_default 8.296842069073773E-4
ozone-datanode-3.ozone_default 7.682500684380311E-4
ozone-datanode-1.ozone_default 7.585499413112762E-4
ozone-datanode-4.ozone_default 7.507898396098833E-4
bash-5.1$ ozone admin datanode diskbalancer status --in-service-datanodes
Status result:
Datanode Status Threshold(%) BandwidthInMB Threads StopAfterDiskEven SuccessMove FailureMove BytesMoved(MB) EstBytesToMove(MB) EstTimeLeft(min)
ozone-datanode-5.ozone_default RUNNING 0.0001 10 5 false 0 0 0 638 2
ozone-datanode-3.ozone_default RUNNING 0.0001 10 5 false 0 0 0 1 1
ozone-datanode-4.ozone_default RUNNING 0.0001 10 5 false 0 0 0 1 1
ozone-datanode-2.ozone_default RUNNING 0.0001 10 5 false 0 0 0 698 2
ozone-datanode-1.ozone_default RUNNING 0.0001 10 5 false 0 0 0 3 1
Note:
- Estimated time left is calculated based on the estimated bytes to move and the configured disk bandwidth.
- EstimatedBytesToMove may be non-zero even when no containers are being moved, especially if the threshold is very small.
Threshold error output:
bash-5.1$ ozone admin datanode diskbalancer start -t 0 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 100 --in-service-datanodes
Error on node [172.18.0.11:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.10:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.8:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.9:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Error on node [172.18.0.7:19864]: Threshold must be a percentage(double) in the range 0 to 100 both exclusive.
Failed to start DiskBalancer on nodes: [172.18.0.11:19864, 172.18.0.10:19864, 172.18.0.8:19864, 172.18.0.9:19864, 172.18.0.7:19864]
bash-5.1$ ozone admin datanode diskbalancer start -t 0.001 --in-service-datanodes
Started DiskBalancer on all IN_SERVICE nodes.
@ChenSammi Please have a look on this patch. I have resolved the review comments.
Thanks @Gargi-jais11 .