[optimize](cooldown) check remote meta path exists before trying to follow cooldowned data
Proposed changes
Issue Number: close #xxx If we set a storage policy for a tablet, doris will choose a replica to cooldown, and other replicas will follow it, but the chose replica may have not cooldowned yet before following. so doris will get exception like this:
W0531 13:28:06.202108 367095 file_system.cpp:34] [IO_ERROR]failed to get file size xxx/136930872/140650777.0.meta, (endpoint: http://xxx, bucket: xxx, key:xxx/136930872/140650777.0.meta, ), No response body., error code 404, request id
0# doris::io::S3FileSystem::file_size_impl(std::filesystem::__cxx11::path const&, long*) const at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
1# doris::io::S3FileSystem::open_file_internal(doris::io::FileDescription const&, std::filesystem::__cxx11::path const&, std::shared_ptr<doris::io::FileReader>*) at /root/jdolap-engine/be/src/common/status.h:446
2# doris::io::RemoteFileSystem::open_file_impl(doris::io::FileDescription const&, std::filesystem::__cxx11::path const&, doris::io::FileReaderOptions const&, std::shared_ptr<doris::io::FileReader>*) at /root/jdolap-engine/be/src/common/status.h:446
3# doris::io::FileSystem::open_file(doris::io::FileDescription const&, doris::io::FileReaderOptions const&, std::shared_ptr<doris::io::FileReader>*) at /root/jdolap-engine/be/src/common/status.h:357
4# doris::Tablet::_read_cooldown_meta(std::shared_ptr<doris::io::RemoteFileSystem> const&, doris::TabletMetaPB*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
5# doris::Tablet::_follow_cooldowned_data() at /root/jdolap-engine/be/src/common/status.h:446
6# doris::Tablet::cooldown() at /root/jdolap-engine/be/src/common/status.h:446
7# std::_Function_handler<void (), doris::StorageEngine::_cooldown_tasks_producer_callback()::$_1>::_M_invoke(std::_Any_data const&) at /root/jdolap-engine/be/src/olap/olap_server.cpp:1076
8# doris::WorkThreadPool<true>::work_thread(int) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
9# execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
10# start_thread
11# clone
W0531 13:28:06.202123 367095 olap_server.cpp:1080] failed to cooldown, tablet: 136930872 err: [INTERNAL_ERROR]cannot read cooldown meta
optimize: check if remote tablet meta path exits before opening
Thank you for your contribution to Apache Doris. Don't know what should be done next? See How to process your PR
Since 2024-03-18, the Document has been moved to doris-website. See Doris Document.
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
run buildall
TeamCity be ut coverage result: Function Coverage: 36.29% (9232/25442) Line Coverage: 27.63% (75708/273970) Region Coverage: 26.85% (39195/145995) Branch Coverage: 23.61% (19896/84286) Coverage Report: http://coverage.selectdb-in.cc/coverage/ea91ae352720cb4c608d003382a027e1916dbdb4_ea91ae352720cb4c608d003382a027e1916dbdb4/report/index.html
seems just use another warning message, however one more s3 call cost.
TPC-H: Total hot run time: 41194 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ea91ae352720cb4c608d003382a027e1916dbdb4, data reload: false
------ Round 1 ----------------------------------
q1 17589 4390 4217 4217
q2 2038 200 201 200
q3 10436 1249 1131 1131
q4 10204 813 789 789
q5 7481 2685 2699 2685
q6 221 131 134 131
q7 965 632 607 607
q8 9223 2107 2076 2076
q9 9163 6770 6760 6760
q10 9510 3915 3924 3915
q11 442 242 238 238
q12 472 244 226 226
q13 17348 3218 3270 3218
q14 249 214 216 214
q15 516 463 485 463
q16 503 409 400 400
q17 988 801 733 733
q18 8487 7771 7778 7771
q19 6296 1541 1597 1541
q20 653 321 319 319
q21 5207 3233 4037 3233
q22 407 333 327 327
Total cold run time: 118398 ms
Total hot run time: 41194 ms
----- Round 2, with runtime_filter_mode=off -----
q1 4666 4439 4414 4414
q2 368 264 272 264
q3 3152 2902 2921 2902
q4 1945 1609 1634 1609
q5 5417 5499 5493 5493
q6 215 121 125 121
q7 2216 1831 1839 1831
q8 3267 3389 3380 3380
q9 8581 8723 8678 8678
q10 4071 3722 3793 3722
q11 599 490 515 490
q12 800 625 643 625
q13 17187 3158 3151 3151
q14 321 281 273 273
q15 524 479 491 479
q16 508 428 446 428
q17 1878 1526 1493 1493
q18 7747 7556 7420 7420
q19 1682 1494 1535 1494
q20 2059 1765 1775 1765
q21 11319 4697 4760 4697
q22 624 528 532 528
Total cold run time: 79146 ms
Total hot run time: 55257 ms
TPC-DS: Total hot run time: 168536 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ea91ae352720cb4c608d003382a027e1916dbdb4, data reload: false
query1 911 373 370 370
query2 6451 2603 2559 2559
query3 6646 202 201 201
query4 19570 17140 17336 17140
query5 4104 422 423 422
query6 241 156 162 156
query7 4582 304 305 304
query8 324 289 300 289
query9 8466 2410 2369 2369
query10 457 286 277 277
query11 10506 9967 9961 9961
query12 132 91 89 89
query13 1661 386 357 357
query14 9728 6195 7680 6195
query15 233 189 189 189
query16 7899 256 258 256
query17 1741 511 516 511
query18 1951 270 263 263
query19 200 156 150 150
query20 92 84 82 82
query21 201 133 132 132
query22 4230 3959 3868 3868
query23 33741 33119 33087 33087
query24 9306 2896 2859 2859
query25 565 354 357 354
query26 706 158 176 158
query27 2185 324 321 321
query28 5574 2062 2053 2053
query29 871 613 594 594
query30 230 149 155 149
query31 962 777 743 743
query32 95 52 55 52
query33 650 271 262 262
query34 861 475 472 472
query35 702 603 597 597
query36 1072 916 946 916
query37 103 70 67 67
query38 2872 2770 2747 2747
query39 860 817 802 802
query40 194 126 125 125
query41 53 51 49 49
query42 105 97 97 97
query43 568 552 542 542
query44 1064 726 739 726
query45 185 171 168 168
query46 1065 748 698 698
query47 1850 1755 1757 1755
query48 370 300 297 297
query49 849 376 379 376
query50 773 386 385 385
query51 6769 6712 6612 6612
query52 104 100 91 91
query53 359 288 291 288
query54 865 434 438 434
query55 73 72 72 72
query56 258 246 269 246
query57 1094 1019 1040 1019
query58 230 205 227 205
query59 3558 3380 3236 3236
query60 286 264 256 256
query61 95 87 85 85
query62 604 457 450 450
query63 317 294 300 294
query64 8536 2293 1748 1748
query65 3199 3108 3130 3108
query66 788 326 328 326
query67 15367 15026 14791 14791
query68 4594 545 533 533
query69 477 269 275 269
query70 1061 1069 1132 1069
query71 395 268 271 268
query72 7531 2708 2528 2528
query73 718 327 322 322
query74 6089 5599 5588 5588
query75 3353 2602 2650 2602
query76 2740 1089 927 927
query77 597 267 275 267
query78 10228 9698 9851 9698
query79 2168 521 530 521
query80 843 448 442 442
query81 519 220 218 218
query82 650 91 90 90
query83 235 171 167 167
query84 243 85 97 85
query85 1140 284 262 262
query86 450 284 309 284
query87 3290 3076 3068 3068
query88 4088 2375 2371 2371
query89 475 397 386 386
query90 1960 195 192 192
query91 136 111 110 110
query92 63 51 54 51
query93 1571 510 504 504
query94 1312 201 189 189
query95 411 322 309 309
query96 578 267 266 266
query97 3198 3066 3034 3034
query98 233 221 213 213
query99 1112 865 851 851
Total cold run time: 263717 ms
Total hot run time: 168536 ms
ClickBench: Total hot run time: 30.11 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ea91ae352720cb4c608d003382a027e1916dbdb4, data reload: false
query1 0.03 0.03 0.03
query2 0.08 0.04 0.04
query3 0.23 0.05 0.04
query4 1.69 0.06 0.07
query5 0.50 0.47 0.49
query6 1.14 0.72 0.72
query7 0.02 0.01 0.01
query8 0.06 0.04 0.04
query9 0.55 0.49 0.51
query10 0.55 0.56 0.57
query11 0.16 0.11 0.12
query12 0.14 0.12 0.12
query13 0.60 0.59 0.59
query14 0.78 0.77 0.79
query15 0.82 0.81 0.81
query16 0.37 0.36 0.36
query17 1.02 1.00 0.95
query18 0.22 0.22 0.26
query19 1.79 1.71 1.65
query20 0.02 0.01 0.01
query21 15.60 0.67 0.66
query22 4.04 7.82 1.58
query23 18.27 1.34 1.25
query24 1.76 0.25 0.23
query25 0.13 0.08 0.09
query26 0.26 0.16 0.17
query27 0.09 0.08 0.08
query28 13.32 1.02 1.00
query29 13.78 3.42 3.34
query30 0.24 0.06 0.06
query31 2.87 0.39 0.38
query32 3.29 0.45 0.46
query33 2.89 2.89 2.87
query34 17.17 4.42 4.41
query35 4.49 4.45 4.65
query36 0.66 0.46 0.46
query37 0.18 0.15 0.16
query38 0.16 0.15 0.14
query39 0.04 0.04 0.03
query40 0.17 0.14 0.15
query41 0.09 0.05 0.05
query42 0.06 0.05 0.05
query43 0.05 0.04 0.04
Total cold run time: 110.38 s
Total hot run time: 30.11 s
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!