doris icon indicating copy to clipboard operation
doris copied to clipboard

[optimize](cooldown) check remote meta path exists before trying to follow cooldowned data

Open DarvenDuan opened this issue 1 year ago • 4 comments

Proposed changes

Issue Number: close #xxx If we set a storage policy for a tablet, doris will choose a replica to cooldown, and other replicas will follow it, but the chose replica may have not cooldowned yet before following. so doris will get exception like this:

W0531 13:28:06.202108 367095 file_system.cpp:34] [IO_ERROR]failed to get file size xxx/136930872/140650777.0.meta, (endpoint: http://xxx, bucket: xxx, key:xxx/136930872/140650777.0.meta, ), No response body., error code 404, request id

	0#  doris::io::S3FileSystem::file_size_impl(std::filesystem::__cxx11::path const&, long*) const at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
	1#  doris::io::S3FileSystem::open_file_internal(doris::io::FileDescription const&, std::filesystem::__cxx11::path const&, std::shared_ptr<doris::io::FileReader>*) at /root/jdolap-engine/be/src/common/status.h:446
	2#  doris::io::RemoteFileSystem::open_file_impl(doris::io::FileDescription const&, std::filesystem::__cxx11::path const&, doris::io::FileReaderOptions const&, std::shared_ptr<doris::io::FileReader>*) at /root/jdolap-engine/be/src/common/status.h:446
	3#  doris::io::FileSystem::open_file(doris::io::FileDescription const&, doris::io::FileReaderOptions const&, std::shared_ptr<doris::io::FileReader>*) at /root/jdolap-engine/be/src/common/status.h:357
	4#  doris::Tablet::_read_cooldown_meta(std::shared_ptr<doris::io::RemoteFileSystem> const&, doris::TabletMetaPB*) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/basic_string.h:187
	5#  doris::Tablet::_follow_cooldowned_data() at /root/jdolap-engine/be/src/common/status.h:446
	6#  doris::Tablet::cooldown() at /root/jdolap-engine/be/src/common/status.h:446
	7#  std::_Function_handler<void (), doris::StorageEngine::_cooldown_tasks_producer_callback()::$_1>::_M_invoke(std::_Any_data const&) at /root/jdolap-engine/be/src/olap/olap_server.cpp:1076
	8#  doris::WorkThreadPool<true>::work_thread(int) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:646
	9#  execute_native_thread_routine at /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/unique_ptr.h:85
	10# start_thread
	11# clone
W0531 13:28:06.202123 367095 olap_server.cpp:1080] failed to cooldown, tablet: 136930872 err: [INTERNAL_ERROR]cannot read cooldown meta

optimize: check if remote tablet meta path exits before opening

DarvenDuan avatar May 31 '24 07:05 DarvenDuan

Thank you for your contribution to Apache Doris. Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website. See Doris Document.

doris-robot avatar May 31 '24 07:05 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar May 31 '24 07:05 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar May 31 '24 07:05 github-actions[bot]

run buildall

DarvenDuan avatar May 31 '24 07:05 DarvenDuan

TeamCity be ut coverage result: Function Coverage: 36.29% (9232/25442) Line Coverage: 27.63% (75708/273970) Region Coverage: 26.85% (39195/145995) Branch Coverage: 23.61% (19896/84286) Coverage Report: http://coverage.selectdb-in.cc/coverage/ea91ae352720cb4c608d003382a027e1916dbdb4_ea91ae352720cb4c608d003382a027e1916dbdb4/report/index.html

doris-robot avatar May 31 '24 08:05 doris-robot

seems just use another warning message, however one more s3 call cost.

GoGoWen avatar May 31 '24 08:05 GoGoWen

TPC-H: Total hot run time: 41194 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ea91ae352720cb4c608d003382a027e1916dbdb4, data reload: false

------ Round 1 ----------------------------------
q1	17589	4390	4217	4217
q2	2038	200	201	200
q3	10436	1249	1131	1131
q4	10204	813	789	789
q5	7481	2685	2699	2685
q6	221	131	134	131
q7	965	632	607	607
q8	9223	2107	2076	2076
q9	9163	6770	6760	6760
q10	9510	3915	3924	3915
q11	442	242	238	238
q12	472	244	226	226
q13	17348	3218	3270	3218
q14	249	214	216	214
q15	516	463	485	463
q16	503	409	400	400
q17	988	801	733	733
q18	8487	7771	7778	7771
q19	6296	1541	1597	1541
q20	653	321	319	319
q21	5207	3233	4037	3233
q22	407	333	327	327
Total cold run time: 118398 ms
Total hot run time: 41194 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4666	4439	4414	4414
q2	368	264	272	264
q3	3152	2902	2921	2902
q4	1945	1609	1634	1609
q5	5417	5499	5493	5493
q6	215	121	125	121
q7	2216	1831	1839	1831
q8	3267	3389	3380	3380
q9	8581	8723	8678	8678
q10	4071	3722	3793	3722
q11	599	490	515	490
q12	800	625	643	625
q13	17187	3158	3151	3151
q14	321	281	273	273
q15	524	479	491	479
q16	508	428	446	428
q17	1878	1526	1493	1493
q18	7747	7556	7420	7420
q19	1682	1494	1535	1494
q20	2059	1765	1775	1765
q21	11319	4697	4760	4697
q22	624	528	532	528
Total cold run time: 79146 ms
Total hot run time: 55257 ms

doris-robot avatar May 31 '24 08:05 doris-robot

TPC-DS: Total hot run time: 168536 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ea91ae352720cb4c608d003382a027e1916dbdb4, data reload: false

query1	911	373	370	370
query2	6451	2603	2559	2559
query3	6646	202	201	201
query4	19570	17140	17336	17140
query5	4104	422	423	422
query6	241	156	162	156
query7	4582	304	305	304
query8	324	289	300	289
query9	8466	2410	2369	2369
query10	457	286	277	277
query11	10506	9967	9961	9961
query12	132	91	89	89
query13	1661	386	357	357
query14	9728	6195	7680	6195
query15	233	189	189	189
query16	7899	256	258	256
query17	1741	511	516	511
query18	1951	270	263	263
query19	200	156	150	150
query20	92	84	82	82
query21	201	133	132	132
query22	4230	3959	3868	3868
query23	33741	33119	33087	33087
query24	9306	2896	2859	2859
query25	565	354	357	354
query26	706	158	176	158
query27	2185	324	321	321
query28	5574	2062	2053	2053
query29	871	613	594	594
query30	230	149	155	149
query31	962	777	743	743
query32	95	52	55	52
query33	650	271	262	262
query34	861	475	472	472
query35	702	603	597	597
query36	1072	916	946	916
query37	103	70	67	67
query38	2872	2770	2747	2747
query39	860	817	802	802
query40	194	126	125	125
query41	53	51	49	49
query42	105	97	97	97
query43	568	552	542	542
query44	1064	726	739	726
query45	185	171	168	168
query46	1065	748	698	698
query47	1850	1755	1757	1755
query48	370	300	297	297
query49	849	376	379	376
query50	773	386	385	385
query51	6769	6712	6612	6612
query52	104	100	91	91
query53	359	288	291	288
query54	865	434	438	434
query55	73	72	72	72
query56	258	246	269	246
query57	1094	1019	1040	1019
query58	230	205	227	205
query59	3558	3380	3236	3236
query60	286	264	256	256
query61	95	87	85	85
query62	604	457	450	450
query63	317	294	300	294
query64	8536	2293	1748	1748
query65	3199	3108	3130	3108
query66	788	326	328	326
query67	15367	15026	14791	14791
query68	4594	545	533	533
query69	477	269	275	269
query70	1061	1069	1132	1069
query71	395	268	271	268
query72	7531	2708	2528	2528
query73	718	327	322	322
query74	6089	5599	5588	5588
query75	3353	2602	2650	2602
query76	2740	1089	927	927
query77	597	267	275	267
query78	10228	9698	9851	9698
query79	2168	521	530	521
query80	843	448	442	442
query81	519	220	218	218
query82	650	91	90	90
query83	235	171	167	167
query84	243	85	97	85
query85	1140	284	262	262
query86	450	284	309	284
query87	3290	3076	3068	3068
query88	4088	2375	2371	2371
query89	475	397	386	386
query90	1960	195	192	192
query91	136	111	110	110
query92	63	51	54	51
query93	1571	510	504	504
query94	1312	201	189	189
query95	411	322	309	309
query96	578	267	266	266
query97	3198	3066	3034	3034
query98	233	221	213	213
query99	1112	865	851	851
Total cold run time: 263717 ms
Total hot run time: 168536 ms

doris-robot avatar May 31 '24 08:05 doris-robot

ClickBench: Total hot run time: 30.11 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ea91ae352720cb4c608d003382a027e1916dbdb4, data reload: false

query1	0.03	0.03	0.03
query2	0.08	0.04	0.04
query3	0.23	0.05	0.04
query4	1.69	0.06	0.07
query5	0.50	0.47	0.49
query6	1.14	0.72	0.72
query7	0.02	0.01	0.01
query8	0.06	0.04	0.04
query9	0.55	0.49	0.51
query10	0.55	0.56	0.57
query11	0.16	0.11	0.12
query12	0.14	0.12	0.12
query13	0.60	0.59	0.59
query14	0.78	0.77	0.79
query15	0.82	0.81	0.81
query16	0.37	0.36	0.36
query17	1.02	1.00	0.95
query18	0.22	0.22	0.26
query19	1.79	1.71	1.65
query20	0.02	0.01	0.01
query21	15.60	0.67	0.66
query22	4.04	7.82	1.58
query23	18.27	1.34	1.25
query24	1.76	0.25	0.23
query25	0.13	0.08	0.09
query26	0.26	0.16	0.17
query27	0.09	0.08	0.08
query28	13.32	1.02	1.00
query29	13.78	3.42	3.34
query30	0.24	0.06	0.06
query31	2.87	0.39	0.38
query32	3.29	0.45	0.46
query33	2.89	2.89	2.87
query34	17.17	4.42	4.41
query35	4.49	4.45	4.65
query36	0.66	0.46	0.46
query37	0.18	0.15	0.16
query38	0.16	0.15	0.14
query39	0.04	0.04	0.03
query40	0.17	0.14	0.15
query41	0.09	0.05	0.05
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 110.38 s
Total hot run time: 30.11 s

doris-robot avatar May 31 '24 08:05 doris-robot

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!

github-actions[bot] avatar Nov 28 '24 00:11 github-actions[bot]