doris icon indicating copy to clipboard operation
doris copied to clipboard

[opt](inverted index) the "unicode" tokenizer can be configured to disable stop words

Open zzzxl1993 opened this issue 1 year ago • 18 comments

Proposed changes

  1. properties: "parser" = "unicode", "use_stopwords" = "none" disable stop words.

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

zzzxl1993 avatar Apr 22 '24 11:04 zzzxl1993

Thank you for your contribution to Apache Doris. Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website. See Doris Document.

doris-robot avatar Apr 22 '24 11:04 doris-robot

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Apr 22 '24 11:04 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar Apr 25 '24 13:04 github-actions[bot]

run buildall

zzzxl1993 avatar May 01 '24 05:05 zzzxl1993

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar May 01 '24 05:05 github-actions[bot]

TPC-H: Total hot run time: 40210 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c9bfa966b2fb67d9b8de0a5c39dc53ab57b8793b, data reload: false

------ Round 1 ----------------------------------
q1	18514	4570	4374	4374
q2	2534	197	195	195
q3	10995	1189	1181	1181
q4	10460	783	777	777
q5	7487	2805	2670	2670
q6	212	131	133	131
q7	1034	609	573	573
q8	9237	2132	2115	2115
q9	9084	6575	6533	6533
q10	8975	3703	3740	3703
q11	460	234	231	231
q12	482	220	217	217
q13	17774	2959	2953	2953
q14	257	214	217	214
q15	512	461	469	461
q16	506	389	369	369
q17	976	632	684	632
q18	8084	7432	7417	7417
q19	4400	1531	1525	1525
q20	659	318	315	315
q21	5051	3342	4007	3342
q22	353	296	282	282
Total cold run time: 118046 ms
Total hot run time: 40210 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4376	4230	4212	4212
q2	378	255	264	255
q3	2965	2780	2776	2776
q4	1866	1608	1608	1608
q5	5306	5300	5249	5249
q6	207	124	123	123
q7	2238	1926	1882	1882
q8	3194	3374	3384	3374
q9	8446	8480	8467	8467
q10	3915	3756	3640	3640
q11	572	475	473	473
q12	765	626	595	595
q13	16612	2979	2991	2979
q14	295	267	261	261
q15	508	470	470	470
q16	476	403	411	403
q17	1764	1486	1472	1472
q18	7705	7441	7483	7441
q19	2360	1565	1530	1530
q20	1966	1765	1753	1753
q21	5013	4751	4841	4751
q22	577	485	490	485
Total cold run time: 71504 ms
Total hot run time: 54199 ms

doris-robot avatar May 01 '24 06:05 doris-robot

TPC-DS: Total hot run time: 184579 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c9bfa966b2fb67d9b8de0a5c39dc53ab57b8793b, data reload: false

query1	918	357	352	352
query2	6444	2404	2362	2362
query3	6658	212	216	212
query4	23434	21129	21085	21085
query5	4201	428	428	428
query6	270	182	185	182
query7	4602	294	295	294
query8	253	189	185	185
query9	8415	2408	2379	2379
query10	444	247	247	247
query11	14668	14118	14099	14099
query12	139	87	94	87
query13	1642	355	366	355
query14	9394	6540	8202	6540
query15	261	175	175	175
query16	8193	260	258	258
query17	1896	559	548	548
query18	2120	271	267	267
query19	315	151	146	146
query20	92	86	82	82
query21	200	129	125	125
query22	5031	4909	4800	4800
query23	33952	33329	33045	33045
query24	11099	2833	2860	2833
query25	625	352	355	352
query26	1643	150	146	146
query27	3041	319	328	319
query28	7801	2037	2052	2037
query29	951	615	595	595
query30	293	148	151	148
query31	1001	734	721	721
query32	93	51	55	51
query33	742	245	241	241
query34	1018	481	489	481
query35	801	662	665	662
query36	1068	900	887	887
query37	140	67	65	65
query38	3105	3015	2984	2984
query39	1628	1562	1537	1537
query40	274	126	126	126
query41	41	38	37	37
query42	103	95	98	95
query43	599	533	534	533
query44	1276	738	740	738
query45	268	255	256	255
query46	1072	751	733	733
query47	1956	1880	1906	1880
query48	363	294	301	294
query49	1116	389	406	389
query50	778	397	391	391
query51	6721	6586	6598	6586
query52	106	88	98	88
query53	345	276	271	271
query54	289	231	224	224
query55	76	71	74	71
query56	243	215	222	215
query57	1259	1164	1130	1130
query58	228	190	194	190
query59	3464	3133	3165	3133
query60	251	232	231	231
query61	89	86	90	86
query62	647	439	444	439
query63	303	276	274	274
query64	9562	7225	7219	7219
query65	3107	3047	3031	3031
query66	1379	339	336	336
query67	15675	15032	14949	14949
query68	5136	546	542	542
query69	471	295	301	295
query70	1120	1127	1118	1118
query71	417	268	264	264
query72	7994	2515	2353	2353
query73	703	318	326	318
query74	6462	6127	6042	6042
query75	3323	2646	2612	2612
query76	2849	1053	954	954
query77	419	266	268	266
query78	10851	10161	10327	10161
query79	2874	533	527	527
query80	2006	430	423	423
query81	528	218	222	218
query82	775	96	96	96
query83	299	180	171	171
query84	283	89	90	89
query85	2037	273	263	263
query86	499	290	315	290
query87	3337	3090	3058	3058
query88	4595	2318	2313	2313
query89	492	380	374	374
query90	1975	183	196	183
query91	128	98	97	97
query92	59	50	48	48
query93	4927	520	512	512
query94	1259	191	186	186
query95	398	311	310	310
query96	610	264	261	261
query97	3146	2919	2984	2919
query98	234	220	214	214
query99	1259	846	873	846
Total cold run time: 291646 ms
Total hot run time: 184579 ms

doris-robot avatar May 01 '24 06:05 doris-robot

TeamCity be ut coverage result: Function Coverage: 35.69% (8960/25102) Line Coverage: 27.28% (73898/270842) Region Coverage: 26.47% (38176/144211) Branch Coverage: 23.23% (19447/83722) Coverage Report: http://coverage.selectdb-in.cc/coverage/c9bfa966b2fb67d9b8de0a5c39dc53ab57b8793b_c9bfa966b2fb67d9b8de0a5c39dc53ab57b8793b/report/index.html

doris-robot avatar May 01 '24 07:05 doris-robot

PR approved by at least one committer and no changes requested.

github-actions[bot] avatar May 01 '24 12:05 github-actions[bot]

PR approved by anyone and no changes requested.

github-actions[bot] avatar May 01 '24 12:05 github-actions[bot]

run buildall

zzzxl1993 avatar May 01 '24 15:05 zzzxl1993

run buildall

zzzxl1993 avatar May 01 '24 15:05 zzzxl1993

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar May 01 '24 15:05 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar May 01 '24 15:05 github-actions[bot]

run buildall

zzzxl1993 avatar May 01 '24 15:05 zzzxl1993

clang-tidy review says "All clean, LGTM! :+1:"

github-actions[bot] avatar May 01 '24 15:05 github-actions[bot]

TeamCity be ut coverage result: Function Coverage: 35.69% (8959/25103) Line Coverage: 27.29% (73908/270843) Region Coverage: 26.47% (38178/144212) Branch Coverage: 23.23% (19448/83722) Coverage Report: http://coverage.selectdb-in.cc/coverage/e492979421e48575145d1f2fefdbd36539300ee5_e492979421e48575145d1f2fefdbd36539300ee5/report/index.html

doris-robot avatar May 01 '24 15:05 doris-robot

TPC-DS: Total hot run time: 187132 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e492979421e48575145d1f2fefdbd36539300ee5, data reload: false

query1	919	365	346	346
query2	6289	2367	2343	2343
query3	6656	210	215	210
query4	23230	21709	21876	21709
query5	3900	437	409	409
query6	264	189	179	179
query7	4532	308	308	308
query8	230	187	192	187
query9	8583	2440	2410	2410
query10	412	260	251	251
query11	15239	14812	14811	14811
query12	127	89	91	89
query13	1653	386	374	374
query14	9618	6886	8456	6886
query15	285	184	165	165
query16	8199	265	268	265
query17	1801	572	553	553
query18	2109	284	275	275
query19	333	157	155	155
query20	92	83	84	83
query21	195	135	125	125
query22	5050	4910	4840	4840
query23	33995	33079	33032	33032
query24	10293	2932	2909	2909
query25	574	386	388	386
query26	696	168	155	155
query27	2083	318	311	311
query28	5990	2086	2042	2042
query29	886	684	606	606
query30	225	153	147	147
query31	944	715	717	715
query32	95	51	50	50
query33	635	238	246	238
query34	899	476	473	473
query35	824	669	662	662
query36	1085	898	886	886
query37	100	64	66	64
query38	3304	3007	3027	3007
query39	1566	1543	1544	1543
query40	204	128	128	128
query41	39	37	39	37
query42	104	97	96	96
query43	574	569	548	548
query44	1095	734	737	734
query45	272	255	258	255
query46	1072	692	721	692
query47	1951	1864	1841	1841
query48	403	300	302	300
query49	827	388	390	388
query50	785	391	391	391
query51	6793	6563	6551	6551
query52	105	95	93	93
query53	356	286	283	283
query54	316	238	261	238
query55	83	78	77	77
query56	237	269	218	218
query57	1204	1134	1117	1117
query58	222	202	202	202
query59	3470	3107	3099	3099
query60	244	228	232	228
query61	89	86	85	85
query62	634	439	464	439
query63	309	282	290	282
query64	8330	7332	7228	7228
query65	3113	3061	3078	3061
query66	835	327	346	327
query67	15925	15338	15316	15316
query68	10645	547	548	547
query69	600	305	307	305
query70	1393	1128	1118	1118
query71	528	276	280	276
query72	8816	2533	2385	2385
query73	1600	329	318	318
query74	6612	6121	6130	6121
query75	5495	2700	2676	2676
query76	5895	1008	970	970
query77	702	268	273	268
query78	11096	10352	10236	10236
query79	11445	535	518	518
query80	1878	441	433	433
query81	509	226	219	219
query82	233	87	90	87
query83	209	170	163	163
query84	269	83	85	83
query85	949	278	268	268
query86	347	310	302	302
query87	3291	3122	3106	3106
query88	6308	2363	2326	2326
query89	512	389	386	386
query90	2449	186	180	180
query91	126	97	95	95
query92	60	47	47	47
query93	7656	512	503	503
query94	1563	178	182	178
query95	403	313	300	300
query96	597	270	263	263
query97	3149	2965	2985	2965
query98	235	217	211	211
query99	1106	879	826	826
Total cold run time: 310177 ms
Total hot run time: 187132 ms

doris-robot avatar May 01 '24 16:05 doris-robot

PR approved by at least one committer and no changes requested.

github-actions[bot] avatar May 03 '24 04:05 github-actions[bot]