[Bug]: Cpu Time statistics value less then 0
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
Branch Name
main
Commit ID
f214303c6
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
MO Regression On TKE Failed after #14871 https://github.com/matrixorigin/matrixone/actions/runs/8245287678/job/22549973813
2024-03-12 08:29:29 FATAL jTPCCTerminal:212 - [UNEXPECTED][TT_PAYMENT][EXECUTION] ErrorCode : 20101, ErrorMessage : internal error: panic counter cannot decrease in value:
github.com/matrixorigin/matrixone/pkg/util/metric.(*ratecounter).Add
/go/src/github.com/matrixorigin/matrixone/pkg/util/metric/mtype_counter.go:106
github.com/matrixorigin/matrixone/pkg/util/trace/impl/motrace.EndStatement
/go/src/github.com/matrixorigin/matrixone/pkg/util/trace/impl/motrace/report_statement.go:595
github.com/matrixorigin/matrixone/pkg/frontend.logStatementStringStatus
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/util.go:517
github.com/matrixorigin/matrixone/pkg/frontend.logStatementStatus
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/util.go:504
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt.func1
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3200
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt.func4
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3263
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt.func5
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3275
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:4165
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:4371
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:4607
github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).handleRequest
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine.go:284
github.com/matrixorigin/matrixone/pkg/frontend.(*RoutineManager).Handler
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine_manager.go:541
github.com/matrixorigin/matrixone/pkg/frontend.(*MOServer).handleMessage
/go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:145
github.com/fagongzi/goetty/v2.(*server).doStart.func2.1
/go/pkg/mod/github.com/matrixorigin/goetty/[email protected]/application.go:356
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1650
Expected Behavior
Steps to Reproduce
1. run action [MO Checkin Regression On TKE](https://github.com/matrixorigin/matrixone/actions/workflows/merge-trigger-tke.yaml)
Additional information
No response
取值远远没有越界
http://175.178.192.213:30088/explore?panes=%7B%22kPb%22:%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22sum%20by%28account,%20sql_source_type%29%20%28sql_statement_cu%7Bnamespace%3D%5C%22mo-checkin-regression-14871%5C%22,%20matrixorigin_io_component%3D%5C%22CNSet%5C%22%7D%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22editorMode%22:%22builder%22,%22legendFormat%22:%22__auto%22,%22useBackend%22:false,%22disableTextWrap%22:false,%22fullMetaSearch%22:false,%22includeNullMetadata%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-6h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1
原因:exec_plan 统计的 cpu 时间为 负数,导致 cu 计算为负数
复现方法:
- mo_tpcc 10 warehouse
- mo_tpcc run 30 min
mysql> select *, duration / 1e6 ms from system.statement_info where statement_id = '018e32e6-ee3e-74d7-a5e5-536ef9a345a3'\G
*************************** 1. row ***************************
statement_id: 018e32e6-ee3e-74d7-a5e5-536ef9a345a3
transaction_id: 018e32e6-edcf-74a3-af1c-124375388832
session_id: 018e32e1-dad6-71cc-9f3a-e9dfeef84377
account: sys
user: dump
host: 127.0.0.1:61667
database: tpcc_10
statement: execute __mo_stmt_id_6 // prepare __mo_stmt_id_6 from SELECT s_quantity, s_data, s_dist_01, s_dist_02, s_dist_03, s_dist_04, s_dist_05, s_dist_06, s_dist_07, s_dist_08, s_dist_09, s_dist_10 FROM bmsql_stock WHERE s_w_id = ? AND s_i_id = ? FOR UPDATE ; 2 ; 59599
statement_tag:
statement_fingerprint:
node_uuid: 7c4dccb4-4d3c-41f8-b482-5251dc7a41bf
node_type: ALL
request_at: 2024-03-12 21:42:58.878267
response_at: 2024-03-12 21:42:58.988380
duration: 110114000
status: Success
err_code: 0
error:
exec_plan: {"steps":null,"code":200,"message":"sql query ignore execution plan","uuid":"018e32e6-ee3e-74d7-a5e5-536ef9a345a3","NewPlanStats":{"ParseDuration":0,"PlanDuration":0,"CompileDuration":0,"ExecutionDuration":0,"IOAccessTimeConsumption":0,"LockTimeConsumption":0,"ParseStartTime":"0001-01-01T00:00:00Z","PlanStartTime":"0001-01-01T00:00:00Z","CompileStartTime":"0001-01-01T00:00:00Z","ExecutionStartTime":"0001-01-01T00:00:00Z","ExecutionEndTime":"0001-01-01T00:00:00Z"}}
rows_read: 0
bytes_scan: 0
stats: [4,-48304378,2278.000,0,0,1990,0,16,-0.0961]
statement_type: Execute
query_type: Other
role_id: 0
sql_source_type: external_sql
aggr_count: 0
result_count: 1
ms: 110.114
1 row in set (0.12 sec)
结论:statsInfo.IOAccessTimeConsumption 和 statsInfo.LockTimeConsumption 远大于 ExecPlan中统计的 cpu耗时 导致 CPU 计算出来的 CU 小于 0.
debug 代码
@Morranto @ouyuanning ptal
working on it
tpcc 10仓,跑了1小时以上,暂时未复现不知道有没有更好的复现方法
正在排查
working on moc2927
初步原因怀疑是阻塞时间并发计入导致偏大
working on it
在处理prepare重构,还没时间看
在处理prepare重构,还没时间看
在处理prepare重构
在处理prepare重构,还没时间看
many negative cpu warning during merge run & daily regression
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-checkin-regression-16004%5C%22%7D%20%7C%3D%20%60negative%20cpu%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715402579623%22,%22to%22:%221715405306593%22%7D%7D%7D&schemaVersion=1&orgId=1
据徐鹏反馈,tpcc修改配置跑的时候比较容易复现。还没验证
等这周跟喜亮交接的时候讨论一下。我估计是retry的时候,io等待记录多次,而执行时长记录一次造成的。 不过还没验证
分析中
待处理
经和喜亮沟通,1.2.1版本来不及,推迟到1.3.0版本解决
Pending processing
not working on it
not working on it
not working on it
not working on it
not working on it
not working on it
代码梳理重构中
analzye代码重构中
analzye代码重构中