matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Bug]: Cpu Time statistics value less then 0

Open xzxiong opened this issue 1 year ago • 25 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch Name

main

Commit ID

f214303c6

Other Environment Information

- Hardware parameters: 
- OS type:
- Others:

Actual Behavior

MO Regression On TKE Failed after #14871 https://github.com/matrixorigin/matrixone/actions/runs/8245287678/job/22549973813

2024-03-12 08:29:29 FATAL jTPCCTerminal:212 - [UNEXPECTED][TT_PAYMENT][EXECUTION] ErrorCode : 20101, ErrorMessage : internal error: panic counter cannot decrease in value: 
github.com/matrixorigin/matrixone/pkg/util/metric.(*ratecounter).Add
	/go/src/github.com/matrixorigin/matrixone/pkg/util/metric/mtype_counter.go:106
github.com/matrixorigin/matrixone/pkg/util/trace/impl/motrace.EndStatement
	/go/src/github.com/matrixorigin/matrixone/pkg/util/trace/impl/motrace/report_statement.go:595
github.com/matrixorigin/matrixone/pkg/frontend.logStatementStringStatus
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/util.go:517
github.com/matrixorigin/matrixone/pkg/frontend.logStatementStatus
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/util.go:504
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt.func1
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3200
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt.func4
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3263
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt.func5
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:3275
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).executeStmt
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:4165
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).doComQuery
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:4371
github.com/matrixorigin/matrixone/pkg/frontend.(*MysqlCmdExecutor).ExecRequest
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/mysql_cmd_executor.go:4607
github.com/matrixorigin/matrixone/pkg/frontend.(*Routine).handleRequest
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine.go:284
github.com/matrixorigin/matrixone/pkg/frontend.(*RoutineManager).Handler
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/routine_manager.go:541
github.com/matrixorigin/matrixone/pkg/frontend.(*MOServer).handleMessage
	/go/src/github.com/matrixorigin/matrixone/pkg/frontend/server.go:145
github.com/fagongzi/goetty/v2.(*server).doStart.func2.1
	/go/pkg/mod/github.com/matrixorigin/goetty/[email protected]/application.go:356
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1650

Expected Behavior

image

Steps to Reproduce

1. run action [MO Checkin Regression On TKE](https://github.com/matrixorigin/matrixone/actions/workflows/merge-trigger-tke.yaml)

Additional information

No response

xzxiong avatar Mar 12 '24 10:03 xzxiong

取值远远没有越界

http://175.178.192.213:30088/explore?panes=%7B%22kPb%22:%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22sum%20by%28account,%20sql_source_type%29%20%28sql_statement_cu%7Bnamespace%3D%5C%22mo-checkin-regression-14871%5C%22,%20matrixorigin_io_component%3D%5C%22CNSet%5C%22%7D%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22editorMode%22:%22builder%22,%22legendFormat%22:%22__auto%22,%22useBackend%22:false,%22disableTextWrap%22:false,%22fullMetaSearch%22:false,%22includeNullMetadata%22:true%7D%5D,%22range%22:%7B%22from%22:%22now-6h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1

image

xzxiong avatar Mar 12 '24 11:03 xzxiong

原因:exec_plan 统计的 cpu 时间为 负数,导致 cu 计算为负数

复现方法:

  1. mo_tpcc 10 warehouse
  2. mo_tpcc run 30 min

image

mysql> select *, duration / 1e6 ms from system.statement_info where statement_id = '018e32e6-ee3e-74d7-a5e5-536ef9a345a3'\G
*************************** 1. row ***************************
         statement_id: 018e32e6-ee3e-74d7-a5e5-536ef9a345a3
       transaction_id: 018e32e6-edcf-74a3-af1c-124375388832
           session_id: 018e32e1-dad6-71cc-9f3a-e9dfeef84377
              account: sys
                 user: dump
                 host: 127.0.0.1:61667
             database: tpcc_10
            statement: execute __mo_stmt_id_6 // prepare __mo_stmt_id_6 from SELECT s_quantity, s_data,        s_dist_01, s_dist_02, s_dist_03, s_dist_04,        s_dist_05, s_dist_06, s_dist_07, s_dist_08,        s_dist_09, s_dist_10     FROM bmsql_stock     WHERE s_w_id = ? AND s_i_id = ?     FOR UPDATE ; 2 ; 59599
        statement_tag:
statement_fingerprint:
            node_uuid: 7c4dccb4-4d3c-41f8-b482-5251dc7a41bf
            node_type: ALL
           request_at: 2024-03-12 21:42:58.878267
          response_at: 2024-03-12 21:42:58.988380
             duration: 110114000
               status: Success
             err_code: 0
                error:
            exec_plan: {"steps":null,"code":200,"message":"sql query ignore execution plan","uuid":"018e32e6-ee3e-74d7-a5e5-536ef9a345a3","NewPlanStats":{"ParseDuration":0,"PlanDuration":0,"CompileDuration":0,"ExecutionDuration":0,"IOAccessTimeConsumption":0,"LockTimeConsumption":0,"ParseStartTime":"0001-01-01T00:00:00Z","PlanStartTime":"0001-01-01T00:00:00Z","CompileStartTime":"0001-01-01T00:00:00Z","ExecutionStartTime":"0001-01-01T00:00:00Z","ExecutionEndTime":"0001-01-01T00:00:00Z"}}

            rows_read: 0
           bytes_scan: 0
                stats: [4,-48304378,2278.000,0,0,1990,0,16,-0.0961]
       statement_type: Execute
           query_type: Other
              role_id: 0
      sql_source_type: external_sql
           aggr_count: 0
         result_count: 1
                   ms: 110.114
1 row in set (0.12 sec)

xzxiong avatar Mar 12 '24 13:03 xzxiong

结论:statsInfo.IOAccessTimeConsumption 和 statsInfo.LockTimeConsumption 远大于 ExecPlan中统计的 cpu耗时 导致 CPU 计算出来的 CU 小于 0.

image debug 代码 image

xzxiong avatar Mar 12 '24 14:03 xzxiong

@Morranto @ouyuanning ptal

xzxiong avatar Mar 12 '24 14:03 xzxiong

working on it

Morranto avatar Mar 13 '24 08:03 Morranto

tpcc 10仓,跑了1小时以上,暂时未复现不知道有没有更好的复现方法

Morranto avatar Mar 14 '24 10:03 Morranto

正在排查

Morranto avatar Apr 02 '24 10:04 Morranto

working on moc2927

Morranto avatar Apr 09 '24 10:04 Morranto

初步原因怀疑是阻塞时间并发计入导致偏大

Morranto avatar Apr 15 '24 11:04 Morranto

working on it

Morranto avatar Apr 18 '24 10:04 Morranto

在处理prepare重构,还没时间看

ouyuanning avatar Apr 23 '24 10:04 ouyuanning

在处理prepare重构,还没时间看

ouyuanning avatar Apr 26 '24 13:04 ouyuanning

在处理prepare重构

ouyuanning avatar May 06 '24 11:05 ouyuanning

在处理prepare重构,还没时间看

ouyuanning avatar May 09 '24 10:05 ouyuanning

many negative cpu warning during merge run & daily regression

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-checkin-regression-16004%5C%22%7D%20%7C%3D%20%60negative%20cpu%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715402579623%22,%22to%22:%221715405306593%22%7D%7D%7D&schemaVersion=1&orgId=1

sukki37 avatar May 11 '24 06:05 sukki37

据徐鹏反馈,tpcc修改配置跑的时候比较容易复现。还没验证

ouyuanning avatar May 16 '24 10:05 ouyuanning

等这周跟喜亮交接的时候讨论一下。我估计是retry的时候,io等待记录多次,而执行时长记录一次造成的。 不过还没验证

ouyuanning avatar May 21 '24 10:05 ouyuanning

分析中

qingxinhome avatar May 24 '24 14:05 qingxinhome

待处理

qingxinhome avatar May 29 '24 13:05 qingxinhome

经和喜亮沟通,1.2.1版本来不及,推迟到1.3.0版本解决

aressu1985 avatar May 30 '24 06:05 aressu1985

Pending processing

qingxinhome avatar Jun 04 '24 14:06 qingxinhome

not working on it

qingxinhome avatar Jun 13 '24 12:06 qingxinhome

not working on it

qingxinhome avatar Jun 19 '24 14:06 qingxinhome

not working on it

qingxinhome avatar Jun 24 '24 14:06 qingxinhome

not working on it

qingxinhome avatar Jun 28 '24 13:06 qingxinhome

not working on it

qingxinhome avatar Jul 04 '24 12:07 qingxinhome

not working on it

qingxinhome avatar Jul 10 '24 14:07 qingxinhome

代码梳理重构中

qingxinhome avatar Jul 16 '24 11:07 qingxinhome

analzye代码重构中

qingxinhome avatar Jul 22 '24 14:07 qingxinhome

analzye代码重构中

qingxinhome avatar Jul 25 '24 11:07 qingxinhome