DCache icon indicating copy to clipboard operation
DCache copied to clipboard

主备切换不好使

Open evilstar2016 opened this issue 6 years ago • 0 comments

场景:手动停止CacheServer的主节点。 Router日志: 2019-10-09 14:05:26|663|doSwitchCheck|SwitchThread::doSwitchCheck find TimeOut groupName:myAppMKVtest2MKVGroup1 masterServerName:DCache.myAppMKVtest2MKVCacheServer1-1 now:1570601126 rePortTime:1 570601062 Timeout:60 2019-10-09 14:05:26|665|doSwitchCheck|1570601126|60|1570601126 2019-10-09 14:05:26|678|doSwitchCheck|moduleName:myAppMKVtest2 | itrGroupInfo: myAppMKVtest2MKVGroup1 2019-10-09 14:05:26|681|doSwitchCheck| 2019-10-09 14:05:26|684|doSwitchCheck| 2019-10-09 14:05:26|695|doSwitchCheck| 2019-10-09 14:05:26|706|doSwitchCheck|master heartbeat overtime 2019-10-09 14:05:26|check server setting state. 2019-10-09 14:05:26|heartBeatSend to masterName:DCache.myAppMKVtest2MKVCacheServer1-1 2019-10-09 14:05:26|652|doSwitchCheck|SwitchThread::doSwitchCheck find master reportTime==0 myAppOneCacheKVGroup1 masterServerName:DCache.myAppOneCacheKVCacheServer1-1 2019-10-09 14:05:26|729|doSwitchCheck|SwitchThread::doSwitchCheck find slave reportTime==0 myAppOneCacheKVGroup1 slaveServerName:DCache.myAppOneCacheKVCacheServer1-2 2019-10-09 14:05:26|652|doSwitchCheck|SwitchThread::doSwitchCheck find master reportTime==0 secondKVGroup1 masterServerName:DCache.secondKVCacheServer1-1 2019-10-09 14:05:26|729|doSwitchCheck|SwitchThread::doSwitchCheck find slave reportTime==0 secondKVGroup1 slaveServerName:DCache.secondKVCacheServer1-2 2019-10-09 14:05:26|652|doSwitchCheck|SwitchThread::doSwitchCheck find master reportTime==0 thirdKVGroup1 masterServerName:DCache.thirdKVCacheServer1-1 2019-10-09 14:05:26|729|doSwitchCheck|SwitchThread::doSwitchCheck find slave reportTime==0 thirdKVGroup1 slaveServerName:DCache.thirdKVCacheServer1-2 2019-10-09 14:05:29|SwitchThread::doSwitch catch exception: [ServantProxy::invoke timeout:3000,servant:DCache.myAppMKVtest2MKVCacheServer1-1.RouterClientObj,func:helloBaby,adaptertcp -h 10.4.120 .136 -p 19046,reqid:5] 2019-10-09 14:05:32|SwitchThread::doSwitch catch exception: [ServantProxy::invoke timeout:3000,servant:DCache.myAppMKVtest2MKVCacheServer1-1.RouterClientObj,func:helloBaby,adaptertcp -h 10.4.120 .136 -p 19046,reqid:6] 2019-10-09 14:05:32|heartBeatSend to slaveName:DCache.myAppMKVtest2MKVCacheServer1-2 2019-10-09 14:05:32|SwitchThread::doSwitch send heartBeat ok ServerName:DCache.myAppMKVtest2MKVCacheServer1-2 2019-10-09 14:05:32|heartBeatSend to slaveName ok:DCache.myAppMKVtest2MKVCacheServer1-2 2019-10-09 14:05:32|query slaveBinlogdif from slaveName:DCache.myAppMKVtest2MKVCacheServer1-2 2019-10-09 14:05:32|SwitchThread:: slaveBinlogdif diffBinlogTime(1570590077) > 300 DCache.myAppMKVtest2MKVCacheServer1-2 2019-10-09 14:05:32|removeSwitchModule moduleName : myAppMKVtest2| switch tasks : 0 控制台上Router服务显示错误信息: 2019-10-09 14:05:32|SwitchThread:: slaveBinlogdif diffBinlogTime(1570590077) > 300 DCache.myAppMKVtest2MKVCacheServer1-2

推测可能是从节点的binlog比主节点的慢了超过300秒,导致切换失败。 分析了下原因,发现两台主机的时钟不同步。(注意:此问题并不能确认是时钟不同步导致,只是排除不合理因素)

问题1:这个错误的具体原因是什么?如果想屏蔽这个错误、保证正常的主备切换,应该修改哪个文件的时间?或者有其他什么方法?

同步了下时钟,待binlog更新后,出现新问题: Router日志: 2019-10-09 14:08:26|663|doSwitchCheck|SwitchThread::doSwitchCheck find TimeOut groupName:myAppMKVtest2MKVGroup1 masterServerName:DCache.myAppMKVtest2MKVCacheServer1-1 now:1570601306 rePortTime:1 570601243 Timeout:60 2019-10-09 14:08:26|665|doSwitchCheck|1570601306|60|1570601306 2019-10-09 14:08:26|678|doSwitchCheck|moduleName:myAppMKVtest2 | itrGroupInfo: myAppMKVtest2MKVGroup1 2019-10-09 14:08:26|681|doSwitchCheck| 2019-10-09 14:08:26|684|doSwitchCheck| 2019-10-09 14:08:26|695|doSwitchCheck| 2019-10-09 14:08:26|706|doSwitchCheck|master heartbeat overtime 2019-10-09 14:08:26|removeSwitchModule moduleName : myAppMKVtest2| switch tasks : 0 日志中上述内容循环打印,看上去没什么异常,但控制台上Router服务显示错误信息: DoSwitchThread::doSwitchMaterSlave switch times over the SwitchMaxTimes: 3, so not do switch groupName:myAppMKVtest2MKVGroup1, masterServer:DCache.myAppMKVtest2MKVCacheServer1-1, slaveServer:DCache.myAppMKVtest2MKVCacheServer1-2

切换超过3次没有成功。客户端进行访问时,也是报错超时,实际上没有发生切换。 具体原因不清楚。

从节点CacheServer的日志不断的循环输出如下内容: 2019-10-09 14:18:26|24350|ERROR|[MKBinLogThread::syncCompress] getLogCompress exception:server unknown exception: ret:-10 msg:[ServantProxy::invoke errno:-10,info:,servant:DCache.myAppMKVtest2MKVCacheServer1-1.BinLogObj,func:getLogCompressWithPart,reqid:0] 2019-10-09 14:18:26|24345|ERROR|[TARS][CommunicatorEpoll::handleInputImp] connect error tcp -h 10.4.120.136 -p 19044,DCache.myAppMKVtest2MKVCacheServer1-1.BinLogObj,_connExcCnt=6,Connection refused 2019-10-09 14:18:26|24345|ERROR|[TARS][ObjectProxy::invoke, objname:DCache.myAppMKVtest2MKVCacheServer1-1.ControlAckObj,selectAdapterProxy is null] 2019-10-09 14:18:26|24355|ERROR|HeartBeatThread::Run connect hb exception: server unknown exception: ret:-10 msg:[ServantProxy::invoke errno:-10,info:,servant:DCache.myAppMKVtest2MKVCacheServer1-1.ControlAckObj,func:connectHb,reqid:0] 2019-10-09 14:18:26|24345|ERROR|[TARS][CommunicatorEpoll::handleInputImp] connect error tcp -h 10.4.120.136 -p 19049,DCache.myAppMKVtest2MKVCacheServer1-1.ControlAckObj,_connExcCnt=6,Connectionrefused 其中,10.4.120.136 这个ip是主节点。

问题2:这个错误的具体原因是什么?如何解决?

evilstar2016 avatar Oct 11 '19 03:10 evilstar2016