codis codis-fe界面 sentinel经常出现stats ERROR状态，实际状态是正常的

codis-fe界面 sentinel经常出现stats ERROR状态，实际使用redis-cli查看sentinel状态是正常的。点击sync也无法同步，无任何报错信息，重新移除添加之后，状态恢复正常。

Sep 06 '17 11:09 zhaomingzhu

log 里面会写出错的原因吧。你看一下写的什么。

On Wed, Sep 6, 2017 at 04:09 zhaomingzhu [email protected] wrote:

codis-fe界面 sentinel经常出现stats ERROR状态，实际使用redis-cli查看sentinel状态是正常的。重新移除添加之后，状态恢复正常。

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CodisLabs/codis/issues/1345, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsHpazcaHiBrNU44nbNtXMYkucbjfneks5sfn1ZgaJpZM4PONWB .

Sep 06 '17 16:09 spinlock

@spinlock 出现下面日志信息 2017/09/13 10:50:48 sentinel.go:67: [WARN] sentinel subscribe canceled (context canceled) 2017/09/13 10:50:48 topom_cache.go:224: [WARN] update sentinel:

Sep 13 '17 02:09 zhaomingzhu

@spinlock 我又观察了一段时间 3.2没有加sentinel_client_timeout这个参数版本之前，没有出现这种现象。升级到加sentinel_client_timeout配置参数集群，容易出现这个报错（]sentinel subscribe canceled (context canceled)）。重启dashboard或者移除重新添加恢复正常。直接点sync无法同步。请帮忙看下有没有好的方法可以解决？

Oct 16 '17 03:10 zhaomingzhu

@zhaomingzhu 可以私信我联系方式？我们换一种方式沟通，wnzheng AT gmail.com

Oct 16 '17 06:10 spinlock

我也出现了同样的问题，不知如何解决

Oct 19 '17 09:10 vipwangtian

版本？是 branch 还是 release ？

On Thu, Oct 19, 2017 at 17:13 vipwangtian [email protected] wrote:

我也出现了同样的问题，不知如何解决

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/CodisLabs/codis/issues/1345#issuecomment-337848521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsHpcqhqK35noWdG8lqXl_wMSnSDLtjks5stxKzgaJpZM4PONWB .

Oct 19 '17 09:10 spinlock

release3.2，借鉴楼上说的，我把sentinel_client_timeout参数设置成了1000，重启dashboard再观察下

Oct 19 '17 14:10 vipwangtian

问题依旧，删除重新添加sentinel会解决，这是dashboard topom中的错误信息，dashboard日志并无异常，sentinel日志也正常，并且可以用redis-cli连接执行命令

"192.168.112.155:26379": {
                    "error": {
                        "Cause": "redigo: unexpected type for String, got type []interface {}",
                        "Stack": [
                            {
                                "Name": "github.com/CodisLabs/codis/pkg/utils/redis.(*Client).Info",
                                "File": "/root/go/src/github.com/CodisLabs/codis/pkg/utils/redis/client.go",
                                "Line": 105
                            },
                            {
                                "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom).RefreshRedisStats.func3",
                                "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go",
                                "Line": 83
                            },
                            {
                                "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom).newRedisStats.func1",
                                "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go",
                                "Line": 33
                            }
                        ]
                    },
                    "unixtime": 1508458576
                },

Oct 20 '17 00:10 vipwangtian

抱歉，这个代码我确认了一下，应该没有问题才对。特别是这个错误是 RESP 指令解析的错误，INFO 指令返回的应该是 String 类型，而不是 []interface{}，很奇怪啊。

On Fri, Oct 20, 2017 at 8:59 AM, vipwangtian [email protected] wrote:

问题依旧，删除重新添加sentinel会解决，这是dashboard topom中的错误信息，dashboard日志并无异常， sentinel日志也正常，并且可以用redis-cli连接执行命令 "192.168.112.155:26379": { "error": { "Cause": "redigo: unexpected type for String, got type []interface {}", "Stack": [ { "Name": "github.com/CodisLabs/codis/pkg/utils/redis.(*Client).Info", "File": "/root/go/src/github.com/CodisLabs/codis/pkg/utils/redis/client.go ", "Line": 105 }, { "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom). RefreshRedisStats.func3", "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go ", "Line": 83 }, { "Name": "github.com/CodisLabs/codis/pkg/topom.(*Topom).newRedisStats.func1 ", "File": "/root/go/src/github.com/CodisLabs/codis/pkg/topom/topom_stats.go ", "Line": 33 } ] }, "unixtime": 1508458576 },

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CodisLabs/codis/issues/1345#issuecomment-338077684, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsHpRIl7_mMivW-6PV77pZAl_Xr83S3ks5st_B8gaJpZM4PONWB .

Oct 20 '17 02:10 spinlock

@spinlock 很奇怪，因为我的sentinel没有专用的机器单独做集群，出问题的sentinel一直是和dashboard在一台机器上部署的那个。

Oct 23 '17 09:10 vipwangtian

same problem occured in my production environment. it seems that the sentinel pipelining misbehaves.

Nov 05 '17 09:11 fancy-rabbit

我的环境也有这种问题，

Dec 21 '17 03:12 2002wmj

嗯，我怀疑是我在处理 sentinel pipeline 的时候，错误处理过程没有及时关闭出错的连接。

我来解释一下我的猜测：因为 1. 集群下面 groups 数量比较多；2. sentinel 处理指令比较慢，这两个因素导致 sync 过程超时，但是超时出错的 client 没有及时关闭 (close)，导致 reuse client 的时候，出现 mismatch 的情况出现。

所以我整理了一下 sentinel 的 pipeline 的处理逻辑，你可以替换一下 dashboard 试试看。

期待反馈，谢谢。

Dec 21 '17 08:12 spinlock

1月12号升级dashboard，版本号2017-12-28 13:21:33 +0000 @9fde2809cca131e3da1a7e0920ea151029301fb4 @3.2.1-10-g9fde280 ，至今问题依旧

Jan 24 '18 06:01 vipwangtian

@vipwangtian 抱歉，我才看到。我现在很难出现这个 bug，能提供更多的信息么？

@fancy-rabbit 如果可能的话，你能帮我 debug 一下这个情况么？谢谢！

Feb 09 '18 05:02 spinlock

@zhaomingzhu 在以前，和 sentinel 是没有 pipeline 的，好处是写起来简单，缺点就是如果集群比较大，单次 sentinel 操作可能在几十秒，甚至几分钟，这是不能接受的。所以才把他改成 pipeline 的，但是不幸的是，我自己在维护过程中没出现过这个错误，我仅有的条件很难进行 debug。

Feb 09 '18 05:02 spinlock

使用最新版本, 問題仍然存在. 煩請繼續跟進, 謝謝

version = 2017-12-28 13:21:33 +0000 @9fde2809cca131e3da1a7e0920ea151029301fb4 @3.2.2
compile = 2018-02-08 15:31:11 +0800 by go version go1.9.4 linux/amd64

Feb 21 '18 04:02 ecvjacky

新年好！

看起来这个问题还是挺严重的。我下周找时间 debug 一下，因为我现在没有环境，所以不一定能找到真正的原因。

Feb 23 '18 02:02 spinlock

@vipwangtian 你的 stack 很有帮助，谢谢！

Feb 23 '18 02:02 spinlock

@fancy-rabbit Hello，我刚刚做了一些修改，你可以 review 一下。

主要修改是，在使用 Pipeline 的地方，对 Client.Pipeline.Send 和 Client.Pipeline.Recv 进行比较，如果不匹配，则立即关闭。

Feb 23 '18 03:02 spinlock

@spinlock 比较这个立刻关闭是没问题的做法，不过还是没看出来之前的写法哪里会出问题。挠头。新版已上生产环境验证~~

Feb 25 '18 15:02 fancy-rabbit

抱歉，刚刚看到，我们在生产环境已经把sentinel移除了，我可以在下次维护的时候升级dashboard版本再观察一下 @spinlock

Mar 12 '18 07:03 vipwangtian

经过长期的观察没有出现过三个节点同时error的情况，即使集群group很少的情况下sentinel也会出现error的情况。

Mar 28 '18 06:03 zhaomingzhu

现在依然有这种情况，最多的时候两个哨兵显示error(但是实际运行正常),在sentinel.go的284行，masterCommand

刚加入哨兵没有出现，是过几天才有

应该是 values, err := redigo.Values(client.Do("SENTINEL", "masters")) 产生的报错

Jan 18 '19 07:01 pengdafu

问题描述

版本: Codis Latest release 3.2.2
Commit: 9fde280

Dashboard Sentinel每隔一段时间依次出现Status Error，所有Sentinels最终都会如此，影响Redis HA切换，解决办法就是删除后再重新添加，日志描述和上面反馈的朋友类似。

另外我们线上核心环境为了提高安全性和快速切换时间，采用的是Codis Proxy + Redis主从，Server IP填写的是Keepalived VIP，会奢华的占用多一些资源，但非常稳定，如果大家规模不大可以试试这套组合。

问题分析

通过分析日志我们最初推测可能与Sentinel或Dashboard有关
阅读源码后发现Master分支中有记录修改Dashboard代码解决Status Error问题，但是在Codis Latest release 3.2.2 并没更新 https://github.com/CodisLabs/codis/commits/release3.2

我的疑问

是否建议用户下载Master分支源码手动编译生成Dashboard二进制文件，替换Latest release 3.2.2的的源文件即可修复该问题，因为涉及线上生产，如果有朋友被相同问题困扰也可以反馈下是否得以修复

未来展望

很感谢作者长期开发和维护Codis，至少在让我们可以拥有一个方便scale-out和相对稳定对客户端友好的Redis集群解决方案。

Redis 6.0 新增 redis cluster proxy，相信技术解决方案上也会有新的突破

我之前整理了关于Codis的文章希望对大家有所帮助 Redis(Codis) 分布式集群部署实践 https://wsgzao.github.io/post/codis/

Jan 16 '20 03:01 wsgzao