proxy_sql_tools icon indicating copy to clipboard operation
proxy_sql_tools copied to clipboard

Galera cluster check failed when ProxySQL monitor user have REQUIRE SSL

Open phongdt2019 opened this issue 5 years ago • 8 comments

Hi,

When using the script galera_check.pl and the proxysql monitor user have REQUIRE SSL, the access from proxysql monitor user is denied from all pxc nodes, which leads to the galera_check.pl could not get the proper state of every pxc nodes hence fail to find a candidate for failover.

I think proxysql uses a different way to check PXC node states in the case of SSL encryption is enforced, while the galera_check.pl report Cannot connect to DBI:mysql:host=10.238.63.98;port=3306; all the PXC nodes stay ONLINE in proxysql.running_mysql_servers.

  1. I am using proxysql as proxysql monitor user In ProxySQL
| mysql-monitor_username                                       | proxysql              
| mysql-monitor_password                                       | secret

In MySQL


mysql> select * from mysql.user where User like 'proxysql'\G
*************************** 1. row ***************************
                  Host: %
                  User: proxysql
              ssl_type: ANY
  1. My PXC servers in PXC nodes
mysql> select * from mysql_servers;
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname     | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10           | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 20           | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 20           | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8010         | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8020         | 172.16.3.91  | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8020         | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8010         | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8010         | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8020         | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
9 rows in set (0.00 sec)
  1. ProxySQL scheduler
+----+--------+-------------+-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+------+---------+
| id | active | interval_ms | filename                          | arg1                                                                                                                                                                                                      | arg2 | arg3 | arg4 | arg5 | comment |
+----+--------+-------------+-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+------+---------+
| 10 | 1      | 2000        | /var/lib/proxysql/galera_check.pl | -u=proxyadmin -p=proxyadmin -h=127.0.0.1 -P=6032 -H=10:W,20:R --active_failover=3 --debug=1 --retry_up=3 --retry_down=2 --single_writer=1 --writer_is_also_reader=0 --log=/var/lib/proxysql/galera_check_ | NULL | NULL | NULL | NULL |         |
+----+--------+-------------+-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+------+------+------+---------+
  1. After enabling REQUIRE SSL

PXC nodes start denying access from ProxySQL user

2020-12-25T05:58:27.171725Z 2079 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES)
2020-12-25T05:58:29.484811Z 2080 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES)
2020-12-25T05:58:29.512120Z 2081 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES)
2020-12-25T05:58:29.553124Z 2082 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES)
2020-12-25T05:58:31.864506Z 2083 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES)
2020-12-25T05:58:31.896786Z 2084 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES)
2020-12-25T05:58:31.951574Z 2085 [Note] Access denied for user 'proxysql'@'vagrant-phong-anydbver-default-891d4c331b800bee.lxd' (using password: YES) 

then galera_check.pl report cannot connect to PXC nodes

2020/12/25 05:58:31.999:[ERROR] Cannot connect to DBI:mysql:host=172.16.2.104;port=3306;mysql_connect_timeout=6 as proxysql
2020/12/25 05:58:31.999:[ERROR]  Node is not responding setting it as SHUNNED (internally) (ProxySQL bug - #2658)172.16.2.104:3306:HG8020

Meanwhile, the node state remain ONLINE in proxysql.running_mysql_servers.

phongdt2019 avatar Dec 25 '20 06:12 phongdt2019

The root of the problem comes from the fact that DBD::mysql by default has mysql_ssl set to false, disabling SSL. See https://metacpan.org/pod/DBD::mysql look for mysql_ssl for details.

Modifying the get_nodes function in galera_check.pl to always use mysql_ssl=1 be done by changing this line:

$node->dns("DBI:mysql:host=".$ref->{hostname}.";port=".$ref->{port}.";mysql_ssl=1;mysql_connect_timeout=$mysql_connect_timeout");

Note the "mysql_ssl=1". That causes connections to succeed when the monitor user has "REQUIRE SSL" in MySQL.

In my opinion in theory this should work too, but for some weird reason it doesn't and I don't know enough Perl to say why:

$node->dns("DBI:mysql:host=".$ref->{hostname}.";port=".$ref->{port}.";mysql_ssl=".$ref->{use_ssl}.";mysql_connect_timeout=$mysql_connect_timeout");

Additionally being able to set client cert, key, ca etc. via command line parameters would be awesome.

samiahlroos avatar Dec 26 '20 13:12 samiahlroos

wait here two things:

  1. the error is because the use_ssl was not copy over (I assume) which is fixed in this pull request https://github.com/Tusamarco/proxy_sql_tools/pull/20
  2. The scheduler is not using ssl to connect which can be the other issue IF SSL is mandatory for a monitor user.

Keep in mind that this check MUST be as fast as possible and it must open a new connection all the times it checks a node. Given that adding SSL to it will increase the time the check will take to execute. I can implement ssl no problem, but it needs to be use with care. @samiahlroos @phongdt2019 ^^

Tusamarco avatar Dec 26 '20 17:12 Tusamarco

@samiahlroos adding all the files for SSL in the params is too much I can add the ssl_cert PATH and then files must have a static name like:

  • client-key.pem;
  • client-cert.pem;
  • ca.pem so it will be : SSL_PATH = ssl_certificates_path=/my/path/to/ssl SSL_PATH/<file>

sounds enough?

Tusamarco avatar Dec 26 '20 17:12 Tusamarco

please check the branch https://github.com/Tusamarco/proxy_sql_tools/tree/ssl_implementation and let me know if it works as expected. @samiahlroos @phongdt2019 thanks

Tusamarco avatar Dec 26 '20 18:12 Tusamarco

Hi @Tusamarco

Thanks for the prompt reaction.

I tested the new patch, unfortunately, it doesn't work as expected.

The script galera_check.pl still cannot check the PXC node state properly when ProxySQL's monitor user has REQUIRE SSLin PXC ndoes. I still received the same error as reported earlier in this issues.

I removed REQUIRE SSL for ProxySQL monitor user and tested failover for this new patch. But failover is also stuck in this new patch.

  • Before failover
proxysql> select * from runtime_mysql_servers;
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname     | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 10           | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8020         | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8020         | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8020         | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8010         | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8010         | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 8010         | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 20           | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
| 20           | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |         |
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
9 rows in set (0.01 sec)
  • Shutdown the main writer node.

  • Log from galera_check.pl --> the script is stuck with DELETE from writer group as: SQL:DELETE from mysql_servers where hostgroup_id in (10,9010) AND STATUS = 'ONLINE'

2020/12/28 01:19:35.209:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 3
2020/12/28 01:19:35.217:[WARN]  Move node:172.16.0.239;3306;8020;4010 SQL:INSERT INTO mysql_servers (hostgroup_id, hostname,port,gtid_port,status,weight,compression,max_connections,max_replication_lag,use_ssl,max_latency_ms,comment)  VALUES(20,'172.16.0.239',3306 ,0 ,'ONLINE' ,1,0,1000,0,1,0,'')
2020/12/28 01:19:35.217:[DEBUG] Check retry Node:172.16.0.239 port:3306 hg:8020 Time IN
2020/12/28 01:19:35.223:[WARN]  Reset retry to UP:4 Down:0for node:172.16.0.239;3306;8020;4010 SQL:UPDATE mysql_servers SET comment='10_W_20_R_retry_up=4;10_W_20_R_retry_down=0;' WHERE hostgroup_id=8020 AND hostname='172.16.0.239' AND port='3306'
2020/12/28 01:19:35.223:[DEBUG] Check retry Node:172.16.0.239 port:3306 hg:8020 Time OUT
2020/12/28 01:19:35.229:[WARN]  Move node:172.16.2.104;3306;8020;4010 SQL:INSERT INTO mysql_servers (hostgroup_id, hostname,port,gtid_port,status,weight,compression,max_connections,max_replication_lag,use_ssl,max_latency_ms,comment)  VALUES(20,'172.16.2.104',3306 ,0 ,'ONLINE' ,1,0,1000,0,1,0,'')
2020/12/28 01:19:35.229:[DEBUG] Check retry Node:172.16.2.104 port:3306 hg:8020 Time IN
2020/12/28 01:19:35.235:[WARN]  Reset retry to UP:3 Down:0for node:172.16.2.104;3306;8020;4010 SQL:UPDATE mysql_servers SET comment='10_W_20_R_retry_up=3;10_W_20_R_retry_down=0;' WHERE hostgroup_id=8020 AND hostname='172.16.2.104' AND port='3306'
2020/12/28 01:19:35.235:[DEBUG] Check retry Node:172.16.2.104 port:3306 hg:8020 Time OUT
2020/12/28 01:19:35.235:[INFO] Special Backup - Group found! I am electing a node to writer following the indications
 This Node Try to become the new WRITER for HG 10 Server details: 172.16.0.239:3306:HG8010
2020/12/28 01:19:35.235:[INFO] This Node Try to become a WRITER promoting to HG 10 172.16.0.239:3306:HG 8010
2020/12/28 01:19:35.235:[WARN]  DELETE from writer group as:  SQL:DELETE from mysql_servers where hostgroup_id in (10,9010) AND STATUS = 'ONLINE'
2020/12/28 01:19:37.376:[INFO] START EXECUTION
  • ProxySQL running_mysql_servers status --> showing no ONLINE node in writer host group.
proxysql> select * from runtime_mysql_servers;
+--------------+--------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+----------------------------------------------+
| hostgroup_id | hostname     | port | gtid_port | status       | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment                                      |
+--------------+--------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+----------------------------------------------+
| 10           | 172.16.1.199 | 3306 | 0         | OFFLINE_SOFT | 1      | 0           | 1000            | 0                   | 1       | 0              | 10_W_20_R_retry_up=0;10_W_20_R_retry_down=0; |
| 8020         | 172.16.0.239 | 3306 | 0         | ONLINE       | 1      | 0           | 1000            | 0                   | 1       | 0              | 10_W_20_R_retry_up=4;10_W_20_R_retry_down=0; |
| 8020         | 172.16.2.104 | 3306 | 0         | ONLINE       | 1      | 0           | 1000            | 0                   | 1       | 0              | 10_W_20_R_retry_up=3;10_W_20_R_retry_down=0; |
| 8020         | 172.16.1.199 | 3306 | 0         | SHUNNED      | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 8010         | 172.16.2.104 | 3306 | 0         | ONLINE       | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 8010         | 172.16.0.239 | 3306 | 0         | ONLINE       | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 8010         | 172.16.1.199 | 3306 | 0         | SHUNNED      | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 20           | 172.16.2.104 | 3306 | 0         | ONLINE       | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 20           | 172.16.0.239 | 3306 | 0         | ONLINE       | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
+--------------+--------------+------+-----------+--------------+--------+-------------+-----------------+---------------------+---------+----------------+----------------------------------------------+
9 rows in set (0.01 sec)
  • Start the former writer node, now the galera checker script keep swapping the candidate for writer
2020/12/28 01:27:49.615:[WARN]  Reset retry to UP:1 Down:0for node:172.16.1.199;3306;10;1000 SQL:UPDATE mysql_servers SET comment='10_W_20_R_retry_up=1;10_W_20_R_retry_down=0;' WHERE hostgroup_id=10 AND hostname='172.16.1.199' AND port='3306'
2020/12/28 01:27:49.615:[DEBUG] Check retry Node:172.16.1.199 port:3306 hg:10 Time OUT
2020/12/28 01:27:49.615:[INFO] This Node Try to become a WRITER promoting to HG 10 172.16.0.239:3306:HG 20
2020/12/28 01:27:49.615:[WARN]  DELETE from writer group as:  SQL:DELETE from mysql_servers where hostgroup_id in (10,9010) AND STATUS = 'ONLINE'

2020/12/28 01:27:51.961:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 3
2020/12/28 01:27:51.976:[WARN]  Move node:172.16.0.239;3306;8020;4010 SQL:INSERT INTO mysql_servers (hostgroup_id, hostname,port,gtid_port,status,weight,compression,max_connections,max_replication_lag,use_ssl,max_latency_ms,comment)  VALUES(20,'172.16.0.239',3306 ,0 ,'ONLINE' ,1,0,1000,0,1,0,'')
2020/12/28 01:27:51.976:[DEBUG] Check retry Node:172.16.0.239 port:3306 hg:8020 Time IN
2020/12/28 01:27:51.985:[WARN]  Reset retry to UP:5 Down:0for node:172.16.0.239;3306;8020;4010 SQL:UPDATE mysql_servers SET comment='10_W_20_R_retry_up=5;10_W_20_R_retry_down=0;' WHERE hostgroup_id=8020 AND hostname='172.16.0.239' AND port='3306'
2020/12/28 01:27:51.985:[DEBUG] Check retry Node:172.16.0.239 port:3306 hg:8020 Time OUT
2020/12/28 01:27:51.985:[DEBUG] Check retry Node:172.16.1.199 port:3306 hg:10 Time IN
2020/12/28 01:27:51.994:[WARN]  Reset retry to UP:2 Down:0for node:172.16.1.199;3306;10;1000 SQL:UPDATE mysql_servers SET comment='10_W_20_R_retry_up=2;10_W_20_R_retry_down=0;' WHERE hostgroup_id=10 AND hostname='172.16.1.199' AND port='3306'
2020/12/28 01:27:51.994:[DEBUG] Check retry Node:172.16.1.199 port:3306 hg:10 Time OUT
2020/12/28 01:27:51.994:[INFO] This Node Try to become a WRITER promoting to HG 10 172.16.2.104:3306:HG 20
2020/12/28 01:27:51.994:[WARN]  DELETE from writer group as:  SQL:DELETE from mysql_servers where hostgroup_id in (10,9010) AND STATUS = 'ONLINE'

2020/12/28 01:27:54.389:[WARN] PXC maintenance on single writer, is asking for failover. Fail-over in action Using Method = 3
2020/12/28 01:27:54.399:[WARN]  Move node:172.16.1.199;3306;10;1000 SQL: UPDATE mysql_servers SET status='ONLINE' WHERE hostgroup_id=10 AND hostname='172.16.1.199' AND port='3306'
2020/12/28 01:27:54.399:[DEBUG] Check retry Node:172.16.1.199 port:3306 hg:10 Time IN
2020/12/28 01:27:54.405:[WARN]  Reset retry to UP:3 Down:0for node:172.16.1.199;3306;10;1000 SQL:UPDATE mysql_servers SET comment='10_W_20_R_retry_up=3;10_W_20_R_retry_down=0;' WHERE hostgroup_id=10 AND hostname='172.16.1.199' AND port='3306'
2020/12/28 01:27:54.405:[DEBUG] Check retry Node:172.16.1.199 port:3306 hg:10 Time OUT
2020/12/28 01:27:54.405:[INFO] This Node Try to become a WRITER promoting to HG 10 172.16.0.239:3306:HG 20
2020/12/28 01:27:54.405:[WARN]  DELETE from writer group as:  SQL:DELETE from mysql_servers where hostgroup_id in (10,9010) AND STATUS = 'ONLINE'
  • ProxySQL running_mysql_servers status after start the former writer node --> There is still no ONLINE node in writer hostgroup.
proysql> select * from runtime_mysql_servers;
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+----------------------------------------------+
| hostgroup_id | hostname     | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment                                      |
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+----------------------------------------------+
| 20           | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 8020         | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              | 10_W_20_R_retry_up=7;10_W_20_R_retry_down=0; |
| 8020         | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              | 10_W_20_R_retry_up=5;10_W_20_R_retry_down=0; |
| 8020         | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              | 10_W_20_R_retry_up=3;10_W_20_R_retry_down=0; |
| 8010         | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 8010         | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 8010         | 172.16.1.199 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 20           | 172.16.0.239 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
| 20           | 172.16.2.104 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 1       | 0              |                                              |
+--------------+--------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+----------------------------------------------+
9 rows in set (0.01 sec)

phongdt2019 avatar Dec 28 '20 01:12 phongdt2019

I tested the ssl_implementation branch. After the fix in https://github.com/Tusamarco/proxy_sql_tools/pull/21 it works for me, for both REQUIRE SSL and REQUIRE NONE cases.

samiahlroos avatar Dec 28 '20 09:12 samiahlroos

@samiahlroos @phongdt2019 I have done some additional editing on the code and fixed the bug as Sami indicated. I have also run some tests of failover scenarios and it seems all working as expected. Given that changes are now merged in to main and the branch ssl_... will be deleted. Please let me know if any additional issue.

Tusamarco avatar Dec 28 '20 12:12 Tusamarco

Just for the records, I also tested the latest changes with and without REQUIRE SSL and found it to work. Also the use_ssl and other fields are kept like they should. Looking good to me!

samiahlroos avatar Dec 30 '20 14:12 samiahlroos