frontend - TAPE REST API: Transient `SSL handshake failed: sslv3 alert certificate unknown` errors
Hello,
We observed some mass transient SSL errors when FTS queries the status of staging requests against the frontend servers:
STAGING [13] [Tape REST API] Stage pooling call failed: (Neon): SSL handshake failed: sslv3 alert certificate unknown
See. https://fts.usatlas.bnl.gov:8449/fts3/ftsmon/#/job/fd576cd4-2c56-11ef-8623-00163e1051a4
It seems to correspond to an error when the server fails to validate the client certificate and its certification authority. These servers also host gPlazma, and we did not observe any authentication failures at that time. We are thinking that it might correspond to an issue when the CRLs are renewed and reloaded on the frontends.
This error message could be reproduced by having an empty /etc/grid-security/certificates directory on the frontend with the python code below.
Any help appreciated.
#!/usr/bin/env python3
import requests
id = "32485037-df6d-4c96-ab79-c409e0e2f238"
url = f'https://dcint-frontend001.sdcc.bnl.gov:3880/api/v1/tape/stage/{id}'
headers = {'Content-Type': 'application/json'}
cert_path = '/tmp/x509up_u0'
response = requests.get(url, headers=headers, cert=(cert_path, cert_path), verify='/etc/grid-security/certificates')
print(response.text)
print(response)
requests.exceptions.SSLError: HTTPSConnectionPool(host='dcint-frontend001.sdcc.bnl.gov', port=3880): Max retries exceeded with url: /api/v1/tape/stage/27ec6771-7d28-483d-97e6-99e2df30f959 (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:877)'),))
do you run fetch-crl?
Vincent:
$ cat rest.py
#!/usr/bin/env python3
import os
import requests
id = "32485037-df6d-4c96-ab79-c409e0e2f238"
url = f'https://cmsdcatape.fnal.gov:3880/api/v1/tape/stage/{id}'
headers = {'Content-Type': 'application/json'}
uid = os.getuid()
cert_path = f'/tmp/x509up_u{uid}'
response = requests.get(url, headers=headers, cert=(cert_path, cert_path), verify='/etc/grid-security/certificates')
print(response.text)
print(response)
running withot voms proxy:
$ python3 rest.py
Traceback (most recent call last):
File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
conn.connect()
File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connection.py", line 424, in connect
tls_in_tls=tls_in_tls,
File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 450, in ssl_wrap_socket
sock, context, tls_in_tls, server_hostname=server_hostname
File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/lib64/python3.6/ssl.py", line 365, in wrap_socket
_context=self, _session=session)
File "/usr/lib64/python3.6/ssl.py", line 776, in __init__
self.do_handshake()
File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
self._sslobj.do_handshake()
File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:877)
running with voms proxy:
$ voms-proxy-info
subject : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Dmitry Litvintsev/CN=UID:litvinse/CN=4175574056
issuer : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Dmitry Litvintsev/CN=UID:litvinse
identity : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Dmitry Litvintsev/CN=UID:litvinse
type : RFC compliant proxy
strength : 2048 bits
path : /tmp/x509up_u8637
timeleft : 119:58:03
$ python3 rest.py
{"detail":"request 32485037-df6d-4c96-ab79-c409e0e2f238 not found","title":"Not Found","status":"404"}
<Response [404]>
do you run fetch-crl?
Every 6 hours.
@DmitryLitvintsev
ssl.SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:877)
Good catch. The error might indicate something on the client side as well, making it harder to debug..
What can be the corresponding error messages for such SSL errors in the dCache domain logs or access logs on the door and frontend?
As we understand from the discussion at Tier-1 support meeting, the certificate directory temporarily becomes empty. Can you configure that?
Yes, also, may I ask you how you update certificates? On our system we have never seen any issues.
# ls -al /etc/grid-security/
total 7616
drwxr-xr-x 5 root root 4096 Jun 20 12:14 .
drwxr-xr-x. 141 root root 12288 Jun 25 08:01 ..
lrwxrwxrwx 1 root root 21 Jun 20 11:44 certificates -> certificates-1.119NEW
drwxr-xr-x 2 root root 40960 Jun 25 11:45 certificates-1.119NEW
...
The /etc/grid-security/security is a soft link to /etc/grid-security/certificates-1.119NEW
The CRLs are updated by cron:
10 * * * * root [ ! -f /var/lock/subsys/osg-update-certs-cron ] || /usr/sbin/osg-update-certs --random-sleep 2700 --called-from-cron > /dev/null 2>&1
provided by osg-ca-scripts package. It works like so: it creates a new directory, fills it up, and then moves symbolic link to it, then it removes old directory which is no longer visible to applications. It never failed to work with dCache.
We have some updates:
- This error was seen not only at BNL, but also at SARA and FZK
- A probe checking the HTTPS connections for all doors and frontends has been deployed
- The underlying HTTP library in FTS has been changed from Davix/Neon to libcurl due to different SSL connection errors observed at INFN
As we understand from the discussion at Tier-1 support meeting, the certificate directory temporarily becomes empty. Can you configure that?
To clarify the cert directory has not been observed empty during issue time or after. In the meeting we were discussing possible scenarios where the CRL availability is compromised.
Yes, also, may I ask you how you update certificates? On our system we have never seen any issues.
# ls -al /etc/grid-security/ total 7616 drwxr-xr-x 5 root root 4096 Jun 20 12:14 . drwxr-xr-x. 141 root root 12288 Jun 25 08:01 .. lrwxrwxrwx 1 root root 21 Jun 20 11:44 certificates -> certificates-1.119NEW drwxr-xr-x 2 root root 40960 Jun 25 11:45 certificates-1.119NEW ...The
/etc/grid-security/securityis a soft link to/etc/grid-security/certificates-1.119NEWThe CRLs are updated by cron:10 * * * * root [ ! -f /var/lock/subsys/osg-update-certs-cron ] || /usr/sbin/osg-update-certs --random-sleep 2700 --called-from-cron > /dev/null 2>&1provided by
osg-ca-scriptspackage. It works like so: it creates a new directory, fills it up, and then moves symbolic link to it, then it removes old directory which is no longer visible to applications. It never failed to work with dCache.
There is no symlink pointing on /etc/grid-security/certificates and fetch-crls runs directly against /etc/grid-security/certificates. Thanks for sharing your configuration
Since FTS@BNL was changed from libneon to libcurl, this issue has not been observed anymore.