Tailscale step runs successfully but subsequent steps to connect to DB fail
We created the correct tags and set the scope to device.
The step for Tailscale runs(i dont see any confirmations that we are connected) but the step to run my tests fail with
ERROR tests/mycode/code/test_my_code.py - sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'mysqlserver.us-east-1.rds.amazonaws.com' (timed out)")
We also see the node being created on the Tailscale UI but i keep getting a timeout when I run pytest.
name: Python application
on:
push:
branches: [ "feature/github-actions" ]
pull_request:
branches: [ "feature/github-actions" ]
env:
AWS_CONFIG_FILE: .github/workflows/aws_config
DB_NAME: "mydbname"
DB_READ_SERVER: "mysqlserver.us-east-1.rds.amazonaws.com"
DB_USERNAME: "root"
DB_PASSWORD: ${{secrets.DB_PASSWORD}}
AWS_PROFILE: "dev"
API_VERSION: "v1"
FRONT_END_KEY: ${{secrets.FRONT_END_KEY}}
LOG_LEVEL: "INFO"
DB_USER_ID: 32
SENTRY_SAMPLE_RATE: 1
NUMEXPR_MAX_THREADS: "8"
LOG_LEVEL_CONSOLE: True
LOG_LEVEL_ALGORITHM: "INFO"
LOG_LEVEL_DB: "WARNING"
permissions:
contents: read
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Tailscale
uses: tailscale/github-action@v2
with:
oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ secrets.TS_OAUTH_SECRET }}
tags: tag:cicd
- uses: actions/checkout@v4
- name: Set up Python 3.12
uses: actions/setup-python@v3
with:
python-version: "3.12"
- name: Install dependencies
run: |
pip install -r requirements-dev.txt
- name: Test with pytest
env:
PYTHONPATH: ${{github.workspace}}/src
run: |
pytest
Switching the URL to a direct IP did the trick. Looks like a DNS issue. I will leave this issue open as id prefer not to use a direct IP.
I'm encountering a similar timeout error, although doesn't seem to be DNS in my case as the IP is resolved properly:
Error: Error connecting to PostgreSQL server database.us-east-1.rds.amazonaws.com (scheme: awspostgres): dial tcp correct.ip.address:5432: connect: connection timed out
@henworth Have you setup your security policies correctly for your Tailscale instance?
Yep, I've done all this. It was working fine and now I'm not sure what's wrong.
Connectivity to this db works fine from other non-GitHub nodes using hostname or ip.
I also started having issues 2 weeks ago. I have also verified that things works fine outside of github actions using same configuration
I am having the same issue. It has been working perfectly so far but today I get random i/o timeouts.
Same here! I had random failures especially on the first connection to our RDS instance (running in AWS) from a github action worker (running in Azure). Subsequent connections after the first failure would succeed. I did some debugging and found that the connection is going through DERP despite having inbound wireguard port for IPv4/v6 on the AWS side.
I changed our use to first run a single ping to the subnet router DNS hostname after bringing up tailscale and that seemed to dramatically improve reliability though still had 1 fail in 10 (that time it was the ping itself failing)
Set up Split DNS and haven't had a failure since then, though only have had 10 or so runs since then.
My issue turned out to be related to the stateful filtering added in v1.66.0. Once I disabled that on my subnet routers the problem disappeared.
I wonder if there's a propagation delay here? E.g. a new node comes up but doesn't propagate fast enough. I wonder if adding a wait of 5 seconds or so would help here. Maybe thats why pinging may have helped?
The stateful filtering is interesting, but it's disabled by default it seems.
@henworth can you describe what flags you changed? I think I'm seeing something similar to this but in the helm world this time.
Update:
--stateful-filtering Enable stateful filtering for [subnet routers](https://tailscale.com/kb/1019/subnets) and [exit nodes](https://tailscale.com/kb/1103/exit-nodes). When enabled, inbound packets with another node's destination IP are dropped, unless they are a part of a tracked outbound connection from that node. Defaults to disabled.
Seems like default is false?
At the time I wrote that comment the default was true, it has since been changed to false in a subsequent release.
v4.0.0 of the action now includes a ping parameter that you can use to specify which devices need to be reachable before your CI job proceeds. We are hopeful that this will resolve your issue. If it does not, please let us know by reopening the ticket.