[Bug] Server crashes when I select the incorrect set of primary keys for Value diff
Current Behavior
I've selected the wrong primary keys in value diff and the server crashed.
logs:
Future exception was never retrieved
future: <Future finished exception=RecceException('Invalid primary key: date_year. The column should be unique. Please check by this sql: \'\n\nselect\n date_year as unique_field,\n count(*) as n_records\n\nfrom "live"."sponsored_collections"."partner_revenue_status_changes_yearly"\nwhere date_year is not null\ngroup by date_year\nhaving count(*) > 1\n\n\'')>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/site-packages/recce/apis/run_func.py", line 139, in fn
raise e
File "/usr/local/lib/python3.10/site-packages/recce/apis/run_func.py", line 132, in fn
result = task.execute()
File "/usr/local/lib/python3.10/site-packages/recce/tasks/valuediff.py", line 230, in execute
self._verify_primary_key(dbt_adapter, primary_key, model)
File "/usr/local/lib/python3.10/site-packages/recce/tasks/valuediff.py", line 71, in _verify_primary_key
raise RecceException(
recce.exceptions.RecceException: Invalid primary key: date_year. The column should be unique. Please check by this sql: '
select
date_year as unique_field,
count(*) as n_records
from "live"."sponsored_collections"."partner_revenue_status_changes_yearly"
where date_year is not null
group by date_year
having count(*) > 1
'
./run-scripts/start-recce.sh: line 40: 12986 Killed recce server --host "$RECCE_SERVER_HOST" --port "$RECCE_SERVER_PORT"
make: *** [makefile:95: start-recce] Error 137
/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Expected Behavior
I'd expect an error message instead and then the ability to select again. Potentially showing me a sample of the duplicate values and a query I can use to double check.
Steps To Reproduce
- I started recce locally (v.054) and opened the UI
- On lineage the changed model was already selected
- navigated to explore change > value diff
- selected the wrong primary keys
- server crashes
Relevant log output
Environment
- recce: 0.54
- OS: MacOS 15.3.1
- Python: 3.10.16
- Data Warehouse: aws redshift
- dbt: 1.8.3
Additional Context
No response
Hi @LePeti
Thanks for opening the issue. I've tried, but currently unable to reproduce. I'll escalate this to the dev team to take a look.
When you say the server crashes, do you mean that;
- An error message is displayed, as in this screenshot:
- Or, does the actual server process crash on the CLI, resulting in a server disconnect message in the web UI, like this:
Thanks,
Dave
hi @DaveFlynn ,
it's the latter of the two. My expected behavior would be the former. I created a video recording: https://drive.google.com/file/d/1rhjiSouSDruIvNNDoB-Md7J1JatGzh0C/view?usp=sharing
Thanks @LePeti Our development team is looking into this issue and we'll get back to you soon.
Hi @LePeti
Thanks for providing the reproduced video record. Based on your video and the logs you provided. Here are what we currently know:
- The
recce servercommand is running under the Devcontainters environment with VSCode. - When executing the Value Diff task with a non-unique column primary key, the
recce serverprocess will be killed. - The
recce serverprocess is killed bySIGKILLsignal (Exit code: 137)
In general, the SIGKILL should be sent by other external processes.
The reasons could be:
- Manually call
kill -9 [PID]command - OS's OOM killer
- Process Management Tools (Supervisor, Docker, Kubernetes, etc.)
- Security/Monitoring Software
In your case, we suspect it could be caused by Docker's usage limit or security software.
Due to the recce server command running under the Devcontainers environment.
Unfortunately, we are unable to reproduce the process-killed behavior in our environment.
It's hard to know the exact reason why the recce server process will be killed while handling the exception.
However, we will modify the recces's error handling mechanism when executing the SQL query. And only show the error message when failing to query SQL from DB. No matter what reasons Recce's process is killed, we should not throw the exception directly. Ideally, fix #623 will resolve this issue. And we will deliver this fix in the next release.