recce icon indicating copy to clipboard operation
recce copied to clipboard

[Bug] Server crashes when I select the incorrect set of primary keys for Value diff

Open LePeti opened this issue 11 months ago • 4 comments

Current Behavior

I've selected the wrong primary keys in value diff and the server crashed.

logs:

Future exception was never retrieved
future: <Future finished exception=RecceException('Invalid primary key: date_year. The column should be unique. Please check by this sql: \'\n\nselect\n    date_year as unique_field,\n    count(*) as n_records\n\nfrom "live"."sponsored_collections"."partner_revenue_status_changes_yearly"\nwhere date_year is not null\ngroup by date_year\nhaving count(*) > 1\n\n\'')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/site-packages/recce/apis/run_func.py", line 139, in fn
    raise e
  File "/usr/local/lib/python3.10/site-packages/recce/apis/run_func.py", line 132, in fn
    result = task.execute()
  File "/usr/local/lib/python3.10/site-packages/recce/tasks/valuediff.py", line 230, in execute
    self._verify_primary_key(dbt_adapter, primary_key, model)
  File "/usr/local/lib/python3.10/site-packages/recce/tasks/valuediff.py", line 71, in _verify_primary_key
    raise RecceException(
recce.exceptions.RecceException: Invalid primary key: date_year. The column should be unique. Please check by this sql: '

select
    date_year as unique_field,
    count(*) as n_records

from "live"."sponsored_collections"."partner_revenue_status_changes_yearly"
where date_year is not null
group by date_year
having count(*) > 1

'
./run-scripts/start-recce.sh: line 40: 12986 Killed                  recce server --host "$RECCE_SERVER_HOST" --port "$RECCE_SERVER_PORT"
make: *** [makefile:95: start-recce] Error 137
/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Expected Behavior

I'd expect an error message instead and then the ability to select again. Potentially showing me a sample of the duplicate values and a query I can use to double check.

Steps To Reproduce

  1. I started recce locally (v.054) and opened the UI
  2. On lineage the changed model was already selected
  3. navigated to explore change > value diff
  4. selected the wrong primary keys
  5. server crashes

Relevant log output


Environment

  • recce: 0.54
  • OS: MacOS 15.3.1
  • Python: 3.10.16
  • Data Warehouse: aws redshift
  • dbt: 1.8.3

Additional Context

No response

LePeti avatar Feb 20 '25 14:02 LePeti

Hi @LePeti

Thanks for opening the issue. I've tried, but currently unable to reproduce. I'll escalate this to the dev team to take a look.

When you say the server crashes, do you mean that;

  1. An error message is displayed, as in this screenshot:

Image

  1. Or, does the actual server process crash on the CLI, resulting in a server disconnect message in the web UI, like this:

Image

Thanks,

Dave

DaveFlynn avatar Feb 21 '25 00:02 DaveFlynn

hi @DaveFlynn ,

it's the latter of the two. My expected behavior would be the former. I created a video recording: https://drive.google.com/file/d/1rhjiSouSDruIvNNDoB-Md7J1JatGzh0C/view?usp=sharing

LePeti avatar Feb 21 '25 08:02 LePeti

Thanks @LePeti Our development team is looking into this issue and we'll get back to you soon.

DaveFlynn avatar Feb 24 '25 23:02 DaveFlynn

Hi @LePeti

Thanks for providing the reproduced video record. Based on your video and the logs you provided. Here are what we currently know:

  • The recce server command is running under the Devcontainters environment with VSCode.
  • When executing the Value Diff task with a non-unique column primary key, the recce server process will be killed.
  • The recce server process is killed by SIGKILL signal (Exit code: 137)

In general, the SIGKILL should be sent by other external processes. The reasons could be:

  • Manually call kill -9 [PID] command
  • OS's OOM killer
  • Process Management Tools (Supervisor, Docker, Kubernetes, etc.)
  • Security/Monitoring Software

In your case, we suspect it could be caused by Docker's usage limit or security software. Due to the recce server command running under the Devcontainers environment. Unfortunately, we are unable to reproduce the process-killed behavior in our environment. It's hard to know the exact reason why the recce server process will be killed while handling the exception.

However, we will modify the recces's error handling mechanism when executing the SQL query. And only show the error message when failing to query SQL from DB. No matter what reasons Recce's process is killed, we should not throw the exception directly. Ideally, fix #623 will resolve this issue. And we will deliver this fix in the next release.

kentwelcome avatar Feb 26 '25 03:02 kentwelcome