Clean-Coder-AI icon indicating copy to clipboard operation
Clean-Coder-AI copied to clipboard

LLM as a Judge checking Debugger/Executor tool calls

Open Grigorij-Dudnik opened this issue 1 year ago • 2 comments

Sometimes Executor or Debugger agents could provide wrong lines when calling "replace_code" or "insert_code" tools. We can use fast low-cost llms (as 3.5-haiku or gpt-4o-mini) for checking if code going to be inserted will not break some old code.

Currently we have syntax checker functions (src/utilities/syntax_checker_functions.py) checking if change not going to break syntax of code. It creates copy of file we changing, intorduces change, checks syntax of that temporary file, and if syntax is ok, allows to introduce change to original file.

Such dumb syntax checking can find most of the bad changes, but will not find bad changes that breaking syntax.

We need LLM as a Judge, that will see file before and after change, will see what actually agent wants to change (knowing last agent message or plan for example) and will be able to evaluate if lines to change been selected good.

Such "smart" check shouldbe done after "dumb" check by sntax checkers.

Grigorij-Dudnik avatar Jan 24 '25 14:01 Grigorij-Dudnik

Honestly, this might be better with a rethink.

You've built a sort of coding team here. A Change Review Board process would reduce failures and thereby conserve tokens. Use Qwen 2.5 coder 32B (or similar) to make a file diff of the proposed changes then apply the diff after a code review. This would limit breakages.

Also I'm not seeing test coverage.

I would definitely recommend coding unit tests and building the code to pass unit tests, then build the integration tests then the integration. Also by building the integration tests first, the errors / not passing, can be used as driver. Once that is completed, you can have early warnings of things that got broke.

devlux76 avatar Jan 30 '25 07:01 devlux76

Hey @devlux76, thanks for the ideas!

A Change Review Board process would reduce failures and thereby conserve tokens. Use Qwen 2.5 coder 32B (or similar) to make a file diff of the proposed changes then apply the diff after a code review.

As I understand, you mean to change way the Clean Coder prints his proposition of changes. Instead of just providing new code and line numbers to replace, it will provide both old and new code. I like that idea as it can increase visibility, I created separate issue for it (https://github.com/Grigorij-Dudnik/Clean-Coder-AI/issues/50) .

Althrough improving printing will help with manual checks, it will not solve problem completely. In Clean Coder, we want to automate as much things as possible, to make it one day be a fully autonomous programmer, that does not require human attention. That why creating AI checks is still the thing - it will allow human to spend less time on manual checking.

Such checks could be optional - if someone want to save tokens, could disable it.

Also I'm not seeing test coverage.

I would definitely recommend coding unit tests and building the code to pass unit tests, then build the integration tests then the integration. Also by building the integration tests first, the errors / not passing, can be used as driver.

Once that is completed, you can have early warnings of things that got broke.

If I understand correctly, you mean here to make Clean Coder to code using test driven development - to write tests before writing an actual code. Something like in micro-agent been done. It could be interesing approach, that proved it work in AI coding already, but it requires us to plan it carefully how to realize it (I have no clear vision yet). Please feel free to open separate issue and write all you thoughts about how exactly you see agent organization in Clean Coder in order to realize TDD approach.

Grigorij-Dudnik avatar Jan 31 '25 08:01 Grigorij-Dudnik