stagehand
stagehand copied to clipboard
Write a custom scorer for the eval.
Thought: The eval should be more detailed for extracts. There is a big different between completely missing 10 out of 20 commits on an eval, vs. one commit starting with a non-capitalized letter. But we score both as 0. One should be scored around 0.5 and another around 0.9
And
Implement fuzzy search scoring
Also, if you get extra time, make more information available on Braintrust dashboard
fuzzy search scoring is an option, how about embeddings vector similarity?