zingg icon indicating copy to clipboard operation
zingg copied to clipboard

Make Zingg More Usable - Part 1. Blocking

Open sonalgoyal opened this issue 1 year ago • 5 comments

Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons. For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.

Let us add a new phase debugBlocking which will block the incoming data and output

  • Counts per block( getPipeUtil().write(blocked.select(ColName.HASH_COL).groupByCount(ColName.HASH_COL, ColName.HASH_COL + "_count"), getPipeForDebugBlockingLocation(timestamp)); )
  • 10% records of top 3 by count blocks so that people can see whcih records are contributing to the issue and add appropriate training

We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples

timestamp - same for both

sonalgoyal avatar Oct 02 '24 08:10 sonalgoyal

this is a new phase. define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks. In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)

In BlockingTreeDebugger, call same

sonalgoyal avatar Oct 03 '24 07:10 sonalgoyal

if there are more than one sources, we need to do a group by of the hashes per source.

sonalgoyal avatar Oct 03 '24 07:10 sonalgoyal

see also https://github.com/zinggAI/zingg/issues/893

sonalgoyal avatar Oct 03 '24 11:10 sonalgoyal

zingg.sh --phase debugBlocking --conf config.json --zinggDir /location

what will the run command look like?

sania-16 avatar Oct 06 '24 13:10 sania-16

—zinggDir is optional

sonalgoyal avatar Oct 06 '24 16:10 sonalgoyal