Duke
Duke copied to clipboard
Duke is a fast and flexible deduplication engine written in Java
Maven will run all tests in a single forked VM by default. This can be problematic if there are a lot of tests or some very memory-hungry ones. We can...
Hi ,I am trying to dedupe real time streaming json with destination as couchbase .I am trying to do this call for dedupe from flink but not able to perform...
Hi! I am new to Duke. I have been trying out deduplication within a dataset but I am encountering issues" ERROR: Couldn't instantiate class no.priv.garshol.duke.databases.LuceneDatabase: java.lang.ClassNotFoundException: no.priv.garshol.duke.databases.LuceneDatabase". Please guide. Thank...
Hello, I can successfully run the deduplicate code with Duke, but the matched two records repeat twice in the match result, why? For example, if records of ID1 and ID2...
I spotted a mistake in the implementation of the Levenshtein.distance and WeightLevenshtein.distance methods. The errors described in #268, #239 and #244 comes from using the wrong indexing in the "matrix"...
Found a bug in Levenshtein and WegihtedLevenshtein distances implementations. In more details, the following methods give wrong #results: `Levenshtein.distance("abc", "a")`: Expected: 2, Actual: 1 `Levenshtein.distance("a", "abc") `: Expected: 2, Actual:...
Hi, Given that the current codebase on master is on version 1.3-SNAPSHOT, is there any plans to release a 1.3 version to Maven central? Version 1.2 seems rather old (from...
Hi, Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?...
Hi ,How can i create couchbase data source with CouchbaseDBIterator to convert into Compact Record and do dedupe for only records fetched by right matches via indexing .
How does genetic algorithm in passive mode support incremental data set?