Duke issues

Improve MAVEN build Performance

Maven will run all tests in a single forked VM by default. This can be problematic if there are a lot of tests or some very memory-hungry ones. We can...

SilverSteven

Dedupe on Couchbase for real time streaming json (flink)

4

Hi ,I am trying to dedupe real time streaming json with destination as couchbase .I am trying to do this call for dedupe from flink but not able to perform...

ashubitm

SemanticDogfood issue

3

Hi! I am new to Duke. I have been trying out deduplication within a dataset but I am encountering issues" ERROR: Couldn't instantiate class no.priv.garshol.duke.databases.LuceneDatabase: java.lang.ClassNotFoundException: no.priv.garshol.duke.databases.LuceneDatabase". Please guide. Thank...

xinelim

Why there are repeated matches?

1

Hello, I can successfully run the deduplicate code with Duke, but the matched two records repeat twice in the match result, why? For example, if records of ID1 and ID2...

wjfjw

This fixes Classical and Weighted Levenshtein distances.

I spotted a mistake in the implementation of the Levenshtein.distance and WeightLevenshtein.distance methods. The errors described in #268, #239 and #244 comes from using the wrong indexing in the "matrix"...

ibuda

Levenshtein distances Bug

Found a bug in Levenshtein and WegihtedLevenshtein distances implementations. In more details, the following methods give wrong #results: `Levenshtein.distance("abc", "a")`: Expected: 2, Actual: 1 `Levenshtein.distance("a", "abc") `: Expected: 2, Actual:...

ibuda

Latest released version is from 15 Feb 2014

Hi, Given that the current codebase on master is on version 1.3-SNAPSHOT, is there any plans to release a 1.3 version to Maven central? Version 1.2 seems rather old (from...

malamili

Streaming data deduplication

3

Hi, Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?...

sridharpattem

Couchbase Data Source Support for Dedupe to fetch limited set of records as per index

Hi ,How can i create couchbase data source with CouchbaseDBIterator to convert into Compact Record and do dedupe for only records fetched by right matches via indexing .

ashubitm

Genetic algorithm

1

How does genetic algorithm in passive mode support incremental data set?

xinelim

Duke
Duke copied to clipboard

Metadata

Improve MAVEN build Performance

Dedupe on Couchbase for real time streaming json (flink)

SemanticDogfood issue

Why there are repeated matches?

This fixes Classical and Weighted Levenshtein distances.

Levenshtein distances Bug

Latest released version is from 15 Feb 2014

Streaming data deduplication

Couchbase Data Source Support for Dedupe to fetch limited set of records as per index

Genetic algorithm

← Metadata

Owner

Metadata

Duke Duke copied to clipboard

Metadata

← Metadata

Owner

Metadata

Duke
Duke copied to clipboard