danfojs icon indicating copy to clipboard operation
danfojs copied to clipboard

feat: add dataframe duolicated issue - #667

Open RahulDas-dev opened this issue 9 months ago • 0 comments

This merge request adds a new [duplicated()] method to the DataFrame class that identifies duplicate rows within a DataFrame. This functionality is essential for data cleaning and exploration workflows.

Resolve the issue - #667

Features

  • Identifies duplicate rows in a DataFrame based on specified columns
  • Returns a Series of boolean values marking duplicate entries
  • Supports flexible options for handling duplicates:
    • keep: 'first' - Mark duplicates except for the first occurrence (default)
    • keep: 'last'- Mark duplicates except for the last occurrence
    • keep: false - Mark all duplicates Allows focusing on specific columns with the subset option

Implementation Details

  • Optimized to handle large datasets efficiently with a hash-based approach
  • Comprehensive input validation for better error handling
  • Well-documented with JSDoc comments and examples
// Create a DataFrame with duplicate rows
const df = new DataFrame({
  'A': [1, 2, 2, 3, 3],
  'B': ['a', 'b', 'b', 'c', 'c']
});

// Find duplicates keeping first occurrence (default)
const dups = df.duplicated();
// Returns: [false, false, true, false, true]

// Find duplicates keeping last occurrence
const dupsLast = df.duplicated({ keep: 'last' });
// Returns: [false, true, false, true, false]

// Find duplicates based on specific columns
const dupsSubset = df.duplicated({ subset: ['B'] });
// Returns: [false, false, true, false, true]

RahulDas-dev avatar May 03 '25 09:05 RahulDas-dev