danfojs
danfojs copied to clipboard
feat: add dataframe duolicated issue - #667
This merge request adds a new [duplicated()] method to the DataFrame class that identifies duplicate rows within a DataFrame. This functionality is essential for data cleaning and exploration workflows.
Resolve the issue - #667
Features
- Identifies duplicate rows in a DataFrame based on specified columns
- Returns a Series of boolean values marking duplicate entries
- Supports flexible options for handling duplicates:
- keep: 'first' - Mark duplicates except for the first occurrence (default)
- keep: 'last'- Mark duplicates except for the last occurrence
- keep: false - Mark all duplicates Allows focusing on specific columns with the subset option
Implementation Details
- Optimized to handle large datasets efficiently with a hash-based approach
- Comprehensive input validation for better error handling
- Well-documented with JSDoc comments and examples
// Create a DataFrame with duplicate rows
const df = new DataFrame({
'A': [1, 2, 2, 3, 3],
'B': ['a', 'b', 'b', 'c', 'c']
});
// Find duplicates keeping first occurrence (default)
const dups = df.duplicated();
// Returns: [false, false, true, false, true]
// Find duplicates keeping last occurrence
const dupsLast = df.duplicated({ keep: 'last' });
// Returns: [false, true, false, true, false]
// Find duplicates based on specific columns
const dupsSubset = df.duplicated({ subset: ['B'] });
// Returns: [false, false, true, false, true]