nushell icon indicating copy to clipboard operation
nushell copied to clipboard

Add `is-not-in` to complement `is-in`

Open kiil opened this issue 3 years ago • 0 comments

Related problem

I am trying to remove stopwords from a tokenized corpus.

Removing all words except the stopwords is easily achievable:

let stopwords = (open stopwords.txt | lines | into df)
let corpus = (open corpus.txt | split words | into df)
let mask = ($corpus | is-in $stopwords)
let result = ($corpus | filter-with $mask)

But I need the opposite, to get rid of the stopwords and keep the other words.

Describe the solution you'd like

The elegant solution would be a new command called is-not-in

(I think this is also termed antijoin in other systems)

An example:

let stopwords = (open stopwords.txt | lines | into df)
let corpus = (open corpus.txt | split words | into df)
let mask = ($corpus | is-not-in $stopwords)                <------ requested feature
let tidy = ($corpus | filter-with $mask)

then $tidy would contain the words in $corpus minus the words in $stopwords

Describe alternatives you've considered

I've been trying to "negate" the mask, so it finds false instead of true - since that would also work, but I have found no way to negate a boolean in filter-with.

EDIT:

let tidy = ($corpus | filter-with ($mask | df-not))

can be used, so is-not-in is more of a "nice to have", I guess.

Additional context and details

No response

kiil avatar Nov 06 '22 09:11 kiil