nushell
nushell copied to clipboard
Add `is-not-in` to complement `is-in`
Related problem
I am trying to remove stopwords from a tokenized corpus.
Removing all words except the stopwords is easily achievable:
let stopwords = (open stopwords.txt | lines | into df)
let corpus = (open corpus.txt | split words | into df)
let mask = ($corpus | is-in $stopwords)
let result = ($corpus | filter-with $mask)
But I need the opposite, to get rid of the stopwords and keep the other words.
Describe the solution you'd like
The elegant solution would be a new command called is-not-in
(I think this is also termed antijoin in other systems)
An example:
let stopwords = (open stopwords.txt | lines | into df)
let corpus = (open corpus.txt | split words | into df)
let mask = ($corpus | is-not-in $stopwords) <------ requested feature
let tidy = ($corpus | filter-with $mask)
then $tidy would contain the words in $corpus minus the words in $stopwords
Describe alternatives you've considered
I've been trying to "negate" the mask, so it finds false instead of true - since that would also work, but I have found no way to negate a boolean in filter-with.
EDIT:
let tidy = ($corpus | filter-with ($mask | df-not))
can be used, so is-not-in is more of a "nice to have", I guess.
Additional context and details
No response