readtext
readtext copied to clipboard
docvarsfrom = "filepaths" not working as expected
Error
This should parse out the filepaths, not filepaths and filenames.
> (rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"),
+ docvarsfrom = "filepaths", docvarnames = "sentiment"))
readtext object consisting of 10 documents and 4 docvars.
# data.frame [10 × 6]
doc_id text sentiment docvar2 docvar3 docvar4
<chr> <chr> <chr> <chr> <chr> <chr>
1 neg_cv000_2… "\"plot : t… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv000 29416.…
2 neg_cv001_1… "\"the happ… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv001 19502.…
3 neg_cv002_1… "\"it is mo… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv002 17424.…
4 neg_cv003_1… "\" \" ques… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv003 12683.…
5 neg_cv004_1… "\"synopsis… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv004 12641.…
6 pos_cv000_2… "\"films ad… /Library/Frameworks/R.framework/Versions/3.… reviews/p… cv000 29590.…
# ... with 4 more rows
Warning message:
In get_docvars_filenames(files, dvsep, docvarnames, docvarsfrom == :
Fewer docnames supplied than existing docvars - last 3 docvars given generic names.
Expected behaviour
The idea behind the docvarsfrom = "filepaths" is not to parse the filenames, but rather to take as docvars the folder parts from the supplied file pattern matches.
So in the example:
DATA_DIR <- system.file("extdata/", package = "readtext")
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"),
docvarsfrom = "filepaths", docvarnames = "sentiment"))
it should return:
readtext object consisting of 10 documents and 1 docvar.
# data.frame [10 × 3]
doc_id text sentiment
<chr> <chr> <chr>
1 neg_cv000_29416.txt "\"plot : two\"..." neg
2 neg_cv001_19502.txt "\"the happy \"..." neg
3 neg_cv002_17424.txt "\"it is movi\"..." neg
4 neg_cv003_12683.txt "\" \" quest f\"..." neg
5 neg_cv004_12641.txt "\"synopsis :\"..." neg
6 pos_cv000_29590.txt "\"films adap\"..." pos
# ... with 4 more rows
where the neg, pos labels come not from filenames but instead from the path at the match level, e.g. the pre-/ part of:
> list.files(path = paste0(DATA_DIR, "txt/movie_reviews/"), recursive = TRUE)
[1] "neg/neg_cv000_29416.txt" "neg/neg_cv001_19502.txt" "neg/neg_cv002_17424.txt"
[4] "neg/neg_cv003_12683.txt" "neg/neg_cv004_12641.txt" "pos/pos_cv000_29590.txt"
[7] "pos/pos_cv001_18431.txt" "pos/pos_cv002_15918.txt" "pos/pos_cv003_11664.txt"
[10] "pos/pos_cv004_11636.txt"
When docvarsfrom = "filepaths" the filenames should not be parsed into dvars.
The root cause is that Sys.glob() does not tell us what in file paths "*" matched.
https://github.com/quanteda/readtext/blob/555aa7222c255a0cde3e17e983dede0e240857f5/R/utils.R#L164