The running time of flatxml
The faltxml is a excellent work for parsing XML data. But there is a problem that I encountered during using the fxml_toDataFrame function. This function need a longer time for running multi files with parsing the whole file to a dataframe. such as:
##
elemids <- unique(test$elemid.)
for(i in elemids){
yy <- fxml_toDataFrame(test,siblings.of = i)
if (nrow(yy) >= 1) {
j <- j+1
yy.dat[[j]] <- yy
print(c(i,j))
} else next()
}
For a test data with elemids > 300. The loop above need > 1 min. I work with file > 1000, which take more than 10 hours. Would you please give some advice?
I work out a solution to my problem as using fxml_hasChildren to get essential elemids, which reduced the loop from 300+ to about 30+ (10 times reduced).
Yes, I was also thinking that the problem may be that when you have two siblings A and B, your loop will find the siblings for both of them (so when i is the elemid of A, then fxml_toDataFrame() will find B as a sibling, and when i is the elemid of B, it will find A which was already processed before). When you have many elements on the same level then you do a lot of "unnecessary work". Therefore, another way to solve this would be to save the elemids of the elements in yy and do not run fxml_toDataFrame() for any of them. So if elemid x is already in any of your yy dataframes then you would never run fxml_toDataFrame(test, siblings = x). This should reduce the number of loops significantly.