flatxml icon indicating copy to clipboard operation
flatxml copied to clipboard

The running time of flatxml

Open HarwayZ opened this issue 6 years ago • 2 comments

The faltxml is a excellent work for parsing XML data. But there is a problem that I encountered during using the fxml_toDataFrame function. This function need a longer time for running multi files with parsing the whole file to a dataframe. such as:

##
elemids <- unique(test$elemid.)
  for(i in elemids){
    yy <- fxml_toDataFrame(test,siblings.of = i)
    if (nrow(yy) >= 1) { 
      j <- j+1
      yy.dat[[j]] <- yy
    print(c(i,j))
  } else next()
}

For a test data with elemids > 300. The loop above need > 1 min. I work with file > 1000, which take more than 10 hours. Would you please give some advice?

HarwayZ avatar Dec 04 '19 14:12 HarwayZ

I work out a solution to my problem as using fxml_hasChildren to get essential elemids, which reduced the loop from 300+ to about 30+ (10 times reduced).

HarwayZ avatar Dec 05 '19 03:12 HarwayZ

Yes, I was also thinking that the problem may be that when you have two siblings A and B, your loop will find the siblings for both of them (so when i is the elemid of A, then fxml_toDataFrame() will find B as a sibling, and when i is the elemid of B, it will find A which was already processed before). When you have many elements on the same level then you do a lot of "unnecessary work". Therefore, another way to solve this would be to save the elemids of the elements in yy and do not run fxml_toDataFrame() for any of them. So if elemid x is already in any of your yy dataframes then you would never run fxml_toDataFrame(test, siblings = x). This should reduce the number of loops significantly.

jsugarelli avatar Dec 05 '19 17:12 jsugarelli