csv icon indicating copy to clipboard operation
csv copied to clipboard

Not wroking correctly on file with too many lines

Open LinjianLi opened this issue 6 years ago • 5 comments

Environment

  • Ubuntu 18.04
  • CLion 2019.1.4

Issue

I test the program on a CSV file 20k_rows_data.csv.txt with 20K lines and the program does not work correctly. (I change the filename with .txt, because GitHub issue does not support uploading .csv file.)

int main() {
  csv::Reader csv;
  csv.read("../tests/inputs/20k_rows_data.csv.txt");
  auto rows = csv.rows();
  auto cols = csv.cols();
  int row_count = 0;
  for (auto row : rows) {
    std::string s = std::to_string(++row_count);
    for (auto col : cols) {
      s += ' ' + (std::string)(row[col]);
    }
    std::cout << s << std::endl;
  }
}

Part of the output is like (copy from my console):

5332     
5333     
5334 1 1 1 1 1
5335     
5336     
5337 1 1 1 1 1
5338     
5339     
5340     
5341 1 1 1 1 1
5342     
5343     

Note that the outputs are not the same each time I run it.

LinjianLi avatar Oct 16 '19 13:10 LinjianLi

I also have this issue, is there a fix planned soon?

benguela avatar Jan 28 '20 05:01 benguela

@p-ranav perhaps it makes sense to make a single threaded/simpler version of the reader implementation and opt-in to the threaded with flags? Runtime or compile time. This issue kind of discouraged me.

nevion avatar Mar 05 '20 08:03 nevion

Just fooling around with these changes, it looks like the test passes with std or unordered_map on my computer - but going back to unordered_flat_map causes it to have blank records again so I'm of the opinion's there's race conditions going on

diff --git a/include/csv/reader.hpp b/include/csv/reader.hpp
index 56542d7..0793d3b 100644
--- a/include/csv/reader.hpp
+++ b/include/csv/reader.hpp
@@ -46,8 +46,13 @@ SOFTWARE.
 #include <iterator>
 #include <atomic>
 #include <string_view>
+#include <map>
 
 namespace csv {
+    template<typename K, typename V>
+    using map_type = std::map<K, V>;
+    //using map_type = std::unordered_map<K, V>;
+    //using map_type = unordered_flat_map<K, V>;
 
   class Reader {
   public:
@@ -121,16 +126,16 @@ namespace csv {
 
     bool ready() {
       size_t rows = 0;
-      number_of_rows_processed_.try_dequeue(rows);
-      row_iterator_queue_.try_dequeue(ready_index_);
-      bool result = (ready_index_ < expected_number_of_rows_ && ready_index_ < rows);
+      auto firstValid = number_of_rows_processed_.try_dequeue(rows);
+      auto secondValid = row_iterator_queue_.try_dequeue(ready_index_);
+      bool result = firstValid && secondValid && (ready_index_ < expected_number_of_rows_ && ready_index_ < rows);
       return result;
     }
 
-    unordered_flat_map<std::string_view, std::string> next_row() {
+    map_type<std::string_view, std::string> next_row() {
       row_iterator_queue_.enqueue(next_index_);
       next_index_ += 1;
-      unordered_flat_map<std::string_view, std::string> result;
+      map_type<std::string_view, std::string> result;
       rows_.try_dequeue(rows_ctoken_, result);
       return result;
     }
@@ -218,8 +223,8 @@ namespace csv {
       }
     }
 
-    std::vector<unordered_flat_map<std::string_view, std::string>> rows() {
-      std::vector<unordered_flat_map<std::string_view, std::string>> rows;
+    std::vector<map_type<std::string_view, std::string>> rows() {
+      std::vector<map_type<std::string_view, std::string>> rows;
       while (!done()) {
         if (ready()) {
           rows.push_back(next_row());
@@ -448,9 +453,9 @@ namespace csv {
     std::string filename_;
     std::ifstream stream_;
     std::vector<std::string> headers_;
-    unordered_flat_map<std::string_view, std::string> current_row_;
+    map_type<std::string_view, std::string> current_row_;
     std::string current_value_;
-    ConcurrentQueue<unordered_flat_map<std::string_view, std::string>> rows_;
+    ConcurrentQueue<map_type<std::string_view, std::string>> rows_;
     ProducerToken rows_ptoken_;
     ConsumerToken rows_ctoken_;
     ConcurrentQueue<size_t> number_of_rows_processed_;
@@ -473,7 +478,7 @@ namespace csv {
     ProducerToken values_ptoken_;
     ConsumerToken values_ctoken_;
     std::string current_dialect_name_;
-    unordered_flat_map<std::string, Dialect> dialects_;
+    map_type<std::string, Dialect> dialects_;
     Dialect current_dialect_;
     size_t done_index_;
     size_t ready_index_;

I noticed the try_dequeue's return bool but this is never checked. I'm also not sure why in the next_row pathways, and somehow we can have a ready that completes but next_row() return's an empty record from the concurrent queue.

nevion avatar Mar 05 '20 10:03 nevion

This code also results in the wrong answer. 0 instead of 1.

csv::Writer csvFile("Test.csv");
csvFile.configure_dialect()
    .delimiter(", ")
    .column_names("D", "O", "H", "L", "C", "V", "M");
csvFile.write_row("1", "2", "3", "4", "5", "6", "7");
csvFile.close();

csv::Reader csv;
csv.read("Test.csv");
auto rows = csv.rows();

cout << rows.size() << "\n";

If i write another row then the answer is correct (2).

Edit: I see that the issue is Closed but still exists. At least in a version provided by vcpkg.

wdznak avatar Mar 07 '20 02:03 wdznak

Hello,

I'm working on a second implementation of this library: https://github.com/p-ranav/csv2. The reader is ready for use. Check it out. Hopefully it works better. I'm planning to archive this repo in favor of csv2.

Sorry again for all the issues you've faced with this library.

p-ranav avatar Apr 19 '20 02:04 p-ranav