unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/get_delimiter() fails when file read stops in the middle of a line

Open Sammi-Smith opened this issue 1 year ago • 3 comments

Describe the bug When ingesting CSV files, sometimes it fails with "Error("Could not determine delimiter")". This only happens for some CSV files, for others, it works as expected. The bug is arising from the get_delimiter() function.

To Reproduce Provide a code snippet that reproduces the issue.

PMID35839768_Correlation_matrix.csv

Code snippet, using the above attached file:

from unstructured.partition.csv import get_delimiter
get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")

Output:

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
Cell In[9], line 1
----> 1 get_delimiter(file_path = "PMID35839768_Correlation_matrix.csv")

File [/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py:124](http://localhost:8886/usr/local/lib/python3.11/site-packages/unstructured/partition/csv.py#line=123), in get_delimiter(file_path, file)
    121     with open(file_path) as f:
    122         data = f.read(num_bytes)
--> 124 return sniffer.sniff(data, delimiters=[",", ";"]).delimiter

File [/usr/local/lib/python3.11/csv.py:187](http://localhost:8886/usr/local/lib/python3.11/csv.py#line=186), in Sniffer.sniff(self, sample, delimiters)
    183     delimiter, skipinitialspace = self._guess_delimiter(sample,
    184                                                         delimiters)
    186 if not delimiter:
--> 187     raise Error("Could not determine delimiter")
    189 class dialect(Dialect):
    190     _name = "sniffed"

Error: Could not determine delimiter

Expected behavior The function returns the delimiter, which is ',' for this file.

Screenshots Not applicable.

Environment Info Python 3.11.8 unstructured 0.12.5

Additional context After looking into this issue for a bit, I found this similar issue for another Python module: https://github.com/Textualize/rich-cli/issues/54#issuecomment-1135595885

Scrolling down further on that same issue thread, I found another comment (https://github.com/Textualize/rich-cli/issues/54#issuecomment-1252186953) that mentions that the example on the official Python csv.Sniffer docs also has the same issue, which may be the source of this bug, since the implementation in unstructured is nearly identical.

Here is a code snippet I used to fix the issue, by reading in whole lines instead of truncating the read mid-line. This same concept should be applied to both instances of the .read() function that appear in get_delimiter() function, they should both be changed to read_lines().

import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv") as f:
    line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
    data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter

Output: ','

Sammi-Smith avatar Mar 13 '24 13:03 Sammi-Smith

@awalker4 it looks like this is the commit where the bug was introduced: https://github.com/Unstructured-IO/unstructured/commit/d594c06a3e3a6c3b19fefc3fbcc316f3f872c530

Tagging you since you would probably be the best one to make this minor fix! :)

Sammi-Smith avatar Mar 13 '24 13:03 Sammi-Smith

Sorry for the delay! This does look like a good fix. I can try to get it to soon, but if you have a chance to make a pr that would be a huge help 🙏

awalker4 avatar Mar 25 '24 19:03 awalker4

In addition, sometimes this portion of the code errors out when the open() function encounters a UnicodeDecodeError. I'd recommend passing in errors = 'ignore' to the open() function to allow the delimiter to still be determined instead of erroring out simply because of a stray character that can't be decoded.

import csv
sniffer = csv.Sniffer()
max_bytes = 8192
with open("PMID35839768_Correlation_matrix.csv", errors='ignore') as f:
    line_strs = f.readlines(max_bytes) #this returns a list of lines from the file, stopping once the number of lines read exceeds the max byte limit
    data = "".join(line_strs)
sniffer.sniff(data, delimiters=[",", ";"]).delimiter

Sammi-Smith avatar Apr 19 '24 13:04 Sammi-Smith